From unicode at unicode.org Thu Jun 1 03:11:12 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Thu, 1 Jun 2017 09:11:12 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com>

<2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com>

<7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2>

Message-ID: On 31 May 2017, at 20:24, Shawn Steele via Unicode wrote: > > > For implementations that emit FFFD while handling text conversion and repair (ie, converting ill-formed > > UTF-8 to well-formed), it is best for interoperability if they get the same results, so that indices within the > > resulting strings are consistent across implementations for all the correct characters thereafter. > > That seems optimistic :) > > If interoperability is the goal, then it would seem to me that changing the recommendation would be contrary to that goal. There are systems that will not or cannot change to a new recommendation. If such systems are updated, then adoption of those systems will likely take some time. Indeed, if interoperability is the goal, the behaviour should be fully specified, not merely recommended. At present, though, it appears that we have (broadly) two different behaviours in the wild, and nobody wants to change what they presently do. Personally I agree with Shawn on this; the presence of a U+FFFD indicates that the input was invalid somehow. You don?t know *how* it was invalid, and probably shouldn?t rely on equivalence with another invalid string. There are obviously some exceptions - e.g. it *may* be desirable in the context of browsers to specify the behaviour in order to avoid behavioural differences being used for Javascript-based ?fingerprinting?. But I don?t see why WHATWG (for instance) couldn?t do that. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Thu Jun 1 03:13:33 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Thu, 1 Jun 2017 09:13:33 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com>

<2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com>

<7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <444e6dc2-a35a-0d2d-26e2-34f5a70af9e1@it.aoyama.ac.jp> <03634118-4070-409D-9D62-98488E9AB1E5@alastairs-place.net> Message-ID: On 31 May 2017, at 20:42, Shawn Steele via Unicode wrote: > >> And *that* is what the specification says. The whole problem here is that someone elevated >> one choice to the status of ?best practice?, and it?s a choice that some of us don?t think *should* >> be considered best practice. > >> Perhaps ?best practice? should simply be altered to say that you *clearly document* your behavior >> in the case of invalid UTF-8 sequences, and that code should not rely on the number of U+FFFDs >> generated, rather than suggesting a behaviour? > > That's what I've been suggesting. > > I think we could maybe go a little further though: > > * Best practice is clearly not to depend on the # of U+FFFDs generated by another component/app. Clearly that can't be relied upon, so I think everyone can agree with that. > * I think encouraging documentation of behavior is cool, though there are probably low priority bugs and people don't like to read the docs in that detail, so I wouldn't expect very much from that. > * As far as I can tell, there are two (maybe three) sane approaches to this problem: > * Either a "maximal" emission of one U+FFFD for every byte that exists outside of a good sequence > * Or a "minimal" version that presumes the lead byte was counting trail bytes correctly even if the resulting sequence was invalid. In that case just use one U+FFFD. > * And (maybe, I haven't heard folks arguing for this one) emit one U+FFFD at the first garbage byte and then ignore the input until valid data starts showing up again. (So you could have 1 U+FFFD for a string of a hundred garbage bytes as long as there weren't any valid sequences within that group). > * I'd be happy if the best practice encouraged one of those two (or maybe three) approaches. I think an approach that called rand() to see how many U+FFFDs to emit when it encountered bad data is fair to discourage. Agreed. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Thu Jun 1 04:32:08 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Thu, 1 Jun 2017 12:32:08 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170531181113.0fc7ea7a@JRWUBU2> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com>

<2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com>

<7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> <20170531181113.0fc7ea7a@JRWUBU2> Message-ID: On Wed, May 31, 2017 at 8:11 PM, Richard Wordingham via Unicode wrote: > On Wed, 31 May 2017 15:12:12 +0300 > Henri Sivonen via Unicode wrote: >> I am not claiming it's too difficult to implement. I think it >> inappropriate to ask implementations, even from-scratch ones, to take >> on added complexity in error handling on mere aesthetic grounds. Also, >> I think it's inappropriate to induce implementations already written >> according to the previous guidance to change (and risk bugs) or to >> make the developers who followed the previous guidance with precision >> be the ones who need to explain why they aren't following the new >> guidance. > > How straightforward is the FSM for back-stepping? This seems beside the point, since the new guidance wasn't advertised as improving backward stepping compared to the old guidance. (On the first look, I don't see the new guidance improving back stepping. In fact, if the UTC meant to adopt ICU's behavior for obsolete five and six-byte bit patterns, AFAICT, backstepping with the ICU behavior requires examining more bytes backward than the old guidance required.) >> On Fri, May 26, 2017 at 6:41 PM, Markus Scherer via Unicode >> wrote: >> > The UTF-8 conversion code that I wrote for ICU, and apparently the >> > code that various other people have written, collects sequences >> > starting from lead bytes, according to the original spec, and at >> > the end looks at whether the assembled code point is too low for >> > the lead byte, or is a surrogate, or is above 10FFFF. Stopping at a >> > non-trail byte is quite natural, and reading the PRI text >> > accordingly is quite natural too. >> >> I don't doubt that other people have written code with the same >> concept as ICU, but as far as non-shortest form handling goes in the >> implementations I tested (see URL at the start of this email) ICU is >> the lone outlier. > > You should have researched implementations as they were in 2007. I don't see how the state of things in 2007 is relevant to a decision taken in 2017. It's relevant that by 2017, prominent implementations had adopted the old Unicode guidance, and, that being the case, it's inappropriate to change the guidance for aesthetic reasons or to favor the Unicode Consortium-hosted implementation. On Wed, May 31, 2017 at 8:43 PM, Shawn Steele via Unicode wrote: > I do not understand the energy being invested in a case that shouldn't happen, especially in a case that is a subset of all the other bad cases that could happen. I'm a browser developer. I've explained previously on this list and in my blog post why the browser developer / Web standard culture favors well-defined behavior in error cases these days. On Wed, May 31, 2017 at 10:38 PM, Doug Ewell via Unicode wrote: > Henri Sivonen wrote: > >> If anything, I hope this thread results in the establishment of a >> requirement for proposals to come with proper research about what >> multiple prominent implementations to about the subject matter of a >> proposal concerning changes to text about implementation behavior. > > Considering that several folks have objected that the U+FFFD > recommendation is perceived as having the weight of a requirement, I > think adding Henri's good advice above as a "requirement" seems > heavy-handed. Who will judge how much research qualifies as "proper"? In the Unicode scope, it's indeed harder to draw clear line to decide what the prominent implementations are than in the WHATWG scope. The point is that just checking ICU is not good enough. Someone making a proposal should check the four major browser engines and a bunch of system frameworks and standard libraries for well-known programming languages. Which frameworks and standard libraries and how many is not precisely definable objectively and depends on the subject matter (there are many UTF-8 decoders but e.g. fewer text shaping engines). There will be diminishing returns to checking them. Chances are that it's not necessary to check too many for a pattern to emerge to judge whether the existing spec language is being implemented (don't change it) or being ignored (probably should be changed then). In any case, "we can't check everything or choose fairly what exactly to check" shouldn't be a reason for it to be OK to just check ICU or to make abstract arguments without checking any implementations at all. Checking multiple popular implementations is homework better done than just checking ICU even if it's up to the person making the proposal to choose which implementations to check exactly. The committee should be able to recognize if the list of implementations tested looks like a list of broadly-deployed implementations. On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode wrote: > * As far as I can tell, there are two (maybe three) sane approaches to this problem: > * Either a "maximal" emission of one U+FFFD for every byte that exists outside of a good sequence > * Or a "minimal" version that presumes the lead byte was counting trail bytes correctly even if the resulting sequence was invalid. In that case just use one U+FFFD. > * And (maybe, I haven't heard folks arguing for this one) emit one U+FFFD at the first garbage byte and then ignore the input until valid data starts showing up again. (So you could have 1 U+FFFD for a string of a hundred garbage bytes as long as there weren't any valid sequences within that group). I think it's not useful to come up with new rules in the abstract. I'd like to focus on the fact that the Standard expressed a preference and the preference got implemented (in broadly-deployed well-known software). That being the case, it's not OK to change the preference expressed in the standard as a matter of what "feels right" or "sane" subsequently when there wasn't a super-serious problem with the previously-expressed preference that already got implemented in multiple pieces of broadly-deployed software whose developers took the Standard's expression of preference seriously. -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Thu Jun 1 06:04:44 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Thu, 1 Jun 2017 12:04:44 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com>

<2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com>

<7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> <20170531181113.0fc7ea7a@JRWUBU2> Message-ID: <4250062B-5AFF-4BEF-AC4D-B55DBB1B12C5@alastairs-place.net> On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode wrote: > > On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode > wrote: >> * As far as I can tell, there are two (maybe three) sane approaches to this problem: >> * Either a "maximal" emission of one U+FFFD for every byte that exists outside of a good sequence >> * Or a "minimal" version that presumes the lead byte was counting trail bytes correctly even if the resulting sequence was invalid. In that case just use one U+FFFD. >> * And (maybe, I haven't heard folks arguing for this one) emit one U+FFFD at the first garbage byte and then ignore the input until valid data starts showing up again. (So you could have 1 U+FFFD for a string of a hundred garbage bytes as long as there weren't any valid sequences within that group). > > I think it's not useful to come up with new rules in the abstract. The first two aren?t ?new? rules; they?re, respectively, the current ?Best Practice?, the proposed ?Best Practice? and one other potentially reasonable approach that might make sense e.g. if the problem you?re worrying about is serial data slip or corruption of a compressed or encrypted file (where corruption will occur until re-synchronisation happens, and as a result you wouldn?t expect to have any knowledge whatever of the number of characters represented in the data in question). All of these approaches are explicitly allowed by the standard at present. All three are reasonable, and each has its own pros and cons in a technical sense (leaving aside how prevalent the approach in question might be). In a general purpose library I?d probably go for the second one; if I knew I was dealing with a potentially corrupt compressed or encrypted stream, I might well plump for the third. I can even *imagine* there being circumstances under which I might choose the first for some reason, in spite of my preference for the second approach. I don?t think it makes sense to standardise on *one* of these approaches, so if what you?re saying is that the ?Best Practice? has been treated as if it was part of the specification (and I think that *is* essentially your claim), then I?m in favour of either removing it completely, or (better) replacing it with Shawn?s suggestion - i.e. listing three reasonable approaches and telling developers to document which they take and why. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Thu Jun 1 09:21:28 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 1 Jun 2017 07:21:28 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com>

<2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com>

<7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> <20170531181113.0fc7ea7a@JRWUBU2> Message-ID: <083c780a-1d99-9586-ba71-dccf79258d37@ix.netcom.com> On 6/1/2017 2:32 AM, Henri Sivonen via Unicode wrote: > O > On Wed, May 31, 2017 at 10:38 PM, Doug Ewell via Unicode > wrote: >> Henri Sivonen wrote: >> >>> If anything, I hope this thread results in the establishment of a >>> requirement for proposals to come with proper research about what >>> multiple prominent implementations to about the subject matter of a >>> proposal concerning changes to text about implementation behavior. >> Considering that several folks have objected that the U+FFFD >> recommendation is perceived as having the weight of a requirement, I >> think adding Henri's good advice above as a "requirement" seems >> heavy-handed. Who will judge how much research qualifies as "proper"? I agree with Henri on these general points: 1) Requiring extensive research on implementation practice is crucial in dealing with any changes to long standing definitions, algorithms, properties and recommendations. 2) Not having a perfect definition of what "extensive" means is not an excuse to do nothing. 3) Evaluating only the proposer's implementation (or only ICU) is not sufficient. 4) Changing a recommendation that many implementers (or worse, an implementers' collective) have chosen to adopt is a breaking change. 5) Breaking changes to fundamental algorithms require extraordinarily strong justification including, but not limited to "proof" that the existing definition/recommendation is not workable or presents grave security risks that cannot be mitigated any other way. I continue to see a disturbing lack of appreciation of these issues in some of the replies to this discussion (and some past decisions by the UTC). A./ From unicode at unicode.org Thu Jun 1 12:41:45 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Thu, 1 Jun 2017 17:41:45 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <4250062B-5AFF-4BEF-AC4D-B55DBB1B12C5@alastairs-place.net> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com>

<2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com>

<7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> <20170531181113.0fc7ea7a@JRWUBU2> <4250062B-5AFF-4BEF-AC4D-B55DBB1B12C5@alastairs-place.net> Message-ID: I think that the (or a) key problem is that the current "best practice" is treated as "SHOULD" in RFC parlance. When what this really needs is a "MAY". People reading standards tend to treat "SHOULD" and "MUST" as the same thing. So, when an implementation deviates, then you get bugs (as we see here). Given that there are very valid engineering reasons why someone might want to choose a different behavior for their needs - without harming the intent of the standard at all in most cases - I think the current/proposed language is too "strong". -Shawn -----Original Message----- From: Alastair Houghton [mailto:alastair at alastairs-place.net] Sent: Thursday, June 1, 2017 4:05 AM To: Henri Sivonen Cc: unicode Unicode Discussion ; Shawn Steele Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode wrote: > > On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode > wrote: >> * As far as I can tell, there are two (maybe three) sane approaches to this problem: >> * Either a "maximal" emission of one U+FFFD for every byte that exists outside of a good sequence >> * Or a "minimal" version that presumes the lead byte was counting trail bytes correctly even if the resulting sequence was invalid. In that case just use one U+FFFD. >> * And (maybe, I haven't heard folks arguing for this one) emit one U+FFFD at the first garbage byte and then ignore the input until valid data starts showing up again. (So you could have 1 U+FFFD for a string of a hundred garbage bytes as long as there weren't any valid sequences within that group). > > I think it's not useful to come up with new rules in the abstract. The first two aren?t ?new? rules; they?re, respectively, the current ?Best Practice?, the proposed ?Best Practice? and one other potentially reasonable approach that might make sense e.g. if the problem you?re worrying about is serial data slip or corruption of a compressed or encrypted file (where corruption will occur until re-synchronisation happens, and as a result you wouldn?t expect to have any knowledge whatever of the number of characters represented in the data in question). All of these approaches are explicitly allowed by the standard at present. All three are reasonable, and each has its own pros and cons in a technical sense (leaving aside how prevalent the approach in question might be). In a general purpose library I?d probably go for the second one; if I knew I was dealing with a potentially corrupt compressed or encrypted stream, I might well plump for the third. I can even *imagine* there being circumstances under which I might choose the first for some reason, in spite of my preference for the second approach. I don?t think it makes sense to standardise on *one* of these approaches, so if what you?re saying is that the ?Best Practice? has been treated as if it was part of the specification (and I think that *is* essentially your claim), then I?m in favour of either removing it completely, or (better) replacing it with Shawn?s suggestion - i.e. listing three reasonable approaches and telling developers to document which they take and why. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Thu Jun 1 13:44:29 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 1 Jun 2017 11:44:29 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com>

<7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> <20170531181113.0fc7ea7a@JRWUBU2> <4250062B-5AFF-4BEF-AC4D-B55DBB1B12C5@alastairs-place.net> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 1 13:53:36 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Thu, 1 Jun 2017 18:53:36 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com>

<7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> <20170531181113.0fc7ea7a@JRWUBU2> <4250062B-5AFF-4BEF-AC4D-B55DBB1B12C5@alastairs-place.net>

Message-ID: But those are IETF definitions. They don?t have to mean the same thing in Unicode - except that people working in this field probably expect them to. From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag via Unicode Sent: Thursday, June 1, 2017 11:44 AM To: unicode at unicode.org Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote: I think that the (or a) key problem is that the current "best practice" is treated as "SHOULD" in RFC parlance. When what this really needs is a "MAY". People reading standards tend to treat "SHOULD" and "MUST" as the same thing. It's not that they "tend to", it's in RFC 2119: SHOULD This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course. The clear inference is that while the non-recommended practice is not prohibited, you better have some valid reason why you are deviating from it (and, reading between the lines, it would not hurt if you documented those reasons). So, when an implementation deviates, then you get bugs (as we see here). Given that there are very valid engineering reasons why someone might want to choose a different behavior for their needs - without harming the intent of the standard at all in most cases - I think the current/proposed language is too "strong". Yes and no. ICU would be perfectly fine deviating from the existing recommendation and stating their engineering reasons for doing so. That would allow them to close their bug ("by documentation"). What's not OK is to take an existing recommendation and change it to something else, just to make bug reports go away for one implementations. That's like two sleepers fighting over a blanket that's too short. Whenever one is covered, the other is exposed. If it is discovered that the existing recommendation is not based on anything like truly better behavior, there may be a case to change it to something that's equivalent to a MAY. Perhaps a list of nearly equally capable options. (If that language is not in the standard already, a strong "an implementation MUST not depend on the use of a particular strategy for replacement of invalid code sequences", clearly ought to be added). A./ -Shawn -----Original Message----- From: Alastair Houghton [mailto:alastair at alastairs-place.net] Sent: Thursday, June 1, 2017 4:05 AM To: Henri Sivonen Cc: unicode Unicode Discussion ; Shawn Steele Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode wrote: On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode wrote: * As far as I can tell, there are two (maybe three) sane approaches to this problem: * Either a "maximal" emission of one U+FFFD for every byte that exists outside of a good sequence * Or a "minimal" version that presumes the lead byte was counting trail bytes correctly even if the resulting sequence was invalid. In that case just use one U+FFFD. * And (maybe, I haven't heard folks arguing for this one) emit one U+FFFD at the first garbage byte and then ignore the input until valid data starts showing up again. (So you could have 1 U+FFFD for a string of a hundred garbage bytes as long as there weren't any valid sequences within that group). I think it's not useful to come up with new rules in the abstract. The first two aren?t ?new? rules; they?re, respectively, the current ?Best Practice?, the proposed ?Best Practice? and one other potentially reasonable approach that might make sense e.g. if the problem you?re worrying about is serial data slip or corruption of a compressed or encrypted file (where corruption will occur until re-synchronisation happens, and as a result you wouldn?t expect to have any knowledge whatever of the number of characters represented in the data in question). All of these approaches are explicitly allowed by the standard at present. All three are reasonable, and each has its own pros and cons in a technical sense (leaving aside how prevalent the approach in question might be). In a general purpose library I?d probably go for the second one; if I knew I was dealing with a potentially corrupt compressed or encrypted stream, I might well plump for the third. I can even *imagine* there being circumstances under which I might choose the first for some reason, in spite of my preference for the second approach. I don?t think it makes sense to standardise on *one* of these approaches, so if what you?re saying is that the ?Best Practice? has been treated as if it was part of the specification (and I think that *is* essentially your claim), then I?m in favour of either removing it completely, or (better) replacing it with Shawn?s suggestion - i.e. listing three reasonable approaches and telling developers to document which they take and why. Kind regards, Alastair. -- http://alastairs-place.net -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 1 14:16:52 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 1 Jun 2017 20:16:52 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com>

<2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com>

<7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> <20170531181113.0fc7ea7a@JRWUBU2> Message-ID: <20170601201652.00b0e509@JRWUBU2> On Thu, 1 Jun 2017 12:32:08 +0300 Henri Sivonen via Unicode wrote: > On Wed, May 31, 2017 at 8:11 PM, Richard Wordingham via Unicode > wrote: > > On Wed, 31 May 2017 15:12:12 +0300 > > Henri Sivonen via Unicode wrote: > >> I am not claiming it's too difficult to implement. I think it > >> inappropriate to ask implementations, even from-scratch ones, to > >> take on added complexity in error handling on mere aesthetic > >> grounds. Also, I think it's inappropriate to induce > >> implementations already written according to the previous guidance > >> to change (and risk bugs) or to make the developers who followed > >> the previous guidance with precision be the ones who need to > >> explain why they aren't following the new guidance. > > > > How straightforward is the FSM for back-stepping? > > This seems beside the point, since the new guidance wasn't advertised > as improving backward stepping compared to the old guidance. > > (On the first look, I don't see the new guidance improving back > stepping. In fact, if the UTC meant to adopt ICU's behavior for > obsolete five and six-byte bit patterns, AFAICT, backstepping with the > ICU behavior requires examining more bytes backward than the old > guidance required.) The greater simplicity comes from the the alternative behaviour being more 'natural'. It's a little difficult to count states without constraints on the machines, but for forward stepping, even supporting 6-byte patterns just in case 20.1 bits eventually turn out not to be enough, there are five intermediate states - '1 byte to go', '2 bytes to go', ... '5 bytes to go'. For backward stepping, there are similarly five intermediate states - '1 trailing byte seen', and so on. For the recommended handling, forward stepping has seven intermediate states, each directly reachable from the starting state - start byte C2..DF; start byte E0; start byte E1..EC, EE or EF; start byte ED; start byte F0; start byte F1..F3; and start byte FF. No further intermediate states are required. For the recommended handling, I see a need for 8 intermediate steps, depending on how may trail bytes have been considered and whether the last one was in the range 80..8F (precludes E0 and F0 immediately preceding), 90..9F (precludes E0 and F4 immediately preceding) or A0..BF (precludes ED and F4 immediately preceding). The logic feels quite complicated. If I implement it, I'm not likely to code it up as an FSM. > > You should have researched implementations as they were in 2007. > I don't see how the state of things in 2007 is relevant to a decision > taken in 2017. Because the argument is that the original decision taken in 2008 was wrong. I have a feeling I have overlooked some of the discussion around then, because I can't find my contribution in the archives, and I thought I objected at the time. Richard. From unicode at unicode.org Thu Jun 1 14:30:23 2017 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Thu, 1 Jun 2017 12:30:23 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com>

<7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> <20170531181113.0fc7ea7a@JRWUBU2> <4250062B-5AFF-4BEF-AC4D-B55DBB1B12C5@alastairs-place.net>

Message-ID: On 6/1/2017 11:53 AM, Shawn Steele wrote: > > But those are IETF definitions. They don?t have to mean the same > thing in Unicode - except that people working in this field probably > expect them to. > That's the thing. And even if Unicode had it's own version of RFC 2119 one would considered it recommended for Unicode to follow widespread industry practice (there's that "r" word again!). A./ > > *From:*Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of > *Asmus Freytag via Unicode > *Sent:* Thursday, June 1, 2017 11:44 AM > *To:* unicode at unicode.org > *Subject:* Re: Feedback on the proposal to change U+FFFD generation > when decoding ill-formed UTF-8 > > On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote: > > I think that the (or a) key problem is that the current "best practice" is treated as "SHOULD" in RFC parlance. When what this really needs is a "MAY". > > People reading standards tend to treat "SHOULD" and "MUST" as the same thing. > > > It's not that they "tend to", it's in RFC 2119: > > > SHOULD > > This word, or the adjective "RECOMMENDED", mean that there > > may exist valid reasons in particular circumstances to ignore a > > particular item, but the full implications must be understood and > > carefully weighed before choosing a different course. > > The clear inference is that while the non-recommended practice is not > prohibited, you better have some valid reason why you are deviating > from it (and, reading between the lines, it would not hurt if you > documented those reasons). > > > So, when an implementation deviates, then you get bugs (as we see here). Given that there are very valid engineering reasons why someone might want to choose a different behavior for their needs - without harming the intent of the standard at all in most cases - I think the current/proposed language is too "strong". > > > Yes and no. ICU would be perfectly fine deviating from the existing > recommendation and stating their engineering reasons for doing so. > That would allow them to close their bug ("by documentation"). > > What's not OK is to take an existing recommendation and change it to > something else, just to make bug reports go away for one > implementations. That's like two sleepers fighting over a blanket > that's too short. Whenever one is covered, the other is exposed. > > If it is discovered that the existing recommendation is not based on > anything like truly better behavior, there may be a case to change it > to something that's equivalent to a MAY. Perhaps a list of nearly > equally capable options. > > (If that language is not in the standard already, a strong "an > implementation MUST not depend on the use of a particular strategy for > replacement of invalid code sequences", clearly ought to be added). > > A./ > > > -Shawn > > -----Original Message----- > > From: Alastair Houghton [mailto:alastair at alastairs-place.net] > > Sent: Thursday, June 1, 2017 4:05 AM > > To: Henri Sivonen > > Cc: unicode Unicode Discussion ; Shawn Steele > > Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 > > On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode wrote: > > On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode > > wrote: > > * As far as I can tell, there are two (maybe three) sane approaches to this problem: > > * Either a "maximal" emission of one U+FFFD for every byte that exists outside of a good sequence > > * Or a "minimal" version that presumes the lead byte was counting trail bytes correctly even if the resulting sequence was invalid. In that case just use one U+FFFD. > > * And (maybe, I haven't heard folks arguing for this one) emit one U+FFFD at the first garbage byte and then ignore the input until valid data starts showing up again. (So you could have 1 U+FFFD for a string of a hundred garbage bytes as long as there weren't any valid sequences within that group). > > I think it's not useful to come up with new rules in the abstract. > > The first two aren?t ?new? rules; they?re, respectively, the current ?Best Practice?, the proposed ?Best Practice? and one other potentially reasonable approach that might make sense e.g. if the problem you?re worrying about is serial data slip or corruption of a compressed or encrypted file (where corruption will occur until re-synchronisation happens, and as a result you wouldn?t expect to have any knowledge whatever of the number of characters represented in the data in question). > > All of these approaches are explicitly allowed by the standard at present. All three are reasonable, and each has its own pros and cons in a technical sense (leaving aside how prevalent the approach in question might be). In a general purpose library I?d probably go for the second one; if I knew I was dealing with a potentially corrupt compressed or encrypted stream, I might well plump for the third. I can even *imagine* there being circumstances under which I might choose the first for some reason, in spite of my preference for the second approach. > > I don?t think it makes sense to standardise on *one* of these approaches, so if what you?re saying is that the ?Best Practice? has been treated as if it was part of the specification (and I think that *is* essentially your claim), then I?m in favour of either removing it completely, or (better) replacing it with Shawn?s suggestion - i.e. listing three reasonable approaches and telling developers to document which they take and why. > > Kind regards, > > Alastair. > > -- > > http://alastairs-place.net > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 1 14:54:45 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 01 Jun 2017 12:54:45 -0700 Subject: Running out of code points, redux (was: Re: Feedback on the proposal...) Message-ID: <20170601125445.665a7a7059d7ee80bb4d670165c8327d.5e7f59113e.wbe@email03.godaddy.com> Richard Wordingham wrote: > even supporting 6-byte patterns just in case 20.1 bits eventually turn > out not to be enough, Oh, gosh, here we go with this. What will we do if 31 bits turn out not to be enough? -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Thu Jun 1 16:39:12 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 1 Jun 2017 22:39:12 +0100 Subject: Running out of code points, redux (was: Re: Feedback on the proposal...) In-Reply-To: <20170601125445.665a7a7059d7ee80bb4d670165c8327d.5e7f59113e.wbe@email03.godaddy.com> References: <20170601125445.665a7a7059d7ee80bb4d670165c8327d.5e7f59113e.wbe@email03.godaddy.com> Message-ID: <20170601223912.5b4c1d49@JRWUBU2> On Thu, 01 Jun 2017 12:54:45 -0700 Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > > > even supporting 6-byte patterns just in case 20.1 bits eventually > > turn out not to be enough, > > Oh, gosh, here we go with this. You were implicitly invited to argue that there was no need to handle 5 and 6 byte invalid sequences. > What will we do if 31 bits turn out not to be enough? A compatible extension of UTF-16 to unbounded length has already been designed. Prefix bytes 0xFF can be used to extend the length for UTF-8 by 8 bytes at a time. Extending UTF-32 is not beyond the wit of man, and we know that UTF-16 could have been done better if the need had been foreseen. While it seems natural to hold a Unicode scalar value in a single machine word of some length, this is not necessary, just highly convenient. In short, it won't be a big problem intrinsically. The UCD may get a bit unwieldy, which may be a problem for small systems without Internet access. Richard. From unicode at unicode.org Thu Jun 1 16:47:31 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 1 Jun 2017 23:47:31 +0200 Subject: Running out of code points, redux (was: Re: Feedback on the proposal...) In-Reply-To: <20170601125445.665a7a7059d7ee80bb4d670165c8327d.5e7f59113e.wbe@email03.godaddy.com> References: <20170601125445.665a7a7059d7ee80bb4d670165c8327d.5e7f59113e.wbe@email03.godaddy.com> Message-ID: This is still very unlikely to occur. Lot of discussions about emojis but they still don't count a lot in the total. The major updates were epected for CJK sinograms, but even the rate of updates has slowed down and we will eventually will have another sinographic plane, but it will not come soon and will be very slow to fill in. This still leaves enough planes for several decenials or more. May be in the next century a new encoding will be designed but we have ample time to prepare this to reflect the best practives and experiences acquired, and it will probably not happen because we lack of code points but only because some experimentations will have proven that another encoding is better performing and less complex to manage (just like the ongoing transition from XML to JSON for the UCD) and because current supporters of Unicode will prefer this new format and will have implemented it (starting first by an automatic conversion from the existing encoding in Unicode and ISO 10646, which will no longer be needed in deployed client applications) I bet it will still be an 8-bit based encoding using 7-bit ASCII (at least the ngraphic part plus a few controls, but some other controls will be remapped), but it could be simply a new 32 bit or 64-bit encoding. Before this change ever occurs, there will be the need to demonstrate that it is better performing, that it allows smooth transition and excellent compatibility (possibly with efficient transcoders) and many implementation "quirks" will have been resolved (including security risks). 2017-06-01 21:54 GMT+02:00 Doug Ewell via Unicode : > Richard Wordingham wrote: > > > even supporting 6-byte patterns just in case 20.1 bits eventually turn > > out not to be enough, > > Oh, gosh, here we go with this. > > What will we do if 31 bits turn out not to be enough? > > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 1 19:10:54 2017 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Thu, 1 Jun 2017 17:10:54 -0700 Subject: Running out of code points, redux (was: Re: Feedback on the proposal...) In-Reply-To: <20170601223912.5b4c1d49@JRWUBU2> References: <20170601125445.665a7a7059d7ee80bb4d670165c8327d.5e7f59113e.wbe@email03.godaddy.com> <20170601223912.5b4c1d49@JRWUBU2> Message-ID: On 6/1/2017 2:39 PM, Richard Wordingham via Unicode wrote: > You were implicitly invited to argue that there was no need to handle > 5 and 6 byte invalid sequences. > Well, working from the *current* specification: FC 80 80 80 80 80 and FF FF FF FF FF FF are equal trash, uninterpretable as *anything* in UTF-8. By definition D39b, either sequence of bytes, if encountered by an conformant UTF-8 conversion process, would be interpreted as a sequence of 6 maximal subparts of an ill-formed subsequence. Whatever your particular strategy for conversion fallbacks for uninterpretable sequences, it ought to treat either one of those trash sequences the same, in my book. I don't see a good reason to build in special logic to treat FC 80 80 80 80 80 as somehow privileged as a unit for conversion fallback, simply because *if* UTF-8 were defined as the Unix gods intended (which it ain't no longer) then that sequence *could* be interpreted as an out-of-bounds scalar value (which it ain't) on spec that the codespace *might* be extended past 10FFFF at some indefinite time in the future (which it won't). --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 1 20:21:55 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 2 Jun 2017 02:21:55 +0100 Subject: Running out of code points, redux (was: Re: Feedback on the proposal...) In-Reply-To: References: <20170601125445.665a7a7059d7ee80bb4d670165c8327d.5e7f59113e.wbe@email03.godaddy.com> <20170601223912.5b4c1d49@JRWUBU2> Message-ID: <20170602022155.66af95f1@JRWUBU2> On Thu, 1 Jun 2017 17:10:54 -0700 Ken Whistler via Unicode wrote: > On 6/1/2017 2:39 PM, Richard Wordingham via Unicode wrote: > > You were implicitly invited to argue that there was no need to > > handle 5 and 6 byte invalid sequences. > > > > Well, working from the *current* specification: > > FC 80 80 80 80 80 > and > FF FF FF FF FF FF > > are equal trash, uninterpretable as *anything* in UTF-8. > > By definition D39b, either sequence of bytes, if encountered by an > conformant UTF-8 conversion process, would be interpreted as a > sequence of 6 maximal subparts of an ill-formed subsequence. ("D39b" is a typo for "D93b".) Conformant with what? There is no mandatory *requirement* for a UTF-8 conversion process conformant with Unicode to have any concept of 'maximal subpart'. > I don't see a good reason to build in special logic to treat FC 80 80 > 80 80 80 as somehow privileged as a unit for conversion fallback, > simply because *if* UTF-8 were defined as the Unix gods intended > (which it ain't no longer) then that sequence *could* be interpreted > as an out-of-bounds scalar value (which it ain't) on spec that the > codespace *might* be extended past 10FFFF at some indefinite time in > the future (which it won't). Arguably, it requires special logic to treat FC 80 80 80 80 80 as an invalid sequence. FC is not ASCII, and has more than one leading bit set. It has the six leading bits set, and therefore should start a sequence of 6 characters. Richard. From unicode at unicode.org Thu Jun 1 20:45:29 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 2 Jun 2017 02:45:29 +0100 Subject: Running out of code points, redux (was: Re: Feedback on the proposal...) In-Reply-To: References: <20170601125445.665a7a7059d7ee80bb4d670165c8327d.5e7f59113e.wbe@email03.godaddy.com> <20170601223912.5b4c1d49@JRWUBU2> Message-ID: <20170602024529.73d22d8c@JRWUBU2> On Thu, 1 Jun 2017 17:10:54 -0700 Ken Whistler via Unicode wrote: > Well, working from the *current* specification: > > FC 80 80 80 80 80 > and > FF FF FF FF FF FF > > are equal trash, uninterpretable as *anything* in UTF-8. > > By definition D39b, either sequence of bytes, if encountered by an > conformant UTF-8 conversion process, would be interpreted as a > sequence of 6 maximal subparts of an ill-formed subsequence. There is a very good argument that 0xFC and 0xFF are not code units (D77) - they are not used in the representation of any Unicode scalar values. By that argument, you have 5 maximal subparts and seven garbage bytes. Richard. From unicode at unicode.org Thu Jun 1 21:19:51 2017 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Thu, 1 Jun 2017 19:19:51 -0700 Subject: Running out of code points, redux (was: Re: Feedback on the proposal...) In-Reply-To: <20170602022155.66af95f1@JRWUBU2> References: <20170601125445.665a7a7059d7ee80bb4d670165c8327d.5e7f59113e.wbe@email03.godaddy.com> <20170601223912.5b4c1d49@JRWUBU2> <20170602022155.66af95f1@JRWUBU2> Message-ID: On 6/1/2017 6:21 PM, Richard Wordingham via Unicode wrote: >> By definition D39b, either sequence of bytes, if encountered by an >> conformant UTF-8 conversion process, would be interpreted as a >> sequence of 6 maximal subparts of an ill-formed subsequence. > ("D39b" is a typo for "D93b".) Sorry about that. :) > > Conformant with what? There is no mandatory*requirement* for a UTF-8 > conversion process conformant with Unicode to have any concept of > 'maximal subpart'. Conformant with the definition of UTF-8. I agree that nothing forces a conversion *process* to care anything about maximal subparts, but if *any* process using a conformant definition of UTF-8 then goes on to have any concept of "maximal subpart of an ill-formed subsequence" that departs from definition D93b in the Unicode Standard, then it is just making s**t up. > >> I don't see a good reason to build in special logic to treat FC 80 80 >> 80 80 80 as somehow privileged as a unit for conversion fallback, >> simply because*if* UTF-8 were defined as the Unix gods intended >> (which it ain't no longer) then that sequence*could* be interpreted >> as an out-of-bounds scalar value (which it ain't) on spec that the >> codespace*might* be extended past 10FFFF at some indefinite time in >> the future (which it won't). > Arguably, it requires special logic to treat FC 80 80 80 80 80 as an > invalid sequence. That would be equally true of FF FF FF FF FF FF. Which was my point, actually. > FC is not ASCII, True, of course. But irrelevant. Because we are talking about UTF-8 here. And just because some non-UTF-8 character encoding happened to include 0xFC as a valid (or invalid) value, might not require any special case processing. A simple 8-bit to 8-bit conversion table could be completely regular in its processing of 0xFC for a conversion. > and has more than one leading bit > set. It has the six leading bits set, True, of course. > and therefore should start a > sequence of 6 characters. That is completely false, and has nothing to do with the current definition of UTF-8. The current, normative definition of UTF-8, in the Unicode Standard, and in ISO/IEC 10646:2014, and in RFC 3629 (which explicitly "obsoletes and replaces RFC 2279") states clearly that 0xFC cannot start a sequence of anything identifiable as UTF-8. --Ken > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 1 22:32:35 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 2 Jun 2017 04:32:35 +0100 Subject: Running out of code points, redux (was: Re: Feedback on the proposal...) In-Reply-To: References: <20170601125445.665a7a7059d7ee80bb4d670165c8327d.5e7f59113e.wbe@email03.godaddy.com> <20170601223912.5b4c1d49@JRWUBU2> <20170602022155.66af95f1@JRWUBU2> Message-ID: <20170602043235.2b7c284f@JRWUBU2> On Thu, 1 Jun 2017 19:19:51 -0700 Ken Whistler via Unicode wrote: > > and therefore should start a > > sequence of 6 characters. > > That is completely false, and has nothing to do with the current > definition of UTF-8. > > The current, normative definition of UTF-8, in the Unicode Standard, > and in ISO/IEC 10646:2014, and in RFC 3629 (which explicitly > "obsoletes and replaces RFC 2279") states clearly that 0xFC cannot > start a sequence of anything identifiable as UTF-8. TUS Section 3 is like the Augean Stables. It is a complete mess as a standards document, imputing mental states to computing processes. Table 3-7 for example, should be a consequence of a 'definition' that UTF-8 only represents Unicode Scalar values and excludes 'non-shortest forms'. Instead, the exclusion of the sequence is presented as a brute definition, rather than as a consequence of 0xD800 not being a Unicode scalar value. Likewise, 0xFC fails to be legal because it would define either a 'non-shortest form' or a value that is not a Unicode scalar value. The differences are a matter of presentation; the outcome as to what is permitted is the same. The difference lies rather in whether the rules are comprehensible. A comprehensible definition is more likely to be implemented correctly. Where the presentation makes a difference is in how malformed sequences are naturally handled. Richard. From unicode at unicode.org Thu Jun 1 23:52:11 2017 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Thu, 1 Jun 2017 21:52:11 -0700 Subject: Running out of code points, redux (was: Re: Feedback on the proposal...) In-Reply-To: <20170602043235.2b7c284f@JRWUBU2> References: <20170601125445.665a7a7059d7ee80bb4d670165c8327d.5e7f59113e.wbe@email03.godaddy.com> <20170601223912.5b4c1d49@JRWUBU2> <20170602022155.66af95f1@JRWUBU2> <20170602043235.2b7c284f@JRWUBU2> Message-ID: <5dc5839f-6939-c804-6a53-abbdcb4d96f3@att.net> On 6/1/2017 8:32 PM, Richard Wordingham via Unicode wrote: > TUS Section 3 is like the Augean Stables. It is a complete mess as a > standards document, That is a matter of editorial taste, I suppose. > imputing mental states to computing processes. That, however, is false. The rhetorical turn in the Unicode Standard's conformance clauses, "A process shall interpret..." and "A process shall not interpret..." has been in the standard for 21 years, and seems to have done its general job in guiding interoperable, conformant implementations fairly well. And everyone -- well, perhaps almost everyone -- has been able to figure out that such wording is a shorthand for something along the lines of "Any person implementing software conforming to the Unicode Standard in which a process does X shall implement it in such a way that that process when doing X shall follow the specification part Y, relevant to doing X, exactly according to that specification of Y...", rather than a misguided assumption that software processes are cognitive agents equipped with mental states that the standard can "tell what to think". And I contend that the shorthand works just fine. > > Table 3-7 for example, should be a consequence of a 'definition' that > UTF-8 only represents Unicode Scalar values and excludes 'non-shortest > forms'. Well, Definition D92 does already explicitly limit UTF-8 to Unicode scalar values, and explicitly limits the form to sequences of one to four bytes. The reason why it doesn't explicitly include the exclusion of "non-shortest form" in the definition, but instead refers to Table 3-7 for the well-formed sequences (which, btw explicitly rule out all the non-shortest forms), is because that would create another terminological conundrum -- trying to specify an air-tight definition of "non-shortest form (of UTF-8)" before UTF-8 itself is defined. It is terminologically cleaner to let people *derive* non-shortest form from the explicit exclusions of Table 3-7. > Instead, the exclusion of the sequence is presented > as a brute definition, rather than as a consequence of 0xD800 not being > a Unicode scalar value. Likewise, 0xFC fails to be legal because it > would define either a 'non-shortest form' or a value that is not a > Unicode scalar value. Actually 0xFC fails quite simply and unambiguously, because it is not in Table 3-7. End of story. Same for 0xFF. There is nothing architecturally special about 0xF5..0xFF. All are simply and unambiguously excluded from any well-formed UTF-8 byte sequence. > > The differences are a matter of presentation; the outcome as to what is > permitted is the same. The difference lies rather in whether the rules > are comprehensible. A comprehensible definition is more likely to be > implemented correctly. Where the presentation makes a difference is in > how malformed sequences are naturally handled. Well, I don't think implementers have all that much trouble figuring out what *well-formed* UTF-8 is these days. As for "how malformed sequences are naturally handled", I can't really say. Nor do I think the standard actually requires any particular handling to be conformant. It says thou shalt not emit them, and if you encounter them, thou shalt not interpret them as Unicode characters. Beyond that, it would be nice, of course, if people converged their error handling for malformed sequences in cooperative ways, but there is no conformance statement to that effect in the standard. I have no trouble with the contention that the wording about "best practice" and "recommendations" regarding the handling of U+FFFD has caused some confusion and differences of interpretation among implementers. I'm sure the language in that area could use cleanup, precisely because it has led to contending, incompatible interpretations of the text. As to what actually *is* best practice in use of U+FFFD when attempting to convert ill-formed sequences handed off to UTF-8 conversion processes, or whether the Unicode Standard should attempt to narrow down or change practice in that area, I am completely agnostic. Back to the U+FFFD thread for that discussion. --Ken From unicode at unicode.org Fri Jun 2 03:02:25 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Fri, 2 Jun 2017 09:02:25 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com>

<7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> <20170531181113.0fc7ea7a@JRWUBU2> <4250062B-5AFF-4BEF-AC4D-B55DBB1B12C5@alastairs-place.net>

Message-ID: On 1 Jun 2017, at 19:44, Asmus Freytag via Unicode wrote: > > What's not OK is to take an existing recommendation and change it to something else, just to make bug reports go away for one implementations. That's like two sleepers fighting over a blanket that's too short. Whenever one is covered, the other is exposed. That?s *not* what?s happening, however many times you and Henri make that claim. > (If that language is not in the standard already, a strong "an implementation MUST not depend on the use of a particular strategy for replacement of invalid code sequences", clearly ought to be added). It already says (p.127, section 3.9): Although a UTF-8 conversion process is required to never consume well-formed subsequences as part of its error handling for ill-formed subsequences, such a process is not otherwise constrained in how it deals with any ill-formed subsequence itself. which probably covers that, no? Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Fri Jun 2 07:23:15 2017 From: unicode at unicode.org (Phake Nick via Unicode) Date: Fri, 2 Jun 2017 20:23:15 +0800 Subject: Encoding of character for new Japanese era name after Heisei In-Reply-To: References:

Message-ID: Nowadays Unicode have encoded four characters, from U+337E to U+337B, as character for the four most recent Japanese era name, which people are using them quite a lot. In recent months, The intention for Japanese emperor to resign from the duty have been announced and Japan is expected to get a new era name together with the new emperor. It can be expected that people would want to type a single character for the new era name just like how people typed old era names now. However, with the new era name cominh into effect in Jan 1 2019 and the name of the new Japanese era is expected to be announced only half years ahead of the use of the character, how will Unicode handle the new era name? According to recent years Unicode release schedule, the announcement time will only be a few weeks before the official release of Unicode 11.0, and way passed the time of the beta. Is it possible for the character to be included in Unicode 11.0, or a 11.0.1 released some dates after? We won't know what the shape of the glyph would be until the era name being announced and as the era name itself is included in the unicode character description in past example, it is also not possible to come up with a name for the expected new character before the era name actually get announced, which mean if by usually process then an application cannot really start until the era name announcement have been made. Is there some methods to apply for inclusion of a character into Unicode without actually knowing what the character would be? Or if it's really too difficult to encode the character within the little amount of time ahead of the era's start, would it be possible to first reserve some codepoints for encoding of upcoming Japanese era, so that people can know what code point they will be using instead of using PUA? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jun 2 09:49:29 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 2 Jun 2017 16:49:29 +0200 Subject: Encoding of character for new Japanese era name after Heisei In-Reply-To: References:

Message-ID: But will there really be a new era name with the new emperor? All that could be made is a preservation by principle, but this does not mean that it will be really encoded. The lack of a "representative glyph" is a blocker. May be we could add instead a generic character for "New Japanese Era" (independant of the actual era) to be used in contexts where the precise era will not be available. The alternative would be to write the new era name using Kanas (or Latin) before the composed Kanji appears. I don't think it will block the localisation in CLDR even if it is later changed to use a newer prefered Kanji when it will be available. Anyway the names of possible successors are probably already known: how do they currently write their name using Kanjis or composed Kanas in a square ? These existing characters may also be used as a substitute, and I think this will be the solution used at least in the first months/years, even if there's a new honorific glyph adopted for the Emperor name 2017-06-02 14:23 GMT+02:00 Phake Nick via Unicode : > Nowadays Unicode have encoded four characters, from U+337E to U+337B, as > character for the four most recent Japanese era name, which people are > using them quite a lot. In recent months, The intention for Japanese > emperor to resign from the duty have been announced and Japan is expected > to get a new era name together with the new emperor. It can be expected > that people would want to type a single character for the new era name just > like how people typed old era names now. However, with the new era name > cominh into effect in Jan 1 2019 and the name of the new Japanese era is > expected to be announced only half years ahead of the use of the character, > how will Unicode handle the new era name? > According to recent years Unicode release schedule, the announcement time > will only be a few weeks before the official release of Unicode 11.0, and > way passed the time of the beta. Is it possible for the character to be > included in Unicode 11.0, or a 11.0.1 released some dates after? We won't > know what the shape of the glyph would be until the era name being > announced and as the era name itself is included in the unicode character > description in past example, it is also not possible to come up with a name > for the expected new character before the era name actually get announced, > which mean if by usually process then an application cannot really start > until the era name announcement have been made. Is there some methods to > apply for inclusion of a character into Unicode without actually knowing > what the character would be? > Or if it's really too difficult to encode the character within the little > amount of time ahead of the era's start, would it be possible to first > reserve some codepoints for encoding of upcoming Japanese era, so that > people can know what code point they will be using instead of using PUA? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jun 2 10:04:28 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 2 Jun 2017 17:04:28 +0200 Subject: Encoding of character for new Japanese era name after Heisei In-Reply-To: References:

Message-ID: Anyway, since emperor Akihito (??), the era starting in 1989 is no longer named after the emperor, but is Heisei (??) "Peace everywhere". This already occured in the past on the Ningo system. There's no absolute requirement to change the era name even if there's a new Emperor named. Anyway it is true that this is a good question, but this will not depend on the new Eperor but from experts on Japanese history, public survey and ministry decision and legislative adoption. The switch is expected to occur on New Year Day (Jan 1, 2019) to allow smooth transition. It may also be delayed one year more after the nomination of the new Emperor (so year 1 of the new Empror would still be using Heisei era without needing any year renumbering) The experts will anyway focus on several candidate names from wellknown historic names that are most probably already encoded and used since long in the Japanese litterature. 2017-06-02 16:49 GMT+02:00 Philippe Verdy : > But will there really be a new era name with the new emperor? All that > could be made is a preservation by principle, but this does not mean that > it will be really encoded. The lack of a "representative glyph" is a > blocker. > > May be we could add instead a generic character for "New Japanese Era" > (independant of the actual era) to be used in contexts where the precise > era will not be available. > The alternative would be to write the new era name using Kanas (or Latin) > before the composed Kanji appears. I don't think it will block the > localisation in CLDR even if it is later changed to use a newer prefered > Kanji when it will be available. > > Anyway the names of possible successors are probably already known: how do > they currently write their name using Kanjis or composed Kanas in a square > ? These existing characters may also be used as a substitute, and I think > this will be the solution used at least in the first months/years, even if > there's a new honorific glyph adopted for the Emperor name > > 2017-06-02 14:23 GMT+02:00 Phake Nick via Unicode : > >> Nowadays Unicode have encoded four characters, from U+337E to U+337B, as >> character for the four most recent Japanese era name, which people are >> using them quite a lot. In recent months, The intention for Japanese >> emperor to resign from the duty have been announced and Japan is expected >> to get a new era name together with the new emperor. It can be expected >> that people would want to type a single character for the new era name just >> like how people typed old era names now. However, with the new era name >> cominh into effect in Jan 1 2019 and the name of the new Japanese era is >> expected to be announced only half years ahead of the use of the character, >> how will Unicode handle the new era name? >> According to recent years Unicode release schedule, the announcement time >> will only be a few weeks before the official release of Unicode 11.0, and >> way passed the time of the beta. Is it possible for the character to be >> included in Unicode 11.0, or a 11.0.1 released some dates after? We won't >> know what the shape of the glyph would be until the era name being >> announced and as the era name itself is included in the unicode character >> description in past example, it is also not possible to come up with a name >> for the expected new character before the era name actually get announced, >> which mean if by usually process then an application cannot really start >> until the era name announcement have been made. Is there some methods to >> apply for inclusion of a character into Unicode without actually knowing >> what the character would be? >> Or if it's really too difficult to encode the character within the little >> amount of time ahead of the era's start, would it be possible to first >> reserve some codepoints for encoding of upcoming Japanese era, so that >> people can know what code point they will be using instead of using PUA? >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jun 2 10:23:25 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 2 Jun 2017 08:23:25 -0700 Subject: Encoding of character for new Japanese era name after Heisei In-Reply-To: References:

Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jun 2 14:03:36 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Fri, 02 Jun 2017 12:03:36 -0700 Subject: Encoding of character for new Japanese era name after Heisei Message-ID: <20170602120336.665a7a7059d7ee80bb4d670165c8327d.82e893d4e5.wbe@email03.godaddy.com> > Anyway, since emperor Akihito (??), the era starting in 1989 is no > longer named after the emperor, but is Heisei (??) "Peace everywhere". > This already occured in the past on the Ningo system. There's no > absolute requirement to change the era name even if there's a new > Emperor named. The Wikipedia article is instructive here (sorry, the French version doesn't seem to have the same information): https://en.wikipedia.org/wiki/Japanese_era_name#Neng.C5.8D_in_modern_Japan Since 1868 Japan has adhered to a system of "one reign, one era name" (????). The era name is determined upon accession of the emperor and is unrelated to his birth name. The emperor continues to be known by his birth name until his death, at which point he becomes known by the name of his era instead (so Emperor Hirohito became Emperor Sh?wa upon his death in 1989). There are no indications that the abdication of an emperor, as opposed to his death, would cause this system to be suspended. Unicode does not have an extensive history of encoding "placeholder" characters without knowing what they will actually be. This is probably a Good Thing. The four existing characters at U+337x are square compatibility characters, with decompositions to unified ideographs. So, whatever era name is chosen for the new emperor (probably Crown Prince Naruhito), there is a near-guarantee that it will be immediately representable in Unicode using normal ideographs. A new square compatibility character, if necessary, can be encoded after the era name is chosen. It might be fast-tracked at that time, as the Euro sign was, but there is no emergency about this and no reason to invent any new encoding procedures or waive any existing ones. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat Jun 3 23:09:01 2017 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Sat, 3 Jun 2017 21:09:01 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com>

<2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com>

<7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> Message-ID: On Wed, May 31, 2017 at 5:12 AM, Henri Sivonen wrote: > On Sun, May 21, 2017 at 7:37 PM, Mark Davis ?? via Unicode > wrote: > > There is plenty of time for public comment, since it was targeted at > Unicode > > 11, the release for about a year from now, not Unicode 10, due this year. > > When the UTC "approves a change", that change is subject to comment, and > the > > UTC can always reverse or modify its approval up until the meeting before > > release date. So there are ca. 9 months in which to comment. > > What should I read to learn how to formulate an appeal correctly? > I suggest you submit a write-up via http://www.unicode.org/reporting.html and make the case there that you think the UTC should retract http://www.unicode.org/L2/L2017/17103.htm#151-C19 *B.13.3.3 Illegal UTF-8 [Scherer, L2/17-168 *] *[151-C19 ] Consensus:* Modify the section on "Best Practices for Using FFFD" in section "3.9 Encoding Forms" of TUS per the recommendation in L2/17-168 , for Unicode version 11.0. Does it matter if a proposal/appeal is submitted as a non-member > implementor person, as an individual person member or as a liaison > member? The reporting.html form exists for gathering feedback from the public. The UTC regularly reviews and considers such feedback in its quarterly meetings. Also, since Chromium/Blink/v8 are using ICU, I suggest you submit an ICU ticket via http://bugs.icu-project.org/trac/newticket and make the case there, too, that you think (assuming you do) that ICU should change its handling of illegal UTF-8 sequences. > If people really believed that the guidelines in that section should have > > been conformance clauses, they should have proposed that at some point. > > It seems to me that this thread does not support the conclusion that > the Unicode Standard's expression of preference for the number of > REPLACEMENT CHARACTERs should be made into a conformance requirement > in the Unicode Standard. This thread could be taken to support a > conclusion that the Unicode Standard should not express any preference > beyond "at least one and at most as many as there were bytes". > Given the discussion and controversy here, in my opinion, the standard should probably tone down the "best practice" and "recommendation" language. > Aside from UTF-8 history, there is a reason for preferring a more > > "structural" definition for UTF-8 over one purely along valid sequences. > > This applies to code that *works* on UTF-8 strings rather than just > > converting them. For UTF-8 *processing* you need to be able to iterate > both > > forward and backward, and sometimes you need not collect code points > while > > skipping over n units in either direction -- but your iteration needs to > be > > consistent in all cases. This is easier to implement (especially in fast, > > short, inline code) if you have to look only at how many trail bytes > follow > > a lead byte, without having to look whether the first trail byte is in a > > certain range for some specific lead bytes. > > But the matter at hand is decoding potentially-invalid UTF-8 input > into a valid in-memory Unicode representation, so later processing is > somewhat a red herring as being out of scope for this step. I do agree > that if you already know that the data is valid UTF-8, it makes sense > to work from the bit pattern definition only. No, it's not a red herring. Not every piece of software has a neat "inside" with all valid text, and with a controllable surface to the "outside". In a large project with a small surface for text to enter the system, such as a browser with a centralized chunk of code for handling streams of input text, it might well work to validate once and then assume "on the inside" that you only ever see well-formed text. In a library with API of the granularity of "compare two strings", "uppercase a string" or "normalize a string", you have no control over your input; you cannot assume that your input is valid; you cannot crash when it's not valid; you cannot overrun your buffer; you cannot go into an endless loop. It's also cumbersome to fail with an error whenever you encounter invalid text, because you need more code for error detection & handling, and because significant C++ code bases do not allow exceptions. (Besides, ICU also offers C APIs.) Processing potentially-invalid UTF-8, iterating over it, and looking up data for it, *can* definitely be simpler (take less code etc.) if for any given lead byte you always collect the same maximum number of trail bytes, and if you have fewer distinct types of lead bytes with their corresponding sequences. Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jun 4 23:08:06 2017 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Mon, 5 Jun 2017 13:08:06 +0900 Subject: Running out of code points, redux (was: Re: Feedback on the proposal...) In-Reply-To: <20170601125445.665a7a7059d7ee80bb4d670165c8327d.5e7f59113e.wbe@email03.godaddy.com> References: <20170601125445.665a7a7059d7ee80bb4d670165c8327d.5e7f59113e.wbe@email03.godaddy.com> Message-ID: <9f34fabe-c676-5354-e4a5-63e65079dbe4@it.aoyama.ac.jp> On 2017/06/02 04:54, Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > >> even supporting 6-byte patterns just in case 20.1 bits eventually turn >> out not to be enough, Sorry to be late with this, but if 20.1 bits turn out to not be enough, what about 21 bits? That would still limit UTF-8 to four bytes, but would almost double the code space. Assuming (conservatively) that it will take about a century to fill up all 17 (well, actually 15, because two are private) planes, this would give us another century. Just one more crazy idea :-(. Regards, Martin. From unicode at unicode.org Mon Jun 5 00:22:58 2017 From: unicode at unicode.org (David Starner via Unicode) Date: Mon, 05 Jun 2017 05:22:58 +0000 Subject: Running out of code points, redux (was: Re: Feedback on the proposal...) In-Reply-To: <9f34fabe-c676-5354-e4a5-63e65079dbe4@it.aoyama.ac.jp> References: <20170601125445.665a7a7059d7ee80bb4d670165c8327d.5e7f59113e.wbe@email03.godaddy.com> <9f34fabe-c676-5354-e4a5-63e65079dbe4@it.aoyama.ac.jp> Message-ID: On Sun, Jun 4, 2017 at 9:13 PM Martin J. D?rst via Unicode < unicode at unicode.org> wrote: > Sorry to be late with this, but if 20.1 bits turn out to not be enough, > what about 21 bits? > > That would still limit UTF-8 to four bytes, but would almost double the > code space. Assuming (conservatively) that it will take about a century > to fill up all 17 (well, actually 15, because two are private) planes, > this would give us another century. > > Just one more crazy idea :-(. > It seems hard to estimate the value of that, without knowing why we ran out of characters. A slow collection of a huge number of Chinese ideographs and new Native American scripts, maybe. Access to a library with a trillion works over billions of years from millions of species, probably not. Given that we're in no risk of running out of characters right now, speculating on this seems pointless. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jun 5 03:20:02 2017 From: unicode at unicode.org (Neil Shadrach via Unicode) Date: Mon, 5 Jun 2017 09:20:02 +0100 Subject: CLDR 'B' Message-ID: http://cldr.unicode.org/translation/date-time-patterns How are 'B' values added for languages that do not have them? I cannot see an option for this in the survey tool which just refers to the existing list. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jun 5 07:37:16 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 5 Jun 2017 13:37:16 +0100 Subject: Running out of code points, redux (was: Re: Feedback on the proposal...) In-Reply-To: <9f34fabe-c676-5354-e4a5-63e65079dbe4@it.aoyama.ac.jp> References: <20170601125445.665a7a7059d7ee80bb4d670165c8327d.5e7f59113e.wbe@email03.godaddy.com> <9f34fabe-c676-5354-e4a5-63e65079dbe4@it.aoyama.ac.jp> Message-ID: <20170605133716.6722aed4@JRWUBU2> On Mon, 5 Jun 2017 13:08:06 +0900 "Martin J. D?rst via Unicode" wrote: > On 2017/06/02 04:54, Doug Ewell via Unicode wrote: > > Richard Wordingham wrote: > > > >> even supporting 6-byte patterns just in case 20.1 bits eventually > >> turn out not to be enough, > > Sorry to be late with this, but if 20.1 bits turn out to not be > enough, what about 21 bits? > > That would still limit UTF-8 to four bytes, but would almost double > the code space. Assuming (conservatively) that it will take about a > century to fill up all 17 (well, actually 15, because two are > private) planes, this would give us another century. It all depends on how the lead byte is parsed. With a block-if construct ignorant of the original design or a look-up table, it may be simplest to treat F5 onwards as out and out errors and not expect any trailing bytes. Code handling attempts at 6-byte code points was the most complex case. Of course, one **might** want to handle a list of mostly small positive integers, at which point the old UTF-8 design might be useful. Richard. From unicode at unicode.org Mon Jun 5 07:32:11 2017 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Mon, 5 Jun 2017 13:32:11 +0100 (BST) Subject: Running out of code points, redux (was: Re: Feedback on the proposal...) In-Reply-To: <16015126.32204.1496665542572.JavaMail.root@webmail34.bt.ext.cpcloud.co.uk> References: <16015126.32204.1496665542572.JavaMail.root@webmail34.bt.ext.cpcloud.co.uk> Message-ID: <16633473.30837.1496665931402.JavaMail.defaultUser@defaultHost> Martin J. D?rst > Sorry to be late with this, but if 20.1 bits turn out to not be enough, what about 21 bits? Martin J. D?rst > That would still limit UTF-8 to four bytes, but would almost double the code space. Assuming (conservatively) that it will take about a century to fill up all 17 (well, actually 15, because two are private) planes, this would give us another century. Martin J. D?rst > Just one more crazy idea :-(. An interesting possibility for application of some of the code points of those extra planes is to encode one code point for each Esperanto word that is in the PanLex database. https://www.panlex.org/ That could provide a platform for assisting communication through the language barrier. William Overington Monday 5 June 2017 From unicode at unicode.org Mon Jun 5 10:58:46 2017 From: unicode at unicode.org (Peter Edberg via Unicode) Date: Mon, 05 Jun 2017 08:58:46 -0700 Subject: CLDR 'B' In-Reply-To: References: Message-ID: <2AF8456B-8BED-44CD-A39E-ACD4AD261662@apple.com> > On Jun 5, 2017, at 1:20 AM, Neil Shadrach via Unicode wrote: > > > http://cldr.unicode.org/translation/date-time-patterns > > How are 'B' values added for languages that do not have them? > I cannot see an option for this in the survey tool which just refers to the existing list. If you want to override the inherited pattern for one of the 5 existing 'B' skeletons (Bh, Bhm, Bhms, EBhm, EBhms) you should be able to do that with no problem, please let us know if that does not work for you. If in a particular locale you want to add another skeleton to the existing 5, please file a ticket: http://unicode.org/cldr/trac/newticket (we shoud be able to get to that within a few days) - Peter E -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jun 5 12:34:49 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 05 Jun 2017 10:34:49 -0700 Subject: Running out of code points, redux (was: Re: Feedback on the proposal...) Message-ID: <20170605103449.665a7a7059d7ee80bb4d670165c8327d.1018d6cbcb.wbe@email03.godaddy.com> Martin J. D?rst wrote: > Assuming (conservatively) that it will take about a century to fill up > all 17 (well, actually 15, because two are private) planes, this would > give us another century. Current estimates seem to indicate that 800 years is closer to the mark. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed Jun 14 16:31:09 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 14 Jun 2017 23:31:09 +0200 Subject: Looking for 8-bit computer designers In-Reply-To: <20170530085056.665a7a7059d7ee80bb4d670165c8327d.a153952c07.wbe@email03.godaddy.com> References: <20170530085056.665a7a7059d7ee80bb4d670165c8327d.a153952c07.wbe@email03.godaddy.com> Message-ID: These old platforms still have their fans which are easily found on socail networks. There's even an active market of designs and extensions with new products being made by them, and sold online. Some Fablabs are using them because of the ease they can be modified/tweaked. The Commodire 64 platform for example is very active and work on various character set designs and implemetnaiton of emulated graphics using characters as fill patterns. Some of their creations are very artistic, and frequently combine siplays and computer-generated music; they frequently develop new hardware as well. What they are doing is almost what inventors of early Apple II did in their garage: they built a giant and solid company from this. What they show is that despite of the limitaiton of the display and even with small resolution and minimum color capabilities, they can create beautiful things. The same can be saif about the old Amiga, PET, Atari, Sinclar personal computers. They are no longer built by their initial companies, but they are rebuilt, thanks to 3D printing, fablabs, CAD softwares, and the traditional electronic devices, but better and more innovatively with new materials. Even their old games are now being ported on current computers, there are tons of emulators working remarkably well, and gamers like the simplicy of these old addictive games. 2017-05-30 17:50 GMT+02:00 Doug Ewell via Unicode : > Not as OT as it might seem: > > If there are any engineers or designers on this list who worked on 8-bit > and early 16-bit legacy computers (Apple II, Atari, Commodore, Tandy, > etc.), and especially on character set design for these machines, please > contact me privately at . Any desired degree of > anonymity and confidentiality will be honored. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jun 14 16:47:58 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 14 Jun 2017 14:47:58 -0700 Subject: Looking for 8-bit computer designers Message-ID: <20170614144758.665a7a7059d7ee80bb4d670165c8327d.be210e2975.wbe@email03.godaddy.com> Philippe Verdy wrote: > These old platforms still have their fans which are easily found on > socail networks. [...] We know this. That's why a group of us is working on a proposal to add missing characters from these platforms. Some of the platforms have really obscure and hard-to-decipher characters, and we were looking for insight from the original folks who worked on them. We have no shortage of present-day expertise. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat Jun 17 02:59:41 2017 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Sat, 17 Jun 2017 08:59:41 +0100 (BST) Subject: The management of the encoding process of emoji In-Reply-To: <1036413.42552.1497632385613.JavaMail.root@webmail39.bt.ext.cpcloud.co.uk> References: <1036413.42552.1497632385613.JavaMail.root@webmail39.bt.ext.cpcloud.co.uk> Message-ID: <11080423.4556.1497686381079.JavaMail.defaultUser@defaultHost> I have been reading the following document. http://www.unicode.org/L2/L2017/17192-response-cmts.pdf Comments in response to L2-17/147 To: UTC From: Peter Edberg & Mark Davis, for the Emoji Subcommittee Date: 2017 June 15 For convenience, here is a link to the L2-17/147 document. http://www.unicode.org/L2/L2017/17147-emoji-subcommittee.pdf In relation to the 17192-response-cmts.pdf document I write about two particular matters. In section 3.b of the document, there is the following. > ...; a great many proposals are received, many in an informal way, and many are ill-formed (a significant number come from children). How does Unicode Inc. respond to children please? As emoji are picture characters, just pieces of art, rather than something with safety issues such as, say, designing a new railway locomotive, is encouragement given to the children? In section 2.a.iv is the following. > Two key issues are whether the characters are likely to be popular and whether they would be supported by major vendors. I am rather concerned at what I am calling majorvendorism. I am concerned that progress in encoding may become subject to majorvendorization whereby only new ideas acceptable to at least one of a small number of major vendors can make progress. On many modern personal computers, fonts used by an end user do not necessarily need to have been produced by the producer of the operating system. Mostly, application programs can use any font that complies to the font standard. Fonts can be produced by many people, including an individual sat at home using a home computer using budget fontmaking software. That includes colour fonts. Fonts can be distributed electronically over the Internet, either for a fee or at no charge as desired by the publisher of the font. So a font produced other than by a major vendor could become widely used even though it is not bundled with an operating system. So there seems to me to be no fair reason for Unicode Inc. to include majorvendorism in its decision-making process. If a major vendor chooses, for commercial reasons, not to support some emoji then that is a matter for that major vendor and should not be a factor in the Unicode encoding process. I opine that progress should not be majorvendorized as that may impede the implementation of new ideas from individuals and small enterprises and new enterprises that are not connected to a major vendor. William Overington Saturday 17 June 2017 From unicode at unicode.org Mon Jun 19 07:14:05 2017 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Mon, 19 Jun 2017 14:14:05 +0200 (CEST) Subject: The management of the encoding process of emoji In-Reply-To: <11080423.4556.1497686381079.JavaMail.defaultUser@defaultHost> References: <1036413.42552.1497632385613.JavaMail.root@webmail39.bt.ext.cpcloud.co.uk> <11080423.4556.1497686381079.JavaMail.defaultUser@defaultHost> Message-ID: <743756490.159753.1497874445173@ox.hosteurope.de> William_J_G Overington: > > http://www.unicode.org/L2/L2017/17192-response-cmts.pdf > http://www.unicode.org/L2/L2017/17147-emoji-subcommittee.pdf > >> Two key issues are whether the characters are likely to be popular >> and whether they would be supported by major vendors. > > I am rather concerned at what I am calling majorvendorism. (...) > only new ideas acceptable to at least one of a small number of major vendors can make progress. I very much share this concern. > (...) a font produced other than by a major vendor could become widely used > even though it is not bundled with an operating system. > > So there seems to me to be no fair reason for Unicode Inc. to include majorvendorism in its decision-making process. > > If a major vendor chooses, for commercial reasons, not to support some emoji > then that is a matter for that major vendor and should not be a factor in the Unicode encoding process. I'm about to propose emojis for several body parts and am afraid that, although there is huge demand and a lot of prior art for them, it's futile for reproductive organs, because American vendors in particular, who make up the vast majority of (voting members in) the consortium, have a history of self-censorship in this regard (cf. Egyptian hieroglyphics) which they likely will extend upon others because they *feel* obliged to support graphically explicit images in their own emoji sets. That's just speculation, of course, because I don't and can't know whether there already has been a similar proposal that the ESC just declined to publish and forward to the UTC. This lack of documentation is a major point of L2/17-147, which hopefully gets addressed soon in the very basic way mentioned in 3.c. of the response: >> The ESC has been working on a list of at least the submitted names for proposed emoji, >> and is planning to make that public in the near future. It's also not unfounded speculation, as the case of the Rifle character shows, which made it into late beta stage of Unicode 9.0 as an emoji, but had its Emoji property withdrawn after a joint request by Apple and Google. Some vendors had already added support for it, but dropped it because they somehow felt obliged to not ship multi-color glyphs for "arbitrary" characters (whereas vendors such as Microsoft and Samsung, a non-member, rightfully don't seem to care much about that). If emojis were treated as proper characters that can come from any font, there would hardly be a problem, at least on platforms where users can install or embed custom typefaces. The fact that some vendors are either not able or not willing to change how emojis work on their operating systems, should not impact the proposal and encoding process for Unicode characters. From unicode at unicode.org Wed Jun 21 10:37:17 2017 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Wed, 21 Jun 2017 16:37:17 +0100 (BST) Subject: =?UTF-8?Q?Re:_Announcing_The_Unicode=C2=AE_Standard,_Version_10.0?= In-Reply-To: <59498994.2030206@unicode.org> References: <59498994.2030206@unicode.org> Message-ID: <27418068.38928.1498059437514.JavaMail.defaultUser@defaultHost> Here is a mnemonic poem, that I wrote on Monday 20 February 2017, now published as U+1F91F is now officially in The Unicode Standard. One eff nine one eff Is the code number to say In one symbol A very special message To a loved one far away In an email Or a message of text From unicode at unicode.org Wed Jun 21 12:01:58 2017 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 21 Jun 2017 10:01:58 -0700 Subject: =?UTF-8?Q?Re:_Announcing_The_Unicode=c2=ae_Standard=2c_Version_10.0?= In-Reply-To: <27418068.38928.1498059437514.JavaMail.defaultUser@defaultHost> References: <59498994.2030206@unicode.org> <27418068.38928.1498059437514.JavaMail.defaultUser@defaultHost> Message-ID: <2791f362-1d4c-d7de-0929-07eeb573139f@att.net> I wonder IF 9 times suffice, But IF more are required, I'll tweet ILY, tweet it twice -- Since spelling's been retired. On 6/21/2017 8:37 AM, William_J_G Overington via Unicode wrote: > Here is a mnemonic poem, that I wrote on Monday 20 February 2017, now published as U+1F91F is now officially in The Unicode Standard. > > One eff nine one eff > Is the code number to say > In one symbol > A very special message > To a loved one far away > > In an email > Or a message of text > > From unicode at unicode.org Wed Jun 21 12:58:27 2017 From: unicode at unicode.org (Michael Everson via Unicode) Date: Wed, 21 Jun 2017 18:58:27 +0100 Subject: =?utf-8?Q?Re=3A_Announcing_The_Unicode=C2=AE_Standard=2C_Version_?= =?utf-8?Q?10=2E0?= In-Reply-To: <2791f362-1d4c-d7de-0929-07eeb573139f@att.net> References: <59498994.2030206@unicode.org> <27418068.38928.1498059437514.JavaMail.defaultUser@defaultHost> <2791f362-1d4c-d7de-0929-07eeb573139f@att.net> Message-ID: <13C8DB6C-1F45-4771-8AEA-A1A54E2F9868@evertype.com> High 101F6, high 101F6, High FE4F? ? > On 21 Jun 2017, at 18:01, Ken Whistler via Unicode wrote: > > I wonder IF 9 times suffice, > But IF more are required, > I'll tweet ILY, tweet it twice -- > Since spelling's been retired. > > > On 6/21/2017 8:37 AM, William_J_G Overington via Unicode wrote: >> Here is a mnemonic poem, that I wrote on Monday 20 February 2017, now published as U+1F91F is now officially in The Unicode Standard. >> >> One eff nine one eff >> Is the code number to say >> In one symbol >> A very special message >> To a loved one far away >> >> In an email >> Or a message of text >> >> > From unicode at unicode.org Thu Jun 22 08:37:46 2017 From: unicode at unicode.org (Michael Bear via Unicode) Date: Thu, 22 Jun 2017 13:37:46 +0000 Subject: 10.0 Code Charts Message-ID: When are the code charts (http://www.unicode.org/charts/) going to be updated for Unicode 10.0? Sent from Mail for Windows 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 22 10:35:24 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 22 Jun 2017 08:35:24 -0700 Subject: 10.0 Code Charts Message-ID: <20170622083524.665a7a7059d7ee80bb4d670165c8327d.70d676d022.wbe@email03.godaddy.com> Michael Bear wrote: > When are the code charts (http://www.unicode.org/charts/) going to be > updated for Unicode 10.0? They look fine to me. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Thu Jun 29 13:32:51 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Thu, 29 Jun 2017 11:32:51 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 Message-ID: On Sat Jun 3 23:09:01 CDT 2017Sat Jun 3 23:09:01 CDT 2017 Markus Scherer wrote: > I suggest you submit a write-up via http://www.unicode.org/reporting.html > > and make the case there that you think the UTC should retract > > http://www.unicode.org/L2/L2017/17103.htm#151-C19 The submission has been made: http://www.unicode.org/L2/L2017/17197-utf8-retract.pdf > Also, since Chromium/Blink/v8 are using ICU, I suggest you submit an ICU > ticket via http://bugs.icu-project.org/trac/newticket Although they use ICU for most legacy encodings, they don't use ICU for UTF-8. Hence, the difference between Chrome and ICU in the above write-up. > and make the case there, too, that you think (assuming you do) that ICU > should change its handling of illegal UTF-8 sequences. Whether I think ICU should change isn't quite that simple. On one hand, a key worry that I have about Unicode changing the long-standing guidance for UTF-8 error handling is that inducing implementations to change (either by the developers feeling that they have to implement the "best practice" or by others complaining when "best practice" isn't implemented) is wasteful and a potential source of bugs. In that sense, I feel I shouldn't ask ICU to change, either. On the other hand, I care about implementations of the WHATWG Encoding Standard being compliant and it appears that Node.js is on track to exposing ICU's UTF-8 decoder via the WHATWG TextDecoder API: https://github.com/nodejs/node/pull/13644 . Additionally, this episode of ICU behavior getting cited in a proposal to change the guidance in the Unicode Standard is a reason why I'd be happier if ICU followed the Unicode 10-and-earlier / WHATWG behavior, since there wouldn't be the risk of ICU's behavior getting cited as a different reference as happened with the proposal to change the guidance for Unicode 11. Still, since I'm not affiliated with the Node.js implementation, I'm a bit worried that if I filed an ICU bug on Node's behalf, I'd be engaging in the kind of behavior towards ICU that I don't want to see towards other implementations, including the one I've written, in response to the new pending Unicode 11 guidance (which I'm requesting be retracted), so at this time I haven't filed an ICU bug on Node's behalf and have instead mentioned the difference between ICU and the WHATWG spec when my input on testing the Node TextDecoder implementation was sought (https://github.com/nodejs/node/issues/13646#issuecomment-308084459). >> But the matter at hand is decoding potentially-invalid UTF-8 input >> into a valid in-memory Unicode representation, so later processing is >> somewhat a red herring as being out of scope for this step. I do agree >> that if you already know that the data is valid UTF-8, it makes sense >> to work from the bit pattern definition only. > > No, it's not a red herring. Not every piece of software has a neat "inside" > with all valid text, and with a controllable surface to the "outside". Fair enough. However, I don't think this supports adopting the ICU behavior as "best practice" when looking at a prominent real-world example of such a system. The Go programming language is a example of a system that post-dates UTF-8, is even designed by the same people as UTF-8 and where strings in memory are potentially-invalid UTF-8, i.e. there isn't a clear distinction with UTF-8 on the outside and UTF-8 on the inside. (In contrast to e.g. Rust where the type system maintains a clear distinction between byte buffers and strings, and strings are guaranteed-valid UTF-8.) Go bakes UTF-8 error handling in the language spec by specifying per-code point iteration over potentially-invalid in-memory UTF-8 buffers. See item 2 in the list at https://golang.org/ref/spec#For_range . The behavior baked into the language is one REPLACEMENT CHARACTER per bogus byte, which is neither the Unicode 10-and-earlier "best practice" nor the ICU behavior. However, it is closer to the Unicode 10-and-earlier "best practice" than to the ICU behavior. (It differs from the Unicode-and-earlier behavior only for truncated sequences that form a prefix of a valid sequence.) (To be clear, I not saying that the guidance in the Unicode Standard should be changed to match Go, either. I'm just saying that Go is an example of a prominent system with ambiguous inside and outside for UTF-8 and it exhibits behavior closer to Unicode 10 than to ICU and, therefore, is not a data point in favor of adopting the ICU behavior.) -- Henri Sivonen hsivonen at mozilla.com From unicode at unicode.org Fri Jun 30 10:29:08 2017 From: unicode at unicode.org (Otto Stolz via Unicode) Date: Fri, 30 Jun 2017 17:29:08 +0200 Subject: LATIN CAPITAL LETTER SHARP S officially recognized Message-ID: <307c150c-35b3-53ea-867e-72d6fa092184@uni-konstanz.de> Hello, der Rat f?r deutsche Rechtschreibung which is responsible for the further development of the official German ortho- graphy has finally recognized LATIN CAPITAL LETTER SHARP S as a possible upper-case equvalent for the LATIN SMALL LETTER SHARP S. The report announcing the change is dated 2016-12-08, but the official rules have been updated only yesterday, so the change is currently in the news (not very prominently, though). The pertinent section from the official 2107 rules reads thusly: > ? 25 E3 > Bei Schreibung mit Gro?buchstaben schreibt man SS. > Daneben ist auch die Verwendung des Gro?buchstabens > ? m?glich. Beispiel: Stra?e ? STRASSE ?STRA?E. Which translates to: > When writing all caps, you spell SS. > Alternatively, it is possible to use the upper-case ?. > Example: Stra?e ? STRASSE ?STRA?E. So, SS remains the primary upper-case equivalent of ?. Yesterday?s note to the press says that the capital ? is meant mainly for passports and similar official documents, wich have to reproduce personal names faith- fully in their respective spelling variants. E. g., Passports used to distinguish proper names such as GRO?MANN and GROSSMAN; up to now, they usually have spelled GRO?MANN, with a small ? between the capitals, which renders ugly, in most fonts. Best wishes, Otto From unicode at unicode.org Fri Jun 30 10:34:24 2017 From: unicode at unicode.org (Michael Everson via Unicode) Date: Fri, 30 Jun 2017 16:34:24 +0100 Subject: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: <307c150c-35b3-53ea-867e-72d6fa092184@uni-konstanz.de> References: <307c150c-35b3-53ea-867e-72d6fa092184@uni-konstanz.de> Message-ID: <665C3878-485E-4838-9706-E592AC79EBCE@evertype.com> It would be sensible to case-map ? to ? however. > On 30 Jun 2017, at 16:29, Otto Stolz via Unicode wrote: > > Hello, > > der Rat f?r deutsche Rechtschreibung which is responsible for the further development of the official German orthography has finally recognized LATIN CAPITAL LETTER SHARP S as a possible upper-case equvalent for the LATIN SMALL LETTER SHARP S. > > The report announcing the change is dated 2016-12-08, but the official rules have been updated only yesterday, so the change is currently in the news (not very prominently, though). > > The pertinent section from the official 2107 rules reads thusly: >> ? 25 E3 >> Bei Schreibung mit Gro?buchstaben schreibt man SS. Daneben ist auch die Verwendung des Gro?buchstabens ? m?glich. Beispiel: Stra?e ? STRASSE ?STRA?E. > > Which translates to: >> When writing all caps, you spell SS. Alternatively, it is possible to use the upper-case ?. Example: Stra?e ? STRASSE ?STRA?E. > > So, SS remains the primary upper-case equivalent of ?. Yesterday?s note to the press says that the capital ? is meant mainly for passports and similar official documents, wich have to reproduce personal names faithfully in their respective spelling variants. E. g., Passports used to distinguish proper names such as GRO?MANN and GROSSMAN; up to now, they usually have spelled GRO?MANN, with a small ? between the capitals, which renders ugly, in most fonts. > > Best wishes, > Otto From unicode at unicode.org Fri Jun 30 11:48:07 2017 From: unicode at unicode.org (Mathias Bynens via Unicode) Date: Fri, 30 Jun 2017 18:48:07 +0200 Subject: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: <665C3878-485E-4838-9706-E592AC79EBCE@evertype.com> References: <307c150c-35b3-53ea-867e-72d6fa092184@uni-konstanz.de> <665C3878-485E-4838-9706-E592AC79EBCE@evertype.com> Message-ID: On Fri, Jun 30, 2017 at 5:34 PM, Michael Everson via Unicode wrote: > > It would be sensible to case-map ? to ? however. I?m hoping this can happen ? converting ? to SS is lossy, so mapping to ? would be far superior. However, says: ?If two characters form a case pair in a version of Unicode, they will remain a case pair in each subsequent version of Unicode. If two characters do not form a case pair in a version of Unicode, they will never become a case pair in any subsequent version of Unicode.? ?? From unicode at unicode.org Fri Jun 30 12:02:31 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 30 Jun 2017 19:02:31 +0200 Subject: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: References: <307c150c-35b3-53ea-867e-72d6fa092184@uni-konstanz.de> <665C3878-485E-4838-9706-E592AC79EBCE@evertype.com> Message-ID: True but this only applies to "simple case mappings" (those in the main datatase), not to extended mappings (which are locale dependant, such as mappings for dotted and undotted i in Turkish). So the extended mappings can perfectly be changed for German: they are not part of the stability policy and designed to be extensible. And this is where you find the existing mapping from ? to SS (lossy case conversion), that will change to ? (non lossy case conversion). 2017-06-30 18:48 GMT+02:00 Mathias Bynens via Unicode : > On Fri, Jun 30, 2017 at 5:34 PM, Michael Everson via Unicode > wrote: > > > > It would be sensible to case-map ? to ? however. > > I?m hoping this can happen ? converting ? to SS is lossy, so mapping > to ? would be far superior. > > However, > says: > > ?If two characters form a case pair in a version of Unicode, they will > remain a case pair in each subsequent version of Unicode. > > If two characters do not form a case pair in a version of Unicode, > they will never become a case pair in any subsequent version of > Unicode.? > > ?? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jun 30 20:15:40 2017 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Sat, 1 Jul 2017 03:15:40 +0200 (CEST) Subject: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: References: <307c150c-35b3-53ea-867e-72d6fa092184@uni-konstanz.de> <665C3878-485E-4838-9706-E592AC79EBCE@evertype.com>

Message-ID: <1590256394.310836.1498871740833@ox.hosteurope.de> Letters in some scripts are a class of two or more characters. Usually, all letters have the same number of such case variants. Rarely, characters may be constituents of different letters within the same script. A closed set of letters, usually with a canonical sort order, makes an alphabet. Every writing system employs exactly one alphabet for each script it supports. Most writing systems only support a single script. Writing systems may have multiple systematically related orthographies, i.e. rules for combining letters into graphemes and these into words. Any Unicode case pair is intended to be equivalent to a letter, but in some cases fails to be this. It fails in the case of Turkish , because every character can only be part of a single case pair. It fails in the case of German , because a categorical error (that cannot be corrected for compatibility and stability reasons) had been made: a grapheme rule was recorded as a letter rule.