From unicode at unicode.org Mon May 1 00:17:05 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 1 May 2017 07:17:05 +0200 Subject: Tibetan Paluta In-Reply-To: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com> References: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com> Message-ID: 2017-04-29 21:21 GMT+02:00 Naena Guru via Unicode : > Just about the name paluta: > In Sanskrit, the length of vowels are measured in maa?ra (a cognate of the > word 'meter'). It is the spoken length of a short vowel. In Latin it is > termed mora. Usually, you have only single and double length vowels. A > palu?a length is like when you call out somebody from a distance. Pluta is > a careless use of spelling. Virama and Halanta are two other terms loosely > used. > > Anyway, Unicode is only about DISPLAYING a script: There's a shape here; > Let's find how to get it by assembling other shapes or by creating a code > point for it. What is short, long or longer in speech is no concern for > Unicode. > Wrong. Unicode is absolutely not about how to "display" any script (except symbols and notational symbols). Unicode does not encode glyphs. Unicode encodes "abstract characters" according to their semantics, in order to assign them properties allowing meaningful transformations of text and in order to allow perfoirming searches (with collation algorithms). What is important is their properties (something that ISO 10646 does not care when it started the UCS in a separate project, ignoring how it would be used, focusing too much on apparent glytphs (and introducing lot of "compatiblity characters" that would not have been encoded otherwise, and creating some havoc in logical processing. Anyway Unciode makes some exceptions to the logical model only for roundtrip comptaibility with other standards that used another encoding model widely used, notably in Thai: these are the exception where there are "prepended" letters. There was some havoc also for some scripts in India because of roundtrip compatiblity with an Indian standard (criticized by many users of Tamil and some other Southern Indic scripts that don't follow directly the paradigm created for getting some limited transliteration with Devanagari: that initial desire was abandoned but the legacy Indic scripts in India were imported as is to Unicode) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 1 07:14:18 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 1 May 2017 13:14:18 +0100 Subject: Unicode is more than shapes (was: Tibetan Paluta) In-Reply-To: References: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com> Message-ID: <20170501131418.665947ee@JRWUBU2> On Mon, 1 May 2017 07:17:05 +0200 Philippe Verdy via Unicode wrote: > 2017-04-29 21:21 GMT+02:00 Naena Guru via Unicode > : > > Anyway, Unicode is only about DISPLAYING a script: There's a shape > > here; Let's find how to get it by assembling other shapes or by > > creating a code point for it. What is short, long or longer in > > speech is no concern for Unicode. When there is considerable variation in shape, describing the function of a character can be of great help in determining the character code to enter for some relatively obscure character. > Wrong. Unicode is absolutely not about how to "display" any script > (except symbols and notational symbols). Unicode does not encode > glyphs. Unicode encodes "abstract characters" according to their > semantics, in order to assign them properties allowing meaningful > transformations of text and in order to allow perfoirming searches > (with collation algorithms). Of course, display is a very important transformation process! However, for many applications, an important part of display is knowing when to split text between lines, and in easy cases that can be done using knowledge of character properties. In hard cases, the user has to insert line-breaking permissions and even prohibitions. There are special characters for these functions. It's somewhat misleading to say that searches use collation algorithms. What is true is that folding can use enough of the same computational processes that much of the code for collation may be re-used for search. Different data tables are frequently appropriate. > Anyway Unciode makes some exceptions to the logical model only for > roundtrip comptaibility with other standards that used another > encoding model widely used, notably in Thai: these are the exception > where there are "prepended" letters. What "logical" model? I don't think you know how Thai works. The key feature is that the Indic consonant stack has no delimiter in Thai, which makes the phonetic placement of preposed vowels ambiguous. In some of the other relevant features that I am aware of, Lao works quite differently. Tai Viet was encoded in visual order. You forget one other change. New Tai Lue switched from phonetic order to visual order because it hadn't been worth Microsoft's while to implement the simple rendering engine. The Universal Shaping Engine (USE) should prevent this happening again with straightforward complex scripts, but good intentions (namely, replacing the working renderer from HarfBuzz and thus Firefox, Chrome and LibreOffice with an emulation of the USE) may unintentionally repeat the process with 'Old Tai Lue'. Using phonetic order in Tai Tham distinguishes homographs (if I may use the term here) that would usually be collated differently. > There was some havoc also for > some scripts in India because of roundtrip compatiblity with an > Indian standard (criticized by many users of Tamil and some other > Southern Indic scripts that don't follow directly the paradigm > created for getting some limited transliteration with Devanagari: > that initial desire was abandoned but the legacy Indic scripts in > India were imported as is to Unicode) The havoc is because half-forms are a north Indian innovation, not an ancient Indic feature. Tamil suffered from the ISCII conflation of combining and merely having no vowel, the Unicode virama. Tibetan and Khmer led the way in splitting the concepts, and the Unicode virama in the Myanmar script was disunified into an invisible stacker and a pure killer. Many of the Tamil complaints arise because the implicit vowel is ill-suited to Tamil, but an attempt to move away from that system about two thousand years ago did not persist. Richard. From unicode at unicode.org Mon May 1 09:19:27 2017 From: unicode at unicode.org (Naena Guru via Unicode) Date: Mon, 1 May 2017 19:49:27 +0530 Subject: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta In-Reply-To: References: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com> Message-ID: This whole attempt to make digitizing Indic script some esoteric, 'abstract', 'semantic representation' and so on seems to me is an attempt to make Unicode the realm of the some super humans. The purpose of writing is to represent speech. It is not some secret that demi-gods created that we are trying to explain with 'modern' linguistic gymnastics. sound => letter that is the basis for writing. English writing was massacred when printing was brought in from Europe. A similar thing is happening to Indic by all this mumbo-jumbo. I call out to NATIVE users of Indic to explain what apparently Europeans or Americans are discussing here. On 5/1/2017 10:47 AM, Philippe Verdy wrote: > > > 2017-04-29 21:21 GMT+02:00 Naena Guru via Unicode >: > > Just about the name paluta: > In Sanskrit, the length of vowels are measured in maa?ra (a > cognate of the word 'meter'). It is the spoken length of a short > vowel. In Latin it is termed mora. Usually, you have only single > and double length vowels. A palu?a length is like when you call > out somebody from a distance. Pluta is a careless use of spelling. > Virama and Halanta are two other terms loosely used. > > Anyway, Unicode is only about DISPLAYING a script: There's a shape > here; Let's find how to get it by assembling other shapes or by > creating a code point for it. What is short, long or longer in > speech is no concern for Unicode. > > > Wrong. Unicode is absolutely not about how to "display" any script > (except symbols and notational symbols). Unicode does not encode > glyphs. Unicode encodes "abstract characters" according to their > semantics, in order to assign them properties allowing meaningful > transformations of text and in order to allow perfoirming searches > (with collation algorithms). What is important is their properties > (something that ISO 10646 does not care when it started the UCS in a > separate project, ignoring how it would be used, focusing too much on > apparent glytphs (and introducing lot of "compatiblity characters" > that would not have been encoded otherwise, and creating some havoc in > logical processing. > > Anyway Unciode makes some exceptions to the logical model only for > roundtrip comptaibility with other standards that used another > encoding model widely used, notably in Thai: these are the exception > where there are "prepended" letters. There was some havoc also for > some scripts in India because of roundtrip compatiblity with an Indian > standard (criticized by many users of Tamil and some other Southern > Indic scripts that don't follow directly the paradigm created for > getting some limited transliteration with Devanagari: that initial > desire was abandoned but the legacy Indic scripts in India were > imported as is to Unicode) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 1 10:25:28 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 1 May 2017 16:25:28 +0100 Subject: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta In-Reply-To: References: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com> Message-ID: <20170501162528.171f631b@JRWUBU2> On Mon, 1 May 2017 19:49:27 +0530 Naena Guru via Unicode wrote: > The purpose of writing is to represent speech. It is not some secret > that demi-gods created Sarasvati and Thoth would be offended at being called mere demi-gods. > sound => letter that is the basis for writing. "=>" is not a particularly phonetic notation. It took quite a while for letters to become the primary part of writing anywhere, and they are not a universal phenomenon. Richard. From unicode at unicode.org Mon May 1 10:26:04 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Mon, 1 May 2017 16:26:04 +0100 Subject: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta In-Reply-To: References: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com> Message-ID: On 1 May 2017, at 15:19, Naena Guru via Unicode wrote: > > This whole attempt to make digitizing Indic script some esoteric, 'abstract', 'semantic representation' and so on seems to me is an attempt to make Unicode the realm of the some super humans. No. It?s important so that the standard Unicode algorithms function acceptably for Indic languages. The design of Unicode is such that, compatibility characters and other some special cases aside, it encodes semantics as opposed to graphic representations. > The purpose of writing is to represent speech. Yes, and Unicode is intended to give us a representation of speech *that is amenable to machine processing*. The other extreme is what used to happen on many Chinese and Japanese websites, namely ?representing speech? by way of an image - if you want to process the text in one of those images, well, good luck with that (you?ll want to start with some kind of OCR). Perhaps part of the problem here is that Unicode sits at the intersection between linguistics and software engineering; the discussion of both sides of this is likely to be quite technical, some of the vocabulary used might well seem like ?mumbo jumbo?, just as some of the design decisions might not make sense if your expertise is mainly on one side or mainly on the other (or, for that matter, if you have little exposure to other languages or the challenges inherent in encoding or rendering them). However, for all that it might *sound* like ?mumbo jumbo? to you, it is not. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Mon May 1 12:28:59 2017 From: unicode at unicode.org (Naena Guru via Unicode) Date: Mon, 1 May 2017 22:58:59 +0530 Subject: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta In-Reply-To: <20170501162528.171f631b@JRWUBU2> References: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com> <20170501162528.171f631b@JRWUBU2> Message-ID: <23ffaf0c-db97-2ebe-9b7c-aecf78023f90@gmail.com> A little humor is very good. sarasva?i was a sweet girl, I am sure, so much so that when she died, I think, those who were imagining about her beyond practical, made her rise up, up and fly away. Now you watch what happens to Elizabeth when she dies. They narrowly failed making one such with Hillary Clinton as she is suspected of having Parkinson's which condition her daughter says has an anecdotal remedy with MaryJane. Hmmm... Who went to her daughter's house instead of to the doctor when they suddenly fell? As for Thoth, he is okay. Don't worry. Egyptian man => demi-god => god has not much of a consequence in the West dominated culture of this day. On 5/1/2017 8:55 PM, Richard Wordingham via Unicode wrote: > On Mon, 1 May 2017 19:49:27 +0530 > Naena Guru via Unicode wrote: > >> The purpose of writing is to represent speech. It is not some secret >> that demi-gods created > Sarasvati and Thoth would be offended at being called mere demi-gods. > >> sound => letter that is the basis for writing. > "=>" is not a particularly phonetic notation. It took quite a while > for letters to become the primary part of writing anywhere, and they > are not a universal phenomenon. > > Richard. Okay, Richard. Your probably have knowledge of how writing evolved in the whole world. Tell us how it was in South Asia. Was it like I said, sound => letter? I assume only to know about English and Indic in this respect. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 1 14:12:22 2017 From: unicode at unicode.org (Michael Bear via Unicode) Date: Mon, 1 May 2017 19:12:22 +0000 Subject: How to Add Beams to Notes Message-ID: I am trying to make a music notation font. It will use the Musical Symbols block in Unicode (1D100-1D1FF), but, since that block has a bad rep for not being very complete, I added some extra characters in the unmapped positions of that block (e.g. U+1D127 inverts the stem of the previous note, U+1D1E9 is a ledger line, U+1D1EA is the "TAB" clef, U+1D1F0-U+1D1FC position the note along the staff, etc.) I've had no problem so far, but now I need to do beamed notes. The Unicode block has control characters for beginning and ending a series of beamed notes (U+1D173 and U+1D174, respectively), but I'm not really sure how to add beams to the notes while keeping the pitch intact. I know I'll obviously need OpenType for this. Slanted beams would be preferred, but straight beams are acceptable. It will need to support beams added on for longer notes. Can someone help me with this? I had asked this on a High Logic Font Creator forum (here), and someone said to subscribe to your mailing list and ask you guys. So here I am! Anyway, help, please? Sent from Mail for Windows 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 1 15:04:29 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 1 May 2017 13:04:29 -0700 Subject: How to Add Beams to Notes In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 1 15:53:33 2017 From: unicode at unicode.org (=?iso-8859-1?Q?St=F6tzner_Signographie?= via Unicode) Date: Mon, 1 May 2017 22:53:33 +0200 Subject: How to Add Beams to Notes In-Reply-To: References: Message-ID: Bad news, I?m afraid. What is the intended usage of your font? Music score applications? others? The overall problem with musical notation is, there is no comprehensive character encoding standard and no generally working text and layout composing method established. In the light of that fact it is hopeless to make fonts for this. The fonts are not the problem (yes they are, there is no solid encoding scheme available), but the lack of composing syntax is the crux you?ll hardly overcome. If you need to cater for a specific usage scenario you?ll end up with a complete hack anyway (however it may look like, doesn?t matter). Good luck! A. St?tzner (Musical notation project) Am 01.05.2017 um 21:12 schrieb Michael Bear via Unicode: > I am trying to make a music notation font. It will use the Musical Symbols block in Unicode (1D100-1D1FF), but, since that block has a bad rep for not being very complete, I added some extra characters in the unmapped positions of that block (e.g. U+1D127 inverts the stem of the previous note, U+1D1E9 is a ledger line, U+1D1EA is the "TAB" clef, U+1D1F0-U+1D1FC position the note along the staff, etc.) I've had no problem so far, but now I need to do beamed notes. The Unicode block has control characters for beginning and ending a series of beamed notes (U+1D173 and U+1D174, respectively), but I'm not really sure how to add beams to the notes while keeping the pitch intact. I know I'll obviously need OpenType for this. Slanted beams would be preferred, but straight beams are acceptable. It will need to support beams added on for longer notes. Can someone help me with this? > > I had asked this on a High Logic Font Creator forum (here), and someone said to subscribe to your mailing list and ask you guys. So here I am! Anyway, help, please? > > Sent from Mail for Windows 10 > _______________________________________________________________________________ Andreas St?tzner Gestaltung Signographie Fontentwicklung Haus des Buches Gerichtsweg 28, Raum 434 04103 Leipzig 0176-86823396 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 1 18:03:53 2017 From: unicode at unicode.org (Michael Bear via Unicode) Date: Mon, 1 May 2017 23:03:53 +0000 Subject: How to Add Beams to Notes In-Reply-To: References: , Message-ID: ?Rather than using "unused code positions", I would always recommend to use some of the Private Use code points.? Consider it done. ?What is the intended usage of your font? Music score applications? others?? I am simply going to make a series of full Unicode fonts (which, due to the 65,535-character limit in fonts, each of the 3 fonts covers different planes: The first font does the BMP, the second one does the SMP, and the third one is all the other planes, which are vacant enough to fit in one font) that will have the necessary OpenType features of every script. And I thought ?Hey, maybe I should do full OT for the music block that no one has really done yet! How awesome would that be?? So I made a test font to work it out, but I ran into this one pothole. That?s when I came here. Sent from Mail for Windows 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 1 19:01:08 2017 From: unicode at unicode.org (David Starner via Unicode) Date: Tue, 02 May 2017 00:01:08 +0000 Subject: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta In-Reply-To: References: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com> Message-ID: On Mon, May 1, 2017 at 7:26 AM Naena Guru via Unicode wrote: > This whole attempt to make digitizing Indic script some esoteric, > 'abstract', 'semantic representation' and so on seems to me is an attempt > to make Unicode the realm of the some super humans. > Unicode is like writing. At its core, it is a hairy esoteric mess; mix these certain chemicals the right ways, and prepare a writing implement and writing surface in the right (non-trivial) ways, and then manipulate that implement carefully to make certain marks that have unclear delimitations between correct and incorrect. But in the end, as much of that is removed from the problem of the user as possible; in the case of modern word-processing system, it's a matter of hitting the keys and then hitting print, in complete ignorance of all the silicon and printing magic going on between. Unicode is not the realm of everyone; it's the realm of people with a certain amount of linguistic knowledge and computer knowledge. There's only a problem if those people can't make it usable for the everyday programmer and therethrough to the average person. > The purpose of writing is to represent speech. > Meh. The purpose of writing is to represent language, which may be unrelated to speech (like in the case of SignWriting and mathematics) or somewhat related to speech--very few forms of writing are direct transcriptions of speech. Even the closest tend to exchange a lot of intonation details for punctuation that reveals different information. > English writing was massacred when printing was brought in from Europe. > No, it wasn't. Printing made no difference to the fact that English has a dozen vowels with five letters to write them. The thorn has little impact on the ambiguity of English writing. The problem with printing is that it fossilizes the written language, and our spellings have stayed the same while the pronunciations have changed. And the dissociation of sound and writing sometimes helps English; even when two English speakers from different parts of the world would have trouble understanding each other, writing is usually not so impaired. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 1 19:03:26 2017 From: unicode at unicode.org (John W Kennedy via Unicode) Date: Mon, 1 May 2017 20:03:26 -0400 Subject: How to Add Beams to Notes In-Reply-To: References: Message-ID: > On May 1, 2017, at 3:12 PM, Michael Bear via Unicode wrote: > > I am trying to make a music notation font. It will use the Musical Symbols block in Unicode (1D100-1D1FF), but, since that block has a bad rep for not being very complete, I added some extra characters in the unmapped positions of that block (e.g. U+1D127 inverts the stem of the previous note, U+1D1E9 is a ledger line, U+1D1EA is the "TAB" clef, U+1D1F0-U+1D1FC position the note along the staff, etc.) I've had no problem so far, but now I need to do beamed notes. The Unicode block has control characters for beginning and ending a series of beamed notes (U+1D173 and U+1D174, respectively), but I'm not really sure how to add beams to the notes while keeping the pitch intact. I know I'll obviously need OpenType for this. Slanted beams would be preferred, but straight beams are acceptable. It will need to support beams added on for longer notes. Can someone help me with this? > > I had asked this on a High Logic Font Creator forum (here), and someone said to subscribe to your mailing list and ask you guys. So here I am! Anyway, help, please? You might want to acquaint yourself with http://www.smufl.org From unicode at unicode.org Mon May 1 19:27:24 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 2 May 2017 01:27:24 +0100 Subject: How to Add Beams to Notes In-Reply-To: References: Message-ID: <20170502012724.3f109a86@JRWUBU2> On Mon, 1 May 2017 23:03:53 +0000 Michael Bear via Unicode wrote: > ?Rather than using "unused code positions", I would always recommend > to use some of the Private Use code points.? Consider it done. > > ?What is the intended usage of your font? Music score > applications? others?? I am simply going to make a series of full > Unicode fonts (which, due to the 65,535-character limit in fonts, > each of the 3 fonts covers different planes: The first font does the > BMP, How much margin do you have for the BMP? There are a fair few variation sequences, on top of all the contextual forms and conjuncts. Richard. From unicode at unicode.org Mon May 1 22:08:27 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 2 May 2017 05:08:27 +0200 Subject: How to Add Beams to Notes In-Reply-To: <20170502012724.3f109a86@JRWUBU2> References: <20170502012724.3f109a86@JRWUBU2> Message-ID: Consider also that the BMP is almost full, the remaining few holes are kept for isolated characters that may be added to existing scripts, or permanently reserved to avoid clashes with legacy softwares using simple code remappings between distinct blocks, or to perform simple case conversions (e.g. in Greek) for internal purposes (these positions are not interoperable and may clash with future versions of the UCS and I18n tools/libraries like ICU) You should abstain using any currently unassigned positions in the existing Unicode blocks: use PUA if you have nothing else; there are plenty of space available, in the BMP (most common usage in fonts that need to map additional glyphs) or in the two last planes. The PUA block in the BMP is large enough for most apps and almost all fonts that need private glyphs for internal purposes, or for still unencoded characters or for your own encoded variants such as slanted symbols, rotated symbols, inverted symbols, or symbols with multiple sizes, or at different positions on the musical score, or using distinct styles (e.g. between different players or singers, or various symbols for percusive instruments or specific intruments, or extra annotations). Many new symbols have been encoded first as PUAs in early fonts used to create proposals (then rendered to a PDF, or embedded fonts in a rich text document, or webfonts loaded from static versioned URLs on a repository like GitHub or on a public cloud). Later the proposal passed the early steps for reviewing the repertoire and choosing more relevant positions, then characters were encoded and standardized and these fonts were updated to map their glyphs to not just their existing PUAs but also the nex standard positions (or encoded variants) 2017-05-02 2:27 GMT+02:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Mon, 1 May 2017 23:03:53 +0000 > Michael Bear via Unicode wrote: > > > ?Rather than using "unused code positions", I would always recommend > > to use some of the Private Use code points.? Consider it done. > > > > ?What is the intended usage of your font? Music score > > applications? others?? I am simply going to make a series of full > > Unicode fonts (which, due to the 65,535-character limit in fonts, > > each of the 3 fonts covers different planes: The first font does the > > BMP, > > How much margin do you have for the BMP? There are a fair few > variation sequences, on top of all the contextual forms and conjuncts. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 2 11:43:18 2017 From: unicode at unicode.org (Naena Guru via Unicode) Date: Tue, 2 May 2017 22:13:18 +0530 Subject: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta In-Reply-To: References: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com> Message-ID: <5cfab138-51e2-9413-2d45-ad13cd8f333b@gmail.com> Thank you, professor. You wrote exactly what one would expect from a professor. It is a wonderful display of your prowess in the subject. Doctors and lawyers use Latin for concealment and self-preservation. Greenspan used Greenspanish. Unicode masters use Unicodish. Indic is the name Unicode assigned to South Asian writing systems that are associated with Sanskrit vyaakaraNa. This is a result of what the good professor explains by "Unicode is not the realm of everyone; it's the realm of people with a certain amount of linguistic knowledge and computer knowledge". What is that 'certain amount' and which deity decides it? How do we unfortunate nincompoops decode it? Decode itself is beyond us, indeed. South Asians, especially Indians who already seem to have too many gods to deal with, do not need, though they might be tempted to add an image of the exalted Unicode god behind a colorful curtain to sing praise to with an alms box marked M$ besides to get favors each time the high priest scrubs off some of its 'hairy esoteric mess' while surreptitiously (or, ignorantly?) adding more. Brahmins were able to make any declaration because they were privileged. Similarly, Unicode experts can make declarations like, 'very few forms of writing are direct transcriptions of speech' and hide behind the 'in case' adjective 'direct' to avoid giving actual data. Of course, they can boldly count Sinhala as one that is not a direct transcription of speech. Speech getting transcribed into writing itself is a Unicodish. Hark! The professor declares. So, boys and girls, if you want to pass the test memorize this, even if it is obviously false: Printing made no difference to the fact that English has a dozen vowels with five letters to write them. The thorn has little impact on the ambiguity of English writing. The problem with printing is that it fossilizes the written language, and our spellings have stayed the same while the pronunciations have changed. And the dissociation of sound and writing sometimes helps English; even when two English speakers from different parts of the world would have trouble understanding each other, writing is usually not so impaired. It is printing with the dictionary industry that fossilized writing and as a result, forced speech to comply. The 'certain' level of knowledge above is now revealed. Language, dialect, creole, migration, intermixing of different peoples, accent...; where do these stand? Find ye by the foregoing what the fossil 'ye' actually was and what caused it to get fossilized in this form. On 5/2/2017 5:31 AM, David Starner wrote: > On Mon, May 1, 2017 at 7:26 AM Naena Guru via Unicode > > wrote: > > This whole attempt to make digitizing Indic script some esoteric, > 'abstract', 'semantic representation' and so on seems to me is an > attempt to make Unicode the realm of the some super humans. > > Unicode is like writing. At its core, it is a hairy esoteric mess; mix > these certain chemicals the right ways, and prepare a writing > implement and writing surface in the right (non-trivial) ways, and > then manipulate that implement carefully to make certain marks that > have unclear delimitations between correct and incorrect. But in the > end, as much of that is removed from the problem of the user as > possible; in the case of modern word-processing system, it's a matter > of hitting the keys and then hitting print, in complete ignorance of > all the silicon and printing magic going on between. > > Unicode is not the realm of everyone; it's the realm of people with a > certain amount of linguistic knowledge and computer knowledge. There's > only a problem if those people can't make it usable for the everyday > programmer and therethrough to the average person. > > The purpose of writing is to represent speech. > > Meh. The purpose of writing is to represent language, which may be > unrelated to speech (like in the case of SignWriting and mathematics) > or somewhat related to speech--very few forms of writing are direct > transcriptions of speech. Even the closest tend to exchange a lot of > intonation details for punctuation that reveals different information. > > English writing was massacred when printing was brought in from > Europe. > > No, it wasn't. Printing made no difference to the fact that English > has a dozen vowels with five letters to write them. The thorn has > little impact on the ambiguity of English writing. The problem with > printing is that it fossilizes the written language, and our spellings > have stayed the same while the pronunciations have changed. And the > dissociation of sound and writing sometimes helps English; even when > two English speakers from different parts of the world would have > trouble understanding each other, writing is usually not so impaired. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 2 22:17:10 2017 From: unicode at unicode.org (N. Ganesan via Unicode) Date: Tue, 2 May 2017 20:17:10 -0700 Subject: Internet unicode use of Indian languages Message-ID: In India, Tamil is most used in the internet. https://assets.kpmg.com/content/dam/kpmg/in/pdf/2017/ 04/Indian-languages-Defining-Indias-Internet.pdf http://www.vikatan.com/news/india/88214-tamil-is-the-most- used-indian-language-says-google.html N. Ganesan -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 3 02:49:49 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 3 May 2017 08:49:49 +0100 Subject: How to Add Beams to Notes In-Reply-To: References: <20170502012724.3f109a86@JRWUBU2> Message-ID: <20170503084949.1b61d689@JRWUBU2> On Tue, 2 May 2017 05:08:27 +0200 Philippe Verdy via Unicode wrote: > Consider also that the BMP is almost full, the remaining few holes > are kept for isolated characters that may be added to existing > scripts, or permanently reserved to avoid clashes with legacy > softwares using simple code remappings between distinct blocks, or to > perform simple case conversions (e.g. in Greek) for internal purposes > (these positions are not interoperable and may clash with future > versions of the UCS and I18n tools/libraries like ICU) > > You should abstain using any currently unassigned positions in the > existing Unicode blocks: use PUA if you have nothing else; there are > plenty of space available, in the BMP (most common usage in fonts > that need to map additional glyphs) or in the two last planes. It isn't codepoints that is the constraint; one must consider the number of glyphs without dedicated one-character codes. For example, U+1000 MYANMAR LETTER KA needs glyphs for: 1000 1000 FE00 1039 1000 (and probably at two different widths) 1039 1000 FE00 (do.) There are a few CJK ideographs with similar needs: 537F 537F FE00 (= CJK COMPATIBILITY IDEOGRAPH-2F831) 537F FE01 (= CJK COMPATIBILITY IDEOGRAPH-2F832) 537F FE02 (= CJK COMPATIBILITY IDEOGRAPH-2F833) There's also the Japanese ideographic variation sequence , which should probably have its own glyph even if it's the same as one of the above. The Arabic script (and other cursively connected scripts) has similar expansions, even if one goes for a typewritten style. Devanagari explodes when one considers just the conjuncts prescribed for Hindi. I think it's also necessary to avoid splitting likely grapheme clusters between fonts. Which of the three fonts will support U+1F3F4 U+E0067 U+E0062 U+E0065 U+E006E U+E0067 U+E007F (English flag) and which U+261D U+1F3FF (index pointing up: dark skin tone)? Now, the BMP has headroom provided by the surrogate characters and the PUA, which will not have mappings, but I'm not sure that it's enough. That's why I asked the question. Richard. From unicode at unicode.org Wed May 3 05:20:16 2017 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Wed, 3 May 2017 11:20:16 +0100 (BST) Subject: English flag (from Re: How to Add Beams to Notes) In-Reply-To: <20170503084949.1b61d689@JRWUBU2> References: <20170502012724.3f109a86@JRWUBU2> <20170503084949.1b61d689@JRWUBU2> Message-ID: <21667.19758.1493806816884.JavaMail.defaultUser@defaultHost> Richard Wordingham wrote: .... U+1F3F4 U+E0067 U+E0062 U+E0065 U+E006E U+E0067 U+E007F (English flag) .... I looked at that and I realized that although I had effectively seen that encoding in http://www.unicode.org/reports/tr51/tr51-11.html though expressed differently, it was only when I saw it expressed as above that I realized that there is something gone wrong with encoding policy. There are at present ten totally unused planes in the Unicode code point map and yet that seven character sequence is needed for encoding an English flag. Surely a single code point could be found. Single code points are being found for various emoji items on a continuing basis. Why pull up the ladder on encoding some flags each with a single code point? Yes, a single code point for an English flag please. And one for a Welsh flag too please. And one for a Scottish flag too please. And some others please, if that is what end users want. William Overington Wednesday 3 May 2017 From unicode at unicode.org Wed May 3 10:07:35 2017 From: unicode at unicode.org (David Faulks via Unicode) Date: Wed, 03 May 2017 11:07:35 -0400 Subject: English flag (from Re: How to Add Beams to Notes) In-Reply-To: <21667.19758.1493806816884.JavaMail.defaultUser@defaultHost> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 3 12:26:42 2017 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 3 May 2017 10:26:42 -0700 Subject: English flag (from Re: How to Add Beams to Notes) In-Reply-To: <21667.19758.1493806816884.JavaMail.defaultUser@defaultHost> References: <20170502012724.3f109a86@JRWUBU2> <20170503084949.1b61d689@JRWUBU2> <21667.19758.1493806816884.JavaMail.defaultUser@defaultHost> Message-ID: On 5/3/2017 3:20 AM, William_J_G Overington via Unicode wrote: > Surely a single code point could be found. Single code points are being found for various emoji items on a continuing basis. Why pull up the ladder on encoding some flags each with a single code point? > > Yes, a single code point for an English flag please. And one for a Welsh flag too please. And one for a Scottish flag too please. And some others please, if that is what end users want. I suggest the following: 10BEDE for an English flag (reminding one of Bede the Venerable) 10CADF for a Welsh flag (harking to Cadfan ap Iago, King of Gwynedd) 10A1BA for a Scottish flag (for Alba, of course) Surely those would work for you! --Ken From unicode at unicode.org Wed May 3 15:31:10 2017 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Wed, 3 May 2017 21:31:10 +0100 (BST) Subject: English flag (from Re: How to Add Beams to Notes) In-Reply-To: References: <20170502012724.3f109a86@JRWUBU2> <20170503084949.1b61d689@JRWUBU2> <21667.19758.1493806816884.JavaMail.defaultUser@defaultHost> Message-ID: <22290163.68159.1493843470220.JavaMail.defaultUser@defaultHost> Ken Whistler wrote: > I suggest the following: > 10BEDE for an English flag (reminding one of Bede the Venerable) > 10CADF for a Welsh flag (harking to Cadfan ap Iago, King of Gwynedd) > 10A1BA for a Scottish flag (for Alba, of course) > Surely those would work for you! Thank you for your reply. Nicely! Those code points each have a helpful mnemonic. I had not known of Cadfan ap Iago until I read your post. I found the following. https://en.wikipedia.org/wiki/Cadfan_ap_Iago I opine that we need to make it clear, for the benefit of some people new to Unicode who may be reading this thread, that those code points are in one of the Private Use Areas, namely Supplementary Private Use Area-B, so there could be problems using them in some circumstances due to lack of uniqueness in the use of those code points. http://www.unicode.org/charts/PDF/U100000.pdf William Overington Wednesday 3 May 2017 From unicode at unicode.org Wed May 3 22:01:17 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 4 May 2017 05:01:17 +0200 Subject: How to Add Beams to Notes In-Reply-To: <20170503084949.1b61d689@JRWUBU2> References: <20170502012724.3f109a86@JRWUBU2> <20170503084949.1b61d689@JRWUBU2> Message-ID: 2017-05-03 9:49 GMT+02:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Tue, 2 May 2017 05:08:27 +0200 > Philippe Verdy via Unicode wrote: > > > Consider also that the BMP is almost full, the remaining few holes > > are kept for isolated characters that may be added to existing > > scripts, or permanently reserved to avoid clashes with legacy > > softwares using simple code remappings between distinct blocks, or to > > perform simple case conversions (e.g. in Greek) for internal purposes > > (these positions are not interoperable and may clash with future > > versions of the UCS and I18n tools/libraries like ICU) > > > > You should abstain using any currently unassigned positions in the > > existing Unicode blocks: use PUA if you have nothing else; there are > > plenty of space available, in the BMP (most common usage in fonts > > that need to map additional glyphs) or in the two last planes. > > It isn't codepoints that is the constraint; one must consider the > number of glyphs without dedicated one-character codes. > Glyph processing use requires internal glyph ids in fonts. The limit is on the total number of glyphs you can put it that font without exceeding the maximum size of glyph id's. Traditionally this is solved by creating coherent (but complete enough) subsets so that all glyphs within the same script can fit. The other solution, nobaly for sinograms, is to use font linking The Arabic script (and other cursively connected scripts) has similar > expansions, even if one goes for a typewritten style. > > Devanagari explodes when one considers just the conjuncts prescribed for > Hindi. > Rendering Devanagari with OpenType does not require any PUA assignment in that font for variants. The sequences are mapped directly using subtables and the rules defined in OpenType for that script. Fonts just use their own internal glyph ID's without having to assign them any Unicode mapping, using Glyph processing rules. Same remark about Arabic (though some encoded compatibility characters will map to some of these glyphs... without using any PUA). > > I think it's also necessary to avoid splitting likely grapheme > clusters between fonts. Which of the three fonts will support U+1F3F4 > U+E0067 U+E0062 U+E0065 U+E006E U+E0067 U+E007F (English flag) and > which U+261D U+1F3FF (index pointing up: dark skin tone)? > > Now, the BMP has headroom provided by the surrogate characters and the > PUA, which will not have mappings, but I'm not sure that it's enough. > > For your question, the solution is to create corent subsets of symbols and create fonts from this subset. For the case of country/region flags, they could all be separated in a specific font. As well you can create separate fonts for persons/animals/plants, and another one for unanimated objects (including planets, game pieces...) Traditional punctuation-like symbols used in typography and normally without any emoji style can fit a generic symbols fonts (along with geom?tric shapes, line drawing symbols). -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 4 02:26:37 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 4 May 2017 08:26:37 +0100 Subject: How to Add Beams to Notes In-Reply-To: References: <20170502012724.3f109a86@JRWUBU2> <20170503084949.1b61d689@JRWUBU2> Message-ID: <20170504082637.40229878@JRWUBU2> On Thu, 4 May 2017 05:01:17 +0200 Philippe Verdy via Unicode wrote: > Rendering Devanagari with OpenType does not require any PUA > assignment in that font for variants. The sequences are mapped > directly using subtables and the rules defined in OpenType for that > script. Fonts just use their own internal glyph ID's without having > to assign them any Unicode mapping, using Glyph processing rules. > Same remark about Arabic (though some encoded compatibility > characters will map to some of these glyphs... without using any PUA). The OP's plan is to use one font for the BMP, one font for the SMP, and one font for the rest. However, the BMP font Code2000, which only goes, incompletely, up to Unicode 5.2, uses 63,546 glyphs, which is very close to the limit of 65,535. There is the slight margin that it included a few small scripts with standardised (ConScript Unicode Registry) PUA allocations. Richard. From unicode at unicode.org Thu May 4 07:50:52 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 4 May 2017 14:50:52 +0200 Subject: How to Add Beams to Notes In-Reply-To: <20170504082637.40229878@JRWUBU2> References: <20170502012724.3f109a86@JRWUBU2> <20170503084949.1b61d689@JRWUBU2> <20170504082637.40229878@JRWUBU2> Message-ID: You cannot cover a full plane with a single font. There are other factors such as total size the also limit severely their use. We have to live with the limitations of OpenType. In addition a giant font is hard to maintain, version and update without breaking usages. Font auhtors should focus their efforts on separatings scripts within a collection a collection of related fonts (like what the Noto project did): the rest will use font linking (which can be and already is used by renderers, and can also be parametered by users for accessibility, or to use prefered variants in some domains). Also not all scripts have the same kinds of style variants (serif/sans-serif, 2 or more distincive weights, straight/italic/oblique, plain/hollow/shadowed), and trying to synthetize these styles will break the nature of the script (notably for many symbols): you'll need separate fonts for separate styles for specific scripts, other scripts may support synthtic styles or not alter at all their rendering. Code2000 is then just useful as a last resort font, but its glyphs are still very poor compared to other fonts, and the fact it uses the same font-wide strategy for hinting also creates lots of caveats: you cannot hint Sinograms like Latin or Greek and symbols have their separate requirements (notably geometric shapes and line drawing). Finally the bad thing about Code2000 is about font metrics, notably baselines: while you want to unify these baselines and line-heights, you'll reach the point where some scripts are ridiculously too small or improperly aligned: its much easier to separate them and tune these metrics separately. Trying tro fix these metrics for one script will break another one in that font, and finally you cannot create a comprehensive coverage test and get stable results because there are contradicting objectives for different uses: it's much easier to conciliate the possible choices by separating scripts, so that you can more easily create additonal variants for a few of them, and then create a separate rendering engine which will use some parametered rules for selecting the most appropriate fonts. Adn then it's much easier to update only one of these fonts when there are improvements, without breaking all the rest. 2017-05-04 9:26 GMT+02:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Thu, 4 May 2017 05:01:17 +0200 > Philippe Verdy via Unicode wrote: > > > Rendering Devanagari with OpenType does not require any PUA > > assignment in that font for variants. The sequences are mapped > > directly using subtables and the rules defined in OpenType for that > > script. Fonts just use their own internal glyph ID's without having > > to assign them any Unicode mapping, using Glyph processing rules. > > > Same remark about Arabic (though some encoded compatibility > > characters will map to some of these glyphs... without using any PUA). > > The OP's plan is to use one font for the BMP, one font for the SMP, and > one font for the rest. However, the BMP font Code2000, which only > goes, incompletely, up to Unicode 5.2, uses 63,546 glyphs, which is > very close to the limit of 65,535. There is the slight margin that it > included a few small scripts with standardised (ConScript > Unicode Registry) PUA allocations. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 4 18:13:08 2017 From: unicode at unicode.org (Michael Bear via Unicode) Date: Thu, 4 May 2017 23:13:08 +0000 Subject: How to Add Beams to Notes In-Reply-To: References: , Message-ID: ?How much margin do you have for the BMP? There are a fair few variation sequences, on top of all the contextual forms and conjuncts.? I plan to do everything in the plane EXCEPT for the surrogates, which you?re not supposed to encode in fonts anyway, which leaves room for about 2,048 more glyphs for OpenType features. Sent from Mail for Windows 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 4 19:54:41 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 5 May 2017 01:54:41 +0100 Subject: How to Add Beams to Notes In-Reply-To: References: Message-ID: <20170505015441.3fd8585e@JRWUBU2> On Thu, 4 May 2017 23:13:08 +0000 Michael Bear via Unicode wrote: > I plan to do everything in the plane EXCEPT for the surrogates, which > you?re not supposed to encode in fonts anyway, which leaves room for > about 2,048 more glyphs for OpenType features. There are, if I avoided double counting errors, 56,251 assigned characters in the BMP in Unicode 10.0.0. There are 1008 standardised variation sequences, all in the BMP. Indic scripts require more glyphs than they have characters - usually at least twice as many. You have the read chapter on Devanagari, haven't you? Richard. From unicode at unicode.org Fri May 5 13:46:17 2017 From: unicode at unicode.org (Michael Bear via Unicode) Date: Fri, 5 May 2017 18:46:17 +0000 Subject: How to Add Beams to Notes In-Reply-To: <20170505015441.3fd8585e@JRWUBU2> References: , <20170505015441.3fd8585e@JRWUBU2> Message-ID: Additionally, I will only do OT features that are absolutely necessary for a certain script, not unnecessary (although stylish!) features, e.g. I will include things like mark positioning and init/medi/fina forms for Arabic, while leaving out small caps, swashes, and extensive ligatures. (In an earlier post, I might have said I?ll do ALL of the possible OT features. If so, I misspoke.) But if the cry for space gets REALLY desperate, I?ll merge identical glyphs into one glyph. Obviously, I won?t do this for more problematic merges, only glyphs in similar scripts with similar features. (e.g. I would represent Latin small letter o, Greek small letter omicron, Cyrillic small letter o, Armenian letter oh, and Georgian labial sign with one glyph, while Hebrew letter samekh and Arabic letter ae, despite also being circular, would be two separate glyphs.) But I?ll only do this if I really need to. Sent from Mail for Windows 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri May 5 17:07:08 2017 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Sat, 6 May 2017 00:07:08 +0200 Subject: How to Add Beams to Notes In-Reply-To: References: Message-ID: > On 1 May 2017, at 21:12, Michael Bear via Unicode wrote: > > I am trying to make a music notation font. It will use the Musical Symbols block in Unicode (1D100-1D1FF), but, since that block has a bad rep for not being very complete, I added some extra characters... SMuFL has a rather comprehensive set of musical symbols. http://www.smufl.org/ http://www.smufl.org/version/latest/ http://www.smufl.org/fonts/ From unicode at unicode.org Sat May 6 07:54:07 2017 From: unicode at unicode.org (Michael Bear via Unicode) Date: Sat, 6 May 2017 12:54:07 +0000 Subject: Sutton SignWriting PDF Message-ID: If I open the Sutton SignWriting code chart in Mozilla Firefox, the glyphs in the tables are blank. I have no idea why. If I open it in Microsoft Edge, however, it works fine. Do you know why this is? Sent from Mail for Windows 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat May 6 19:56:21 2017 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 6 May 2017 16:56:21 -0800 Subject: How to Add Beams to Notes In-Reply-To: References: <20170502012724.3f109a86@JRWUBU2> <20170503084949.1b61d689@JRWUBU2> <20170504082637.40229878@JRWUBU2> Message-ID: Philippe Verdy wrote, > Code2000 ... uses the same font-wide strategy for hinting also > creates lots of caveats: ... Code2000 does not have hinting instructions; that's the font-wide strategy. > Finally the bad thing about Code2000 is about font metrics, notably > baselines: while you want to unify these baselines and line-heights, > you'll reach the point where some scripts are ridiculously too small > or improperly aligned ... Do you have an example of either? Is it possible that any improper alignment or disproportionate glyphs in your display are being caused by something other than the font? > Trying tro fix these metrics for one script will break another one > in that font ... Trying to fix something which isn't broken is generally a bad plan. I wonder if the bizarre behavior you're reporting might have been caused by some third party "fixing" something in the font. In a pan-Unicode font, the base of the CJK ideographs wouldn't be expected to match the baseline of alphabetic scripts. Likewise, the base of the stems used in Indic scripts shouldn't be expected to match the baseline of alphabetic scripts as Indic scripts don't use baselines. Rather, the glyphs in such a font might be designed so that, even with reasonable above and below marks/diacritics, there would be no excessive line gaps generated for the other scripts covered in the font. A font which made, for example, Tibetan base letters the same size as Latin letters would work just fine... as long as you don't mind that runs of Latin text displayed with the font would appear to have two or three line feeds inserted between each line. Best regards, James Kass From unicode at unicode.org Sun May 7 03:03:34 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 7 May 2017 09:03:34 +0100 Subject: Sutton SignWriting PDF In-Reply-To: References: Message-ID: <20170507090334.52230093@JRWUBU2> On Sat, 6 May 2017 12:54:07 +0000 Michael Bear via Unicode wrote: > If I open the Sutton SignWriting code chart in Mozilla Firefox, the > glyphs in the tables are blank. I have no idea why. If I open it in > Microsoft Edge, however, it works fine. Do you know why this is? It smacks of being a fault in Firefox. If I download the file on Linux, I can then read it using Adobe Reader 9 or evince 3.18.2, but still not with Firefox 53.0. Of course, it's possible that there's a fault in the file that doesn't affect other readers - Adobe Reader has had a spate of problems with embedded fonts, but it may have been the PDF generators that were at fault in that case. The short-term practical solution is to change the plug-in action for PDF's - short URL is about:preferences#applications. >From the support page at https://support.mozilla.org/en-US/kb/view-pdf-files-firefox , it looks like an old or recurring problem with Firefox. Richard. From unicode at unicode.org Sun May 7 03:23:08 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 7 May 2017 09:23:08 +0100 Subject: How to Add Beams to Notes In-Reply-To: References: <20170505015441.3fd8585e@JRWUBU2> Message-ID: <20170507092308.316a7a3a@JRWUBU2> On Fri, 5 May 2017 18:46:17 +0000 Michael Bear via Unicode wrote: > But > if the cry for space gets REALLY desperate, I?ll merge identical > glyphs into one glyph. Obviously, I won?t do this for more > problematic merges, only glyphs in similar scripts with similar > features. (e.g. I would represent Latin small letter o, Greek small > letter omicron, Cyrillic small letter o, Armenian letter oh, and > Georgian labial sign with one glyph, while Hebrew letter samekh and > Arabic letter ae, despite also being circular, would be two separate > glyphs.) But I?ll only do this if I really need to. That could cause problems with extracting text from PDFs generated using the font. My interest was in whether a pan-BMP font was still possible. As you haven't done the counting (which is ill-defined for scripts with conjuncts, and possibly even also for old Hangul support), you can't tell me yet. Richard. From unicode at unicode.org Tue May 9 09:09:57 2017 From: unicode at unicode.org (Michael Bear via Unicode) Date: Tue, 9 May 2017 14:09:57 +0000 Subject: CSUR and UCSUR glyphs Message-ID: I need some help with the glyphs from the CSUR and UCSUR. Some of the glyphs were no problem, such as the Tengwar and Cirth ones, because their pages actually show the glyphs on their pages. Others do not, which poses a bit of a problem. Some of them have links to other sites that are intended to show the glyphs, but most of those links are outdated and lead to 404s. I could just pull up an archived version with the Wayback machine (web.archive.org), and for some of them, this works, but most of them don?t have any saved versions. I could just do a Google search to find out what the characters look like, but many of the scripts are too obscure to get anything reliable out of that Google search. I?m making a font with everything in the UCSUR, and this is a major obstacle I must overcome. So do you guys know where I could get glyph shapes for most of these scripts? Sent from Mail for Windows 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 9 11:24:27 2017 From: unicode at unicode.org (Rebecca Bettencourt via Unicode) Date: Tue, 9 May 2017 09:24:27 -0700 Subject: CSUR and UCSUR glyphs In-Reply-To: References: Message-ID: In addition to the sites linked to by the CSUR and UCSUR pages, there are the PDFs linked to by the UCSUR page, and the Constructium and Fairfax fonts. There is also a font called Nishiki-teki that includes a lot of CSUR and UCSUR scripts, and a version of GNU Unifont that does as well. Beyond that, I know as much as you do. -- Rebecca Bettencourt On Tue, May 9, 2017 at 7:09 AM, Michael Bear via Unicode < unicode at unicode.org> wrote: > I need some help with the glyphs from the CSUR > and UCSUR > . > > > > Some of the glyphs were no problem, such as the Tengwar and Cirth ones, > because their pages *actually show the glyphs on their pages*. > > Others do not, which poses a bit of a problem. Some of them have links to > other sites that are intended to show the glyphs, but most of those links > are outdated and lead to 404s. I could just pull up an archived version > with the Wayback machine (web.archive.org), and for some of them, this > works, but most of them don?t have any saved versions. > > I could just do a Google search to find out what the characters look like, > but many of the scripts are too obscure to get anything reliable out of > that Google search. I?m making a font with everything in the UCSUR, and > this is a major obstacle I must overcome. So do you guys know where I could > get glyph shapes for most of these scripts? > > > > Sent from Mail for > Windows 10 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 9 11:31:04 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 09 May 2017 09:31:04 -0700 Subject: CSUR and UCSUR glyphs Message-ID: <20170509093104.665a7a7059d7ee80bb4d670165c8327d.127813a82a.wbe@email03.godaddy.com> Michael Bear wrote: > Some of the glyphs were no problem, such as the Tengwar and Cirth > ones, because their pages actually show the glyphs on their pages. > > Others do not, which poses a bit of a problem. [...] As you probably read on both the CSUR and UCSUR sites, neither is sponsored or endorsed by Unicode. They are side projects embarked upon by individuals, some of whom also happen to be involved in the Consortium. Please keep this in mind. I was never able to find some of these scripts, such as Pikto, even in the '90s when CSUR activity was at its peak. Herman Miller's site, linked from CSUR, still has all or most of his early alphabets. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue May 9 12:30:47 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 09 May 2017 10:30:47 -0700 Subject: If at first... (was: RE: CSUR and UCSUR glyphs) Message-ID: <20170509103047.665a7a7059d7ee80bb4d670165c8327d.f0a5ac1b47.wbe@email03.godaddy.com> I wrote: > I was never able to find some of these scripts, such as Pikto http://unifoundry.com/pikto/index.html Never hurts to try again. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue May 9 17:44:47 2017 From: unicode at unicode.org (Mats Blakstad via Unicode) Date: Wed, 10 May 2017 00:44:47 +0200 Subject: Human Rights translations Message-ID: Hi Who is at the moment organizing the human rights translations in Unicode? How can we submit new translations? Best regards Mats Blakstad -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 10 09:30:30 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 10 May 2017 07:30:30 -0700 Subject: Human Rights translations Message-ID: <20170510073030.665a7a7059d7ee80bb4d670165c8327d.7d6c67158e.wbe@email03.godaddy.com> Mats Blakstad wrote: > Who is at the moment organizing the human rights translations in > Unicode? How can we submit new translations? http://www.unicode.org/udhr/contributing.html -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed May 10 12:22:53 2017 From: unicode at unicode.org (Michael Bear via Unicode) Date: Wed, 10 May 2017 17:22:53 +0000 Subject: Join me in protecting net neutrality Message-ID: The FCC and their new Chairman, Ajit Paij, have a plan to destroy net neutrality as we know it. It?s up to us to stop it. I just signed onto Mozilla?s campaign to demand strong net neutrality protections. You can show your support here: http://advocacy.mozilla.org/net-neutrality?sp_ref=302214293.352.180765.e.575605.2&source=email Sent from Mail for Windows 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun May 14 01:04:31 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 14 May 2017 07:04:31 +0100 Subject: Fighting Spell-Checking by Renderers Message-ID: <20170514070431.0292e34b@JRWUBU2> One of the early problems encountered with Unicode was that there can be multiple ways of representing the same text. For many scripts, the solution was canonical equivalence - the multiple ways were declared to be equivalent, and anything that thought they had different meanings and should *therefore* be treated differently was non-compliant with the Unicode standard. Where canonical equivalence actually leads to the wrong conclusion a method was subsequently found to make sequences canonically inequivalent, U+034F COMBINING GRAPHEME JOINER (CGJ). It generally takes extra effort to insert this character. However, canonical equivalence hit a severe problem with two-part Indic vowels, and the use of non-zero canonical combining classes in Indic scripts is generally low. A similar issue might arise with graphically non-interacting subordinated consonants, especially when encoded as virama/coeng plus base consonant. One solution to this problem is for renderers to produce a strange rendering if characters appear in a non-standard order. However, character strings are not just rendered and compared for identity. They are also be transliterated, sorted into alphabetical order, and may be input to automatic speech generation systems with limited capabilities for resolving homographs. This may require some way of tagging an apparently incorrectly ordered string, analogous to the use of 'sic' in English, to indicate that the text is intended not to accord with the 'standard' character order. What characters are available for such a r?le? CGJ is a possibility, but I am concerned that it may be being overworked. It is already suggested as a solution for dealing with sorting when a digraph is treated as a letter, but accidental sequences are not, as in the Welsh letter 'ng' (which comes between 'g' and 'h' in the alphabet) as opposed to an 'accidental' sequence such as in 'Bangor' and 'Llangollen'. Such characters probably don't work now, but it may be possible to persuade the suppliers to heed them. The ideal character would be disallowed in domain names, which should allay the greatest security worries about simply rendering the text as it stands. Some potential ambiguities arise from Sanskrit, and were raised long ago by Peter Constable on the Unicode Indic list on 28 August 2006 under the heading 'contrastive /Crv/ and /Cvr/ in Telugu, Malayalam'. The cases he gave were 'grva' v. 'gvra', 'drva' v. 'dvra' and 'srva' v. 'svra'. For the Khmer script, the KhmerOS font renders the pairs identically, which did surprise me, as I had got it into my head that one could tell from the depth of the where the RO came in the sequence of conjoined letters. Richard. From unicode at unicode.org Mon May 15 05:21:45 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Mon, 15 May 2017 13:21:45 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 Message-ID: In reference to: http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf I think Unicode should not adopt the proposed change. The proposal is to make ICU's spec violation conforming. I think there is both a technical and a political reason why the proposal is a bad idea. First, the technical reason: ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't representative of implementation concerns of implementations that use UTF-8 as their in-memory Unicode representation. Even though there are notable systems (Win32, Java, C#, JavaScript, ICU, etc.) that are stuck with UTF-16 as their in-memory representation, which makes concerns of such implementation very relevant, I think the Unicode Consortium should acknowledge that UTF-16 was, in retrospect, a mistake (since Unicode grew past 16 bits anyway making UTF-16 both variable-width *and* ASCII-incompatible--i.e. widening the the code units to be ASCII-incompatible didn't buy a constant-width encoding after all) and that when the legacy constraints of Win32, Java, C#, JavaScript, ICU, etc. don't force UTF-16 as the internal Unicode representation, using UTF-8 as the internal Unicode representation is the technically superior design: Using UTF-8 as the internal Unicode representation is memory-efficient and cache-efficient when dealing with data formats whose syntax is mostly ASCII (e.g. HTML), forces developers to handle variable-width issues right away, makes input decode a matter of mere validation without copy when the input is conforming and makes output encode infinitely fast (no encode step needed). Therefore, despite UTF-16 being widely used as an in-memory representation of Unicode and in no way going away, I think the Unicode Consortium should be *very* sympathetic to technical considerations for implementations that use UTF-8 as the in-memory representation of Unicode. When looking this issue from the ICU perspective of using UTF-16 as the in-memory representation of Unicode, it's easy to consider the proposed change as the easier thing for implementation (after all, no change for the ICU implementation is involved!). However, when UTF-8 is the in-memory representation of Unicode and "decoding" UTF-8 input is a matter of *validating* UTF-8, a state machine that rejects a sequence as soon as it's impossible for the sequence to be valid UTF-8 (under the definition that excludes surrogate code points and code points beyond U+10FFFF) makes a whole lot of sense. If the proposed change was adopted, while Draconian decoders (that fail upon first error) could retain their current state machine, implementations that emit U+FFFD for errors and continue would have to add more state machine states (i.e. more complexity) to consolidate more input bytes into a single U+FFFD even after a valid sequence is obviously impossible. When the decision can easily go either way for implementations that use UTF-16 internally but the options are not equal when using UTF-8 internally, the "UTF-8 internally" case should be decisive. (Especially when spec-wise that decision involves no change. I further note the proposal PDF argues on the level of "feels right" without even discussing the impact on implementations that use UTF-8 internally.) As a matter of implementation experience, the implementation I've written (https://github.com/hsivonen/encoding_rs) supports both the UTF-16 as the in-memory Unicode representation and the UTF-8 as the in-memory Unicode representation scenarios, and the fail-fast requirement wasn't onerous in the UTF-16 as the in-memory representation scenario. Second, the political reason: Now that ICU is a Unicode Consortium project, I think the Unicode Consortium should be particular sensitive to biases arising from being both the source of the spec and the source of a popular implementation. It looks *really bad* both in terms of equal footing of ICU vs. other implementations for the purpose of how the standard is developed as well as the reliability of the standard text vs. ICU source code as the source of truth that other implementors need to pay attention to if the way the Unicode Consortium resolves a discrepancy between ICU behavior and a well-known spec provision (this isn't some ill-known corner case, after all) is by changing the spec instead of changing ICU *especially* when the change is not neutral for implementations that have made different but completely valid per then-existing spec and, in the absence of legacy constraints, superior architectural choices compared to ICU (i.e. UTF-8 internally instead of UTF-16 internally). I can see the irony of this viewpoint coming from a WHATWG-aligned browser developer, but I note that even browsers that use ICU for legacy encodings don't use ICU for UTF-8, so the ICU UTF-8 behavior isn't, in fact, the dominant browser UTF-8 behavior. That is, even Blink and WebKit use their own non-ICU UTF-8 decoder. The Web is the environment that's the most sensitive to how issues like this are handled, so it would be appropriate for the proposal to survey current browser behavior instead of just saying that ICU "feels right" or is "natural". -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Mon May 15 09:57:00 2017 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Mon, 15 May 2017 15:57:00 +0100 (BST) Subject: Are Emoji ZWJ sequences characters? In-Reply-To: <30186454.40678.1494856125964.JavaMail.root@webmail14.bt.ext.cpcloud.co.uk> References: <30186454.40678.1494856125964.JavaMail.root@webmail14.bt.ext.cpcloud.co.uk> Message-ID: <17598869.46886.1494860220668.JavaMail.defaultUser@defaultHost> I am concerned about emoji ZWJ sequences being encoded without going through the ISO process and whether Unicode will therefore lose synchronization with ISO/IEC 10646. I have raised this by email and a very helpful person has advised me that encoding emoji sequences does not mean that Unicode and ISO/IEC 10646 go out of being synchronized because ZWJ sequences are not *characters*, and they have no implications for ISO/IEC 10646, noting that ISO/IEC 10646 does not define ZWJ sequences. Now I have great respect for the person who advised me. However I am a researcher and I opine that I need evidence. Thus I am writing to the mailing list in the hope that there will be a discussion please. http://www.unicode.org/reports/tr51/tr51-11.html (A proposed update document) http://www.unicode.org/Public/emoji/5.0/emoji-zwj-sequences.txt http://www.unicode.org/charts/PDF/U1F300.pdf http://www.unicode.org/charts/PDF/U1F680.pdf In tr51-11.html at 2.3 Emoji ZWJ Sequences quote To the user of such a system, these behave like single emoji characters, even though internally they are sequences. end quote In emoji-zwj-sequences.txt there is the following line. 1F468 200D 1F680 ; Emoji_ZWJ_Sequence ; man astronaut >From U1F300.pdf, 1F468 is MAN 200D is ZWJ >From U1F680.pdf 1F680 is ROCKET The reasoning upon which I base my concern is as follows. 0063 is c 0070 is p 0074 is t If 0063 200D 0074 is used to specifically request a ct ligature in a display of some text, then the meaning of 0063 200D 0074 is the same as the meaning of 0063 0074 and indeed a font with an OpenType table could cause a ct ligature to be displayed even if the sequence is 0063 0074 rather than the sequence 0063 200D 0074 that is used where the ligature glyph is specifically requested. Thus the meaning of ct is not changed by using the ZWJ character. Now the use of the ct ligature is well-known and frequent. Suppose now that a fontmaker is making a font of his or her own and decides to include a glyph for a pp ligature, with a swash flourish joining and going beyond the lower ends of the descenders both to the left and to the right. The fontmaker could note that the ligature might be good in a word like copper but might look wrong in a word like happy due to the tail on the letter y clashing with the rightward side of the swash flourish. So the fontmaker encodes 0070 200D 0070 as a pp ligature but does not encode 0070 0070 as a pp ligature, so that the ligature glyph is only used when specifically requested using a ZWJ character. However, when the ZWJ character is used, the meaning of the pp sequence is not changed from the meaning when the pp sequence is not used. Yet when 1F468 200D 1F680 is used, the meaning of the sequence is different from the meaning of the sequence 1F468 1F680 such that the meaning of 1F468 200D 1F680 is listed in a file available from the Unicode website. >From where does the astronaut's spacesuit and helmet come? I am reminded that in chemistry if one mixes two chemicals, sometimes one just gets a mixture of two chemicals and sometimes one gets a chemical reaction such that another chemical is produced. Repeating the quote from earlier in this post. In tr51-11.html at 2.3 Emoji ZWJ Sequences quote To the user of such a system, these behave like single emoji characters, even though internally they are sequences. end quote I am concerned that in the future a user of ISO/IEC 10646 will not be able to find from ISO/IEC 10646 the meaning of an emoji that he or she observes being displayed, even if he or she is able to discover what is the sequence of characters being used. So I ask that this matter be discussed please. William Overington Monday 15 May 2017 From unicode at unicode.org Mon May 15 10:37:13 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Mon, 15 May 2017 16:37:13 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: On 15 May 2017, at 11:21, Henri Sivonen via Unicode wrote: > > In reference to: > http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf > > I think Unicode should not adopt the proposed change. Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense. > ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't > representative of implementation concerns of implementations that use > UTF-8 as their in-memory Unicode representation. > > Even though there are notable systems (Win32, Java, C#, JavaScript, > ICU, etc.) that are stuck with UTF-16 as their in-memory > representation, which makes concerns of such implementation very > relevant, I think the Unicode Consortium should acknowledge that > UTF-16 was, in retrospect, a mistake You may think that. There are those of us who do not. The fact is that UTF-16 makes sense as a default encoding in many cases. Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the case for other situations and the fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 usually focus on) is no more complicated than handling combining characters, which you have to do anyway. > Therefore, despite UTF-16 being widely used as an in-memory > representation of Unicode and in no way going away, I think the > Unicode Consortium should be *very* sympathetic to technical > considerations for implementations that use UTF-8 as the in-memory > representation of Unicode. I don?t think the Unicode Consortium should be unsympathetic to people who use UTF-8 internally, for sure, but I don?t see what that has to do with either the original proposal or with your criticism of UTF-16. [snip] > If the proposed > change was adopted, while Draconian decoders (that fail upon first > error) could retain their current state machine, implementations that > emit U+FFFD for errors and continue would have to add more state > machine states (i.e. more complexity) to consolidate more input bytes > into a single U+FFFD even after a valid sequence is obviously > impossible. ?Impossible?? Why? You just need to add some error states (or *an* error state and a counter); it isn?t exactly difficult, and I?m sure ICU isn?t the only library that already did just that *because it?s clearly the right thing to do*. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Mon May 15 11:14:23 2017 From: unicode at unicode.org (Peter Constable via Unicode) Date: Mon, 15 May 2017 16:14:23 +0000 Subject: Are Emoji ZWJ sequences characters? In-Reply-To: <17598869.46886.1494860220668.JavaMail.defaultUser@defaultHost> References: <30186454.40678.1494856125964.JavaMail.root@webmail14.bt.ext.cpcloud.co.uk> <17598869.46886.1494860220668.JavaMail.defaultUser@defaultHost> Message-ID: Emoji sequences are not _encoded_, per se, in either Unicode or ISO/IEC 10646. The act of "encoding" in either of these coding standards is to assign an encoded representation in the encoding method of the standards for a given entity. In this case, that means to assign a code point. Specifying ZWJ sequences for representation of text elements is not encoding in the standard; it is simply defining an encoded representation for those text elements. Unicode gives some attention to this kind of thing, but ISO/IEC 10646, not so much. For instance, you won't find anything in ISO/IEC 10646 specifying that the encoded representation for a rakaar is < VIRAMA, RA >. So, your helpful person was, indeed, helpful, giving you correct information: ZWJ sequences are not _characters_ and have no implications for ISO/IEC 10646. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of William_J_G Overington via Unicode Sent: Monday, May 15, 2017 7:57 AM To: unicode at unicode.org Subject: Are Emoji ZWJ sequences characters? I am concerned about emoji ZWJ sequences being encoded without going through the ISO process and whether Unicode will therefore lose synchronization with ISO/IEC 10646. I have raised this by email and a very helpful person has advised me that encoding emoji sequences does not mean that Unicode and ISO/IEC 10646 go out of being synchronized because ZWJ sequences are not *characters*, and they have no implications for ISO/IEC 10646, noting that ISO/IEC 10646 does not define ZWJ sequences. Now I have great respect for the person who advised me. However I am a researcher and I opine that I need evidence. Thus I am writing to the mailing list in the hope that there will be a discussion please. https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2Freports%2Ftr51%2Ftr51-11.html&data=02%7C01%7Cpetercon%40microsoft.com%7C5ed7d97f20194242d58908d49ba6034d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636304584722863385&sdata=IWXir%2BfVIg2NW5Q95ClTs5Powet54k5VFEyJaEL7KYE%3D&reserved=0 (A proposed update document) https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2FPublic%2Femoji%2F5.0%2Femoji-zwj-sequences.txt&data=02%7C01%7Cpetercon%40microsoft.com%7C5ed7d97f20194242d58908d49ba6034d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636304584722863385&sdata=2TzPVAvyTRaLqFBx8gKG%2BvwK86DTzcZgnQpPYuaQto8%3D&reserved=0 https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2Fcharts%2FPDF%2FU1F300.pdf&data=02%7C01%7Cpetercon%40microsoft.com%7C5ed7d97f20194242d58908d49ba6034d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636304584722863385&sdata=aG3AQEN8iwsyJtcLZFdKYBsM682sGCuBDUTyf8lyhy4%3D&reserved=0 https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2Fcharts%2FPDF%2FU1F680.pdf&data=02%7C01%7Cpetercon%40microsoft.com%7C5ed7d97f20194242d58908d49ba6034d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636304584722863385&sdata=xC2tM5TFs9XLDbbYqfTaeVULxe8ciShAlgbWGQfknPg%3D&reserved=0 In tr51-11.html at 2.3 Emoji ZWJ Sequences quote To the user of such a system, these behave like single emoji characters, even though internally they are sequences. end quote In emoji-zwj-sequences.txt there is the following line. 1F468 200D 1F680 ; Emoji_ZWJ_Sequence ; man astronaut >From U1F300.pdf, 1F468 is MAN 200D is ZWJ >From U1F680.pdf 1F680 is ROCKET The reasoning upon which I base my concern is as follows. 0063 is c 0070 is p 0074 is t If 0063 200D 0074 is used to specifically request a ct ligature in a display of some text, then the meaning of 0063 200D 0074 is the same as the meaning of 0063 0074 and indeed a font with an OpenType table could cause a ct ligature to be displayed even if the sequence is 0063 0074 rather than the sequence 0063 200D 0074 that is used where the ligature glyph is specifically requested. Thus the meaning of ct is not changed by using the ZWJ character. Now the use of the ct ligature is well-known and frequent. Suppose now that a fontmaker is making a font of his or her own and decides to include a glyph for a pp ligature, with a swash flourish joining and going beyond the lower ends of the descenders both to the left and to the right. The fontmaker could note that the ligature might be good in a word like copper but might look wrong in a word like happy due to the tail on the letter y clashing with the rightward side of the swash flourish. So the fontmaker encodes 0070 200D 0070 as a pp ligature but does not encode 0070 0070 as a pp ligature, so that the ligature glyph is only used when specifically requested using a ZWJ character. However, when the ZWJ character is used, the meaning of the pp sequence is not changed from the meaning when the pp sequence is not used. Yet when 1F468 200D 1F680 is used, the meaning of the sequence is different from the meaning of the sequence 1F468 1F680 such that the meaning of 1F468 200D 1F680 is listed in a file available from the Unicode website. >From where does the astronaut's spacesuit and helmet come? I am reminded that in chemistry if one mixes two chemicals, sometimes one just gets a mixture of two chemicals and sometimes one gets a chemical reaction such that another chemical is produced. Repeating the quote from earlier in this post. In tr51-11.html at 2.3 Emoji ZWJ Sequences quote To the user of such a system, these behave like single emoji characters, even though internally they are sequences. end quote I am concerned that in the future a user of ISO/IEC 10646 will not be able to find from ISO/IEC 10646 the meaning of an emoji that he or she observes being displayed, even if he or she is able to discover what is the sequence of characters being used. So I ask that this matter be discussed please. William Overington Monday 15 May 2017 From unicode at unicode.org Mon May 15 12:43:53 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 15 May 2017 18:43:53 +0100 Subject: Are Emoji ZWJ sequences characters? In-Reply-To: References: <30186454.40678.1494856125964.JavaMail.root@webmail14.bt.ext.cpcloud.co.uk> <17598869.46886.1494860220668.JavaMail.defaultUser@defaultHost> Message-ID: <20170515184353.47e68b81@JRWUBU2> On Mon, 15 May 2017 16:14:23 +0000 Peter Constable via Unicode wrote: > So, your helpful person was, indeed, helpful, giving you correct > information: ZWJ sequences are not _characters_ and have no > implications for ISO/IEC 10646. Except in so far as the claimed ligature changes the meaning of the ligated elements. For example, using <'a', ZWJ, 'e'> for an a-umlaut that was clearly not a-diaeresis would probably be on the edge of what is permissible. Returning to the example, shouldn't 1F468 200D 1F680 mean 'male rocket maker'? Richard. From unicode at unicode.org Mon May 15 12:52:25 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 15 May 2017 10:52:25 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote: > On 15 May 2017, at 11:21, Henri Sivonen via Unicode wrote: >> In reference to: >> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf >> >> I think Unicode should not adopt the proposed change. > Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense. Changing a specification as fundamental as this is something that should not be undertaken lightly. Apparently we have a situation where implementations disagree, and have done so for a while. This normally means not only that the implementations differ, but that data exists in both formats. Even if it were true that all data is only stored in UTF-8, any data converted from UFT-8 back to UTF-8 going through an interim stage that requires UTF-8 conversion would then be different based on which converter is used. Implementations working in UTF-8 natively would potentially see three formats: 1) the original ill-formed data 2) data converted with single FFFD 3) data converted with multiple FFFD These forms cannot be compared for equality by binary matching. The best that can be done is to convert (1) into one of the other forms and then compare treating any run of FFFD code points as equal to any other run, irrespective of length. (For security-critical applications, the presence of any FFFD should render the data invalid, so the comparisons we'd be talking about here would be for general purpose, like search). Because we've had years of multiple implementations, it would be expected that copious data exists in all three formats, and that data will not go away. Changing the specification to pick one of these formats as solely conformant is IMHO too late. A./ > >> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't >> representative of implementation concerns of implementations that use >> UTF-8 as their in-memory Unicode representation. >> >> Even though there are notable systems (Win32, Java, C#, JavaScript, >> ICU, etc.) that are stuck with UTF-16 as their in-memory >> representation, which makes concerns of such implementation very >> relevant, I think the Unicode Consortium should acknowledge that >> UTF-16 was, in retrospect, a mistake > You may think that. There are those of us who do not. The fact is that UTF-16 makes sense as a default encoding in many cases. Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the case for other situations and the fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 usually focus on) is no more complicated than handling combining characters, which you have to do anyway. > >> Therefore, despite UTF-16 being widely used as an in-memory >> representation of Unicode and in no way going away, I think the >> Unicode Consortium should be *very* sympathetic to technical >> considerations for implementations that use UTF-8 as the in-memory >> representation of Unicode. > I don?t think the Unicode Consortium should be unsympathetic to people who use UTF-8 internally, for sure, but I don?t see what that has to do with either the original proposal or with your criticism of UTF-16. > > [snip] > >> If the proposed >> change was adopted, while Draconian decoders (that fail upon first >> error) could retain their current state machine, implementations that >> emit U+FFFD for errors and continue would have to add more state >> machine states (i.e. more complexity) to consolidate more input bytes >> into a single U+FFFD even after a valid sequence is obviously >> impossible. > ?Impossible?? Why? You just need to add some error states (or *an* error state and a counter); it isn?t exactly difficult, and I?m sure ICU isn?t the only library that already did just that *because it?s clearly the right thing to do*. > > Kind regards, > > Alastair. > > -- > http://alastairs-place.net > > > From unicode at unicode.org Mon May 15 12:54:23 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 15 May 2017 10:54:23 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: <37d24cde-96c3-5732-726f-79293d561b4e@ix.netcom.com> On 5/15/2017 3:21 AM, Henri Sivonen via Unicode wrote: > Second, the political reason: > > Now that ICU is a Unicode Consortium project, I think the Unicode > Consortium should be particular sensitive to biases arising from being > both the source of the spec and the source of a popular > implementation. It looks*really bad* both in terms of equal footing > of ICU vs. other implementations for the purpose of how the standard > is developed as well as the reliability of the standard text vs. ICU > source code as the source of truth that other implementors need to pay > attention to if the way the Unicode Consortium resolves a discrepancy > between ICU behavior and a well-known spec provision (this isn't some > ill-known corner case, after all) is by changing the spec instead of > changing ICU*especially* when the change is not neutral for > implementations that have made different but completely valid per > then-existing spec and, in the absence of legacy constraints, superior > architectural choices compared to ICU (i.e. UTF-8 internally instead > of UTF-16 internally). > > I can see the irony of this viewpoint coming from a WHATWG-aligned > browser developer, but I note that even browsers that use ICU for > legacy encodings don't use ICU for UTF-8, so the ICU UTF-8 behavior > isn't, in fact, the dominant browser UTF-8 behavior. That is, even > Blink and WebKit use their own non-ICU UTF-8 decoder. The Web is the > environment that's the most sensitive to how issues like this are > handled, so it would be appropriate for the proposal to survey current > browser behavior instead of just saying that ICU "feels right" or is > "natural". I think this political reason should be taken very seriously. There are already too many instances where ICU can be seen "driving" the development of property and algorithms. Those involved in the ICU project may not see the problem, but I agree with Henri that it requires a bit more sensitivity from the UTC. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 15 13:02:34 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Mon, 15 May 2017 19:02:34 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: On 15 May 2017, at 18:52, Asmus Freytag wrote: > > On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote: >> On 15 May 2017, at 11:21, Henri Sivonen via Unicode wrote: >>> In reference to: >>> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf >>> >>> I think Unicode should not adopt the proposed change. >> Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense. > > Changing a specification as fundamental as this is something that should not be undertaken lightly. Agreed. > Apparently we have a situation where implementations disagree, and have done so for a while. This normally means not only that the implementations differ, but that data exists in both formats. > > Even if it were true that all data is only stored in UTF-8, any data converted from UFT-8 back to UTF-8 going through an interim stage that requires UTF-8 conversion would then be different based on which converter is used. > > Implementations working in UTF-8 natively would potentially see three formats: > 1) the original ill-formed data > 2) data converted with single FFFD > 3) data converted with multiple FFFD > > These forms cannot be compared for equality by binary matching. But that was always true, if you were under the impression that only one of (2) and (3) existed, and indeed claiming equality between two instances of U+FFFD might be problematic itself in some circumstances (you don?t know why the U+FFFDs were inserted - they may not replace the same original data). > The best that can be done is to convert (1) into one of the other forms and then compare treating any run of FFFD code points as equal to any other run, irrespective of length. It?s probably safer, actually, to refuse to compare U+FFFD as equal to anything (even itself) unless a special flag is passed. For ?general purpose? applications, you could set that flag and then a single U+FFFD would compare equal to another single U+FFFD; no need for the complicated ?any string of U+FFFD? logic (which in any case makes little sense - it could just as easily generate erroneous comparisons as fix the case we?re worrying about here). > Because we've had years of multiple implementations, it would be expected that copious data exists in all three formats, and that data will not go away. Changing the specification to pick one of these formats as solely conformant is IMHO too late. I don?t think so. Even if we acknowledge the possibility of data in the other form, I think it?s useful guidance to implementers, both now and in the future. One might even imagine that the other, non-favoured form, would eventually fall out of use. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Mon May 15 13:33:18 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Mon, 15 May 2017 21:33:18 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: On Mon, May 15, 2017 at 6:37 PM, Alastair Houghton wrote: > On 15 May 2017, at 11:21, Henri Sivonen via Unicode wrote: >> >> In reference to: >> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf >> >> I think Unicode should not adopt the proposed change. > > Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense. The currently-specced behavior makes perfect sense when you add error emission on top of a fail-fast UTF-8 validation state machine. >> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't >> representative of implementation concerns of implementations that use >> UTF-8 as their in-memory Unicode representation. >> >> Even though there are notable systems (Win32, Java, C#, JavaScript, >> ICU, etc.) that are stuck with UTF-16 as their in-memory >> representation, which makes concerns of such implementation very >> relevant, I think the Unicode Consortium should acknowledge that >> UTF-16 was, in retrospect, a mistake > > You may think that. There are those of us who do not. My point is: The proposal seems to arise from the "UTF-16 as the in-memory representation" mindset. While I don't expect that case in any way to go away, I think the Unicode Consortium should recognize the serious technical merit of the "UTF-8 as the in-memory representation" case as having significant enough merit that proposals like this should consider impact to both cases equally despite "UTF-8 as the in-memory representation" case at present appearing to be the minority case. That is, I think it's wrong to view things only or even primarily through the lens of the "UTF-16 as the in-memory representation" case that ICU represents. -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Mon May 15 15:05:55 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Mon, 15 May 2017 20:05:55 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: >> Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense. > > Changing a specification as fundamental as this is something that should not be undertaken lightly. IMO, the only think that can be agreed upon is that "something's bad with this UTF-8 data". I think that whether it's treated as a single group of corrupt bytes or each individual byte is considered a problem should be up to the implementation. #1 - This data should "never happen". In a system behaving normally, this condition should never be encountered. * At this point the data is "bad" and all bets are off. * Some applications may have a clue how the bad data could have happened and want to do something in particular. * It seems odd to me to spend much effort standardizing a scenario that should be impossible. #2 - Depending on implementation, either behavior, or some combination, may be more efficient. I'd rather allow apps to optimize for the common case, not the case-that-shouldn't-ever-happen #3 - We have no clue if this "maximal" sequence was a single error, 2 errors, or even more. The lead byte says how many trail bytes should follow, and those should be in a certain range. Values outside of those conditions are illegal, so we shouldn't ever encounter them. So if we did, then something really weird happened. * Did a single character get misencoded? * Was an illegal sequence illegally encoded? * Perhaps a byte got corrupted in transmission? * Maybe we dropped a packet/block, so this is really the beginning of a valid sequence and the tail of another completely valid sequence? In practice, all that most apps would be able to do would be to say "You have bad data, how bad I have no clue, but it's not right". A single bit could've flipped, or you could have only 3 pages of a 4000 page document. No clue at all. At that point it doesn't really matter how many FFFD's the error(s) are replaced with, and no assumptions should be made about the severity of the error. -Shawn From unicode at unicode.org Mon May 15 15:49:05 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 15 May 2017 13:49:05 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: On 5/15/2017 11:33 AM, Henri Sivonen via Unicode wrote: > >>> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't >>> representative of implementation concerns of implementations that use >>> UTF-8 as their in-memory Unicode representation. >>> >>> Even though there are notable systems (Win32, Java, C#, JavaScript, >>> ICU, etc.) that are stuck with UTF-16 as their in-memory >>> representation, which makes concerns of such implementation very >>> relevant, I think the Unicode Consortium should acknowledge that >>> UTF-16 was, in retrospect, a mistake >> You may think that. There are those of us who do not. > My point is: > The proposal seems to arise from the "UTF-16 as the in-memory > representation" mindset. While I don't expect that case in any way to > go away, I think the Unicode Consortium should recognize the serious > technical merit of the "UTF-8 as the in-memory representation" case as > having significant enough merit that proposals like this should > consider impact to both cases equally despite "UTF-8 as the in-memory > representation" case at present appearing to be the minority case. > That is, I think it's wrong to view things only or even primarily > through the lens of the "UTF-16 as the in-memory representation" case > that ICU represents. > UTF-16 has some nice properties and there's not need to brand it a "mistake". UTF-8 has different nice properties, but there's equally not reason to treat it as more special than UTF-16. The UTC should adopt a position of perfect neutrality when it comes to assuming in-memory representation, in other words, not make assumptions that optimizing for any encoding form will benefit implementers. UTC, where ICU is strongly represented, needs to guard against basing encoding/properties/algorithm decisions (edge cases mostly), solely or primarily on the needs of a particular implementation that happens to be chosen by the ICU project. A./ From unicode at unicode.org Mon May 15 16:38:26 2017 From: unicode at unicode.org (David Starner via Unicode) Date: Mon, 15 May 2017 21:38:26 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: On Mon, May 15, 2017 at 8:41 AM Alastair Houghton via Unicode < unicode at unicode.org> wrote: > Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the > case for other situations UTF-8 is clearly more efficient space-wise that includes more ASCII characters than characters between U+0800 and U+FFFF. Given the prevalence of spaces and ASCII punctuation, Latin, Greek, Cyrillic, Hebrew and Arabic will pretty much always be smaller in UTF-8. Even for scripts that go from 2 bytes to 3, webpages can get much smaller in UTF-8 (http://www.gov.cn/ goes from 63k in UTF-8 to 116k in UTF-16, a factor of 1.8). The max change in reverse is 1.5, as two bytes goes to three. > and the fact is that handling surrogates (which is what proponents of > UTF-8 or UCS-4 usually focus on) is no more complicated than handling > combining characters, which you have to do anyway. > Not necessarily; you can legally process Unicode text without worrying about combining characters, whereas you cannot process UTF-16 without handling surrogates. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 15 17:16:32 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Mon, 15 May 2017 22:16:32 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: I?m not sure how the discussion of ?which is better? relates to the discussion of ill-formed UTF-8 at all. And to the last, saying ?you cannot process UTF-16 without handling surrogates? seems to me to be the equivalent of saying ?you cannot process UTF-8 without handling lead & trail bytes?. That?s how the respective encodings work. One could look at it and think ?there are 128 unicode characters that have the same value in UTF-8 as UTF-32,? and ?there are xx thousand unicode characters that have the same value in UTF-16 and UTF-32.? -Shawn From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of David Starner via Unicode Sent: Monday, May 15, 2017 2:38 PM To: unicode at unicode.org Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On Mon, May 15, 2017 at 8:41 AM Alastair Houghton via Unicode > wrote: Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the case for other situations UTF-8 is clearly more efficient space-wise that includes more ASCII characters than characters between U+0800 and U+FFFF. Given the prevalence of spaces and ASCII punctuation, Latin, Greek, Cyrillic, Hebrew and Arabic will pretty much always be smaller in UTF-8. Even for scripts that go from 2 bytes to 3, webpages can get much smaller in UTF-8 (http://www.gov.cn/ goes from 63k in UTF-8 to 116k in UTF-16, a factor of 1.8). The max change in reverse is 1.5, as two bytes goes to three. and the fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 usually focus on) is no more complicated than handling combining characters, which you have to do anyway. Not necessarily; you can legally process Unicode text without worrying about combining characters, whereas you cannot process UTF-16 without handling surrogates. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 15 17:43:29 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 15 May 2017 23:43:29 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: <20170515234329.10745518@JRWUBU2> On Mon, 15 May 2017 21:38:26 +0000 David Starner via Unicode wrote: > > and the fact is that handling surrogates (which is what proponents > > of UTF-8 or UCS-4 usually focus on) is no more complicated than > > handling combining characters, which you have to do anyway. > Not necessarily; you can legally process Unicode text without worrying > about combining characters, whereas you cannot process UTF-16 without > handling surrogates. The problem with surrogates is inadequate testing. They're sufficiently rare for many users that it may be a long time before an error is discovered. It's not always obvious that code is designed for UCS-2 rather than UTF-16. Richard. From unicode at unicode.org Mon May 15 17:53:13 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 16 May 2017 00:53:13 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <37d24cde-96c3-5732-726f-79293d561b4e@ix.netcom.com> References: <37d24cde-96c3-5732-726f-79293d561b4e@ix.netcom.com> Message-ID: 2017-05-15 19:54 GMT+02:00 Asmus Freytag via Unicode : > I think this political reason should be taken very seriously. There are > already too many instances where ICU can be seen "driving" the development > of property and algorithms. > > Those involved in the ICU project may not see the problem, but I agree > with Henri that it requires a bit more sensitivity from the UTC. > I don't think that the fact that ICU was originately using UTF-16 internally has ANY effect on the decision to represent ill-formed sequences as single or multiple U+FFFD. The internal encoding has nothing in common with the external encoding used when processing input data (which may be UTf-8, UTF-16, UTF-32, and could in all case present ill-formed sequences). That internal encoding here will paly no role in how to convert the ill-formed input, or if it will be converted. So yes, independantly of the internal encoding, we'll still ahve to choose between: - not converting the input and return an error or throw an exception - converting the input using a single U+FFFD (in its internal representation, this does not matter) to replace the complete sequence of ill-formed code units in the input data, and preferably return an error status - converting the input using as many U+FFFD (in its internal representation, this does not matter) to replace every ocurence of ill-formed code units in the input data, and preferably return an error status. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 15 18:20:40 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 16 May 2017 01:20:40 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170515234329.10745518@JRWUBU2> References: <20170515234329.10745518@JRWUBU2> Message-ID: Softwares designed with only UCS-2 and not real UTF-16 support are still used today For example MySQL with its broken "UTF-8" encoding which in fact encodes supplementary characters as two separate 16-bit code-units for surrogates, each one blindly encoded as 3-byte sequences which would be ill-formed in standard UTF-8, buit that also does not differentiate invalid pairs of surrogates, and offers no collation support for supplementary characters. In this case some other softwares will break silently on these sequences (for example Mediawiki when installed with a MySQL backend server whose datastore was created with its broken "UTF-8", will silently discard any text starting at the first supplementary character found in the wikitext. This is not a problem of Mediawiki but the fact the MediaWiki does NOT support such MySQL server isntalled with its "UTF-8" datastore, but only supports MySQL if the storage encoding declared for the database was "binary" (but in that case there's no support of collation in MySQL, texts are just containing any random sequences of bytes and internationalization is then made in the client software, here Mediawiki and its PHP, ICU, or Lua libraries, and other tools written in Perl and other languages) Note that this does not affect Wikimedia in its wikis because they were initially installed corectly with the binary encoding in MySQL, but now Wikimedia wikis use another database engine with native UTF-8 support and full coverage of the UCS. Other wikis using Mediawiki will need to upgrade their MySQL version if they want to keep it for adminsitrative reasons (and not convert again their datastore to the binary encoding). Softwares running with only UCS-2 are exposed to such risks similar to the one seen in MediaWiki on incorrect MySQL installations, where any user may edit a page to insert any supplementary character (supplementary sinograms, emojis, Gothic letters, supplementary symbols...) which will look correct when previewing, and correct when it is parsed, accepted silently by MySQL, but then silently truncated because of the encoding error: when reloading the data from MySQL, there will effectively be unexpectedly discarded data. How to react to the risks of data losses or truncation ? Throwing an exception or just returning an error is in fact more dangerous than just replacing the ill-formed sequences by one or more U+FFFD: we preserve as much as possible, but anyway softwares should be able to perform some tests in their datastore to see if they correctly handle the encoding: this could be done when starting the sofware and emitting log messages when the backend do not support the encoding: all that is needed is to send a single supplementary character to the remote datastore in a junk table or field and then retrieve it immediately in another transaction to make sure it is preserved. Similar tests can be done to see if the remote datastore also preserves the encoding form or "normalizes it, or alters it (this alteration could happen with a leading BOM and some other silent alterations could be made on NULL and trailing spaces if the datastore does not use text fields with varying length but fixed length instead). Similar tests could be done to check the maximum length accepted (a VARCHAR(256) on a binary-encoded database will not always store 256 Unciode characters, but in a database encoded with non borken UTF-8, it should store 256 codepoints independantly of theior values, even if their UTF-8 encoding would be up to 1024 bytes. 2017-05-16 0:43 GMT+02:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Mon, 15 May 2017 21:38:26 +0000 > David Starner via Unicode wrote: > > > > and the fact is that handling surrogates (which is what proponents > > > of UTF-8 or UCS-4 usually focus on) is no more complicated than > > > handling combining characters, which you have to do anyway. > > > Not necessarily; you can legally process Unicode text without worrying > > about combining characters, whereas you cannot process UTF-16 without > > handling surrogates. > > The problem with surrogates is inadequate testing. They're sufficiently > rare for many users that it may be a long time before an error is > discovered. It's not always obvious that code is designed for UCS-2 > rather than UTF-16. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 15 22:23:06 2017 From: unicode at unicode.org (Karl Williamson via Unicode) Date: Mon, 15 May 2017 21:23:06 -0600 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote: > In reference to: > http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf > > I think Unicode should not adopt the proposed change. > > The proposal is to make ICU's spec violation conforming. I think there > is both a technical and a political reason why the proposal is a bad > idea. Henri's claim that "The proposal is to make ICU's spec violation conforming" is a false statement, and hence all further commentary based on this false premise is irrelevant. I believe that ICU is actually currently conforming to TUS. The proposal reads: "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8..." There is nothing in here that is requiring any implementation to be changed. The word "recommend" does not mean the same as "require". Have you guys been so caught up in the current international political situation that you have lost the ability to read straight? TUS has certain requirements for UTF-8 handling, and it has certain other "Best Practices" as detailed in 3.9. The proposal involves changing those recommendations. It does not involve changing any requirements. From unicode at unicode.org Tue May 16 01:50:54 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Tue, 16 May 2017 09:50:54 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode wrote: > I?m not sure how the discussion of ?which is better? relates to the > discussion of ill-formed UTF-8 at all. Clearly, the "which is better" issue is distracting from the underlying issue. I'll clarify what I meant on that point and then move on: I acknowledge that UTF-16 as the internal memory representation is the dominant design. However, because UTF-8 as the internal memory representation is *such a good design* (when legacy constraits permit) that *despite it not being the current dominant design*, I think the Unicode Consortium should be fully supportive of UTF-8 as the internal memory representation and not treat UTF-16 as the internal representation as the one true way of doing things that gets considered when speccing stuff. I.e. I wasn't arguing against UTF-16 as the internal memory representation (for the purposes of this thread) but trying to motivate why the Consortium should consider "UTF-8 internally" equally despite it not being the dominant design. So: When a decision could go either way from the "UTF-16 internally" perspective, but one way clearly makes more sense from the "UTF-8 internally" perspective, the "UTF-8 internally" perspective should be decisive in *such a case*. (I think the matter at hand is such a case.) At the very least a proposal should discuss the impact on the "UTF-8 internally" case, which the proposal at hand doesn't do. (Moving on to a different point.) The matter at hand isn't, however, a new green-field (in terms of implementations) issue to be decided but a proposed change to a standard that has many widely-deployed implementations. Even when observing only "UTF-16 internally" implementations, I think it would be appropriate for the proposal to include a review of what existing implementations, beyond ICU, do. Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick test with three major browsers that use UTF-16 internally and have independent (of each other) implementations of UTF-8 decoding (Firefox, Edge and Chrome) shows agreement on the current spec: there is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, 6 on the second, 4 on the third and 6 on the last line). Changing the Unicode standard away from that kind of interop needs *way* better rationale than "feels right". -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Tue May 16 02:01:03 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Tue, 16 May 2017 10:01:03 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: On Tue, May 16, 2017 at 6:23 AM, Karl Williamson wrote: > On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote: >> >> In reference to: >> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf >> >> I think Unicode should not adopt the proposed change. >> >> The proposal is to make ICU's spec violation conforming. I think there >> is both a technical and a political reason why the proposal is a bad >> idea. > > > > Henri's claim that "The proposal is to make ICU's spec violation conforming" > is a false statement, and hence all further commentary based on this false > premise is irrelevant. > > I believe that ICU is actually currently conforming to TUS. Do you mean that ICU's behavior differs from what the PDF claims (I didn't test and took the assertion in the PDF about behavior at face value) or do you mean that despite deviating from the currently-recommended best practice the behavior is conforming, because the relevant part of the spec is mere best practice and not a requirement? > TUS has certain requirements for UTF-8 handling, and it has certain other > "Best Practices" as detailed in 3.9. The proposal involves changing those > recommendations. It does not involve changing any requirements. Even so, I think even changing a recommendation of "best practice" needs way better rationale than "feels right" or "ICU already does it" when a) major browsers (which operate in the most prominent environment of broken and hostile UTF-8) agree with the currently-recommended best practice and b) the currently-recommended best practice makes more sense for implementations where "UTF-8 decoding" is actually mere "UTF-8 validation". -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Tue May 16 02:13:45 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 08:13:45 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: <6259C523-20B6-4B76-AB2B-33020E1C863C@alastairs-place.net> On 15 May 2017, at 23:16, Shawn Steele via Unicode wrote: > > I?m not sure how the discussion of ?which is better? relates to the discussion of ill-formed UTF-8 at all. It doesn?t, which is a point I made in my original reply to Henry. The only reason I answered his anti-UTF-16 rant at all was to point out that some of us don?t think UTF-16 is a mistake, and in fact can see various benefits (*particularly* as an in-memory representation). > And to the last, saying ?you cannot process UTF-16 without handling surrogates? seems to me to be the equivalent of saying ?you cannot process UTF-8 without handling lead & trail bytes?. That?s how the respective encodings work. Quite. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 02:22:53 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 16 May 2017 00:22:53 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> On 5/15/2017 11:50 PM, Henri Sivonen via Unicode wrote: > On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode > wrote: >> I?m not sure how the discussion of ?which is better? relates to the >> discussion of ill-formed UTF-8 at all. > Clearly, the "which is better" issue is distracting from the > underlying issue. I'll clarify what I meant on that point and then > move on: > > I acknowledge that UTF-16 as the internal memory representation is the > dominant design. However, because UTF-8 as the internal memory > representation is *such a good design* (when legacy constraits permit) > that *despite it not being the current dominant design*, I think the > Unicode Consortium should be fully supportive of UTF-8 as the internal > memory representation and not treat UTF-16 as the internal > representation as the one true way of doing things that gets > considered when speccing stuff. There are cases where it is prohibitive to transcode external data from UTF-8 to any other format, as a precondition to doing any work. In these situations processing has to be done in UTF-8, effectively making that the in-memory representation. I've encountered this issue on separate occasions, both for my own code as well as code I reviewed for clients. I therefore think that Henri has a point when he's concerned about tacit assumptions favoring one memory representation over another, but I think the way he raises this point is needlessly antagonistic. > ....At the very least a proposal should discuss the impact on the "UTF-8 > internally" case, which the proposal at hand doesn't do. This is a key point. It may not be directly relevant to any other modifications to the standard, but the larger point is to not make assumption about how people implement the standard (or any of the algorithms). > (Moving on to a different point.) > > The matter at hand isn't, however, a new green-field (in terms of > implementations) issue to be decided but a proposed change to a > standard that has many widely-deployed implementations. Even when > observing only "UTF-16 internally" implementations, I think it would > be appropriate for the proposal to include a review of what existing > implementations, beyond ICU, do. I would like to second this as well. The level of documented review of existing implementation practices tends to be thin (at least thinner than should be required for changing long-established edge cases or recommendations, let alone core conformance requirements). > > Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick > test with three major browsers that use UTF-16 internally and have > independent (of each other) implementations of UTF-8 decoding > (Firefox, Edge and Chrome) shows agreement on the current spec: there > is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, > 6 on the second, 4 on the third and 6 on the last line). Changing the > Unicode standard away from that kind of interop needs *way* better > rationale than "feels right". It would be good if the UTC could work out some minimal requirements for evaluating proposals for changes to properties and algorithms, much like the criteria for encoding new code points A./ From unicode at unicode.org Tue May 16 02:23:14 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Tue, 16 May 2017 10:23:14 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: On Tue, May 16, 2017 at 9:50 AM, Henri Sivonen wrote: > Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick > test with three major browsers that use UTF-16 internally and have > independent (of each other) implementations of UTF-8 decoding > (Firefox, Edge and Chrome) shows agreement on the current spec: there > is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, > 6 on the second, 4 on the third and 6 on the last line). Changing the > Unicode standard away from that kind of interop needs *way* better > rationale than "feels right". Testing with that file, Python 3 and OpenJDK 8 agree with the currently-specced best-practice, too. I expect there to be other well-known implementations that comply with the currently-specced best practice, so the rationale to change the stated best practice would have to be very strong (as in: security problem with currently-stated best practice) for a change to be appropriate. -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Tue May 16 02:26:33 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 08:26:33 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170515234329.10745518@JRWUBU2> References: <20170515234329.10745518@JRWUBU2> Message-ID: <752AC650-694E-45F7-8854-A0DE9A8D5A77@alastairs-place.net> On 15 May 2017, at 23:43, Richard Wordingham via Unicode wrote: > > The problem with surrogates is inadequate testing. They're sufficiently > rare for many users that it may be a long time before an error is > discovered. It's not always obvious that code is designed for UCS-2 > rather than UTF-16. While I don?t think we should spend too long debating the relative merits of UTF-8 versus UTF-16, I?ll note that that argument applies equally to both combining characters and indeed the underlying UTF-8 encoding in the first place, and that mistakes in handling both are not exactly uncommon. There are advantages to UTF-8 and advantages to UTF-16. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 02:42:46 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 08:42:46 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> Message-ID: On 16 May 2017, at 08:22, Asmus Freytag via Unicode wrote: > I therefore think that Henri has a point when he's concerned about tacit assumptions favoring one memory representation over another, but I think the way he raises this point is needlessly antagonistic. That would be true if the in-memory representation had any effect on what we?re talking about, but it really doesn?t. (The only time I can think of that the in-memory representation has a significant effect is where you?re talking about default binary ordering of string data, in which case, in the presence of non-BMP characters, UTF-8 and UCS-4 sort the same way, but because the surrogates are ?in the wrong place?, UTF-16 doesn?t. I think everyone is well aware of that, no?) >> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick >> test with three major browsers that use UTF-16 internally and have >> independent (of each other) implementations of UTF-8 decoding >> (Firefox, Edge and Chrome) shows agreement on the current spec: there >> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, >> 6 on the second, 4 on the third and 6 on the last line). Changing the >> Unicode standard away from that kind of interop needs *way* better >> rationale than "feels right?. In what sense is this ?interop?? Under what circumstance would it matter how many U+FFFDs you see? If you?re about to mutter something about security, consider this: security code *should* refuse to compare strings that contain U+FFFD (or at least should never treat them as equal, even to themselves), because it has no way to know what that code point represents. Would you advocate replacing e0 80 80 with U+FFFD U+FFFD U+FFFD (1) rather than U+FFFD (2) It?s pretty clear what the intent of the encoder was there, I?d say, and while we certainly don?t want to decode it as a NUL (that was the source of previous security bugs, as I recall), I also don?t see the logic in insisting that it must be decoded to *three* code points when it clearly only represented one in the input. This isn?t just a matter of ?feels nicer?. (1) is simply illogical behaviour, and since behaviours (1) and (2) are both clearly out there today, it makes sense to pick the more logical alternative as the official recommendation. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 02:50:27 2017 From: unicode at unicode.org (J Decker via Unicode) Date: Tue, 16 May 2017 00:50:27 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: On Mon, May 15, 2017 at 11:50 PM, Henri Sivonen via Unicode < unicode at unicode.org> wrote: > On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode > wrote: > > I?m not sure how the discussion of ?which is better? relates to the > > discussion of ill-formed UTF-8 at all. > > Clearly, the "which is better" issue is distracting from the > underlying issue. I'll clarify what I meant on that point and then > move on: > > I acknowledge that UTF-16 as the internal memory representation is the > dominant design. However, because UTF-8 as the internal memory > representation is *such a good design* (when legacy constraits permit) > that *despite it not being the current dominant design*, I think the > Unicode Consortium should be fully supportive of UTF-8 as the internal > memory representation and not treat UTF-16 as the internal > representation as the one true way of doing things that gets > considered when speccing stuff. > > I.e. I wasn't arguing against UTF-16 as the internal memory > representation (for the purposes of this thread) but trying to > motivate why the Consortium should consider "UTF-8 internally" equally > despite it not being the dominant design. > > So: When a decision could go either way from the "UTF-16 internally" > perspective, but one way clearly makes more sense from the "UTF-8 > internally" perspective, the "UTF-8 internally" perspective should be > decisive in *such a case*. (I think the matter at hand is such a > case.) > > At the very least a proposal should discuss the impact on the "UTF-8 > internally" case, which the proposal at hand doesn't do. > > (Moving on to a different point.) > > The matter at hand isn't, however, a new green-field (in terms of > implementations) issue to be decided but a proposed change to a > standard that has many widely-deployed implementations. Even when > observing only "UTF-16 internally" implementations, I think it would > be appropriate for the proposal to include a review of what existing > implementations, beyond ICU, do. > > Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick > test with three major browsers that use UTF-16 internally and have > independent (of each other) implementations of UTF-8 decoding > (Firefox, Edge and Chrome) Something I've learned through working with Node (V8 javascript engine from chrome) V8 stores strings either as UTF-16 OR UTF-8 interchangably and is not one OR the other... https://groups.google.com/forum/#!topic/v8-users/wmXgQOdrwfY and I wouldn't really assume UTF-16 is a 'majority'; Go is utf-8 for instance. > shows agreement on the current spec: there > is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, > 6 on the second, 4 on the third and 6 on the last line). Changing the > Unicode standard away from that kind of interop needs *way* better > rationale than "feels right". > > -- > Henri Sivonen > hsivonen at hsivonen.fi > https://hsivonen.fi/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 03:00:13 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 16 May 2017 09:00:13 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: <20170516090013.55793b87@JRWUBU2> On Tue, 16 May 2017 10:01:03 +0300 Henri Sivonen via Unicode wrote: > Even so, I think even changing a recommendation of "best practice" > needs way better rationale than "feels right" or "ICU already does it" > when a) major browsers (which operate in the most prominent > environment of broken and hostile UTF-8) agree with the > currently-recommended best practice and b) the currently-recommended > best practice makes more sense for implementations where "UTF-8 > decoding" is actually mere "UTF-8 validation". There was originally an attempt to prescribe rather than to recommend the interpretation of ill-formed 8-bit Unicode strings. It may even briefly have been an issued prescription, until common sense prevailed. I do remember a sinking feeling when I thought I would have to change my own handling of bogus UTF-8, only to be relieved later when it became mere best practice. However, it is not uncommon for coding standards to prescribe 'best practice'. Richard. From unicode at unicode.org Tue May 16 03:18:41 2017 From: unicode at unicode.org (David Starner via Unicode) Date: Tue, 16 May 2017 08:18:41 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> Message-ID: On Tue, May 16, 2017 at 12:42 AM Alastair Houghton < alastair at alastairs-place.net> wrote: > If you?re about to mutter something about security, consider this: > security code *should* refuse to compare strings that contain U+FFFD (or at > least should never treat them as equal, even to themselves), because it has > no way to know what that code point represents. > Which causes various other security problems; if an object (file, database element, etc.) gets a name with a FFFD in it, it becomes impossible to reference. That an IEEE 754 float may not equal itself is a perpetual source of confusion for programmers. > Would you advocate replacing > > e0 80 80 > > with > > U+FFFD U+FFFD U+FFFD (1) > > rather than > > U+FFFD (2) > > It?s pretty clear what the intent of the encoder was there, I?d say, and > while we certainly don?t want to decode it as a NUL (that was the source of > previous security bugs, as I recall), I also don?t see the logic in > insisting that it must be decoded to *three* code points when it clearly > only represented one in the input. > In this case, It's pretty clear, but I don't see it as a general rule. Any rule has to handle e0 e0 e0 or 80 80 80 or any variety of charset or mojibake or random binary data. 88 A0 8B D4 is UTF-16 Chinese, but I'm not going to insist that it get replaced with U+FFFD U+FFFD because it's clear (to me) it was meant as two characters. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 03:31:07 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Tue, 16 May 2017 11:31:07 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> Message-ID: On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag wrote: > but I think the way he raises this point is needlessly antagonistic. I apologize. My level of dismay at the proposal's ICU-centricity overcame me. On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton wrote: > That would be true if the in-memory representation had any effect on what we?re talking about, but it really doesn?t. If the internal representation is UTF-16 (or UTF-32), it is a likely design that there is a variable into which the scalar value of the current code point is accumulated during UTF-8 decoding. In such a scenario, it can be argued as "natural" to first operate according to the general structure of UTF-8 and then inspect what you got in the accumulation variable (ruling out non-shortest forms, values above the Unicode range and surrogate values after the fact). When the internal representation is UTF-8, only UTF-8 validation is needed, and it's natural to have a fail-fast validator, which *doesn't necessarily need such a scalar value accumulator at all*. The construction at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ when used as a UTF-8 validator is the best illustration of a UTF-8 validator not necessarily looking like a "natural" UTF-8 to UTF-16 converter at all. >>> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick >>> test with three major browsers that use UTF-16 internally and have >>> independent (of each other) implementations of UTF-8 decoding >>> (Firefox, Edge and Chrome) shows agreement on the current spec: there >>> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, >>> 6 on the second, 4 on the third and 6 on the last line). Changing the >>> Unicode standard away from that kind of interop needs *way* better >>> rationale than "feels right?. > > In what sense is this ?interop?? In the sense that prominent independent implementations do the same externally observable thing. > Under what circumstance would it matter how many U+FFFDs you see? Maybe it doesn't, but I don't think the burden of proof should be on the person advocating keeping the spec and major implementations as they are. If anything, I think those arguing for a change of the spec in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing with the current spec should show why it's important to have a different number of U+FFFDs than the spec's "best practice" calls for now. > If you?re about to mutter something about security, consider this: security code *should* refuse to compare strings that contain U+FFFD (or at least should never treat them as equal, even to themselves), because it has no way to know what that code point represents. In practice, e.g. the Web Platform doesn't allow for stopping operating on input that contains an U+FFFD, so the focus is mainly on making sure that U+FFFDs are placed well enough to prevent bad stuff under normal operations. At least typically, the number of U+FFFDs doesn't matter for that purpose, but when browsers agree on the number of U+FFFDs, changing that number should have an overwhelmingly strong rationale. A security reason could be a strong reason, but such a security motivation for fewer U+FFFDs has not been shown, to my knowledge. > Would you advocate replacing > > e0 80 80 > > with > > U+FFFD U+FFFD U+FFFD (1) > > rather than > > U+FFFD (2) I advocate (1), most simply because that's what Firefox, Edge and Chrome do *in accordance with the currently-recommended best practice* and, less simply, because it makes sense in the presence of a fail-fast UTF-8 validator. I think the burden of proof to show an overwhelmingly good reason to change should, at this point, be on whoever proposes doing it differently than what the current widely-implemented spec says. > It?s pretty clear what the intent of the encoder was there, I?d say, and while we certainly don?t want to decode it as a NUL (that was the source of previous security bugs, as I recall), I also don?t see the logic in insisting that it must be decoded to *three* code points when it clearly only represented one in the input. As noted previously, the logic is that you generate a U+FFFD whenever a fail-fast validator fails. > This isn?t just a matter of ?feels nicer?. (1) is simply illogical behaviour, and since behaviours (1) and (2) are both clearly out there today, it makes sense to pick the more logical alternative as the official recommendation. Again, the current best practice makes perfect logical sense in the context of a fail-fast UTF-8 validator. Moreover, it doesn't look like both are "out there" equally when major browsers, OpenJDK and Python 3 agree. (I expect I could find more prominent implementations that implement the currently-stated best practice, but I feel I shouldn't have to.) From my experience from working on Web standards and implementing them, I think it's a bad idea to change something to be "more logical" when the change would move away from browser consensus. -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Tue May 16 03:45:48 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 09:45:48 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> Message-ID: <89D51495-3FDE-4A8E-A58B-703F0045CD08@alastairs-place.net> > On 16 May 2017, at 09:18, David Starner wrote: > > On Tue, May 16, 2017 at 12:42 AM Alastair Houghton wrote: >> If you?re about to mutter something about security, consider this: security code *should* refuse to compare strings that contain U+FFFD (or at least should never treat them as equal, even to themselves), because it has no way to know what that code point represents. >> > Which causes various other security problems; if an object (file, database element, etc.) gets a name with a FFFD in it, it becomes impossible to reference. That an IEEE 754 float may not equal itself is a perpetual source of confusion for programmers. That?s true anyway; imagine the database holds raw bytes, that just happen to decode to U+FFFD. There might seem to be *two* names that both contain U+FFFD in the same place. How do you distinguish between them? Clearly if you are holding Unicode code points that you know are validly encoded somehow, you may want to be able to match U+FFFDs, but that?s a special case where you have extra knowledge. > In this case, It's pretty clear, but I don't see it as a general rule. Any rule has to handle e0 e0 e0 or 80 80 80 or any variety of charset or mojibake or random binary data. I don?t see a problem; the point is that where a structurally valid UTF-8 encoding has been used, albeit in an invalid manner (e.g. encoding a number that is not a valid code point, or encoding a valid code point as an over-long sequence), a single U+FFFD is appropriate. That seems a perfectly sensible rule to adopt. The proposal actually does cover things that aren?t structurally valid, like your e0 e0 e0 example, which it suggests should be a single U+FFFD because the initial e0 denotes a three byte sequence, and your 80 80 80 example, which it proposes should constitute three illegal subsequences (again, both reasonable). However, I?m not entirely certain about things like e0 e0 c3 89 which the proposal would appear to decode as U+FFFD U+FFFD U+FFFD U+FFFD (3) instead of a perhaps more reasonable U+FFFD U+FFFD U+00C9 (4) (the key part is the ?without ever restricting trail bytes to less than 80..BF?) and if Markus or others could explain why they chose (3) over (4) I?d be quite interested to hear the explanation. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 04:29:09 2017 From: unicode at unicode.org (David Starner via Unicode) Date: Tue, 16 May 2017 09:29:09 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <89D51495-3FDE-4A8E-A58B-703F0045CD08@alastairs-place.net> References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> <89D51495-3FDE-4A8E-A58B-703F0045CD08@alastairs-place.net> Message-ID: On Tue, May 16, 2017 at 1:45 AM Alastair Houghton < alastair at alastairs-place.net> wrote: > That?s true anyway; imagine the database holds raw bytes, that just happen > to decode to U+FFFD. There might seem to be *two* names that both contain > U+FFFD in the same place. How do you distinguish between them? > If the database holds raw bytes, then the name is a byte string, not a Unicode string, and can't contain U+FFFD at all. It's a relatively easy rule to make and enforce that a string in a database is a validly formatted string; I would hope that most SQL servers do in fact reject malformed UTF-8 strings. On the other hand, I'd expect that an SQL server would accept U+FFFD in a Unicode string. > I don?t see a problem; the point is that where a structurally valid UTF-8 > encoding has been used, albeit in an invalid manner (e.g. encoding a number > that is not a valid code point, or encoding a valid code point as an > over-long sequence), a single U+FFFD is appropriate. That seems a > perfectly sensible rule to adopt. > It seems like a perfectly arbitrary rule to adopt; I'd like to assume that the only source of such UTF-8 data is willful attempts to break security, and in that case, how is this a win? Nonattack sources of broken data are much more likely to be the result of mixing UTF-8 with other character encodings or raw binary data. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 04:55:34 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 10:55:34 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> <89D51495-3FDE-4A8E-A58B-703F0045CD08@alastairs-place.net> Message-ID: <14C93F5D-1CFF-4999-B9F2-8BE604FA77B9@alastairs-place.net> > On 16 May 2017, at 10:29, David Starner wrote: > > On Tue, May 16, 2017 at 1:45 AM Alastair Houghton wrote: > That?s true anyway; imagine the database holds raw bytes, that just happen to decode to U+FFFD. There might seem to be *two* names that both contain U+FFFD in the same place. How do you distinguish between them? > >> If the database holds raw bytes, then the name is a byte string, not a Unicode string, and can't contain U+FFFD at all. It's a relatively easy rule to make and enforce that a string in a database is a validly formatted string; I would hope that most SQL servers do in fact reject malformed UTF-8 strings. On the other hand, I'd expect that an SQL server would accept U+FFFD in a Unicode string. Databases typically separate the encoding in which strings are stored from the encoding in which an application connected to the database is operating. A database might well hold data in (say) ISO Latin 1, EUC-JP, or indeed any other character set, while presenting it to a client application as UTF-8 or UTF-16. Hence my comment - application software could very well see two names that are apparently identical and that include U+FFFDs in the same places, even though the database back-end actually has different strings. As I said, this is a problem we already have. > I don?t see a problem; the point is that where a structurally valid UTF-8 encoding has been used, albeit in an invalid manner (e.g. encoding a number that is not a valid code point, or encoding a valid code point as an over-long sequence), a single U+FFFD is appropriate. That seems a perfectly sensible rule to adopt. > >> It seems like a perfectly arbitrary rule to adopt; I'd like to assume that the only source of such UTF-8 data is willful attempts to break security, and in that case, how is this a win? Nonattack sources of broken data are much more likely to be the result of mixing UTF-8 with other character encodings or raw binary data. I?d say there are three sources of UTF-8 data of that ilk: (a) bugs, (b) ?Modified UTF-8? and ?CESU-8? implementations, (c) wilful attacks (b) in particular is quite common, and the result of the presently recommended approach doesn?t make much sense there ([c0 80] will get replaced with *two* U+FFFDs, while [ed a0 bd ed b8 80] will be replaced by *four* U+FFFDs - surrogates aren?t supposed to be valid in UTF-8, right?) Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 05:09:44 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 11:09:44 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> Message-ID: <9EFEA10F-535B-4D46-8637-B2288162FF45@alastairs-place.net> On 16 May 2017, at 09:31, Henri Sivonen via Unicode wrote: > > On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton > wrote: >> That would be true if the in-memory representation had any effect on what we?re talking about, but it really doesn?t. > > If the internal representation is UTF-16 (or UTF-32), it is a likely > design that there is a variable into which the scalar value of the > current code point is accumulated during UTF-8 decoding. That?s quite a likely design with a UTF-8 internal representation too; it?s just that you?d only decode during processing, as opposed to immediately at input. > When the internal representation is UTF-8, only UTF-8 validation is > needed, and it's natural to have a fail-fast validator, which *doesn't > necessarily need such a scalar value accumulator at all*. Sure. But a state machine can still contain appropriate error states without needing an accumulator. That the ones you care about currently don?t is readily apparent, but there?s nothing stopping them from doing so. I don?t see this as an argument about implementations, since it really makes very little difference to the implementation which approach is taken; in both internal representations, the question is whether you generate U+FFFD immediately on detection of the first incorrect *byte*, or whether you do so after reading a complete sequence. UTF-8 sequences are bounded anyway, so it isn?t as if failing early gives you any significant performance benefit. >> In what sense is this ?interop?? > > In the sense that prominent independent implementations do the same > externally observable thing. The argument is, I think, that in this case the thing they are doing is the *wrong* thing. That many of them do it would only be an argument if there was some reason that it was desirable that they did it. There doesn?t appear to be such a reason, unless you can think of something that hasn?t been mentioned thus far? The only reason you?ve given, to date, is that they currently do that, so that should be the recommended behaviour (which is little different from the argument - which nobody deployed - that ICU currently does the other thing, so *that* should be the recommended behaviour; the only difference is that *you* care about browsers and don?t care about ICU, whereas you yourself suggested that some of us might be advocating this decision because we care about ICU and not about e.g. browsers). I?ll add also that even among the implementations you cite, some of them permit surrogates in their UTF-8 input (i.e. they?re actually processing CESU-8, not UTF-8 anyway). Python, for example, certainly accepts the sequence [ed a0 bd ed b8 80] and decodes it as U+1F600; a true ?fast fail? implementation that conformed literally to the recommendation, as you seem to want, should instead replace it with *four* U+FFFDs (I think), no? One additional note: the standard codifies this behaviour as a *recommendation*, not a requirement. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 05:40:37 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Tue, 16 May 2017 13:40:37 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <9EFEA10F-535B-4D46-8637-B2288162FF45@alastairs-place.net> References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> <9EFEA10F-535B-4D46-8637-B2288162FF45@alastairs-place.net> Message-ID: On Tue, May 16, 2017 at 1:09 PM, Alastair Houghton wrote: > On 16 May 2017, at 09:31, Henri Sivonen via Unicode wrote: >> >> On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton >> wrote: >>> That would be true if the in-memory representation had any effect on what we?re talking about, but it really doesn?t. >> >> If the internal representation is UTF-16 (or UTF-32), it is a likely >> design that there is a variable into which the scalar value of the >> current code point is accumulated during UTF-8 decoding. > > That?s quite a likely design with a UTF-8 internal representation too; it?s just that you?d only decode during processing, as opposed to immediately at input. The time to generate the U+FFFDs is at the input time which is what's at issue here. The later processing, which may then involve iterating by code point and involving computing the scalar values is a different step that should be able to assume valid UTF-8 and not be concerned with invalid UTF-8. (To what extent different programming languages and frameworks allow confident maintenance of the invariant that after input all in-RAM UTF-8 can be treated as valid varies.) >> When the internal representation is UTF-8, only UTF-8 validation is >> needed, and it's natural to have a fail-fast validator, which *doesn't >> necessarily need such a scalar value accumulator at all*. > > Sure. But a state machine can still contain appropriate error states without needing an accumulator. As I said upthread, it could, but it seems inappropriate to ask implementations to take on that extra complexity on as weak grounds as "ICU does it" or "feels right" when the current recommendation doesn't call for those extra states and the current spec is consistent with a number of prominent non-ICU implementations, including Web browsers. >>> In what sense is this ?interop?? >> >> In the sense that prominent independent implementations do the same >> externally observable thing. > > The argument is, I think, that in this case the thing they are doing is the *wrong* thing. It's seems weird to characterize following the currently-specced "best practice" as "wrong" without showing a compelling fundamental flaw (such as a genuine security problem) in the currently-specced "best practice". With implementations of the currently-specced "best practice" already shipped, I don't think aesthetic preferences should be considered enough of a reason to proclaim behavior adhering to the currently-specced "best practice" as "wrong". > That many of them do it would only be an argument if there was some reason that it was desirable that they did it. There doesn?t appear to be such a reason, unless you can think of something that hasn?t been mentioned thus far? I've already given a reason: UTF-8 validation code not needing to have extra states catering to aesthetic considerations of U+FFFD consolidation. > The only reason you?ve given, to date, is that they currently do that, so that should be the recommended behaviour (which is little different from the argument - which nobody deployed - that ICU currently does the other thing, so *that* should be the recommended behaviour; the only difference is that *you* care about browsers and don?t care about ICU, whereas you yourself suggested that some of us might be advocating this decision because we care about ICU and not about e.g. browsers). Not just browsers. Also OpenJDK and Python 3. Do I really need to test the standard libraries of more languages/systems to more strongly make the case that the ICU behavior (according to the proposal PDF) is not the norm and what the spec currently says is? > I?ll add also that even among the implementations you cite, some of them permit surrogates in their UTF-8 input (i.e. they?re actually processing CESU-8, not UTF-8 anyway). Python, for example, certainly accepts the sequence [ed a0 bd ed b8 80] and decodes it as U+1F600; a true ?fast fail? implementation that conformed literally to the recommendation, as you seem to want, should instead replace it with *four* U+FFFDs (I think), no? I see that behavior in Python 2. Earlier, I said that Python 3 agrees with the current spec for my test case. The Python 2 behavior I see is not just against "best practice" but obviously incompliant. (For details: I tested Python 2.7.12 and 3.5.2 as shipped on Ubuntu 16.04.) > One additional note: the standard codifies this behaviour as a *recommendation*, not a requirement. This is an odd argument in favor of changing it. If the argument is that it's just a recommendation that you don't need to adhere to, surely then the people who don't like the current recommendation should choose not to adhere to it instead of advocating changing it. -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Tue May 16 05:44:00 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 16 May 2017 12:44:00 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <89D51495-3FDE-4A8E-A58B-703F0045CD08@alastairs-place.net> References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> <89D51495-3FDE-4A8E-A58B-703F0045CD08@alastairs-place.net> Message-ID: > > The proposal actually does cover things that aren?t structurally valid, > like your e0 e0 e0 example, which it suggests should be a single U+FFFD > because the initial e0 denotes a three byte sequence, and your 80 80 80 > example, which it proposes should constitute three illegal subsequences > (again, both reasonable). However, I?m not entirely certain about things > like > > e0 e0 c3 89 > > which the proposal would appear to decode as > > U+FFFD U+FFFD U+FFFD U+FFFD (3) > > instead of a perhaps more reasonable > > U+FFFD U+FFFD U+00C9 (4) > > (the key part is the ?without ever restricting trail bytes to less than > 80..BF?) > I also agree with that, due to access in strings from random position: if you access it from byte 0x89, you can assume it's a trialing byte and you'll want to look backward, and will see 0xc3,0x89 which will decode correctly as U+00C9 without any error detected. So the wrong bytes are only the initial two occurences of 0x80 which are individually converted to U+FFFD. In summary: when you detect any ill-formed sequence, only replace the first code unit by U+FFFD and restart scanning from the next code unit, without skeeping over multiple bytes. This means that multiple occurences of U+FFFD is not only the best practice, it also matches the intended design of UTF-8 to allow access from random positions. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 06:08:52 2017 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Tue, 16 May 2017 20:08:52 +0900 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> Message-ID: <73e5364f-08db-2f19-3498-4a23e56649ac@it.aoyama.ac.jp> Hello everybody, [using this mail to in effect reply to different mails in the thread] On 2017/05/16 17:31, Henri Sivonen via Unicode wrote: > On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag wrote: >> Under what circumstance would it matter how many U+FFFDs you see? > > Maybe it doesn't, but I don't think the burden of proof should be on > the person advocating keeping the spec and major implementations as > they are. If anything, I think those arguing for a change of the spec > in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing > with the current spec should show why it's important to have a > different number of U+FFFDs than the spec's "best practice" calls for > now. I have just checked (the programming language) Ruby. Some background: As you might know, Ruby is (at least in theory) pretty encoding-independent, meaning you can run scripts in iso-8859-1, in Shift_JIS, in UTF-8, or in any of quite a few other encodings directly, without any conversion. However, in practice, incl. Ruby on Rails, Ruby is very much using UTF-8 internally, and is optimized to work well that way. Character encoding conversion also works with UTF-8 as the pivot encoding. As far as I understand, Ruby does the same as all of the above software, based (among else) on the fact that we followed the recommendation in the standard. Here are a few examples (sorry for the linebreaks introduced by mail software): $ ruby -e 'puts "\xF0\xaf".encode("UTF-16BE", invalid: :replace).inspect' #=> "\uFFFD" $ ruby -e 'puts "\xe0\x80\x80".encode("UTF-16BE", invalid: :replace).inspect' #=> "\uFFFD\uFFFD\uFFFD" $ ruby -e 'puts "\xF4\x90\x80\x80".encode("UTF-16BE", invalid: :replace).inspect' #=> "\uFFFD\uFFFD\uFFFD\uFFFD" $ ruby -e 'puts "\xfd\x81\x82\x83\x84\x85".encode("UTF-16BE", invalid: :replace).inspect #=> "\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD" $ ruby -e 'puts "\x41\xc0\xaf\x41\xf4\x80\x80\x41".encode("UTF-16BE", invalid: : replace).inspect' #=> "A\uFFFD\uFFFDA\uFFFDA" This is based on http://www.unicode.org/review/pr-121.html as noted at https://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/test/ruby/test_transcode.rb?revision=56516&view=markup#l1507 (for those having a look at these tests, in Ruby's version of assert_equal, the expected value comes first (not sure whether this is called little-endian or big-endian :-), but this is a decision where the various test frameworks are virtually split 50/50 :-(. )) Even if the above examples and the tests use conversion to UTF-16 (in particular the BE variant for better readability), what happens internally is that the input is analyzed byte-by-byte. In this case, it is easiest to just stop as soon as something is found that is clearly invalid (be this a single byte or something longer). This makes a data-driven implementation (such as the Ruby transcoder) or one based on a state machine (such as http://bjoern.hoehrmann.de/utf-8/decoder/dfa/) more compact. In other words, because we never know whether the next byte is a valid one such as 0x41, it's easier to just handle one byte at a time if this way we can avoid lookahead (which is always a good idea when parsing). I agree with Henri and others that there is no need at all to change the recommendation in the standard that has been stable for so long (close to 9 years). Because the original was done on a PR (http://www.unicode.org/review/pr-121.html), I think this should at least also be handled as PR (if it's not dropped based on the discussion here). I think changing the current definition of "maximal subsequence" is a bad idea, because it would mean that one wouldn't know what one was speaking about over the years. If necessary, new definitions should be introduced for other variants. I agree with others that ICU should not be considered to have a special status, it should be just one implementation among others. [The next point is a side issue, please don't spend too much time on it.] I find it particularly strange that at a time when UTF-8 is firmly defined as up to 4 bytes, never including any bytes above 0xF4, the Unicode consortium would want to consider recommending that be converted to a single U+FFFD. I note with agreement that Markus seems to have thoughts in the same direction, because the proposal (17168-utf-8-recommend.pdf) says "(I suppose that lead bytes above F4 could be somewhat debatable.)". Regards, Martin. From unicode at unicode.org Tue May 16 06:15:33 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 16 May 2017 13:15:33 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> <9EFEA10F-535B-4D46-8637-B2288162FF45@alastairs-place.net> Message-ID: 2017-05-16 12:40 GMT+02:00 Henri Sivonen via Unicode : > > One additional note: the standard codifies this behaviour as a > *recommendation*, not a requirement. > > This is an odd argument in favor of changing it. If the argument is > that it's just a recommendation that you don't need to adhere to, > surely then the people who don't like the current recommendation > should choose not to adhere to it instead of advocating changing it. I also agree. The internet is full of RFC specifications that are also "best practices" and even in this case, changing them must be extensively documented, including discussing new compatibility/interoperability problems and new security risks. The case of random access in substrings is significant because what was once valid UTF-8 could become invalid if the best recommandation is not followed, and then could cause unexpected failures, uncaught exceptions causing software to suddenly fail and become subject to possible attacks due to this new failure (this is mostly a problem for implementations that do not use "safe" U+FFFD replacements but throw exceptions on ill-formed input: we should not change the cases where these exceptions may occur by adding new cases caused by a change of implementation based on a change of best practice). The considerations about trying to reduce the nnumber of U+FFFD is not relevant, purely esthetic because some people would like to compact the decoded result in memory. What is really import is to not ignore silently these ill-formed sequences, and properly track that there was some data loss. The number of U+FFFD (only one or as many as there are invalid code units in the input before the first resynchronization point) inserted is not so important. As well, wether implementations will use an accumulator or just a single state (where each state knows how many code units have been parsed without emitting an output code point, so that these code points can be decoded by relative indexed accesses) is not relevant, it is just a very minor optimization case (in my opinion, using an accumulator that can live in a CPU register is faster than using relative indexed accesses All modern CPUs have enough registers to store that accumulator, and the input and output pointers, and a finite state number is not needed when the state can be tracked by the executable instruction position where you don't necessarily need to loop for each code unit but can easily write your decoder so that each loop will process a full codepoint or will emit a single U+FFFD before adjusting the input pointer : UTF-8 and UTF-16 complexity is small enough that unwinding such loops will be easy to implement for processing full code points instead of single code units: That code will still remain very small (fitting fully in instruction cache), and it will be faster because it will avoid several conditional branches and because it will save one register (for the finite state number) that will not ned to be slowly saved on a stack: 2 pointer registers (or 2 access function/method addresses) + 2 data registers + the PC instruction counter is enough. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 07:44:44 2017 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 16 May 2017 14:44:44 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> > On 15 May 2017, at 12:21, Henri Sivonen via Unicode wrote: ... > I think Unicode should not adopt the proposed change. It would be useful, for use with filesystems, to have Unicode codepoint markers that indicate how UTF-8, including non-valid sequences, is translated into UTF-32 in a way that the original octet sequence can be restored. From unicode at unicode.org Tue May 16 08:00:33 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 16 May 2017 15:00:33 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> Message-ID: 2017-05-16 14:44 GMT+02:00 Hans ?berg via Unicode : > > > On 15 May 2017, at 12:21, Henri Sivonen via Unicode > wrote: > ... > > I think Unicode should not adopt the proposed change. > > It would be useful, for use with filesystems, to have Unicode codepoint > markers that indicate how UTF-8, including non-valid sequences, is > translated into UTF-32 in a way that the original octet sequence can be > restored. Why just UTF-32 ? How would you convert ill-formed UTF-8/UTF-16/UTF-32 to valid UTF-8/UTF-16/UTF-32 ? In all cases this would require extensions on the 3 standards (which MUST be interoperable), then you'll shoke on new validation rules for these 3 standards for these extensions, and new ill-formed sequences that you won't be able to convert interoperably. Given the most restrictive condition in UTF-16 (which is still the most widely used internal representation), such extensions would be very complex too manage. There's no solution, such extensions in any one of them are then undesirable and can only be used privately (but without interoperating with the other 2 representations), so it's impossible to make sure the original octet sequences can be restored. Any deviation of the UTF-8/16/32 will be bounded in the same UTF. It cannot be part of the 3 standard UTF, but may be part of a distinct encoding, not fully compatible with the 3 standards. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 08:10:05 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 16 May 2017 14:10:05 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <73e5364f-08db-2f19-3498-4a23e56649ac@it.aoyama.ac.jp> References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> <73e5364f-08db-2f19-3498-4a23e56649ac@it.aoyama.ac.jp> Message-ID: <20170516141005.608e95e9@JRWUBU2> On Tue, 16 May 2017 20:08:52 +0900 "Martin J. D?rst via Unicode" wrote: > I agree with others that ICU should not be considered to have a > special status, it should be just one implementation among others. > [The next point is a side issue, please don't spend too much time on > it.] I find it particularly strange that at a time when UTF-8 is > firmly defined as up to 4 bytes, never including any bytes above > 0xF4, the Unicode consortium would want to consider recommending that > be converted to a single U+FFFD. I note with > agreement that Markus seems to have thoughts in the same direction, > because the proposal (17168-utf-8-recommend.pdf) says "(I suppose > that lead bytes above F4 could be somewhat debatable.)". The undesirable sidetrack, I suppose, is worrying about how many planes will be required for emoji. However, it does make for the point that, while some practices may be better than other, there isn't necessarily a best practice. The English of the proposal is unclear - the text would benefit from showing some maximal subsequences (poor terminology - some of us are used to non-contiguous subsequences). When he writes, "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, without ever restricting trail bytes to less than 80..BF", I am pretty sure he means "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, with the only restriction on trailing bytes beyond the number of them being that they must be in the range 80..BF". Thus Philippe's example of "E0 E0 C3 89" would be converted with an error flagged to a sequence of scalar values FFFD FFFD C9. This may make a UTF-8 system usable if it tries to use something like non-characters as understood before CLDR was caught publishing them as an essential part of text files. Richard. From unicode at unicode.org Tue May 16 08:21:53 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 16 May 2017 14:21:53 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> Message-ID: <20170516142153.3e146371@JRWUBU2> On Tue, 16 May 2017 14:44:44 +0200 Hans ?berg via Unicode wrote: > > On 15 May 2017, at 12:21, Henri Sivonen via Unicode > > wrote: > ... > > I think Unicode should not adopt the proposed change. > > It would be useful, for use with filesystems, to have Unicode > codepoint markers that indicate how UTF-8, including non-valid > sequences, is translated into UTF-32 in a way that the original octet > sequence can be restored. Escape sequences for the inappropriate bytes is the natural technique. Your problem is smoothly transitioning so that the escape character is always escaped when it means itself. Strictly, it can't be done. Of course, some sequences of escaped characters should be prohibited. Checking could be fiddly. Richard. From unicode at unicode.org Tue May 16 08:23:55 2017 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 16 May 2017 15:23:55 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> Message-ID: > On 16 May 2017, at 15:00, Philippe Verdy wrote: > > 2017-05-16 14:44 GMT+02:00 Hans ?berg via Unicode : > > > On 15 May 2017, at 12:21, Henri Sivonen via Unicode wrote: > ... > > I think Unicode should not adopt the proposed change. > > It would be useful, for use with filesystems, to have Unicode codepoint markers that indicate how UTF-8, including non-valid sequences, is translated into UTF-32 in a way that the original octet sequence can be restored. > > Why just UTF-32 ? Synonym for codepoint numbers. It would suffice to add markers how it is translated. For example, codepoints meaning "overlong long length ", "byte", or whatever is useful. > How would you convert ill-formed UTF-8/UTF-16/UTF-32 to valid UTF-8/UTF-16/UTF-32 ? You don't. You have a filename, which is a octet sequence of unknown encoding, and want to deal with it. Therefore, valid Unicode transformations of the filename may result in that is is not being reachable. It only matters that the correct octet sequence is handed back to the filesystem. All current filsystems, as far as experts could recall, use octet sequences at the lowest level; whatever encoding is used is built in a layer above. From unicode at unicode.org Tue May 16 09:10:52 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 16 May 2017 16:10:52 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> Message-ID: 2017-05-16 15:23 GMT+02:00 Hans ?berg : > All current filsystems, as far as experts could recall, use octet > sequences at the lowest level; whatever encoding is used is built in a > layer above > Not NTFS (on Windows) which uses sequences of 16bit units. Same about FAT32/exFAT within "Long File Names" (the legacy 8.3 short filenames are using legacy 8-bit codepages, but these are alternate filenames used when long filenames are not found, and working mostly like aliasing physical links on Unix filesystems, as if they were separate directory entries, except that they are hidden by default when their matching LFN are already shown) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 10:30:09 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 16:30:09 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> Message-ID: <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> On 16 May 2017, at 14:23, Hans ?berg via Unicode wrote: > > You don't. You have a filename, which is a octet sequence of unknown encoding, and want to deal with it. Therefore, valid Unicode transformations of the filename may result in that is is not being reachable. > > It only matters that the correct octet sequence is handed back to the filesystem. All current filsystems, as far as experts could recall, use octet sequences at the lowest level; whatever encoding is used is built in a layer above. HFS(+), NTFS and VFAT long filenames are all encoded in some variation on UCS-2/UTF-16. FAT 8.3 names are also encoded, but the encoding isn?t specified (more specifically, MS-DOS and Windows assume an encoding based on your locale, which could cause all kinds of fun if you swapped disks with someone from a different country, and IIRC there are some shenanigans for Japan because of the use of 0xe5 as a deleted file marker). There are some less widely used filesystems that require a particular encoding also (BeOS? BFS used UTF-8, for instance). Also, Mac OS X and iOS use UTF-8 at the BSD layer; if a filesystem is in use whose names can?t be converted to UTF-8, the Darwin kernel uses a percent encoding scheme(!) It looks like Apple has changed its mind for APFS and is going with the ?bag of bytes? approach that?s typical of other systems; at least, that?s what it appears to have done on iOS. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 10:44:54 2017 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 16 May 2017 17:44:54 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> Message-ID: <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> > On 16 May 2017, at 17:30, Alastair Houghton via Unicode wrote: > > On 16 May 2017, at 14:23, Hans ?berg via Unicode wrote: >> >> You don't. You have a filename, which is a octet sequence of unknown encoding, and want to deal with it. Therefore, valid Unicode transformations of the filename may result in that is is not being reachable. >> >> It only matters that the correct octet sequence is handed back to the filesystem. All current filsystems, as far as experts could recall, use octet sequences at the lowest level; whatever encoding is used is built in a layer above. > > HFS(+), NTFS and VFAT long filenames are all encoded in some variation on UCS-2/UTF-16. ... The filesystem directory is using octet sequences and does not bother passing over an encoding, I am told. Someone could remember one that to used UTF-16 directly, but I think it may not be current. From unicode at unicode.org Tue May 16 10:52:00 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 16:52:00 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> Message-ID: <9719D9DA-3EBE-42E5-B37C-FA3214A5E92D@alastairs-place.net> On 16 May 2017, at 16:44, Hans ?berg wrote: > > On 16 May 2017, at 17:30, Alastair Houghton via Unicode wrote: >> >> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on UCS-2/UTF-16. ... > > The filesystem directory is using octet sequences and does not bother passing over an encoding, I am told. Someone could remember one that to used UTF-16 directly, but I think it may not be current. No, that?s not true. All three of those systems store UTF-16 on the disk (give or take). On Windows, the ?ANSI? APIs convert the filenames to or from the appropriate Windows code page, while the ?Wide? API works in UTF-16, which is the native encoding for VFAT long filenames and NTFS filenames. And, as I said, on Mac OS X and iOS, the kernel expects filenames to be encoded as UTF-8 at the BSD API, regardless of what encoding you might be using in your Terminal (this is different to traditional UNIX behaviour, where how you interpret your filenames is entirely up to you - usually you?d use the same encoding you were using on your tty). Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 11:07:34 2017 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 16 May 2017 18:07:34 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <9719D9DA-3EBE-42E5-B37C-FA3214A5E92D@alastairs-place.net> References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> <9719D9DA-3EBE-42E5-B37C-FA3214A5E92D@alastairs-place.net> Message-ID: > On 16 May 2017, at 17:52, Alastair Houghton wrote: > > On 16 May 2017, at 16:44, Hans ?berg wrote: >> >> On 16 May 2017, at 17:30, Alastair Houghton via Unicode wrote: >>> >>> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on UCS-2/UTF-16. ... >> >> The filesystem directory is using octet sequences and does not bother passing over an encoding, I am told. Someone could remember one that to used UTF-16 directly, but I think it may not be current. > > No, that?s not true. All three of those systems store UTF-16 on the disk (give or take). I am not speaking about what they store, but how the filesystem identifies files. From unicode at unicode.org Tue May 16 11:13:33 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 17:13:33 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> <9719D9DA-3EBE-42E5-B37C-FA3214A5E92D@alastairs-place.net> Message-ID: On 16 May 2017, at 17:07, Hans ?berg wrote: > >>>> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on UCS-2/UTF-16. ... >>> >>> The filesystem directory is using octet sequences and does not bother passing over an encoding, I am told. Someone could remember one that to used UTF-16 directly, but I think it may not be current. >> >> No, that?s not true. All three of those systems store UTF-16 on the disk (give or take). > > I am not speaking about what they store, but how the filesystem identifies files. Well, quite clearly none of those systems treat the UTF-16 strings as binary either - they?re case insensitive, so how could they? HFS+ even normalises strings using a variant of a frozen version of the normalisation spec. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 11:23:51 2017 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 16 May 2017 18:23:51 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> <9719D9DA-3EBE-42E5-B37C-FA3214A5E92D@alastairs-place.net> Message-ID: > On 16 May 2017, at 18:13, Alastair Houghton wrote: > > On 16 May 2017, at 17:07, Hans ?berg wrote: >> >>>>> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on UCS-2/UTF-16. ... >>>> >>>> The filesystem directory is using octet sequences and does not bother passing over an encoding, I am told. Someone could remember one that to used UTF-16 directly, but I think it may not be current. >>> >>> No, that?s not true. All three of those systems store UTF-16 on the disk (give or take). >> >> I am not speaking about what they store, but how the filesystem identifies files. > > Well, quite clearly none of those systems treat the UTF-16 strings as binary either - they?re case insensitive, so how could they? HFS+ even normalises strings using a variant of a frozen version of the normalisation spec. HFS implements case insensitivity in a layer above the filesystem raw functions. So it is perfectly possible to have files that differ by case only in the same directory by using low level function calls. The Tenon MachTen did that on Mac OS 9 already. From unicode at unicode.org Tue May 16 11:38:36 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 17:38:36 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> <9719D9DA-3EBE-42E5-B37C-FA3214A5E92D@alastairs-place.net> Message-ID: On 16 May 2017, at 17:23, Hans ?berg wrote: > > HFS implements case insensitivity in a layer above the filesystem raw functions. So it is perfectly possible to have files that differ by case only in the same directory by using low level function calls. The Tenon MachTen did that on Mac OS 9 already. You keep insisting on this, but it?s not true; I?m a disk utility developer, and I can tell you for a fact that HFS+ uses a B+-Tree to hold its directory data (a single one for the entire disk, not one per directory either), and that that tree is sorted by (CNID, filename) pairs. And since it?s case-preserving *and* case-insensitive, the comparisons it does to order its B+-Tree nodes *cannot* be raw. I should know - I?ve actually written the code for it! Even for legacy HFS, which didn?t store UTF-16, but stored a specified Mac legacy encoding (the encoding used is in the volume header), it?s case sensitive, so the encoding matters. I don?t know what tricks Tenon MachTen pulled on Mac OS 9, but I *do* know how the filesystem works. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 11:52:03 2017 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 16 May 2017 18:52:03 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> <9719D9DA-3EBE-42E5-B37C-FA3214A5E92D@alastairs-place.net> Message-ID: > On 16 May 2017, at 18:38, Alastair Houghton wrote: > > On 16 May 2017, at 17:23, Hans ?berg wrote: >> >> HFS implements case insensitivity in a layer above the filesystem raw functions. So it is perfectly possible to have files that differ by case only in the same directory by using low level function calls. The Tenon MachTen did that on Mac OS 9 already. > > You keep insisting on this, but it?s not true; I?m a disk utility developer, and I can tell you for a fact that HFS+ uses a B+-Tree to hold its directory data (a single one for the entire disk, not one per directory either), and that that tree is sorted by (CNID, filename) pairs. And since it?s case-preserving *and* case-insensitive, the comparisons it does to order its B+-Tree nodes *cannot* be raw. I should know - I?ve actually written the code for it! > > Even for legacy HFS, which didn?t store UTF-16, but stored a specified Mac legacy encoding (the encoding used is in the volume header), it?s case sensitive, so the encoding matters. > > I don?t know what tricks Tenon MachTen pulled on Mac OS 9, but I *do* know how the filesystem works. One could make files that differed by case in the same directory, and Mac OS 9 did not bother. Legacy HFS tended to slow down with many files in the same directory, so that gave an impression of a tree structure. The BSD filesystem at the time, perhaps the one that Mac OS X once supported, did not store files in a tree, but flat with redundancy. The other info I got on the Austin Group List a decade ago. From unicode at unicode.org Tue May 16 12:30:01 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Tue, 16 May 2017 17:30:01 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> Message-ID: > Would you advocate replacing > e0 80 80 > with > U+FFFD U+FFFD U+FFFD (1) > rather than > U+FFFD (2) > It?s pretty clear what the intent of the encoder was there, I?d say, and while we certainly don?t > want to decode it as a NUL (that was the source of previous security bugs, as I recall), I also don?t > see the logic in insisting that it must be decoded to *three* code points when it clearly only > represented one in the input. It is not at all clear what the intent of the encoder was - or even if it's not just a problem with the data stream. E0 80 80 is not permitted, it's garbage. An encoder can't "intend" it. Either A) the "encoder" was attempting to be malicious, in which case the whole thing is suspect and garbage, and so the # of FFFD's doesn't matter, or B) the "encoder" is completely broken, in which case all bets are off, again, specifying the # of FFFD's is irrelevant. C) The data was corrupted by some other means. Perhaps bad concatenations, lost blocks during read/transmission, etc. If we lost 2 512 byte blocks, then maybe we should have a thousand FFFDs (but how would we known?) -Shawn From unicode at unicode.org Tue May 16 12:58:22 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 16 May 2017 18:58:22 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> Message-ID: <20170516185822.47a88df5@JRWUBU2> On Tue, 16 May 2017 17:30:01 +0000 Shawn Steele via Unicode wrote: > > Would you advocate replacing > > > e0 80 80 > > > with > > > U+FFFD U+FFFD U+FFFD (1) > > > rather than > > > U+FFFD (2) > > > It?s pretty clear what the intent of the encoder was there, I?d > > say, and while we certainly don?t want to decode it as a NUL (that > > was the source of previous security bugs, as I recall), I also > > don?t see the logic in insisting that it must be decoded to *three* > > code points when it clearly only represented one in the input. > > It is not at all clear what the intent of the encoder was - or even > if it's not just a problem with the data stream. E0 80 80 is not > permitted, it's garbage. An encoder can't "intend" it. It was once a legal way of encoding NUL, just like C0 E0, which is still in use, and seems to be the best way of storing NUL as character content in a *C string*. (Strictly speaking, one can't do it.) It could be lurking in old text or come from an old program that somehow doesn't get used for U+0080 to U+07FF. Converting everything in UCS-2 to 3 bytes was an easily encoded way of converting UTF-16 to UTF-8. Remember the conformance test for the Unicode Collation Algorithm has contained lone surrogates in the past, and the UAX on Unicode Regular Expressions used to require the ability to search for lone surrogates. Richard. From unicode at unicode.org Tue May 16 13:01:57 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 16 May 2017 20:01:57 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> Message-ID: On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random sequences of 16-bit code units are not permitted. There's visibly a validation step that returns an error if you attempt to create files with invalid sequences (including other restrictions such as forbidding U+0000 and some other problematic controls). This occurs because the NTFS and FAT driver will also attempt to normalize the string in order to create compatibility 8.3 filenames using the system's native locale (not the current user locale which is used when searching files/enumerating directories or opening files - this could generate errors when the encodings for distinct locales do not match, but should not cause errors when filenames are **first** searched in their UTF-16 encoding specified in applications, but applications that still need to access files using their short name are deprecated). The kind of normalization taken for creating short 8.3 filenames uses OS-specific specific conversion tables built in the filesystem drivers. This generation however has a cost due to the uniqueness constraints (requiring to abbreviate the first part of the 8.3 name to add "~numbered" suffixes before the extension, whose value is unpredicatable if there are other existing "*~1.*" files: it requires the driver to retry with another number, looping if necessary). This also has a (very modest) storage cost but it is less critical than the enumeration step and the fact that these shortened name cannot be predicted by applications. This canonicalization is also required also because the filesystem is case-insensitive (and it's technically not possible to store all the multiple case variants for filenames as assigned aliases/physical links). In classic filesystems for Unix/Linux the only restrictions are on forbidding null bytes, and assigning "/" a role for hierarchic filesystems (unusable anywhere as directory entry name), plus the preservation of "." and ".." entries in directories, meaning that only 8-bit encodings based on 7-bit ASCII are possible, so Linux/Unix are not completely treating thes filenames as pure binary bags of bytes (however if this is not checked and such random names may occur, which will be difficult to handle with classic tools and shells). Some other filesystems for Linux/Unix are still enforcing restrictions (and there exists even versions of them that are supporting case insensitity, in addition to FAT12/FAT16/FAT32/exFAT/NTFS emulated filesystems: this also exists in NFS driver as an option, or in drivers for legacy filesystems initially coming from mainframes, or in filesystem drivers based on FTP, and even in the filesystem driver allowing to mount a Windows registry which is also case-insensitive). Technically in the core kernel of Linux/Unix there's no restriction on the effective encoding (except "/" and null), the actual restrictions are implemented within filesystem drivers, configured only when volumes are mounted: each mounted filesystem can then have its own internal encoding; there will be different behaviors when using a driver for any MacOS filesystem. Linux can perfectly work with NTFS filesystems, except that most of the time, short filenames will be completely ignored and not generated on the fly. This generation of short filenames in a legacy (unspecified) 8-bit codepage is not a requirement of NTFS and it can be disabled also in Windows. But FAT12/FAT16/FAT32 still require these legacy short names to be generated when only the LFN could be used, and the short 8.3 name left completely null in the main directory entry ; but legacy FAT drivers will shoke on these null entries, if they are not tagged by a custom attribute bit as "ignorable but not empty", or if the 8+3 characters do not use specific unique parterns such as "\" followed by 7 pseudo-random characters in the main part, plus 3 other pseudo-random characters in the extension (these 10 characters may use any non null value: they provide nearly 80 bits or more exactly 250^10 identifiers if we exclude the 6 characters "/", "\", ".", ":" NULL and SPACE that are reserved, which could be generated almost predictably simply by hashing the original unabbreviated name with 79 bits from SHA-128, or faster with simple MD5 hahsing, and very rare remaining collisions to handle). Some FAT repait tools will attempt to repair the legacy short filenames that are not unique or cannot be derived from the UTF-16 encoded LFN (this happens when "repairing" a FAT volume initially created on another system that used a different 8-bit OEM codepage, but this "CheckDisk" tools should have an option to not "repair" them, given that modern applications normally do not need these filenames if a LFN is present (even the Windows Explorer will not display these short names because trhey are hidden by default each time there's a LFN which overrides them). We must add however that on FAT filesystems, a LFN will not always be stored if the Unicode name already has the "8.3" form and all characters are from ASCII (which is the base of all supported 8-bit OEM charsets), but it will be created if the user edits the filename to use another prefered capitalization than the default one (the Explorer default is to render fully capitlized short filenames using a single leading capital letter, and all other characters, including the 1-to-3-characters file extension, befing displayed as lowercase (so the "Windows" LFN would be stored simply as the "WINDOWS" short filename without any LFN needed in the directory entries). To be complete, a few legacy filenames are also reserved and can't be used in Windows (short of LFN) filenames, such as "CON" (case-insensitive), reserved by another legacy non-filesystem driver before they are seeked in a specific current directory: to use them as filenames, you must prefix them with a drive letter or with the some ".\" prefix (relative to the current directory) or full path name. 2017-05-16 17:44 GMT+02:00 Hans ?berg : > > > On 16 May 2017, at 17:30, Alastair Houghton via Unicode < > unicode at unicode.org> wrote: > > > > On 16 May 2017, at 14:23, Hans ?berg via Unicode > wrote: > >> > >> You don't. You have a filename, which is a octet sequence of unknown > encoding, and want to deal with it. Therefore, valid Unicode > transformations of the filename may result in that is is not being > reachable. > >> > >> It only matters that the correct octet sequence is handed back to the > filesystem. All current filsystems, as far as experts could recall, use > octet sequences at the lowest level; whatever encoding is used is built in > a layer above. > > > > HFS(+), NTFS and VFAT long filenames are all encoded in some variation > on UCS-2/UTF-16. ... > > The filesystem directory is using octet sequences and does not bother > passing over an encoding, I am told. Someone could remember one that to > used UTF-16 directly, but I think it may not be current. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 13:09:32 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Tue, 16 May 2017 18:09:32 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170516185822.47a88df5@JRWUBU2> References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> <20170516185822.47a88df5@JRWUBU2> Message-ID: Regardless, it's not legal and hasn't been legal for quite some time. Replacing a hacked embedded "null" with FFFD is going to be pretty breaking to anything depending on that fake-null, so one or three isn't really going to matter. -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham via Unicode Sent: Tuesday, May 16, 2017 10:58 AM To: unicode at unicode.org Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On Tue, 16 May 2017 17:30:01 +0000 Shawn Steele via Unicode wrote: > > Would you advocate replacing > > > e0 80 80 > > > with > > > U+FFFD U+FFFD U+FFFD (1) > > > rather than > > > U+FFFD (2) > > > It?s pretty clear what the intent of the encoder was there, I?d say, > > and while we certainly don?t want to decode it as a NUL (that was > > the source of previous security bugs, as I recall), I also don?t see > > the logic in insisting that it must be decoded to *three* code > > points when it clearly only represented one in the input. > > It is not at all clear what the intent of the encoder was - or even if > it's not just a problem with the data stream. E0 80 80 is not > permitted, it's garbage. An encoder can't "intend" it. It was once a legal way of encoding NUL, just like C0 E0, which is still in use, and seems to be the best way of storing NUL as character content in a *C string*. (Strictly speaking, one can't do it.) It could be lurking in old text or come from an old program that somehow doesn't get used for U+0080 to U+07FF. Converting everything in UCS-2 to 3 bytes was an easily encoded way of converting UTF-16 to UTF-8. Remember the conformance test for the Unicode Collation Algorithm has contained lone surrogates in the past, and the UAX on Unicode Regular Expressions used to require the ability to search for lone surrogates. Richard. From unicode at unicode.org Tue May 16 13:13:15 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 16 May 2017 20:13:15 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> Message-ID: 2017-05-16 19:30 GMT+02:00 Shawn Steele via Unicode : > C) The data was corrupted by some other means. Perhaps bad > concatenations, lost blocks during read/transmission, etc. If we lost 2 > 512 byte blocks, then maybe we should have a thousand FFFDs (but how would > we known?) > Thousands of U+FFFD's is not a problem (independantly of the internal UTF encoding used): yes the 2512 byte block could then become 3 times larger (if using UTF-8 internal encoding) or 2 times larger (if using UTF-16 internal encoding) but every application should be prepared to support the size expansion with a completely know maximum factor, which could occur as well with any valid CJK-only text. So the size to allocate for the internal sorage is predictable from the size of the input, this is an important feature of all standard UTF's. Being able to handle the worst case of allowed expansion, militates largely for the adoption of UTF-16 as the internal encoding, instead of UTF-8 (where you'll need to allocate more space before decoding the input, if you want to avoid successive memory reallocations, which would impact the performance of your decoder): it's simple to accept input from 512 bytes (or 1KB) buffers, and allocate a 1KB (or 2KB) buffer for storing the intermediate results in the generic decoder, and simpler on the outer level to preallocate buffers with resonable sizes that will be reallocated once if needed to the maximum size, and then reduced to the effective size (if needed) at end of successful decoding (some implementations can use pools of preallocated buffers with small static sizes, allocating new buffers out side the pool only for rare cases where more space will be needed) . -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 13:13:53 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 16 May 2017 11:13:53 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> Message-ID: <2a11e4fa-7e0c-3e68-e7d5-f5147051e9ce@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 13:20:00 2017 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 16 May 2017 20:20:00 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> Message-ID: <9D18C48B-F7E0-46FE-B8BF-5A963C79109A@telia.com> > On 16 May 2017, at 20:01, Philippe Verdy wrote: > > On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random sequences of 16-bit code units are not permitted. There's visibly a validation step that returns an error if you attempt to create files with invalid sequences (including other restrictions such as forbidding U+0000 and some other problematic controls). For it to work the way I suggested, there would be low level routines that handles the names raw, and then on top of that, interface routines doing what you describe. On the Austin Group List, they mentioned a filesystem doing it directly in UTF-16, and it could have been the one you describe. From unicode at unicode.org Tue May 16 13:36:39 2017 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Tue, 16 May 2017 11:36:39 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> <9719D9DA-3EBE-42E5-B37C-FA3214A5E92D@alastairs-place.net> Message-ID: Let me try to address some of the issues raised here. The proposal changes a recommendation, not a requirement. Conformance applies to finding and interpreting valid sequences properly. This includes not consuming parts of valid sequences when dealing with illegal ones, as explained in the section "Constraints on Conversion Processes". Otherwise, what you do with illegal sequences is a matter of what you think makes sense -- a matter of opinion and convenience. Nothing more. I wrote my first UTF-8 handling code some 18 years ago, before joining the ICU team. At the time, I believe the ISO UTF-8 definition was not yet limited to U+10FFFF, and decoding overlong sequences and those yielding surrogate code points was regarded as a misdemeanor. The spec has been tightened up, but I am pretty sure that most people familiar with how UTF-8 came about would recognize and as single sequences. I believe that the discussion of how to handle illegal sequences came out of security issues a few years ago from some implementations including valid single and lead bytes with preceding illegal sequences. Beyond the "Constraints on Conversion Processes", there was evidently also a desire to recommend how to handle illegal sequences. I think that the current recommendation was an extrapolation of common practice for non-UTF encodings, such as Shift-JIS or GB 18030. It's ok for UTF-8, too, but "it feels like" (yes, that's the level of argument for stuff that doesn't really matter) not treating and as single sequences is "weird". Why do we care how we carve up an illegal sequence into subsequences? Only for debugging and visual inspection. Maybe some process is using illegal, overlong sequences to encode something special (? la Java string serialization, "modified UTF-8"), and for that it might be convenient too to treat overlong sequences as single errors. If you don't like some recommendation, then do something else. It does not matter. If you don't reject the whole input but instead choose to replace illegal sequences with something, then make sure the something is not nothing -- replacing with an empty string can cause security issues. Otherwise, what the something is, or how many of them you put in, is not very relevant. One or more U+FFFDs is customary. When the current recommendation came in, I thought it was reasonable but didn't like the edge cases. At the time, I didn't think it was important to twiddle with the text in the standard, and I didn't care that ICU didn't exactly implement that particular recommendation. I have seen implementations that clobber every byte in an illegal sequence with a space, because it's easier than writing an U+FFFD for each byte or for some subsequences. Fine. Someone might write a single U+FFFD for an arbitrarily long illegal subsequence; that's fine, too. Karl Williamson sent feedback to the UTC, "In short, I believe the best practices are wrong." I think "wrong" is far too strong, but I got an action item to propose a change in the text. I proposed a modified recommendation. Nothing gets elevated to "right" that wasn't, nothing gets demoted to "wrong" that was "right". None of this is motivated by which UTF is used internally. It is true that it takes a tiny bit more thought and work to recognize a wider set of sequences, but a capable implementer will optimize successfully for valid sequences, and maybe even for a subset of those for what might be expected high-frequency code point ranges. Error handling can go into a slow path. In a true state table implementation, it will require more states but should not affect the performance of valid sequences. Many years ago, I decided for ICU to add a small amount of slow-path error-handling code for more human-friendly illegal-sequence reporting. In other words, this was not done out of convenience; it was an inconvenience that seemed justified by nicer error reporting. If you don't like to do so, then don't. Which UTF is better? It depends. They all have advantages and problems. It's all Unicode, so it's all good. ICU largely uses UTF-16 but also UTF-8. It has data structures and code for charset conversion, property lookup, sets of characters (UnicodeSet), and collation that are co-optimized for both UTF-16 and UTF-8. It has a slowly growing set of APIs working directly with UTF-8. So, please take a deep breath. No conformance requirement is being touched, no one is forced to do something they don't like, no special consideration is given for one UTF over another. Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 13:45:01 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 19:45:01 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> <9719D9DA-3EBE-42E5-B37C-FA3214A5E92D@alastairs-place.net> Message-ID: On 16 May 2017, at 19:36, Markus Scherer wrote: > > Let me try to address some of the issues raised here. Thanks for jumping in. The one thing I wanted to ask about was the ?without ever restricting trail bytes to less than 80..BF?. I think that could be misinterpreted; having thought about it some more, I think you mean ?considering any trailing byte in the range 80..BF as valid?. The ?less than? threw me the first few times I read it and I started thinking you meant allowing any byte as a trailing byte, which is clearly not right. Otherwise, I?m happy :-) Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 13:50:00 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Tue, 16 May 2017 18:50:00 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> <9719D9DA-3EBE-42E5-B37C-FA3214A5E92D@alastairs-place.net> Message-ID: But why change a recommendation just because it ?feels like?. As you said, it?s just a recommendation, so if that really annoyed someone, they could do something else (eg: they could use a single FFFD). If the recommendation is truly that meaningless or arbitrary, then we just get into silly discussions of ?better? that nobody can really answer. Alternatively, how about ?one or more FFFDs?? for the recommendation? To me it feels very odd to perhaps require writing extra code to detect an illegal case. The ?best practice? here should maybe be ?one or more FFFDs, whatever makes your code faster?. Best practices may not be requirements, but people will still take time to file bugs that something isn?t following a ?best practice?. -Shawn From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Markus Scherer via Unicode Sent: Tuesday, May 16, 2017 11:37 AM To: Alastair Houghton Cc: Philippe Verdy ; Henri Sivonen ; unicode Unicode Discussion ; Hans ?berg Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 Let me try to address some of the issues raised here. The proposal changes a recommendation, not a requirement. Conformance applies to finding and interpreting valid sequences properly. This includes not consuming parts of valid sequences when dealing with illegal ones, as explained in the section "Constraints on Conversion Processes". Otherwise, what you do with illegal sequences is a matter of what you think makes sense -- a matter of opinion and convenience. Nothing more. I wrote my first UTF-8 handling code some 18 years ago, before joining the ICU team. At the time, I believe the ISO UTF-8 definition was not yet limited to U+10FFFF, and decoding overlong sequences and those yielding surrogate code points was regarded as a misdemeanor. The spec has been tightened up, but I am pretty sure that most people familiar with how UTF-8 came about would recognize and as single sequences. I believe that the discussion of how to handle illegal sequences came out of security issues a few years ago from some implementations including valid single and lead bytes with preceding illegal sequences. Beyond the "Constraints on Conversion Processes", there was evidently also a desire to recommend how to handle illegal sequences. I think that the current recommendation was an extrapolation of common practice for non-UTF encodings, such as Shift-JIS or GB 18030. It's ok for UTF-8, too, but "it feels like" (yes, that's the level of argument for stuff that doesn't really matter) not treating and as single sequences is "weird". Why do we care how we carve up an illegal sequence into subsequences? Only for debugging and visual inspection. Maybe some process is using illegal, overlong sequences to encode something special (? la Java string serialization, "modified UTF-8"), and for that it might be convenient too to treat overlong sequences as single errors. If you don't like some recommendation, then do something else. It does not matter. If you don't reject the whole input but instead choose to replace illegal sequences with something, then make sure the something is not nothing -- replacing with an empty string can cause security issues. Otherwise, what the something is, or how many of them you put in, is not very relevant. One or more U+FFFDs is customary. When the current recommendation came in, I thought it was reasonable but didn't like the edge cases. At the time, I didn't think it was important to twiddle with the text in the standard, and I didn't care that ICU didn't exactly implement that particular recommendation. I have seen implementations that clobber every byte in an illegal sequence with a space, because it's easier than writing an U+FFFD for each byte or for some subsequences. Fine. Someone might write a single U+FFFD for an arbitrarily long illegal subsequence; that's fine, too. Karl Williamson sent feedback to the UTC, "In short, I believe the best practices are wrong." I think "wrong" is far too strong, but I got an action item to propose a change in the text. I proposed a modified recommendation. Nothing gets elevated to "right" that wasn't, nothing gets demoted to "wrong" that was "right". None of this is motivated by which UTF is used internally. It is true that it takes a tiny bit more thought and work to recognize a wider set of sequences, but a capable implementer will optimize successfully for valid sequences, and maybe even for a subset of those for what might be expected high-frequency code point ranges. Error handling can go into a slow path. In a true state table implementation, it will require more states but should not affect the performance of valid sequences. Many years ago, I decided for ICU to add a small amount of slow-path error-handling code for more human-friendly illegal-sequence reporting. In other words, this was not done out of convenience; it was an inconvenience that seemed justified by nicer error reporting. If you don't like to do so, then don't. Which UTF is better? It depends. They all have advantages and problems. It's all Unicode, so it's all good. ICU largely uses UTF-16 but also UTF-8. It has data structures and code for charset conversion, property lookup, sets of characters (UnicodeSet), and collation that are co-optimized for both UTF-16 and UTF-8. It has a slowly growing set of APIs working directly with UTF-8. So, please take a deep breath. No conformance requirement is being touched, no one is forced to do something they don't like, no special consideration is given for one UTF over another. Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 14:43:58 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 16 May 2017 20:43:58 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> <9719D9DA-3EBE-42E5-B37C-FA3214A5E92D@alastairs-place.net> Message-ID: <20170516204358.15f6656a@JRWUBU2> On Tue, 16 May 2017 11:36:39 -0700 Markus Scherer via Unicode wrote: > Why do we care how we carve up an illegal sequence into subsequences? > Only for debugging and visual inspection. Maybe some process is using > illegal, overlong sequences to encode something special (? la Java > string serialization, "modified UTF-8"), and for that it might be > convenient too to treat overlong sequences as single errors. I think that's not quite true. If we are moving back and forth through a buffer containing corrupt text, we need to make sure that moving three characters forward and then three characters back leaves us where we started. That requires internal consistency. One possible issue is with text input methods that access an application's backing store. They can issue updates in the form of 'delete 3 characters and insert ...'. However, if the input method is accessing characters it hasn't written, it's probably misbehaving anyway. Such commands do rather heavily assume that any relevant normalisation by the application will be taken into account by the input method. I once had a go at fixing an application that was misinterpreting 'delete x characters' as 'delete x UTF-16 code units'. It was a horrible mess, as the application's interface layer couldn't peek at the string being edited. Richard. From unicode at unicode.org Tue May 16 16:08:13 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 16 May 2017 23:08:13 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> <9719D9DA-3EBE-42E5-B37C-FA3214A5E92D@alastairs-place.net> Message-ID: 2017-05-16 20:50 GMT+02:00 Shawn Steele : > But why change a recommendation just because it ?feels like?. As you > said, it?s just a recommendation, so if that really annoyed someone, they > could do something else (eg: they could use a single FFFD). > > > > If the recommendation is truly that meaningless or arbitrary, then we just > get into silly discussions of ?better? that nobody can really answer. > > > > Alternatively, how about ?one or more FFFDs?? for the recommendation? > > > > To me it feels very odd to perhaps require writing extra code to detect an > illegal case. The ?best practice? here should maybe be ?one or more FFFDs, > whatever makes your code faster?. > Faster ok, privided this does not break other uses, notably for random access within strings, where UTF-8 is designed to allow searching backward on a limited number of bytes (maximum 3) in order to find the leading byte, and then check its value: - if it's not found, return back to the initial position and amke the next access return U+FFFD to signal the error of position: this trailing byte is part of an ill-formed sequence, and for coherence, any further trailine bytes fouind after it will **also** return U+FFFD to be coherent (because these other trailing bytes may also be found bby random access to them. - it the leading byte is found backward ut does not match the expected number of trailing bytes after it, return back to the initial random position where you'll return also U+FFFD. This means that the initial leading byte (part of the ill-formed sequence) must also return a separate U+FFFD, given that each following trailing byte will return U+FFFD isolately when accessing to them. If we want coherent decoding with text handling promitives allowing random access with encoded sequences, there's no other choice than treating EACH byte part of the ill-formed sequence as individual errors mapped to the same replacement code point (U+FFFD if that is what is chosen, but these APIs could as well specify annother replacement character or could eventually return a non-codepoint if the API return value is not restricted to only valid codepoints (for example the replacement could be a negative value whose absolute value matches the invalid code unit, or some other invalid code unit outside the valid range for code points with scalar values: isolated surrogates in UTF-16 for example could be returned as is, or made negative either by returning its opposite or by setting (or'ing) the most significant bit of the return value). The problem will arise when you need to store the replacement values if the internal backing store is limited to 16-bit code units or 8-bit code units: this internal backing store may use its own internal extension of standard UTF's, including the possibility of encoding NULLs as C0,80 (like what Java does with its "modified UTF-8 internal encoding used in its compiled binary classes and serializations), or internally using isolated trailing surrogates to store illformed UTF-8 input by or'ing these bytes with 0xDC00 that will be returned as an code point with no valid scalar value. For internally representing illformed UTF-16 sequences, there's no need to change anything. For internally representing ill-formed UTF-32 sequences (in fact limited to one 32-bitcode unit), with a 16bit internal backing store you may need to store 3 16bit values (three isolated trailing surrogates). For internally representing ill formed UTF-32 in an 8 bit backing store, you could use 0xC1 followed by 5 five trailing bytes (each one storing 7 bits of the initial ill-formed code unit from the UTF-32 input). What you'll do in the internal backing store will not be exposed to your API which will just return either valide codepoints with valid scalar values, or values outside the two valid subranges (so it could possibly negative values, or isolated trailing surrogates). That backing store can also substitute some valid input causing problems (such as NULLs) using 0xC0 plus another byte, that sequence being unexposed by your API which will still be able to return the expected codepoints (but with the minor caveat that the total number of returned codepoints will not match the actual size allocated for the internal backing store (that applications using that API won't even need to know how it is internally represented). In other words: any private extensions are possible internally, but it is possible to isolate it within a blackboxing API which will still be able to chose how to represent the input text (it may as well use a zlib-compressed backing store, or some stateless Huffmann compression based on a static statistic table configured and stored elsewhere, intiialized when you first instantiate your API). -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 16:15:53 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Tue, 16 May 2017 21:15:53 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> <9719D9DA-3EBE-42E5-B37C-FA3214A5E92D@alastairs-place.net> Message-ID: > Faster ok, privided this does not break other uses, notably for random access within strings? Either way, this is a ?recommendation?. I don?t see how that can provide for not-?breaking other uses.? If it?s internal, you can do what you will, so if you need the 1:1 seeming parity, then you can do that internally. But if you?re depending on other APIs/libraries/data source/whatever, it would seem like you couldn?t count on that. (And probably shouldn?t even if it was a requirement rather than a recommendation). I?m wary of the idea of attempting random access on a stream that is also manipulating the stream at the same time (decoding apparently). The U+FFFD emitted by this decoding could also require a different # of bytes to reencode. Which might disrupt the presumed parity, depending on how the data access was being handled. -Shawn -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 16:19:52 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 16 May 2017 23:19:52 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> <9719D9DA-3EBE-42E5-B37C-FA3214A5E92D@alastairs-place.net> Message-ID: Another alternative for you API is to not return simple integer values, but return (read-only) instances of a Char32 class whose "scalar" property would normally be a valid codepoint with scalar value, or whose "string" property will be the actual character; but with another static property "isValidScalar" returning "true"; for other ill-formed sequences,"isValidScalar" will be false, the scalar value will be the initial code unit from the input (decoded from the internal representation in tyhe backing store) and the "string" property will be empty. You may also add a special "Char32" static instance representing end-of-file/end-of-string, whose property "isEOF" will be true, and property scalar will be typically -1, "isValid Scalar" will be false, and the "string" property will be the empty string. All this is possible independantly of the internal representation made in the backing store for its own code units (where it may use any extension of standard UTF's or any data compression scheme without exposing it) 2017-05-16 23:08 GMT+02:00 Philippe Verdy : > > > 2017-05-16 20:50 GMT+02:00 Shawn Steele : > >> But why change a recommendation just because it ?feels like?. As you >> said, it?s just a recommendation, so if that really annoyed someone, they >> could do something else (eg: they could use a single FFFD). >> >> >> >> If the recommendation is truly that meaningless or arbitrary, then we >> just get into silly discussions of ?better? that nobody can really answer. >> >> >> >> Alternatively, how about ?one or more FFFDs?? for the recommendation? >> >> >> >> To me it feels very odd to perhaps require writing extra code to detect >> an illegal case. The ?best practice? here should maybe be ?one or more >> FFFDs, whatever makes your code faster?. >> > > Faster ok, privided this does not break other uses, notably for random > access within strings, where UTF-8 is designed to allow searching backward > on a limited number of bytes (maximum 3) in order to find the leading byte, > and then check its value: > - if it's not found, return back to the initial position and amke the next > access return U+FFFD to signal the error of position: this trailing byte is > part of an ill-formed sequence, and for coherence, any further trailine > bytes fouind after it will **also** return U+FFFD to be coherent (because > these other trailing bytes may also be found bby random access to them. > - it the leading byte is found backward ut does not match the expected > number of trailing bytes after it, return back to the initial random > position where you'll return also U+FFFD. This means that the initial > leading byte (part of the ill-formed sequence) must also return a separate > U+FFFD, given that each following trailing byte will return U+FFFD > isolately when accessing to them. > > If we want coherent decoding with text handling promitives allowing random > access with encoded sequences, there's no other choice than treating EACH > byte part of the ill-formed sequence as individual errors mapped to the > same replacement code point (U+FFFD if that is what is chosen, but these > APIs could as well specify annother replacement character or could > eventually return a non-codepoint if the API return value is not restricted > to only valid codepoints (for example the replacement could be a negative > value whose absolute value matches the invalid code unit, or some other > invalid code unit outside the valid range for code points with scalar > values: isolated surrogates in UTF-16 for example could be returned as is, > or made negative either by returning its opposite or by setting (or'ing) > the most significant bit of the return value). > > The problem will arise when you need to store the replacement values if > the internal backing store is limited to 16-bit code units or 8-bit code > units: this internal backing store may use its own internal extension of > standard UTF's, including the possibility of encoding NULLs as C0,80 (like > what Java does with its "modified UTF-8 internal encoding used in its > compiled binary classes and serializations), or internally using isolated > trailing surrogates to store illformed UTF-8 input by or'ing these bytes > with 0xDC00 that will be returned as an code point with no valid scalar > value. For internally representing illformed UTF-16 sequences, there's no > need to change anything. For internally representing ill-formed UTF-32 > sequences (in fact limited to one 32-bitcode unit), with a 16bit internal > backing store you may need to store 3 16bit values (three isolated trailing > surrogates). For internally representing ill formed UTF-32 in an 8 bit > backing store, you could use 0xC1 followed by 5 five trailing bytes (each > one storing 7 bits of the initial ill-formed code unit from the UTF-32 > input). > > What you'll do in the internal backing store will not be exposed to your > API which will just return either valide codepoints with valid scalar > values, or values outside the two valid subranges (so it could possibly > negative values, or isolated trailing surrogates). That backing store can > also substitute some valid input causing problems (such as NULLs) using > 0xC0 plus another byte, that sequence being unexposed by your API which > will still be able to return the expected codepoints (but with the minor > caveat that the total number of returned codepoints will not match the > actual size allocated for the internal backing store (that applications > using that API won't even need to know how it is internally represented). > > In other words: any private extensions are possible internally, but it is > possible to isolate it within a blackboxing API which will still be able to > chose how to represent the input text (it may as well use a zlib-compressed > backing store, or some stateless Huffmann compression based on a static > statistic table configured and stored elsewhere, intiialized when you first > instantiate your API). > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 17 01:03:49 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Wed, 17 May 2017 09:03:49 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> <9719D9DA-3EBE-42E5-B37C-FA3214A5E92D@alastairs-place.net> Message-ID: On Tue, May 16, 2017 at 9:36 PM, Markus Scherer wrote: > Let me try to address some of the issues raised here. Thank you. > The proposal changes a recommendation, not a requirement. This is a very bad reason in favor of the change. If anything, this should be a reason why there is no need to change the spec text. > Conformance > applies to finding and interpreting valid sequences properly. This includes > not consuming parts of valid sequences when dealing with illegal ones, as > explained in the section "Constraints on Conversion Processes". > > Otherwise, what you do with illegal sequences is a matter of what you think > makes sense -- a matter of opinion and convenience. Nothing more. This may be the Unicode-level view of error handling. It isn't the Web-level view of error handling. In the world of Web standards (i.e. standards that read on the behavior of browsers engines), we've learned that implementation-defined behavior is bad, because someone makes a popular site that depends on the implementation-defined behavior of the browser they happened to test in. For this reason, the WHATWG has since 2004 written specs that are well-defined even in corner cases and for non-conforming input, and we've tried to extend this culture into the W3C, too. (Sometimes, exceptions are made when there's a very good reason to handle a corner case differently in a given implementatino: A recent example is CSS allowing the non-preservation of lone surrogates entering the CSS Object Model via JavaScript strings in order to enable CSS Object Model implementations that use UTF-8 [really UTF-8 and not some almost-UTF-8 variant] internally. But, yes, we really do sweat the details on that level.) Even if one could argue that implementation-defined behavior on the topic of number of U+FFFDs for ill-formed sequences in UTF-8 decode doesn't matter, the WHATWG way of doing things isn't to debate whether implementation-defined behavior matters in this particular case but to require one particular behavior in order to have well-defined behavior even when input is non-conforming. It further seems that there are people who do care about what's a *requirement* on the WHATWG level matching what's "best practice" on the Unicode level: https://www.w3.org/Bugs/Public/show_bug.cgi?id=19938 Now that major browsers agree, knowing what I know about how the WHATWG operates, while I can't speak for Anne, I expect the WHATWG spec to say as-is, because it now matches the browser consensus. So as a practical matter, if Unicode now changes its "best practice", when people check consistency with Unicode-level "best practice" and notice a discrepancy, the WHATWG and developers of implementations that took the previously-stated "best practice" seriously (either directly or by the means of another spec, like the WHATWG Encoding Standard, elevating it to a *requirement*) will need to explain why they don't follow the best practice. It is really inappropriate to inflict that trouble onto pretty much everyone except ICU when the rationale for change is as flimsy as "feels right". And, as noted earlier, politically it looks *really bad* for Unicode to change its own previous recommendation to side with ICU not following it when a number of other prominent implementations do. > I believe that the discussion of how to handle illegal sequences came out of > security issues a few years ago from some implementations including valid > single and lead bytes with preceding illegal sequences. ... > Why do we care how we carve up an illegal sequence into subsequences? Only > for debugging and visual inspection. ... > If you don't like some recommendation, then do something else. It does not > matter. If you don't reject the whole input but instead choose to replace > illegal sequences with something, then make sure the something is not > nothing -- replacing with an empty string can cause security issues. > Otherwise, what the something is, or how many of them you put in, is not > very relevant. One or more U+FFFDs is customary. When the recommendation came about for security reasons, it's a really bad idea that to suggest that implementors should decide on their own what to do and trust that their decision deviates little enough from the suggestion to stay on the secure side. To be clear, I'm not, at this time, claiming that the number of U+FFFDs has a security consequence as long as the number is at least one, but there's an awfully short slippery slope to giving the caller of a converter API the option to "ignore errors", i.e. make the number zero, which *is*, as you note, a security problem. > When the current recommendation came in, I thought it was reasonable but > didn't like the edge cases. At the time, I didn't think it was important to > twiddle with the text in the standard, and I didn't care that ICU didn't > exactly implement that particular recommendation. If ICU doesn't care, then it should be ICU developers and not the developers of other implementations who respond to bug reports about not following the "best practice". > Karl Williamson sent feedback to the UTC, "In short, I believe the best > practices are wrong." I think "wrong" is far too strong, but I got an action > item to propose a change in the text. I proposed a modified recommendation. > Nothing gets elevated to "right" that wasn't, nothing gets demoted to > "wrong" that was "right". I find it shocking that the Unicode Consortium would change a widely-implemented part of the standard (regardless of whether Unicode itself officially designates it as a requirement or suggestion) on such flimsy grounds. I'd like to register my feedback that I believe changing the best practices is wrong. > no one is forced to do something they don't like I don't believe this to be *practically* true when 1) other specs elevate into requirements what are mere suggestions on the Unicode level 2) people who read specs carefully file bugs for discrepancies between implementations and best practice 3) test suites will test things a particular way and the easy way for test suite authors to settle arguments is to let the "best practice" win -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Wed May 17 03:07:25 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Wed, 17 May 2017 09:07:25 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170516204358.15f6656a@JRWUBU2> References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <2CAF7168-E373-4FE8-B1B5-54C2F8B46DA2@alastairs-place.net> <86DFE59E-E144-4B61-9DA4-CF7EF4855A03@telia.com> <9719D9DA-3EBE-42E5-B37C-FA3214A5E92D@alastairs-place.net> <20170516204358.15f6656a@JRWUBU2> Message-ID: <055E6AF6-03EB-42AC-BC13-6732A0034B81@alastairs-place.net> > On 16 May 2017, at 20:43, Richard Wordingham via Unicode wrote: > > On Tue, 16 May 2017 11:36:39 -0700 > Markus Scherer via Unicode wrote: > >> Why do we care how we carve up an illegal sequence into subsequences? >> Only for debugging and visual inspection. Maybe some process is using >> illegal, overlong sequences to encode something special (? la Java >> string serialization, "modified UTF-8"), and for that it might be >> convenient too to treat overlong sequences as single errors. > > I think that's not quite true. If we are moving back and forth through > a buffer containing corrupt text, we need to make sure that moving three > characters forward and then three characters back leaves us where we > started. That requires internal consistency. That?s very true. But the proposed change doesn?t actually affect that; it?s still the case that you can correctly identify boundaries in both directions. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Wed May 17 15:36:08 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 17 May 2017 13:36:08 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 Message-ID: <20170517133608.665a7a7059d7ee80bb4d670165c8327d.a4f7627e87.wbe@email03.godaddy.com> Hans ?berg wrote: > It would be useful, for use with filesystems, to have Unicode > codepoint markers that indicate how UTF-8, including non-valid > sequences, is translated into UTF-32 in a way that the original > octet sequence can be restored. I have always argued strongly against this idea, and always will. Far from solving the stated problem, it would introduce a new one: conversion from the "bad data" Unicode code points, currently well-defined, would become ambiguous. Suppose the block U+EFFxx were assigned to invalid UTF-8 bytes . Then there would be two possible conversions from, for instance, U+EFF80: either <80> or . Declaring the "special" code points to be excluded from straightforward UTF-* conversion would invalidate every existing UTF-* processor, and would be widely ignored. File systems cannot have it both ways: they must define file names either as unrestricted sequences of bytes, or as strings of characters in some defined encoding. If they choose the latter, they need to define conversion mechanisms with suitable fallback and adhere to them. They can use the PUA if they like. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed May 17 15:37:51 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 17 May 2017 13:37:51 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 Message-ID: <20170517133751.665a7a7059d7ee80bb4d670165c8327d.df8bc92afc.wbe@email03.godaddy.com> Richard Wordingham wrote: >> It is not at all clear what the intent of the encoder was - or even >> if it's not just a problem with the data stream. E0 80 80 is not >> permitted, it's garbage. An encoder can't "intend" it. > > It was once a legal way of encoding NUL, just like C0 E0, which is > still in use, and seems to be the best way of storing NUL as character > content in a *C string*. I wish I had a penny for every time I'd seen this urban legend. At http://doc.cat-v.org/bell_labs/utf-8_history you can read the original definition of UTF-8, from Ken Thompson on 1992-09-08, so long ago that it was still called FSS-UTF: "When there are multiple ways to encode a value, for example UCS 0, only the shortest encoding is legal." Unicode once permitted implementations to *decode* non-shortest forms, but never allowed an implementation to *create* them (http://www.unicode.org/versions/corrigendum1.html): "For example, UTF-8 allows nonshortest code value sequences to be interpreted: a UTF-8 conformant may map the code value sequence C0 80 (11000000? 10000000?) to the Unicode value U+0000, even though a UTF-8 conformant process shall never generate that code value sequence -- it shall generate the sequence 00 (00000000?) instead." This was the passage that was deleted as part of Corrigendum #1. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed May 17 15:41:56 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 17 May 2017 13:41:56 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 Message-ID: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> Henri Sivonen wrote: > I find it shocking that the Unicode Consortium would change a > widely-implemented part of the standard (regardless of whether Unicode > itself officially designates it as a requirement or suggestion) on > such flimsy grounds. > > I'd like to register my feedback that I believe changing the best > practices is wrong. Perhaps surprisingly, it's already too late. UTC approved this change the day after the proposal was written. http://www.unicode.org/L2/L2017/17103.htm#151-C19 -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed May 17 16:05:47 2017 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Wed, 17 May 2017 23:05:47 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170517133608.665a7a7059d7ee80bb4d670165c8327d.a4f7627e87.wbe@email03.godaddy.com> References: <20170517133608.665a7a7059d7ee80bb4d670165c8327d.a4f7627e87.wbe@email03.godaddy.com> Message-ID: <3264CAA2-8A83-414D-BDA7-D5D5FB5455CF@telia.com> > On 17 May 2017, at 22:36, Doug Ewell via Unicode wrote: > > Hans ?berg wrote: > >> It would be useful, for use with filesystems, to have Unicode >> codepoint markers that indicate how UTF-8, including non-valid >> sequences, is translated into UTF-32 in a way that the original >> octet sequence can be restored. > > I have always argued strongly against this idea, and always will. > > Far from solving the stated problem, it would introduce a new one: > conversion from the "bad data" Unicode code points, currently > well-defined, would become ambiguous. Actually not: just translate the invalid UTF-8 sequences into invalid UTF-32. No Unicode extensions are needed, as it has no say about what to happen with what it considers invalid. > File systems cannot have it both ways: they must define file names > either as unrestricted sequences of bytes, or as strings of characters > in some defined encoding. If they choose the latter, they need to define > conversion mechanisms with suitable fallback and adhere to them. They > can use the PUA if they like. The latter is complicated, so that is not what one does I am told, with some exception. Also, one may end up with a file in an unknown encoding, say imported remotely, and then the OS cannot deal with it. From unicode at unicode.org Wed May 17 16:18:15 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 17 May 2017 14:18:15 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 Message-ID: <20170517141815.665a7a7059d7ee80bb4d670165c8327d.0df684298f.wbe@email03.godaddy.com> Hans ?berg wrote: >> Far from solving the stated problem, it would introduce a new one: >> conversion from the "bad data" Unicode code points, currently >> well-defined, would become ambiguous. > > Actually not: just translate the invalid UTF-8 sequences into invalid > UTF-32. Far from solving the stated problem, it would introduce TWO new ones... -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed May 17 16:21:54 2017 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Wed, 17 May 2017 23:21:54 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170517141815.665a7a7059d7ee80bb4d670165c8327d.0df684298f.wbe@email03.godaddy.com> References: <20170517141815.665a7a7059d7ee80bb4d670165c8327d.0df684298f.wbe@email03.godaddy.com> Message-ID: > On 17 May 2017, at 23:18, Doug Ewell wrote: > > Hans ?berg wrote: > >>> Far from solving the stated problem, it would introduce a new one: >>> conversion from the "bad data" Unicode code points, currently >>> well-defined, would become ambiguous. >> >> Actually not: just translate the invalid UTF-8 sequences into invalid >> UTF-32. > > Far from solving the stated problem, it would introduce TWO new ones... There is no good solution to the problem of illegal UTF-8 sequences, as the intent of those is not known. From unicode at unicode.org Wed May 17 16:31:42 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 17 May 2017 22:31:42 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> Message-ID: <20170517223142.4b44687f@JRWUBU2> On Wed, 17 May 2017 13:41:56 -0700 Doug Ewell via Unicode wrote: > Perhaps surprisingly, it's already too late. UTC approved this change > the day after the proposal was written. > > http://www.unicode.org/L2/L2017/17103.htm#151-C19 Approved for Unicode 11.0. Unicode 10.0 has yet to be released. The change may still be rescinded. There's some sort of rule that proposals should be made seven days in advance of the meeting. I can't find it now, so I'm not sure whether the actual rule was followed, let alone what authority it has. Richard. From unicode at unicode.org Wed May 17 17:04:18 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 17 May 2017 23:04:18 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170517133751.665a7a7059d7ee80bb4d670165c8327d.df8bc92afc.wbe@email03.godaddy.com> References: <20170517133751.665a7a7059d7ee80bb4d670165c8327d.df8bc92afc.wbe@email03.godaddy.com> Message-ID: <20170517230418.624585ad@JRWUBU2> On Wed, 17 May 2017 13:37:51 -0700 Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > > >> It is not at all clear what the intent of the encoder was - or even > >> if it's not just a problem with the data stream. E0 80 80 is not > >> permitted, it's garbage. An encoder can't "intend" it. > > > > It was once a legal way of encoding NUL, just like C0 E0, which is > > still in use, and seems to be the best way of storing NUL as > > character content in a *C string*. > > I wish I had a penny for every time I'd seen this urban legend. > > At http://doc.cat-v.org/bell_labs/utf-8_history you can read the > original definition of UTF-8, from Ken Thompson on 1992-09-08, so long > ago that it was still called FSS-UTF: > > "When there are multiple ways to encode a value, for example > UCS 0, only the shortest encoding is legal." > > Unicode once permitted implementations to *decode* non-shortest forms, > but never allowed an implementation to *create* them > (http://www.unicode.org/versions/corrigendum1.html): > > "For example, UTF-8 allows nonshortest code value sequences to be > interpreted: a UTF-8 conformant may map the code value sequence C0 80 > (11000000? 10000000?) to the Unicode value U+0000, even though a > UTF-8 conformant process shall never generate that code value sequence > -- it shall generate the sequence 00 (00000000?) instead." > > This was the passage that was deleted as part of Corrigendum #1. So it was still a legal way for a non-UTF-8-compliant process! Note for example that a compliant implementation of full upper-casing shall convert the canonically equivalent strings and to the canonically inequivalent strings and . A compliant Unicode process may not assume that this is the right thing to do. (Or are some compliant Unicode processes required to incorrectly believe that they are doing something they mustn't do?) Richard. From unicode at unicode.org Wed May 17 17:31:56 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 17 May 2017 15:31:56 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 Message-ID: <20170517153156.665a7a7059d7ee80bb4d670165c8327d.2cf6d49d41.wbe@email03.godaddy.com> Richard Wordingham wrote: > So it was still a legal way for a non-UTF-8-compliant process! Anything is possible if you are non-compliant. You can encode U+263A with 9,786 FF bytes followed by a terminating FE byte and call that "UTF-8," if you are willing to be non-compliant enough. > Note for example that a compliant implementation of full upper-casing > shall convert the canonically equivalent strings LETTER ALPHA WITH YPOGEGRAMMENI, U+0313 COMBINING COMMA ABOVE> and > YPOGEGRAMMENI> to the canonically inequivalent strings CAPITAL LETTER ALPHA, U+0399 GREEK CAPITAL LETTER IOTA, U+0313> and > LETTER IOTA>. A compliant Unicode process may not assume that this is > the right thing to do. (Or are some compliant Unicode processes > required to incorrectly believe that they are doing something they > mustn't do?) I'm afraid I don't get the analogy. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed May 17 18:11:53 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 18 May 2017 00:11:53 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170517153156.665a7a7059d7ee80bb4d670165c8327d.2cf6d49d41.wbe@email03.godaddy.com> References: <20170517153156.665a7a7059d7ee80bb4d670165c8327d.2cf6d49d41.wbe@email03.godaddy.com> Message-ID: <20170518001153.14315d22@JRWUBU2> On Wed, 17 May 2017 15:31:56 -0700 Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > > > So it was still a legal way for a non-UTF-8-compliant process! > > Anything is possible if you are non-compliant. You can encode U+263A > with 9,786 FF bytes followed by a terminating FE byte and call that > "UTF-8," if you are willing to be non-compliant enough. > > > Note for example that a compliant implementation of full > > upper-casing shall convert the canonically equivalent strings > > > COMBINING COMMA ABOVE> and > PSILI, U+0345 COMBINING GREEK > > YPOGEGRAMMENI> to the canonically inequivalent strings > YPOGEGRAMMENI> GREEK > > CAPITAL LETTER ALPHA, U+0399 GREEK CAPITAL LETTER IOTA, U+0313> and > > > LETTER IOTA>. A compliant Unicode process may not assume that this > > is the right thing to do. (Or are some compliant Unicode processes > > required to incorrectly believe that they are doing something they > > mustn't do?) > > I'm afraid I don't get the analogy. You can't build a full Unicode system out of Unicode-compliant parts. However, having dug out Unicode Version 2 Appendix A Section 2 UTF-8 (in http://www.unicode.org/versions/Unicode2.0.0/appA.pdf), I find the critical wording, "When converting from UTF-8 to Unicode values, however, implementations do not need to check that the shortest encoding is being used,...". There was no prohibition on implementations performing the check, so whether C0 80 would be interpreted as U+0000 or as an error was unpredictable. Richard. From unicode at unicode.org Wed May 17 18:41:53 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 17 May 2017 16:41:53 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170517223142.4b44687f@JRWUBU2> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <20170517223142.4b44687f@JRWUBU2> Message-ID: <9cdf9f3a-2922-0966-021c-a5dd293f2a65@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 17 19:04:55 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 18 May 2017 02:04:55 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <9cdf9f3a-2922-0966-021c-a5dd293f2a65@ix.netcom.com> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <20170517223142.4b44687f@JRWUBU2> <9cdf9f3a-2922-0966-021c-a5dd293f2a65@ix.netcom.com> Message-ID: I find intriguating that the update intends to enforce the decoding of the **shortest** sequences, but now wants to treat **maximal sequences** as a single unit with arbitrary length. UTF-8 was designed to work only with some state machines that would NEVER need to parse more than 4 bytes. For me, as soon as the first byte encountered is invalid, the current sequence should be stopped there and treated as error (replaced by U+FFFD is replacement is enabled instead of returning an error or throwing an exception), and then any further trailing byte should be treated isolated as an error: The number of returned U+FFFD replacements would then be the same when you scan the input forward or backward without **ever** reading more than 4 bytes in all directions (this is a problem when the parseing will reach an end of buffer where you'll block on performing I/O to read the previous or next block, and managing a cache of multiple blocks (more than 2) is a problem with this unexpected change that will create new performance problems and add new memory constraints (in adition to new possible attacks if that parser needs to keep multiple buffers in memorty instead of treating them individually with a single overhead buffer, and throwing away the individual buffers on the fly as soon as they are indivisually fully parsed). 2017-05-18 1:41 GMT+02:00 Asmus Freytag via Unicode : > On 5/17/2017 2:31 PM, Richard Wordingham via Unicode wrote: > > There's some sort of rule that proposals should be made seven days in > advance of the meeting. I can't find it now, so I'm not sure whether > the actual rule was followed, let alone what authority it has. > > Ideally, proposals that update algorithms or properties of some > significance should be required to be reviewed in more than one pass. The > procedures of the UTC are a bit weak in that respect, at least compared to > other standards organizations. The PRI process addresses that issue to some > extent. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 17 20:48:59 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 17 May 2017 19:48:59 -0600 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170517230418.624585ad@JRWUBU2> References: <20170517133751.665a7a7059d7ee80bb4d670165c8327d.df8bc92afc.wbe@email03.godaddy.com> <20170517230418.624585ad@JRWUBU2> Message-ID: Richard Wordingham wrote: >> I'm afraid I don't get the analogy. > > You can't build a full Unicode system out of Unicode-compliant parts. Others will have to address Richard's point about canonical-equivalent sequences. > However, having dug out Unicode Version 2 Appendix A Section 2 UTF-8 > (in http://www.unicode.org/versions/Unicode2.0.0/appA.pdf), I find the > critical wording, "When converting from UTF-8 to Unicode values, > however, implementations do not need to check that the shortest > encoding is being used,...". There was no prohibition on > implementations performing the check, so whether C0 80 would be > interpreted as U+0000 or as an error was unpredictable. So it is as I said, and as TUS said before Corrigendum #1 was approved, more than 16 years ago: It was not legal to create overlong sequences, but implementations were allowed to interpret any that they came across. As someone who pays attention to the fine details, you will certainly appreciate the difference between "it was once legal to encode NUL as E0 80 80" and "it was once legal for a decoder to interpret the sequence E0 80 80 as NUL instead of rejecting it." -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Thu May 18 00:01:49 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 18 May 2017 06:01:49 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <20170517223142.4b44687f@JRWUBU2> <9cdf9f3a-2922-0966-021c-a5dd293f2a65@ix.netcom.com> Message-ID: <20170518060149.1710f1bb@JRWUBU2> On Thu, 18 May 2017 02:04:55 +0200 Philippe Verdy via Unicode wrote: > I find intriguating that the update intends to enforce the decoding > of the **shortest** sequences, but now wants to treat **maximal > sequences** as a single unit with arbitrary length. UTF-8 was > designed to work only with some state machines that would NEVER need > to parse more than 4 bytes. If you look at the sample code in http://www.unicode.org/versions/Unicode2.0.0/appA.pdf, you'll see that it's working with 6-byte sequences. It's the Unicode, as opposed to ISO 10646, version that has always been restricted to 4 bytes. Richard. From unicode at unicode.org Thu May 18 01:18:48 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Thu, 18 May 2017 09:18:48 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <9cdf9f3a-2922-0966-021c-a5dd293f2a65@ix.netcom.com> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <20170517223142.4b44687f@JRWUBU2> <9cdf9f3a-2922-0966-021c-a5dd293f2a65@ix.netcom.com> Message-ID: On Thu, May 18, 2017 at 2:41 AM, Asmus Freytag via Unicode wrote: > On 5/17/2017 2:31 PM, Richard Wordingham via Unicode wrote: > > There's some sort of rule that proposals should be made seven days in > advance of the meeting. I can't find it now, so I'm not sure whether > the actual rule was followed, let alone what authority it has. > > Ideally, proposals that update algorithms or properties of some significance > should be required to be reviewed in more than one pass. The procedures of > the UTC are a bit weak in that respect, at least compared to other standards > organizations. The PRI process addresses that issue to some extent. What action should I take to make proposals to be considered by the UTC? I'd like to make two: 1) Substantive: Reverse the decision to modify U+FFFD best practice when decoding UTF-8. (I think the decision lacked a truly compelling reason to change something that has a number of prominent implementations and the decision complicates U+FFFD generation when validating UTF-8 by state machine. Aesthetic considerations in error handling shouldn't outweigh multiple prominent implementations and shouldn't introduce implementation complexity.) 2) Procedural: To be considered in the future, proposals to change what the standard suggests or requires implementations to do should consider different implementation strategies and discuss the impact of the change in the light of the different implementation strategies (in the matter at hand, I think the proposal should have included a discussion of the impact on UTF-8 validation state machines) and should include a review of what prominent implementations, including major browser engines, operating system libraries, and standard libraries of well-known programming languages, already do. (The more established the presently specced behavior is among prominent implementations, the more compelling reason should be required to change the spec. An implementation hosted by the Consortium itself shouldn't have special weight compared to other prominent implementations.) -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Thu May 18 02:54:11 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Thu, 18 May 2017 08:54:11 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <20170517223142.4b44687f@JRWUBU2> <9cdf9f3a-2922-0966-021c-a5dd293f2a65@ix.netcom.com> Message-ID: <44B15C6B-B06C-4625-9CF3-CE893BEB3101@alastairs-place.net> On 18 May 2017, at 01:04, Philippe Verdy via Unicode wrote: > > I find intriguating that the update intends to enforce the decoding of the **shortest** sequences, but now wants to treat **maximal sequences** as a single unit with arbitrary length. UTF-8 was designed to work only with some state machines that would NEVER need to parse more than 4 bytes. This won?t change. You still don?t need to parse more than four bytes. In fact, you don?t need to do *anything*, even if your implementation doesn?t match the proposal, because *it?s only a recommendation*. But if you did choose to do something, you *still* don?t need to scan arbitrary numbers of bytes. > For me, as soon as the first byte encountered is invalid, the current sequence should be stopped there and treated as error (replaced by U+FFFD is replacement is enabled instead of returning an error or throwing an exception), This is still essentially true under the proposal; the only difference is that instead of being a clever dick and taking account of the valid *code point* ranges while doing this in order to ban certain trailing bytes given the values of their predecessors, you allow any trailing byte, and only worry about whether the complete sequence represents a valid code point or is over-long once you?ve finished reading it. You never need to read more than four bytes under the new proposal, because the lead byte tells you how many to expect, and you?d still stop and instantly replace with U+FFFD if you see a byte outside the 0x80-0xbf range, even if you hadn?t scanned the number of bytes the lead byte says to expect. This also *does not* change the view of the underlying UTF-8 string based on iteration direction; you would still generate the exact same sequence of code points in both directions. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Thu May 18 02:55:49 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Thu, 18 May 2017 08:55:49 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170518060149.1710f1bb@JRWUBU2> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <20170517223142.4b44687f@JRWUBU2> <9cdf9f3a-2922-0966-021c-a5dd293f2a65@ix.netcom.com> <20170518060149.1710f1bb@JRWUBU2> Message-ID: On 18 May 2017, at 06:01, Richard Wordingham via Unicode wrote: > > On Thu, 18 May 2017 02:04:55 +0200 > Philippe Verdy via Unicode wrote: > >> I find intriguating that the update intends to enforce the decoding >> of the **shortest** sequences, but now wants to treat **maximal >> sequences** as a single unit with arbitrary length. UTF-8 was >> designed to work only with some state machines that would NEVER need >> to parse more than 4 bytes. > > If you look at the sample code in > http://www.unicode.org/versions/Unicode2.0.0/appA.pdf, you'll see that > it's working with 6-byte sequences. It's the Unicode, as opposed to > ISO 10646, version that has always been restricted to 4 bytes. There are good reasons for restricting it to four byte sequences, mind; doing so increases the number of invalid code units, which makes it easier to detect UTF-8 versus not UTF-8. I don?t think anyone is proposing allowing 5-byte or 6-byte sequences. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Thu May 18 03:30:24 2017 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Thu, 18 May 2017 10:30:24 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170516142153.3e146371@JRWUBU2> References: <71F0642E-69A0-4349-A842-AB62FA65BB4E@telia.com> <20170516142153.3e146371@JRWUBU2> Message-ID: <98319FCB-60B3-4782-9670-9865D5DE0AEC@telia.com> > On 16 May 2017, at 15:21, Richard Wordingham via Unicode wrote: > > On Tue, 16 May 2017 14:44:44 +0200 > Hans ?berg via Unicode wrote: > >>> On 15 May 2017, at 12:21, Henri Sivonen via Unicode >>> wrote: >> ... >>> I think Unicode should not adopt the proposed change. >> >> It would be useful, for use with filesystems, to have Unicode >> codepoint markers that indicate how UTF-8, including non-valid >> sequences, is translated into UTF-32 in a way that the original octet >> sequence can be restored. > > Escape sequences for the inappropriate bytes is the natural technique. > Your problem is smoothly transitioning so that the escape character is > always escaped when it means itself. Strictly, it can't be done. > > Of course, some sequences of escaped characters should be prohibited. > Checking could be fiddly. One could write the bytes using \xnn escape codes, sequences terminated using \& as in Haskell, translating '\' into "\\". It then becomes a C-encoded string, not plain text. From unicode at unicode.org Thu May 18 03:58:43 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Thu, 18 May 2017 09:58:43 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <20170517223142.4b44687f@JRWUBU2> <9cdf9f3a-2922-0966-021c-a5dd293f2a65@ix.netcom.com> Message-ID: On 18 May 2017, at 07:18, Henri Sivonen via Unicode wrote: > > the decision complicates U+FFFD generation when validating UTF-8 by state machine. It *really* doesn?t. Even if you?re hell bent on using a pure state machine approach, you need to add maybe two additional error states (two-trailing-bytes-to-eat-then-fffd and one-trailing-byte-to-eat-then-fffd) on top of the states you already have. The implementation complexity argument is a *total* red herring. > 2) Procedural: To be considered in the future, proposals to change > what the standard suggests or requires implementations to do should > consider different implementation strategies and discuss the impact of > the change in the light of the different implementation strategies (in > the matter at hand, I think the proposal should have included a > discussion of the impact on UTF-8 validation state machines) Well, let?s discuss that here and now (see above). Do you, for some reason, think that it?s more complicated than I suggest? Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Thu May 18 06:40:43 2017 From: unicode at unicode.org (zelpa via Unicode) Date: Thu, 18 May 2017 21:40:43 +1000 Subject: Petition to ban Google from designing emoji Message-ID: http://blog.emojipedia.org/rip-blobs-google-redesigns-emojis/ Is this some kind of joke? Have Google put ANY thought into their emoji design? First they bastardise the cute blob emoji, then they make their emoji gendered, now they've literally just copied Apple's emoji. It's sickening. Disgusting. I propose we hold a petition for the Unicode Consortium to ban Google from designing emoji in the future, and make them revert back to the Android 5 designs. Everyone in favour of this please leave a response, anybody not in favour please rethink your opinion. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 18 09:06:04 2017 From: unicode at unicode.org (David Faulks via Unicode) Date: Thu, 18 May 2017 14:06:04 +0000 (UTC) Subject: Petition to ban Google from designing emoji References: <1560827334.707069.1495116364372.ref@mail.yahoo.com> Message-ID: <1560827334.707069.1495116364372@mail.yahoo.com> And what makes you think Unicode has any authority to ?ban? Google from doing anything at all (hint: Unicode has zero ability to enforce anything). -------------------------------------------- On Thu, 5/18/17, zelpa via Unicode wrote: Subject: Petition to ban Google from designing emoji To: "Unicode Public" Received: Thursday, May 18, 2017, 7:40 AM http://blog.emojipedia.org/rip-blobs-google-redesigns-emojis/ Is this some kind of joke? Have Google put ANY thought into their emoji design? First they bastardise the cute blob emoji, then they make their emoji gendered, now they've literally just copied Apple's emoji. It's sickening. Disgusting. I propose we hold a petition for the Unicode Consortium to ban Google from designing emoji in the future, and make them revert back to the Android 5 designs. Everyone in favour of this please leave a response, anybody not in favour please rethink your opinion. From unicode at unicode.org Thu May 18 09:07:06 2017 From: unicode at unicode.org (Rebecca T via Unicode) Date: Thu, 18 May 2017 10:07:06 -0400 Subject: Petition to ban Google from designing emoji In-Reply-To: References: Message-ID: Well, you?re certainly not alone in your distaste for the new design. @eevee just today said ?cool how we improved gender diversity by slowly changing from ?ambiguous/neutral? to ?explicit color-coded binary, default usually male?? On the other hand, quoting @zaccolley: ?if you treat emoji like pictures: yay blobs, if you treat emoji like language: yay consistency? Ultimately, the new emoji designs will make our digital communication less ambiguous ? I?m just not sure if that?s a good change or not, and I certainly don?t enjoy Apple being the default (on principle and for their designs specifically). Quoting UTR #51: ?General-purpose emoji for people and body parts should also not be given overly specific images: the general recommendation is to be as neutral as possible regarding race, ethnicity, and gender.? Unambiguously, Apple has failed to meet these technical guidelines, in a blatant and unapologetic manner, and that?s why I liked the blobs ? they bucked norms, refused to conform to trends, and made emoji more friendly to people who didn?t want to attach a gender to their every expression. I think that?s valuable and I?m sad to see it go. And a serious response to this joke letter: Given that Google pays $18,000 / annum to vote on new emoji, it seems unlikely that the Consortium will just kick them out. On Thu, May 18, 2017 at 7:40 AM, zelpa via Unicode wrote: > http://blog.emojipedia.org/rip-blobs-google-redesigns-emojis/ > > Is this some kind of joke? Have Google put ANY thought into their emoji > design? First they bastardise the cute blob emoji, then they make their > emoji gendered, now they've literally just copied Apple's emoji. It's > sickening. Disgusting. I propose we hold a petition for the Unicode > Consortium to ban Google from designing emoji in the future, and make them > revert back to the Android 5 designs. Everyone in favour of this please > leave a response, anybody not in favour please rethink your opinion. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 18 09:26:53 2017 From: unicode at unicode.org (zelpa via Unicode) Date: Fri, 19 May 2017 00:26:53 +1000 Subject: Petition to ban Google from designing emoji In-Reply-To: References: Message-ID: >Unambiguously, Apple has failed to meet these technical guidelines, >in a blatant and unapologetic manner, and that?s why I liked the blobs ? >they bucked norms, refused to conform to trends, and made emoji more >friendly to people who didn?t want to attach a gender to their every >expression. I think that?s valuable and I?m sad to see it go. At least someone realised it was a (half) joke. This is my real issue, Apple disregards guidelines, sets a de facto standard, Google races to copy them. It's actually sad to see how far other vendors will go to copy Apple's designs. I honestly think the consortium should try harder to enforce the guidelines instead of letting Apple be the ruler and expecting others to obey. On Fri, May 19, 2017 at 12:07 AM, Rebecca T <637275 at gmail.com> wrote: > Well, you?re certainly not alone in your distaste for the new design. > @eevee > just today said ?cool how we improved gender diversity by slowly changing > > from ?ambiguous/neutral? to ?explicit color-coded binary, default usually > > male?? > > On the other hand, quoting @zaccolley: ?if you treat emoji like pictures: > > yay blobs, if you treat emoji like language: yay consistency? > > > Ultimately, the new emoji designs will make our digital communication less > ambiguous ? I?m just not sure if that?s a good change or not, and I > certainly don?t enjoy Apple being the default (on principle and for their > designs specifically). > > Quoting UTR #51: ?General-purpose emoji for people and body parts should > also not be given overly specific images: the general recommendation is to > be as neutral as possible regarding race, ethnicity, and gender.? > > Unambiguously, Apple has failed to meet these technical guidelines, > in a blatant and unapologetic manner, and that?s why I liked the blobs ? > they bucked norms, refused to conform to trends, and made emoji more > friendly to people who didn?t want to attach a gender to their every > expression. I think that?s valuable and I?m sad to see it go. > > And a serious response to this joke letter: Given that Google pays $18,000 > / > annum to vote on new emoji, it seems unlikely that the Consortium will just > kick them out. > > > On Thu, May 18, 2017 at 7:40 AM, zelpa via Unicode > wrote: > >> http://blog.emojipedia.org/rip-blobs-google-redesigns-emojis/ >> >> Is this some kind of joke? Have Google put ANY thought into their emoji >> design? First they bastardise the cute blob emoji, then they make their >> emoji gendered, now they've literally just copied Apple's emoji. It's >> sickening. Disgusting. I propose we hold a petition for the Unicode >> Consortium to ban Google from designing emoji in the future, and make them >> revert back to the Android 5 designs. Everyone in favour of this please >> leave a response, anybody not in favour please rethink your opinion. >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 18 09:41:56 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 18 May 2017 07:41:56 -0700 Subject: Petition to ban Google from designing emoji Message-ID: <20170518074156.665a7a7059d7ee80bb4d670165c8327d.333da282b2.wbe@email03.godaddy.com> zelpa wrote: > This is my real issue, Apple disregards guidelines, sets a de facto > standard, Google races to copy them. It's actually sad to see how far > other vendors will go to copy Apple's designs. I honestly think the > consortium should try harder to enforce the guidelines instead of > letting Apple be the ruler and expecting others to obey. Given that one co-chair of the Emoji Subcommittee is from Apple and the other is from Google, you may wish to rethink your expectations about all this. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Thu May 18 09:16:31 2017 From: unicode at unicode.org (Gabriel von Dehn via Unicode) Date: Thu, 18 May 2017 17:16:31 +0300 Subject: Petition to ban Google from designing emoji In-Reply-To: References: Message-ID: <61678C1D-15A2-4507-9E3D-5E1849D57105@gmail.com> Hi, the Unicode Consortium does not and cannot ?ban? vendors from designing emojis the way they wish. Unicode merely gives recommendations on how the characters should be displayed. Think of the different designs on different platforms like different fonts you can use (because that is actually what they are): They all look slightly different and no one would hold a petition for the design of characters in a font to change. As for the gendered Emojis, those are in the Unicode specification now: http://emojipedia.org/emoji-4.0/ If you do not like the upcoming Emoji design from Google (or anything about the upcoming version of Android), you can report to Google directly, but posting on this List won?t help. > On 18 May 2017, at 14:40, zelpa via Unicode wrote: > > http://blog.emojipedia.org/rip-blobs-google-redesigns-emojis/ > > Is this some kind of joke? Have Google put ANY thought into their emoji design? First they bastardise the cute blob emoji, then they make their emoji gendered, now they've literally just copied Apple's emoji. It's sickening. Disgusting. I propose we hold a petition for the Unicode Consortium to ban Google from designing emoji in the future, and make them revert back to the Android 5 designs. Everyone in favour of this please leave a response, anybody not in favour please rethink your opinion. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 18 09:38:34 2017 From: unicode at unicode.org (Gabriel von Dehn via Unicode) Date: Thu, 18 May 2017 17:38:34 +0300 Subject: Petition to ban Google from designing emoji In-Reply-To: References: Message-ID: <460AC842-881C-40F3-9E12-A89936641007@gmail.com> As said, Unicode does not and cannot enforce anything. Unicode sets the recommendation, but has no power whatsoever of enforcing every vendor to meet these recommendations, nor does it expect vendors to follow Apples designs. > On 18 May 2017, at 17:26, zelpa via Unicode wrote: > > >Unambiguously, Apple has failed to meet these technical guidelines, > >in a blatant and unapologetic manner, and that?s why I liked the blobs ? > >they bucked norms, refused to conform to trends, and made emoji more > >friendly to people who didn?t want to attach a gender to their every > >expression. I think that?s valuable and I?m sad to see it go. > > At least someone realised it was a (half) joke. This is my real issue, Apple disregards guidelines, sets a de facto standard, Google races to copy them. It's actually sad to see how far other vendors will go to copy Apple's designs. I honestly think the consortium should try harder to enforce the guidelines instead of letting Apple be the ruler and expecting others to obey. > > On Fri, May 19, 2017 at 12:07 AM, Rebecca T <637275 at gmail.com > wrote: > Well, you?re certainly not alone in your distaste for the new design. @eevee > just today said ?cool how we improved gender diversity by slowly changing > from ?ambiguous/neutral? to ?explicit color-coded binary, default usually > male?? > > On the other hand, quoting @zaccolley: ?if you treat emoji like pictures: > yay blobs, if you treat emoji like language: yay consistency? > > Ultimately, the new emoji designs will make our digital communication less > ambiguous ? I?m just not sure if that?s a good change or not, and I > certainly don?t enjoy Apple being the default (on principle and for their > designs specifically). > > Quoting UTR #51: ?General-purpose emoji for people and body parts should > also not be given overly specific images: the general recommendation is to > be as neutral as possible regarding race, ethnicity, and gender.? > > Unambiguously, Apple has failed to meet these technical guidelines, > in a blatant and unapologetic manner, and that?s why I liked the blobs ? > they bucked norms, refused to conform to trends, and made emoji more > friendly to people who didn?t want to attach a gender to their every > expression. I think that?s valuable and I?m sad to see it go. > > And a serious response to this joke letter: Given that Google pays $18,000 / > annum to vote on new emoji, it seems unlikely that the Consortium will just > kick them out. > > > On Thu, May 18, 2017 at 7:40 AM, zelpa via Unicode > wrote: > http://blog.emojipedia.org/rip-blobs-google-redesigns-emojis/ > > Is this some kind of joke? Have Google put ANY thought into their emoji design? First they bastardise the cute blob emoji, then they make their emoji gendered, now they've literally just copied Apple's emoji. It's sickening. Disgusting. I propose we hold a petition for the Unicode Consortium to ban Google from designing emoji in the future, and make them revert back to the Android 5 designs. Everyone in favour of this please leave a response, anybody not in favour please rethink your opinion. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 18 10:35:10 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 18 May 2017 08:35:10 -0700 Subject: Petition to ban Google from designing emoji In-Reply-To: <20170518074156.665a7a7059d7ee80bb4d670165c8327d.333da282b2.wbe@email03.godaddy.com> References: <20170518074156.665a7a7059d7ee80bb4d670165c8327d.333da282b2.wbe@email03.godaddy.com> Message-ID: <0e556296-748a-d35f-113d-9a416977e304@ix.netcom.com> On 5/18/2017 7:41 AM, Doug Ewell via Unicode wrote: > zelpa wrote: > >> This is my real issue, Apple disregards guidelines, sets a de facto >> standard, Google races to copy them. It's actually sad to see how far >> other vendors will go to copy Apple's designs. I honestly think the >> consortium should try harder to enforce the guidelines instead of >> letting Apple be the ruler and expecting others to obey. > Given that one co-chair of the Emoji Subcommittee is from Apple and the > other is from Google, you may wish to rethink your expectations about > all this. > I'd expect "zelpa" to feel validated by this info in their concern, wouldn't you? A./ From unicode at unicode.org Thu May 18 11:48:19 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 18 May 2017 09:48:19 -0700 Subject: Petition to ban Google from designing emoji Message-ID: <20170518094819.665a7a7059d7ee80bb4d670165c8327d.b452a92739.wbe@email03.godaddy.com> Asmus Freytag wrote: >> Given that one co-chair of the Emoji Subcommittee is from Apple and >> the other is from Google, you may wish to rethink your expectations >> about all this. > > I'd expect "zelpa" to feel validated by this info in their concern, > wouldn't you? Well, it's public information: http://www.unicode.org/emoji/ The more important point is the one others have been making: Unicode does not and cannot attempt to dictate to any vendor how to design glyphs, either for normal characters like A and ? and ? or for emoji. Unicode does insist that the glyph design not misrepresent the meaning of the character, which I believe was Michael Everson's objection to vendors implementing U+1F3B1 BILLIARDS as a lone 8-ball. It's not clear to me that the Google redesign discussed here goes that far; this seems more like objection on aesthetic grounds. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Thu May 18 11:53:12 2017 From: unicode at unicode.org (Shakil Anwar via Unicode) Date: Thu, 18 May 2017 17:53:12 +0100 Subject: Petition to ban Google from designing emoji In-Reply-To: <61678C1D-15A2-4507-9E3D-5E1849D57105@gmail.com> References: <61678C1D-15A2-4507-9E3D-5E1849D57105@gmail.com> Message-ID: A more democratic solution is to allow the global public to both submit and vote on emoji designs. Rather than allow a small number of (probably) north american white males to dictate emojis in a 'colonial' process based on their own world and personal view. The Unicode consortium can vote to change the process and now the proposal has been made it will speak volumes if Google, Apple etc. choose not to democratise. ICANN chose to democratise their processes ; so can Unicode. On 18 May 2017 at 15:16, Gabriel von Dehn via Unicode wrote: > Hi, > > the Unicode Consortium does not and cannot ?ban? vendors from designing > emojis the way they wish. Unicode merely gives recommendations on how the > characters should be displayed. Think of the different designs on different > platforms like different fonts you can use (because that is actually what > they *are*): They all look slightly different and no one would hold a > petition for the design of characters in a font to change. > > As for the gendered Emojis, those are in the Unicode specification now: > http://emojipedia.org/emoji-4.0/ > > If you do not like the upcoming Emoji design from Google (or anything > about the upcoming version of Android), you can report to Google directly, > but posting on this List won?t help. > > > On 18 May 2017, at 14:40, zelpa via Unicode wrote: > > http://blog.emojipedia.org/rip-blobs-google-redesigns-emojis/ > > Is this some kind of joke? Have Google put ANY thought into their emoji > design? First they bastardise the cute blob emoji, then they make their > emoji gendered, now they've literally just copied Apple's emoji. It's > sickening. Disgusting. I propose we hold a petition for the Unicode > Consortium to ban Google from designing emoji in the future, and make them > revert back to the Android 5 designs. Everyone in favour of this please > leave a response, anybody not in favour please rethink your opinion. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 18 12:30:23 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 18 May 2017 10:30:23 -0700 Subject: Petition to ban Google from designing emoji In-Reply-To: <20170518094819.665a7a7059d7ee80bb4d670165c8327d.b452a92739.wbe@email03.godaddy.com> References: <20170518094819.665a7a7059d7ee80bb4d670165c8327d.b452a92739.wbe@email03.godaddy.com> Message-ID: <8da32bb0-1575-e7f9-a9ee-8d6873b2e51d@ix.netcom.com> On 5/18/2017 9:48 AM, Doug Ewell via Unicode wrote: > Asmus Freytag wrote: > >>> Given that one co-chair of the Emoji Subcommittee is from Apple and >>> the other is from Google, you may wish to rethink your expectations >>> about all this. >> I'd expect "zelpa" to feel validated by this info in their concern, >> wouldn't you? > Well, it's public information: http://www.unicode.org/emoji/ > > The more important point is the one others have been making: Unicode > does not and cannot attempt to dictate to any vendor how to design > glyphs, either for normal characters like A and ? and ? or for emoji. > > Unicode does insist that the glyph design not misrepresent the meaning > of the character, which I believe was Michael Everson's objection to > vendors implementing U+1F3B1 BILLIARDS as a lone 8-ball. It's not clear > to me that the Google redesign discussed here goes that far; this seems > more like objection on aesthetic grounds. > While this is all true, it seems to miss the point behind the whole complaint. Attempts to counter "tongue-in-cheek" complaints with literal facts aren't always effective. :) A./ From unicode at unicode.org Thu May 18 12:37:32 2017 From: unicode at unicode.org (Gabriel von Dehn via Unicode) Date: Thu, 18 May 2017 20:37:32 +0300 Subject: Petition to ban Google from designing emoji In-Reply-To: References: <61678C1D-15A2-4507-9E3D-5E1849D57105@gmail.com> Message-ID: <8C4D1DA4-584D-4D74-9C88-4793177C642F@gmail.com> Again, Unicode is not intended to and cannot ban specific designs of characters including emoji. Unicode is responsible creating a list of characters that should be supported, with the goal of making textual communication online possible through a standardised encoding. Unicode is not responsible for designing these characters, that is up to the vendors to decide. From Unicodes Website: "Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.?; "The Unicode Consortium was founded to develop, extend and promote use of the Unicode Standard, which specifies the representation of text in modern software products and standards.? (http://www.unicode.org/standard/WhatIsUnicode.html) If you wish that a certain vendor - like Google or Apple - democratise their process of designing characters you should make that clear to them. Posting on this list will do absolutely nothing. ? > On 18 May 2017, at 19:53, Shakil Anwar via Unicode wrote: > > A more democratic solution is to allow the global public to both submit and vote on emoji designs. Rather than allow a small number of (probably) north american white males to dictate emojis in a 'colonial' process based on their own world and personal view. > The Unicode consortium can vote to change the process and now the proposal has been made it will speak volumes if Google, Apple etc. choose not to democratise. > ICANN chose to democratise their processes ; so can Unicode. > > On 18 May 2017 at 15:16, Gabriel von Dehn via Unicode > wrote: > Hi, > > the Unicode Consortium does not and cannot ?ban? vendors from designing emojis the way they wish. Unicode merely gives recommendations on how the characters should be displayed. Think of the different designs on different platforms like different fonts you can use (because that is actually what they are): They all look slightly different and no one would hold a petition for the design of characters in a font to change. > > As for the gendered Emojis, those are in the Unicode specification now: http://emojipedia.org/emoji-4.0/ > > If you do not like the upcoming Emoji design from Google (or anything about the upcoming version of Android), you can report to Google directly, but posting on this List won?t help. > > >> On 18 May 2017, at 14:40, zelpa via Unicode > wrote: >> >> http://blog.emojipedia.org/rip-blobs-google-redesigns-emojis/ >> >> Is this some kind of joke? Have Google put ANY thought into their emoji design? First they bastardise the cute blob emoji, then they make their emoji gendered, now they've literally just copied Apple's emoji. It's sickening. Disgusting. I propose we hold a petition for the Unicode Consortium to ban Google from designing emoji in the future, and make them revert back to the Android 5 designs. Everyone in favour of this please leave a response, anybody not in favour please rethink your opinion. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 18 12:40:41 2017 From: unicode at unicode.org (David Faulks via Unicode) Date: Thu, 18 May 2017 13:40:41 -0400 Subject: Petition to ban Google from designing emoji In-Reply-To: <8da32bb0-1575-e7f9-a9ee-8d6873b2e51d@ix.netcom.com> Message-ID: <2f52bd2e-4ffe-401f-a7f8-faf17068c782@email.android.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 18 13:03:09 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 18 May 2017 19:03:09 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <20170517223142.4b44687f@JRWUBU2> <9cdf9f3a-2922-0966-021c-a5dd293f2a65@ix.netcom.com> Message-ID: <20170518190309.0fc4223b@JRWUBU2> On Thu, 18 May 2017 09:58:43 +0100 Alastair Houghton via Unicode wrote: > On 18 May 2017, at 07:18, Henri Sivonen via Unicode > wrote: > > > > the decision complicates U+FFFD generation when validating UTF-8 by > > state machine. > > It *really* doesn?t. Even if you?re hell bent on using a pure state > machine approach, you need to add maybe two additional error states > (two-trailing-bytes-to-eat-then-fffd and > one-trailing-byte-to-eat-then-fffd) on top of the states you already > have. The implementation complexity argument is a *total* red > herring. For big programs, yes. However, for a small program it can be attractive to have a small hand-coded routine so that the source code can sit in a single file. It can even allow a basically UTF-8 program to meet a requirement to be able to match lone surrogates in a regular expression, as was once required. Richard. From unicode at unicode.org Thu May 18 13:36:05 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 18 May 2017 11:36:05 -0700 Subject: Petition to ban Google from designing emoji In-Reply-To: <2f52bd2e-4ffe-401f-a7f8-faf17068c782@email.android.com> References: <2f52bd2e-4ffe-401f-a7f8-faf17068c782@email.android.com> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 18 13:53:49 2017 From: unicode at unicode.org (Phake Nick via Unicode) Date: Fri, 19 May 2017 02:53:49 +0800 Subject: Petition to ban Google from designing emoji In-Reply-To: References: <2f52bd2e-4ffe-401f-a7f8-faf17068c782@email.android.com> Message-ID: Is it possible to introduce variation selector for emoji with large design variation among vendors so that when users send emoji with selectors their variation among vendors can be minimized by asking vendors to support both versions? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 18 20:21:35 2017 From: unicode at unicode.org (Rebecca T via Unicode) Date: Thu, 18 May 2017 21:21:35 -0400 Subject: Petition to ban Google from designing emoji In-Reply-To: References: <2f52bd2e-4ffe-401f-a7f8-faf17068c782@email.android.com> Message-ID: Nick, Don?t even joke. Sincerely, ? Rebecca On Thu, May 18, 2017 at 2:53 PM, Phake Nick via Unicode wrote: > Is it possible to introduce variation selector for emoji with large design > variation among vendors so that when users send emoji with selectors their > variation among vendors can be minimized by asking vendors to support both > versions? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri May 19 15:09:52 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Fri, 19 May 2017 13:09:52 -0700 Subject: Team Emoji Message-ID: <20170519130952.665a7a7059d7ee80bb4d670165c8327d.3a3224a989.wbe@email03.godaddy.com> http://www.cnn.com/2017/05/19/us/emoji-redhead-curly-black-hair-trnd/index.html "Team Emoji (aka the Unicode Consortium) has approved some well-recieved [sic] updates to the visual lexicon we've all come to love. One of the most recent updates included black hearts and a unicorn, and they also got rid of the gun emoji in favor of a much less threatening water gun version. And shockingly, it has only been two years since Unicode updated the icons with different skin tones." "Team Emoji (aka the Unicode Consortium)." What a legacy. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat May 20 18:01:59 2017 From: unicode at unicode.org (Oren Watson via Unicode) Date: Sat, 20 May 2017 19:01:59 -0400 Subject: Fwd: Team Emoji In-Reply-To: References: <20170519130952.665a7a7059d7ee80bb4d670165c8327d.3a3224a989.wbe@email03.godaddy.com> Message-ID: It's especially bad that they think that it was the Unicode consortium that changed the PISTOL emoji to a water gun. Does no-one at CNN use Android, Samsung or Windows? It's a pistol, specifically a revolver, on all those. On Fri, May 19, 2017 at 4:09 PM, Doug Ewell via Unicode wrote: > http://www.cnn.com/2017/05/19/us/emoji-redhead-curly-black-h > air-trnd/index.html > > "Team Emoji (aka the Unicode Consortium) has approved some well-recieved > [sic] updates to the visual lexicon we've all come to love. One of the > most recent updates included black hearts and a unicorn, and they also > got rid of the gun emoji in favor of a much less threatening water gun > version. And shockingly, it has only been two years since Unicode > updated the icons with different skin tones." > > "Team Emoji (aka the Unicode Consortium)." What a legacy. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun May 21 11:37:27 2017 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sun, 21 May 2017 18:37:27 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> Message-ID: I actually didn't see any of this discussion until today. ( unicode at unicode.org mail was going into my spam folder...) I started reading the thread, but it looks like a lot of it is OT, so just scanned some of them. A few brief points: 1. There is plenty of time for public comment, since it was targeted at *Unicode 11*, the release for about a year from now, *not* *Unicode 10*, due this year. 2. When the UTC "approves a change", that change is subject to comment, and the UTC can always reverse or modify its approval up until the meeting before release date. *So there are ca. 9 months in which to comment.* 3. The modified text is a set of guidelines, not requirements. So no conformance clause is being changed. - If people really believed that the guidelines in that section should have been conformance clauses, they should have proposed that at some point. - And still can proposal that ? as I said, there is plenty of time. Mark On Wed, May 17, 2017 at 10:41 PM, Doug Ewell via Unicode < unicode at unicode.org> wrote: > Henri Sivonen wrote: > > > I find it shocking that the Unicode Consortium would change a > > widely-implemented part of the standard (regardless of whether Unicode > > itself officially designates it as a requirement or suggestion) on > > such flimsy grounds. > > > > I'd like to register my feedback that I believe changing the best > > practices is wrong. > > Perhaps surprisingly, it's already too late. UTC approved this change > the day after the proposal was written. > > http://www.unicode.org/L2/L2017/17103.htm#151-C19 > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 22 12:35:34 2017 From: unicode at unicode.org (Don Osborn via Unicode) Date: Mon, 22 May 2017 13:35:34 -0400 Subject: Conference marking 40th anniversary of Niamey expert meeting? Message-ID: <01f901d2d321$d22e5900$768b0b00$@bisharat.net> Is there any interest in a conference on support for African languages, including issues at the character and script level? I'm looking at the upcoming 40th anniversary of the Niamey expert meeting on "Transcription and Harmonization of African Languages" with the thought that it might be an opportune occasion to take stock of a process that was prominent in the 1960s - 1970s, reflecting/shaping the Latin-based orthographies used today, and consider current issues with all scripts used in Africa. Such an event could also serve as a way to exchange skills and network among people doing applied work (localization, content development, language technology). I've just posted a short question to that effect at http://niamey.blogspot.com/2017/05/marking-40th-anniversary-of-niamey.html in the hopes of eliciting feedback. This post also references 2 earlier postings about the 50th anniversary of the landmark 1966 Bamako expert meeting, in which various possible issues for discussion were mentioned. The 1978 Niamey conference was a key meeting among a series of UNESCO-(co)sponsored expert meetings on harmonization of transcriptions (orthographies) in Latin script during the 1960s and 1970s. Among other things, this conference produced the African Reference Alphabet, which has been referred to in standardization of orthographies in several countries and in much later discussions relating to Unicode. Thanks in advance for any feedback, here or on the blog. Don Osborn, PhD -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 22 16:44:06 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 22 May 2017 22:44:06 +0100 Subject: Comparing Raw Values of the Age Property Message-ID: <20170522224406.43d2d2fa@JRWUBU2> Given two raw values of the Age property, defined in UCD file DerivedAge.txt, how is a computer program supposed to compare them? Apart from special handling for the value "Unassigned" and its short alias "NA", one used to be able to compare short values against short values and long values against long values by simple string comparison. However, now we are coming to Version 10.0 of Unicode, this no longer works - "1.1" < "10.0" < "2.0". There are some possibilities - the values appear in order in PropertyValueAliases.txt and in DerivedAge.txt. However, I can find no relevant guarantees in UAX#44. I am looking for a solution that can be driven by the data files, rather than requiring human thought at every version release. Can one rely on the FULL STOP being the field divider, and can one rely on there never being any grouping characters in the short values? Again, I could find no guarantees. Richard. From unicode at unicode.org Mon May 22 17:10:02 2017 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Mon, 22 May 2017 15:10:02 -0700 Subject: Comparing Raw Values of the Age Property In-Reply-To: <20170522224406.43d2d2fa@JRWUBU2> References: <20170522224406.43d2d2fa@JRWUBU2> Message-ID: On Mon, May 22, 2017 at 2:44 PM, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > Given two raw values of the Age property, defined in UCD file > DerivedAge.txt, how is a computer program supposed to compare them? > Apart from special handling for the value "Unassigned" and its short > alias "NA", one used to be able to compare short values against short > values and long values against long values by simple string > comparison. However, now we are coming to Version 10.0 of Unicode, > this no longer works - "1.1" < "10.0" < "2.0". > This is normal for numbers, and for multi-field version numbers. If you want numeric sorting, then you need to either use a collator with that option, or parse the versions into tuples of integers and sort those. There are some possibilities - the values appear in order in > PropertyValueAliases.txt and in DerivedAge.txt. You should not rely on the order of values in data files, unless the file explicitly states that order matters. Can one rely on the FULL STOP being the field > divider, I think so. Dots are extremely common for version numbers. I see no reason for Unicode to use something else. and can one rely on there never being any grouping characters > in the short values? I don't know what "grouping characters" you have in mind. I think the format is pretty self-evident. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 22 17:19:08 2017 From: unicode at unicode.org (Anshuman Pandey via Unicode) Date: Mon, 22 May 2017 17:19:08 -0500 Subject: Comparing Raw Values of the Age Property In-Reply-To: <20170522224406.43d2d2fa@JRWUBU2> References: <20170522224406.43d2d2fa@JRWUBU2> Message-ID: <9E1A71F0-5DCA-440C-A573-90B6BA4402CB@umich.edu> I performed several operations on DerivedAge.txt a few months ago. One basic example here: https://pandey.github.io/posts/unicode-growth-UCD-python.html If you provide some more insight into your objective, I might be able to help. I would recommend against relying on the order of the data, and that you instead parse the individual entries to obtain the 'Age' property. All my best, Anshu > On May 22, 2017, at 4:44 PM, Richard Wordingham via Unicode wrote: > > Given two raw values of the Age property, defined in UCD file > DerivedAge.txt, how is a computer program supposed to compare them? > Apart from special handling for the value "Unassigned" and its short > alias "NA", one used to be able to compare short values against short > values and long values against long values by simple string > comparison. However, now we are coming to Version 10.0 of Unicode, > this no longer works - "1.1" < "10.0" < "2.0". > > There are some possibilities - the values appear in order in > PropertyValueAliases.txt and in DerivedAge.txt. However, I can find no > relevant guarantees in UAX#44. I am looking for a solution that can be > driven by the data files, rather than requiring human thought at every > version release. Can one rely on the FULL STOP being the field > divider, and can one rely on there never being any grouping characters > in the short values? Again, I could find no guarantees. > > Richard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 22 17:48:19 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 22 May 2017 23:48:19 +0100 Subject: Comparing Raw Values of the Age Property In-Reply-To: References: <20170522224406.43d2d2fa@JRWUBU2> Message-ID: <20170522234819.0f44c619@JRWUBU2> On Mon, 22 May 2017 15:10:02 -0700 Markus Scherer via Unicode wrote: > On Mon, May 22, 2017 at 2:44 PM, Richard Wordingham via Unicode < > unicode at unicode.org> wrote: > > > Given two raw values of the Age property, defined in UCD file > > DerivedAge.txt, how is a computer program supposed to compare them? > > Apart from special handling for the value "Unassigned" and its short > > alias "NA", one used to be able to compare short values against > > short values and long values against long values by simple string > > comparison. However, now we are coming to Version 10.0 of Unicode, > > this no longer works - "1.1" < "10.0" < "2.0". > > > > This is normal for numbers, and for multi-field version numbers. > If you want numeric sorting, then you need to either use a collator > with that option, or parse the versions into tuples of integers and > sort those. Well, comparing "15.1" and "15.12" gives different answers depending on whether you view them as decimal numbers or a hierarchical sequence of numbers. > Can one rely on the FULL STOP being the field > > divider, > I think so. Dots are extremely common for version numbers. I see no > reason for Unicode to use something else. But where is that stated? > and can one rely on there never being any grouping characters > > in the short values? > I don't know what "grouping characters" you have in mind. Comma is the obvious one. Looking to the far future (I trust you've heard of the predicted Cobol crisis for the Y10k problem), will we have "1000.0" or "1,000.0"? Richard. From unicode at unicode.org Mon May 22 17:49:31 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 22 May 2017 23:49:31 +0100 Subject: Comparing Raw Values of the Age Property In-Reply-To: <9E1A71F0-5DCA-440C-A573-90B6BA4402CB@umich.edu> References: <20170522224406.43d2d2fa@JRWUBU2> <9E1A71F0-5DCA-440C-A573-90B6BA4402CB@umich.edu> Message-ID: <20170522234931.79f307b0@JRWUBU2> On Mon, 22 May 2017 17:19:08 -0500 Anshuman Pandey wrote: > I performed several operations on DerivedAge.txt a few months ago. > One basic example here: > > https://pandey.github.io/posts/unicode-growth-UCD-python.html So what happens if you apply it to Unicode Version 10.0? Are the versions sorted as strings, as real numbers, or just in the order of the data in DerivedAge.txt. > If you provide some more insight into your objective, I might be able > to help. One of the objectives is to use a current version of the UCD to determine, for example, which characters were in Version x.y. One needs that for a regular expression such as [:Age=3.0:], which also matches all characters that have survived since Version 1.1. Another is to record for which versions of the standard a character had some particular value of a property. Richard. From unicode at unicode.org Tue May 23 01:10:09 2017 From: unicode at unicode.org (Jonathan Coxhead via Unicode) Date: Mon, 22 May 2017 23:10:09 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <20170517223142.4b44687f@JRWUBU2> <9cdf9f3a-2922-0966-021c-a5dd293f2a65@ix.netcom.com> Message-ID: On 18/05/2017 1:58 am, Alastair Houghton via Unicode wrote: > On 18 May 2017, at 07:18, Henri Sivonen via Unicode wrote: >> the decision complicates U+FFFD generation when validating UTF-8 by state machine. > It *really* doesn?t. Even if you?re hell bent on using a pure state machine approach, you need to add maybe two additional error states (two-trailing-bytes-to-eat-then-fffd and one-trailing-byte-to-eat-then-fffd) on top of the states you already have. The implementation complexity argument is a *total* red herring. Heh. A state machine with N+2 states is, /a fortiori/, more complex than one with N states. So I think your argument is self-contradictory. > Alastair. ?? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 23 01:43:42 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 22 May 2017 23:43:42 -0700 Subject: Comparing Raw Values of the Age Property In-Reply-To: <20170522234931.79f307b0@JRWUBU2> References: <20170522224406.43d2d2fa@JRWUBU2> <9E1A71F0-5DCA-440C-A573-90B6BA4402CB@umich.edu> <20170522234931.79f307b0@JRWUBU2> Message-ID: <83e02aa3-11b0-505a-dd47-0adf6dc0489c@ix.netcom.com> On 5/22/2017 3:49 PM, Richard Wordingham via Unicode wrote: > One of the objectives is to use a current version of the UCD to > determine, for example, which characters were in Version x.y. One > needs that for a regular expression such as [:Age=3.0:], which > also matches all characters that have survived since Version 1.1. > Another is to record for which versions of the standard a character had > some particular value of a property. Richard, I would tend to side with those who claim that "version number" is something that's defined by common industry practice, and therefore not something that Unicode needs to define - but is allowed to use. Just like Unicode doesn't define what an integer is, or hexadecimal number system or a whole host of other concepts that are used in defining in turn what Unicode is. As Markus implied, version numbers are a positional number system where the positions in turn are integers in decimal notation, separated by dots. As it is neither a "string" nor a single number, neither of those common sorting methods give the right answer, but a multi-field sort will. If you have a multi-field sort algorithm that uses commas as the delimiter, just swap out the dots for commas. If not, then you have to implement your own multi-level sort. In any well-designed modern runtime library you can pass a comparison method to any of the sorting algorithms (or sorted data collections). A./ PS: somewhere in the standard, Unicode does define names for the fields: Major, Minor and Update. The use of the term "Update" may not be universal, but major and minor version numbers are a well established concept and do not need a definition. The naming also implies the order of precedence. From unicode at unicode.org Tue May 23 03:24:31 2017 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Tue, 23 May 2017 17:24:31 +0900 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> Message-ID: <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> Hello Mark, On 2017/05/22 01:37, Mark Davis ?? via Unicode wrote: > I actually didn't see any of this discussion until today. Many thanks for chiming in. > ( > unicode at unicode.org mail was going into my spam folder...) I started > reading the thread, but it looks like a lot of it is OT, As is quite usual on mailing list :-(. > so just scanned > some of them. > > A few brief points: > > 1. There is plenty of time for public comment, since it was > targeted at *Unicode > 11*, the release for about a year from now, *not* *Unicode 10*, due this > year. > 2. When the UTC "approves a change", that change is subject to comment, > and the UTC can always reverse or modify its approval up until the meeting > before release date. *So there are ca. 9 months in which to comment.* This is good to hear. What's the best way to submit such comments? > 3. The modified text is a set of guidelines, not requirements. So no > conformance clause is being changed. > - If people really believed that the guidelines in that section should > have been conformance clauses, they should have proposed that at > some point. I may have missed something, but I think nobody actually proposed to change the recommendations into requirements. I think everybody understands that there are several ways to do things, and situations where one or the other is preferred. The only advantage of changing the current recommendations to requirements would be to make it more difficult for them to be changed. I think the situation at hand is somewhat special: Recommendations are okay. But there's a strong wish from downstream communities such asWeb browser implementers and programming language/library implementers to not change these recommendations. Some of these communities have stricter requirement for alignment, and some have followed longstanding recommendations in the absence of specific arguments for something different. Regards, Martin. > - And still can proposal that ? as I said, there is plenty of time. > > > Mark > > On Wed, May 17, 2017 at 10:41 PM, Doug Ewell via Unicode < > unicode at unicode.org> wrote: > >> Henri Sivonen wrote: >> >>> I find it shocking that the Unicode Consortium would change a >>> widely-implemented part of the standard (regardless of whether Unicode >>> itself officially designates it as a requirement or suggestion) on >>> such flimsy grounds. >>> >>> I'd like to register my feedback that I believe changing the best >>> practices is wrong. >> >> Perhaps surprisingly, it's already too late. UTC approved this change >> the day after the proposal was written. >> >> http://www.unicode.org/L2/L2017/17103.htm#151-C19 >> >> -- >> Doug Ewell | Thornton, CO, US | ewellic.org >> >> >> > -- Prof. Dr.sc. Martin J. D?rst Department of Intelligent Information Technology College of Science and Engineering Aoyama Gakuin University Fuchinobe 5-1-10, Chuo-ku, Sagamihara 252-5258 Japan From unicode at unicode.org Tue May 23 04:17:06 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 23 May 2017 10:17:06 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <20170517223142.4b44687f@JRWUBU2> <9cdf9f3a-2922-0966-021c-a5dd293f2a65@ix.netcom.com> Message-ID: <4D46DB3C-6134-4B44-A59C-F6A6DDE38B0C@alastairs-place.net> On 23 May 2017, at 07:10, Jonathan Coxhead via Unicode wrote: > > On 18/05/2017 1:58 am, Alastair Houghton via Unicode wrote: >> On 18 May 2017, at 07:18, Henri Sivonen via Unicode >> wrote: >> >>> the decision complicates U+FFFD generation when validating UTF-8 by state machine. >>> >> It *really* doesn?t. Even if you?re hell bent on using a pure state machine approach, you need to add maybe two additional error states (two-trailing-bytes-to-eat-then-fffd and one-trailing-byte-to-eat-then-fffd) on top of the states you already have. The implementation complexity argument is a *total* red herring. > > Heh. A state machine with N+2 states is, a fortiori, more complex than one with N states. So I think your argument is self-contradictory. You?re being overly pedantic (and in this case, actually, the cyclomatic complexity of the state machine wouldn?t increase). In any case, Henri is complaining that it?s too difficult to implement; it isn?t. You need two extra states, both of which are trivial. The point I was making was that this is not a strong argument against the proposed change, *even if* we were treating it as a requirement, which it isn?t. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 23 04:33:24 2017 From: unicode at unicode.org (Manuel Strehl via Unicode) Date: Tue, 23 May 2017 11:33:24 +0200 Subject: Comparing Raw Values of the Age Property In-Reply-To: <83e02aa3-11b0-505a-dd47-0adf6dc0489c@ix.netcom.com> References: <20170522224406.43d2d2fa@JRWUBU2> <9E1A71F0-5DCA-440C-A573-90B6BA4402CB@umich.edu> <20170522234931.79f307b0@JRWUBU2> <83e02aa3-11b0-505a-dd47-0adf6dc0489c@ix.netcom.com> Message-ID: The rising standard in the world of web development (and others) is called ?Semantic Versioning? [1], that many projects adhere to or sometimes must actively explain, why they don't. The structure of a ?semantic version? string is a set of three integers, MAJOR.MINOR.PATCH, where the ?sematics? part lies in a kind of contract between author and user, when to increment which part. I do _not_ suggest Unicode to embrace that standard, merely stating, that that is what many frontend developers will simply assume when looking at a version string, that matches this pattern. --Manuel [1] http://semver.org/ 2017-05-23 8:43 GMT+02:00 Asmus Freytag via Unicode : > On 5/22/2017 3:49 PM, Richard Wordingham via Unicode wrote: > >> One of the objectives is to use a current version of the UCD to >> determine, for example, which characters were in Version x.y. One >> needs that for a regular expression such as [:Age=3.0:], which >> also matches all characters that have survived since Version 1.1. >> Another is to record for which versions of the standard a character had >> some particular value of a property. >> > > Richard, > > I would tend to side with those who claim that "version number" is > something that's defined by common industry practice, and therefore not > something that Unicode needs to define - but is allowed to use. Just like > Unicode doesn't define what an integer is, or hexadecimal number system or > a whole host of other concepts that are used in defining in turn what > Unicode is. > > As Markus implied, version numbers are a positional number system where > the positions in turn are integers in decimal notation, separated by dots. > > As it is neither a "string" nor a single number, neither of those common > sorting methods give the right answer, but a multi-field sort will. > > If you have a multi-field sort algorithm that uses commas as the > delimiter, just swap out the dots for commas. If not, then you have to > implement your own multi-level sort. > > In any well-designed modern runtime library you can pass a comparison > method to any of the sorting algorithms (or sorted data collections). > > A./ > > PS: somewhere in the standard, Unicode does define names for the fields: > Major, Minor and Update. The use of the term "Update" may not be universal, > but major and minor version numbers are a well established concept and do > not need a definition. The naming also implies the order of precedence. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 23 06:04:58 2017 From: unicode at unicode.org (Janusz S. Bien via Unicode) Date: Tue, 23 May 2017 13:04:58 +0200 Subject: Comparing Raw Values of the Age Property In-Reply-To: References: <20170522224406.43d2d2fa@JRWUBU2> <9E1A71F0-5DCA-440C-A573-90B6BA4402CB@umich.edu> <20170522234931.79f307b0@JRWUBU2> <83e02aa3-11b0-505a-dd47-0adf6dc0489c@ix.netcom.com> Message-ID: <20170523130458.311188ckx2vumdei@mail.mimuw.edu.pl> Quote/Cytat - Manuel Strehl via Unicode (Tue 23 May 2017 11:33:24 AM CEST): > The rising standard in the world of web development (and others) is called > ?Semantic Versioning? [1], that many projects adhere to or sometimes must > actively explain, why they don't. > > The structure of a ?semantic version? string is a set of three integers, > MAJOR.MINOR.PATCH, where the ?sematics? part lies in a kind of contract > between author and user, when to increment which part. > Perhaps I am missing something, but I don't understand this thread. Cf. http://unicode.org/versions/ Version numbers for the Unicode Standard consist of three fields, denoting the major version, the minor version, and the update version, respectively. The differences between major, minor, and update versions are as follows: [...] Best regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From unicode at unicode.org Tue May 23 07:29:33 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 23 May 2017 05:29:33 -0700 Subject: Comparing Raw Values of the Age Property In-Reply-To: <20170523130458.311188ckx2vumdei@mail.mimuw.edu.pl> References: <20170522224406.43d2d2fa@JRWUBU2> <9E1A71F0-5DCA-440C-A573-90B6BA4402CB@umich.edu> <20170522234931.79f307b0@JRWUBU2> <83e02aa3-11b0-505a-dd47-0adf6dc0489c@ix.netcom.com> <20170523130458.311188ckx2vumdei@mail.mimuw.edu.pl> Message-ID: On 5/23/2017 4:04 AM, Janusz S. Bien via Unicode wrote: > Quote/Cytat - Manuel Strehl via Unicode (Tue 23 > May 2017 11:33:24 AM CEST): > >> The rising standard in the world of web development (and others) is >> called >> ?Semantic Versioning? [1], that many projects adhere to or sometimes >> must >> actively explain, why they don't. >> >> The structure of a ?semantic version? string is a set of three integers, >> MAJOR.MINOR.PATCH, where the ?sematics? part lies in a kind of contract >> between author and user, when to increment which part. >> > > Perhaps I am missing something, but I don't understand this thread. Cf. You are not missing anything, the OP is being obtuse. We just didn't want to run the search for him. :) A./ > > http://unicode.org/versions/ > > Version numbers for the Unicode Standard consist of three fields, > denoting the major version, the minor version, and the update version, > respectively. > > The differences between major, minor, and update versions are as follows: > > [...] > > Best regards > > Janusz > From unicode at unicode.org Tue May 23 08:27:51 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 23 May 2017 15:27:51 +0200 Subject: Comparing Raw Values of the Age Property In-Reply-To: <83e02aa3-11b0-505a-dd47-0adf6dc0489c@ix.netcom.com> References: <20170522224406.43d2d2fa@JRWUBU2> <9E1A71F0-5DCA-440C-A573-90B6BA4402CB@umich.edu> <20170522234931.79f307b0@JRWUBU2> <83e02aa3-11b0-505a-dd47-0adf6dc0489c@ix.netcom.com> Message-ID: 2017-05-23 8:43 GMT+02:00 Asmus Freytag via Unicode : > On 5/22/2017 3:49 PM, Richard Wordingham via Unicode wrote: > >> One of the objectives is to use a current version of the UCD to >> determine, for example, which characters were in Version x.y. One >> needs that for a regular expression such as [:Age=3.0:], which >> also matches all characters that have survived since Version 1.1. >> Another is to record for which versions of the standard a character had >> some particular value of a property. >> > > Richard, > > I would tend to side with those who claim that "version number" is > something that's defined by common industry practice, and therefore not > something that Unicode needs to define - but is allowed to use. Just like > Unicode doesn't define what an integer is, or hexadecimal number system or > a whole host of other concepts that are used in defining in turn what > Unicode is. > > As Markus implied, version numbers are a positional number system where > the positions in turn are integers in decimal notation, separated by dots. > Not all version numbers obey this scheme with dots and only integers. There are also version numbers using dates (separated by hyphens like in the ISO format), or additional letters (a,b,c...) or labels (alpha, beta, RC) sometimes in the middle of other fields (these labels are not always easy to compare), but they are generally made to be case-insensitive and tend to avoid non-latin letters, so Greek letters are named in Latin), and they cannot be always parsed and combined as a single integer. For comparing/sorting, it's best to use case-ensensitive and use only primary differences in UCA. But the UCA algorithm should be tweaked using preparsing to locate where there are numbers In rare cases you may find roman decimal numbers (I, II,III, IV, V, IX, X) which can't be strictly sorted like other Latin letters. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 23 09:05:04 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 23 May 2017 07:05:04 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> Message-ID: <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 23 11:20:33 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Tue, 23 May 2017 16:20:33 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> Message-ID: + the list, which somehow my reply seems to have lost. > I may have missed something, but I think nobody actually proposed to change the recommendations into requirements No thanks, that would be a breaking change for some implementations (like mine) and force them to become non-complying or potentially break customer behavior. I would prefer that both options be permitted, perhaps with a few words of advantages. -Shawn From unicode at unicode.org Tue May 23 12:45:46 2017 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Tue, 23 May 2017 10:45:46 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> Message-ID: On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode < unicode at unicode.org> wrote: > So, if the proposal for Unicode really was more of a "feels right" and not > a "deviate at your peril" situation (or necessary escape hatch), then we > are better off not making a RECOMMEDATION that goes against collective > practice. > I think the standard is quite clear about this: Although a UTF-8 conversion process is required to never consume well-formed subsequences as part of its error handling for ill-formed subsequences, such a process is not otherwise constrained in how it deals with any ill-formed subsequence itself. An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 23 13:09:33 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 23 May 2017 19:09:33 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> Message-ID: <169D8226-B6EC-4D4F-BD91-A3C4141ED9DB@alastairs-place.net> > On 23 May 2017, at 18:45, Markus Scherer via Unicode wrote: > > On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode wrote: >> So, if the proposal for Unicode really was more of a "feels right" and not a "deviate at your peril" situation (or necessary escape hatch), then we are better off not making a RECOMMEDATION that goes against collective practice. > > I think the standard is quite clear about this: > > Although a UTF-8 conversion process is required to never consume well-formed subsequences as part of its error handling for ill-formed subsequences, such a process is not otherwise constrained in how it deals with any ill-formed subsequence itself. An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors. Agreed. That paragraph is entirely clear. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 23 13:20:23 2017 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Tue, 23 May 2017 11:20:23 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> Message-ID: On 5/23/2017 10:45 AM, Markus Scherer wrote: > On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode > > wrote: > > So, if the proposal for Unicode really was more of a "feels right" > and not a "deviate at your peril" situation (or necessary escape > hatch), then we are better off not making a RECOMMEDATION that > goes against collective practice. > > > I think the standard is quite clear about this: > > Although a UTF-8 conversion process is required to never consume > well-formed subsequences as part of its error handling for > ill-formed subsequences, such a process is not otherwise > constrained in how it deals with any ill-formed subsequence > itself. An ill-formed subsequence consisting of more than one code > unit could be treated as a single error or as multiple errors. > > And why add a recommendation that changes that from completely up to the implementation (or groups of implementations) to something where one way of doing it now has to justify itself? If the thread has made one thing clear is that there's no consensus in the wider community that one approach is obviously better. When it comes to ill-formed sequences, all bets are off. Simple as that. Adding a "recommendation" this late in the game is just bad standards policy. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 23 14:31:49 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Tue, 23 May 2017 19:31:49 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> Message-ID: > If the thread has made one thing clear is that there's no consensus in the wider community > that one approach is obviously better. When it comes to ill-formed sequences, all bets are off. > Simple as that. > Adding a "recommendation" this late in the game is just bad standards policy. I agree. I'm not sure what value this provides. If someone thought it added value to discuss the pros and cons of implementing it one way and the other as MAY do this or MAY do that, I don't mind. But I think both should be permitted, and neither should be encouraged with anything stronger than a MAY. -Shawn From unicode at unicode.org Tue May 23 15:42:58 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 23 May 2017 13:42:58 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 Message-ID: <20170523134258.665a7a7059d7ee80bb4d670165c8327d.7aa745091b.wbe@email03.godaddy.com> Asmus Freytag \(c\) wrote: > And why add a recommendation that changes that from completely up to > the implementation (or groups of implementations) to something where > one way of doing it now has to justify itself? A recommendation already exists, at the end of Section 3.9. The current proposal is to change it to recommend something else. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue May 23 15:48:40 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 23 May 2017 21:48:40 +0100 Subject: Comparing Raw Values of the Age Property In-Reply-To: References: <20170522224406.43d2d2fa@JRWUBU2> <9E1A71F0-5DCA-440C-A573-90B6BA4402CB@umich.edu> <20170522234931.79f307b0@JRWUBU2> <83e02aa3-11b0-505a-dd47-0adf6dc0489c@ix.netcom.com> <20170523130458.311188ckx2vumdei@mail.mimuw.edu.pl> Message-ID: <20170523214840.6f40ffe7@JRWUBU2> On Tue, 23 May 2017 05:29:33 -0700 Asmus Freytag via Unicode wrote: > On 5/23/2017 4:04 AM, Janusz S. Bien via Unicode wrote: > > Quote/Cytat - Manuel Strehl via Unicode (Tue > > 23 May 2017 11:33:24 AM CEST): > > > >> The rising standard in the world of web development (and others) > >> is called > >> ?Semantic Versioning? [1], that many projects adhere to or > >> sometimes must > >> actively explain, why they don't. > >> > >> The structure of a ?semantic version? string is a set of three > >> integers, MAJOR.MINOR.PATCH, where the ?sematics? part lies in a > >> kind of contract between author and user, when to increment which > >> part. > > > > Perhaps I am missing something, but I don't understand this thread. > > Cf. > > You are not missing anything, the OP is being obtuse. We just didn't > want to run the search for him. :) The object is to generate code *now* that, up to say Unicode Version 23.0, can work out, from the UCD files DerivedAge.txt and PropertyValueAliases.txt, whether an arbitrary code point was included by some Unicode version identified by a Unicode version identified by a value of the property Age. One needs this capability to implement the regular expressions of the form \p{Age=xxx}. This requires a scheme for determining which of two values of the property identifies the earlier version of Unicode. What TUS 9.0, its appendices and annexes is lacking is a clear statement such as, "The short values for the Age property are of the form "m.n", with the first field corresponding to the major version, and the second field corresponding to the minor version. There is no need for a third version field, because new characters are never assigned in update versions of the standard." Conveniently, this almost true statement is included in Section 5.14 of the proposed update to UAX#44 (in Draft 12 to be precise. It's not quite true, for there is also the short value NA for Unassigned. Is there any way of formally recording this oversight? With this proposed change, to compare two values, all one has to do is compare the short names of the values, for one knows what form they will be in. > > Version numbers for the Unicode Standard consist of three fields, > > denoting the major version, the minor version, and the update > > version, respectively. Yes, but 4.0.1 is not a value of the property Age; the last field is redundant. Oddly enough, ICU understands the regular expression \p{age=4.0.1}, but not \p{age=V2_1} (http://demo.icu-project.org/icu-bin/redemo). Ah well, it's only a recommendation that regular expression engines understand both short names and long names of values of properties. Richard. From unicode at unicode.org Tue May 23 15:57:24 2017 From: unicode at unicode.org (Karl Williamson via Unicode) Date: Tue, 23 May 2017 14:57:24 -0600 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> Message-ID: <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote: > On 5/23/2017 10:45 AM, Markus Scherer wrote: >> On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode >> > wrote: >> >> So, if the proposal for Unicode really was more of a "feels right" >> and not a "deviate at your peril" situation (or necessary escape >> hatch), then we are better off not making a RECOMMEDATION that >> goes against collective practice. >> >> >> I think the standard is quite clear about this: >> >> Although a UTF-8 conversion process is required to never consume >> well-formed subsequences as part of its error handling for >> ill-formed subsequences, such a process is not otherwise >> constrained in how it deals with any ill-formed subsequence >> itself. An ill-formed subsequence consisting of more than one code >> unit could be treated as a single error or as multiple errors. >> >> > And why add a recommendation that changes that from completely up to the > implementation (or groups of implementations) to something where one way > of doing it now has to justify itself? > > If the thread has made one thing clear is that there's no consensus in > the wider community that one approach is obviously better. When it comes > to ill-formed sequences, all bets are off. Simple as that. > > Adding a "recommendation" this late in the game is just bad standards > policy. > > A./ > > Unless I misunderstand, you are missing the point. There is already a recommendation listed in TUS, and that recommendation appears to have been added without much thought. There is no proposal to add a recommendation "this late in the game". From unicode at unicode.org Tue May 23 19:44:49 2017 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Tue, 23 May 2017 17:44:49 -0700 Subject: Comparing Raw Values of the Age Property In-Reply-To: <20170523214840.6f40ffe7@JRWUBU2> References: <20170522224406.43d2d2fa@JRWUBU2> <9E1A71F0-5DCA-440C-A573-90B6BA4402CB@umich.edu> <20170522234931.79f307b0@JRWUBU2> <83e02aa3-11b0-505a-dd47-0adf6dc0489c@ix.netcom.com> <20170523130458.311188ckx2vumdei@mail.mimuw.edu.pl> <20170523214840.6f40ffe7@JRWUBU2> Message-ID: <92e234ce-3d29-dabc-3202-709a84d73619@att.net> Richard On 5/23/2017 1:48 PM, Richard Wordingham via Unicode wrote: > The object is to generate code*now* that, up to say Unicode Version 23.0, > can work out, from the UCD files DerivedAge.txt and > PropertyValueAliases.txt, whether an arbitrary code point was included > by some Unicode version identified by a Unicode version identified by a > value of the property Age. Ah, but keep in mind, if projecting out to Version 23.0 (in the year 2030, by our current schedule), there is a significant chance that particular UCD data files may have morphed into something entirely different. Recall how at one point Unihan.txt morphed into Unihan.zip with multiple subpart files. Even though the maintainers of the UCD data files do our best to maintain them to be as stable as possible, their content and sometimes their formats do morph gradually from release to release. Just don't expect *any* parser to be completely forward proofed against what *might* happen in the UCD in some future version. On the other hand, for the property Age, even in the absence of normative definitions of invariants for the property values, given recent practice, it is pretty damn safe to assume: A. Major versions will continue to have two digits, incremented by one for each subsequent version: 10, 11, 12, ... 99. B. Minor versions will mostly (if not entirely) consist of the value "0", and will never require two digits. Assumption A will get you through this century, which by my estimation should well exceed the lifetime of any code you might be writing now that depends on it. BTW, unlike many actual products, the version numbering of the Unicode Standard is not really driven by marketing concerns. So there is very little chance of some version sequence for Unicode that ends up fitting a pattern like: 3.0, 3.1, 95 or NT, 98, 2000, XP, Vista, 7, 8, 8.1, 10 ... ;-) > What TUS 9.0, its appendices and annexes is lacking is a clear > statement such as, "The short values for the Age property are of the > form "m.n", with the first field corresponding to the major version, > and the second field corresponding to the minor version. There is no > need for a third version field, because new characters are never > assigned in update versions of the standard." I think the UTC and the editors had just been assuming that the pattern was so obvious that it needed no explaining. But the lack of a clear description of Age had become apparent, which is why I wrote that text to add to UAX #44 for the upcoming version. > Conveniently, this > almost true statement is included in Section 5.14 of the proposed > update to UAX#44 (in Draft 12 to be precise. It's not quite true, for > there is also the short value NA for Unassigned. Is there any way of > formally recording this oversight? Yes. You could always file another piece of feedback using the contact form. However, in this case, you already have the attention of the editors of UAX #44. So my advice would be to simply wait now for the publication of Version 10.0 of UAX #44 around the 3rd week of June. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 24 00:22:10 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 24 May 2017 06:22:10 +0100 Subject: Comparing Raw Values of the Age Property In-Reply-To: <92e234ce-3d29-dabc-3202-709a84d73619@att.net> References: <20170522224406.43d2d2fa@JRWUBU2> <9E1A71F0-5DCA-440C-A573-90B6BA4402CB@umich.edu> <20170522234931.79f307b0@JRWUBU2> <83e02aa3-11b0-505a-dd47-0adf6dc0489c@ix.netcom.com> <20170523130458.311188ckx2vumdei@mail.mimuw.edu.pl> <20170523214840.6f40ffe7@JRWUBU2> <92e234ce-3d29-dabc-3202-709a84d73619@att.net> Message-ID: <20170524062210.6ecc7680@JRWUBU2> On Tue, 23 May 2017 17:44:49 -0700 Ken Whistler via Unicode wrote: > Ah, but keep in mind, if projecting out to Version 23.0 (in the year > 2030, by our current schedule), there is a significant chance that > particular UCD data files may have morphed into something entirely > different. Recall how at one point Unihan.txt morphed into Unihan.zip > with multiple subpart files. Even though the maintainers of the UCD > data files do our best to maintain them to be as stable as possible, > their content and sometimes their formats do morph gradually from > release to release. Just don't expect *any* parser to be completely > forward proofed against what *might* happen in the UCD in some future > version. So long as the parser chokes on the new input, that is not too bad for my programs, which rely on being directed to a local copy of the UCD. That issue would be nastier for any program that tries to keep abreast of Unicode additions by downloading the relevant parts of the UCD. > On the other hand, for the property Age, even in the absence of > normative definitions of invariants for the property values, given > recent practice, it is pretty damn safe to assume: > A. Major versions will continue to have two digits, incremented by > one for each subsequent version: 10, 11, 12, ... 99. > B. Minor versions will mostly (if not entirely) consist of the value > "0", and will never require two digits. > Assumption A will get you through this century, which by my > estimation should well exceed the lifetime of any code you might be > writing now that depends on it. Yes, but http://www.thejokeshop.org/2008/12/as-useful-as-a-cobol-programmer/ . > BTW, unlike many actual products, the version numbering of the > Unicode Standard is not really driven by marketing concerns. So there > is very little chance of some version sequence for Unicode that ends > up fitting a pattern like: 3.0, 3.1, 95 or NT, 98, 2000, XP, Vista, > 7, 8, 8.1, 10 ... ;-) The risk I saw was that someone would decide to deprecate value names that look like floating point numbers, so that the relevant value for Version 17.0.0 would be named V17_0 and have no aliases. The new text in UAX#14 is also proof against the major version numbers suddenly becoming the year numbers, as has happened with several products. > Yes. You could always file another piece of feedback using the > contact form. However, in this case, you already have the attention > of the editors of UAX #44. So my advice would be to simply wait now > for the publication of Version 10.0 of UAX #44 around the 3rd week of > June. What deterred me was: (a) "The beta review period for Unicode 10.0 and related technical standards will close on May 1, 2017. This is the last opportunity for technical comments before version 10.0 is released in Q2 2017." - http://blog.unicode.org/2017/04/last-call-on-unicode-100-beta-review.html and (b) Proposed changes aren't yet part of the Unicode standard. Richard. From unicode at unicode.org Wed May 24 01:46:54 2017 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Wed, 24 May 2017 15:46:54 +0900 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> Message-ID: <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> On 2017/05/24 05:57, Karl Williamson via Unicode wrote: > On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote: >> Adding a "recommendation" this late in the game is just bad standards >> policy. > Unless I misunderstand, you are missing the point. There is already a > recommendation listed in TUS, That's indeed correct. > and that recommendation appears to have > been added without much thought. That's wrong. There was a public review issue with various options and with feedback, and the recommendation has been implemented and in use widely (among else, in major programming language and browsers) without problems for quite some time. > There is no proposal to add a > recommendation "this late in the game". True. The proposal isn't for an addition, it's for a change. The "late in the game" however, still applies. Regards, Martin. From unicode at unicode.org Wed May 24 17:56:39 2017 From: unicode at unicode.org (Karl Williamson via Unicode) Date: Wed, 24 May 2017 16:56:39 -0600 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> Message-ID: <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> On 05/24/2017 12:46 AM, Martin J. D?rst wrote: > On 2017/05/24 05:57, Karl Williamson via Unicode wrote: >> On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote: > >>> Adding a "recommendation" this late in the game is just bad standards >>> policy. > >> Unless I misunderstand, you are missing the point. There is already a >> recommendation listed in TUS, > > That's indeed correct. > > >> and that recommendation appears to have >> been added without much thought. > > That's wrong. There was a public review issue with various options and > with feedback, and the recommendation has been implemented and in use > widely (among else, in major programming language and browsers) without > problems for quite some time. Could you supply a reference to the PRI and its feedback? The recommendation in TUS 5.2 is "Replace each maximal subpart of an ill-formed subsequence by a single U+FFFD." And I agree with that. And I view an overlong sequence as a maximal ill-formed subsequence that should be replaced by a single FFFD. There's nothing in the text of 5.2 that immediately follows that recommendation that indicates to me that my view is incorrect. Perhaps my view is colored by the fact that I now maintain code that was written to parse UTF-8 back when overlongs were still considered legal input. An overlong was a single unit. When they became illegal, the code still considered them a single unit. I can understand how someone who comes along later could say C0 can't be followed by any continuation character that doesn't yield an overlong, therefore C0 is a maximal subsequence. But I assert that my interpretation is just as valid as that one. And perhaps more so, because of historical precedent. It appears to me that little thought was given to the fact that these changes would cause overlongs to now be at least two units instead of one, making long existing code no longer be best practice. You are effectively saying I'm wrong about this. I thought I had been paying attention to PRI's since the 5.x series, and I don't remember anything about this. If you have evidence to the contrary, please give it. However, I would have thought Markus would have dug any up and given it in his proposal. > > >> There is no proposal to add a >> recommendation "this late in the game". > > True. The proposal isn't for an addition, it's for a change. The "late > in the game" however, still applies. > > Regards, Martin. > From unicode at unicode.org Wed May 24 19:22:39 2017 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Wed, 24 May 2017 17:22:39 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> Message-ID: On Wed, May 24, 2017 at 3:56 PM, Karl Williamson wrote: > On 05/24/2017 12:46 AM, Martin J. D?rst wrote: > >> That's wrong. There was a public review issue with various options and >> with feedback, and the recommendation has been implemented and in use >> widely (among else, in major programming language and browsers) without >> problems for quite some time. >> > > Could you supply a reference to the PRI and its feedback? > http://www.unicode.org/review/resolved-pri-100.html#pri121 The PRI did not discuss possible different versions of "maximal subpart", and the examples there yield the same results either way. (No non-shortest forms.) The recommendation in TUS 5.2 is "Replace each maximal subpart of an > ill-formed subsequence by a single U+FFFD." > You are right. http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly expanded example compared with the PRI. The text simply talked about a "conversion process" stopping as soon as it encounters something that does not fit, so these edge cases would depend on whether the conversion process treats original-UTF-8 sequences as single units. And I agree with that. And I view an overlong sequence as a maximal > ill-formed subsequence that should be replaced by a single FFFD. There's > nothing in the text of 5.2 that immediately follows that recommendation > that indicates to me that my view is incorrect. > > Perhaps my view is colored by the fact that I now maintain code that was > written to parse UTF-8 back when overlongs were still considered legal > input. An overlong was a single unit. When they became illegal, the code > still considered them a single unit. > Right. I can understand how someone who comes along later could say C0 can't be > followed by any continuation character that doesn't yield an overlong, > therefore C0 is a maximal subsequence. > Right. But I assert that my interpretation is just as valid as that one. And > perhaps more so, because of historical precedent. > I agree. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri May 26 05:28:36 2017 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Fri, 26 May 2017 19:28:36 +0900 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> Message-ID: On 2017/05/25 09:22, Markus Scherer wrote: > On Wed, May 24, 2017 at 3:56 PM, Karl Williamson > wrote: > >> On 05/24/2017 12:46 AM, Martin J. D?rst wrote: >> >>> That's wrong. There was a public review issue with various options and >>> with feedback, and the recommendation has been implemented and in use >>> widely (among else, in major programming language and browsers) without >>> problems for quite some time. >>> >> >> Could you supply a reference to the PRI and its feedback? >> > > http://www.unicode.org/review/resolved-pri-100.html#pri121 > > The PRI did not discuss possible different versions of "maximal subpart", > and the examples there yield the same results either way. (No non-shortest > forms.) It is correct that it didn't give any of the *examples* that are under discussion now. On the other hand, the PRI is very clear about what it means by "maximal subpart": Citing directly from the PRI: >>>> The term "maximal subpart of the ill-formed subsequence" refers to the longest potentially valid initial subsequence or, if none, then to the next single code unit. >>>> At the time of the PRI, so-called "overlongs" were already ill-formed. That change goes back to 2003 or earlier (RFC 3629 (https://tools.ietf.org/html/rfc3629) was published in 2003 to reflect the tightening of the UTF-8 definition in Unicode/ISO 10646). > The recommendation in TUS 5.2 is "Replace each maximal subpart of an >> ill-formed subsequence by a single U+FFFD." >> > > You are right. > > http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly > expanded example compared with the PRI. > > The text simply talked about a "conversion process" stopping as soon as it > encounters something that does not fit, so these edge cases would depend on > whether the conversion process treats original-UTF-8 sequences as single > units. No, the text, both in the PRI and in Unicode 5.2, is quite clear. The "does not fit" (which I haven't found in either text) is clearly grounded by "ill-formed UTF-8". And there's no question about what "ill-formed UTF-8" means, in particular in Unicode 5.2, where you just have to go two pages back to find byte sequences such as , , and all called out explicitly as ill-formed. Any kind of claim, as in the L2/17-168 document, about there being an option 2a, are just not substantiated. It's true that there are no explicit examples in the PRI that would allow to distinguish between converting e.g. FC BF BF BF BF 80 to a single FFFD or to six of these. But there's no need to have examples for every corner case if the text is clear enough. In the above six-byte sequence, there's not a single potentially valid (initial) subsequence, so it's all single code units. >> And I agree with that. And I view an overlong sequence as a maximal >> ill-formed subsequence Can you point to any definition that would include or allow such an interpretation? I just haven't found any yet, neither in the PRI nor in Unicode 5.2. >> that should be replaced by a single FFFD. There's >> nothing in the text of 5.2 that immediately follows that recommendation >> that indicates to me that my view is incorrect. I have to agree that the text in Unicode 5.2 could be clearer. It's a hodgepodge of attempts at justifications and definitions. And the word "maximal" itself may also contribute to pushing the interpretation in one direction. But there's plenty in the text that makes it absolutely clear that some things cannot be included. In particular, it says >>>> The term ?maximal subpart of an ill-formed subsequence? refers to the code units that were collected in this manner. They could be the start of a well-formed sequence, except that the sequence lacks the proper continuation. Alternatively, the converter may have found an continuation code unit, which cannot be the start of a well-formed sequence. >>>> And the "in this manner" refers to: >>>> A sequence of code units will be processed up to the point where the sequence either can be unambiguously interpreted as a particular Unicode code point or where the converter recognizes that the code units collected so far constitute an ill-formed subsequence. >>>> So we have the same thing twice: Bail out as soon as something is ill-formed. >> Perhaps my view is colored by the fact that I now maintain code that was >> written to parse UTF-8 back when overlongs were still considered legal >> input. Thanks for providing this information. That's a lot more useful than "feels right", which was given as a reason on this list before. >> An overlong was a single unit. When they became illegal, the code >> still considered them a single unit. That's fine for your code. I might do the same (or not) if I were you, because one indeed never knows in which situation some code is used, and what repercussions a change might produce. But the PRI, and the wording in Unicode 5.2, was created when overlongs and 5-byte and 6-byte sequences and surrogate pairs,... were very clearly ill-formed already. If these texts had intended to make an exception for any of these cases, it would clearly have had to be written into the actual text. Saying something like "the text isn't clear because it says ill-formed, but maybe it doesn't mean ill-formed at the time it was written, but quite a few years before" just doesn't make sense to me at all. > I can understand how someone who comes along later could say C0 can't be >> followed by any continuation character that doesn't yield an overlong, >> therefore C0 is a maximal subsequence. Yes indeed, because maximal subsequences are defined by reference to well-formed/ill-formed subsequences, and what's ill-formed is defined in the same standard at the same time. There's nobody "coming along later". That kind of wording would be appropriate if the PRI and the recommendation in the standard had been made e.g. in the 1990ies, before the tightening of the UTF-8 definition. Then somebody could say that Unicode overlooked that they implicitly changed the recommendation for how to generate U+FFFDs by changing the definition of well-formed UTF-8. But no such thing at all happened. The PRI was evaluated, and the recommendation included in the text of Unicode, in the context of the then-existing (and since then unchanged) definition of UTF-8. >> But I assert that my interpretation is just as valid as that one. Sorry, but it cannot be valid, because of the timing. The tightening of the UTF-8 definition happened years before the PRI. >> And perhaps more so, because of historical precedent. It's good to know that there are older implementations that behave differently. And I understand that in some cases, these might be reluctant to change. Mine, and Henri's, comments are very much motivated by the fact that we are reluctant to change our implementations. It may be worth to think about whether the Unicode standard should mention implementations like yours. But there should be no doubt about the fact that the PRI and Unicode 5.2 (and the current version of Unicode) are clear about what they recommend, and that that recommendation is based on the definition of UTF-8 at that time (and still in force), and not at based on a historical definition of UTF-8. Regards, Martin. From unicode at unicode.org Fri May 26 08:22:54 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 26 May 2017 15:22:54 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> Message-ID: > > Citing directly from the PRI: > > >>>> > The term "maximal subpart of the ill-formed subsequence" refers to the > longest potentially valid initial subsequence or, if none, then to the next > single code unit. > >>>> > The way i understand it is that C0 80 will have TWO maximal subparts, because there's not any valid initial subsequence, so only the next single code unit (C0) will be considered. After this the following byte 80 also has not any valid initial subsequence, so here again only the next single code unit (80) will be considered. You'll get U+FFFD replacements emitted twice. This treats all cases of "overlong" sequences that were in the old UTF-8 definition in the first RFC. For C3 80 20, there will be only ONE maximal subpart because C3 80 is a valid initial subsequence, so a single U+FFFD replacement will be emitted, followed then by the valid UTF-8 sequence (20) which will correctly decode as U+0020. Good ! This means that this proposal makes sense and is compatible with random accesses within the encoded text whithout having to look backward for an indefinite number of code units and we never have to handle any case with possibly infinite number of code units mapped to the same U+FFFD replacement. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri May 26 10:41:32 2017 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Fri, 26 May 2017 08:41:32 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> Message-ID: On Fri, May 26, 2017 at 3:28 AM, Martin J. D?rst wrote: > But there's plenty in the text that makes it absolutely clear that some > things cannot be included. In particular, it says > > >>>> > The term ?maximal subpart of an ill-formed subsequence? refers to the code > units that were collected in this manner. They could be the start of a > well-formed sequence, except that the sequence lacks the proper > continuation. Alternatively, the converter may have found an continuation > code unit, which cannot be the start of a well-formed sequence. > >>>> > > And the "in this manner" refers to: > >>>> > A sequence of code units will be processed up to the point where the > sequence either can be unambiguously interpreted as a particular Unicode > code point or where the converter recognizes that the code units collected > so far constitute an ill-formed subsequence. > >>>> > > So we have the same thing twice: Bail out as soon as something is > ill-formed. The UTF-8 conversion code that I wrote for ICU, and apparently the code that various other people have written, collects sequences starting from lead bytes, according to the original spec, and at the end looks at whether the assembled code point is too low for the lead byte, or is a surrogate, or is above 10FFFF. Stopping at a non-trail byte is quite natural, and reading the PRI text accordingly is quite natural too. Aside from UTF-8 history, there is a reason for preferring a more "structural" definition for UTF-8 over one purely along valid sequences. This applies to code that *works* on UTF-8 strings rather than just converting them. For UTF-8 *processing* you need to be able to iterate both forward and backward, and sometimes you need not collect code points while skipping over n units in either direction -- but your iteration needs to be consistent in all cases. This is easier to implement (especially in fast, short, inline code) if you have to look only at how many trail bytes follow a lead byte, without having to look whether the first trail byte is in a certain range for some specific lead bytes. (And don't say that everyone can validate all strings once and then all code can assume they are valid: That just does not work for library code, you cannot assume anything about your input strings, and you cannot crash when they are ill-formed.) markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri May 26 12:28:43 2017 From: unicode at unicode.org (Karl Williamson via Unicode) Date: Fri, 26 May 2017 11:28:43 -0600 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> Message-ID: On 05/26/2017 04:28 AM, Martin J. D?rst wrote: > It may be worth to think about whether the Unicode standard should > mention implementations like yours. But there should be no doubt about > the fact that the PRI and Unicode 5.2 (and the current version of > Unicode) are clear about what they recommend, and that that > recommendation is based on the definition of UTF-8 at that time (and > still in force), and not at based on a historical definition of UTF-8. The link provided about the PRI doesn't lead to the comments. Is there any evidence that there was a realization that the language being adopted would lead to overlongs being split into multiple subparts? From unicode at unicode.org Fri May 26 13:22:37 2017 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Fri, 26 May 2017 11:22:37 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> Message-ID: On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: > The link provided about the PRI doesn't lead to the comments. > PRI #121 (August, 2008) pre-dated the practice of keeping all the feedback comments together with the PRI itself in a numbered directory with the name "feedback.html". But the comments were collected together at the time and are accessible here: http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121 Also there was a separately submitted comment document: http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt And the minutes of the pertinent UTC meeting (UTC #116): http://www.unicode.org/L2/L2008/08253.htm The minutes simply capture the consensus to adopt Option #2 from PRI #121, and the relevant action items. I now return the floor to the distinguished disputants to continue litigating history. ;-) --Ken From unicode at unicode.org Fri May 26 16:15:55 2017 From: unicode at unicode.org (Karl Williamson via Unicode) Date: Fri, 26 May 2017 15:15:55 -0600 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> Message-ID: <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> On 05/26/2017 12:22 PM, Ken Whistler wrote: > > On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: >> The link provided about the PRI doesn't lead to the comments. >> > > PRI #121 (August, 2008) pre-dated the practice of keeping all the > feedback comments together with the PRI itself in a numbered directory > with the name "feedback.html". But the comments were collected together > at the time and are accessible here: > > http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121 > > Also there was a separately submitted comment document: > > http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt > > And the minutes of the pertinent UTC meeting (UTC #116): > > http://www.unicode.org/L2/L2008/08253.htm > > The minutes simply capture the consensus to adopt Option #2 from PRI > #121, and the relevant action items. > > I now return the floor to the distinguished disputants to continue > litigating history. ;-) > > --Ken > > The reason this discussion got started was that in December, someone came to me and said the code I support does not follow Unicode best practices, and suggested I need to change, though no ticket (yet) has been filed. I was surprised, and posted a query to this list about what the advantages of the new approach are. There were a number of replies, but I did not see anything that seemed definitive. After a month, I created a ticket in Unicode and Markus was assigned to research it, and came up with the proposal currently being debated. Looking at the PRI, it seems to me that treating an overlong as a single maximal unit is in the spirit of the wording, if not the fine print. That seems to be borne out by Markus, even with his stake in ICU, supporting option #2. Looking at the comments, I don't see any discussion of the effect of this on overlong treatments. My guess is that the effect change was unintentional. So I have code that handled overlongs in the only correct way possible when they were acceptable, and in the obvious way after they became illegal, and now without apparent discussion (which is very much akin to "flimsy reasons"), it suddenly was no longer "best practice". And that change came "rather late in the game". That this escaped notice for years indicates that the specifics of REPLACEMENT CHAR handling don't matter all that much. To cut to the chase, I think Unicode should issue a Corrigendum to the effect that it was never the intent of this change to say that treating overlongs as a single unit isn't best practice. I'm not sure this warrants a full-fledge Corrigendum, though. But I believe the text of the best practices should indicate that treating overlongs as a single unit is just as acceptable as Martin's interpretation. I believe this is pretty much in line with Shawn's position. Certainly, a discussion of the reasons one might choose one interpretation over another should be included in TUS. That would likely have satisfied my original query, which hence would never have been posted. From unicode at unicode.org Fri May 26 16:41:49 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Fri, 26 May 2017 21:41:49 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> Message-ID: So basically this came about because code got bugged for not following the "recommendation." To fix that, the recommendation will be changed. However then that is going to lead to bugs for other existing code that does not follow the new recommendation. I totally get the forward/backward scanning in sync without decoding reasoning for some implementations, however I do not think that the practices that benefit those should extend to other applications that are happy with a different practice. In either case, the bad characters are garbage, so neither approach is "better" - except that one or the other may be more conducive to the requirements of the particular API/application. I really think the correct approach here is to allow any number of replacement characters without prejudice. Perhaps with suggestions for pros and cons of various approaches if people feel that is really necessary. -Shawn -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Karl Williamson via Unicode Sent: Friday, May 26, 2017 2:16 PM To: Ken Whistler Cc: unicode at unicode.org Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On 05/26/2017 12:22 PM, Ken Whistler wrote: > > On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: >> The link provided about the PRI doesn't lead to the comments. >> > > PRI #121 (August, 2008) pre-dated the practice of keeping all the > feedback comments together with the PRI itself in a numbered directory > with the name "feedback.html". But the comments were collected > together at the time and are accessible here: > > http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121 > > Also there was a separately submitted comment document: > > http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt > > And the minutes of the pertinent UTC meeting (UTC #116): > > http://www.unicode.org/L2/L2008/08253.htm > > The minutes simply capture the consensus to adopt Option #2 from PRI > #121, and the relevant action items. > > I now return the floor to the distinguished disputants to continue > litigating history. ;-) > > --Ken > > The reason this discussion got started was that in December, someone came to me and said the code I support does not follow Unicode best practices, and suggested I need to change, though no ticket (yet) has been filed. I was surprised, and posted a query to this list about what the advantages of the new approach are. There were a number of replies, but I did not see anything that seemed definitive. After a month, I created a ticket in Unicode and Markus was assigned to research it, and came up with the proposal currently being debated. Looking at the PRI, it seems to me that treating an overlong as a single maximal unit is in the spirit of the wording, if not the fine print. That seems to be borne out by Markus, even with his stake in ICU, supporting option #2. Looking at the comments, I don't see any discussion of the effect of this on overlong treatments. My guess is that the effect change was unintentional. So I have code that handled overlongs in the only correct way possible when they were acceptable, and in the obvious way after they became illegal, and now without apparent discussion (which is very much akin to "flimsy reasons"), it suddenly was no longer "best practice". And that change came "rather late in the game". That this escaped notice for years indicates that the specifics of REPLACEMENT CHAR handling don't matter all that much. To cut to the chase, I think Unicode should issue a Corrigendum to the effect that it was never the intent of this change to say that treating overlongs as a single unit isn't best practice. I'm not sure this warrants a full-fledge Corrigendum, though. But I believe the text of the best practices should indicate that treating overlongs as a single unit is just as acceptable as Martin's interpretation. I believe this is pretty much in line with Shawn's position. Certainly, a discussion of the reasons one might choose one interpretation over another should be included in TUS. That would likely have satisfied my original query, which hence would never have been posted. From unicode at unicode.org Tue May 30 05:55:47 2017 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Tue, 30 May 2017 19:55:47 +0900 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> Message-ID: <999a4fbb-8a90-1afb-41e5-30d13ef2415f@it.aoyama.ac.jp> Hello Markus, others, On 2017/05/27 00:41, Markus Scherer wrote: > On Fri, May 26, 2017 at 3:28 AM, Martin J. D?rst > wrote: > >> But there's plenty in the text that makes it absolutely clear that some >> things cannot be included. In particular, it says >> >>>>>> >> The term ?maximal subpart of an ill-formed subsequence? refers to the code >> units that were collected in this manner. They could be the start of a >> well-formed sequence, except that the sequence lacks the proper >> continuation. Alternatively, the converter may have found an continuation >> code unit, which cannot be the start of a well-formed sequence. >>>>>> >> >> And the "in this manner" refers to: >>>>>> >> A sequence of code units will be processed up to the point where the >> sequence either can be unambiguously interpreted as a particular Unicode >> code point or where the converter recognizes that the code units collected >> so far constitute an ill-formed subsequence. >>>>>> >> >> So we have the same thing twice: Bail out as soon as something is >> ill-formed. > > > The UTF-8 conversion code that I wrote for ICU, and apparently the code > that various other people have written, collects sequences starting from > lead bytes, according to the original spec, and at the end looks at whether > the assembled code point is too low for the lead byte, or is a surrogate, > or is above 10FFFF. Stopping at a non-trail byte is quite natural, I think nobody is debating that this is *one way* to do things, and that some code does it. > and > reading the PRI text accordingly is quite natural too. So you are claiming that you're covered because you produce an FFFD "where the converter recognizes that the code units collected so far constitute an ill-formed subsequence", except that your converter is a bit slow in doing that recognition? Well, I guess I could come up with another converter that would be even slower at recognizing that the code units collected so far constitute an ill-formed subsequence. Would that still be okay in your view? And please note that your "just a bit slow" interpretation might somehow work for Unicode 5.2, but it doesn't work for Unicode 9.0, because over the years, things have been tightened up, and the standard now makes it perfectly clear that C0 by itself is a maximal subpart of an ill-formed subsequence. From Section 3.9 of http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf: >>>> Applying the definition of maximal subparts for these ill-formed subsequences, in the first case is a maximal subpart, because that byte value can never be the first byte of a well-formed UTF-8 sequence. >>>> > Aside from UTF-8 history, there is a reason for preferring a more > "structural" definition for UTF-8 over one purely along valid sequences. There may be all kinds of reasons for doing things one way or another. But there are good reasons why the current recommendation is in place, and there are even better reasons for not suddenly reversing it to something completely different. > This applies to code that *works* on UTF-8 strings rather than just > converting them. For UTF-8 *processing* you need to be able to iterate both > forward and backward, and sometimes you need not collect code points while > skipping over n units in either direction -- but your iteration needs to be > consistent in all cases. This is easier to implement (especially in fast, > short, inline code) if you have to look only at how many trail bytes follow > a lead byte, without having to look whether the first trail byte is in a > certain range for some specific lead bytes. > > (And don't say that everyone can validate all strings once and then all > code can assume they are valid: That just does not work for library code, > you cannot assume anything about your input strings, and you cannot crash > when they are ill-formed.) [rest of mail mostly OT] Well, different libraries may make different choices. As an example, the Ruby programming language does essentially that: Whenever it finds an invalid string, it raises an exception. Not all processing on all kinds of invalid strings immediately raises an exception (because of efficiency considerations). But there are quite strong expectations that this happens soon. As an example, when I extended case conversion from ASCII only to Unicode (see e.g. http://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/, http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/), I had to go back and fix some things because there were explicit tests checking that invalid inputs would raise exceptions. At least for Ruby, this policy of catching problems early rather than allowing garbage-in-garbage-out has worked well. > markus Regards, Martin. From unicode at unicode.org Tue May 30 06:26:39 2017 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Tue, 30 May 2017 20:26:39 +0900 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> Message-ID: <444e6dc2-a35a-0d2d-26e2-34f5a70af9e1@it.aoyama.ac.jp> Hello Karl, others, On 2017/05/27 06:15, Karl Williamson via Unicode wrote: > On 05/26/2017 12:22 PM, Ken Whistler wrote: >> >> On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: >>> The link provided about the PRI doesn't lead to the comments. >>> >> >> PRI #121 (August, 2008) pre-dated the practice of keeping all the >> feedback comments together with the PRI itself in a numbered directory >> with the name "feedback.html". But the comments were collected >> together at the time and are accessible here: >> >> http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121 >> >> Also there was a separately submitted comment document: >> >> http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt >> >> And the minutes of the pertinent UTC meeting (UTC #116): >> >> http://www.unicode.org/L2/L2008/08253.htm >> >> The minutes simply capture the consensus to adopt Option #2 from PRI >> #121, and the relevant action items. >> >> I now return the floor to the distinguished disputants to continue >> litigating history. ;-) >> >> --Ken >> >> > > The reason this discussion got started was that in December, someone > came to me and said the code I support does not follow Unicode best > practices, and suggested I need to change, though no ticket (yet) has > been filed. I was surprised, and posted a query to this list about what > the advantages of the new approach are. Can you provide a reference to that discussion? I might have missed it in December. > There were a number of replies, > but I did not see anything that seemed definitive. After a month, I > created a ticket in Unicode and Markus was assigned to research it, and > came up with the proposal currently being debated. Which is to completely reverse the current recommendation in Unicode 9.0. While I agree that this might help you fending off a bug report, it would create chances for bug reports for Ruby, Python3, many if not all Web browsers,... > Looking at the PRI, it seems to me that treating an overlong as a single > maximal unit is in the spirit of the wording, if not the fine print. In standards, the "fine print" matters. > That seems to be borne out by Markus, even with his stake in ICU, > supporting option #2. Well, at http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121, I also supported option 2, with code behind it. > Looking at the comments, I don't see any discussion of the effect of > this on overlong treatments. My guess is that the effect change was > unintentional. I agree that it was probably not considered explicitly. But overlongs were disallowed for security reasons, and once the definition of UTF-8 was tightened, "overlongs" essentially did not exist anymore. Essentially, "overlong" is a word like "dragon" or "ghost": Everybody knows what it means, but everybody knows they don't exist. [Just to be sure, by the above, I don't mean that a sequence such as C0 B0 cannot appear somewhere in some input. But C0 is not UTF-8 all by itself, and there is no need to see C0 B0 as a (ghost) sequence.] > So I have code that handled overlongs in the only correct way possible > when they were acceptable, No. As long as they were acceptable, they wouldn't have been replaced by an FFFD. > and in the obvious way after they became illegal, Why? A change was necessary from producing an actual character to producing some number of FFFDs. It may have been easier to produce just a single FFFD, but that depends on how the code was organized. > and now without apparent discussion (which is very much akin to > "flimsy reasons"), it suddenly was no longer "best practice". Not 'now', but almost 9 years ago. And not "without apparent discussion", but with an explicit PRI. > And that > change came "rather late in the game". That this escaped notice for > years indicates that the specifics of REPLACEMENT CHAR handling don't > matter all that much. I agree. You haven't even yet received a ticket yet. > To cut to the chase, I think Unicode should issue a Corrigendum to the > effect that it was never the intent of this change to say that treating > overlongs as a single unit isn't best practice. I'm not sure this > warrants a full-fledge Corrigendum, though. But I believe the text of > the best practices should indicate that treating overlongs as a single > unit is just as acceptable as Martin's interpretation. I'd essentially be fine with that, under the condition that the current recommendation is maintained as a clearly identified recommendation, so that Python3, Ruby, Web standards and browsers, and so on can easily refer to it. Regards, Martin. > I believe this is pretty much in line with Shawn's position. Certainly, > a discussion of the reasons one might choose one interpretation over > another should be included in TUS. That would likely have satisfied my > original query, which hence would never have been posted. > . > From unicode at unicode.org Tue May 30 10:07:05 2017 From: unicode at unicode.org (Tony Narlock via Unicode) Date: Tue, 30 May 2017 08:07:05 -0700 Subject: unihan-etl: create exports of UNIHAN db to csv, json and yaml Message-ID: I have created a tool in python to extract and transform UNIHAN database's information. It?s open source (MIT-licensed) and offers users customized outputs. It?s documented extensively at https://unihan-etl.git-pull.com. In addition, the project?s source code can be found at https://github.com/cihai/unihan-etl. I paired off this tool due to the time-effort of studying the fields and extracting the information correctly. The hope is that one day a traveller going down the same path can find this useful. It has been mentioned before on this list at least once, back in 2004: http://unicode.org/mail-arch/unicode-ml/y2004-m04/0255.html > I'm trying to pare Unihan.txt down to a less unwieldy size for my own use by eliminating properties that are of no interest to me and would like to be certain that eliminating the four properties containing the actual values for those dictionaries can be done safely because the information can be reconstituted if necessary from the kIRG* properties since I'm not certain if those properties are of interest to me. There are developers who may only want to extract a pre-determined set of fields. $ pip install ?user unihan-etl And create an export values into a CSV (UNIHAN downloads automatically): $ unihan-etl Only pull custom fields (once downloaded, Unihan.zip is cached for reuse): $ unihan-etl -f kMandarin kNelson kMorohashi Will only pull out those fields. Let?s get a structured output in JSON (empty values are pruned automatically): $ unihan-etl -f kMandarin kNelson kMorohashi -F json Also, with pyyaml you can use -F yaml, as well. $ pip install pyyaml $ unihan-etl -f kMandarin kNelson kMorohashi -F yaml To see all the command line options: http://unihan-etl.git-pull.com/en/latest/cli.html Container format: To keep that data exports as portable as possible, it follows the Data Packages standard ( http://frictionlessdata.io/data-packages/). This is a trickier data set since fields compact quite a bit of detail in them. Other data sets such as CEDict will also be made available as data packages. Backstory: I am trying to create a spiritual successor to cjklib ( https://pypi.python.org/pypi/cjklib). The project aims to pull in CJK datasets and make them accessible under one library. Datasets are also going to be available a la carte via a consistent data standard (Data Packages). I am opting to use UNIHAN database as a core of the CJK data sources. The project?s homepage is https://cihai.git-pull.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 30 10:50:56 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 30 May 2017 08:50:56 -0700 Subject: Looking for 8-bit computer designers Message-ID: <20170530085056.665a7a7059d7ee80bb4d670165c8327d.a153952c07.wbe@email03.godaddy.com> Not as OT as it might seem: If there are any engineers or designers on this list who worked on 8-bit and early 16-bit legacy computers (Apple II, Atari, Commodore, Tandy, etc.), and especially on character set design for these machines, please contact me privately at . Any desired degree of anonymity and confidentiality will be honored. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue May 30 11:20:12 2017 From: unicode at unicode.org (Rebecca T via Unicode) Date: Tue, 30 May 2017 16:20:12 +0000 Subject: unihan-etl: create exports of UNIHAN db to csv, json and yaml In-Reply-To: References: Message-ID: Oh, thank god. I?ve wanted something like this for ages, but I?ve been too lazy to invest the time to create a serious tool ? I?ve used a lot of messy one-time regular expressions. Will definitely be starring your repo! -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 30 12:05:23 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Tue, 30 May 2017 17:05:23 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <999a4fbb-8a90-1afb-41e5-30d13ef2415f@it.aoyama.ac.jp> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <999a4fbb-8a90-1afb-41e5-30d13ef2415f@it.aoyama.ac.jp> Message-ID: > I think nobody is debating that this is *one way* to do things, and that some code does it. Except that they sort of are. The premise is that the "old language was wrong", and the "new language is right." The reason we know the old language was wrong was that there was a bug filed against an implementation because it did not conform to the old language. The response to the application bug was to change the standard's recommendation. If this language is adopted, then the opposite is going to happen: Bugs will be filed against applications that conform to the old recommendation and not the new recommendation. They will say "your code could be better, it is not following the recommendation." Eventually that will escalate to some level that it will need to be considered, however, regardless of the improvements, it will be a "breaking change". Changing code from one recommendation to another will change behavior. For applications or SDKs with enough visibility, that will break *someone* because that's how these things work. For applications that choose not to change, in response to some RFP, someone's going to say "you don't fully conform to Unicode, we'll go with a different vendor." Not saying that these things make sense, that's just the way the world works. In some situations, one form is better, in some cases another form is better. If the intent is truly that there is not "one way to do things," then the language should reflect that. -Shawn From unicode at unicode.org Tue May 30 12:11:40 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Tue, 30 May 2017 17:11:40 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <444e6dc2-a35a-0d2d-26e2-34f5a70af9e1@it.aoyama.ac.jp> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <444e6dc2-a35a-0d2d-26e2-34f5a70af9e1@it.aoyama.ac.jp> Message-ID: > Which is to completely reverse the current recommendation in Unicode 9.0. While I agree that this might help you fending off a bug report, it would create chances for bug reports for Ruby, Python3, many if not all Web browsers,... & Windows & .Net Changing the behavior of the Windows / .Net SDK is a non-starter. > Essentially, "overlong" is a word like "dragon" or "ghost": Everybody knows what it means, but everybody knows they don't exist. Yes, this is trying to improve the language for a scenario that CANNOT HAPPEN. We're trying to optimize a case for data that implementations should never encounter. It is sort of exactly like optimizing for the case where your data input is actually a dragon and not UTF-8 text. Since it is illegal, then the "at least 1 FFFD but as many as you want to emit (or just fail)" is fine. -Shawn From unicode at unicode.org Tue May 30 12:11:40 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Tue, 30 May 2017 17:11:40 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <444e6dc2-a35a-0d2d-26e2-34f5a70af9e1@it.aoyama.ac.jp> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <444e6dc2-a35a-0d2d-26e2-34f5a70af9e1@it.aoyama.ac.jp> Message-ID: > Which is to completely reverse the current recommendation in Unicode 9.0. While I agree that this might help you fending off a bug report, it would create chances for bug reports for Ruby, Python3, many if not all Web browsers,... & Windows & .Net Changing the behavior of the Windows / .Net SDK is a non-starter. > Essentially, "overlong" is a word like "dragon" or "ghost": Everybody knows what it means, but everybody knows they don't exist. Yes, this is trying to improve the language for a scenario that CANNOT HAPPEN. We're trying to optimize a case for data that implementations should never encounter. It is sort of exactly like optimizing for the case where your data input is actually a dragon and not UTF-8 text. Since it is illegal, then the "at least 1 FFFD but as many as you want to emit (or just fail)" is fine. -Shawn From unicode at unicode.org Tue May 30 15:30:56 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 30 May 2017 13:30:56 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 Message-ID: <20170530133056.665a7a7059d7ee80bb4d670165c8327d.e41abb7e04.wbe@email03.godaddy.com> L2/17-168 says: "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, without ever restricting trail bytes to less than 80..BF. For example: is a single maximal subsequence because C0 was originally a lead byte for two-byte sequences." When was it ever true that C0 was a valid lead byte? And what does that have to do with (not) restricting trail bytes? -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue May 30 17:32:34 2017 From: unicode at unicode.org (Karl Williamson via Unicode) Date: Tue, 30 May 2017 16:32:34 -0600 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170530133056.665a7a7059d7ee80bb4d670165c8327d.e41abb7e04.wbe@email03.godaddy.com> References: <20170530133056.665a7a7059d7ee80bb4d670165c8327d.e41abb7e04.wbe@email03.godaddy.com> Message-ID: On 05/30/2017 02:30 PM, Doug Ewell via Unicode wrote: > L2/17-168 says: > > "For UTF-8, recommend evaluating maximal subsequences based on the > original structural definition of UTF-8, without ever restricting trail > bytes to less than 80..BF. For example: is a single maximal > subsequence because C0 was originally a lead byte for two-byte > sequences." > > When was it ever true that C0 was a valid lead byte? And what does that > have to do with (not) restricting trail bytes? Until TUS 3.1, it was legal for UTF-8 parsers to treat the sequence as U+002F. From unicode at unicode.org Tue May 30 17:38:45 2017 From: unicode at unicode.org (Karl Williamson via Unicode) Date: Tue, 30 May 2017 16:38:45 -0600 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170530133056.665a7a7059d7ee80bb4d670165c8327d.e41abb7e04.wbe@email03.godaddy.com> References: <20170530133056.665a7a7059d7ee80bb4d670165c8327d.e41abb7e04.wbe@email03.godaddy.com> Message-ID: <9e6d3a06-8e04-21b6-a32a-a1d9bb06a812@khwilliamson.com> Under Best Practices, how many REPLACEMENT CHARACTERs should the sequence generate? 0, 1, 2, 3, 4 ? In practice, how many do parsers generate? From unicode at unicode.org Tue May 30 17:51:30 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Tue, 30 May 2017 22:51:30 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170530133056.665a7a7059d7ee80bb4d670165c8327d.e41abb7e04.wbe@email03.godaddy.com> Message-ID: > Until TUS 3.1, it was legal for UTF-8 parsers to treat the sequence as U+002F. Sort of, maybe. It was not legal for them to generate it though. So you could kind of infer that it was not a legal sequence. -Shawn From unicode at unicode.org Tue May 30 18:41:13 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 30 May 2017 17:41:13 -0600 Subject: =?US-ASCII?Q?Re:_Feedback_on_the_proposal_to_change_U+FF?= =?US-ASCII?Q?FD_generation_when=0D__decoding_ill-formed_UTF-8?= Message-ID: <201705302341.v4UNftZP026461@unicode.org> That's not at all the same as saying it was a valid sequence. That's saying decoders were allowed to be lenient with invalid sequences. We're supposed to be comfortable with standards language here. Do we really not understand this distinction? --Doug Ewell | Thornton, CO, US | ewellic.org -------- Original message --------From: Karl Williamson Date: 5/30/17 16:32 (GMT-07:00) To: Doug Ewell , Unicode Mailing List Subject: Re: Feedback on the proposal to change U+FFFD generation when ? decoding ill-formed UTF-8 On 05/30/2017 02:30 PM, Doug Ewell via Unicode wrote: > L2/17-168 says: > > "For UTF-8, recommend evaluating maximal subsequences based on the > original structural definition of UTF-8, without ever restricting trail > bytes to less than 80..BF. For example: is a single maximal > subsequence because C0 was originally a lead byte for two-byte > sequences." > > When was it ever true that C0 was a valid lead byte? And what does that > have to do with (not) restricting trail bytes? Until TUS 3.1, it was legal for UTF-8 parsers to treat the sequence ? as U+002F. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 31 00:34:06 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 31 May 2017 06:34:06 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> Message-ID: <20170531063406.1fc54994@JRWUBU2> On Fri, 26 May 2017 11:22:37 -0700 Ken Whistler via Unicode wrote: > On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: > > The link provided about the PRI doesn't lead to the comments. > > > > PRI #121 (August, 2008) pre-dated the practice of keeping all the > feedback comments together with the PRI itself in a numbered > directory with the name "feedback.html". But the comments were > collected together at the time and are accessible here: > > http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121 > > Also there was a separately submitted comment document: > > http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt > > And the minutes of the pertinent UTC meeting (UTC #116): > > http://www.unicode.org/L2/L2008/08253.htm > > The minutes simply capture the consensus to adopt Option #2 from PRI > #121, and the relevant action items. For Unicode members, there is also the original Unicore thread, which starts at http://www.unicode.org/mail-arch/unicore-ml/y2008-m04/0091.html . (I couldn't find anything on the general list.) There were objections there to replacing non-shortest form sequences by multiple ocurrences of U+FFFD. They were rejected by those that mattered, and so the option of a single U+FFFD was not included in the PRI. Richard. From unicode at unicode.org Wed May 31 00:47:46 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 31 May 2017 06:47:46 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <9e6d3a06-8e04-21b6-a32a-a1d9bb06a812@khwilliamson.com> References: <20170530133056.665a7a7059d7ee80bb4d670165c8327d.e41abb7e04.wbe@email03.godaddy.com> <9e6d3a06-8e04-21b6-a32a-a1d9bb06a812@khwilliamson.com> Message-ID: <20170531064746.1ae234f9@JRWUBU2> On Tue, 30 May 2017 16:38:45 -0600 Karl Williamson via Unicode wrote: > Under Best Practices, how many REPLACEMENT CHARACTERs should the > sequence generate? 0, 1, 2, 3, 4 ? > > In practice, how many do parsers generate? See Markus Kuhn's test page http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt, test 5.1.5. Firefox generates three replacement characters. Richard. From unicode at unicode.org Wed May 31 01:08:37 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 31 May 2017 07:08:37 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> Message-ID: <20170531070837.6aa2b590@JRWUBU2> On Fri, 26 May 2017 21:41:49 +0000 Shawn Steele via Unicode wrote: > I totally get the forward/backward scanning in sync without decoding > reasoning for some implementations, however I do not think that the > practices that benefit those should extend to other applications that > are happy with a different practice. > In either case, the bad characters are garbage, so neither approach > is "better" - except that one or the other may be more conducive to > the requirements of the particular API/application. There's a potential issue with input methods that indirectly edit the backing store. For example, GTK input methods (e.g. function gtk_im_context_delete_surrounding()) can delete an amount of text specified in characters, not storage units. (Deletion by storage units is not available in this interface.) This might cause utter confusion or worse if the backing store starts out corrupt. A corrupt backing store is normally manually correctable if most of the text is ASCII. Richard. From unicode at unicode.org Wed May 31 07:12:12 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Wed, 31 May 2017 15:12:12 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170531070837.6aa2b590@JRWUBU2> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> Message-ID: I've researched this more. While the old advice dominates the handling of non-shortest forms, there is more variation than I previously thought when it comes to truncated sequences and CESU-8-style surrogates. Still, the ICU behavior is an outlier considering the set of implementations that I tested. I've written up my findings at https://hsivonen.fi/broken-utf-8/ The write-up mentions https://bugs.chromium.org/p/chromium/issues/detail?id=662822#c13 . I'd like to draw everyone's attention to that bug, which is real-world evidence of a bug arising from two UTF-8 decoders within one product handling UTF-8 errors differently. On Sun, May 21, 2017 at 7:37 PM, Mark Davis ?? via Unicode wrote: > There is plenty of time for public comment, since it was targeted at Unicode > 11, the release for about a year from now, not Unicode 10, due this year. > When the UTC "approves a change", that change is subject to comment, and the > UTC can always reverse or modify its approval up until the meeting before > release date. So there are ca. 9 months in which to comment. What should I read to learn how to formulate an appeal correctly? Does it matter if a proposal/appeal is submitted as a non-member implementor person, as an individual person member or as a liaison member? http://www.unicode.org/consortium/liaison-members.html list "the Mozilla Project" as a liaison member, but Mozilla-side conventions make submitting proposals like this "as Mozilla" problematic (we tend to avoid "as Mozilla" statements on technical standardization fora except when the W3C Process forces us to make them as part of charter or Proposed Recommendation review). > The modified text is a set of guidelines, not requirements. So no > conformance clause is being changed. I'm aware of this. > If people really believed that the guidelines in that section should have > been conformance clauses, they should have proposed that at some point. It seems to me that this thread does not support the conclusion that the Unicode Standard's expression of preference for the number of REPLACEMENT CHARACTERs should be made into a conformance requirement in the Unicode Standard. This thread could be taken to support a conclusion that the Unicode Standard should not express any preference beyond "at least one and at most as many as there were bytes". On Tue, May 23, 2017 at 12:17 PM, Alastair Houghton via Unicode wrote: > In any case, Henri is complaining that it?s too difficult to implement; it isn?t. You need two extra states, both of which are trivial. I am not claiming it's too difficult to implement. I think it inappropriate to ask implementations, even from-scratch ones, to take on added complexity in error handling on mere aesthetic grounds. Also, I think it's inappropriate to induce implementations already written according to the previous guidance to change (and risk bugs) or to make the developers who followed the previous guidance with precision be the ones who need to explain why they aren't following the new guidance. On Fri, May 26, 2017 at 6:41 PM, Markus Scherer via Unicode wrote: > The UTF-8 conversion code that I wrote for ICU, and apparently the code that > various other people have written, collects sequences starting from lead > bytes, according to the original spec, and at the end looks at whether the > assembled code point is too low for the lead byte, or is a surrogate, or is > above 10FFFF. Stopping at a non-trail byte is quite natural, and reading the > PRI text accordingly is quite natural too. I don't doubt that other people have written code with the same concept as ICU, but as far as non-shortest form handling goes in the implementations I tested (see URL at the start of this email) ICU is the lone outlier. > Aside from UTF-8 history, there is a reason for preferring a more > "structural" definition for UTF-8 over one purely along valid sequences. > This applies to code that *works* on UTF-8 strings rather than just > converting them. For UTF-8 *processing* you need to be able to iterate both > forward and backward, and sometimes you need not collect code points while > skipping over n units in either direction -- but your iteration needs to be > consistent in all cases. This is easier to implement (especially in fast, > short, inline code) if you have to look only at how many trail bytes follow > a lead byte, without having to look whether the first trail byte is in a > certain range for some specific lead bytes. But the matter at hand is decoding potentially-invalid UTF-8 input into a valid in-memory Unicode representation, so later processing is somewhat a red herring as being out of scope for this step. I do agree that if you already know that the data is valid UTF-8, it makes sense to work from the bit pattern definition only. (E.g. in encoding_rs, the implementation I've written and that's on track to replacing uconv in Firefox, UTF-8 decode works using the knowledge of which bytes can possibly follow which leads, but encode from UTF-8 to legacy encodings works using the bit pattern definition, because the Rust type system allows the encoder side to confidently assume that the input to the encoder is valid UTF-8.) On Sat, May 27, 2017 at 12:15 AM, Karl Williamson via Unicode wrote: > The reason this discussion got started was that in December, someone came to > me and said the code I support does not follow Unicode best practices, and > suggested I need to change, though no ticket (yet) has been filed. I think it's pretty uncool to inflict the problem you experienced onto everyone who followed the previous guidance instead. > I was > surprised, and posted a query to this list about what the advantages of the > new approach are. There were a number of replies, but I did not see > anything that seemed definitive. After a month, I created a ticket in > Unicode and Markus was assigned to research it, and came up with the > proposal currently being debated. I think the research I linked to at the start of this email shows that the proposal wasn't researched sufficiently before it was brought to the Unicode Technical Committee. If anything, I hope this thread results in the establishment of a requirement for proposals to come with proper research about what multiple prominent implementations to about the subject matter of a proposal concerning changes to text about implementation behavior. -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Wed May 31 12:11:13 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 31 May 2017 18:11:13 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> Message-ID: <20170531181113.0fc7ea7a@JRWUBU2> On Wed, 31 May 2017 15:12:12 +0300 Henri Sivonen via Unicode wrote: > The write-up mentions > https://bugs.chromium.org/p/chromium/issues/detail?id=662822#c13 . I'd > like to draw everyone's attention to that bug, which is real-world > evidence of a bug arising from two UTF-8 decoders within one product > handling UTF-8 errors differently. > Does it matter if a proposal/appeal is submitted as a non-member > implementor person, as an individual person member or as a liaison > member? http://www.unicode.org/consortium/liaison-members.html list > "the Mozilla Project" as a liaison member, but Mozilla-side > conventions make submitting proposals like this "as Mozilla" > problematic (we tend to avoid "as Mozilla" statements on technical > standardization fora except when the W3C Process forces us to make > them as part of charter or Proposed Recommendation review). There may well be an advantage to being able to answer any questions on the proposal at the meeting, especially if it isn't read until the meeting. > > The modified text is a set of guidelines, not requirements. So no > > conformance clause is being changed. > > I'm aware of this. > > > If people really believed that the guidelines in that section > > should have been conformance clauses, they should have proposed > > that at some point. > > It seems to me that this thread does not support the conclusion that > the Unicode Standard's expression of preference for the number of > REPLACEMENT CHARACTERs should be made into a conformance requirement > in the Unicode Standard. This thread could be taken to support a > conclusion that the Unicode Standard should not express any preference > beyond "at least one and at most as many as there were bytes". > > On Tue, May 23, 2017 at 12:17 PM, Alastair Houghton via Unicode > wrote: > > In any case, Henri is complaining that it?s too difficult to > > implement; it isn?t. You need two extra states, both of which are > > trivial. > > I am not claiming it's too difficult to implement. I think it > inappropriate to ask implementations, even from-scratch ones, to take > on added complexity in error handling on mere aesthetic grounds. Also, > I think it's inappropriate to induce implementations already written > according to the previous guidance to change (and risk bugs) or to > make the developers who followed the previous guidance with precision > be the ones who need to explain why they aren't following the new > guidance. How straightforward is the FSM for back-stepping? > On Fri, May 26, 2017 at 6:41 PM, Markus Scherer via Unicode > wrote: > > The UTF-8 conversion code that I wrote for ICU, and apparently the > > code that various other people have written, collects sequences > > starting from lead bytes, according to the original spec, and at > > the end looks at whether the assembled code point is too low for > > the lead byte, or is a surrogate, or is above 10FFFF. Stopping at a > > non-trail byte is quite natural, and reading the PRI text > > accordingly is quite natural too. > > I don't doubt that other people have written code with the same > concept as ICU, but as far as non-shortest form handling goes in the > implementations I tested (see URL at the start of this email) ICU is > the lone outlier. You should have researched implementations as they were in 2007. My own code uses the same concept as Markus's ICU code - convert and check the resulting value is legal for the length. As a check, remember that for n > 1, n bytes could represent 2**(5n + 1) values if overlongs were permitted. > > Aside from UTF-8 history, there is a reason for preferring a more > > "structural" definition for UTF-8 over one purely along valid > > sequences. This applies to code that *works* on UTF-8 strings > > rather than just converting them. For UTF-8 *processing* you need > > to be able to iterate both forward and backward, and sometimes you > > need not collect code points while skipping over n units in either > > direction -- but your iteration needs to be consistent in all > > cases. This is easier to implement (especially in fast, short, > > inline code) if you have to look only at how many trail bytes > > follow a lead byte, without having to look whether the first trail > > byte is in a certain range for some specific lead bytes. > > But the matter at hand is decoding potentially-invalid UTF-8 input > into a valid in-memory Unicode representation, so later processing is > somewhat a red herring as being out of scope for this step. No. Both lossily converting a UTF-8-like string as a stream of bytes to scalar values and moving back and forth through the string 'character' by 'character' imply an ability to count the number of 'characters' in the string. The bug you mentioned arose from two different ways of counting the string length in 'characters'. Having two different 'character' counts for the same string is inviting trouble. Richard. From unicode at unicode.org Wed May 31 12:43:08 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Wed, 31 May 2017 17:43:08 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170531070837.6aa2b590@JRWUBU2> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> Message-ID: > > In either case, the bad characters are garbage, so neither approach is > > "better" - except that one or the other may be more conducive to the > > requirements of the particular API/application. > There's a potential issue with input methods that indirectly edit the backing store. For example, > GTK input methods (e.g. function gtk_im_context_delete_surrounding()) can delete an amount > of text specified in characters, not storage units. (Deletion by storage units is not available in this > interface.) This might cause utter confusion or worse if the backing store starts out corrupt. > A corrupt backing store is normally manually correctable if most of the text is ASCII. I think that's sort of what I said: some approaches might work better for some systems and another approach might work better for another system. This also presupposes a corrupt store. It is unclear to me what the expected behavior would be for this corruption if, for example, there were merely a half dozen 0x80 in the middle of ASCII text? Is that garbage a single "character"? Perhaps because it's a consecutive string of bad bytes? Or should it be 6 characters since they're nonsense? Or maybe 2 characters because the maximum # of trail bytes we can have is 3? What if it were 2 consecutive 2-byte sequence lead bytes and no trail bytes? I can see how different implementations might be able to come up with "rules" that would help them navigate (or clean up) those minefields, however it is not at all clear to me that there is a "best practice" for those situations. There also appears to be a special weight given to non-minimally-encoded sequences. It would seem to me that none of these illegal sequences should appear in practice, so we have either: * A bad encoder spewing out garbage (overlong sequences) * Flipped bit(s) due to storage/transmission/whatever errors * Lost byte(s) due to storage/transmission/coding/whatever errors * Extra byte(s) due to whatever errors * Bad string manipulation breaking/concatenating in the middle of sequences, causing garbage (perhaps one of the above 2 codeing errors). Only in the first case, of a bad encoder, are the overlong sequences actually "real". And that shouldn't happen (it's a bad encoder after all). The other scenarios seem just as likely, (or, IMO, much more likely) than a badly designed encoder creating overlong sequences that appear to fit the UTF-8 pattern but aren't actually UTF-8. The other cases are going to cause byte patterns that are less "obvious" about how they should be navigated for various applications. I do not understand the energy being invested in a case that shouldn't happen, especially in a case that is a subset of all the other bad cases that could happen. -Shawn From unicode at unicode.org Wed May 31 13:11:59 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Wed, 31 May 2017 19:11:59 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <444e6dc2-a35a-0d2d-26e2-34f5a70af9e1@it.aoyama.ac.jp> Message-ID: <03634118-4070-409D-9D62-98488E9AB1E5@alastairs-place.net> > On 30 May 2017, at 18:11, Shawn Steele via Unicode wrote: > >> Which is to completely reverse the current recommendation in Unicode 9.0. While I agree that this might help you fending off a bug report, it would create chances for bug reports for Ruby, Python3, many if not all Web browsers,... > > & Windows & .Net > > Changing the behavior of the Windows / .Net SDK is a non-starter. > >> Essentially, "overlong" is a word like "dragon" or "ghost": Everybody knows what it means, but everybody knows they don't exist. > > Yes, this is trying to improve the language for a scenario that CANNOT HAPPEN. We're trying to optimize a case for data that implementations should never encounter. It is sort of exactly like optimizing for the case where your data input is actually a dragon and not UTF-8 text. > > Since it is illegal, then the "at least 1 FFFD but as many as you want to emit (or just fail)" is fine. And *that* is what the specification says. The whole problem here is that someone elevated one choice to the status of ?best practice?, and it?s a choice that some of us don?t think *should* be considered best practice. Perhaps ?best practice? should simply be altered to say that you *clearly document* your behaviour in the case of invalid UTF-8 sequences, and that code should not rely on the number of U+FFFDs generated, rather than suggesting a behaviour? Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Wed May 31 13:34:15 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Wed, 31 May 2017 19:34:15 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> Message-ID: <5ED57BFA-A51B-4E2E-8DF9-4F274EC12CCD@alastairs-place.net> On 31 May 2017, at 18:43, Shawn Steele via Unicode wrote: > > It is unclear to me what the expected behavior would be for this corruption if, for example, there were merely a half dozen 0x80 in the middle of ASCII text? Is that garbage a single "character"? Perhaps because it's a consecutive string of bad bytes? Or should it be 6 characters since they're nonsense? Or maybe 2 characters because the maximum # of trail bytes we can have is 3? It should be six U+FFFD characters, because 0x80 is not a lead byte. Basically, the new proposal is that we should decode bytes that structurally match UTF-8, and if the encoding is then illegal (because it?s over-long, because it?s a surrogate or because it?s over U+10FFFF) then the entire thing is replaced with U+FFFD. If, on the other hand, we get a sequence that isn?t structurally valid UTF-8, we replace the maximally *structurally* valid subpart with U+FFFD and continue. > What if it were 2 consecutive 2-byte sequence lead bytes and no trail bytes? Then you get two U+FFFDs. > I can see how different implementations might be able to come up with "rules" that would help them navigate (or clean up) those minefields, however it is not at all clear to me that there is a "best practice" for those situations. I?m not sure the whole ?best practice? thing has been a lot of help here. Perhaps we should change it to say ?Suggested Handling?, to make quite clear that filing a bug report against code that chooses some other option is not necessary? > There also appears to be a special weight given to non-minimally-encoded sequences. I don?t think that?s true, *although* it *is* true that UTF-8 decoders historically tended to allow such things, so one might assume that some software out there is generating them for whatever reason. There are also *deliberate* violations of the minimal length encoding specification in some cases (for instance to allow NUL to be encoded in such a way that it won?t terminate a C-style string). Yes, you may retort, that isn?t ?valid UTF-8?. Sure. It *is* useful, though, and it is *in use*. If a UTF-8 decoder encounters such a thing, it?s more meaningful for whoever sees the output to see a single U+FFFD representing the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid lead byte and then another for an ?unexpected? trailing byte. Likewise, there are encoders that generate surrogates in UTF-8, which is, of course, illegal, but *does* happen. Again, they can provide reasonable justifications for their behaviour (typically they want the default binary sort to work the same as for UTF-16 for some reason), and again, replacing a single surrogate with U+FFFD rather than multiple U+FFFDs is more helpful to whoever/whatever ends up seeing it. And, of course, there are encoders that are attempting to exploit security flaws, which will very definitely generate these kinds of things. > It would seem to me that none of these illegal sequences should appear in practice, so we have either: > > * A bad encoder spewing out garbage (overlong sequences) > * Flipped bit(s) due to storage/transmission/whatever errors > * Lost byte(s) due to storage/transmission/coding/whatever errors > * Extra byte(s) due to whatever errors > * Bad string manipulation breaking/concatenating in the middle of sequences, causing garbage (perhaps one of the above 2 codeing errors). I see no reason to suppose that the proposed behaviour would function any less well in those cases. > Only in the first case, of a bad encoder, are the overlong sequences actually "real". And that shouldn't happen (it's a bad encoder after all). Except some encoders *deliberately* use over-longs, and one would assume that since UTF-8 decoders historically allowed this, there will be data ?in the wild? that has this form. > The other scenarios seem just as likely, (or, IMO, much more likely) than a badly designed encoder creating overlong sequences that appear to fit the UTF-8 pattern but aren't actually UTF-8. I?m not sure I agree that flipped bits, lost bytes and extra bytes are more likely than a ?bad? encoder. Bad string manipulation is of course prevalent, though - there?s no way around that. > The other cases are going to cause byte patterns that are less "obvious" about how they should be navigated for various applications. This is true, *however* the new proposed behaviour is in no way inferior to the old proposed behaviour in those cases - it?s just different. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Wed May 31 14:04:41 2017 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Wed, 31 May 2017 21:04:41 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> Message-ID: > I do not understand the energy being invested in a case that shouldn't happen, especially in a case that is a subset of all the other bad cases that could happen. I think Richard stated the most compelling reason: ? The bug you mentioned arose from two different ways of counting the string length in 'characters'. Having two different 'character' counts for the same string is inviting trouble. For implementations that emit FFFD while handling text conversion and repair (ie, converting ill-formed UTF-8 to well-formed), it is best for interoperability if they get the same results, so that indices within the resulting strings are consistent across implementations for all the *correct* characters thereafter. It would be preferable *not* to have the following: source = %c0%80abc Vendor 1: fixed = fix(source) fixed == ??abc codepointAt(fixed, 3) == 'b' Vendor2: fixed = fix(source) fixed == ??abc codepointAt(fixed, 3) = ?=? ' ?c ' In theory one could just throw an exception. In practice, nobody wants their browser ?? to belly up on a webpage with a component that has an ill-formed bit of UTF-8. I n theory one could document everyone's flavor of the month for how many FFFD's to emit. In practice, that falls apart immediately, since in today's interconnected world you can't tell which processes get first crack at text repair. Mark On Wed, May 31, 2017 at 7:43 PM, Shawn Steele via Unicode < unicode at unicode.org> wrote: > > > In either case, the bad characters are garbage, so neither approach is > > > "better" - except that one or the other may be more conducive to the > > > requirements of the particular API/application. > > > There's a potential issue with input methods that indirectly edit the > backing store. For example, > > GTK input methods (e.g. function gtk_im_context_delete_surrounding()) > can delete an amount > > of text specified in characters, not storage units. (Deletion by > storage units is not available in this > > interface.) This might cause utter confusion or worse if the backing > store starts out corrupt. > > A corrupt backing store is normally manually correctable if most of the > text is ASCII. > > I think that's sort of what I said: some approaches might work better for > some systems and another approach might work better for another system. > This also presupposes a corrupt store. > > It is unclear to me what the expected behavior would be for this > corruption if, for example, there were merely a half dozen 0x80 in the > middle of ASCII text? Is that garbage a single "character"? Perhaps > because it's a consecutive string of bad bytes? Or should it be 6 > characters since they're nonsense? Or maybe 2 characters because the > maximum # of trail bytes we can have is 3? > > What if it were 2 consecutive 2-byte sequence lead bytes and no trail > bytes? > > I can see how different implementations might be able to come up with > "rules" that would help them navigate (or clean up) those minefields, > however it is not at all clear to me that there is a "best practice" for > those situations. > > There also appears to be a special weight given to non-minimally-encoded > sequences. It would seem to me that none of these illegal sequences should > appear in practice, so we have either: > > * A bad encoder spewing out garbage (overlong sequences) > * Flipped bit(s) due to storage/transmission/whatever errors > * Lost byte(s) due to storage/transmission/coding/whatever errors > * Extra byte(s) due to whatever errors > * Bad string manipulation breaking/concatenating in the middle of > sequences, causing garbage (perhaps one of the above 2 codeing errors). > > Only in the first case, of a bad encoder, are the overlong sequences > actually "real". And that shouldn't happen (it's a bad encoder after > all). The other scenarios seem just as likely, (or, IMO, much more likely) > than a badly designed encoder creating overlong sequences that appear to > fit the UTF-8 pattern but aren't actually UTF-8. > > The other cases are going to cause byte patterns that are less "obvious" > about how they should be navigated for various applications. > > I do not understand the energy being invested in a case that shouldn't > happen, especially in a case that is a subset of all the other bad cases > that could happen. > > -Shawn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 31 14:24:04 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Wed, 31 May 2017 19:24:04 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> Message-ID: > For implementations that emit FFFD while handling text conversion and repair (ie, converting ill-formed > UTF-8 to well-formed), it is best for interoperability if they get the same results, so that indices within the > resulting strings are consistent across implementations for all the correct characters thereafter. That seems optimistic :) If interoperability is the goal, then it would seem to me that changing the recommendation would be contrary to that goal. There are systems that will not or cannot change to a new recommendation. If such systems are updated, then adoption of those systems will likely take some time. In other words, I cannot see where ?consistency across implementations? would be achievable anytime in the near future. It seems to me that being able to use a data stream of ambiguous quality in another application with predictable results, then that stream should be ?repaired? prior to being handed over. Then both endpoints would be using the same set of FFFDs, whether that was single or multiple forms. -Shawn -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 31 14:28:03 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Wed, 31 May 2017 19:28:03 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <5ED57BFA-A51B-4E2E-8DF9-4F274EC12CCD@alastairs-place.net> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> <5ED57BFA-A51B-4E2E-8DF9-4F274EC12CCD@alastairs-place.net> Message-ID: > it?s more meaningful for whoever sees the output to see a single U+FFFD representing > the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid lead byte and > then another for an ?unexpected? trailing byte. I disagree. It may be more meaningful for some applications to have a single U+FFFD representing an illegally encoded 2-byte NULL than to have 2 U+FFFDs. Of course then you don't know if it was an illegally encoded 2-byte NULL or an illegally encoded 3-byte NULL or whatever, so some information that other applications may be interested in is lost. Personally, I prefer the "emit a U+FFFD if the sequence is invalid, drop the byte, and try again" approach. -Shawn From unicode at unicode.org Wed May 31 14:38:58 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 31 May 2017 12:38:58 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 Message-ID: <20170531123858.665a7a7059d7ee80bb4d670165c8327d.173ddafba2.wbe@email03.godaddy.com> Henri Sivonen wrote: > If anything, I hope this thread results in the establishment of a > requirement for proposals to come with proper research about what > multiple prominent implementations to about the subject matter of a > proposal concerning changes to text about implementation behavior. Considering that several folks have objected that the U+FFFD recommendation is perceived as having the weight of a requirement, I think adding Henri's good advice above as a "requirement" seems heavy-handed. Who will judge how much research qualifies as "proper"? Who will determine that the judge doesn't have a conflict? An alternative would be to require that proposals, once received with whatever amount of research, are augmented with any necessary additional research *before* being approved. The identity or reputation of the requester should be irrelevant to approval. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed May 31 14:42:25 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Wed, 31 May 2017 19:42:25 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <03634118-4070-409D-9D62-98488E9AB1E5@alastairs-place.net> References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <444e6dc2-a35a-0d2d-26e2-34f5a70af9e1@it.aoyama.ac.jp> <03634118-4070-409D-9D62-98488E9AB1E5@alastairs-place.net> Message-ID: > And *that* is what the specification says. The whole problem here is that someone elevated > one choice to the status of ?best practice?, and it?s a choice that some of us don?t think *should* > be considered best practice. > Perhaps ?best practice? should simply be altered to say that you *clearly document* your behavior > in the case of invalid UTF-8 sequences, and that code should not rely on the number of U+FFFDs > generated, rather than suggesting a behaviour? That's what I've been suggesting. I think we could maybe go a little further though: * Best practice is clearly not to depend on the # of U+FFFDs generated by another component/app. Clearly that can't be relied upon, so I think everyone can agree with that. * I think encouraging documentation of behavior is cool, though there are probably low priority bugs and people don't like to read the docs in that detail, so I wouldn't expect very much from that. * As far as I can tell, there are two (maybe three) sane approaches to this problem: * Either a "maximal" emission of one U+FFFD for every byte that exists outside of a good sequence * Or a "minimal" version that presumes the lead byte was counting trail bytes correctly even if the resulting sequence was invalid. In that case just use one U+FFFD. * And (maybe, I haven't heard folks arguing for this one) emit one U+FFFD at the first garbage byte and then ignore the input until valid data starts showing up again. (So you could have 1 U+FFFD for a string of a hundred garbage bytes as long as there weren't any valid sequences within that group). * I'd be happy if the best practice encouraged one of those two (or maybe three) approaches. I think an approach that called rand() to see how many U+FFFDs to emit when it encountered bad data is fair to discourage. -Shawn From unicode at unicode.org Wed May 31 15:06:29 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 31 May 2017 21:06:29 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp> <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> Message-ID: <20170531210629.2bb15b73@JRWUBU2> On Wed, 31 May 2017 17:43:08 +0000 Shawn Steele via Unicode wrote: > There also appears to be a special weight given to > non-minimally-encoded sequences. It would seem to me that none of > these illegal sequences should appear in practice, so we have either: > I do not understand the energy being invested in a case that > shouldn't happen, especially in a case that is a subset of all the > other bad cases that could happen. That's not the motivation for my using a structurally based approach. I want to expend as little energy as possible, both in thought (Keep It Simple, Stupid) and in machine cycles, in catering for these overlong/non-scalar value cases. I have to cater for indisputably illegal truncated sequences, but for the rest of it I optimise for the conformant case. If I'm extracting scalar values, I calculate the scalar value and then check that it's legal. If I'm advancing through a string, I just advance by the requisite number of trailing bytes. UTF-8 is simple in concept, and I try to follow that simplicity. A state machine overcomplicates it. Moroever, if I want to handle CESU-8 or U+0000 as opposed to a sentinel null, it is easy to add special case logic to a scalar value extractor. > > -Shawn > From unicode at unicode.org Wed May 31 15:20:02 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 31 May 2017 21:20:02 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com> <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com> <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp> <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com> <7b6c2184-4177-01fd-7e5f-e8f36b54f4be@khwilliamson.com> <20170531070837.6aa2b590@JRWUBU2> Message-ID: <20170531212002.72ab9ed3@JRWUBU2> On Wed, 31 May 2017 19:24:04 +0000 Shawn Steele via Unicode wrote: > It seems to me that being able to use a data stream of ambiguous > quality in another application with predictable results, then that > stream should be ?repaired? prior to being handed over. Then both > endpoints would be using the same set of FFFDs, whether that was > single or multiple forms. This of course depends on where the damage is being done. You're urging that applications check the strings they have generated as they export them. Richard.