From unicode at unicode.org Mon May 1 00:17:05 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 1 May 2017 07:17:05 +0200 Subject: Tibetan Paluta In-Reply-To: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com> References: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com> Message-ID: 2017-04-29 21:21 GMT+02:00 Naena Guru via Unicode : > Just about the name paluta: > In Sanskrit, the length of vowels are measured in maa?ra (a cognate of the > word 'meter'). It is the spoken length of a short vowel. In Latin it is > termed mora. Usually, you have only single and double length vowels. A > palu?a length is like when you call out somebody from a distance. Pluta is > a careless use of spelling. Virama and Halanta are two other terms loosely > used. > > Anyway, Unicode is only about DISPLAYING a script: There's a shape here; > Let's find how to get it by assembling other shapes or by creating a code > point for it. What is short, long or longer in speech is no concern for > Unicode. > Wrong. Unicode is absolutely not about how to "display" any script (except symbols and notational symbols). Unicode does not encode glyphs. Unicode encodes "abstract characters" according to their semantics, in order to assign them properties allowing meaningful transformations of text and in order to allow perfoirming searches (with collation algorithms). What is important is their properties (something that ISO 10646 does not care when it started the UCS in a separate project, ignoring how it would be used, focusing too much on apparent glytphs (and introducing lot of "compatiblity characters" that would not have been encoded otherwise, and creating some havoc in logical processing. Anyway Unciode makes some exceptions to the logical model only for roundtrip comptaibility with other standards that used another encoding model widely used, notably in Thai: these are the exception where there are "prepended" letters. There was some havoc also for some scripts in India because of roundtrip compatiblity with an Indian standard (criticized by many users of Tamil and some other Southern Indic scripts that don't follow directly the paradigm created for getting some limited transliteration with Devanagari: that initial desire was abandoned but the legacy Indic scripts in India were imported as is to Unicode) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 1 07:14:18 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 1 May 2017 13:14:18 +0100 Subject: Unicode is more than shapes (was: Tibetan Paluta) In-Reply-To: References: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com> Message-ID: <20170501131418.665947ee@JRWUBU2> On Mon, 1 May 2017 07:17:05 +0200 Philippe Verdy via Unicode wrote: > 2017-04-29 21:21 GMT+02:00 Naena Guru via Unicode > : > > Anyway, Unicode is only about DISPLAYING a script: There's a shape > > here; Let's find how to get it by assembling other shapes or by > > creating a code point for it. What is short, long or longer in > > speech is no concern for Unicode. When there is considerable variation in shape, describing the function of a character can be of great help in determining the character code to enter for some relatively obscure character. > Wrong. Unicode is absolutely not about how to "display" any script > (except symbols and notational symbols). Unicode does not encode > glyphs. Unicode encodes "abstract characters" according to their > semantics, in order to assign them properties allowing meaningful > transformations of text and in order to allow perfoirming searches > (with collation algorithms). Of course, display is a very important transformation process! However, for many applications, an important part of display is knowing when to split text between lines, and in easy cases that can be done using knowledge of character properties. In hard cases, the user has to insert line-breaking permissions and even prohibitions. There are special characters for these functions. It's somewhat misleading to say that searches use collation algorithms. What is true is that folding can use enough of the same computational processes that much of the code for collation may be re-used for search. Different data tables are frequently appropriate. > Anyway Unciode makes some exceptions to the logical model only for > roundtrip comptaibility with other standards that used another > encoding model widely used, notably in Thai: these are the exception > where there are "prepended" letters. What "logical" model? I don't think you know how Thai works. The key feature is that the Indic consonant stack has no delimiter in Thai, which makes the phonetic placement of preposed vowels ambiguous. In some of the other relevant features that I am aware of, Lao works quite differently. Tai Viet was encoded in visual order. You forget one other change. New Tai Lue switched from phonetic order to visual order because it hadn't been worth Microsoft's while to implement the simple rendering engine. The Universal Shaping Engine (USE) should prevent this happening again with straightforward complex scripts, but good intentions (namely, replacing the working renderer from HarfBuzz and thus Firefox, Chrome and LibreOffice with an emulation of the USE) may unintentionally repeat the process with 'Old Tai Lue'. Using phonetic order in Tai Tham distinguishes homographs (if I may use the term here) that would usually be collated differently. > There was some havoc also for > some scripts in India because of roundtrip compatiblity with an > Indian standard (criticized by many users of Tamil and some other > Southern Indic scripts that don't follow directly the paradigm > created for getting some limited transliteration with Devanagari: > that initial desire was abandoned but the legacy Indic scripts in > India were imported as is to Unicode) The havoc is because half-forms are a north Indian innovation, not an ancient Indic feature. Tamil suffered from the ISCII conflation of combining and merely having no vowel, the Unicode virama. Tibetan and Khmer led the way in splitting the concepts, and the Unicode virama in the Myanmar script was disunified into an invisible stacker and a pure killer. Many of the Tamil complaints arise because the implicit vowel is ill-suited to Tamil, but an attempt to move away from that system about two thousand years ago did not persist. Richard. From unicode at unicode.org Mon May 1 09:19:27 2017 From: unicode at unicode.org (Naena Guru via Unicode) Date: Mon, 1 May 2017 19:49:27 +0530 Subject: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta In-Reply-To: References: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com> Message-ID: This whole attempt to make digitizing Indic script some esoteric, 'abstract', 'semantic representation' and so on seems to me is an attempt to make Unicode the realm of the some super humans. The purpose of writing is to represent speech. It is not some secret that demi-gods created that we are trying to explain with 'modern' linguistic gymnastics. sound => letter that is the basis for writing. English writing was massacred when printing was brought in from Europe. A similar thing is happening to Indic by all this mumbo-jumbo. I call out to NATIVE users of Indic to explain what apparently Europeans or Americans are discussing here. On 5/1/2017 10:47 AM, Philippe Verdy wrote: > > > 2017-04-29 21:21 GMT+02:00 Naena Guru via Unicode >: > > Just about the name paluta: > In Sanskrit, the length of vowels are measured in maa?ra (a > cognate of the word 'meter'). It is the spoken length of a short > vowel. In Latin it is termed mora. Usually, you have only single > and double length vowels. A palu?a length is like when you call > out somebody from a distance. Pluta is a careless use of spelling. > Virama and Halanta are two other terms loosely used. > > Anyway, Unicode is only about DISPLAYING a script: There's a shape > here; Let's find how to get it by assembling other shapes or by > creating a code point for it. What is short, long or longer in > speech is no concern for Unicode. > > > Wrong. Unicode is absolutely not about how to "display" any script > (except symbols and notational symbols). Unicode does not encode > glyphs. Unicode encodes "abstract characters" according to their > semantics, in order to assign them properties allowing meaningful > transformations of text and in order to allow perfoirming searches > (with collation algorithms). What is important is their properties > (something that ISO 10646 does not care when it started the UCS in a > separate project, ignoring how it would be used, focusing too much on > apparent glytphs (and introducing lot of "compatiblity characters" > that would not have been encoded otherwise, and creating some havoc in > logical processing. > > Anyway Unciode makes some exceptions to the logical model only for > roundtrip comptaibility with other standards that used another > encoding model widely used, notably in Thai: these are the exception > where there are "prepended" letters. There was some havoc also for > some scripts in India because of roundtrip compatiblity with an Indian > standard (criticized by many users of Tamil and some other Southern > Indic scripts that don't follow directly the paradigm created for > getting some limited transliteration with Devanagari: that initial > desire was abandoned but the legacy Indic scripts in India were > imported as is to Unicode) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 1 10:25:28 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 1 May 2017 16:25:28 +0100 Subject: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta In-Reply-To: References: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com>

Message-ID: <20170501162528.171f631b@JRWUBU2> On Mon, 1 May 2017 19:49:27 +0530 Naena Guru via Unicode wrote: > The purpose of writing is to represent speech. It is not some secret > that demi-gods created Sarasvati and Thoth would be offended at being called mere demi-gods. > sound => letter that is the basis for writing. "=>" is not a particularly phonetic notation. It took quite a while for letters to become the primary part of writing anywhere, and they are not a universal phenomenon. Richard. From unicode at unicode.org Mon May 1 10:26:04 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Mon, 1 May 2017 16:26:04 +0100 Subject: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta In-Reply-To: References: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com>

Message-ID: On 1 May 2017, at 15:19, Naena Guru via Unicode wrote: > > This whole attempt to make digitizing Indic script some esoteric, 'abstract', 'semantic representation' and so on seems to me is an attempt to make Unicode the realm of the some super humans. No. It?s important so that the standard Unicode algorithms function acceptably for Indic languages. The design of Unicode is such that, compatibility characters and other some special cases aside, it encodes semantics as opposed to graphic representations. > The purpose of writing is to represent speech. Yes, and Unicode is intended to give us a representation of speech *that is amenable to machine processing*. The other extreme is what used to happen on many Chinese and Japanese websites, namely ?representing speech? by way of an image - if you want to process the text in one of those images, well, good luck with that (you?ll want to start with some kind of OCR). Perhaps part of the problem here is that Unicode sits at the intersection between linguistics and software engineering; the discussion of both sides of this is likely to be quite technical, some of the vocabulary used might well seem like ?mumbo jumbo?, just as some of the design decisions might not make sense if your expertise is mainly on one side or mainly on the other (or, for that matter, if you have little exposure to other languages or the challenges inherent in encoding or rendering them). However, for all that it might *sound* like ?mumbo jumbo? to you, it is not. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Mon May 1 12:28:59 2017 From: unicode at unicode.org (Naena Guru via Unicode) Date: Mon, 1 May 2017 22:58:59 +0530 Subject: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta In-Reply-To: <20170501162528.171f631b@JRWUBU2> References: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com>

<20170501162528.171f631b@JRWUBU2> Message-ID: <23ffaf0c-db97-2ebe-9b7c-aecf78023f90@gmail.com> A little humor is very good. sarasva?i was a sweet girl, I am sure, so much so that when she died, I think, those who were imagining about her beyond practical, made her rise up, up and fly away. Now you watch what happens to Elizabeth when she dies. They narrowly failed making one such with Hillary Clinton as she is suspected of having Parkinson's which condition her daughter says has an anecdotal remedy with MaryJane. Hmmm... Who went to her daughter's house instead of to the doctor when they suddenly fell? As for Thoth, he is okay. Don't worry. Egyptian man => demi-god => god has not much of a consequence in the West dominated culture of this day. On 5/1/2017 8:55 PM, Richard Wordingham via Unicode wrote: > On Mon, 1 May 2017 19:49:27 +0530 > Naena Guru via Unicode wrote: > >> The purpose of writing is to represent speech. It is not some secret >> that demi-gods created > Sarasvati and Thoth would be offended at being called mere demi-gods. > >> sound => letter that is the basis for writing. > "=>" is not a particularly phonetic notation. It took quite a while > for letters to become the primary part of writing anywhere, and they > are not a universal phenomenon. > > Richard. Okay, Richard. Your probably have knowledge of how writing evolved in the whole world. Tell us how it was in South Asia. Was it like I said, sound => letter? I assume only to know about English and Indic in this respect. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 1 14:12:22 2017 From: unicode at unicode.org (Michael Bear via Unicode) Date: Mon, 1 May 2017 19:12:22 +0000 Subject: How to Add Beams to Notes Message-ID: I am trying to make a music notation font. It will use the Musical Symbols block in Unicode (1D100-1D1FF), but, since that block has a bad rep for not being very complete, I added some extra characters in the unmapped positions of that block (e.g. U+1D127 inverts the stem of the previous note, U+1D1E9 is a ledger line, U+1D1EA is the "TAB" clef, U+1D1F0-U+1D1FC position the note along the staff, etc.) I've had no problem so far, but now I need to do beamed notes. The Unicode block has control characters for beginning and ending a series of beamed notes (U+1D173 and U+1D174, respectively), but I'm not really sure how to add beams to the notes while keeping the pitch intact. I know I'll obviously need OpenType for this. Slanted beams would be preferred, but straight beams are acceptable. It will need to support beams added on for longer notes. Can someone help me with this? I had asked this on a High Logic Font Creator forum (here), and someone said to subscribe to your mailing list and ask you guys. So here I am! Anyway, help, please? Sent from Mail for Windows 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 1 15:04:29 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 1 May 2017 13:04:29 -0700 Subject: How to Add Beams to Notes In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 1 15:53:33 2017 From: unicode at unicode.org (=?iso-8859-1?Q?St=F6tzner_Signographie?= via Unicode) Date: Mon, 1 May 2017 22:53:33 +0200 Subject: How to Add Beams to Notes In-Reply-To: References: Message-ID: Bad news, I?m afraid. What is the intended usage of your font? Music score applications? others? The overall problem with musical notation is, there is no comprehensive character encoding standard and no generally working text and layout composing method established. In the light of that fact it is hopeless to make fonts for this. The fonts are not the problem (yes they are, there is no solid encoding scheme available), but the lack of composing syntax is the crux you?ll hardly overcome. If you need to cater for a specific usage scenario you?ll end up with a complete hack anyway (however it may look like, doesn?t matter). Good luck! A. St?tzner (Musical notation project) Am 01.05.2017 um 21:12 schrieb Michael Bear via Unicode: > I am trying to make a music notation font. It will use the Musical Symbols block in Unicode (1D100-1D1FF), but, since that block has a bad rep for not being very complete, I added some extra characters in the unmapped positions of that block (e.g. U+1D127 inverts the stem of the previous note, U+1D1E9 is a ledger line, U+1D1EA is the "TAB" clef, U+1D1F0-U+1D1FC position the note along the staff, etc.) I've had no problem so far, but now I need to do beamed notes. The Unicode block has control characters for beginning and ending a series of beamed notes (U+1D173 and U+1D174, respectively), but I'm not really sure how to add beams to the notes while keeping the pitch intact. I know I'll obviously need OpenType for this. Slanted beams would be preferred, but straight beams are acceptable. It will need to support beams added on for longer notes. Can someone help me with this? > > I had asked this on a High Logic Font Creator forum (here), and someone said to subscribe to your mailing list and ask you guys. So here I am! Anyway, help, please? > > Sent from Mail for Windows 10 > _______________________________________________________________________________ Andreas St?tzner Gestaltung Signographie Fontentwicklung Haus des Buches Gerichtsweg 28, Raum 434 04103 Leipzig 0176-86823396 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 1 18:03:53 2017 From: unicode at unicode.org (Michael Bear via Unicode) Date: Mon, 1 May 2017 23:03:53 +0000 Subject: How to Add Beams to Notes In-Reply-To: References: , Message-ID: ?Rather than using "unused code positions", I would always recommend to use some of the Private Use code points.? Consider it done. ?What is the intended usage of your font? Music score applications? others?? I am simply going to make a series of full Unicode fonts (which, due to the 65,535-character limit in fonts, each of the 3 fonts covers different planes: The first font does the BMP, the second one does the SMP, and the third one is all the other planes, which are vacant enough to fit in one font) that will have the necessary OpenType features of every script. And I thought ?Hey, maybe I should do full OT for the music block that no one has really done yet! How awesome would that be?? So I made a test font to work it out, but I ran into this one pothole. That?s when I came here. Sent from Mail for Windows 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 1 19:01:08 2017 From: unicode at unicode.org (David Starner via Unicode) Date: Tue, 02 May 2017 00:01:08 +0000 Subject: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta In-Reply-To: References: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com>

Message-ID: On Mon, May 1, 2017 at 7:26 AM Naena Guru via Unicode wrote: > This whole attempt to make digitizing Indic script some esoteric, > 'abstract', 'semantic representation' and so on seems to me is an attempt > to make Unicode the realm of the some super humans. > Unicode is like writing. At its core, it is a hairy esoteric mess; mix these certain chemicals the right ways, and prepare a writing implement and writing surface in the right (non-trivial) ways, and then manipulate that implement carefully to make certain marks that have unclear delimitations between correct and incorrect. But in the end, as much of that is removed from the problem of the user as possible; in the case of modern word-processing system, it's a matter of hitting the keys and then hitting print, in complete ignorance of all the silicon and printing magic going on between. Unicode is not the realm of everyone; it's the realm of people with a certain amount of linguistic knowledge and computer knowledge. There's only a problem if those people can't make it usable for the everyday programmer and therethrough to the average person. > The purpose of writing is to represent speech. > Meh. The purpose of writing is to represent language, which may be unrelated to speech (like in the case of SignWriting and mathematics) or somewhat related to speech--very few forms of writing are direct transcriptions of speech. Even the closest tend to exchange a lot of intonation details for punctuation that reveals different information. > English writing was massacred when printing was brought in from Europe. > No, it wasn't. Printing made no difference to the fact that English has a dozen vowels with five letters to write them. The thorn has little impact on the ambiguity of English writing. The problem with printing is that it fossilizes the written language, and our spellings have stayed the same while the pronunciations have changed. And the dissociation of sound and writing sometimes helps English; even when two English speakers from different parts of the world would have trouble understanding each other, writing is usually not so impaired. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 1 19:03:26 2017 From: unicode at unicode.org (John W Kennedy via Unicode) Date: Mon, 1 May 2017 20:03:26 -0400 Subject: How to Add Beams to Notes In-Reply-To: References: Message-ID: > On May 1, 2017, at 3:12 PM, Michael Bear via Unicode wrote: > > I am trying to make a music notation font. It will use the Musical Symbols block in Unicode (1D100-1D1FF), but, since that block has a bad rep for not being very complete, I added some extra characters in the unmapped positions of that block (e.g. U+1D127 inverts the stem of the previous note, U+1D1E9 is a ledger line, U+1D1EA is the "TAB" clef, U+1D1F0-U+1D1FC position the note along the staff, etc.) I've had no problem so far, but now I need to do beamed notes. The Unicode block has control characters for beginning and ending a series of beamed notes (U+1D173 and U+1D174, respectively), but I'm not really sure how to add beams to the notes while keeping the pitch intact. I know I'll obviously need OpenType for this. Slanted beams would be preferred, but straight beams are acceptable. It will need to support beams added on for longer notes. Can someone help me with this? > > I had asked this on a High Logic Font Creator forum (here), and someone said to subscribe to your mailing list and ask you guys. So here I am! Anyway, help, please? You might want to acquaint yourself with http://www.smufl.org From unicode at unicode.org Mon May 1 19:27:24 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 2 May 2017 01:27:24 +0100 Subject: How to Add Beams to Notes In-Reply-To: References:

Message-ID: <20170502012724.3f109a86@JRWUBU2> On Mon, 1 May 2017 23:03:53 +0000 Michael Bear via Unicode wrote: > ?Rather than using "unused code positions", I would always recommend > to use some of the Private Use code points.? Consider it done. > > ?What is the intended usage of your font? Music score > applications? others?? I am simply going to make a series of full > Unicode fonts (which, due to the 65,535-character limit in fonts, > each of the 3 fonts covers different planes: The first font does the > BMP, How much margin do you have for the BMP? There are a fair few variation sequences, on top of all the contextual forms and conjuncts. Richard. From unicode at unicode.org Mon May 1 22:08:27 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 2 May 2017 05:08:27 +0200 Subject: How to Add Beams to Notes In-Reply-To: <20170502012724.3f109a86@JRWUBU2> References:

<20170502012724.3f109a86@JRWUBU2> Message-ID: Consider also that the BMP is almost full, the remaining few holes are kept for isolated characters that may be added to existing scripts, or permanently reserved to avoid clashes with legacy softwares using simple code remappings between distinct blocks, or to perform simple case conversions (e.g. in Greek) for internal purposes (these positions are not interoperable and may clash with future versions of the UCS and I18n tools/libraries like ICU) You should abstain using any currently unassigned positions in the existing Unicode blocks: use PUA if you have nothing else; there are plenty of space available, in the BMP (most common usage in fonts that need to map additional glyphs) or in the two last planes. The PUA block in the BMP is large enough for most apps and almost all fonts that need private glyphs for internal purposes, or for still unencoded characters or for your own encoded variants such as slanted symbols, rotated symbols, inverted symbols, or symbols with multiple sizes, or at different positions on the musical score, or using distinct styles (e.g. between different players or singers, or various symbols for percusive instruments or specific intruments, or extra annotations). Many new symbols have been encoded first as PUAs in early fonts used to create proposals (then rendered to a PDF, or embedded fonts in a rich text document, or webfonts loaded from static versioned URLs on a repository like GitHub or on a public cloud). Later the proposal passed the early steps for reviewing the repertoire and choosing more relevant positions, then characters were encoded and standardized and these fonts were updated to map their glyphs to not just their existing PUAs but also the nex standard positions (or encoded variants) 2017-05-02 2:27 GMT+02:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Mon, 1 May 2017 23:03:53 +0000 > Michael Bear via Unicode wrote: > > > ?Rather than using "unused code positions", I would always recommend > > to use some of the Private Use code points.? Consider it done. > > > > ?What is the intended usage of your font? Music score > > applications? others?? I am simply going to make a series of full > > Unicode fonts (which, due to the 65,535-character limit in fonts, > > each of the 3 fonts covers different planes: The first font does the > > BMP, > > How much margin do you have for the BMP? There are a fair few > variation sequences, on top of all the contextual forms and conjuncts. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 2 11:43:18 2017 From: unicode at unicode.org (Naena Guru via Unicode) Date: Tue, 2 May 2017 22:13:18 +0530 Subject: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta In-Reply-To: References: <1206d55a-0725-5488-bbe0-af0a0c3f3d10@gmail.com>

Message-ID: <5cfab138-51e2-9413-2d45-ad13cd8f333b@gmail.com> Thank you, professor. You wrote exactly what one would expect from a professor. It is a wonderful display of your prowess in the subject. Doctors and lawyers use Latin for concealment and self-preservation. Greenspan used Greenspanish. Unicode masters use Unicodish. Indic is the name Unicode assigned to South Asian writing systems that are associated with Sanskrit vyaakaraNa. This is a result of what the good professor explains by "Unicode is not the realm of everyone; it's the realm of people with a certain amount of linguistic knowledge and computer knowledge". What is that 'certain amount' and which deity decides it? How do we unfortunate nincompoops decode it? Decode itself is beyond us, indeed. South Asians, especially Indians who already seem to have too many gods to deal with, do not need, though they might be tempted to add an image of the exalted Unicode god behind a colorful curtain to sing praise to with an alms box marked M$ besides to get favors each time the high priest scrubs off some of its 'hairy esoteric mess' while surreptitiously (or, ignorantly?) adding more. Brahmins were able to make any declaration because they were privileged. Similarly, Unicode experts can make declarations like, 'very few forms of writing are direct transcriptions of speech' and hide behind the 'in case' adjective 'direct' to avoid giving actual data. Of course, they can boldly count Sinhala as one that is not a direct transcription of speech. Speech getting transcribed into writing itself is a Unicodish. Hark! The professor declares. So, boys and girls, if you want to pass the test memorize this, even if it is obviously false: Printing made no difference to the fact that English has a dozen vowels with five letters to write them. The thorn has little impact on the ambiguity of English writing. The problem with printing is that it fossilizes the written language, and our spellings have stayed the same while the pronunciations have changed. And the dissociation of sound and writing sometimes helps English; even when two English speakers from different parts of the world would have trouble understanding each other, writing is usually not so impaired. It is printing with the dictionary industry that fossilized writing and as a result, forced speech to comply. The 'certain' level of knowledge above is now revealed. Language, dialect, creole, migration, intermixing of different peoples, accent...; where do these stand? Find ye by the foregoing what the fossil 'ye' actually was and what caused it to get fossilized in this form. On 5/2/2017 5:31 AM, David Starner wrote: > On Mon, May 1, 2017 at 7:26 AM Naena Guru via Unicode > > wrote: > > This whole attempt to make digitizing Indic script some esoteric, > 'abstract', 'semantic representation' and so on seems to me is an > attempt to make Unicode the realm of the some super humans. > > Unicode is like writing. At its core, it is a hairy esoteric mess; mix > these certain chemicals the right ways, and prepare a writing > implement and writing surface in the right (non-trivial) ways, and > then manipulate that implement carefully to make certain marks that > have unclear delimitations between correct and incorrect. But in the > end, as much of that is removed from the problem of the user as > possible; in the case of modern word-processing system, it's a matter > of hitting the keys and then hitting print, in complete ignorance of > all the silicon and printing magic going on between. > > Unicode is not the realm of everyone; it's the realm of people with a > certain amount of linguistic knowledge and computer knowledge. There's > only a problem if those people can't make it usable for the everyday > programmer and therethrough to the average person. > > The purpose of writing is to represent speech. > > Meh. The purpose of writing is to represent language, which may be > unrelated to speech (like in the case of SignWriting and mathematics) > or somewhat related to speech--very few forms of writing are direct > transcriptions of speech. Even the closest tend to exchange a lot of > intonation details for punctuation that reveals different information. > > English writing was massacred when printing was brought in from > Europe. > > No, it wasn't. Printing made no difference to the fact that English > has a dozen vowels with five letters to write them. The thorn has > little impact on the ambiguity of English writing. The problem with > printing is that it fossilizes the written language, and our spellings > have stayed the same while the pronunciations have changed. And the > dissociation of sound and writing sometimes helps English; even when > two English speakers from different parts of the world would have > trouble understanding each other, writing is usually not so impaired. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 2 22:17:10 2017 From: unicode at unicode.org (N. Ganesan via Unicode) Date: Tue, 2 May 2017 20:17:10 -0700 Subject: Internet unicode use of Indian languages Message-ID: In India, Tamil is most used in the internet. https://assets.kpmg.com/content/dam/kpmg/in/pdf/2017/ 04/Indian-languages-Defining-Indias-Internet.pdf http://www.vikatan.com/news/india/88214-tamil-is-the-most- used-indian-language-says-google.html N. Ganesan -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 3 02:49:49 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 3 May 2017 08:49:49 +0100 Subject: How to Add Beams to Notes In-Reply-To: References:

<20170502012724.3f109a86@JRWUBU2> Message-ID: <20170503084949.1b61d689@JRWUBU2> On Tue, 2 May 2017 05:08:27 +0200 Philippe Verdy via Unicode wrote: > Consider also that the BMP is almost full, the remaining few holes > are kept for isolated characters that may be added to existing > scripts, or permanently reserved to avoid clashes with legacy > softwares using simple code remappings between distinct blocks, or to > perform simple case conversions (e.g. in Greek) for internal purposes > (these positions are not interoperable and may clash with future > versions of the UCS and I18n tools/libraries like ICU) > > You should abstain using any currently unassigned positions in the > existing Unicode blocks: use PUA if you have nothing else; there are > plenty of space available, in the BMP (most common usage in fonts > that need to map additional glyphs) or in the two last planes. It isn't codepoints that is the constraint; one must consider the number of glyphs without dedicated one-character codes. For example, U+1000 MYANMAR LETTER KA needs glyphs for: 1000 1000 FE00 1039 1000 (and probably at two different widths) 1039 1000 FE00 (do.) There are a few CJK ideographs with similar needs: 537F 537F FE00 (= CJK COMPATIBILITY IDEOGRAPH-2F831) 537F FE01 (= CJK COMPATIBILITY IDEOGRAPH-2F832) 537F FE02 (= CJK COMPATIBILITY IDEOGRAPH-2F833) There's also the Japanese ideographic variation sequence , which should probably have its own glyph even if it's the same as one of the above. The Arabic script (and other cursively connected scripts) has similar expansions, even if one goes for a typewritten style. Devanagari explodes when one considers just the conjuncts prescribed for Hindi. I think it's also necessary to avoid splitting likely grapheme clusters between fonts. Which of the three fonts will support U+1F3F4 U+E0067 U+E0062 U+E0065 U+E006E U+E0067 U+E007F (English flag) and which U+261D U+1F3FF (index pointing up: dark skin tone)? Now, the BMP has headroom provided by the surrogate characters and the PUA, which will not have mappings, but I'm not sure that it's enough. That's why I asked the question. Richard. From unicode at unicode.org Wed May 3 05:20:16 2017 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Wed, 3 May 2017 11:20:16 +0100 (BST) Subject: English flag (from Re: How to Add Beams to Notes) In-Reply-To: <20170503084949.1b61d689@JRWUBU2> References:

<20170502012724.3f109a86@JRWUBU2> <20170503084949.1b61d689@JRWUBU2> Message-ID: <21667.19758.1493806816884.JavaMail.defaultUser@defaultHost> Richard Wordingham wrote: .... U+1F3F4 U+E0067 U+E0062 U+E0065 U+E006E U+E0067 U+E007F (English flag) .... I looked at that and I realized that although I had effectively seen that encoding in http://www.unicode.org/reports/tr51/tr51-11.html though expressed differently, it was only when I saw it expressed as above that I realized that there is something gone wrong with encoding policy. There are at present ten totally unused planes in the Unicode code point map and yet that seven character sequence is needed for encoding an English flag. Surely a single code point could be found. Single code points are being found for various emoji items on a continuing basis. Why pull up the ladder on encoding some flags each with a single code point? Yes, a single code point for an English flag please. And one for a Welsh flag too please. And one for a Scottish flag too please. And some others please, if that is what end users want. William Overington Wednesday 3 May 2017 From unicode at unicode.org Wed May 3 10:07:35 2017 From: unicode at unicode.org (David Faulks via Unicode) Date: Wed, 03 May 2017 11:07:35 -0400 Subject: English flag (from Re: How to Add Beams to Notes) In-Reply-To: <21667.19758.1493806816884.JavaMail.defaultUser@defaultHost> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 3 12:26:42 2017 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 3 May 2017 10:26:42 -0700 Subject: English flag (from Re: How to Add Beams to Notes) In-Reply-To: <21667.19758.1493806816884.JavaMail.defaultUser@defaultHost> References:

<20170502012724.3f109a86@JRWUBU2> <20170503084949.1b61d689@JRWUBU2> <21667.19758.1493806816884.JavaMail.defaultUser@defaultHost> Message-ID: On 5/3/2017 3:20 AM, William_J_G Overington via Unicode wrote: > Surely a single code point could be found. Single code points are being found for various emoji items on a continuing basis. Why pull up the ladder on encoding some flags each with a single code point? > > Yes, a single code point for an English flag please. And one for a Welsh flag too please. And one for a Scottish flag too please. And some others please, if that is what end users want. I suggest the following: 10BEDE for an English flag (reminding one of Bede the Venerable) 10CADF for a Welsh flag (harking to Cadfan ap Iago, King of Gwynedd) 10A1BA for a Scottish flag (for Alba, of course) Surely those would work for you! --Ken From unicode at unicode.org Wed May 3 15:31:10 2017 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Wed, 3 May 2017 21:31:10 +0100 (BST) Subject: English flag (from Re: How to Add Beams to Notes) In-Reply-To: References:

<20170502012724.3f109a86@JRWUBU2> <20170503084949.1b61d689@JRWUBU2> <21667.19758.1493806816884.JavaMail.defaultUser@defaultHost> Message-ID: <22290163.68159.1493843470220.JavaMail.defaultUser@defaultHost> Ken Whistler wrote: > I suggest the following: > 10BEDE for an English flag (reminding one of Bede the Venerable) > 10CADF for a Welsh flag (harking to Cadfan ap Iago, King of Gwynedd) > 10A1BA for a Scottish flag (for Alba, of course) > Surely those would work for you! Thank you for your reply. Nicely! Those code points each have a helpful mnemonic. I had not known of Cadfan ap Iago until I read your post. I found the following. https://en.wikipedia.org/wiki/Cadfan_ap_Iago I opine that we need to make it clear, for the benefit of some people new to Unicode who may be reading this thread, that those code points are in one of the Private Use Areas, namely Supplementary Private Use Area-B, so there could be problems using them in some circumstances due to lack of uniqueness in the use of those code points. http://www.unicode.org/charts/PDF/U100000.pdf William Overington Wednesday 3 May 2017 From unicode at unicode.org Wed May 3 22:01:17 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 4 May 2017 05:01:17 +0200 Subject: How to Add Beams to Notes In-Reply-To: <20170503084949.1b61d689@JRWUBU2> References:

<20170502012724.3f109a86@JRWUBU2> <20170503084949.1b61d689@JRWUBU2> Message-ID: 2017-05-03 9:49 GMT+02:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Tue, 2 May 2017 05:08:27 +0200 > Philippe Verdy via Unicode wrote: > > > Consider also that the BMP is almost full, the remaining few holes > > are kept for isolated characters that may be added to existing > > scripts, or permanently reserved to avoid clashes with legacy > > softwares using simple code remappings between distinct blocks, or to > > perform simple case conversions (e.g. in Greek) for internal purposes > > (these positions are not interoperable and may clash with future > > versions of the UCS and I18n tools/libraries like ICU) > > > > You should abstain using any currently unassigned positions in the > > existing Unicode blocks: use PUA if you have nothing else; there are > > plenty of space available, in the BMP (most common usage in fonts > > that need to map additional glyphs) or in the two last planes. > > It isn't codepoints that is the constraint; one must consider the > number of glyphs without dedicated one-character codes. > Glyph processing use requires internal glyph ids in fonts. The limit is on the total number of glyphs you can put it that font without exceeding the maximum size of glyph id's. Traditionally this is solved by creating coherent (but complete enough) subsets so that all glyphs within the same script can fit. The other solution, nobaly for sinograms, is to use font linking The Arabic script (and other cursively connected scripts) has similar > expansions, even if one goes for a typewritten style. > > Devanagari explodes when one considers just the conjuncts prescribed for > Hindi. > Rendering Devanagari with OpenType does not require any PUA assignment in that font for variants. The sequences are mapped directly using subtables and the rules defined in OpenType for that script. Fonts just use their own internal glyph ID's without having to assign them any Unicode mapping, using Glyph processing rules. Same remark about Arabic (though some encoded compatibility characters will map to some of these glyphs... without using any PUA). > > I think it's also necessary to avoid splitting likely grapheme > clusters between fonts. Which of the three fonts will support U+1F3F4 > U+E0067 U+E0062 U+E0065 U+E006E U+E0067 U+E007F (English flag) and > which U+261D U+1F3FF (index pointing up: dark skin tone)? > > Now, the BMP has headroom provided by the surrogate characters and the > PUA, which will not have mappings, but I'm not sure that it's enough. > > For your question, the solution is to create corent subsets of symbols and create fonts from this subset. For the case of country/region flags, they could all be separated in a specific font. As well you can create separate fonts for persons/animals/plants, and another one for unanimated objects (including planets, game pieces...) Traditional punctuation-like symbols used in typography and normally without any emoji style can fit a generic symbols fonts (along with geom?tric shapes, line drawing symbols). -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 4 02:26:37 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 4 May 2017 08:26:37 +0100 Subject: How to Add Beams to Notes In-Reply-To: References:

<20170502012724.3f109a86@JRWUBU2> <20170503084949.1b61d689@JRWUBU2> Message-ID: <20170504082637.40229878@JRWUBU2> On Thu, 4 May 2017 05:01:17 +0200 Philippe Verdy via Unicode wrote: > Rendering Devanagari with OpenType does not require any PUA > assignment in that font for variants. The sequences are mapped > directly using subtables and the rules defined in OpenType for that > script. Fonts just use their own internal glyph ID's without having > to assign them any Unicode mapping, using Glyph processing rules. > Same remark about Arabic (though some encoded compatibility > characters will map to some of these glyphs... without using any PUA). The OP's plan is to use one font for the BMP, one font for the SMP, and one font for the rest. However, the BMP font Code2000, which only goes, incompletely, up to Unicode 5.2, uses 63,546 glyphs, which is very close to the limit of 65,535. There is the slight margin that it included a few small scripts with standardised (ConScript Unicode Registry) PUA allocations. Richard. From unicode at unicode.org Thu May 4 07:50:52 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 4 May 2017 14:50:52 +0200 Subject: How to Add Beams to Notes In-Reply-To: <20170504082637.40229878@JRWUBU2> References:

<20170502012724.3f109a86@JRWUBU2> <20170503084949.1b61d689@JRWUBU2> <20170504082637.40229878@JRWUBU2> Message-ID: You cannot cover a full plane with a single font. There are other factors such as total size the also limit severely their use. We have to live with the limitations of OpenType. In addition a giant font is hard to maintain, version and update without breaking usages. Font auhtors should focus their efforts on separatings scripts within a collection a collection of related fonts (like what the Noto project did): the rest will use font linking (which can be and already is used by renderers, and can also be parametered by users for accessibility, or to use prefered variants in some domains). Also not all scripts have the same kinds of style variants (serif/sans-serif, 2 or more distincive weights, straight/italic/oblique, plain/hollow/shadowed), and trying to synthetize these styles will break the nature of the script (notably for many symbols): you'll need separate fonts for separate styles for specific scripts, other scripts may support synthtic styles or not alter at all their rendering. Code2000 is then just useful as a last resort font, but its glyphs are still very poor compared to other fonts, and the fact it uses the same font-wide strategy for hinting also creates lots of caveats: you cannot hint Sinograms like Latin or Greek and symbols have their separate requirements (notably geometric shapes and line drawing). Finally the bad thing about Code2000 is about font metrics, notably baselines: while you want to unify these baselines and line-heights, you'll reach the point where some scripts are ridiculously too small or improperly aligned: its much easier to separate them and tune these metrics separately. Trying tro fix these metrics for one script will break another one in that font, and finally you cannot create a comprehensive coverage test and get stable results because there are contradicting objectives for different uses: it's much easier to conciliate the possible choices by separating scripts, so that you can more easily create additonal variants for a few of them, and then create a separate rendering engine which will use some parametered rules for selecting the most appropriate fonts. Adn then it's much easier to update only one of these fonts when there are improvements, without breaking all the rest. 2017-05-04 9:26 GMT+02:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Thu, 4 May 2017 05:01:17 +0200 > Philippe Verdy via Unicode wrote: > > > Rendering Devanagari with OpenType does not require any PUA > > assignment in that font for variants. The sequences are mapped > > directly using subtables and the rules defined in OpenType for that > > script. Fonts just use their own internal glyph ID's without having > > to assign them any Unicode mapping, using Glyph processing rules. > > > Same remark about Arabic (though some encoded compatibility > > characters will map to some of these glyphs... without using any PUA). > > The OP's plan is to use one font for the BMP, one font for the SMP, and > one font for the rest. However, the BMP font Code2000, which only > goes, incompletely, up to Unicode 5.2, uses 63,546 glyphs, which is > very close to the limit of 65,535. There is the slight margin that it > included a few small scripts with standardised (ConScript > Unicode Registry) PUA allocations. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 4 18:13:08 2017 From: unicode at unicode.org (Michael Bear via Unicode) Date: Thu, 4 May 2017 23:13:08 +0000 Subject: How to Add Beams to Notes In-Reply-To: References: , Message-ID: ?How much margin do you have for the BMP? There are a fair few variation sequences, on top of all the contextual forms and conjuncts.? I plan to do everything in the plane EXCEPT for the surrogates, which you?re not supposed to encode in fonts anyway, which leaves room for about 2,048 more glyphs for OpenType features. Sent from Mail for Windows 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 4 19:54:41 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 5 May 2017 01:54:41 +0100 Subject: How to Add Beams to Notes In-Reply-To: References:

Message-ID: <20170505015441.3fd8585e@JRWUBU2> On Thu, 4 May 2017 23:13:08 +0000 Michael Bear via Unicode wrote: > I plan to do everything in the plane EXCEPT for the surrogates, which > you?re not supposed to encode in fonts anyway, which leaves room for > about 2,048 more glyphs for OpenType features. There are, if I avoided double counting errors, 56,251 assigned characters in the BMP in Unicode 10.0.0. There are 1008 standardised variation sequences, all in the BMP. Indic scripts require more glyphs than they have characters - usually at least twice as many. You have the read chapter on Devanagari, haven't you? Richard. From unicode at unicode.org Fri May 5 13:46:17 2017 From: unicode at unicode.org (Michael Bear via Unicode) Date: Fri, 5 May 2017 18:46:17 +0000 Subject: How to Add Beams to Notes In-Reply-To: <20170505015441.3fd8585e@JRWUBU2> References:

, <20170505015441.3fd8585e@JRWUBU2> Message-ID: Additionally, I will only do OT features that are absolutely necessary for a certain script, not unnecessary (although stylish!) features, e.g. I will include things like mark positioning and init/medi/fina forms for Arabic, while leaving out small caps, swashes, and extensive ligatures. (In an earlier post, I might have said I?ll do ALL of the possible OT features. If so, I misspoke.) But if the cry for space gets REALLY desperate, I?ll merge identical glyphs into one glyph. Obviously, I won?t do this for more problematic merges, only glyphs in similar scripts with similar features. (e.g. I would represent Latin small letter o, Greek small letter omicron, Cyrillic small letter o, Armenian letter oh, and Georgian labial sign with one glyph, while Hebrew letter samekh and Arabic letter ae, despite also being circular, would be two separate glyphs.) But I?ll only do this if I really need to. Sent from Mail for Windows 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri May 5 17:07:08 2017 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Sat, 6 May 2017 00:07:08 +0200 Subject: How to Add Beams to Notes In-Reply-To: References: Message-ID: > On 1 May 2017, at 21:12, Michael Bear via Unicode wrote: > > I am trying to make a music notation font. It will use the Musical Symbols block in Unicode (1D100-1D1FF), but, since that block has a bad rep for not being very complete, I added some extra characters... SMuFL has a rather comprehensive set of musical symbols. http://www.smufl.org/ http://www.smufl.org/version/latest/ http://www.smufl.org/fonts/ From unicode at unicode.org Sat May 6 07:54:07 2017 From: unicode at unicode.org (Michael Bear via Unicode) Date: Sat, 6 May 2017 12:54:07 +0000 Subject: Sutton SignWriting PDF Message-ID: If I open the Sutton SignWriting code chart in Mozilla Firefox, the glyphs in the tables are blank. I have no idea why. If I open it in Microsoft Edge, however, it works fine. Do you know why this is? Sent from Mail for Windows 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat May 6 19:56:21 2017 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 6 May 2017 16:56:21 -0800 Subject: How to Add Beams to Notes In-Reply-To: References:

<20170502012724.3f109a86@JRWUBU2> <20170503084949.1b61d689@JRWUBU2> <20170504082637.40229878@JRWUBU2> Message-ID: Philippe Verdy wrote, > Code2000 ... uses the same font-wide strategy for hinting also > creates lots of caveats: ... Code2000 does not have hinting instructions; that's the font-wide strategy. > Finally the bad thing about Code2000 is about font metrics, notably > baselines: while you want to unify these baselines and line-heights, > you'll reach the point where some scripts are ridiculously too small > or improperly aligned ... Do you have an example of either? Is it possible that any improper alignment or disproportionate glyphs in your display are being caused by something other than the font? > Trying tro fix these metrics for one script will break another one > in that font ... Trying to fix something which isn't broken is generally a bad plan. I wonder if the bizarre behavior you're reporting might have been caused by some third party "fixing" something in the font. In a pan-Unicode font, the base of the CJK ideographs wouldn't be expected to match the baseline of alphabetic scripts. Likewise, the base of the stems used in Indic scripts shouldn't be expected to match the baseline of alphabetic scripts as Indic scripts don't use baselines. Rather, the glyphs in such a font might be designed so that, even with reasonable above and below marks/diacritics, there would be no excessive line gaps generated for the other scripts covered in the font. A font which made, for example, Tibetan base letters the same size as Latin letters would work just fine... as long as you don't mind that runs of Latin text displayed with the font would appear to have two or three line feeds inserted between each line. Best regards, James Kass From unicode at unicode.org Sun May 7 03:03:34 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 7 May 2017 09:03:34 +0100 Subject: Sutton SignWriting PDF In-Reply-To: References: Message-ID: <20170507090334.52230093@JRWUBU2> On Sat, 6 May 2017 12:54:07 +0000 Michael Bear via Unicode wrote: > If I open the Sutton SignWriting code chart in Mozilla Firefox, the > glyphs in the tables are blank. I have no idea why. If I open it in > Microsoft Edge, however, it works fine. Do you know why this is? It smacks of being a fault in Firefox. If I download the file on Linux, I can then read it using Adobe Reader 9 or evince 3.18.2, but still not with Firefox 53.0. Of course, it's possible that there's a fault in the file that doesn't affect other readers - Adobe Reader has had a spate of problems with embedded fonts, but it may have been the PDF generators that were at fault in that case. The short-term practical solution is to change the plug-in action for PDF's - short URL is about:preferences#applications. >From the support page at https://support.mozilla.org/en-US/kb/view-pdf-files-firefox , it looks like an old or recurring problem with Firefox. Richard. From unicode at unicode.org Sun May 7 03:23:08 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 7 May 2017 09:23:08 +0100 Subject: How to Add Beams to Notes In-Reply-To: References:

<20170505015441.3fd8585e@JRWUBU2> Message-ID: <20170507092308.316a7a3a@JRWUBU2> On Fri, 5 May 2017 18:46:17 +0000 Michael Bear via Unicode wrote: > But > if the cry for space gets REALLY desperate, I?ll merge identical > glyphs into one glyph. Obviously, I won?t do this for more > problematic merges, only glyphs in similar scripts with similar > features. (e.g. I would represent Latin small letter o, Greek small > letter omicron, Cyrillic small letter o, Armenian letter oh, and > Georgian labial sign with one glyph, while Hebrew letter samekh and > Arabic letter ae, despite also being circular, would be two separate > glyphs.) But I?ll only do this if I really need to. That could cause problems with extracting text from PDFs generated using the font. My interest was in whether a pan-BMP font was still possible. As you haven't done the counting (which is ill-defined for scripts with conjuncts, and possibly even also for old Hangul support), you can't tell me yet. Richard. From unicode at unicode.org Tue May 9 09:09:57 2017 From: unicode at unicode.org (Michael Bear via Unicode) Date: Tue, 9 May 2017 14:09:57 +0000 Subject: CSUR and UCSUR glyphs Message-ID: I need some help with the glyphs from the CSUR and UCSUR. Some of the glyphs were no problem, such as the Tengwar and Cirth ones, because their pages actually show the glyphs on their pages. Others do not, which poses a bit of a problem. Some of them have links to other sites that are intended to show the glyphs, but most of those links are outdated and lead to 404s. I could just pull up an archived version with the Wayback machine (web.archive.org), and for some of them, this works, but most of them don?t have any saved versions. I could just do a Google search to find out what the characters look like, but many of the scripts are too obscure to get anything reliable out of that Google search. I?m making a font with everything in the UCSUR, and this is a major obstacle I must overcome. So do you guys know where I could get glyph shapes for most of these scripts? Sent from Mail for Windows 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 9 11:24:27 2017 From: unicode at unicode.org (Rebecca Bettencourt via Unicode) Date: Tue, 9 May 2017 09:24:27 -0700 Subject: CSUR and UCSUR glyphs In-Reply-To: References: Message-ID: In addition to the sites linked to by the CSUR and UCSUR pages, there are the PDFs linked to by the UCSUR page, and the Constructium and Fairfax fonts. There is also a font called Nishiki-teki that includes a lot of CSUR and UCSUR scripts, and a version of GNU Unifont that does as well. Beyond that, I know as much as you do. -- Rebecca Bettencourt On Tue, May 9, 2017 at 7:09 AM, Michael Bear via Unicode < unicode at unicode.org> wrote: > I need some help with the glyphs from the CSUR > and UCSUR > . > > > > Some of the glyphs were no problem, such as the Tengwar and Cirth ones, > because their pages *actually show the glyphs on their pages*. > > Others do not, which poses a bit of a problem. Some of them have links to > other sites that are intended to show the glyphs, but most of those links > are outdated and lead to 404s. I could just pull up an archived version > with the Wayback machine (web.archive.org), and for some of them, this > works, but most of them don?t have any saved versions. > > I could just do a Google search to find out what the characters look like, > but many of the scripts are too obscure to get anything reliable out of > that Google search. I?m making a font with everything in the UCSUR, and > this is a major obstacle I must overcome. So do you guys know where I could > get glyph shapes for most of these scripts? > > > > Sent from Mail for > Windows 10 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 9 11:31:04 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 09 May 2017 09:31:04 -0700 Subject: CSUR and UCSUR glyphs Message-ID: <20170509093104.665a7a7059d7ee80bb4d670165c8327d.127813a82a.wbe@email03.godaddy.com> Michael Bear wrote: > Some of the glyphs were no problem, such as the Tengwar and Cirth > ones, because their pages actually show the glyphs on their pages. > > Others do not, which poses a bit of a problem. [...] As you probably read on both the CSUR and UCSUR sites, neither is sponsored or endorsed by Unicode. They are side projects embarked upon by individuals, some of whom also happen to be involved in the Consortium. Please keep this in mind. I was never able to find some of these scripts, such as Pikto, even in the '90s when CSUR activity was at its peak. Herman Miller's site, linked from CSUR, still has all or most of his early alphabets. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue May 9 12:30:47 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 09 May 2017 10:30:47 -0700 Subject: If at first... (was: RE: CSUR and UCSUR glyphs) Message-ID: <20170509103047.665a7a7059d7ee80bb4d670165c8327d.f0a5ac1b47.wbe@email03.godaddy.com> I wrote: > I was never able to find some of these scripts, such as Pikto http://unifoundry.com/pikto/index.html Never hurts to try again. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue May 9 17:44:47 2017 From: unicode at unicode.org (Mats Blakstad via Unicode) Date: Wed, 10 May 2017 00:44:47 +0200 Subject: Human Rights translations Message-ID: Hi Who is at the moment organizing the human rights translations in Unicode? How can we submit new translations? Best regards Mats Blakstad -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 10 09:30:30 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 10 May 2017 07:30:30 -0700 Subject: Human Rights translations Message-ID: <20170510073030.665a7a7059d7ee80bb4d670165c8327d.7d6c67158e.wbe@email03.godaddy.com> Mats Blakstad wrote: > Who is at the moment organizing the human rights translations in > Unicode? How can we submit new translations? http://www.unicode.org/udhr/contributing.html -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed May 10 12:22:53 2017 From: unicode at unicode.org (Michael Bear via Unicode) Date: Wed, 10 May 2017 17:22:53 +0000 Subject: Join me in protecting net neutrality Message-ID: The FCC and their new Chairman, Ajit Paij, have a plan to destroy net neutrality as we know it. It?s up to us to stop it. I just signed onto Mozilla?s campaign to demand strong net neutrality protections. You can show your support here: http://advocacy.mozilla.org/net-neutrality?sp_ref=302214293.352.180765.e.575605.2&source=email Sent from Mail for Windows 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun May 14 01:04:31 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 14 May 2017 07:04:31 +0100 Subject: Fighting Spell-Checking by Renderers Message-ID: <20170514070431.0292e34b@JRWUBU2> One of the early problems encountered with Unicode was that there can be multiple ways of representing the same text. For many scripts, the solution was canonical equivalence - the multiple ways were declared to be equivalent, and anything that thought they had different meanings and should *therefore* be treated differently was non-compliant with the Unicode standard. Where canonical equivalence actually leads to the wrong conclusion a method was subsequently found to make sequences canonically inequivalent, U+034F COMBINING GRAPHEME JOINER (CGJ). It generally takes extra effort to insert this character. However, canonical equivalence hit a severe problem with two-part Indic vowels, and the use of non-zero canonical combining classes in Indic scripts is generally low. A similar issue might arise with graphically non-interacting subordinated consonants, especially when encoded as virama/coeng plus base consonant. One solution to this problem is for renderers to produce a strange rendering if characters appear in a non-standard order. However, character strings are not just rendered and compared for identity. They are also be transliterated, sorted into alphabetical order, and may be input to automatic speech generation systems with limited capabilities for resolving homographs. This may require some way of tagging an apparently incorrectly ordered string, analogous to the use of 'sic' in English, to indicate that the text is intended not to accord with the 'standard' character order. What characters are available for such a r?le? CGJ is a possibility, but I am concerned that it may be being overworked. It is already suggested as a solution for dealing with sorting when a digraph is treated as a letter, but accidental sequences are not, as in the Welsh letter 'ng' (which comes between 'g' and 'h' in the alphabet) as opposed to an 'accidental' sequence such as in 'Bangor' and 'Llangollen'. Such characters probably don't work now, but it may be possible to persuade the suppliers to heed them. The ideal character would be disallowed in domain names, which should allay the greatest security worries about simply rendering the text as it stands. Some potential ambiguities arise from Sanskrit, and were raised long ago by Peter Constable on the Unicode Indic list on 28 August 2006 under the heading 'contrastive /Crv/ and /Cvr/ in Telugu, Malayalam'. The cases he gave were 'grva' v. 'gvra', 'drva' v. 'dvra' and 'srva' v. 'svra'. For the Khmer script, the KhmerOS font renders the pairs identically, which did surprise me, as I had got it into my head that one could tell from the depth of the where the RO came in the sequence of conjoined letters. Richard. From unicode at unicode.org Mon May 15 05:21:45 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Mon, 15 May 2017 13:21:45 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 Message-ID: In reference to: http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf I think Unicode should not adopt the proposed change. The proposal is to make ICU's spec violation conforming. I think there is both a technical and a political reason why the proposal is a bad idea. First, the technical reason: ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't representative of implementation concerns of implementations that use UTF-8 as their in-memory Unicode representation. Even though there are notable systems (Win32, Java, C#, JavaScript, ICU, etc.) that are stuck with UTF-16 as their in-memory representation, which makes concerns of such implementation very relevant, I think the Unicode Consortium should acknowledge that UTF-16 was, in retrospect, a mistake (since Unicode grew past 16 bits anyway making UTF-16 both variable-width *and* ASCII-incompatible--i.e. widening the the code units to be ASCII-incompatible didn't buy a constant-width encoding after all) and that when the legacy constraints of Win32, Java, C#, JavaScript, ICU, etc. don't force UTF-16 as the internal Unicode representation, using UTF-8 as the internal Unicode representation is the technically superior design: Using UTF-8 as the internal Unicode representation is memory-efficient and cache-efficient when dealing with data formats whose syntax is mostly ASCII (e.g. HTML), forces developers to handle variable-width issues right away, makes input decode a matter of mere validation without copy when the input is conforming and makes output encode infinitely fast (no encode step needed). Therefore, despite UTF-16 being widely used as an in-memory representation of Unicode and in no way going away, I think the Unicode Consortium should be *very* sympathetic to technical considerations for implementations that use UTF-8 as the in-memory representation of Unicode. When looking this issue from the ICU perspective of using UTF-16 as the in-memory representation of Unicode, it's easy to consider the proposed change as the easier thing for implementation (after all, no change for the ICU implementation is involved!). However, when UTF-8 is the in-memory representation of Unicode and "decoding" UTF-8 input is a matter of *validating* UTF-8, a state machine that rejects a sequence as soon as it's impossible for the sequence to be valid UTF-8 (under the definition that excludes surrogate code points and code points beyond U+10FFFF) makes a whole lot of sense. If the proposed change was adopted, while Draconian decoders (that fail upon first error) could retain their current state machine, implementations that emit U+FFFD for errors and continue would have to add more state machine states (i.e. more complexity) to consolidate more input bytes into a single U+FFFD even after a valid sequence is obviously impossible. When the decision can easily go either way for implementations that use UTF-16 internally but the options are not equal when using UTF-8 internally, the "UTF-8 internally" case should be decisive. (Especially when spec-wise that decision involves no change. I further note the proposal PDF argues on the level of "feels right" without even discussing the impact on implementations that use UTF-8 internally.) As a matter of implementation experience, the implementation I've written (https://github.com/hsivonen/encoding_rs) supports both the UTF-16 as the in-memory Unicode representation and the UTF-8 as the in-memory Unicode representation scenarios, and the fail-fast requirement wasn't onerous in the UTF-16 as the in-memory representation scenario. Second, the political reason: Now that ICU is a Unicode Consortium project, I think the Unicode Consortium should be particular sensitive to biases arising from being both the source of the spec and the source of a popular implementation. It looks *really bad* both in terms of equal footing of ICU vs. other implementations for the purpose of how the standard is developed as well as the reliability of the standard text vs. ICU source code as the source of truth that other implementors need to pay attention to if the way the Unicode Consortium resolves a discrepancy between ICU behavior and a well-known spec provision (this isn't some ill-known corner case, after all) is by changing the spec instead of changing ICU *especially* when the change is not neutral for implementations that have made different but completely valid per then-existing spec and, in the absence of legacy constraints, superior architectural choices compared to ICU (i.e. UTF-8 internally instead of UTF-16 internally). I can see the irony of this viewpoint coming from a WHATWG-aligned browser developer, but I note that even browsers that use ICU for legacy encodings don't use ICU for UTF-8, so the ICU UTF-8 behavior isn't, in fact, the dominant browser UTF-8 behavior. That is, even Blink and WebKit use their own non-ICU UTF-8 decoder. The Web is the environment that's the most sensitive to how issues like this are handled, so it would be appropriate for the proposal to survey current browser behavior instead of just saying that ICU "feels right" or is "natural". -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Mon May 15 09:57:00 2017 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Mon, 15 May 2017 15:57:00 +0100 (BST) Subject: Are Emoji ZWJ sequences characters? In-Reply-To: <30186454.40678.1494856125964.JavaMail.root@webmail14.bt.ext.cpcloud.co.uk> References: <30186454.40678.1494856125964.JavaMail.root@webmail14.bt.ext.cpcloud.co.uk> Message-ID: <17598869.46886.1494860220668.JavaMail.defaultUser@defaultHost> I am concerned about emoji ZWJ sequences being encoded without going through the ISO process and whether Unicode will therefore lose synchronization with ISO/IEC 10646. I have raised this by email and a very helpful person has advised me that encoding emoji sequences does not mean that Unicode and ISO/IEC 10646 go out of being synchronized because ZWJ sequences are not *characters*, and they have no implications for ISO/IEC 10646, noting that ISO/IEC 10646 does not define ZWJ sequences. Now I have great respect for the person who advised me. However I am a researcher and I opine that I need evidence. Thus I am writing to the mailing list in the hope that there will be a discussion please. http://www.unicode.org/reports/tr51/tr51-11.html (A proposed update document) http://www.unicode.org/Public/emoji/5.0/emoji-zwj-sequences.txt http://www.unicode.org/charts/PDF/U1F300.pdf http://www.unicode.org/charts/PDF/U1F680.pdf In tr51-11.html at 2.3 Emoji ZWJ Sequences quote To the user of such a system, these behave like single emoji characters, even though internally they are sequences. end quote In emoji-zwj-sequences.txt there is the following line. 1F468 200D 1F680 ; Emoji_ZWJ_Sequence ; man astronaut >From U1F300.pdf, 1F468 is MAN 200D is ZWJ >From U1F680.pdf 1F680 is ROCKET The reasoning upon which I base my concern is as follows. 0063 is c 0070 is p 0074 is t If 0063 200D 0074 is used to specifically request a ct ligature in a display of some text, then the meaning of 0063 200D 0074 is the same as the meaning of 0063 0074 and indeed a font with an OpenType table could cause a ct ligature to be displayed even if the sequence is 0063 0074 rather than the sequence 0063 200D 0074 that is used where the ligature glyph is specifically requested. Thus the meaning of ct is not changed by using the ZWJ character. Now the use of the ct ligature is well-known and frequent. Suppose now that a fontmaker is making a font of his or her own and decides to include a glyph for a pp ligature, with a swash flourish joining and going beyond the lower ends of the descenders both to the left and to the right. The fontmaker could note that the ligature might be good in a word like copper but might look wrong in a word like happy due to the tail on the letter y clashing with the rightward side of the swash flourish. So the fontmaker encodes 0070 200D 0070 as a pp ligature but does not encode 0070 0070 as a pp ligature, so that the ligature glyph is only used when specifically requested using a ZWJ character. However, when the ZWJ character is used, the meaning of the pp sequence is not changed from the meaning when the pp sequence is not used. Yet when 1F468 200D 1F680 is used, the meaning of the sequence is different from the meaning of the sequence 1F468 1F680 such that the meaning of 1F468 200D 1F680 is listed in a file available from the Unicode website. >From where does the astronaut's spacesuit and helmet come? I am reminded that in chemistry if one mixes two chemicals, sometimes one just gets a mixture of two chemicals and sometimes one gets a chemical reaction such that another chemical is produced. Repeating the quote from earlier in this post. In tr51-11.html at 2.3 Emoji ZWJ Sequences quote To the user of such a system, these behave like single emoji characters, even though internally they are sequences. end quote I am concerned that in the future a user of ISO/IEC 10646 will not be able to find from ISO/IEC 10646 the meaning of an emoji that he or she observes being displayed, even if he or she is able to discover what is the sequence of characters being used. So I ask that this matter be discussed please. William Overington Monday 15 May 2017 From unicode at unicode.org Mon May 15 10:37:13 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Mon, 15 May 2017 16:37:13 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: On 15 May 2017, at 11:21, Henri Sivonen via Unicode wrote: > > In reference to: > http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf > > I think Unicode should not adopt the proposed change. Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense. > ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't > representative of implementation concerns of implementations that use > UTF-8 as their in-memory Unicode representation. > > Even though there are notable systems (Win32, Java, C#, JavaScript, > ICU, etc.) that are stuck with UTF-16 as their in-memory > representation, which makes concerns of such implementation very > relevant, I think the Unicode Consortium should acknowledge that > UTF-16 was, in retrospect, a mistake You may think that. There are those of us who do not. The fact is that UTF-16 makes sense as a default encoding in many cases. Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the case for other situations and the fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 usually focus on) is no more complicated than handling combining characters, which you have to do anyway. > Therefore, despite UTF-16 being widely used as an in-memory > representation of Unicode and in no way going away, I think the > Unicode Consortium should be *very* sympathetic to technical > considerations for implementations that use UTF-8 as the in-memory > representation of Unicode. I don?t think the Unicode Consortium should be unsympathetic to people who use UTF-8 internally, for sure, but I don?t see what that has to do with either the original proposal or with your criticism of UTF-16. [snip] > If the proposed > change was adopted, while Draconian decoders (that fail upon first > error) could retain their current state machine, implementations that > emit U+FFFD for errors and continue would have to add more state > machine states (i.e. more complexity) to consolidate more input bytes > into a single U+FFFD even after a valid sequence is obviously > impossible. ?Impossible?? Why? You just need to add some error states (or *an* error state and a counter); it isn?t exactly difficult, and I?m sure ICU isn?t the only library that already did just that *because it?s clearly the right thing to do*. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Mon May 15 11:14:23 2017 From: unicode at unicode.org (Peter Constable via Unicode) Date: Mon, 15 May 2017 16:14:23 +0000 Subject: Are Emoji ZWJ sequences characters? In-Reply-To: <17598869.46886.1494860220668.JavaMail.defaultUser@defaultHost> References: <30186454.40678.1494856125964.JavaMail.root@webmail14.bt.ext.cpcloud.co.uk> <17598869.46886.1494860220668.JavaMail.defaultUser@defaultHost> Message-ID: Emoji sequences are not _encoded_, per se, in either Unicode or ISO/IEC 10646. The act of "encoding" in either of these coding standards is to assign an encoded representation in the encoding method of the standards for a given entity. In this case, that means to assign a code point. Specifying ZWJ sequences for representation of text elements is not encoding in the standard; it is simply defining an encoded representation for those text elements. Unicode gives some attention to this kind of thing, but ISO/IEC 10646, not so much. For instance, you won't find anything in ISO/IEC 10646 specifying that the encoded representation for a rakaar is < VIRAMA, RA >. So, your helpful person was, indeed, helpful, giving you correct information: ZWJ sequences are not _characters_ and have no implications for ISO/IEC 10646. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of William_J_G Overington via Unicode Sent: Monday, May 15, 2017 7:57 AM To: unicode at unicode.org Subject: Are Emoji ZWJ sequences characters? I am concerned about emoji ZWJ sequences being encoded without going through the ISO process and whether Unicode will therefore lose synchronization with ISO/IEC 10646. I have raised this by email and a very helpful person has advised me that encoding emoji sequences does not mean that Unicode and ISO/IEC 10646 go out of being synchronized because ZWJ sequences are not *characters*, and they have no implications for ISO/IEC 10646, noting that ISO/IEC 10646 does not define ZWJ sequences. Now I have great respect for the person who advised me. However I am a researcher and I opine that I need evidence. Thus I am writing to the mailing list in the hope that there will be a discussion please. https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2Freports%2Ftr51%2Ftr51-11.html&data=02%7C01%7Cpetercon%40microsoft.com%7C5ed7d97f20194242d58908d49ba6034d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636304584722863385&sdata=IWXir%2BfVIg2NW5Q95ClTs5Powet54k5VFEyJaEL7KYE%3D&reserved=0 (A proposed update document) https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2FPublic%2Femoji%2F5.0%2Femoji-zwj-sequences.txt&data=02%7C01%7Cpetercon%40microsoft.com%7C5ed7d97f20194242d58908d49ba6034d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636304584722863385&sdata=2TzPVAvyTRaLqFBx8gKG%2BvwK86DTzcZgnQpPYuaQto8%3D&reserved=0 https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2Fcharts%2FPDF%2FU1F300.pdf&data=02%7C01%7Cpetercon%40microsoft.com%7C5ed7d97f20194242d58908d49ba6034d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636304584722863385&sdata=aG3AQEN8iwsyJtcLZFdKYBsM682sGCuBDUTyf8lyhy4%3D&reserved=0 https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2Fcharts%2FPDF%2FU1F680.pdf&data=02%7C01%7Cpetercon%40microsoft.com%7C5ed7d97f20194242d58908d49ba6034d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636304584722863385&sdata=xC2tM5TFs9XLDbbYqfTaeVULxe8ciShAlgbWGQfknPg%3D&reserved=0 In tr51-11.html at 2.3 Emoji ZWJ Sequences quote To the user of such a system, these behave like single emoji characters, even though internally they are sequences. end quote In emoji-zwj-sequences.txt there is the following line. 1F468 200D 1F680 ; Emoji_ZWJ_Sequence ; man astronaut >From U1F300.pdf, 1F468 is MAN 200D is ZWJ >From U1F680.pdf 1F680 is ROCKET The reasoning upon which I base my concern is as follows. 0063 is c 0070 is p 0074 is t If 0063 200D 0074 is used to specifically request a ct ligature in a display of some text, then the meaning of 0063 200D 0074 is the same as the meaning of 0063 0074 and indeed a font with an OpenType table could cause a ct ligature to be displayed even if the sequence is 0063 0074 rather than the sequence 0063 200D 0074 that is used where the ligature glyph is specifically requested. Thus the meaning of ct is not changed by using the ZWJ character. Now the use of the ct ligature is well-known and frequent. Suppose now that a fontmaker is making a font of his or her own and decides to include a glyph for a pp ligature, with a swash flourish joining and going beyond the lower ends of the descenders both to the left and to the right. The fontmaker could note that the ligature might be good in a word like copper but might look wrong in a word like happy due to the tail on the letter y clashing with the rightward side of the swash flourish. So the fontmaker encodes 0070 200D 0070 as a pp ligature but does not encode 0070 0070 as a pp ligature, so that the ligature glyph is only used when specifically requested using a ZWJ character. However, when the ZWJ character is used, the meaning of the pp sequence is not changed from the meaning when the pp sequence is not used. Yet when 1F468 200D 1F680 is used, the meaning of the sequence is different from the meaning of the sequence 1F468 1F680 such that the meaning of 1F468 200D 1F680 is listed in a file available from the Unicode website. >From where does the astronaut's spacesuit and helmet come? I am reminded that in chemistry if one mixes two chemicals, sometimes one just gets a mixture of two chemicals and sometimes one gets a chemical reaction such that another chemical is produced. Repeating the quote from earlier in this post. In tr51-11.html at 2.3 Emoji ZWJ Sequences quote To the user of such a system, these behave like single emoji characters, even though internally they are sequences. end quote I am concerned that in the future a user of ISO/IEC 10646 will not be able to find from ISO/IEC 10646 the meaning of an emoji that he or she observes being displayed, even if he or she is able to discover what is the sequence of characters being used. So I ask that this matter be discussed please. William Overington Monday 15 May 2017 From unicode at unicode.org Mon May 15 12:43:53 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 15 May 2017 18:43:53 +0100 Subject: Are Emoji ZWJ sequences characters? In-Reply-To: References: <30186454.40678.1494856125964.JavaMail.root@webmail14.bt.ext.cpcloud.co.uk> <17598869.46886.1494860220668.JavaMail.defaultUser@defaultHost> Message-ID: <20170515184353.47e68b81@JRWUBU2> On Mon, 15 May 2017 16:14:23 +0000 Peter Constable via Unicode wrote: > So, your helpful person was, indeed, helpful, giving you correct > information: ZWJ sequences are not _characters_ and have no > implications for ISO/IEC 10646. Except in so far as the claimed ligature changes the meaning of the ligated elements. For example, using <'a', ZWJ, 'e'> for an a-umlaut that was clearly not a-diaeresis would probably be on the edge of what is permissible. Returning to the example, shouldn't 1F468 200D 1F680 mean 'male rocket maker'? Richard. From unicode at unicode.org Mon May 15 12:52:25 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 15 May 2017 10:52:25 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

Message-ID: On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote: > On 15 May 2017, at 11:21, Henri Sivonen via Unicode wrote: >> In reference to: >> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf >> >> I think Unicode should not adopt the proposed change. > Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense. Changing a specification as fundamental as this is something that should not be undertaken lightly. Apparently we have a situation where implementations disagree, and have done so for a while. This normally means not only that the implementations differ, but that data exists in both formats. Even if it were true that all data is only stored in UTF-8, any data converted from UFT-8 back to UTF-8 going through an interim stage that requires UTF-8 conversion would then be different based on which converter is used. Implementations working in UTF-8 natively would potentially see three formats: 1) the original ill-formed data 2) data converted with single FFFD 3) data converted with multiple FFFD These forms cannot be compared for equality by binary matching. The best that can be done is to convert (1) into one of the other forms and then compare treating any run of FFFD code points as equal to any other run, irrespective of length. (For security-critical applications, the presence of any FFFD should render the data invalid, so the comparisons we'd be talking about here would be for general purpose, like search). Because we've had years of multiple implementations, it would be expected that copious data exists in all three formats, and that data will not go away. Changing the specification to pick one of these formats as solely conformant is IMHO too late. A./ > >> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't >> representative of implementation concerns of implementations that use >> UTF-8 as their in-memory Unicode representation. >> >> Even though there are notable systems (Win32, Java, C#, JavaScript, >> ICU, etc.) that are stuck with UTF-16 as their in-memory >> representation, which makes concerns of such implementation very >> relevant, I think the Unicode Consortium should acknowledge that >> UTF-16 was, in retrospect, a mistake > You may think that. There are those of us who do not. The fact is that UTF-16 makes sense as a default encoding in many cases. Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the case for other situations and the fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 usually focus on) is no more complicated than handling combining characters, which you have to do anyway. > >> Therefore, despite UTF-16 being widely used as an in-memory >> representation of Unicode and in no way going away, I think the >> Unicode Consortium should be *very* sympathetic to technical >> considerations for implementations that use UTF-8 as the in-memory >> representation of Unicode. > I don?t think the Unicode Consortium should be unsympathetic to people who use UTF-8 internally, for sure, but I don?t see what that has to do with either the original proposal or with your criticism of UTF-16. > > [snip] > >> If the proposed >> change was adopted, while Draconian decoders (that fail upon first >> error) could retain their current state machine, implementations that >> emit U+FFFD for errors and continue would have to add more state >> machine states (i.e. more complexity) to consolidate more input bytes >> into a single U+FFFD even after a valid sequence is obviously >> impossible. > ?Impossible?? Why? You just need to add some error states (or *an* error state and a counter); it isn?t exactly difficult, and I?m sure ICU isn?t the only library that already did just that *because it?s clearly the right thing to do*. > > Kind regards, > > Alastair. > > -- > http://alastairs-place.net > > > From unicode at unicode.org Mon May 15 12:54:23 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 15 May 2017 10:54:23 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: <37d24cde-96c3-5732-726f-79293d561b4e@ix.netcom.com> On 5/15/2017 3:21 AM, Henri Sivonen via Unicode wrote: > Second, the political reason: > > Now that ICU is a Unicode Consortium project, I think the Unicode > Consortium should be particular sensitive to biases arising from being > both the source of the spec and the source of a popular > implementation. It looks*really bad* both in terms of equal footing > of ICU vs. other implementations for the purpose of how the standard > is developed as well as the reliability of the standard text vs. ICU > source code as the source of truth that other implementors need to pay > attention to if the way the Unicode Consortium resolves a discrepancy > between ICU behavior and a well-known spec provision (this isn't some > ill-known corner case, after all) is by changing the spec instead of > changing ICU*especially* when the change is not neutral for > implementations that have made different but completely valid per > then-existing spec and, in the absence of legacy constraints, superior > architectural choices compared to ICU (i.e. UTF-8 internally instead > of UTF-16 internally). > > I can see the irony of this viewpoint coming from a WHATWG-aligned > browser developer, but I note that even browsers that use ICU for > legacy encodings don't use ICU for UTF-8, so the ICU UTF-8 behavior > isn't, in fact, the dominant browser UTF-8 behavior. That is, even > Blink and WebKit use their own non-ICU UTF-8 decoder. The Web is the > environment that's the most sensitive to how issues like this are > handled, so it would be appropriate for the proposal to survey current > browser behavior instead of just saying that ICU "feels right" or is > "natural". I think this political reason should be taken very seriously. There are already too many instances where ICU can be seen "driving" the development of property and algorithms. Those involved in the ICU project may not see the problem, but I agree with Henri that it requires a bit more sensitivity from the UTC. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 15 13:02:34 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Mon, 15 May 2017 19:02:34 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

Message-ID: On 15 May 2017, at 18:52, Asmus Freytag wrote: > > On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote: >> On 15 May 2017, at 11:21, Henri Sivonen via Unicode wrote: >>> In reference to: >>> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf >>> >>> I think Unicode should not adopt the proposed change. >> Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense. > > Changing a specification as fundamental as this is something that should not be undertaken lightly. Agreed. > Apparently we have a situation where implementations disagree, and have done so for a while. This normally means not only that the implementations differ, but that data exists in both formats. > > Even if it were true that all data is only stored in UTF-8, any data converted from UFT-8 back to UTF-8 going through an interim stage that requires UTF-8 conversion would then be different based on which converter is used. > > Implementations working in UTF-8 natively would potentially see three formats: > 1) the original ill-formed data > 2) data converted with single FFFD > 3) data converted with multiple FFFD > > These forms cannot be compared for equality by binary matching. But that was always true, if you were under the impression that only one of (2) and (3) existed, and indeed claiming equality between two instances of U+FFFD might be problematic itself in some circumstances (you don?t know why the U+FFFDs were inserted - they may not replace the same original data). > The best that can be done is to convert (1) into one of the other forms and then compare treating any run of FFFD code points as equal to any other run, irrespective of length. It?s probably safer, actually, to refuse to compare U+FFFD as equal to anything (even itself) unless a special flag is passed. For ?general purpose? applications, you could set that flag and then a single U+FFFD would compare equal to another single U+FFFD; no need for the complicated ?any string of U+FFFD? logic (which in any case makes little sense - it could just as easily generate erroneous comparisons as fix the case we?re worrying about here). > Because we've had years of multiple implementations, it would be expected that copious data exists in all three formats, and that data will not go away. Changing the specification to pick one of these formats as solely conformant is IMHO too late. I don?t think so. Even if we acknowledge the possibility of data in the other form, I think it?s useful guidance to implementers, both now and in the future. One might even imagine that the other, non-favoured form, would eventually fall out of use. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Mon May 15 13:33:18 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Mon, 15 May 2017 21:33:18 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

Message-ID: On Mon, May 15, 2017 at 6:37 PM, Alastair Houghton wrote: > On 15 May 2017, at 11:21, Henri Sivonen via Unicode wrote: >> >> In reference to: >> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf >> >> I think Unicode should not adopt the proposed change. > > Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense. The currently-specced behavior makes perfect sense when you add error emission on top of a fail-fast UTF-8 validation state machine. >> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't >> representative of implementation concerns of implementations that use >> UTF-8 as their in-memory Unicode representation. >> >> Even though there are notable systems (Win32, Java, C#, JavaScript, >> ICU, etc.) that are stuck with UTF-16 as their in-memory >> representation, which makes concerns of such implementation very >> relevant, I think the Unicode Consortium should acknowledge that >> UTF-16 was, in retrospect, a mistake > > You may think that. There are those of us who do not. My point is: The proposal seems to arise from the "UTF-16 as the in-memory representation" mindset. While I don't expect that case in any way to go away, I think the Unicode Consortium should recognize the serious technical merit of the "UTF-8 as the in-memory representation" case as having significant enough merit that proposals like this should consider impact to both cases equally despite "UTF-8 as the in-memory representation" case at present appearing to be the minority case. That is, I think it's wrong to view things only or even primarily through the lens of the "UTF-16 as the in-memory representation" case that ICU represents. -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Mon May 15 15:05:55 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Mon, 15 May 2017 20:05:55 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

Message-ID: >> Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense. > > Changing a specification as fundamental as this is something that should not be undertaken lightly. IMO, the only think that can be agreed upon is that "something's bad with this UTF-8 data". I think that whether it's treated as a single group of corrupt bytes or each individual byte is considered a problem should be up to the implementation. #1 - This data should "never happen". In a system behaving normally, this condition should never be encountered. * At this point the data is "bad" and all bets are off. * Some applications may have a clue how the bad data could have happened and want to do something in particular. * It seems odd to me to spend much effort standardizing a scenario that should be impossible. #2 - Depending on implementation, either behavior, or some combination, may be more efficient. I'd rather allow apps to optimize for the common case, not the case-that-shouldn't-ever-happen #3 - We have no clue if this "maximal" sequence was a single error, 2 errors, or even more. The lead byte says how many trail bytes should follow, and those should be in a certain range. Values outside of those conditions are illegal, so we shouldn't ever encounter them. So if we did, then something really weird happened. * Did a single character get misencoded? * Was an illegal sequence illegally encoded? * Perhaps a byte got corrupted in transmission? * Maybe we dropped a packet/block, so this is really the beginning of a valid sequence and the tail of another completely valid sequence? In practice, all that most apps would be able to do would be to say "You have bad data, how bad I have no clue, but it's not right". A single bit could've flipped, or you could have only 3 pages of a 4000 page document. No clue at all. At that point it doesn't really matter how many FFFD's the error(s) are replaced with, and no assumptions should be made about the severity of the error. -Shawn From unicode at unicode.org Mon May 15 15:49:05 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 15 May 2017 13:49:05 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

Message-ID: On 5/15/2017 11:33 AM, Henri Sivonen via Unicode wrote: > >>> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't >>> representative of implementation concerns of implementations that use >>> UTF-8 as their in-memory Unicode representation. >>> >>> Even though there are notable systems (Win32, Java, C#, JavaScript, >>> ICU, etc.) that are stuck with UTF-16 as their in-memory >>> representation, which makes concerns of such implementation very >>> relevant, I think the Unicode Consortium should acknowledge that >>> UTF-16 was, in retrospect, a mistake >> You may think that. There are those of us who do not. > My point is: > The proposal seems to arise from the "UTF-16 as the in-memory > representation" mindset. While I don't expect that case in any way to > go away, I think the Unicode Consortium should recognize the serious > technical merit of the "UTF-8 as the in-memory representation" case as > having significant enough merit that proposals like this should > consider impact to both cases equally despite "UTF-8 as the in-memory > representation" case at present appearing to be the minority case. > That is, I think it's wrong to view things only or even primarily > through the lens of the "UTF-16 as the in-memory representation" case > that ICU represents. > UTF-16 has some nice properties and there's not need to brand it a "mistake". UTF-8 has different nice properties, but there's equally not reason to treat it as more special than UTF-16. The UTC should adopt a position of perfect neutrality when it comes to assuming in-memory representation, in other words, not make assumptions that optimizing for any encoding form will benefit implementers. UTC, where ICU is strongly represented, needs to guard against basing encoding/properties/algorithm decisions (edge cases mostly), solely or primarily on the needs of a particular implementation that happens to be chosen by the ICU project. A./ From unicode at unicode.org Mon May 15 16:38:26 2017 From: unicode at unicode.org (David Starner via Unicode) Date: Mon, 15 May 2017 21:38:26 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

Message-ID: On Mon, May 15, 2017 at 8:41 AM Alastair Houghton via Unicode < unicode at unicode.org> wrote: > Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the > case for other situations UTF-8 is clearly more efficient space-wise that includes more ASCII characters than characters between U+0800 and U+FFFF. Given the prevalence of spaces and ASCII punctuation, Latin, Greek, Cyrillic, Hebrew and Arabic will pretty much always be smaller in UTF-8. Even for scripts that go from 2 bytes to 3, webpages can get much smaller in UTF-8 (http://www.gov.cn/ goes from 63k in UTF-8 to 116k in UTF-16, a factor of 1.8). The max change in reverse is 1.5, as two bytes goes to three. > and the fact is that handling surrogates (which is what proponents of > UTF-8 or UCS-4 usually focus on) is no more complicated than handling > combining characters, which you have to do anyway. > Not necessarily; you can legally process Unicode text without worrying about combining characters, whereas you cannot process UTF-16 without handling surrogates. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 15 17:16:32 2017 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Mon, 15 May 2017 22:16:32 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

Message-ID: I?m not sure how the discussion of ?which is better? relates to the discussion of ill-formed UTF-8 at all. And to the last, saying ?you cannot process UTF-16 without handling surrogates? seems to me to be the equivalent of saying ?you cannot process UTF-8 without handling lead & trail bytes?. That?s how the respective encodings work. One could look at it and think ?there are 128 unicode characters that have the same value in UTF-8 as UTF-32,? and ?there are xx thousand unicode characters that have the same value in UTF-16 and UTF-32.? -Shawn From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of David Starner via Unicode Sent: Monday, May 15, 2017 2:38 PM To: unicode at unicode.org Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On Mon, May 15, 2017 at 8:41 AM Alastair Houghton via Unicode > wrote: Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the case for other situations UTF-8 is clearly more efficient space-wise that includes more ASCII characters than characters between U+0800 and U+FFFF. Given the prevalence of spaces and ASCII punctuation, Latin, Greek, Cyrillic, Hebrew and Arabic will pretty much always be smaller in UTF-8. Even for scripts that go from 2 bytes to 3, webpages can get much smaller in UTF-8 (http://www.gov.cn/ goes from 63k in UTF-8 to 116k in UTF-16, a factor of 1.8). The max change in reverse is 1.5, as two bytes goes to three. and the fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 usually focus on) is no more complicated than handling combining characters, which you have to do anyway. Not necessarily; you can legally process Unicode text without worrying about combining characters, whereas you cannot process UTF-16 without handling surrogates. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 15 17:43:29 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 15 May 2017 23:43:29 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

Message-ID: <20170515234329.10745518@JRWUBU2> On Mon, 15 May 2017 21:38:26 +0000 David Starner via Unicode wrote: > > and the fact is that handling surrogates (which is what proponents > > of UTF-8 or UCS-4 usually focus on) is no more complicated than > > handling combining characters, which you have to do anyway. > Not necessarily; you can legally process Unicode text without worrying > about combining characters, whereas you cannot process UTF-16 without > handling surrogates. The problem with surrogates is inadequate testing. They're sufficiently rare for many users that it may be a long time before an error is discovered. It's not always obvious that code is designed for UCS-2 rather than UTF-16. Richard. From unicode at unicode.org Mon May 15 17:53:13 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 16 May 2017 00:53:13 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <37d24cde-96c3-5732-726f-79293d561b4e@ix.netcom.com> References: <37d24cde-96c3-5732-726f-79293d561b4e@ix.netcom.com> Message-ID: 2017-05-15 19:54 GMT+02:00 Asmus Freytag via Unicode : > I think this political reason should be taken very seriously. There are > already too many instances where ICU can be seen "driving" the development > of property and algorithms. > > Those involved in the ICU project may not see the problem, but I agree > with Henri that it requires a bit more sensitivity from the UTC. > I don't think that the fact that ICU was originately using UTF-16 internally has ANY effect on the decision to represent ill-formed sequences as single or multiple U+FFFD. The internal encoding has nothing in common with the external encoding used when processing input data (which may be UTf-8, UTF-16, UTF-32, and could in all case present ill-formed sequences). That internal encoding here will paly no role in how to convert the ill-formed input, or if it will be converted. So yes, independantly of the internal encoding, we'll still ahve to choose between: - not converting the input and return an error or throw an exception - converting the input using a single U+FFFD (in its internal representation, this does not matter) to replace the complete sequence of ill-formed code units in the input data, and preferably return an error status - converting the input using as many U+FFFD (in its internal representation, this does not matter) to replace every ocurence of ill-formed code units in the input data, and preferably return an error status. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 15 18:20:40 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 16 May 2017 01:20:40 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170515234329.10745518@JRWUBU2> References:

<20170515234329.10745518@JRWUBU2> Message-ID: Softwares designed with only UCS-2 and not real UTF-16 support are still used today For example MySQL with its broken "UTF-8" encoding which in fact encodes supplementary characters as two separate 16-bit code-units for surrogates, each one blindly encoded as 3-byte sequences which would be ill-formed in standard UTF-8, buit that also does not differentiate invalid pairs of surrogates, and offers no collation support for supplementary characters. In this case some other softwares will break silently on these sequences (for example Mediawiki when installed with a MySQL backend server whose datastore was created with its broken "UTF-8", will silently discard any text starting at the first supplementary character found in the wikitext. This is not a problem of Mediawiki but the fact the MediaWiki does NOT support such MySQL server isntalled with its "UTF-8" datastore, but only supports MySQL if the storage encoding declared for the database was "binary" (but in that case there's no support of collation in MySQL, texts are just containing any random sequences of bytes and internationalization is then made in the client software, here Mediawiki and its PHP, ICU, or Lua libraries, and other tools written in Perl and other languages) Note that this does not affect Wikimedia in its wikis because they were initially installed corectly with the binary encoding in MySQL, but now Wikimedia wikis use another database engine with native UTF-8 support and full coverage of the UCS. Other wikis using Mediawiki will need to upgrade their MySQL version if they want to keep it for adminsitrative reasons (and not convert again their datastore to the binary encoding). Softwares running with only UCS-2 are exposed to such risks similar to the one seen in MediaWiki on incorrect MySQL installations, where any user may edit a page to insert any supplementary character (supplementary sinograms, emojis, Gothic letters, supplementary symbols...) which will look correct when previewing, and correct when it is parsed, accepted silently by MySQL, but then silently truncated because of the encoding error: when reloading the data from MySQL, there will effectively be unexpectedly discarded data. How to react to the risks of data losses or truncation ? Throwing an exception or just returning an error is in fact more dangerous than just replacing the ill-formed sequences by one or more U+FFFD: we preserve as much as possible, but anyway softwares should be able to perform some tests in their datastore to see if they correctly handle the encoding: this could be done when starting the sofware and emitting log messages when the backend do not support the encoding: all that is needed is to send a single supplementary character to the remote datastore in a junk table or field and then retrieve it immediately in another transaction to make sure it is preserved. Similar tests can be done to see if the remote datastore also preserves the encoding form or "normalizes it, or alters it (this alteration could happen with a leading BOM and some other silent alterations could be made on NULL and trailing spaces if the datastore does not use text fields with varying length but fixed length instead). Similar tests could be done to check the maximum length accepted (a VARCHAR(256) on a binary-encoded database will not always store 256 Unciode characters, but in a database encoded with non borken UTF-8, it should store 256 codepoints independantly of theior values, even if their UTF-8 encoding would be up to 1024 bytes. 2017-05-16 0:43 GMT+02:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Mon, 15 May 2017 21:38:26 +0000 > David Starner via Unicode wrote: > > > > and the fact is that handling surrogates (which is what proponents > > > of UTF-8 or UCS-4 usually focus on) is no more complicated than > > > handling combining characters, which you have to do anyway. > > > Not necessarily; you can legally process Unicode text without worrying > > about combining characters, whereas you cannot process UTF-16 without > > handling surrogates. > > The problem with surrogates is inadequate testing. They're sufficiently > rare for many users that it may be a long time before an error is > discovered. It's not always obvious that code is designed for UCS-2 > rather than UTF-16. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 15 22:23:06 2017 From: unicode at unicode.org (Karl Williamson via Unicode) Date: Mon, 15 May 2017 21:23:06 -0600 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: Message-ID: On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote: > In reference to: > http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf > > I think Unicode should not adopt the proposed change. > > The proposal is to make ICU's spec violation conforming. I think there > is both a technical and a political reason why the proposal is a bad > idea. Henri's claim that "The proposal is to make ICU's spec violation conforming" is a false statement, and hence all further commentary based on this false premise is irrelevant. I believe that ICU is actually currently conforming to TUS. The proposal reads: "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8..." There is nothing in here that is requiring any implementation to be changed. The word "recommend" does not mean the same as "require". Have you guys been so caught up in the current international political situation that you have lost the ability to read straight? TUS has certain requirements for UTF-8 handling, and it has certain other "Best Practices" as detailed in 3.9. The proposal involves changing those recommendations. It does not involve changing any requirements. From unicode at unicode.org Tue May 16 01:50:54 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Tue, 16 May 2017 09:50:54 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

Message-ID: On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode wrote: > I?m not sure how the discussion of ?which is better? relates to the > discussion of ill-formed UTF-8 at all. Clearly, the "which is better" issue is distracting from the underlying issue. I'll clarify what I meant on that point and then move on: I acknowledge that UTF-16 as the internal memory representation is the dominant design. However, because UTF-8 as the internal memory representation is *such a good design* (when legacy constraits permit) that *despite it not being the current dominant design*, I think the Unicode Consortium should be fully supportive of UTF-8 as the internal memory representation and not treat UTF-16 as the internal representation as the one true way of doing things that gets considered when speccing stuff. I.e. I wasn't arguing against UTF-16 as the internal memory representation (for the purposes of this thread) but trying to motivate why the Consortium should consider "UTF-8 internally" equally despite it not being the dominant design. So: When a decision could go either way from the "UTF-16 internally" perspective, but one way clearly makes more sense from the "UTF-8 internally" perspective, the "UTF-8 internally" perspective should be decisive in *such a case*. (I think the matter at hand is such a case.) At the very least a proposal should discuss the impact on the "UTF-8 internally" case, which the proposal at hand doesn't do. (Moving on to a different point.) The matter at hand isn't, however, a new green-field (in terms of implementations) issue to be decided but a proposed change to a standard that has many widely-deployed implementations. Even when observing only "UTF-16 internally" implementations, I think it would be appropriate for the proposal to include a review of what existing implementations, beyond ICU, do. Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick test with three major browsers that use UTF-16 internally and have independent (of each other) implementations of UTF-8 decoding (Firefox, Edge and Chrome) shows agreement on the current spec: there is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, 6 on the second, 4 on the third and 6 on the last line). Changing the Unicode standard away from that kind of interop needs *way* better rationale than "feels right". -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Tue May 16 02:01:03 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Tue, 16 May 2017 10:01:03 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

Message-ID: On Tue, May 16, 2017 at 6:23 AM, Karl Williamson wrote: > On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote: >> >> In reference to: >> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf >> >> I think Unicode should not adopt the proposed change. >> >> The proposal is to make ICU's spec violation conforming. I think there >> is both a technical and a political reason why the proposal is a bad >> idea. > > > > Henri's claim that "The proposal is to make ICU's spec violation conforming" > is a false statement, and hence all further commentary based on this false > premise is irrelevant. > > I believe that ICU is actually currently conforming to TUS. Do you mean that ICU's behavior differs from what the PDF claims (I didn't test and took the assertion in the PDF about behavior at face value) or do you mean that despite deviating from the currently-recommended best practice the behavior is conforming, because the relevant part of the spec is mere best practice and not a requirement? > TUS has certain requirements for UTF-8 handling, and it has certain other > "Best Practices" as detailed in 3.9. The proposal involves changing those > recommendations. It does not involve changing any requirements. Even so, I think even changing a recommendation of "best practice" needs way better rationale than "feels right" or "ICU already does it" when a) major browsers (which operate in the most prominent environment of broken and hostile UTF-8) agree with the currently-recommended best practice and b) the currently-recommended best practice makes more sense for implementations where "UTF-8 decoding" is actually mere "UTF-8 validation". -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Tue May 16 02:13:45 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 08:13:45 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

Message-ID: <6259C523-20B6-4B76-AB2B-33020E1C863C@alastairs-place.net> On 15 May 2017, at 23:16, Shawn Steele via Unicode wrote: > > I?m not sure how the discussion of ?which is better? relates to the discussion of ill-formed UTF-8 at all. It doesn?t, which is a point I made in my original reply to Henry. The only reason I answered his anti-UTF-16 rant at all was to point out that some of us don?t think UTF-16 is a mistake, and in fact can see various benefits (*particularly* as an in-memory representation). > And to the last, saying ?you cannot process UTF-16 without handling surrogates? seems to me to be the equivalent of saying ?you cannot process UTF-8 without handling lead & trail bytes?. That?s how the respective encodings work. Quite. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 02:22:53 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 16 May 2017 00:22:53 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

Message-ID: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> On 5/15/2017 11:50 PM, Henri Sivonen via Unicode wrote: > On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode > wrote: >> I?m not sure how the discussion of ?which is better? relates to the >> discussion of ill-formed UTF-8 at all. > Clearly, the "which is better" issue is distracting from the > underlying issue. I'll clarify what I meant on that point and then > move on: > > I acknowledge that UTF-16 as the internal memory representation is the > dominant design. However, because UTF-8 as the internal memory > representation is *such a good design* (when legacy constraits permit) > that *despite it not being the current dominant design*, I think the > Unicode Consortium should be fully supportive of UTF-8 as the internal > memory representation and not treat UTF-16 as the internal > representation as the one true way of doing things that gets > considered when speccing stuff. There are cases where it is prohibitive to transcode external data from UTF-8 to any other format, as a precondition to doing any work. In these situations processing has to be done in UTF-8, effectively making that the in-memory representation. I've encountered this issue on separate occasions, both for my own code as well as code I reviewed for clients. I therefore think that Henri has a point when he's concerned about tacit assumptions favoring one memory representation over another, but I think the way he raises this point is needlessly antagonistic. > ....At the very least a proposal should discuss the impact on the "UTF-8 > internally" case, which the proposal at hand doesn't do. This is a key point. It may not be directly relevant to any other modifications to the standard, but the larger point is to not make assumption about how people implement the standard (or any of the algorithms). > (Moving on to a different point.) > > The matter at hand isn't, however, a new green-field (in terms of > implementations) issue to be decided but a proposed change to a > standard that has many widely-deployed implementations. Even when > observing only "UTF-16 internally" implementations, I think it would > be appropriate for the proposal to include a review of what existing > implementations, beyond ICU, do. I would like to second this as well. The level of documented review of existing implementation practices tends to be thin (at least thinner than should be required for changing long-established edge cases or recommendations, let alone core conformance requirements). > > Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick > test with three major browsers that use UTF-16 internally and have > independent (of each other) implementations of UTF-8 decoding > (Firefox, Edge and Chrome) shows agreement on the current spec: there > is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, > 6 on the second, 4 on the third and 6 on the last line). Changing the > Unicode standard away from that kind of interop needs *way* better > rationale than "feels right". It would be good if the UTC could work out some minimal requirements for evaluating proposals for changes to properties and algorithms, much like the criteria for encoding new code points A./ From unicode at unicode.org Tue May 16 02:23:14 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Tue, 16 May 2017 10:23:14 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

Message-ID: On Tue, May 16, 2017 at 9:50 AM, Henri Sivonen wrote: > Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick > test with three major browsers that use UTF-16 internally and have > independent (of each other) implementations of UTF-8 decoding > (Firefox, Edge and Chrome) shows agreement on the current spec: there > is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, > 6 on the second, 4 on the third and 6 on the last line). Changing the > Unicode standard away from that kind of interop needs *way* better > rationale than "feels right". Testing with that file, Python 3 and OpenJDK 8 agree with the currently-specced best-practice, too. I expect there to be other well-known implementations that comply with the currently-specced best practice, so the rationale to change the stated best practice would have to be very strong (as in: security problem with currently-stated best practice) for a change to be appropriate. -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Tue May 16 02:26:33 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 08:26:33 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <20170515234329.10745518@JRWUBU2> References:

<20170515234329.10745518@JRWUBU2> Message-ID: <752AC650-694E-45F7-8854-A0DE9A8D5A77@alastairs-place.net> On 15 May 2017, at 23:43, Richard Wordingham via Unicode wrote: > > The problem with surrogates is inadequate testing. They're sufficiently > rare for many users that it may be a long time before an error is > discovered. It's not always obvious that code is designed for UCS-2 > rather than UTF-16. While I don?t think we should spend too long debating the relative merits of UTF-8 versus UTF-16, I?ll note that that argument applies equally to both combining characters and indeed the underlying UTF-8 encoding in the first place, and that mistakes in handling both are not exactly uncommon. There are advantages to UTF-8 and advantages to UTF-16. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 02:42:46 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 08:42:46 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> References:

<717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> Message-ID: On 16 May 2017, at 08:22, Asmus Freytag via Unicode wrote: > I therefore think that Henri has a point when he's concerned about tacit assumptions favoring one memory representation over another, but I think the way he raises this point is needlessly antagonistic. That would be true if the in-memory representation had any effect on what we?re talking about, but it really doesn?t. (The only time I can think of that the in-memory representation has a significant effect is where you?re talking about default binary ordering of string data, in which case, in the presence of non-BMP characters, UTF-8 and UCS-4 sort the same way, but because the surrogates are ?in the wrong place?, UTF-16 doesn?t. I think everyone is well aware of that, no?) >> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick >> test with three major browsers that use UTF-16 internally and have >> independent (of each other) implementations of UTF-8 decoding >> (Firefox, Edge and Chrome) shows agreement on the current spec: there >> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, >> 6 on the second, 4 on the third and 6 on the last line). Changing the >> Unicode standard away from that kind of interop needs *way* better >> rationale than "feels right?. In what sense is this ?interop?? Under what circumstance would it matter how many U+FFFDs you see? If you?re about to mutter something about security, consider this: security code *should* refuse to compare strings that contain U+FFFD (or at least should never treat them as equal, even to themselves), because it has no way to know what that code point represents. Would you advocate replacing e0 80 80 with U+FFFD U+FFFD U+FFFD (1) rather than U+FFFD (2) It?s pretty clear what the intent of the encoder was there, I?d say, and while we certainly don?t want to decode it as a NUL (that was the source of previous security bugs, as I recall), I also don?t see the logic in insisting that it must be decoded to *three* code points when it clearly only represented one in the input. This isn?t just a matter of ?feels nicer?. (1) is simply illogical behaviour, and since behaviours (1) and (2) are both clearly out there today, it makes sense to pick the more logical alternative as the official recommendation. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 02:50:27 2017 From: unicode at unicode.org (J Decker via Unicode) Date: Tue, 16 May 2017 00:50:27 -0700 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

Message-ID: On Mon, May 15, 2017 at 11:50 PM, Henri Sivonen via Unicode < unicode at unicode.org> wrote: > On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode > wrote: > > I?m not sure how the discussion of ?which is better? relates to the > > discussion of ill-formed UTF-8 at all. > > Clearly, the "which is better" issue is distracting from the > underlying issue. I'll clarify what I meant on that point and then > move on: > > I acknowledge that UTF-16 as the internal memory representation is the > dominant design. However, because UTF-8 as the internal memory > representation is *such a good design* (when legacy constraits permit) > that *despite it not being the current dominant design*, I think the > Unicode Consortium should be fully supportive of UTF-8 as the internal > memory representation and not treat UTF-16 as the internal > representation as the one true way of doing things that gets > considered when speccing stuff. > > I.e. I wasn't arguing against UTF-16 as the internal memory > representation (for the purposes of this thread) but trying to > motivate why the Consortium should consider "UTF-8 internally" equally > despite it not being the dominant design. > > So: When a decision could go either way from the "UTF-16 internally" > perspective, but one way clearly makes more sense from the "UTF-8 > internally" perspective, the "UTF-8 internally" perspective should be > decisive in *such a case*. (I think the matter at hand is such a > case.) > > At the very least a proposal should discuss the impact on the "UTF-8 > internally" case, which the proposal at hand doesn't do. > > (Moving on to a different point.) > > The matter at hand isn't, however, a new green-field (in terms of > implementations) issue to be decided but a proposed change to a > standard that has many widely-deployed implementations. Even when > observing only "UTF-16 internally" implementations, I think it would > be appropriate for the proposal to include a review of what existing > implementations, beyond ICU, do. > > Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick > test with three major browsers that use UTF-16 internally and have > independent (of each other) implementations of UTF-8 decoding > (Firefox, Edge and Chrome) Something I've learned through working with Node (V8 javascript engine from chrome) V8 stores strings either as UTF-16 OR UTF-8 interchangably and is not one OR the other... https://groups.google.com/forum/#!topic/v8-users/wmXgQOdrwfY and I wouldn't really assume UTF-16 is a 'majority'; Go is utf-8 for instance. > shows agreement on the current spec: there > is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, > 6 on the second, 4 on the third and 6 on the last line). Changing the > Unicode standard away from that kind of interop needs *way* better > rationale than "feels right". > > -- > Henri Sivonen > hsivonen at hsivonen.fi > https://hsivonen.fi/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 03:00:13 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 16 May 2017 09:00:13 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

Message-ID: <20170516090013.55793b87@JRWUBU2> On Tue, 16 May 2017 10:01:03 +0300 Henri Sivonen via Unicode wrote: > Even so, I think even changing a recommendation of "best practice" > needs way better rationale than "feels right" or "ICU already does it" > when a) major browsers (which operate in the most prominent > environment of broken and hostile UTF-8) agree with the > currently-recommended best practice and b) the currently-recommended > best practice makes more sense for implementations where "UTF-8 > decoding" is actually mere "UTF-8 validation". There was originally an attempt to prescribe rather than to recommend the interpretation of ill-formed 8-bit Unicode strings. It may even briefly have been an issued prescription, until common sense prevailed. I do remember a sinking feeling when I thought I would have to change my own handling of bogus UTF-8, only to be relieved later when it became mere best practice. However, it is not uncommon for coding standards to prescribe 'best practice'. Richard. From unicode at unicode.org Tue May 16 03:18:41 2017 From: unicode at unicode.org (David Starner via Unicode) Date: Tue, 16 May 2017 08:18:41 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

<717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> Message-ID: On Tue, May 16, 2017 at 12:42 AM Alastair Houghton < alastair at alastairs-place.net> wrote: > If you?re about to mutter something about security, consider this: > security code *should* refuse to compare strings that contain U+FFFD (or at > least should never treat them as equal, even to themselves), because it has > no way to know what that code point represents. > Which causes various other security problems; if an object (file, database element, etc.) gets a name with a FFFD in it, it becomes impossible to reference. That an IEEE 754 float may not equal itself is a perpetual source of confusion for programmers. > Would you advocate replacing > > e0 80 80 > > with > > U+FFFD U+FFFD U+FFFD (1) > > rather than > > U+FFFD (2) > > It?s pretty clear what the intent of the encoder was there, I?d say, and > while we certainly don?t want to decode it as a NUL (that was the source of > previous security bugs, as I recall), I also don?t see the logic in > insisting that it must be decoded to *three* code points when it clearly > only represented one in the input. > In this case, It's pretty clear, but I don't see it as a general rule. Any rule has to handle e0 e0 e0 or 80 80 80 or any variety of charset or mojibake or random binary data. 88 A0 8B D4 is UTF-16 Chinese, but I'm not going to insist that it get replaced with U+FFFD U+FFFD because it's clear (to me) it was meant as two characters. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 03:31:07 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Tue, 16 May 2017 11:31:07 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> References:

<717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> Message-ID: On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag wrote: > but I think the way he raises this point is needlessly antagonistic. I apologize. My level of dismay at the proposal's ICU-centricity overcame me. On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton wrote: > That would be true if the in-memory representation had any effect on what we?re talking about, but it really doesn?t. If the internal representation is UTF-16 (or UTF-32), it is a likely design that there is a variable into which the scalar value of the current code point is accumulated during UTF-8 decoding. In such a scenario, it can be argued as "natural" to first operate according to the general structure of UTF-8 and then inspect what you got in the accumulation variable (ruling out non-shortest forms, values above the Unicode range and surrogate values after the fact). When the internal representation is UTF-8, only UTF-8 validation is needed, and it's natural to have a fail-fast validator, which *doesn't necessarily need such a scalar value accumulator at all*. The construction at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ when used as a UTF-8 validator is the best illustration of a UTF-8 validator not necessarily looking like a "natural" UTF-8 to UTF-16 converter at all. >>> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick >>> test with three major browsers that use UTF-16 internally and have >>> independent (of each other) implementations of UTF-8 decoding >>> (Firefox, Edge and Chrome) shows agreement on the current spec: there >>> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, >>> 6 on the second, 4 on the third and 6 on the last line). Changing the >>> Unicode standard away from that kind of interop needs *way* better >>> rationale than "feels right?. > > In what sense is this ?interop?? In the sense that prominent independent implementations do the same externally observable thing. > Under what circumstance would it matter how many U+FFFDs you see? Maybe it doesn't, but I don't think the burden of proof should be on the person advocating keeping the spec and major implementations as they are. If anything, I think those arguing for a change of the spec in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing with the current spec should show why it's important to have a different number of U+FFFDs than the spec's "best practice" calls for now. > If you?re about to mutter something about security, consider this: security code *should* refuse to compare strings that contain U+FFFD (or at least should never treat them as equal, even to themselves), because it has no way to know what that code point represents. In practice, e.g. the Web Platform doesn't allow for stopping operating on input that contains an U+FFFD, so the focus is mainly on making sure that U+FFFDs are placed well enough to prevent bad stuff under normal operations. At least typically, the number of U+FFFDs doesn't matter for that purpose, but when browsers agree on the number of U+FFFDs, changing that number should have an overwhelmingly strong rationale. A security reason could be a strong reason, but such a security motivation for fewer U+FFFDs has not been shown, to my knowledge. > Would you advocate replacing > > e0 80 80 > > with > > U+FFFD U+FFFD U+FFFD (1) > > rather than > > U+FFFD (2) I advocate (1), most simply because that's what Firefox, Edge and Chrome do *in accordance with the currently-recommended best practice* and, less simply, because it makes sense in the presence of a fail-fast UTF-8 validator. I think the burden of proof to show an overwhelmingly good reason to change should, at this point, be on whoever proposes doing it differently than what the current widely-implemented spec says. > It?s pretty clear what the intent of the encoder was there, I?d say, and while we certainly don?t want to decode it as a NUL (that was the source of previous security bugs, as I recall), I also don?t see the logic in insisting that it must be decoded to *three* code points when it clearly only represented one in the input. As noted previously, the logic is that you generate a U+FFFD whenever a fail-fast validator fails. > This isn?t just a matter of ?feels nicer?. (1) is simply illogical behaviour, and since behaviours (1) and (2) are both clearly out there today, it makes sense to pick the more logical alternative as the official recommendation. Again, the current best practice makes perfect logical sense in the context of a fail-fast UTF-8 validator. Moreover, it doesn't look like both are "out there" equally when major browsers, OpenJDK and Python 3 agree. (I expect I could find more prominent implementations that implement the currently-stated best practice, but I feel I shouldn't have to.) From my experience from working on Web standards and implementing them, I think it's a bad idea to change something to be "more logical" when the change would move away from browser consensus. -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Tue May 16 03:45:48 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 09:45:48 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

<717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> Message-ID: <89D51495-3FDE-4A8E-A58B-703F0045CD08@alastairs-place.net> > On 16 May 2017, at 09:18, David Starner wrote: > > On Tue, May 16, 2017 at 12:42 AM Alastair Houghton wrote: >> If you?re about to mutter something about security, consider this: security code *should* refuse to compare strings that contain U+FFFD (or at least should never treat them as equal, even to themselves), because it has no way to know what that code point represents. >> > Which causes various other security problems; if an object (file, database element, etc.) gets a name with a FFFD in it, it becomes impossible to reference. That an IEEE 754 float may not equal itself is a perpetual source of confusion for programmers. That?s true anyway; imagine the database holds raw bytes, that just happen to decode to U+FFFD. There might seem to be *two* names that both contain U+FFFD in the same place. How do you distinguish between them? Clearly if you are holding Unicode code points that you know are validly encoded somehow, you may want to be able to match U+FFFDs, but that?s a special case where you have extra knowledge. > In this case, It's pretty clear, but I don't see it as a general rule. Any rule has to handle e0 e0 e0 or 80 80 80 or any variety of charset or mojibake or random binary data. I don?t see a problem; the point is that where a structurally valid UTF-8 encoding has been used, albeit in an invalid manner (e.g. encoding a number that is not a valid code point, or encoding a valid code point as an over-long sequence), a single U+FFFD is appropriate. That seems a perfectly sensible rule to adopt. The proposal actually does cover things that aren?t structurally valid, like your e0 e0 e0 example, which it suggests should be a single U+FFFD because the initial e0 denotes a three byte sequence, and your 80 80 80 example, which it proposes should constitute three illegal subsequences (again, both reasonable). However, I?m not entirely certain about things like e0 e0 c3 89 which the proposal would appear to decode as U+FFFD U+FFFD U+FFFD U+FFFD (3) instead of a perhaps more reasonable U+FFFD U+FFFD U+00C9 (4) (the key part is the ?without ever restricting trail bytes to less than 80..BF?) and if Markus or others could explain why they chose (3) over (4) I?d be quite interested to hear the explanation. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 04:29:09 2017 From: unicode at unicode.org (David Starner via Unicode) Date: Tue, 16 May 2017 09:29:09 +0000 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <89D51495-3FDE-4A8E-A58B-703F0045CD08@alastairs-place.net> References:

<717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> <89D51495-3FDE-4A8E-A58B-703F0045CD08@alastairs-place.net> Message-ID: On Tue, May 16, 2017 at 1:45 AM Alastair Houghton < alastair at alastairs-place.net> wrote: > That?s true anyway; imagine the database holds raw bytes, that just happen > to decode to U+FFFD. There might seem to be *two* names that both contain > U+FFFD in the same place. How do you distinguish between them? > If the database holds raw bytes, then the name is a byte string, not a Unicode string, and can't contain U+FFFD at all. It's a relatively easy rule to make and enforce that a string in a database is a validly formatted string; I would hope that most SQL servers do in fact reject malformed UTF-8 strings. On the other hand, I'd expect that an SQL server would accept U+FFFD in a Unicode string. > I don?t see a problem; the point is that where a structurally valid UTF-8 > encoding has been used, albeit in an invalid manner (e.g. encoding a number > that is not a valid code point, or encoding a valid code point as an > over-long sequence), a single U+FFFD is appropriate. That seems a > perfectly sensible rule to adopt. > It seems like a perfectly arbitrary rule to adopt; I'd like to assume that the only source of such UTF-8 data is willful attempts to break security, and in that case, how is this a win? Nonattack sources of broken data are much more likely to be the result of mixing UTF-8 with other character encodings or raw binary data. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 04:55:34 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 10:55:34 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

<717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> <89D51495-3FDE-4A8E-A58B-703F0045CD08@alastairs-place.net> Message-ID: <14C93F5D-1CFF-4999-B9F2-8BE604FA77B9@alastairs-place.net> > On 16 May 2017, at 10:29, David Starner wrote: > > On Tue, May 16, 2017 at 1:45 AM Alastair Houghton wrote: > That?s true anyway; imagine the database holds raw bytes, that just happen to decode to U+FFFD. There might seem to be *two* names that both contain U+FFFD in the same place. How do you distinguish between them? > >> If the database holds raw bytes, then the name is a byte string, not a Unicode string, and can't contain U+FFFD at all. It's a relatively easy rule to make and enforce that a string in a database is a validly formatted string; I would hope that most SQL servers do in fact reject malformed UTF-8 strings. On the other hand, I'd expect that an SQL server would accept U+FFFD in a Unicode string. Databases typically separate the encoding in which strings are stored from the encoding in which an application connected to the database is operating. A database might well hold data in (say) ISO Latin 1, EUC-JP, or indeed any other character set, while presenting it to a client application as UTF-8 or UTF-16. Hence my comment - application software could very well see two names that are apparently identical and that include U+FFFDs in the same places, even though the database back-end actually has different strings. As I said, this is a problem we already have. > I don?t see a problem; the point is that where a structurally valid UTF-8 encoding has been used, albeit in an invalid manner (e.g. encoding a number that is not a valid code point, or encoding a valid code point as an over-long sequence), a single U+FFFD is appropriate. That seems a perfectly sensible rule to adopt. > >> It seems like a perfectly arbitrary rule to adopt; I'd like to assume that the only source of such UTF-8 data is willful attempts to break security, and in that case, how is this a win? Nonattack sources of broken data are much more likely to be the result of mixing UTF-8 with other character encodings or raw binary data. I?d say there are three sources of UTF-8 data of that ilk: (a) bugs, (b) ?Modified UTF-8? and ?CESU-8? implementations, (c) wilful attacks (b) in particular is quite common, and the result of the presently recommended approach doesn?t make much sense there ([c0 80] will get replaced with *two* U+FFFDs, while [ed a0 bd ed b8 80] will be replaced by *four* U+FFFDs - surrogates aren?t supposed to be valid in UTF-8, right?) Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 05:09:44 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 16 May 2017 11:09:44 +0100 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References:

<717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> Message-ID: <9EFEA10F-535B-4D46-8637-B2288162FF45@alastairs-place.net> On 16 May 2017, at 09:31, Henri Sivonen via Unicode wrote: > > On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton > wrote: >> That would be true if the in-memory representation had any effect on what we?re talking about, but it really doesn?t. > > If the internal representation is UTF-16 (or UTF-32), it is a likely > design that there is a variable into which the scalar value of the > current code point is accumulated during UTF-8 decoding. That?s quite a likely design with a UTF-8 internal representation too; it?s just that you?d only decode during processing, as opposed to immediately at input. > When the internal representation is UTF-8, only UTF-8 validation is > needed, and it's natural to have a fail-fast validator, which *doesn't > necessarily need such a scalar value accumulator at all*. Sure. But a state machine can still contain appropriate error states without needing an accumulator. That the ones you care about currently don?t is readily apparent, but there?s nothing stopping them from doing so. I don?t see this as an argument about implementations, since it really makes very little difference to the implementation which approach is taken; in both internal representations, the question is whether you generate U+FFFD immediately on detection of the first incorrect *byte*, or whether you do so after reading a complete sequence. UTF-8 sequences are bounded anyway, so it isn?t as if failing early gives you any significant performance benefit. >> In what sense is this ?interop?? > > In the sense that prominent independent implementations do the same > externally observable thing. The argument is, I think, that in this case the thing they are doing is the *wrong* thing. That many of them do it would only be an argument if there was some reason that it was desirable that they did it. There doesn?t appear to be such a reason, unless you can think of something that hasn?t been mentioned thus far? The only reason you?ve given, to date, is that they currently do that, so that should be the recommended behaviour (which is little different from the argument - which nobody deployed - that ICU currently does the other thing, so *that* should be the recommended behaviour; the only difference is that *you* care about browsers and don?t care about ICU, whereas you yourself suggested that some of us might be advocating this decision because we care about ICU and not about e.g. browsers). I?ll add also that even among the implementations you cite, some of them permit surrogates in their UTF-8 input (i.e. they?re actually processing CESU-8, not UTF-8 anyway). Python, for example, certainly accepts the sequence [ed a0 bd ed b8 80] and decodes it as U+1F600; a true ?fast fail? implementation that conformed literally to the recommendation, as you seem to want, should instead replace it with *four* U+FFFDs (I think), no? One additional note: the standard codifies this behaviour as a *recommendation*, not a requirement. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue May 16 05:40:37 2017 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Tue, 16 May 2017 13:40:37 +0300 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <9EFEA10F-535B-4D46-8637-B2288162FF45@alastairs-place.net> References:

<717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> <9EFEA10F-535B-4D46-8637-B2288162FF45@alastairs-place.net> Message-ID: On Tue, May 16, 2017 at 1:09 PM, Alastair Houghton wrote: > On 16 May 2017, at 09:31, Henri Sivonen via Unicode wrote: >> >> On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton >> wrote: >>> That would be true if the in-memory representation had any effect on what we?re talking about, but it really doesn?t. >> >> If the internal representation is UTF-16 (or UTF-32), it is a likely >> design that there is a variable into which the scalar value of the >> current code point is accumulated during UTF-8 decoding. > > That?s quite a likely design with a UTF-8 internal representation too; it?s just that you?d only decode during processing, as opposed to immediately at input. The time to generate the U+FFFDs is at the input time which is what's at issue here. The later processing, which may then involve iterating by code point and involving computing the scalar values is a different step that should be able to assume valid UTF-8 and not be concerned with invalid UTF-8. (To what extent different programming languages and frameworks allow confident maintenance of the invariant that after input all in-RAM UTF-8 can be treated as valid varies.) >> When the internal representation is UTF-8, only UTF-8 validation is >> needed, and it's natural to have a fail-fast validator, which *doesn't >> necessarily need such a scalar value accumulator at all*. > > Sure. But a state machine can still contain appropriate error states without needing an accumulator. As I said upthread, it could, but it seems inappropriate to ask implementations to take on that extra complexity on as weak grounds as "ICU does it" or "feels right" when the current recommendation doesn't call for those extra states and the current spec is consistent with a number of prominent non-ICU implementations, including Web browsers. >>> In what sense is this ?interop?? >> >> In the sense that prominent independent implementations do the same >> externally observable thing. > > The argument is, I think, that in this case the thing they are doing is the *wrong* thing. It's seems weird to characterize following the currently-specced "best practice" as "wrong" without showing a compelling fundamental flaw (such as a genuine security problem) in the currently-specced "best practice". With implementations of the currently-specced "best practice" already shipped, I don't think aesthetic preferences should be considered enough of a reason to proclaim behavior adhering to the currently-specced "best practice" as "wrong". > That many of them do it would only be an argument if there was some reason that it was desirable that they did it. There doesn?t appear to be such a reason, unless you can think of something that hasn?t been mentioned thus far? I've already given a reason: UTF-8 validation code not needing to have extra states catering to aesthetic considerations of U+FFFD consolidation. > The only reason you?ve given, to date, is that they currently do that, so that should be the recommended behaviour (which is little different from the argument - which nobody deployed - that ICU currently does the other thing, so *that* should be the recommended behaviour; the only difference is that *you* care about browsers and don?t care about ICU, whereas you yourself suggested that some of us might be advocating this decision because we care about ICU and not about e.g. browsers). Not just browsers. Also OpenJDK and Python 3. Do I really need to test the standard libraries of more languages/systems to more strongly make the case that the ICU behavior (according to the proposal PDF) is not the norm and what the spec currently says is? > I?ll add also that even among the implementations you cite, some of them permit surrogates in their UTF-8 input (i.e. they?re actually processing CESU-8, not UTF-8 anyway). Python, for example, certainly accepts the sequence [ed a0 bd ed b8 80] and decodes it as U+1F600; a true ?fast fail? implementation that conformed literally to the recommendation, as you seem to want, should instead replace it with *four* U+FFFDs (I think), no? I see that behavior in Python 2. Earlier, I said that Python 3 agrees with the current spec for my test case. The Python 2 behavior I see is not just against "best practice" but obviously incompliant. (For details: I tested Python 2.7.12 and 3.5.2 as shipped on Ubuntu 16.04.) > One additional note: the standard codifies this behaviour as a *recommendation*, not a requirement. This is an odd argument in favor of changing it. If the argument is that it's just a recommendation that you don't need to adhere to, surely then the people who don't like the current recommendation should choose not to adhere to it instead of advocating changing it. -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Tue May 16 05:44:00 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 16 May 2017 12:44:00 +0200 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: <89D51495-3FDE-4A8E-A58B-703F0045CD08@alastairs-place.net> References:

<717da35a-ad60-647c-a05e-7986650f7dec@ix.netcom.com> <89D51495-3FDE-4A8E-A58B-703F0045CD08@alastairs-place.net> Message-ID: > > The proposal actually does cover things that aren?t structurally valid, > like your e0 e0 e0 example, which it suggests should be a single U+FFFD > because the initial e0 denotes a three byte sequence, and your 80 80 80 > example, which it proposes should constitute three illegal subsequences > (again, both reasonable). However, I?m not entirely certain about things > like > > e0 e0 c3 89 > > which the proposal would appear to decode as > > U+FFFD U+FFFD U+FFFD U+FFFD (3) > > instead of a perhaps more reasonable > > U+FFFD U+FFFD U+00C9 (4) > > (the key part is the ?without ever restricting trail bytes to less than > 80..BF?) > I also agree with that, due to access in strings from random position: if you access it from byte 0x89, you can assume it's a trialing byte and you'll want to look backward, and will see 0xc3,0x89 which will decode correctly as U+00C9 without any error detected. So the wrong bytes are only the initial two occurences of 0x80 which are individually converted to U+FFFD. In summary: when you detect any ill-formed sequence, only replace the first code unit by U+FFFD and restart scanning from the next code unit, without skeeping over multiple bytes. This means that multiple occurences of U+FFFD is not only the best practice, it also matches the intended design of UTF-8 to allow access from random positions. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 16 06:08:52 2017 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Tue, 16 May 2017 20:08:52 +0900 Subject: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 In-Reply-To: References: