From verdy_p at wanadoo.fr Sun Jun 1 00:06:34 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 1 Jun 2014 07:06:34 +0200 Subject: Corrigendum #9 In-Reply-To: <538A8CD8.4070905@ix.netcom.com> References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> <538A00A9.1050907@ix.netcom.com> <538A8CD8.4070905@ix.netcom.com> Message-ID: I've not proposed to move these characters elsewhere (or ro reencode them), why do you think that?. I just challenge your statement that a block cannot be discontinuous, something that is unique in all Unicode properties and completely absent from ISO 10646 which does not define any real properties beside a name in a specific code point and some informative glyph, plus historic reference links documenting its intended usage. (Where is it written in the Unicode-only stability rules that is continuous when allocations of codepoints in these blocs has always been discontinuous?...), much more important than this legacy one which has absolutely no use in regexps as you stated. Even the set of non-characters is also discontinuous, as well as blocks for the Arabic script.; or blocks for presentation forms, or blocks for compatibility characters. Every property in Unicode is fragmented over multiple ranges (whose length is also extremely frequently discontinuous within each block or even in the same encoding column In other words IsInArabicPresentation(x) would still remain true for all assgned characters in that block, it will just be false for non-characters considered outside of it but non-characters don't have nay useful property except being non-character (the block where they are allocated does not matter at all). The alternative is to not restrict these characters as being non-characters and allowing them to be present in files without enforcing any error, i.e. treat it like PUA, also with a feow possible default properties (this makes them a bit interoperable still with limited private agreements, possibly implicit with the transport interface or enveloppe format). 2014-06-01 4:15 GMT+02:00 Asmus Freytag : > More importantly, while a regex that uses an expression that is > equivalent to "IsInArabiPresentation(x)" may or may not be well-defined, > there is no reason to break it by splitting the block. > > As blocks cannot be discontiguous (unlike other properties), some Arabic > Presentation forms would have to be put into a new block (Arabic > Presentation Forms C). This is what would break such expressions - it has, > in fact, nothing to do with the status of the noncharacters. > > There's no reason to contemplate breaking changes of any kind at this > point. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sun Jun 1 01:20:16 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sat, 31 May 2014 23:20:16 -0700 Subject: Corrigendum #9 In-Reply-To: References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> <538A00A9.1050907@ix.netcom.com> <538A8CD8.4070905@ix.netcom.com> Message-ID: <538AC620.7010208@ix.netcom.com> On 5/31/2014 10:06 PM, Philippe Verdy wrote: > I've not proposed to move these characters elsewhere (or ro reencode > them), why do you think that?. > > I just challenge your statement that a block cannot be discontinuous, Well, go ahead and challenge that. As implemented in the current nameslist and file blocks.txt a block would have this definition. "A block is a uniquely named, continuous, non-overlapping range of code points, containing a multiple of 16 code points, and starting at a location that is a multiple of 16." Per chapter 3 the definition of the property block is given in Section 17.1 (Code Charts) - which contains no actual definition, only tells you how they are used in organizing the code charts, so, effectively, a block is what blocks.txt (and therefore the names list) say it is. The way blocks are assigned, has been following the empirically derived definition I gave above, and at this point, the production process for the code charts has some of these restrictions built in. Chapter 3 calls blocks an enumerated property, meaning that the names must be unique, and blocks.txt associates a single range with a name, in concurrence with the glossary, which says blocks represent a range of characters (not a collection of ranges). Likewise, changing blocks to not starting at or containing multiples of 16 code points (sometimes called a "column") is equally not in the cards - it would break the very production process for chart production. The description of how blocks are used does not contemplate that they can be mutually overlapping, so that becomes part of their implicit definition as well. There's reason behind the madness of not providing an explicit definition of "block" in the standard. It has to do with discouraging people from relying on what is largely an editorial device (headers on charts). However, it does not mean that arbitrary redefinition of a block from a single to multiple ranges is something that can or should be contemplated. So, the chances that UTC would agree to such changes, even if not formally guaranteed, is de facto nil. A./ From verdy_p at wanadoo.fr Sun Jun 1 03:28:29 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 1 Jun 2014 10:28:29 +0200 Subject: Corrigendum #9 In-Reply-To: <538AC620.7010208@ix.netcom.com> References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> <538A00A9.1050907@ix.netcom.com> <538A8CD8.4070905@ix.netcom.com> <538AC620.7010208@ix.netcom.com> Message-ID: Ok then, the definitions still dors not say that blocks cannot be split (in fact it has already occured many time across versions by reevaluating the need for new blocks and for desifying the BMP, up to the point that sometime a single addition in the same script required allocating columns in multiple subblocks as small as a column of 16 code points). Blocks are in fact artefacts of the encoding process the y are previsional until the characters needed are effectively allocated. Later any unused area may be reallocated to another block. On the BMP for example there remains a quite large area in a block initially described for supplemental arrows that could host a new full alphabetic script (most probably one of the remaining Indic or African modern scripts still to encode) or symbols used in common softwares or devices for their UI and its documentation (such as the window minimize/maximize/close button or resize corner, or refresh button, or microphone symbol to initiate a vocal talk, or the radio wave symbol for accessing a wireless network), or conventional symbols for accessibility devices, marks of dangers/hazards or restrictions/prohibitions that could be used as widely as currency symbols (encoded often in emergency but isolately, unlike other symbols coming in small related groups; if these collections are large like emoticons/emojis they'll go directly in the SMP). Blocks are not immutable in size, even if they keep their initial position (because allocations in blocks start by the leang position, skeeping only a few entries that were balloted for possible later allocation to the same script, or for former proposals of characters that were balloted in favor of unification to another character, or just to align the block with the layout of another legacy encoding chart, or because the initial beta fonts submitted to support the script allocated other characters that were not approved and fonts were not updated to use a new layout). May be in some future we will see a few more allocations made in the BMP using half columns (this is *already* the case at end of the BMP where a single column is split in two parts, containing Armenian presentation forms, and Hebrew presentation forms for Yiddish...), or filling some random holes for which it is definitively decided that the initial reservations in the roadmap will never be used for the initially intended purpose. 2014-06-01 8:20 GMT+02:00 Asmus Freytag : > On 5/31/2014 10:06 PM, Philippe Verdy wrote: > >> I've not proposed to move these characters elsewhere (or ro reencode >> them), why do you think that?. >> >> I just challenge your statement that a block cannot be discontinuous, >> > > Well, go ahead and challenge that. > > As implemented in the current nameslist and file blocks.txt a block would > have this definition. "A block is a uniquely named, continuous, > non-overlapping range of code points, containing a multiple of 16 code > points, and starting at a location that is a multiple of 16." > > Per chapter 3 the definition of the property block is given in Section > 17.1 (Code Charts) - which contains no actual definition, only tells you > how they are used in organizing the code charts, so, effectively, a block > is what blocks.txt (and therefore the names list) say it is. The way blocks > are assigned, has been following the empirically derived definition I gave > above, and at this point, the production process for the code charts has > some of these restrictions built in. > > Chapter 3 calls blocks an enumerated property, meaning that the names must > be unique, and blocks.txt associates a single range with a name, in > concurrence with the glossary, which says blocks represent a range of > characters (not a collection of ranges). Likewise, changing blocks to not > starting at or containing multiples of 16 code points (sometimes called a > "column") is equally not in the cards - it would break the very production > process for chart production. The description of how blocks are used does > not contemplate that they can be mutually overlapping, so that becomes part > of their implicit definition as well. > > There's reason behind the madness of not providing an explicit definition > of "block" in the standard. It has to do with discouraging people from > relying on what is largely an editorial device (headers on charts). > However, it does not mean that arbitrary redefinition of a block from a > single to multiple ranges is something that can or should be contemplated. > > So, the chances that UTC would agree to such changes, even if not formally > guaranteed, is de facto nil. > > A./ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Jun 1 03:49:31 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 1 Jun 2014 09:49:31 +0100 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> Message-ID: <20140601094931.413857e2@JRWUBU2> On Sat, 31 May 2014 19:28:27 -0700 Markus Scherer wrote: > On Sat, May 31, 2014 at 1:59 AM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > > > Bear in mind that a pattern \uD808 shall not match anything in a > > well-formed Unicode string. > > > Depends. See the definitions of Unicode strings vs. UTF strings. D80: Unicode string: A code unit sequence containing code units of a particular Unicode encoding form... D85 Well-formed: A Unicode code unit sequence that purports to be in a Unicode encod- ing form is called well-formed if and only if it does follow the specification of that Unicode encoding form. How does a Unicode string purport anything? >> \uD808\uDF45 specifies a sequence of two >> codepoints. > Implementations that use Unicode 16-bit strings will usually treat > this as one supplementary code point. > In Java, there is no other way to escape one. In which case, Java does *not* supply 'basic Unicode support' as defined by UTS#18 Version 17 - see just before Section 1.1.1 therein. An engine that matches code unit by code unit does not comply with RL1.7. This makes sense in so far as it provides for consistent results across UTF-encodings for Unicode strings that could once have been reversibly converted. (A 32-bit Unicode string converted to a 16-bit Unicode string and back would become <12345>.) Now that that conversion should not preserve lone surrogates (separately both C10 together with D93 and TUS Section 5.22), it makes less sense. However, I can think of one major objection to a regular expression engine using 16-bit Unicode strings treating every supplementary point as a sequence of two surrogate points. While it might be acceptable for a lone surrogate to match \P{L} (codepoints that are not letters), it would not be acceptable for every supplementary point to match \P{L}\P{L} or even \p{Any}\p{Any}. Richard. From richard.wordingham at ntlworld.com Sun Jun 1 05:42:39 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 1 Jun 2014 11:42:39 +0100 Subject: Long-Encoded Restricted Characters in High Frequency Modern Use In-Reply-To: References: <20140529233956.5db1ea5e@JRWUBU2> Message-ID: <20140601114239.24a2d02e@JRWUBU2> On Sat, 31 May 2014 21:27:55 +0200 Mark Davis ?? wrote: > The structure of the data is based on the use of NFKC characters in > identifiers. So SARA AM and the Lao? equivalent are both not NFKC > characters, and are categorized as such, and would need to be > represented by their NFKC fors. The process is in > http://www.unicode.org/reports/tr39/proposed.html#IDMOD_Data_Collection There's no absolute IETF prohibition on NFKC characters. > > Now, U+0E4D THAI > > CHARACTER NIKHAHIT is classified as 'allowed; recommended', although > > its main use is in writing Pali, which would suggest that it should > > be 'restricted; historic' or 'restricted; limited-use'. > ?For that, it would be best to submit via > http://www.unicode.org/reports/tr39/proposed.html#Feedback, AND file a > feedback form at http://www.unicode.org/reporting.html, just to be > sure. ? I have no desire to restrict NIKHAHIT simply because of limited use. The problem is simply the confusion caused by the existence of SARA AM. Unicode support for the compatibility decomposition of SARA AM is incomplete, in part irremediably so. The problem is that has a different appearance to . In the former, the tone mark is the topmost glyph; in the latter, the nikkhahit is the topmost glyph. usually has the same appearance as , which is what Uniscribe effectively converts it to. There used to be filters in place to stop being typed. It's not unknown for to be mistyped as , and that too used to be blocked. DUCET has a contraction for to reduce the ill-effects, but of course the contraction doesn't work for the sequence . (Action on me: CLDR ticket on omission for th locale.) In short, the co-existence of NIKHAHIT with ccc=0 and SARA AM causes problems. The simplest solution is to restrict NIKHAHIT, which should be tolerable. Ideally, one would merely prohibit the sequence \p{Mn}*\u0E4D\p{Mn}*\u0E32. There is no virtue in making both NIKHAHIT and SARA AM 'restricted'. Indeed, one could argue that applying the compatibility decomposition to SARA AM brings NIKHAHIT into 'high frequency modern use' - it depends on the frequency of NFKC and NFKD conversions. However, the compatibility decomposition of SARA AM is simply *wrong* as Thai text. It would be good to hear from someone at Thailand's National Electronics and Computer Technology Center (NECTEC) on the matter of SARA AM in domain names. The sequence-prohibiting solution ought to extend to Lao, but there may be the additional problem of the tone mark being applied to the SARA AM. The m17n Lao keyboard on my computer actually comes with a single keystroke for the sequence ! (Action on me: File a bug report against the keyboard.) Richard. From public at khwilliamson.com Sun Jun 1 09:49:47 2014 From: public at khwilliamson.com (Karl Williamson) Date: Sun, 01 Jun 2014 08:49:47 -0600 Subject: Corrigendum #9 In-Reply-To: <5388D29C.9040502@ix.netcom.com> References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> Message-ID: <538B3D8B.2070102@khwilliamson.com> On 05/30/2014 12:49 PM, Asmus Freytag wrote: > One of the concerns was that people felt that they had to have "data > pipeline" style implementations (tools) go and filter these out - even > if there was no intent for the implementation to use them internally in > any way. Making clear that the standard does not require filtering > allows for cleaner implementations of such ("path through) tools. Thanks, I had not thought about that. I'm thinking wording something like this is more appropriate "Noncharacters may be openly interchanged, but it is inadvisable to do so without prior agreement, since at each stage any of them might be replaced by a REPLACEMENT CHARACTER or otherwise disposed of, at the sole discretion of that stage's implementation." From markus.icu at gmail.com Sun Jun 1 10:58:26 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Sun, 1 Jun 2014 08:58:26 -0700 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: <20140601094931.413857e2@JRWUBU2> References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> <20140601094931.413857e2@JRWUBU2> Message-ID: On Sun, Jun 1, 2014 at 1:49 AM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > D80: Unicode string: > A code unit sequence containing code units of a particular Unicode > encoding form... > Right -- in a Unicode 16-bit string, you have a sequence of any 16-bit value in any order. Well-formedness applies to UTF-x encoding forms. It is common to not treat unpaired surrogates as errors because they behave like "boring" code points, that is, they are "harmless". However, that does not mean that they work like fully supported code points in all places, just that where it's easier to treat them like harmless code points that's often done. In ICU4C simple string functions, if you search for code point 0xd800 you will find it in a string if it occurs as an unpaired surrogate. In ICU collation of 16-bit strings, an unpaired surrogate sorts with an unassigned-implicit primary weight. (You can try this with the online collation demo. In ICU UTF-8 collation, ill-formed sequences sort like U+FFFD.) >> \uD808\uDF45 specifies a sequence of two > >> codepoints. > > > Implementations that use Unicode 16-bit strings will usually treat > > this as one supplementary code point. > > In Java, there is no other way to escape one. > > In which case, Java does *not* supply 'basic Unicode support' as defined > by UTS#18 Version 17 - see just before Section 1.1.1 therein. An > engine that matches code unit by code unit does not comply with RL1.7. > You misunderstand. In Java, \uD808\uDF45 is the only way to escape a supplementary code point, but as long as you have a surrogate pair, it is treated as a code point in APIs that support them. Java 5 upgraded the regular expression code to match code points, not code units. I don't know what it does when the pattern contains an unpaired surrogate. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Sun Jun 1 11:07:40 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Sun, 1 Jun 2014 09:07:40 -0700 Subject: Corrigendum #9 In-Reply-To: <538B3D8B.2070102@khwilliamson.com> References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> <538B3D8B.2070102@khwilliamson.com> Message-ID: On Sun, Jun 1, 2014 at 7:49 AM, Karl Williamson wrote: > Thanks, I had not thought about that. I'm thinking wording something like > this is more appropriate > > "Noncharacters may be openly interchanged, but it is inadvisable to do so > without prior agreement, since at each stage any of them might be replaced > by a REPLACEMENT CHARACTER or otherwise disposed of, at the sole discretion > of that stage's implementation." I think that would invite again the kinds of implementations that triggered Corrigendum #9, where you couldn't use CLDR files with Gnome-based tools (plain text editors, file diff tools, command-line terminal) if the files contained noncharacters. (CLDR data uses them for boundary mappings in collation data.) markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Jun 1 12:04:57 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 1 Jun 2014 18:04:57 +0100 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> <20140601094931.413857e2@JRWUBU2> Message-ID: <20140601180457.273ac6b9@JRWUBU2> On Sun, 1 Jun 2014 08:58:26 -0700 Markus Scherer wrote: > You misunderstand. In Java, \uD808\uDF45 is the only way to escape a > supplementary code point, but as long as you have a surrogate pair, > it is treated as a code point in APIs that support them. Wasn't obvious that in the following paragraph \uD808\uDF45 was a pattern? "Bear in mind that a pattern \uD808 shall not match anything in a well-formed Unicode string. \uD808\uDF45 specifies a sequence of two codepoints. This sequence can occur in an ill-formed UTF-32 Unicode string and before Unicode 5.2 could readily be taken to occur in an ill-formed UTF-8 Unicode string. RL1.7 declares that for a regular expression engine, the codepoint sequence cannot occur in a UTF-16 Unicode string; instead, the code unit sequence is the codepoint sequence ." (It might have been clearer to you if I'd said '8-bit' and '16-bit' instead of UTF-8 and UTF-16. It does make me wonder what you'd call a 16-bit encoding of arbitrary *codepoint* sequences.) Richard. From public at khwilliamson.com Sun Jun 1 12:13:53 2014 From: public at khwilliamson.com (Karl Williamson) Date: Sun, 01 Jun 2014 11:13:53 -0600 Subject: Corrigendum #9 In-Reply-To: References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> <538B3D8B.2070102@khwilliamson.com> Message-ID: <538B5F51.60102@khwilliamson.com> On 06/01/2014 10:07 AM, Markus Scherer wrote: > On Sun, Jun 1, 2014 at 7:49 AM, Karl Williamson > wrote: > > Thanks, I had not thought about that. I'm thinking wording > something like this is more appropriate > > "Noncharacters may be openly interchanged, but it is inadvisable to > do so without prior agreement, since at each stage any of them might > be replaced by a REPLACEMENT CHARACTER or otherwise disposed of, at > the sole discretion of that stage's implementation." > > > I think that would invite again the kinds of implementations that > triggered Corrigendum #9, where you couldn't use CLDR files with > Gnome-based tools (plain text editors, file diff tools, command-line > terminal) if the files contained noncharacters. (CLDR data uses them for > boundary mappings in collation data.) > > markus I don't understand your point. Are you saying that Gnome should not have the discretion to rid its inputs of noncharacters? If so, then noncharacters really are just Gc=Co ones. From asmusf at ix.netcom.com Sun Jun 1 14:34:24 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 01 Jun 2014 12:34:24 -0700 Subject: Corrigendum #9 In-Reply-To: <538B3D8B.2070102@khwilliamson.com> References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> <538B3D8B.2070102@khwilliamson.com> Message-ID: <538B8040.1060103@ix.netcom.com> On 6/1/2014 7:49 AM, Karl Williamson wrote: > On 05/30/2014 12:49 PM, Asmus Freytag wrote: >> One of the concerns was that people felt that they had to have "data >> pipeline" style implementations (tools) go and filter these out - even >> if there was no intent for the implementation to use them internally in >> any way. Making clear that the standard does not require filtering >> allows for cleaner implementations of such ("path through) tools. > > Thanks, I had not thought about that. I'm thinking wording something > like this is more appropriate > > "Noncharacters may be openly interchanged, but it is inadvisable to do > so without prior agreement, since at each stage any of them might be > replaced by a REPLACEMENT CHARACTER or otherwise disposed of, at the > sole discretion of that stage's implementation." > Karl, I think you should address the pass-through style of implementation explicitly. "Noncharacters are designed to be used for special, implementation-internal purposes, that puts them outside the text content of the data. Some implementations, by necessity, use a distributed architecture, and rely on yet other implementations for services like transport, code conversion, and so on. For such "pass-through" implementations, it would be inadvisable to rely on, or replace any noncharacter, and certainly not to reject or filter them. Doing so would make such an implementation a poor choice to serve as a "pass through" in a distributed architecture that makes use of noncharcters for internal purposes. In other words such an implementation would make it impossible to bridge between the partners in a prior agreement on the use of noncharacters, which would severely undercut its utility." You might want to check whether some statement like this isn't already part of the FAQ. If it isn't, it would be the easiest to retrofit (and the easiest place to lay out usage guidelines). Alternatively, or in conjunction, you could propose that the text in the core specification be tweaked to help set better expectations. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sun Jun 1 14:40:35 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 01 Jun 2014 12:40:35 -0700 Subject: Corrigendum #9 In-Reply-To: References: <5388CD4A.4060704@khwilliamson.com><5388D29C.9040502@ix.netcom.com><538B3D8B.2070102@khwilliamson.com> Message-ID: <538B81B3.2040900@ix.netcom.com> On 6/1/2014 9:07 AM, Markus Scherer wrote: > On Sun, Jun 1, 2014 at 7:49 AM, Karl Williamson > > wrote: > > Thanks, I had not thought about that. I'm thinking wording > something like this is more appropriate > > "Noncharacters may be openly interchanged, but it is inadvisable > to do so without prior agreement, since at each stage any of them > might be replaced by a REPLACEMENT CHARACTER or otherwise disposed > of, at the sole discretion of that stage's implementation." > > > I think that would invite again the kinds of implementations that > triggered Corrigendum #9, where you couldn't use CLDR files with > Gnome-based tools (plain text editors, file diff tools, command-line > terminal) if the files contained noncharacters. (CLDR data uses them > for boundary mappings in collation data.) > > The new text triggers some really unwarranted interpretations, which can invalidate the use of noncharacters for their stated purpose. Please see my suggested text that attempts to describe both intent and differences in use. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Jun 1 20:36:14 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 2 Jun 2014 02:36:14 +0100 Subject: Wild Card Collation Matches Message-ID: <20140602023614.0b013b0a@JRWUBU2> In a fairly wild environment (http://www.thaivisa.com/forum/topic/730564-new-front-end-to-ri-dictionary-alpha), I encountered the following question: "If you search for ?* do you expect to return words such as ???? and ????" Now, as a regular expression, in UTS#18 'Unicode Regular Expressions' Version 13 (dated 2008, superseded in 2012), RL3.5 comes pretty close to this with ranges tailored for collation. The pattern [\u0E01-\u0E02]* would match both those words. To be precise, one would use a search for [?-??]*. RL3.5 has been with withdrawn because of difficulties, though I can't say that I see it as a major difficulty that at least one of [A-z] and [a-Z] is empty. Even POSIX is aware of that little issue. Turning to fully collation-based definition of searches, UTS#10 Unicode Collation Algorithm's definition DS2 comes closest to answering the question for the UTC. DS2 reads: DS2. The pattern string P has a match at Q[s,e] according to collation C if C generates the same sort key for P as for Q[s,e], and the offsets s and e meet the boundary condition B. One can also say P has a match in Q according to C. It's a soft job to create sequences of codepoints P starting with U+0E01 THAI CHARACTER KO KAI that are tertiary matches for ???? and ??? under both DUCET and the CLDR collations for Thai. Can I therefore say that the two strings match the pattern ?* according to these collations? (A pattern P for ??? is P = .) Disturbingly, another possible answer is that there is no match for in either string because it only occurs in the legacy/extended grapheme cluster . Richard. From mark at macchiato.com Mon Jun 2 04:29:09 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 2 Jun 2014 11:29:09 +0200 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: <20140601180457.273ac6b9@JRWUBU2> References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> <20140601094931.413857e2@JRWUBU2> <20140601180457.273ac6b9@JRWUBU2> Message-ID: > \uD808\uDF45 specifies a sequence of two codepoints. ?That is simply incorrect.? In Java (and similar environments), \uXXXX means a char (a UTF16 code unit), not a code point. Here is the difference. If you are not used to Java, string.replaceAll(x,y) uses Java's regex to replace the pattern x with the replacement y in string. Backslashes in literals need escaping, so \x needs to be written in literals as \\x. String[] tests = {"\\x{12345}", "\\uD808\\uDF45", "\uD808\uDF45", "?.?"}; String target = "one: ?\uD808\uDF45?\t\t" + "two: ?\uD808\uDF45\uD808\uDF45?\t\t" + "lead: ?\uD808?\t\t" + "trail: ?\uDF45?\t\t" + "one+: ?\uD808\uDF45\uD808?"; System.out.println("pattern" + "\t?\t" + target + "\n"); for (String test : tests) { System.out.println(test + "\t?\t" + target.replaceAll(test, "??")); } *?Output:* pattern ? one: ???? two: ?????? lead: ??? trail: ??? one+: ????? \x{12345} ? one: ???? two: ?????? lead: ??? trail: ??? one+: ????? \uD808\uDF45 ? one: ???? two: ?????? lead: ??? trail: ??? one+: ????? ?? ? one: ???? two: ?????? lead: ??? trail: ??? one+: ????? ?.? ? one: ?? two: ?????? lead: ?? trail: ?? one+: ????? The target has various combinations of code units, to see what happens. Notice that Java treats a pair of lead+trail as a single code point for matching (eg .), but also an isolated surrogate char as a single code point (last line of output). Note that Java's regex in addition allows \x{hex} for specifying a code point explicitly. It also has the syntax \uXXXX (in a literal the \ needs escaping) to specify a code unit; that is slightly different than the Java preprocessing. Thus the first two are equivalent, and replace "{" by "x". The last two are also equivalent?and fail?because a single "{" is a broken regex pattern. System.out.println("{".replaceAll("\\u007B", "x")); System.out.println("{".replaceAll("\\x{7B}", "x")); System.out.println("{".replaceAll("\u007B", "x")); System.out.println("{".replaceAll("{", "x")); Mark *? Il meglio ? l?inimico del bene ?* On Sun, Jun 1, 2014 at 7:04 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Sun, 1 Jun 2014 08:58:26 -0700 > Markus Scherer wrote: > > > You misunderstand. In Java, \uD808\uDF45 is the only way to escape a > > supplementary code point, but as long as you have a surrogate pair, > > it is treated as a code point in APIs that support them. > > Wasn't obvious that in the following paragraph \uD808\uDF45 was a > pattern? > > "Bear in mind that a pattern \uD808 shall not match anything in a > well-formed Unicode string. \uD808\uDF45 specifies a sequence of two > codepoints. This sequence can occur in an ill-formed UTF-32 Unicode > string and before Unicode 5.2 could readily be taken to occur in an > ill-formed UTF-8 Unicode string. RL1.7 declares that for a regular > expression engine, the codepoint sequence cannot > occur in a UTF-16 Unicode string; instead, the code unit sequence DF45> is the codepoint sequence KI>." > > (It might have been clearer to you if I'd said '8-bit' and '16-bit' > instead of UTF-8 and UTF-16. It does make me wonder what you'd call a > 16-bit encoding of arbitrary *codepoint* sequences.) > > Richard. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Jun 2 06:44:04 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 2 Jun 2014 13:44:04 +0200 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> <20140601094931.413857e2@JRWUBU2> <20140601180457.273ac6b9@JRWUBU2> Message-ID: Your example would have been better explained by just saying that in Java, the regexp represented in source code as "\\uD808\\uDF45" means matching two successive 16-bit code units, and "\\uD808" or "\\uDF45" just matches one. The "\\uNNNN" regexo notation (in source code, equivalentto "\uNNNN" in string at runtime) does not designate necessarily a full code point. Unlike the "\\x{NNNN}" and "." regexs which will necessarily match a full code point in the target (even if it's an isolated surrogate). But there's no way in Java to represent a target string that can store arbitrary sequences of codepoints if you use the String type (this is not specific to Java but applies as well to any language or runtime library handling streams of 16-bit code units, including in C, C++, Python, Javascript, PHP...). The problem is then not in the way you write regexps, but in the way the target string is encoded : it is not technically posible with 16-bit streams to represent arbitrary sequences of codepoints, but only arbitrary sequences of 16-bit code units (even if they aren't valid UTF-16 text). But there's no problem at all to process valid UTF-16 streams. Your "lead", "trail" and "one+" are representable in Java as arbitrary 16-bit streams but they do not represent not valid Unicode texts. On the opposite all your "tests[]" strings are valid Unicode texts but their interpretation as regexps are not necessarily valid regexps. Each time you use single backslashes in a Java source-code string, there's no warranty it will be a valid Unicode text even though it will compile without problem as a valid 16-bit stream (and the same will be true in other languages). If you want to represent aribtrary sequences of codepoints in a target text, you cannot use any UTF alone (it may be technically possible with UTF-8 or UTF-32, but these are also invalid for these standard encodings), without using an escaping mechanism such as the double backslashes like in the notation of regexps. This escaping mechnims is then independant of the actual runtime encoding used to transport the escaped streams within valid Unicode texts. In summary; arbitrary sequences of codepoints in a valid Unicode text require escaping mechanism on top of the actual text encoding for the storage or transport (there are other ways to escape arbitrary streams into valid texts, including the U+NNNNNN notation, or Base64 or Hex or octal representation of UTF-32, or Punycode. and many other technics used to embed binary objets (UUCP, Postscript streams). In HTTP a few of them are suported as standard "transport syntaxes". Terminal protocols (like VT220 and related, or Videotext) have since long used escape sequences (plus controls like SI/SO encapsulation and isolated DLE escapes for transporting 8-bit data over a 7-bit stream) Technically the Java strings at runtime are not plain text (unless they are checked on input and the validaty conditions are not brokeb by some text transforms like extraction ob substrings at arbitrary absolute positions, or with error recovery with resynchronization after a failure or missing data, where these errors are likely to occur because we have no warranty that validity is kept during the exchange by matching preconditions and postconditions), they are binary object (and this is also true for C/C++ standard strings, or PHP strings, or the content transported by an HTTP session or a terminal protocol (defining also its own escaping mechanism where needed). If yuo develop a general purpose library in any language that can be reused in arbitrary code, you cannot assume on input that all preconditions are satisfied so you need to check the input. And you also have to be careful about the design of your library to make sure that it respects the postconditions (some library APIs are technically unsafe, notably extracting substrings and almost blocked I/O using fixed size buffers such as file I/O in filesystems that do not discritimate text files and binary files (so that text files will use buffers with variable length only broken at codepoint positions and not at arbitrary code unit positions. As far as I know, there does not exist any filesystem that enforce code point positions (unless it uses non-space efficient encodings with code units wider than 20 bits (storage devices are optimized for code units wth size that are a power of 2 in bytes, so you would finally use only files whose sizes in bytes is a multiple of 4 and all random access file positions also a multiple of 4 bytes. You could also use 24-but storage code units with blocks limited to sectors of 255 bytes with the extra byte only used as a filler or as a length indicator in that sector (255 bytes would store 85 arbitrary code units of 24 bits but you will still need to check the value range of these code units if you want to restrict the the U+0000.U+10FFFF codepoint space, unless your application code handles all of the extra code units like non-character code points) However the filesystem may perform this check when writing text files so that it could mark files that are valid Unicode strings by updating some metadata (that metadata could be stored in the spare byte of the first 256-bytes sector or using a separate indexing database of compatible files). You coudl do that also for in-memory temporary buffers by keeping this info (this would allow to discriminate very early those streams that are not plain-text without having to process them up to the end). A relational database could aslso perform this check when creating indexes on table keys so that it will know that it cano only return valid text for any subselection in a table. In all cases, this will still be more efficient for small storage than using any transport syntax or escaping, but generally for moderate and large volumes, the transport syntax or escaping mechanism often wins in terms of performance by minimizing the volume of I/O, notably if theses I/O are very costly compared to data in working memory or even in CPU data caches (bit only if this data is very frequently reaccessed in this cache). However, if your I/O is very slow compared to CPU and the data volume sufficient large, it is always better to use UTF-32 in memory but storing that data in a compressed stream (you can safely use a generic binary compressor which will generally work better with UTF-32 compared to UTF-8 or UTF-16 for moderate and large volumes). Simple algorithms like deflate or even basic Huffmann encoding will generate excellent throughout with very modest CPU cost compared the huge to I/O costs that such compression saved (and in tht case, even a range check on input will have insignicant cost, thanks to branch prediction in your code using the fast pipelined path only for valid texts, and the slower non pipelined path only for exceptions and error handling most processors and CPU caches use now excellent branch predictors, even if code compilers can help them). In summary, ranche checking should no longer be only a debugging option in code (even for production code), even for internal libraries, its cost is rapidly insignificant for large data volumes. Just design your algorithms to minimize the number of state variables and minimize table lookups in order to improve data locality, instead of local data sizes for just one or a few code points or code units: just select runtime code units that can fit in a single CPU register (almost all processors today have registers at least 32-bit wide, so UTF-32 is not a problem for local processing in native code). 2014-06-02 11:29 GMT+02:00 Mark Davis ?? : > > \uD808\uDF45 specifies a sequence of two codepoints. > > ?That is simply incorrect.? > > In Java (and similar environments), \uXXXX means a char (a UTF16 code > unit), not a code point. Here is the difference. If you are not used to > Java, string.replaceAll(x,y) uses Java's regex to replace the pattern x > with the replacement y in string. Backslashes in literals need escaping, so > \x needs to be written in literals as \\x. > > String[] tests = {"\\x{12345}", "\\uD808\\uDF45", "\uD808\uDF45", > "?.?"}; > String target = > "one: ?\uD808\uDF45?\t\t" + > "two: ?\uD808\uDF45\uD808\uDF45?\t\t" + > "lead: ?\uD808?\t\t" + > "trail: ?\uDF45?\t\t" + > "one+: ?\uD808\uDF45\uD808?"; > System.out.println("pattern" + "\t?\t" + target + "\n"); > for (String test : tests) { > System.out.println(test + "\t?\t" + target.replaceAll(test, "??")); > } > > > *?Output:* > pattern ? one: ???? two: ?????? lead: ??? trail: ??? one+: ????? > > \x{12345} ? one: ???? two: ?????? lead: ??? trail: ??? one+: ????? > \uD808\uDF45 ? one: ???? two: ?????? lead: ??? trail: ??? one+: ????? > ?? ? one: ???? two: ?????? lead: ??? trail: ??? one+: ????? > ?.? ? one: ?? two: ?????? lead: ?? trail: ?? one+: ????? > > The target has various combinations of code units, to see what happens. > Notice that Java treats a pair of lead+trail as a single code point for > matching (eg .), but also an isolated surrogate char as a single code point > (last line of output). Note that Java's regex in addition allows \x{hex} > for specifying a code point explicitly. It also has the syntax \uXXXX (in a > literal the \ needs escaping) to specify a code unit; that is slightly > different than the Java preprocessing. Thus the first two are equivalent, > and replace "{" by "x". The last two are also equivalent?and fail?because a > single "{" is a broken regex pattern. > > System.out.println("{".replaceAll("\\u007B", "x")); > System.out.println("{".replaceAll("\\x{7B}", "x")); > > System.out.println("{".replaceAll("\u007B", "x")); > System.out.println("{".replaceAll("{", "x")); > > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > > On Sun, Jun 1, 2014 at 7:04 PM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > >> On Sun, 1 Jun 2014 08:58:26 -0700 >> Markus Scherer wrote: >> >> > You misunderstand. In Java, \uD808\uDF45 is the only way to escape a >> > supplementary code point, but as long as you have a surrogate pair, >> > it is treated as a code point in APIs that support them. >> >> Wasn't obvious that in the following paragraph \uD808\uDF45 was a >> pattern? >> >> "Bear in mind that a pattern \uD808 shall not match anything in a >> well-formed Unicode string. \uD808\uDF45 specifies a sequence of two >> codepoints. This sequence can occur in an ill-formed UTF-32 Unicode >> string and before Unicode 5.2 could readily be taken to occur in an >> ill-formed UTF-8 Unicode string. RL1.7 declares that for a regular >> expression engine, the codepoint sequence cannot >> occur in a UTF-16 Unicode string; instead, the code unit sequence > DF45> is the codepoint sequence > KI>." >> >> (It might have been clearer to you if I'd said '8-bit' and '16-bit' >> instead of UTF-8 and UTF-16. It does make me wonder what you'd call a >> 16-bit encoding of arbitrary *codepoint* sequences.) >> >> Richard. >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Jun 2 10:27:22 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 02 Jun 2014 08:27:22 -0700 Subject: Corrigendum #9 Message-ID: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> It seems that the broadening of the term "interchange" in this corrigendum to mean "almost any type of processing imaginable," below, is what caused the trouble. This is the decision that would need to be reconsidered if the real intent of noncharacters is to be expressed. I suspect everyone can agree on the edge cases, that noncharacters are harmless in internal processing, but probably should not appear in random text shipped around on the web. > This is necessary for the effective use of noncharacters, because > anytime a Unicode string crosses an API boundary, it is in effect > being "interchanged". Furthermore, for distributed software, it is > often very difficult to determine what constitutes an "internal" > versus an "external" context for any particular software process. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From markus.icu at gmail.com Mon Jun 2 10:48:57 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 2 Jun 2014 08:48:57 -0700 Subject: Corrigendum #9 In-Reply-To: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell wrote: > I suspect everyone can agree on the edge cases, that noncharacters are > harmless in internal processing, but probably should not appear in > random text shipped around on the web. > Right, in principle. However, it should be ok to include noncharacters in CLDR data files for processing by CLDR implementations, and it should be possible to edit and diff and version-control and web-view those files etc. It seems that trying to define "interchange" and "public" in ways that satisfy everyone will not be successful. The FAQ already gives some examples of where noncharacters might be used, should be preserved, or could be stripped, starting with "Q: Are noncharacters intended for interchange? " In my view, those Q/A pairs explain noncharacters quite well. If there are further examples of where noncharacters might be used, should be preserved, or could be stripped, and that would be particularly useful to add to the examples already there, then we could add them. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Mon Jun 2 11:02:58 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 2 Jun 2014 16:02:58 +0000 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: I also think that the verbiage swung too far the other way. Sure, I might need to save or transmit a file to talk to myself later, but apps should be strongly discouraged for using these for interchange with other apps. Interchange bugs are why nearly any news web site ends up with at least a few articles with mangled apostrophes or whatever (because of encoding differences). Should authors? tools or feeds or databases or whatever start emitting non-characters from internal use, then we?re going to have ugly leak into text ?everywhere?. So I?d prefer to see text that better permitted interchange with other components of an application?s internal system or partner system, yet discouraged use for interchange with ?foreign? apps. -Shawn -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Mon Jun 2 11:08:14 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 2 Jun 2014 18:08:14 +0200 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: The problem is where to draw the line. In today's world, what's an app? You may have a cooperating system of "apps", where it is perfectly reasonable to interchange sentinel values (for example). I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where we should make it clearer.) Mark *? Il meglio ? l?inimico del bene ?* On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele wrote: > I also think that the verbiage swung too far the other way. Sure, I > might need to save or transmit a file to talk to myself later, but apps > should be strongly discouraged for using these for interchange with other > apps. > > > > Interchange bugs are why nearly any news web site ends up with at least a > few articles with mangled apostrophes or whatever (because of encoding > differences). Should authors? tools or feeds or databases or whatever > start emitting non-characters from internal use, then we?re going to have > ugly leak into text ?everywhere?. > > > > So I?d prefer to see text that better permitted interchange with other > components of an application?s internal system or partner system, yet > discouraged use for interchange with ?foreign? apps. > > > > -Shawn > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Mon Jun 2 11:21:23 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 2 Jun 2014 16:21:23 +0000 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: <13aa00b784a74c55adb12d7eacede01c@BY2PR03MB491.namprd03.prod.outlook.com> That?s what I think is exactly what should be clarified. A cooperating system of apps should likely use some other markup, however if they want to use FFFF to say ?OK to insert ad here? (or whatever), that?s up to them. I fear that the current wording says ?Because you might have a cooperating system of apps that all agree FFFF is ?OK to insert ad here?, you may as well emit FFFF all the time just in case some other app happens to use the same sentinel?. The ?problem? is now that previously these characters were illegal, so my application didn?t have to explicitly remove them when importing external stuff because they weren?t allowed to be there. With the wording of the corrigendum, the onus is on every app importing data to filter out these code points because they are ?suddenly? legal in foreign data streams. That is a breaking change for applications, and, worse, it isn?t in the control of the applications that take advantage of the newly laxer wording, but rather all the other applications on the planet, which may have been stable for years. My interpretation of ?interchanged? was ?interchanged outside of a system that understood your private use of the noncharacters?. I can see where that may not have been everyone?s interpretation, and maybe should be updated. My interpretation of what you?re saying below is ?sentinel values with a private meaning can be exchanged between apps?, which is what the PUA?s for. I don?t mind at all if the definition is loosened somewhat, but if we?re turning them into PUA characters we should just turn them into PUA characters. -Shawn From: mark.edward.davis at gmail.com [mailto:mark.edward.davis at gmail.com] On Behalf Of Mark Davis ?? Sent: Monday, June 2, 2014 9:08 AM To: Shawn Steele Cc: Markus Scherer; Doug Ewell; Unicode Mailing List Subject: Re: Corrigendum #9 The problem is where to draw the line. In today's world, what's an app? You may have a cooperating system of "apps", where it is perfectly reasonable to interchange sentinel values (for example). I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where we should make it clearer.) Mark ? Il meglio ? l?inimico del bene ? On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele > wrote: I also think that the verbiage swung too far the other way. Sure, I might need to save or transmit a file to talk to myself later, but apps should be strongly discouraged for using these for interchange with other apps. Interchange bugs are why nearly any news web site ends up with at least a few articles with mangled apostrophes or whatever (because of encoding differences). Should authors? tools or feeds or databases or whatever start emitting non-characters from internal use, then we?re going to have ugly leak into text ?everywhere?. So I?d prefer to see text that better permitted interchange with other components of an application?s internal system or partner system, yet discouraged use for interchange with ?foreign? apps. -Shawn _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Mon Jun 2 11:27:54 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 2 Jun 2014 18:27:54 +0200 Subject: Corrigendum #9 In-Reply-To: <13aa00b784a74c55adb12d7eacede01c@BY2PR03MB491.namprd03.prod.outlook.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <13aa00b784a74c55adb12d7eacede01c@BY2PR03MB491.namprd03.prod.outlook.com> Message-ID: On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele wrote: > The ?problem? is now that previously these characters were illegal The problem was that we were inconsistent in standard and related material about just what the status was for these things. Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Jun 2 11:35:54 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 02 Jun 2014 09:35:54 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: <538CA7EA.5020906@ix.netcom.com> On 6/2/2014 9:08 AM, Mark Davis ?? wrote: > The problem is where to draw the line. In today's world, what's an > app? You may have a cooperating system of "apps", where it is > perfectly reasonable to interchange sentinel values (for example). The way to draw the line is to insist on there being an agreement between sender and ultimate receiver, and an pass-through agreement (if you will) for any intermediate stage, so that the coast is clear. What defines an "implementation" in this scenario, is the existence of the agreement. What got us into trouble is that the negative case (pass-through) was not well-defined, and lead to people assuming that they had to filter any incoming noncharacters. Because noncharacters can have any interpretation (not limited to interpretations as characters), it is much riskier to send then out oblivious whether the intended recipient is part of the same agreement on their interpretation as the sender. In that sense, they are not mere PUA code points. The other aspect of their original design was to allow code points that recipients were free no to honor or preserve, if they were not part of the agreement (and hadn't made an explicit or implicit pass-through agreement). Otherwise, if anyone expects them to be preserved, no application like Word, would be free to use these for purely internal use. Word thus would not be a tool to handle CLDR data; which may be disappointing to some, but should be fine. A./ > > I agree with Markus; I think the FAQ is pretty clear. (And if not, > that's where we should make it clearer.) > > > Mark > / > / > /? Il meglio ? l?inimico del bene ?/ > // > > > On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele > > wrote: > > I also think that the verbiage swung too far the other way. Sure, > I might need to save or transmit a file to talk to myself later, > but apps should be strongly discouraged for using these for > interchange with other apps. > > Interchange bugs are why nearly any news web site ends up with at > least a few articles with mangled apostrophes or whatever (because > of encoding differences). Should authors? tools or feeds or > databases or whatever start emitting non-characters from internal > use, then we?re going to have ugly leak into text ?everywhere?. > > So I?d prefer to see text that better permitted interchange with > other components of an application?s internal system or partner > system, yet discouraged use for interchange with ?foreign? apps. > > -Shawn > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Jun 2 11:36:50 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 02 Jun 2014 09:36:50 -0700 Subject: Corrigendum #9 Message-ID: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> Shawn Steele wrote: > So I?d prefer to see text that better permitted interchange with other > components of an application?s internal system or partner system, yet > discouraged use for interchange with "foreign" apps. If any wording is to be revised, while we're at it, I'd also like to see a reaffirmation of the proper relationship between private-use characters and noncharacters. I still hear arguments that private-use characters are to be avoided in public interchange at all costs, as if lack of knowledge of the private agreement, or conflicting interpretations, will cause some kind of major security breach. At the same time, the Corrigendum seems to imply that noncharacters in public interchange are no big deal. That seems upside-down. Mark Davis ?? replied: > The problem is where to draw the line. In today's world, what's an > app? You may have a cooperating system of "apps", where it is > perfectly reasonable to interchange sentinel values (for example). Correct. Most people wouldn't consider a cooperating system like that quite the same as true public interchange, like throwing this ??? into a message on a public mailing list. Since the Corrigendum deals with recommendations rather than hard requirements, SHOULDs rather than MUSTs, it doesn't seem that a bright line is really needed. > I agree with Markus; I think the FAQ is pretty clear. (And if not, > that's where we should make it clearer.) But the formal wording of the standard should reflect that clarity, right? -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From asmusf at ix.netcom.com Mon Jun 2 11:37:19 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 02 Jun 2014 09:37:19 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <13aa00b784a74c55adb12d7eacede01c@BY2PR03MB491.namprd03.prod.outlook.com> Message-ID: <538CA83F.9090500@ix.netcom.com> On 6/2/2014 9:27 AM, Mark Davis ?? wrote: > > On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele > > wrote: > > The ?problem? is now that previously these characters were illegal > > > The problem was that we were inconsistent in standard and related > material about just what the status was for these things. > > And threw the baby out to fix it. A./ > > Mark > / > / > /? Il meglio ? l?inimico del bene ?/ > // > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Mon Jun 2 11:38:28 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 2 Jun 2014 16:38:28 +0000 Subject: Corrigendum #9 In-Reply-To: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> References: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> Message-ID: > > I agree with Markus; I think the FAQ is pretty clear. (And if not, > > that's where we should make it clearer.) > But the formal wording of the standard should reflect that clarity, right? I don't tend to read the FAQ :) From doug at ewellic.org Mon Jun 2 11:44:06 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 02 Jun 2014 09:44:06 -0700 Subject: Corrigendum #9 Message-ID: <20140602094406.665a7a7059d7ee80bb4d670165c8327d.edf1a109a4.wbe@email03.secureserver.net> I wrote, sort of: > Correct. Most people wouldn't consider a cooperating system like that > quite the same as true public interchange, like throwing this ??? > into a message on a public mailing list. Oh, look. My mail system converted those nice noncharacters into U+FFFD. Was that compliant? Did I deserve what I got? Are those two different questions? -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From mark at macchiato.com Mon Jun 2 11:47:44 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 2 Jun 2014 18:47:44 +0200 Subject: Corrigendum #9 In-Reply-To: <538CA83F.9090500@ix.netcom.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <13aa00b784a74c55adb12d7eacede01c@BY2PR03MB491.namprd03.prod.outlook.com> <538CA83F.9090500@ix.netcom.com> Message-ID: I disagree with that characterization, of course. The recommendation for libraries and low-level tools to pass them through rather than screw with them makes them usable. The recommendation to check for noncharacters from unknown sources and fix them was good advice then, and is good advice now. Any app where input of noncharacters causes security problems or crashes is and was not a very good app. Mark *? Il meglio ? l?inimico del bene ?* On Mon, Jun 2, 2014 at 6:37 PM, Asmus Freytag wrote: > On 6/2/2014 9:27 AM, Mark Davis ?? wrote: > > > On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele > wrote: > >> The ?problem? is now that previously these characters were illegal > > > The problem was that we were inconsistent in standard and related > material about just what the status was for these things. > > > And threw the baby out to fix it. > > A./ > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > > _______________________________________________ > Unicode mailing listUnicode at unicode.orghttp://unicode.org/mailman/listinfo/unicode > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Jun 2 11:49:29 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 02 Jun 2014 09:49:29 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> Message-ID: <538CAB19.7020208@ix.netcom.com> On 6/2/2014 9:38 AM, Shawn Steele wrote: >>> I agree with Markus; I think the FAQ is pretty clear. (And if not, >>> that's where we should make it clearer.) >> But the formal wording of the standard should reflect that clarity, right? > I don't tend to read the FAQ :) FAQ's are useful, but they are not binding. They are even less binding than general explanation in the text of the Core specification, which itself doesn't rise to the that of conformance clauses and definition... Doug's unease about the "upside-down" nature of the wording regarding PUA and noncharacters is something that should be addressed in revised text in the core specification. A./ > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From Shawn.Steele at microsoft.com Mon Jun 2 12:00:59 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 2 Jun 2014 17:00:59 +0000 Subject: Corrigendum #9 In-Reply-To: <538CAB19.7020208@ix.netcom.com> References: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> <538CAB19.7020208@ix.netcom.com> Message-ID: <00c3fcd9d08f4e4eaf5cda05cec0a63f@BY2PR03MB491.namprd03.prod.outlook.com> To further my understanding, can someone provide examples of how these are used in actual practice? I can't think of any offhand and the closest I get is like the old escape characters to get a dot matrix printer to shift modes, or old word processor internal formatting sequences. From Shawn.Steele at microsoft.com Mon Jun 2 12:08:38 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 2 Jun 2014 17:08:38 +0000 Subject: Corrigendum #9 In-Reply-To: <20140602094406.665a7a7059d7ee80bb4d670165c8327d.edf1a109a4.wbe@email03.secureserver.net> References: <20140602094406.665a7a7059d7ee80bb4d670165c8327d.edf1a109a4.wbe@email03.secureserver.net> Message-ID: > Oh, look. My mail system converted those nice noncharacters into U+FFFD. > Was that compliant? Did I deserve what I got? Are those two different questions? I think I just got spaces. From markus.icu at gmail.com Mon Jun 2 12:17:04 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 2 Jun 2014 10:17:04 -0700 Subject: Corrigendum #9 In-Reply-To: <00c3fcd9d08f4e4eaf5cda05cec0a63f@BY2PR03MB491.namprd03.prod.outlook.com> References: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> <538CAB19.7020208@ix.netcom.com> <00c3fcd9d08f4e4eaf5cda05cec0a63f@BY2PR03MB491.namprd03.prod.outlook.com> Message-ID: On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele wrote: > To further my understanding, can someone provide examples of how these are > used in actual practice? > CLDR collation data defines special contraction mappings that start with a noncharacter, for http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers In CLDR 23 and before (when we were still using XML collation syntax), these were raw noncharacters in the .xml files. As I said earlier: it should be ok to include noncharacters in CLDR data files for processing by CLDR implementations, and it should be possible to edit and diff and version-control and web-view those files etc. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Mon Jun 2 12:50:18 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 2 Jun 2014 17:50:18 +0000 Subject: Corrigendum #9 In-Reply-To: References: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> <538CAB19.7020208@ix.netcom.com> <00c3fcd9d08f4e4eaf5cda05cec0a63f@BY2PR03MB491.namprd03.prod.outlook.com> Message-ID: <8ef7b3954b13479cad76585e628fb83b@BY2PR03MB491.namprd03.prod.outlook.com> Hmm, I find that disconcerting. I?d prefer a real Unicode character with special weights if that concept?s needed. And I guess that goes a long ways to explaining the interchange problem since clearly the code editor?s going to need these ? From: Markus Scherer [mailto:markus.icu at gmail.com] Sent: Monday, June 2, 2014 10:17 AM To: Shawn Steele Cc: Asmus Freytag; Doug Ewell; Mark Davis ??; Unicode Mailing List Subject: Re: Corrigendum #9 On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele > wrote: To further my understanding, can someone provide examples of how these are used in actual practice? CLDR collation data defines special contraction mappings that start with a noncharacter, for http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers In CLDR 23 and before (when we were still using XML collation syntax), these were raw noncharacters in the .xml files. As I said earlier: it should be ok to include noncharacters in CLDR data files for processing by CLDR implementations, and it should be possible to edit and diff and version-control and web-view those files etc. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Jun 2 13:05:11 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 2 Jun 2014 19:05:11 +0100 Subject: Corrigendum #9 In-Reply-To: References: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> <538CAB19.7020208@ix.netcom.com> <00c3fcd9d08f4e4eaf5cda05cec0a63f@BY2PR03MB491.namprd03.prod.outlook.com> Message-ID: <20140602190511.5f67ffd8@JRWUBU2> On Mon, 2 Jun 2014 10:17:04 -0700 Markus Scherer wrote: > CLDR collation data defines special contraction mappings that start > with a noncharacter, for > http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers > In CLDR 23 and before (when we were still using XML collation syntax), > these were raw noncharacters in the .xml files. > As I said earlier: > it should be ok to include noncharacters in CLDR data files for > processing by CLDR implementations, and it should be possible to edit > and diff and version-control and web-view those files etc. They come as a nasty shock when someone thinks XML files are marked-up text files. I'm still surprised that the published human-readable form of CLDR files should contain automatically applied non-Unicode copyright claims. Richard. From richard.wordingham at ntlworld.com Mon Jun 2 15:01:53 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 2 Jun 2014 21:01:53 +0100 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> <20140601094931.413857e2@JRWUBU2> <20140601180457.273ac6b9@JRWUBU2> Message-ID: <20140602210153.40a8bf08@JRWUBU2> On Mon, 2 Jun 2014 11:29:09 +0200 Mark Davis ?? wrote: > > \uD808\uDF45 specifies a sequence of two codepoints. > > ?That is simply incorrect.? The above is in the sample notation of UTS #18 Version 17 Section 1.1. >From what I can make out, the corresponding Java notation would be \x{D808}\x{DF45}. I don't *know* what \x{D808} and \x{DF45} match in Java, or whether they are even acceptable. The only thing UTS #18 RL1.7 permits them to match in Java is lone surrogates, but I don't know if Java complies. All UTS #18 says for sure about regular expressions matching code units is that they don't satisfy RL1.1, though Section 1.7 appears to ban them when it says, "A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units". Perhaps it's a fundamental requirement of something other than UTS #18. I thought matching parts of characters in terms of their canonical equivalences was awkward enough, without having the additional option of matching some of the code units! Richard. From prosfilaes at gmail.com Mon Jun 2 15:32:43 2014 From: prosfilaes at gmail.com (David Starner) Date: Mon, 2 Jun 2014 13:32:43 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: On Mon, Jun 2, 2014 at 8:48 AM, Markus Scherer wrote: > Right, in principle. However, it should be ok to include noncharacters in > CLDR data files for processing by CLDR implementations, and it should be > possible to edit and diff and version-control and web-view those files etc. Why? It seems you're changing the rules so some Unicode guys can get oversmart in using Unicode in their systems. You could do the same thing everyone else does and use special tags or symbols you have to escape. I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if used carelessly for their stated purpose, can probably trigger serious bugs in some lamebrained utility. -- Kie ekzistas vivo, ekzistas espero. From markus.icu at gmail.com Mon Jun 2 16:53:08 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 2 Jun 2014 14:53:08 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: On Mon, Jun 2, 2014 at 1:32 PM, David Starner wrote: > I would especially discourage any web browser from handling > these; they're noncharacters used for unknown purposes that are > undisplayable and if used carelessly for their stated purpose, can > probably trigger serious bugs in some lamebrained utility. > I don't expect "handling these" in web browsers and lamebrained utilities. I expect "treat like unassigned code points". markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Mon Jun 2 17:07:03 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 2 Jun 2014 22:07:03 +0000 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: <81e121ab27544aeca6f23663850c32dd@BY2PR03MB491.namprd03.prod.outlook.com> Except that, particularly the max-weight ones, mean that developers can be expected to use this as sentinels in code using ICU, which would preclude their use for other things? Which makes them more like ?reserved for use in CLDR? than ?noncharacters?? -Shawn From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Markus Scherer Sent: Monday, June 2, 2014 2:53 PM To: David Starner Cc: Unicode Mailing List Subject: Re: Corrigendum #9 On Mon, Jun 2, 2014 at 1:32 PM, David Starner > wrote: I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if used carelessly for their stated purpose, can probably trigger serious bugs in some lamebrained utility. I don't expect "handling these" in web browsers and lamebrained utilities. I expect "treat like unassigned code points". markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Jun 2 17:06:21 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 3 Jun 2014 00:06:21 +0200 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: We can still draw a line : interchange should be meant so that other non-Unicode standards should find their way to not mixup random data within plain-text without defining a clear encapsulation and escaping mechanism that ensures that plain text remains isolatable. In other words, desieng separate layers of representation and processing, and be more imaginative when you design an application or protocol with a better modeling. If an application really internaly needs some non-characters, this is not reallyfor encoding text but for the application/protocol-specific system of encapsulation, which should be clearly identified: - these protocols can use separate APIs for handling objects that are composite and contain some text but that are not text by themselves. - they should isolate data types (or MIME types) - they should use some "magic" identifiers in the headers of their data, including versioning in their protocol - they should document internally their own encapsulation/escaping mechanisms - they should test them to make sure they preserve the valid text content without breaking them As the kind of data is not text, we fall within the design of binary data formats. These kinds of statements mean that protocols and API will be improved for better separation of layers, working more as separate blackboxes. But it's not up to the Unicode standard to explain how they will do it. So for me non-characters are not Unicode text, they are not text at all and we should not attempt to make them legal if we want to allow string designs of isolation mechanisms that allow this separation of layers. The Unicode standard offers enough space for this separation, with non-characters (invalid in all standard UTFs), with onvalid code sequences in standard UTFs that allow building up specific encodings that must not be called "UTFs" (or "Unicode" or "UCS" or other terms defined in TUS) and identified as such in API/protocol designs. Thnigs would be simply better is TUS did not even define what is a non-character and if it dd not even suggest that they are legal in "some" circonstance of text "interchange". 2014-06-02 18:08 GMT+02:00 Mark Davis ?? : > The problem is where to draw the line. In today's world, what's an app? > You may have a cooperating system of "apps", where it is perfectly > reasonable to interchange sentinel values (for example). > > I agree with Markus; I think the FAQ is pretty clear. (And if not, that's > where we should make it clearer.) > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > > On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele > wrote: > >> I also think that the verbiage swung too far the other way. Sure, I >> might need to save or transmit a file to talk to myself later, but apps >> should be strongly discouraged for using these for interchange with other >> apps. >> >> >> >> Interchange bugs are why nearly any news web site ends up with at least a >> few articles with mangled apostrophes or whatever (because of encoding >> differences). Should authors? tools or feeds or databases or whatever >> start emitting non-characters from internal use, then we?re going to have >> ugly leak into text ?everywhere?. >> >> >> >> So I?d prefer to see text that better permitted interchange with other >> components of an application?s internal system or partner system, yet >> discouraged use for interchange with ?foreign? apps. >> >> >> >> -Shawn >> >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Jun 2 17:08:27 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 02 Jun 2014 15:08:27 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: <538CF5DB.3070007@ix.netcom.com> On 6/2/2014 2:53 PM, Markus Scherer wrote: > On Mon, Jun 2, 2014 at 1:32 PM, David Starner > wrote: > > I would especially discourage any web browser from handling > these; they're noncharacters used for unknown purposes that are > undisplayable and if used carelessly for their stated purpose, can > probably trigger serious bugs in some lamebrained utility. > > > I don't expect "handling these" in web browsers and lamebrained > utilities. I expect "treat like unassigned code points". > I can't shake the suspicion that Corrigendum #9 is not actually solving a general problem, but is a special favor to CLDR as being run by insiders, and in the process muddying the waters for everyone else. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Mon Jun 2 17:09:21 2014 From: prosfilaes at gmail.com (David Starner) Date: Mon, 2 Jun 2014 15:09:21 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: On Mon, Jun 2, 2014 at 2:53 PM, Markus Scherer wrote: > On Mon, Jun 2, 2014 at 1:32 PM, David Starner wrote: >> >> I would especially discourage any web browser from handling >> these; they're noncharacters used for unknown purposes that are >> undisplayable and if used carelessly for their stated purpose, can >> probably trigger serious bugs in some lamebrained utility. > > > I don't expect "handling these" in web browsers and lamebrained utilities. I > expect "treat like unassigned code points". So certain programs can't use noncharacters internally because some people want to interchange them? That doesn't seem like what noncharacters should be used for. Unix utilities shouldn't usually go to the trouble of messing with them; limiting the number of changes needed for Unicode was the whole point of UTF-8. Any program transferring them across the Internet as text should filter them, IMO; either some lamebrained utility will open a security hole by using them and not filtering first, or something will filter them after security checks have been done, or something. Unless it's a completely trusted system, text files with these characters should be treated with extreme prejudice by the first thing that receives them over the net. -- Kie ekzistas vivo, ekzistas espero. From Shawn.Steele at microsoft.com Mon Jun 2 17:21:14 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 2 Jun 2014 22:21:14 +0000 Subject: Corrigendum #9 In-Reply-To: <538CF5DB.3070007@ix.netcom.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <538CF5DB.3070007@ix.netcom.com> Message-ID: <216013a44c0845d09d6ae7034dc22468@BY2PR03MB491.namprd03.prod.outlook.com> ? I can't shake the suspicion that Corrigendum #9 is not actually solving a general problem, but is a special favor to CLDR as being run by insiders, and in the process muddying the waters for everyone else I think we could generalize to other scenarios so it wasn?t necessarily an insider scenario. For example, I could have a string manipulation library that used FFFE to indicate the beginning of an identifier for a localizable sentence, terminated by FFFF. Any system using FFFEid1234FFFF would likely expect to be able to read the tokens in their favorite code editor. But I?m concerned that these ?conflict? with each other, and embedding the behavior in major programming languages doesn?t smell to me like ?internal? use. Clearly if I wanted to use that library in a CLDR-aware app, there is a potential risk for a conflict. In the CLDR case, there *IS* a special relationship with Unicode, and perhaps it would be warranted to explicitly encode character(s) with the necessary meaning(s) to handle edge-case collation scenarios. -Shawn -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Jun 2 17:20:49 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 3 Jun 2014 00:20:49 +0200 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: I better expect: "treat them as you like", there will never be any warranty of interoperability, everyone is allowed to use them as they want and even change it at any time. The behavior is not defined in TUS, and users cannot expect that TUS will define this behavior. There's no clear solution about what to do if you encounter them in data supposed to be text. For me they are not text, so the whole data could be rejected or the text remaining after some filtering may be galsely interpreted. You need an external specification outside TUS. I certainly do not consider non-characters like unassigned valid code points where applications are strongly encouraged to not apply any kinf of filter if they want to remain compatible with evolutions of the standard that may assign them (the best you can do with unassigned code points is treat them as symbols, with the minimial properties defined in the standard (notably Bidi properties according to their range, where this direction is defined in some ranges, or treat them as symbols with weak direction), even if applications cannot still render them (renderers will find a way to show them, generally using a .notdef glyph like empty boxes). Normalizers will also not mix them (the default combining class should be 0). Only applications that want to ensure that the text conforms to a specific version of the standard are allowed to filter out or signal as errors the presence of unassigned code points. But all applications can do that kind of things on non-characters (or any code unit whose value falls outside the valid range of a defined UTF?. This is an important difference. non-characters are not like unassigned code points, they are assigned to be considered invalid and filterable by design by any Unicode conforming process for handling text. 2014-06-02 23:53 GMT+02:00 Markus Scherer : > On Mon, Jun 2, 2014 at 1:32 PM, David Starner > wrote: > >> I would especially discourage any web browser from handling >> these; they're noncharacters used for unknown purposes that are >> undisplayable and if used carelessly for their stated purpose, can >> probably trigger serious bugs in some lamebrained utility. >> > > I don't expect "handling these" in web browsers and lamebrained utilities. > I expect "treat like unassigned code points". > > markus > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Jun 2 17:55:31 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 3 Jun 2014 00:55:31 +0200 Subject: Corrigendum #9 In-Reply-To: <81e121ab27544aeca6f23663850c32dd@BY2PR03MB491.namprd03.prod.outlook.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <81e121ab27544aeca6f23663850c32dd@BY2PR03MB491.namprd03.prod.outlook.com> Message-ID: "reserved for CLDR" would be wrong in TUS, you have reached a borderline where you are no longer handling plain text (stream of scalar values assigned to code points), but binary data via a binary interface outside TUS (handling streams of collation elements, whose representation is not even bound to the ICU implementation of CLDR for its own definitions and syntax for its tailorings). CLDR data defines its own interface and protocol, it can reserve these code points only for itself but not in TUS and no other conforming plain-text application is expected to accept these reservations, so they can **freely** mark them in error, replace them, or filter them out, or interpret them differently for their own usage, using their own specification and encapsulation mechanisms and specific **non-plain-text** data types. CLDR data transmitted in binary form that would embed these code points are not transporting plain-text, this is still a binary datatype specific to this application. CLDR data must remain isolated in its scope without forcing other protocols or TUS to follow its practices. Other applications may develop "gateway" interfaces to convert them to be interoperable with ICU but they are not required to do that. If they do, they will follow the ICU specifications, not TUS and this should not influence their own way to handle what TUS describe as plain-text. To make it clear, it is referable to just say in TUS that the behavior of applications with non-characters is completely undefined and unpredictable without an external specification, and these entities should not even be considered as encodable in any standard UTFs (which can be freely be replaced by another one without causing any loss or modification of the represented plain-text). It should be possible to define other (non standard) conforming UTFs which are completely unable to represent these non-characters (as well as any unpaired surrogate). A conforming UTF just needs to be able to represent streams of scalar values in their full standard range (even without knowing if they are assigned or not or without knowing their character properties). You can/should even design CLDR to completely ovoid the use of non-characters: it's up to it to define an encapsulation/escaping mechanism that clearly separates what is standard plain-text in the content and what is not and used for specific purpose in CLDR or ICU implementations. 2014-06-03 0:07 GMT+02:00 Shawn Steele : > Except that, particularly the max-weight ones, mean that developers can > be expected to use this as sentinels in code using ICU, which would > preclude their use for other things? > > > > Which makes them more like ?reserved for use in CLDR? than ?noncharacters?? > > > > -Shawn > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Markus > Scherer > *Sent:* Monday, June 2, 2014 2:53 PM > *To:* David Starner > *Cc:* Unicode Mailing List > *Subject:* Re: Corrigendum #9 > > > > On Mon, Jun 2, 2014 at 1:32 PM, David Starner > wrote: > > I would especially discourage any web browser from handling > > these; they're noncharacters used for unknown purposes that are > undisplayable and if used carelessly for their stated purpose, can > probably trigger serious bugs in some lamebrained utility. > > > > I don't expect "handling these" in web browsers and lamebrained utilities. > I expect "treat like unassigned code points". > > > > markus > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lisam at us.ibm.com Mon Jun 2 18:32:31 2014 From: lisam at us.ibm.com (Lisa Moore) Date: Mon, 2 Jun 2014 16:32:31 -0700 Subject: Corrigendum #9 In-Reply-To: <538CF5DB.3070007@ix.netcom.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <538CF5DB.3070007@ix.netcom.com> Message-ID: I would like to point out to Asmus that this decision was reached unanimously at the UTC by Adobe, Apple, Google, IBM, Microsoft, SAP, UC Berkeley, and Yahoo! One might disagree with the decision, but there were no special favors involved. Lisa > > > I can't shake the suspicion that Corrigendum #9 is not actually > solving a general problem, but is a special favor to CLDR as being > run by insiders, and in the process muddying the waters for everyone else. > > A./_______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Jun 2 18:33:58 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 3 Jun 2014 00:33:58 +0100 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: <20140603003358.1c8f4150@JRWUBU2> On Mon, 2 Jun 2014 15:09:21 -0700 David Starner wrote: > So certain programs can't use noncharacters internally because some > people want to interchange them? That doesn't seem like what > noncharacters should be used for. Much as I don't like their uninvited use, it is possible to pass them and other undesirables through most applications by a slight bit of recoding at the application's boundaries. Using 99 = (3 + 32 + 64) PUA characters, one can ape UTF-16 surrogates and encode: 32 ? 64 pairs for lone surrogates 1 ? 64 pairs to replace some of the PUA characters 1 ? 35 pairs to replace the rest of the PUA characters 1 ? 4 pairs for incoming FFFC to FFFF 1 ? 32 pairs for the other BMP non-characters 1 ? 32 pairs for the supplementary plane non-characters. This then frees up non-characters for the application's use. Richard. From prosfilaes at gmail.com Tue Jun 3 01:21:38 2014 From: prosfilaes at gmail.com (David Starner) Date: Mon, 2 Jun 2014 23:21:38 -0700 Subject: Corrigendum #9 In-Reply-To: <20140603003358.1c8f4150@JRWUBU2> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <20140603003358.1c8f4150@JRWUBU2> Message-ID: On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham wrote: > Much as I don't like their uninvited use, it is possible to pass them > and other undesirables through most applications by a slight bit of > recoding at the application's boundaries. Using 99 = (3 + 32 + 64) PUA > characters, one can ape UTF-16 surrogates and encode: What's the point? If we can use the PUA, then we don't need the noncharacters; we can just use the PUA directly. If we have to play around with remapping them, they're pointless; they're no easier to use in that case then ESC or '\' or PUA characters. -- Kie ekzistas vivo, ekzistas espero. From mark at macchiato.com Tue Jun 3 01:55:09 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 3 Jun 2014 08:55:09 +0200 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: On Mon, Jun 2, 2014 at 10:32 PM, David Starner wrote: > Why? It seems you're changing the rules > ?... > > This isn't "are changing", it is "has changed". The Corrigendum was issued at the start of 2013, about 16 months ago; applicable to all relevant earlier versions. It was the result of fairly extensive debate inside the UTC; there hasn't been a single issue on this thread that wasn't considered during the discussions there. And as far back as 2001, the UTC made it clear that noncharacters *are* scalar values, and are to be converted by UTF converters. Eg, see http://www.unicode.org/mail-arch/unicode-ml/y2001-m09/0149.html (by chance, one day before 9/11). > probably trigger serious bugs in some lamebrained utility. There were already plenty of programs that passed the noncharacters through; very few would filter them (some would delete them, which is horrible for security). Thinking that a utility would never encounter them in input text was a pipe-dream. If a utility or library is so fragile that it *breaks* on input of any valid UTF sequence, then it *is* a "lamebrained" utility. A good unit test for any production chain would be to check there is no crash on any input scalar value (and for that matter, any ill-formed UTF text). -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Tue Jun 3 02:09:27 2014 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Tue, 03 Jun 2014 16:09:27 +0900 Subject: Corrigendum #9 In-Reply-To: <538CF5DB.3070007@ix.netcom.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <538CF5DB.3070007@ix.netcom.com> Message-ID: <538D74A7.5020605@it.aoyama.ac.jp> On 2014/06/03 07:08, Asmus Freytag wrote: > On 6/2/2014 2:53 PM, Markus Scherer wrote: >> On Mon, Jun 2, 2014 at 1:32 PM, David Starner > > wrote: >> >> I would especially discourage any web browser from handling >> these; they're noncharacters used for unknown purposes that are >> undisplayable and if used carelessly for their stated purpose, can >> probably trigger serious bugs in some lamebrained utility. >> >> >> I don't expect "handling these" in web browsers and lamebrained >> utilities. I expect "treat like unassigned code points". Expecting them to be treated like unassigned code points shows that their use is a bad idea: Since when does the Unicode Consortium use unassigned code points (and the like) in plain sight? > I can't shake the suspicion that Corrigendum #9 is not actually solving > a general problem, but is a special favor to CLDR as being run by > insiders, and in the process muddying the waters for everyone else. I have to fully agree with Asmus, Richard, Shawn and others that the use of non-characters in CLDR is a very bad and dangerous example. However convenient the misuse of some of these codepoints in CLDR may be, it sets a very bad example for everybody else. Unicode itself should not just be twice as careful with the use of its own codepoints, but 10 times as careful. I'd strongly suggest that completely independent of when and how Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets worked out for how to get rid of these codepoints in CLDR data. The sooner, the better. Regards, Martin. From richard.wordingham at ntlworld.com Tue Jun 3 02:31:46 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 3 Jun 2014 08:31:46 +0100 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <20140603003358.1c8f4150@JRWUBU2> Message-ID: <20140603083146.3eda0c21@JRWUBU2> On Mon, 2 Jun 2014 23:21:38 -0700 David Starner wrote: > On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham > wrote: > > Using 99 = (3 + > > 32 + 64) PUA characters, one can ape UTF-16 surrogates and encode: > What's the point? If we can use the PUA, then we don't need the > noncharacters; we can just use the PUA directly. If we have to play > around with remapping them, they're pointless; they're no easier to > use in that case then ESC or '\' or PUA characters. A search for two 2-character string '\n' would also find a substring of 4-character string 'a\\n'. The PUA is in general not available for general utilities to make special use of. Richard. From prosfilaes at gmail.com Tue Jun 3 02:41:18 2014 From: prosfilaes at gmail.com (David Starner) Date: Tue, 3 Jun 2014 00:41:18 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: On Mon, Jun 2, 2014 at 11:55 PM, Mark Davis ?? wrote: > Thinking that a utility would never encounter them in input text > was a pipe-dream. Thinking that a utility would never mangle them if encountered in input text was a pipe-dream. > If a utility or library is so fragile that it breaks on > input of any valid UTF sequence, then it is a "lamebrained" utility. And? The world is filled with lamebrained utilities, and being cautious about what you take in can prevent one of those lamebrained utilities from turning into an exploit. > A good > unit test for any production chain would be to check there is no crash on > any input scalar value (and for that matter, any ill-formed UTF text). Right; and if you filter out stuff at the frontend, like ill-formed UTF text and noncharacters, you don't have to worry about what the middle end will do with them. I don't get what the goal of these changes were. It seems you've taken these characters away from programmers to use them in programs and given them to CLDR and anyone else willing to make their "plain text files" skirt the limits. -- Kie ekzistas vivo, ekzistas espero. From prosfilaes at gmail.com Tue Jun 3 02:42:54 2014 From: prosfilaes at gmail.com (David Starner) Date: Tue, 3 Jun 2014 00:42:54 -0700 Subject: Corrigendum #9 In-Reply-To: <20140603083146.3eda0c21@JRWUBU2> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <20140603003358.1c8f4150@JRWUBU2> <20140603083146.3eda0c21@JRWUBU2> Message-ID: On Tue, Jun 3, 2014 at 12:31 AM, Richard Wordingham wrote: > On Mon, 2 Jun 2014 23:21:38 -0700 > David Starner wrote: > >> On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham >> wrote: >> > Using 99 = (3 + >> > 32 + 64) PUA characters, one can ape UTF-16 surrogates and encode: > > The PUA is in general not available for > general utilities to make special use of. No, the PUA is not. Then where are you getting the 99 PUA characters you suggested using? -- Kie ekzistas vivo, ekzistas espero. From richard.wordingham at ntlworld.com Tue Jun 3 02:46:44 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 3 Jun 2014 08:46:44 +0100 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: <20140603084644.2f01e910@JRWUBU2> On Tue, 3 Jun 2014 08:55:09 +0200 Mark Davis ?? wrote: > On Mon, Jun 2, 2014 at 10:32 PM, David Starner > wrote: > > > Why? It seems you're changing the rules > > ?... > > > > > This isn't "are changing", it is "has changed". The Corrigendum was > issued at the start of 2013, about 16 months ago; applicable to all > relevant earlier versions. It was the result of fairly extensive > debate inside the UTC; there hasn't been a single issue on this > thread that wasn't considered during the discussions there. And as > far back as 2001, the UTC made it clear that noncharacters *are* > scalar values, and are to be converted by UTF converters. Eg, see > http://www.unicode.org/mail-arch/unicode-ml/y2001-m09/0149.html (by > chance, one day before 9/11). But that says U+FDD0 is not to be externally interchanged! Richard. From mark at macchiato.com Tue Jun 3 02:52:45 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 3 Jun 2014 09:52:45 +0200 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net>

Message-ID: On Tue, Jun 3, 2014 at 9:41 AM, David Starner wrote: > Thinking that a utility would never mangle them if encountered in > input text was a pipe-dream. > I didn't say "not mangle", I said "break", as in "crash". ?I don't think this thread is going anywhere productive, so? I'm signing off from it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Jun 3 03:02:32 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 3 Jun 2014 09:02:32 +0100 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <20140603003358.1c8f4150@JRWUBU2> <20140603083146.3eda0c21@JRWUBU2> Message-ID: <20140603090232.0f3cf06c@JRWUBU2> On Tue, 3 Jun 2014 00:42:54 -0700 David Starner wrote: > On Tue, Jun 3, 2014 at 12:31 AM, Richard Wordingham > wrote: > > On Mon, 2 Jun 2014 23:21:38 -0700 > > David Starner wrote: > > > >> On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham > >> wrote: > >> > Using 99 = (3 + > >> > 32 + 64) PUA characters, one can ape UTF-16 surrogates and > >> > encode: > > > > The PUA is in general not available for > > general utilities to make special use of. > > No, the PUA is not. Then where are you getting the 99 PUA characters > you suggested using? By escaping them as well. The point of the complex scheme is to keep searching simple. Using a general escape character doesn't work so well. Richard. From prosfilaes at gmail.com Tue Jun 3 04:46:29 2014 From: prosfilaes at gmail.com (David Starner) Date: Tue, 3 Jun 2014 02:46:29 -0700 Subject: Corrigendum #9 In-Reply-To: <20140603090232.0f3cf06c@JRWUBU2> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <20140603003358.1c8f4150@JRWUBU2> <20140603083146.3eda0c21@JRWUBU2> <20140603090232.0f3cf06c@JRWUBU2> Message-ID: On Tue, Jun 3, 2014 at 1:02 AM, Richard Wordingham wrote: > On Tue, 3 Jun 2014 00:42:54 -0700 > David Starner wrote: > >> No, the PUA is not. Then where are you getting the 99 PUA characters >> you suggested using? > > By escaping them as well. The point of the complex scheme is to keep > searching simple. Using a general escape character doesn't work so > well. The point is, instead of escaping the PUA so you can use the noncharacters, why not just escape the PUA so you can use the PUA characters? The latter is simpler and more flexible. -- Kie ekzistas vivo, ekzistas espero. From verdy_p at wanadoo.fr Tue Jun 3 09:20:35 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 3 Jun 2014 16:20:35 +0200 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <20140603003358.1c8f4150@JRWUBU2> Message-ID: I think his point is that an application may want to encapsulate in a valid text any orbitrary stream of code points (including non characters, PUAs, or isolated surrogate code units found in 16-bit or 32-bit streams that are invalid UTF-16 or UTF-32 streams, or even invalid arbitrary 8-but bytes in streams that are not valid UTF-8). For 8-bit streams, using ESC or \ s generally a good choice of escape to derive a valid UTF-8 text stream. But for 16-bit and 32-bit stream, PUAs are more economical (but PUA code units found in the stream still need to be escaped. If you think about the Java regexp "\\uD800", it does not designates a code point but only a code unit which is not valid plain text alone as it violates UTF-16 encoding rules. Trying to match it in a valid UTF-16 stream can work only if you can reprecent isolated code units for a specific encoding like UTF-16, even if the targer stream to look for this match uses any other valid UTF (not necessarily UTF-16: decode the target text, reencode it to UTF-16 to generate a 16-bit stream in which you'll look for isolated 16-but code units with the regexp) So yes the regexp "\\uXXXX" (in Java source) is not used to match a single valid character 2014-06-03 8:21 GMT+02:00 David Starner : > On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham > wrote: > > Much as I don't like their uninvited use, it is possible to pass them > > and other undesirables through most applications by a slight bit of > > recoding at the application's boundaries. Using 99 = (3 + 32 + 64) PUA > > characters, one can ape UTF-16 surrogates and encode: > > What's the point? If we can use the PUA, then we don't need the > noncharacters; we can just use the PUA directly. If we have to play > around with remapping them, they're pointless; they're no easier to > use in that case then ESC or '\' or PUA characters. > > -- > Kie ekzistas vivo, ekzistas espero. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mpapendick at vermeer.com Tue Jun 3 09:25:46 2014 From: mpapendick at vermeer.com (Papendick, Michelle) Date: Tue, 3 Jun 2014 14:25:46 +0000 Subject: Use of Unicode Symbol 26A0 Message-ID: Good Day - Just wondering if Unicode provides for or anyone know of documentation for standard usage around the following symbol: [cid:image001.png at 01CF7C48.A6D54D00] Noticed that is it used in many applications as a general warning or error symbol, but upon research it is also the symbol for personal injury so appears to be a conflict of meaning. Any information around standard usage of the symbol in software applications is appreciated. Thank you! Michelle -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8819 bytes Desc: image001.png URL: From verdy_p at wanadoo.fr Tue Jun 3 10:56:05 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 3 Jun 2014 17:56:05 +0200 Subject: Use of Unicode Symbol 26A0 In-Reply-To: References: Message-ID: Warning, danger, caution, risk, hazard... All these things are related. The personal injury is just a particular case for this broad meaning which is to ask people to be careful before going forward, and read the notice. The symbol is also used as a street sign, or various dangers on roads, when there's no other specific sign, or for temporary signs (e.g. to signal a nearby accident. In almost all case, it does not come alone, there's a label or sentence explaining the kind of danger or risk to which one could be exposed (risks are not necessarly on health or death, they may be virtual. It is commonly used in softwares in warning prompt dialogs that signal a problem for which something should be investigated, or before continuing with an action destroying data in an unrecoverable way (or only in a way that offers no warranty of success or reliability). The name of the symbol is descriptive enough "WARNING SIGN". Adding extra info would incorrectly limit its broad usage. 2014-06-03 16:25 GMT+02:00 Papendick, Michelle : > Good Day ? > > > > Just wondering if Unicode provides for or anyone know of documentation for > standard usage around the following symbol: > > > > [image: cid:image001.png at 01CF7C48.A6D54D00] > > > > Noticed that is it used in many applications as a general warning or error > symbol, but upon research it is also the symbol for personal injury so > appears to be a conflict of meaning. > > > > Any information around standard usage of the symbol in software > applications is appreciated. > > > > Thank you! > Michelle > > > > > > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8819 bytes Desc: not available URL: From asmusf at ix.netcom.com Tue Jun 3 11:05:11 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 03 Jun 2014 09:05:11 -0700 Subject: Corrigendum #9 In-Reply-To: <538CF5DB.3070007@ix.netcom.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <538CF5DB.3070007@ix.netcom.com> Message-ID: <538DF237.5060906@ix.netcom.com> On 6/2/2014 3:08 PM, Asmus Freytag wrote: > On 6/2/2014 2:53 PM, Markus Scherer wrote: >> On Mon, Jun 2, 2014 at 1:32 PM, David Starner > > wrote: >> >> I would especially discourage any web browser from handling >> these; they're noncharacters used for unknown purposes that are >> undisplayable and if used carelessly for their stated purpose, can >> probably trigger serious bugs in some lamebrained utility. >> >> >> I don't expect "handling these" in web browsers and lamebrained >> utilities. I expect "treat like unassigned code points". >> > > I can't shake the suspicion that Corrigendum #9 is not actually > solving a general problem, but is a special favor to CLDR as being run > by insiders, and in the process muddying the waters for everyone else. Clarifying: I still haven't heard from anyone that this solves a general problem that is widespread. The only actual example has always been CLDR, and its decision to ship these code points in XML. Shipping these code points in files was pretty far down the list of "what not to do" when they were originally adopted. My view continues to be that this is was a questionable design decision by CLDR, given what was on the record. The reaction of several outside implementers during this discussion makes clear that viewing that design as problematic is not just my personal view. Usually, if there's a discrepancy between an implementation and Unicode, the reaction is not to retract conformance language. I think arriving at this decision was easier for the UTC, because CLDR is not a random, unrelated implementation. And, as in any group, it's perhaps easier to not be as keenly aware of the impact on external implementations. So, I'd like to clarify, that this is the sense in which I meant "special favor", and which therefore is not the most felicitous expression to describe what I had in mind. A./ > > A./ > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Tue Jun 3 11:13:17 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 03 Jun 2014 09:13:17 -0700 Subject: Use of Unicode Symbol 26A0 In-Reply-To: References: Message-ID: <538DF41D.7030904@ix.netcom.com> Michelle, Unicode normally does not document all known usages of symbols. Occasionally, if a symbol is used in ways that might be unexpected from its name, the standard may add an alias or annotation. This is done in particular, when there is a question of whether a given symbol is the correct choice for a given application - especially if Unicode contains multiple, similar symbols. In this case, that does not seem the case. The symbol is used for a variety of purposes, from warning to error to alerting readers to important information. These all seem to fit in the same general usage as suggested by the name, and the symbol is distinct enough so that that there is no other symbol in Unicode that might suggest itself as an alternate. The use to warn about risk of personal injury would not seem to demand additional clarification. A./ On 6/3/2014 7:25 AM, Papendick, Michelle wrote: > > Good Day ? > > Just wondering if Unicode provides for or anyone know of documentation > for standard usage around the following symbol: > > cid:image001.png at 01CF7C48.A6D54D00 > > Noticed that is it used in many applications as a general warning or > error symbol, but upon research it is also the symbol for personal > injury so appears to be a conflict of meaning. > > Any information around standard usage of the symbol in software > applications is appreciated. > > Thank you! > Michelle > > > > __ > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 8819 bytes Desc: not available URL: From asmusf at ix.netcom.com Tue Jun 3 11:15:27 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 03 Jun 2014 09:15:27 -0700 Subject: Corrigendum #9 In-Reply-To: <538D74A7.5020605@it.aoyama.ac.jp> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <538CF5DB.3070007@ix.netcom.com> <538D74A7.5020605@it.aoyama.ac.jp> Message-ID: <538DF49F.10605@ix.netcom.com> Nicely put. A./ On 6/3/2014 12:09 AM, "Martin J. D?rst" wrote: > On 2014/06/03 07:08, Asmus Freytag wrote: >> On 6/2/2014 2:53 PM, Markus Scherer wrote: >>> On Mon, Jun 2, 2014 at 1:32 PM, David Starner >> > wrote: >>> >>> I would especially discourage any web browser from handling >>> these; they're noncharacters used for unknown purposes that are >>> undisplayable and if used carelessly for their stated purpose, can >>> probably trigger serious bugs in some lamebrained utility. >>> >>> >>> I don't expect "handling these" in web browsers and lamebrained >>> utilities. I expect "treat like unassigned code points". > > Expecting them to be treated like unassigned code points shows that > their use is a bad idea: Since when does the Unicode Consortium use > unassigned code points (and the like) in plain sight? > >> I can't shake the suspicion that Corrigendum #9 is not actually solving >> a general problem, ... > > I have to fully agree with Asmus, Richard, Shawn and others that the > use of non-characters in CLDR is a very bad and dangerous example. > > However convenient the misuse of some of these codepoints in CLDR may > be, it sets a very bad example for everybody else. Unicode itself > should not just be twice as careful with the use of its own > codepoints, but 10 times as careful. > > I'd strongly suggest that completely independent of when and how > Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets > worked out for how to get rid of these codepoints in CLDR data. The > sooner, the better. > > Regards, Martin. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From jkorpela at cs.tut.fi Tue Jun 3 12:17:01 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Tue, 03 Jun 2014 20:17:01 +0300 Subject: Use of Unicode Symbol 26A0 In-Reply-To: <538DF41D.7030904@ix.netcom.com> References: <538DF41D.7030904@ix.netcom.com> Message-ID: <538E030D.2070703@cs.tut.fi> 2014-06-03 19:13, Asmus Freytag wrote: > Unicode normally does not document all known usages of symbols. Not to mention unknown usages. Characters will be used in different ways, no matter what the Unicode Standard says, and it would be mostly pointless to put restrictions on it. In some cases, however, some types of usage are warned against, or better approaches are suggested?. > The symbol is used for a > variety of purposes, from warning to error to alerting readers to > important information. These all seem to fit in the same general usage > as suggested by the name, and the symbol is distinct enough so that that > there is no other symbol in Unicode that might suggest itself as an > alternate. Right, but if we consider the use of WARNING SIGN as a text character, or contexts where an image resembling WARNING SIGN is used and WARNING SIGN could well be used (with the usual caveats), then it seems to generally indicate a warning message as opposite to an error message, on one hand, and a purely informative note, on the other. The use of graphic symbols similar to WARNING SIGN e.g. in traffic signs is really a different issue and external to Unicode, as it is not about characters, though it might be tangentially related. > The use to warn about risk of personal injury would not seem to demand > additional clarification. On the practical side, it might be in order to warn against usage that relies on some particular interpretation like that. What I mean is that it is OK to use WARNING SIGN as warning about risk of personal injury, but questionable to expect that people will generally take it that way (and not more loosely as warning of some kind). Yucca From richard.wordingham at ntlworld.com Tue Jun 3 13:52:53 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 3 Jun 2014 19:52:53 +0100 Subject: UTF-16 Encoding Scheme and U+FFFE Message-ID: <20140603195253.3c0df53f@JRWUBU2> How do I read definition D98 in TUS Version 6.3.0 Chapter 3 to prohibit a file in the UTF-16 encoding scheme from starting with U+FFFE? Or is U+FFFE actually allowed to start such a file? Is an implementation that deduces the encoding scheme of a plain text file from a leading BOM to be characterised as reckless? Richard. From richard.wordingham at ntlworld.com Tue Jun 3 13:59:23 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 3 Jun 2014 19:59:23 +0100 Subject: Corrigendum #9 In-Reply-To: <538D74A7.5020605@it.aoyama.ac.jp> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <538CF5DB.3070007@ix.netcom.com> <538D74A7.5020605@it.aoyama.ac.jp> Message-ID: <20140603195923.6ec9c275@JRWUBU2> On Tue, 03 Jun 2014 16:09:27 +0900 "Martin J. D?rst" wrote: > I'd strongly suggest that completely independent of when and how > Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets > worked out for how to get rid of these codepoints in CLDR data. The > sooner, the better. I suspect this has already been done. I know of no CLDR text files still containing them. Richard. From petercon at microsoft.com Tue Jun 3 16:28:05 2014 From: petercon at microsoft.com (Peter Constable) Date: Tue, 3 Jun 2014 21:28:05 +0000 Subject: UTF-16 Encoding Scheme and U+FFFE In-Reply-To: <20140603195253.3c0df53f@JRWUBU2> References: <20140603195253.3c0df53f@JRWUBU2> Message-ID: There's never been anything preventing a file from containing and beginning with U+FFFE. It's just not a very useful thing to do, hence not very likely. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham Sent: June 3, 2014 11:53 AM To: unicode at unicode.org Subject: UTF-16 Encoding Scheme and U+FFFE How do I read definition D98 in TUS Version 6.3.0 Chapter 3 to prohibit a file in the UTF-16 encoding scheme from starting with U+FFFE? Or is U+FFFE actually allowed to start such a file? Is an implementation that deduces the encoding scheme of a plain text file from a leading BOM to be characterised as reckless? Richard. _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From xueming.shen at oracle.com Tue Jun 3 17:06:30 2014 From: xueming.shen at oracle.com (Xueming Shen) Date: Tue, 03 Jun 2014 15:06:30 -0700 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: <20140602210153.40a8bf08@JRWUBU2> References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> <20140601094931.413857e2@JRWUBU2> <20140601180457.273ac6b9@JRWUBU2> <20140602210153.40a8bf08@JRWUBU2> Message-ID: <538E46E6.9050406@oracle.com> On 06/02/2014 01:01 PM, Richard Wordingham wrote: > On Mon, 2 Jun 2014 11:29:09 +0200 > Mark Davis ?? wrote: > >>> \uD808\uDF45 specifies a sequence of two codepoints. >> ?That is simply incorrect.? > The above is in the sample notation of UTS #18 Version 17 Section 1.1. > > From what I can make out, the corresponding Java notation would be > \x{D808}\x{DF45}. I don't *know* what \x{D808} and \x{DF45} match in > Java, or whether they are even acceptable. The only thing UTS #18 > RL1.7 permits them to match in Java is lone surrogates, but I don't > know if Java complies. The notation for "\uD808\uDF45" is interpreted as a supplementary codepoint and is represent internally as a pair of surrogates in String. Pattern.compile("\\x{D808}\\x{DF45}").matcher("\ud808\udf45").find()); -> false Pattern.compile("\uD808\uDF45").matcher("\ud808\udf45").find()); -> true Pattern.compile("\\x{D808}").matcher("\ud808\udf45").find()); -> false Pattern.compile("\\x{D808}").matcher("\ud808_\udf45").find()); -> true -Sherman > All UTS #18 says for sure about regular expressions matching code units > is that they don't satisfy RL1.1, though Section 1.7 appears to ban > them when it says, "A fundamental requirement is that Unicode text be > interpreted semantically by code point, not code units". Perhaps it's > a fundamental requirement of something other than UTS #18. I thought > matching parts of characters in terms of their canonical equivalences > was awkward enough, without having the additional option of matching > some of the code units! > From richard.wordingham at ntlworld.com Tue Jun 3 18:40:50 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 4 Jun 2014 00:40:50 +0100 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: <538E46E6.9050406@oracle.com> References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> <20140601094931.413857e2@JRWUBU2> <20140601180457.273ac6b9@JRWUBU2> <20140602210153.40a8bf08@JRWUBU2> <538E46E6.9050406@oracle.com> Message-ID: <20140604004050.566e54c9@JRWUBU2> On Tue, 03 Jun 2014 15:06:30 -0700 Xueming Shen wrote: > On 06/02/2014 01:01 PM, Richard Wordingham wrote: > > On Mon, 2 Jun 2014 11:29:09 +0200 > > Mark Davis ?? wrote: > > > >>> \uD808\uDF45 specifies a sequence of two codepoints. > >> ?That is simply incorrect.? > > The above is in the sample notation of UTS #18 Version 17 Section > > 1.1. > > > > From what I can make out, the corresponding Java notation would be > > \x{D808}\x{DF45}. I don't *know* what \x{D808} and \x{DF45} match > > in Java, or whether they are even acceptable. The only thing UTS > > #18 RL1.7 permits them to match in Java is lone surrogates, but I > > don't know if Java complies. > > The notation for "\uD808\uDF45" is interpreted as a supplementary > codepoint and is represent internally as a pair of surrogates in > String. > > Pattern.compile("\\x{D808}\\x{DF45}").matcher("\ud808\udf45").find()); > -> false > Pattern.compile("\uD808\uDF45").matcher("\ud808\udf45").find()); > -> true > Pattern.compile("\\x{D808}").matcher("\ud808\udf45").find()); > -> false > Pattern.compile("\\x{D808}").matcher("\ud808_\udf45").find()); > -> true Thank you for providing examples confirming that what in the UTS #18 *sample* notation would be written \uD808\uDF45, i.e. \x{D808}\x{DF45} in Java notation, matches nothing in any 16-bit Unicode string. Richard. From richard.wordingham at ntlworld.com Tue Jun 3 18:50:51 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 4 Jun 2014 00:50:51 +0100 Subject: UTF-16 Encoding Scheme and U+FFFE In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> Message-ID: <20140604005051.1f2aee9a@JRWUBU2> On Tue, 3 Jun 2014 21:28:05 +0000 Peter Constable wrote: > There's never been anything preventing a file from containing and > beginning with U+FFFE. It's just not a very useful thing to do, hence > not very likely. Well, while U+FFFE was apparently prohibited from public interchange, one could be very confident of not finding it in an external file. As an internally generated file, it would then be much more likely to be in the UTF-16BE or UTF-16LE encoding scheme. Richard. From ken.whistler at sap.com Tue Jun 3 19:23:53 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Wed, 4 Jun 2014 00:23:53 +0000 Subject: UTF-16 Encoding Scheme and U+FFFE In-Reply-To: <20140604005051.1f2aee9a@JRWUBU2> References: <20140603195253.3c0df53f@JRWUBU2> <20140604005051.1f2aee9a@JRWUBU2> Message-ID: You cannot even be "very confident" of not finding actual ill-formed UTF-16, like unpaired surrogates, in an external file, let alone noncharacters. As for the noncharacters, take a look at the collation test files that we distribute with each version of UCA. The test data includes test strings like the following, to verify that UCA implementations do the correct thing when faced with unusual edge cases: FFFE 0021 FFFE 003F FFFE 0061 FFFE 0041 FFFE 0062 1FFFE 0021 1FFFE 003F 1FFFE 0334 ... As well as test strings starting with unpaired surrogates: D800 0021 D800 003F D800 0061 D800 0041 D800 0062 And while it is true that the *file* CollationTest_SHIFTED.txt doesn't start with either a noncharacter or an unpaired surrogate -- because all of the test data in it is represented in ASCII hex strings instead of directly in UTF-16 -- the issue in any case isn't whether a *file* starts with a noncharacter, but whether a UTF-16 *string* starts with a noncharacter. Any one of those test strings could be trivially turned into a text file by piping out that one UTF-16 string to a file. And I could then write conformant test software that would read UTF-16 string input data from that file and run it through the UCA algorithm to construct sortkeys for it. As Peter said, the main thing that prevents running into these is that it isn't very *useful* to start off files (or strings) with U+FFFE. (And, additionally, in the case of UTF-16 text data files, it would be confusing and possibly lead to misinterpretation of byte order, if you were somehow depending solely on initial BOMs -- which I wouldn't advise, anyway.) Basically, the rules of standards (e.g., you shouldn't try to publicly interchange noncharacters) are not like laws of physics. Just because the standard says you shouldn't do it doesn't mean it doesn't happen. --Ken > On Tue, 3 Jun 2014 21:28:05 +0000 > Peter Constable wrote: > > > There's never been anything preventing a file from containing and > > beginning with U+FFFE. It's just not a very useful thing to do, hence > > not very likely. > > Well, while U+FFFE was apparently prohibited from public interchange, > one could be very confident of not finding it in an external file. As > an internally generated file, it would then be much more likely to be > in the UTF-16BE or UTF-16LE encoding scheme. > > Richard. From asmusf at ix.netcom.com Wed Jun 4 01:32:03 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 03 Jun 2014 23:32:03 -0700 Subject: Use of Unicode Symbol 26A0 In-Reply-To: <538E030D.2070703@cs.tut.fi> References: <538DF41D.7030904@ix.netcom.com> <538E030D.2070703@cs.tut.fi> Message-ID: <538EBD63.8080004@ix.netcom.com> On 6/3/2014 10:17 AM, Jukka K. Korpela wrote: > On the practical side, it might be in order to warn against usage that > relies on some particular interpretation like that. What I mean is > that it is OK to use WARNING SIGN as warning about risk of personal > injury, but questionable to expect that people will generally take it > that way (and not more loosely as warning of some kind). > > Yucca It might be useful to note in the description of symbols that their names are commonly not limited to the semantics (instead, names are frequently based on appearance). The clarification could include statements to the effect that: In the case the name is based on semantics, the name chosen may reflect only one of many uses of the symbol, and, further, the symbol may not always be considered the "best" representative of that semantic by all users. Exceptions occur for example for mathematical symbols, many of which have conventional names outside Unicode, some of which (like integral sign) do directly name the standard use of that symbol. I'm not sure, but I imagine if you read carefully that this is covered already (either in the chapters or in the FAQ). Should comparable language really be absent, that would be good to know. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jun 4 01:54:23 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 4 Jun 2014 08:54:23 +0200 Subject: UTF-16 Encoding Scheme and U+FFFE In-Reply-To: <20140604005051.1f2aee9a@JRWUBU2> References: <20140603195253.3c0df53f@JRWUBU2> <20140604005051.1f2aee9a@JRWUBU2> Message-ID: U+FFFE is prohibited in interchanges because if interchanges specify a UTF-16 encoding (not UTF16-BE or URF16-LE) it would be interpreted as a BOM where it occurs at start of a stream (with the consequence of reparsing it as U+FEFF with bytes swapped). In all other positions where it cannot be a BOM. BOM are *normally* only authorized in interchanges at "start" of streams. But this is a problem for "live" streams that have no defined "start" but can be synced at random positions (such as on the next newline, or the start of a network datagram, but note that some network layers may fragment them so that BOM could be repeated, and also reunite them, leaving multiple BOMs in the same datagram) so we can assume that U+FFFE anywhere in a UTF16 "live" stream, not a UTF16-BE or UTF16-LE stream, is each time a BOM and not a BOM or legacy ZWNBSP or a non-character) Streams that are known to be UTF16-BE or UTF16-LE are also not recommanded for interchanged if these files or live streams may be transmitted without metadata specifying its encoding explicitly (so many remote readers will interpret them instead as UTF16, possibly with multiple BOMs in resynchronizable live streams). The problem of live streams is also a good reason why WZNBSP (U+FEFF) has been strongly discouraged in interchanges in vafor of word joiner (and this also applies to all other conforming UTFs (including UTF-8, UTF16-BE, UTF16-LE, UTF32, UTF32-LE, UTF32-BE) where it is strongly recommended not to use U+FEFF and U+FFFE except for BOMs (possibly repeated on live streams). You should note that conforminf processes working in interchanges (or storage) should always be allowed to switch from one standard UTF to another. And the same encoded streams may be consumed by various clients having different native order. It is now become difficult to define what is a "local" system, when applications are converted to work in a cloud with more and more heterogeneous clients and more intermediate third parties (providing things like caching, archiving, proxying, backup of data and restauration on another system...). For long term reusability of data, we are strongly encouraged not to use U+FFFE and U+FEFF except for BOMs, and we should be tolerant about the number of BOMs found (an in my opinion, UCA implementations should ignore discard them on input, treating them as fully ignorable, except for delimiting combining base characters for the prupose of normalisation, that conforming applications or intermediate filters should be allowed to perform as they want. And we should absolutely forget the legacy semantic of ZWNBSP. But this complexity and tolerance for one or more BOMs also means that all UTFs not based on 8-bit code units should be also discouraged in interchanges. This means that UTF-16 and UTF-32 should be discouraged, leaving only UTF-16BE or UTF-16LE or UTF-32BE not for storage or networking, but for temporary streams in memory used the "blackbox" internally implementing each conforming process. For all the rest, most applications now use UTF-8, possibly packaged within a generic compressed stream (binary compression of live streams remains possible, even if you cannot predict in the text encoding where the resynchronization points will occur: it's up to the protocol using this transport compression to properly define the resynchronization points). In UTF-8 streams we can completely omit U+FFFE, U+FEFF, either as BOMs, ZWNSP or non-characters (and we can also expect that many applications will just discard them silently, as they only have a "no-op" role as BOMs in 8-bit streams). If an application ouputs an 8-bit stream that is not UTF-8, it wil drop all U+FEFF and U+FFFE found in input, and will often ouput its encoding of U+FEFF its non-UTF-8 encoding generated, frequently as a "magic" signature of this encoding. Secure digital signatures of text streams should also ignore these code units silently as these code units won't be relevant elsewhere in the chain of producers or consumers of this data (these secure digital signatures should be computed by dropping these discarvable U+FEFF and U+FFFE, normaling that data for example to NFC or NFD, and producing a specific UTF (the easiest one to avoid complications being to use UTF-32BE or UTF-32LE with a predetermined byte order, as specified by the digital signature algorithm). Additionally it will be very easy to use as many U+FEFF code units as needed as ignorable extra BOMs, for cases where a protocol needs a safe "padding filler" f they want to use fixed-size block I/O with random access and easy resynchronization (in live streams), when the producer safely breaks data blocks at boundary of combining sequences (allowing these blocks to be normalized separately and reunited later witout creating problem. 2014-06-04 1:50 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Tue, 3 Jun 2014 21:28:05 +0000 > Peter Constable wrote: > > > There's never been anything preventing a file from containing and > > beginning with U+FFFE. It's just not a very useful thing to do, hence > > not very likely. > > Well, while U+FFFE was apparently prohibited from public interchange, > one could be very confident of not finding it in an external file. As > an internally generated file, it would then be much more likely to be > in the UTF-16BE or UTF-16LE encoding scheme. > > Richard. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jun 4 03:10:52 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 4 Jun 2014 10:10:52 +0200 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: <20140604004050.566e54c9@JRWUBU2> References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> <20140601094931.413857e2@JRWUBU2> <20140601180457.273ac6b9@JRWUBU2> <20140602210153.40a8bf08@JRWUBU2> <538E46E6.9050406@oracle.com> <20140604004050.566e54c9@JRWUBU2> Message-ID: It does match in a 16-bit "Unicode" string, but this is not a "UTF-16" string : there's no such thing as a "16-bit string" in Unicode if you omit to specify the exact UTF encoding type specified in the standard. - the Java regex "\\x{0020}" (here in Java-source litteral String format which requires escaping the backslash for that regexp literal) is not contextual: it matches exactly one 16-bit char '\u0020' independantly of its context. - the Java regex "\\x{DC00}" (here in Java-source litteral String format) is contextual: it really matches one 16-bit char '\uDC00' either at *start* of the String or NOT immediately preceded by a 16-bit char between '\uD800' and '\uDBFF'. - the Java regex "\\uDC00" (here in Java-source litteral String format) is NOT contextual: it really matches one 16-bit char '\uDC00' in all contexts, so it is the same as the Java regexp "\uDC00" (because this single surrogate char has no "special" meaning in regexps and is interpreted literally by the regexp engine) - the Java regex "\\x{D808}" (here in Java-source litteral String format) is contextual: it really matches one 16-bit char '\uD808' either at *end* of the String or NOT immediately followed by a 16-bit char between '\uDC00' and '\uDFFF'. - the Java regex "\\uD808" (here in Java-source litteral String format) is NOT contextual: it really matches one 16-bit char '\uDC00' in all contexts, so it is the same as the Java regexp "\uD808" (because this single surrogate char has no "special" meaning in regexps and is interpreted literally by the regexp engine) In summary, the regexp engine in Java does not really work with code points, it works directly at the code unit level. The \x notation is a convenient shortcut to specify contexts for litteral codeunits, or to escape the special meaning of some regexp operators. Another example: the Java regexp "A*" is exactly identical to "\u0041\u002A", in both cases this means 0 or more Latin capital letter A (the \u notation in Java source code does not escape the special meaning for regexps at runtime, it is a convenience only for the source code, for example to escape a litteral double quote in a litteral String (note that Java source code files may be be encoded in any text encoding supported by its internationalisation library accessible to the Java compiler, for example the Java source code could be using only US-ASCII or Windows-1252 and there's no otherway than the \u notation to compile a 16-bit char code unit in a String literal if the needed character is absent from the Java source code encoding; Java source code may also be encoded with UTF-8 in which case most uses of \u is not needed in Java you can as well use the \u notation for identifiers, or for operators of the language ! The \u notation in source Java code is in fact interpreted AFTER it has been generated by the source code reader according to its specified source encoding. Then the decoded source string (internally represented in a Java 16-bit char[] array) is processed by the input stage of the lexer which will convert these \u notation, prior to recognizing the lexical items. There are quite similar input stages in ANSI C/C++ compilers. For example ANSI C supports since long the "???" trigram prefix for noting some standard operators or delimiters of the language if the characters needed by its syntax is not supported in the source code encoding, and this input stage also occurs prior to recognizing lexical entities of the language, and it was used if the input encoding did not support the full US-ASCII character set, but only the invariant subset of ISO 646, such as old national versions of 7-bit EBCDIC or even the older 5-bit or 6-bit encodings like Baudot ; very few C programmers know the existence of this notation in ANSI C because today they only write code in files stored in an encoding suporting at least the full US-ASCII subset (including one of the many 8-bit EBCDIC variants remaining on mainframes or when working on source code via old "exotic" 7-bit terminals, or if their national keyboard don't define a way to enter the full US-ASCII graphic set, such as braces or backslashes)... 2014-06-04 1:40 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Tue, 03 Jun 2014 15:06:30 -0700 > Xueming Shen wrote: > > > On 06/02/2014 01:01 PM, Richard Wordingham wrote: > > > On Mon, 2 Jun 2014 11:29:09 +0200 > > > Mark Davis ?? wrote: > > > > > >>> \uD808\uDF45 specifies a sequence of two codepoints. > > >> ?That is simply incorrect.? > > > The above is in the sample notation of UTS #18 Version 17 Section > > > 1.1. > > > > > > From what I can make out, the corresponding Java notation would be > > > \x{D808}\x{DF45}. I don't *know* what \x{D808} and \x{DF45} match > > > in Java, or whether they are even acceptable. The only thing UTS > > > #18 RL1.7 permits them to match in Java is lone surrogates, but I > > > don't know if Java complies. > > > > The notation for "\uD808\uDF45" is interpreted as a supplementary > > codepoint and is represent internally as a pair of surrogates in > > String. > > > > Pattern.compile("\\x{D808}\\x{DF45}").matcher("\ud808\udf45").find()); > > -> false > > Pattern.compile("\uD808\uDF45").matcher("\ud808\udf45").find()); > > -> true > > Pattern.compile("\\x{D808}").matcher("\ud808\udf45").find()); > > -> false > > Pattern.compile("\\x{D808}").matcher("\ud808_\udf45").find()); > > -> true > > Thank you for providing examples confirming that what in the UTS #18 > *sample* notation would be written \uD808\uDF45, i.e. \x{D808}\x{DF45} > in Java notation, matches nothing in any 16-bit Unicode string. > > Richard. > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From A.Schappo at lboro.ac.uk Wed Jun 4 04:28:49 2014 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Wed, 4 Jun 2014 09:28:49 +0000 Subject: Swift In-Reply-To: <20140603195253.3c0df53f@JRWUBU2> References: <20140603195253.3c0df53f@JRWUBU2> Message-ID: <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> Swift is Apple's new programming language. In Swift, variable and constant names can be constructed from Unicode characters. Here are a couple of examples from Apple's doc http://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/TheBasics.html let ? = 3.14159 let ?? = "????" I think this a huge step forward for i18n and Unicode. There are some restrictions on which Unicode chars can be used. From Apple's doc "Constant and variable names cannot contain mathematical symbols, arrows, private-use (or invalid) Unicode code points, or line- and box-drawing characters. Nor can they begin with a number, although numbers may be included elsewhere within the name." The restrictions seem a little like IDNA2008. Anyone have links to info giving a detailed explanation/tabulation of allowed and non allowed Unicode chars for Swift Variable and Constant names? Andr? Schappo From mark at macchiato.com Wed Jun 4 04:41:17 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 4 Jun 2014 11:41:17 +0200 Subject: Swift In-Reply-To: <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> Message-ID: Apparently you can use emoji in the identifiers. ?? ( http://www.globalnerdy.com/2014/06/03/swift-fun-fact-1-you-can-use-emoji-characters-in-variable-constant-function-and-class-names/ ) Mark *? Il meglio ? l?inimico del bene ?* On Wed, Jun 4, 2014 at 11:28 AM, Andre Schappo wrote: > Swift is Apple's new programming language. In Swift, variable and constant > names can be constructed from Unicode characters. Here are a couple of > examples from Apple's doc > http://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/TheBasics.html > > let ? = 3.14159 > let ?? = "????" > > I think this a huge step forward for i18n and Unicode. > > There are some restrictions on which Unicode chars can be used. From > Apple's doc > > "Constant and variable names cannot contain mathematical symbols, arrows, > private-use (or invalid) Unicode code points, or line- and box-drawing > characters. Nor can they begin with a number, although numbers may be > included elsewhere within the name." > > The restrictions seem a little like IDNA2008. Anyone have links to info > giving a detailed explanation/tabulation of allowed and non allowed Unicode > chars for Swift Variable and Constant names? > > Andr? Schappo > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Wed Jun 4 05:01:57 2014 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Wed, 04 Jun 2014 19:01:57 +0900 Subject: Corrigendum #9 In-Reply-To: <20140603195923.6ec9c275@JRWUBU2> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <538CF5DB.3070007@ix.netcom.com> <538D74A7.5020605@it.aoyama.ac.jp> <20140603195923.6ec9c275@JRWUBU2> Message-ID: <538EEE95.70100@it.aoyama.ac.jp> On 2014/06/04 03:59, Richard Wordingham wrote: > On Tue, 03 Jun 2014 16:09:27 +0900 > "Martin J. D?rst" wrote: > >> I'd strongly suggest that completely independent of when and how >> Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets >> worked out for how to get rid of these codepoints in CLDR data. The >> sooner, the better. > > I suspect this has already been done. I know of no CLDR text files > still containing them. Really great if that's true! Regards, Martin. From mark at macchiato.com Wed Jun 4 06:17:15 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 4 Jun 2014 13:17:15 +0200 Subject: Corrigendum #9 In-Reply-To: <538EEE95.70100@it.aoyama.ac.jp> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <538CF5DB.3070007@ix.netcom.com> <538D74A7.5020605@it.aoyama.ac.jp> <20140603195923.6ec9c275@JRWUBU2> <538EEE95.70100@it.aoyama.ac.jp> Message-ID: The characters are present, but are escaped in the source for readability. Here is a sample from collation/zh.xml: ... ... *? Il meglio ? l?inimico del bene ?* On Wed, Jun 4, 2014 at 12:01 PM, "Martin J. D?rst" wrote: > On 2014/06/04 03:59, Richard Wordingham wrote: > >> On Tue, 03 Jun 2014 16:09:27 +0900 >> "Martin J. D?rst" wrote: >> >> I'd strongly suggest that completely independent of when and how >>> Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets >>> worked out for how to get rid of these codepoints in CLDR data. The >>> sooner, the better. >>> >> >> I suspect this has already been done. I know of no CLDR text files >> still containing them. >> > > Really great if that's true! Regards, Martin. > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Wed Jun 4 06:45:18 2014 From: prosfilaes at gmail.com (David Starner) Date: Wed, 4 Jun 2014 04:45:18 -0700 Subject: Swift In-Reply-To: <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> Message-ID: