From verdy_p at wanadoo.fr Sun Jun 1 00:06:34 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 1 Jun 2014 07:06:34 +0200 Subject: Corrigendum #9 In-Reply-To: <538A8CD8.4070905@ix.netcom.com> References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> <538A00A9.1050907@ix.netcom.com> <538A8CD8.4070905@ix.netcom.com> Message-ID: I've not proposed to move these characters elsewhere (or ro reencode them), why do you think that?. I just challenge your statement that a block cannot be discontinuous, something that is unique in all Unicode properties and completely absent from ISO 10646 which does not define any real properties beside a name in a specific code point and some informative glyph, plus historic reference links documenting its intended usage. (Where is it written in the Unicode-only stability rules that is continuous when allocations of codepoints in these blocs has always been discontinuous?...), much more important than this legacy one which has absolutely no use in regexps as you stated. Even the set of non-characters is also discontinuous, as well as blocks for the Arabic script.; or blocks for presentation forms, or blocks for compatibility characters. Every property in Unicode is fragmented over multiple ranges (whose length is also extremely frequently discontinuous within each block or even in the same encoding column In other words IsInArabicPresentation(x) would still remain true for all assgned characters in that block, it will just be false for non-characters considered outside of it but non-characters don't have nay useful property except being non-character (the block where they are allocated does not matter at all). The alternative is to not restrict these characters as being non-characters and allowing them to be present in files without enforcing any error, i.e. treat it like PUA, also with a feow possible default properties (this makes them a bit interoperable still with limited private agreements, possibly implicit with the transport interface or enveloppe format). 2014-06-01 4:15 GMT+02:00 Asmus Freytag : > More importantly, while a regex that uses an expression that is > equivalent to "IsInArabiPresentation(x)" may or may not be well-defined, > there is no reason to break it by splitting the block. > > As blocks cannot be discontiguous (unlike other properties), some Arabic > Presentation forms would have to be put into a new block (Arabic > Presentation Forms C). This is what would break such expressions - it has, > in fact, nothing to do with the status of the noncharacters. > > There's no reason to contemplate breaking changes of any kind at this > point. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sun Jun 1 01:20:16 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sat, 31 May 2014 23:20:16 -0700 Subject: Corrigendum #9 In-Reply-To: References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> <538A00A9.1050907@ix.netcom.com> <538A8CD8.4070905@ix.netcom.com> Message-ID: <538AC620.7010208@ix.netcom.com> On 5/31/2014 10:06 PM, Philippe Verdy wrote: > I've not proposed to move these characters elsewhere (or ro reencode > them), why do you think that?. > > I just challenge your statement that a block cannot be discontinuous, Well, go ahead and challenge that. As implemented in the current nameslist and file blocks.txt a block would have this definition. "A block is a uniquely named, continuous, non-overlapping range of code points, containing a multiple of 16 code points, and starting at a location that is a multiple of 16." Per chapter 3 the definition of the property block is given in Section 17.1 (Code Charts) - which contains no actual definition, only tells you how they are used in organizing the code charts, so, effectively, a block is what blocks.txt (and therefore the names list) say it is. The way blocks are assigned, has been following the empirically derived definition I gave above, and at this point, the production process for the code charts has some of these restrictions built in. Chapter 3 calls blocks an enumerated property, meaning that the names must be unique, and blocks.txt associates a single range with a name, in concurrence with the glossary, which says blocks represent a range of characters (not a collection of ranges). Likewise, changing blocks to not starting at or containing multiples of 16 code points (sometimes called a "column") is equally not in the cards - it would break the very production process for chart production. The description of how blocks are used does not contemplate that they can be mutually overlapping, so that becomes part of their implicit definition as well. There's reason behind the madness of not providing an explicit definition of "block" in the standard. It has to do with discouraging people from relying on what is largely an editorial device (headers on charts). However, it does not mean that arbitrary redefinition of a block from a single to multiple ranges is something that can or should be contemplated. So, the chances that UTC would agree to such changes, even if not formally guaranteed, is de facto nil. A./ From verdy_p at wanadoo.fr Sun Jun 1 03:28:29 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 1 Jun 2014 10:28:29 +0200 Subject: Corrigendum #9 In-Reply-To: <538AC620.7010208@ix.netcom.com> References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> <538A00A9.1050907@ix.netcom.com> <538A8CD8.4070905@ix.netcom.com> <538AC620.7010208@ix.netcom.com> Message-ID: Ok then, the definitions still dors not say that blocks cannot be split (in fact it has already occured many time across versions by reevaluating the need for new blocks and for desifying the BMP, up to the point that sometime a single addition in the same script required allocating columns in multiple subblocks as small as a column of 16 code points). Blocks are in fact artefacts of the encoding process the y are previsional until the characters needed are effectively allocated. Later any unused area may be reallocated to another block. On the BMP for example there remains a quite large area in a block initially described for supplemental arrows that could host a new full alphabetic script (most probably one of the remaining Indic or African modern scripts still to encode) or symbols used in common softwares or devices for their UI and its documentation (such as the window minimize/maximize/close button or resize corner, or refresh button, or microphone symbol to initiate a vocal talk, or the radio wave symbol for accessing a wireless network), or conventional symbols for accessibility devices, marks of dangers/hazards or restrictions/prohibitions that could be used as widely as currency symbols (encoded often in emergency but isolately, unlike other symbols coming in small related groups; if these collections are large like emoticons/emojis they'll go directly in the SMP). Blocks are not immutable in size, even if they keep their initial position (because allocations in blocks start by the leang position, skeeping only a few entries that were balloted for possible later allocation to the same script, or for former proposals of characters that were balloted in favor of unification to another character, or just to align the block with the layout of another legacy encoding chart, or because the initial beta fonts submitted to support the script allocated other characters that were not approved and fonts were not updated to use a new layout). May be in some future we will see a few more allocations made in the BMP using half columns (this is *already* the case at end of the BMP where a single column is split in two parts, containing Armenian presentation forms, and Hebrew presentation forms for Yiddish...), or filling some random holes for which it is definitively decided that the initial reservations in the roadmap will never be used for the initially intended purpose. 2014-06-01 8:20 GMT+02:00 Asmus Freytag : > On 5/31/2014 10:06 PM, Philippe Verdy wrote: > >> I've not proposed to move these characters elsewhere (or ro reencode >> them), why do you think that?. >> >> I just challenge your statement that a block cannot be discontinuous, >> > > Well, go ahead and challenge that. > > As implemented in the current nameslist and file blocks.txt a block would > have this definition. "A block is a uniquely named, continuous, > non-overlapping range of code points, containing a multiple of 16 code > points, and starting at a location that is a multiple of 16." > > Per chapter 3 the definition of the property block is given in Section > 17.1 (Code Charts) - which contains no actual definition, only tells you > how they are used in organizing the code charts, so, effectively, a block > is what blocks.txt (and therefore the names list) say it is. The way blocks > are assigned, has been following the empirically derived definition I gave > above, and at this point, the production process for the code charts has > some of these restrictions built in. > > Chapter 3 calls blocks an enumerated property, meaning that the names must > be unique, and blocks.txt associates a single range with a name, in > concurrence with the glossary, which says blocks represent a range of > characters (not a collection of ranges). Likewise, changing blocks to not > starting at or containing multiples of 16 code points (sometimes called a > "column") is equally not in the cards - it would break the very production > process for chart production. The description of how blocks are used does > not contemplate that they can be mutually overlapping, so that becomes part > of their implicit definition as well. > > There's reason behind the madness of not providing an explicit definition > of "block" in the standard. It has to do with discouraging people from > relying on what is largely an editorial device (headers on charts). > However, it does not mean that arbitrary redefinition of a block from a > single to multiple ranges is something that can or should be contemplated. > > So, the chances that UTC would agree to such changes, even if not formally > guaranteed, is de facto nil. > > A./ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Jun 1 03:49:31 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 1 Jun 2014 09:49:31 +0100 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> Message-ID: <20140601094931.413857e2@JRWUBU2> On Sat, 31 May 2014 19:28:27 -0700 Markus Scherer wrote: > On Sat, May 31, 2014 at 1:59 AM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > > > Bear in mind that a pattern \uD808 shall not match anything in a > > well-formed Unicode string. > > > Depends. See the definitions of Unicode strings vs. UTF strings. D80: Unicode string: A code unit sequence containing code units of a particular Unicode encoding form... D85 Well-formed: A Unicode code unit sequence that purports to be in a Unicode encod- ing form is called well-formed if and only if it does follow the specification of that Unicode encoding form. How does a Unicode string purport anything? >> \uD808\uDF45 specifies a sequence of two >> codepoints. > Implementations that use Unicode 16-bit strings will usually treat > this as one supplementary code point. > In Java, there is no other way to escape one. In which case, Java does *not* supply 'basic Unicode support' as defined by UTS#18 Version 17 - see just before Section 1.1.1 therein. An engine that matches code unit by code unit does not comply with RL1.7. This makes sense in so far as it provides for consistent results across UTF-encodings for Unicode strings that could once have been reversibly converted. (A 32-bit Unicode string converted to a 16-bit Unicode string and back would become <12345>.) Now that that conversion should not preserve lone surrogates (separately both C10 together with D93 and TUS Section 5.22), it makes less sense. However, I can think of one major objection to a regular expression engine using 16-bit Unicode strings treating every supplementary point as a sequence of two surrogate points. While it might be acceptable for a lone surrogate to match \P{L} (codepoints that are not letters), it would not be acceptable for every supplementary point to match \P{L}\P{L} or even \p{Any}\p{Any}. Richard. From richard.wordingham at ntlworld.com Sun Jun 1 05:42:39 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 1 Jun 2014 11:42:39 +0100 Subject: Long-Encoded Restricted Characters in High Frequency Modern Use In-Reply-To: References: <20140529233956.5db1ea5e@JRWUBU2> Message-ID: <20140601114239.24a2d02e@JRWUBU2> On Sat, 31 May 2014 21:27:55 +0200 Mark Davis ?? wrote: > The structure of the data is based on the use of NFKC characters in > identifiers. So SARA AM and the Lao? equivalent are both not NFKC > characters, and are categorized as such, and would need to be > represented by their NFKC fors. The process is in > http://www.unicode.org/reports/tr39/proposed.html#IDMOD_Data_Collection There's no absolute IETF prohibition on NFKC characters. > > Now, U+0E4D THAI > > CHARACTER NIKHAHIT is classified as 'allowed; recommended', although > > its main use is in writing Pali, which would suggest that it should > > be 'restricted; historic' or 'restricted; limited-use'. > ?For that, it would be best to submit via > http://www.unicode.org/reports/tr39/proposed.html#Feedback, AND file a > feedback form at http://www.unicode.org/reporting.html, just to be > sure. ? I have no desire to restrict NIKHAHIT simply because of limited use. The problem is simply the confusion caused by the existence of SARA AM. Unicode support for the compatibility decomposition of SARA AM is incomplete, in part irremediably so. The problem is that has a different appearance to . In the former, the tone mark is the topmost glyph; in the latter, the nikkhahit is the topmost glyph. usually has the same appearance as , which is what Uniscribe effectively converts it to. There used to be filters in place to stop being typed. It's not unknown for to be mistyped as , and that too used to be blocked. DUCET has a contraction for to reduce the ill-effects, but of course the contraction doesn't work for the sequence . (Action on me: CLDR ticket on omission for th locale.) In short, the co-existence of NIKHAHIT with ccc=0 and SARA AM causes problems. The simplest solution is to restrict NIKHAHIT, which should be tolerable. Ideally, one would merely prohibit the sequence \p{Mn}*\u0E4D\p{Mn}*\u0E32. There is no virtue in making both NIKHAHIT and SARA AM 'restricted'. Indeed, one could argue that applying the compatibility decomposition to SARA AM brings NIKHAHIT into 'high frequency modern use' - it depends on the frequency of NFKC and NFKD conversions. However, the compatibility decomposition of SARA AM is simply *wrong* as Thai text. It would be good to hear from someone at Thailand's National Electronics and Computer Technology Center (NECTEC) on the matter of SARA AM in domain names. The sequence-prohibiting solution ought to extend to Lao, but there may be the additional problem of the tone mark being applied to the SARA AM. The m17n Lao keyboard on my computer actually comes with a single keystroke for the sequence ! (Action on me: File a bug report against the keyboard.) Richard. From public at khwilliamson.com Sun Jun 1 09:49:47 2014 From: public at khwilliamson.com (Karl Williamson) Date: Sun, 01 Jun 2014 08:49:47 -0600 Subject: Corrigendum #9 In-Reply-To: <5388D29C.9040502@ix.netcom.com> References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> Message-ID: <538B3D8B.2070102@khwilliamson.com> On 05/30/2014 12:49 PM, Asmus Freytag wrote: > One of the concerns was that people felt that they had to have "data > pipeline" style implementations (tools) go and filter these out - even > if there was no intent for the implementation to use them internally in > any way. Making clear that the standard does not require filtering > allows for cleaner implementations of such ("path through) tools. Thanks, I had not thought about that. I'm thinking wording something like this is more appropriate "Noncharacters may be openly interchanged, but it is inadvisable to do so without prior agreement, since at each stage any of them might be replaced by a REPLACEMENT CHARACTER or otherwise disposed of, at the sole discretion of that stage's implementation." From markus.icu at gmail.com Sun Jun 1 10:58:26 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Sun, 1 Jun 2014 08:58:26 -0700 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: <20140601094931.413857e2@JRWUBU2> References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> <20140601094931.413857e2@JRWUBU2> Message-ID: On Sun, Jun 1, 2014 at 1:49 AM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > D80: Unicode string: > A code unit sequence containing code units of a particular Unicode > encoding form... > Right -- in a Unicode 16-bit string, you have a sequence of any 16-bit value in any order. Well-formedness applies to UTF-x encoding forms. It is common to not treat unpaired surrogates as errors because they behave like "boring" code points, that is, they are "harmless". However, that does not mean that they work like fully supported code points in all places, just that where it's easier to treat them like harmless code points that's often done. In ICU4C simple string functions, if you search for code point 0xd800 you will find it in a string if it occurs as an unpaired surrogate. In ICU collation of 16-bit strings, an unpaired surrogate sorts with an unassigned-implicit primary weight. (You can try this with the online collation demo. In ICU UTF-8 collation, ill-formed sequences sort like U+FFFD.) >> \uD808\uDF45 specifies a sequence of two > >> codepoints. > > > Implementations that use Unicode 16-bit strings will usually treat > > this as one supplementary code point. > > In Java, there is no other way to escape one. > > In which case, Java does *not* supply 'basic Unicode support' as defined > by UTS#18 Version 17 - see just before Section 1.1.1 therein. An > engine that matches code unit by code unit does not comply with RL1.7. > You misunderstand. In Java, \uD808\uDF45 is the only way to escape a supplementary code point, but as long as you have a surrogate pair, it is treated as a code point in APIs that support them. Java 5 upgraded the regular expression code to match code points, not code units. I don't know what it does when the pattern contains an unpaired surrogate. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Sun Jun 1 11:07:40 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Sun, 1 Jun 2014 09:07:40 -0700 Subject: Corrigendum #9 In-Reply-To: <538B3D8B.2070102@khwilliamson.com> References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> <538B3D8B.2070102@khwilliamson.com> Message-ID: On Sun, Jun 1, 2014 at 7:49 AM, Karl Williamson wrote: > Thanks, I had not thought about that. I'm thinking wording something like > this is more appropriate > > "Noncharacters may be openly interchanged, but it is inadvisable to do so > without prior agreement, since at each stage any of them might be replaced > by a REPLACEMENT CHARACTER or otherwise disposed of, at the sole discretion > of that stage's implementation." I think that would invite again the kinds of implementations that triggered Corrigendum #9, where you couldn't use CLDR files with Gnome-based tools (plain text editors, file diff tools, command-line terminal) if the files contained noncharacters. (CLDR data uses them for boundary mappings in collation data.) markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Jun 1 12:04:57 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 1 Jun 2014 18:04:57 +0100 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> <20140601094931.413857e2@JRWUBU2> Message-ID: <20140601180457.273ac6b9@JRWUBU2> On Sun, 1 Jun 2014 08:58:26 -0700 Markus Scherer wrote: > You misunderstand. In Java, \uD808\uDF45 is the only way to escape a > supplementary code point, but as long as you have a surrogate pair, > it is treated as a code point in APIs that support them. Wasn't obvious that in the following paragraph \uD808\uDF45 was a pattern? "Bear in mind that a pattern \uD808 shall not match anything in a well-formed Unicode string. \uD808\uDF45 specifies a sequence of two codepoints. This sequence can occur in an ill-formed UTF-32 Unicode string and before Unicode 5.2 could readily be taken to occur in an ill-formed UTF-8 Unicode string. RL1.7 declares that for a regular expression engine, the codepoint sequence cannot occur in a UTF-16 Unicode string; instead, the code unit sequence is the codepoint sequence ." (It might have been clearer to you if I'd said '8-bit' and '16-bit' instead of UTF-8 and UTF-16. It does make me wonder what you'd call a 16-bit encoding of arbitrary *codepoint* sequences.) Richard. From public at khwilliamson.com Sun Jun 1 12:13:53 2014 From: public at khwilliamson.com (Karl Williamson) Date: Sun, 01 Jun 2014 11:13:53 -0600 Subject: Corrigendum #9 In-Reply-To: References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> <538B3D8B.2070102@khwilliamson.com> Message-ID: <538B5F51.60102@khwilliamson.com> On 06/01/2014 10:07 AM, Markus Scherer wrote: > On Sun, Jun 1, 2014 at 7:49 AM, Karl Williamson > wrote: > > Thanks, I had not thought about that. I'm thinking wording > something like this is more appropriate > > "Noncharacters may be openly interchanged, but it is inadvisable to > do so without prior agreement, since at each stage any of them might > be replaced by a REPLACEMENT CHARACTER or otherwise disposed of, at > the sole discretion of that stage's implementation." > > > I think that would invite again the kinds of implementations that > triggered Corrigendum #9, where you couldn't use CLDR files with > Gnome-based tools (plain text editors, file diff tools, command-line > terminal) if the files contained noncharacters. (CLDR data uses them for > boundary mappings in collation data.) > > markus I don't understand your point. Are you saying that Gnome should not have the discretion to rid its inputs of noncharacters? If so, then noncharacters really are just Gc=Co ones. From asmusf at ix.netcom.com Sun Jun 1 14:34:24 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 01 Jun 2014 12:34:24 -0700 Subject: Corrigendum #9 In-Reply-To: <538B3D8B.2070102@khwilliamson.com> References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> <538B3D8B.2070102@khwilliamson.com> Message-ID: <538B8040.1060103@ix.netcom.com> On 6/1/2014 7:49 AM, Karl Williamson wrote: > On 05/30/2014 12:49 PM, Asmus Freytag wrote: >> One of the concerns was that people felt that they had to have "data >> pipeline" style implementations (tools) go and filter these out - even >> if there was no intent for the implementation to use them internally in >> any way. Making clear that the standard does not require filtering >> allows for cleaner implementations of such ("path through) tools. > > Thanks, I had not thought about that. I'm thinking wording something > like this is more appropriate > > "Noncharacters may be openly interchanged, but it is inadvisable to do > so without prior agreement, since at each stage any of them might be > replaced by a REPLACEMENT CHARACTER or otherwise disposed of, at the > sole discretion of that stage's implementation." > Karl, I think you should address the pass-through style of implementation explicitly. "Noncharacters are designed to be used for special, implementation-internal purposes, that puts them outside the text content of the data. Some implementations, by necessity, use a distributed architecture, and rely on yet other implementations for services like transport, code conversion, and so on. For such "pass-through" implementations, it would be inadvisable to rely on, or replace any noncharacter, and certainly not to reject or filter them. Doing so would make such an implementation a poor choice to serve as a "pass through" in a distributed architecture that makes use of noncharcters for internal purposes. In other words such an implementation would make it impossible to bridge between the partners in a prior agreement on the use of noncharacters, which would severely undercut its utility." You might want to check whether some statement like this isn't already part of the FAQ. If it isn't, it would be the easiest to retrofit (and the easiest place to lay out usage guidelines). Alternatively, or in conjunction, you could propose that the text in the core specification be tweaked to help set better expectations. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sun Jun 1 14:40:35 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 01 Jun 2014 12:40:35 -0700 Subject: Corrigendum #9 In-Reply-To: References: <5388CD4A.4060704@khwilliamson.com><5388D29C.9040502@ix.netcom.com><538B3D8B.2070102@khwilliamson.com> Message-ID: <538B81B3.2040900@ix.netcom.com> On 6/1/2014 9:07 AM, Markus Scherer wrote: > On Sun, Jun 1, 2014 at 7:49 AM, Karl Williamson > > wrote: > > Thanks, I had not thought about that. I'm thinking wording > something like this is more appropriate > > "Noncharacters may be openly interchanged, but it is inadvisable > to do so without prior agreement, since at each stage any of them > might be replaced by a REPLACEMENT CHARACTER or otherwise disposed > of, at the sole discretion of that stage's implementation." > > > I think that would invite again the kinds of implementations that > triggered Corrigendum #9, where you couldn't use CLDR files with > Gnome-based tools (plain text editors, file diff tools, command-line > terminal) if the files contained noncharacters. (CLDR data uses them > for boundary mappings in collation data.) > > The new text triggers some really unwarranted interpretations, which can invalidate the use of noncharacters for their stated purpose. Please see my suggested text that attempts to describe both intent and differences in use. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Jun 1 20:36:14 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 2 Jun 2014 02:36:14 +0100 Subject: Wild Card Collation Matches Message-ID: <20140602023614.0b013b0a@JRWUBU2> In a fairly wild environment (http://www.thaivisa.com/forum/topic/730564-new-front-end-to-ri-dictionary-alpha), I encountered the following question: "If you search for ?* do you expect to return words such as ???? and ????" Now, as a regular expression, in UTS#18 'Unicode Regular Expressions' Version 13 (dated 2008, superseded in 2012), RL3.5 comes pretty close to this with ranges tailored for collation. The pattern [\u0E01-\u0E02]* would match both those words. To be precise, one would use a search for [?-??]*. RL3.5 has been with withdrawn because of difficulties, though I can't say that I see it as a major difficulty that at least one of [A-z] and [a-Z] is empty. Even POSIX is aware of that little issue. Turning to fully collation-based definition of searches, UTS#10 Unicode Collation Algorithm's definition DS2 comes closest to answering the question for the UTC. DS2 reads: DS2. The pattern string P has a match at Q[s,e] according to collation C if C generates the same sort key for P as for Q[s,e], and the offsets s and e meet the boundary condition B. One can also say P has a match in Q according to C. It's a soft job to create sequences of codepoints P starting with U+0E01 THAI CHARACTER KO KAI that are tertiary matches for ???? and ??? under both DUCET and the CLDR collations for Thai. Can I therefore say that the two strings match the pattern ?* according to these collations? (A pattern P for ??? is P = .) Disturbingly, another possible answer is that there is no match for in either string because it only occurs in the legacy/extended grapheme cluster . Richard. From mark at macchiato.com Mon Jun 2 04:29:09 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 2 Jun 2014 11:29:09 +0200 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: <20140601180457.273ac6b9@JRWUBU2> References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> <20140601094931.413857e2@JRWUBU2> <20140601180457.273ac6b9@JRWUBU2> Message-ID: > \uD808\uDF45 specifies a sequence of two codepoints. ?That is simply incorrect.? In Java (and similar environments), \uXXXX means a char (a UTF16 code unit), not a code point. Here is the difference. If you are not used to Java, string.replaceAll(x,y) uses Java's regex to replace the pattern x with the replacement y in string. Backslashes in literals need escaping, so \x needs to be written in literals as \\x. String[] tests = {"\\x{12345}", "\\uD808\\uDF45", "\uD808\uDF45", "?.?"}; String target = "one: ?\uD808\uDF45?\t\t" + "two: ?\uD808\uDF45\uD808\uDF45?\t\t" + "lead: ?\uD808?\t\t" + "trail: ?\uDF45?\t\t" + "one+: ?\uD808\uDF45\uD808?"; System.out.println("pattern" + "\t?\t" + target + "\n"); for (String test : tests) { System.out.println(test + "\t?\t" + target.replaceAll(test, "??")); } *?Output:* pattern ? one: ???? two: ?????? lead: ??? trail: ??? one+: ????? \x{12345} ? one: ???? two: ?????? lead: ??? trail: ??? one+: ????? \uD808\uDF45 ? one: ???? two: ?????? lead: ??? trail: ??? one+: ????? ?? ? one: ???? two: ?????? lead: ??? trail: ??? one+: ????? ?.? ? one: ?? two: ?????? lead: ?? trail: ?? one+: ????? The target has various combinations of code units, to see what happens. Notice that Java treats a pair of lead+trail as a single code point for matching (eg .), but also an isolated surrogate char as a single code point (last line of output). Note that Java's regex in addition allows \x{hex} for specifying a code point explicitly. It also has the syntax \uXXXX (in a literal the \ needs escaping) to specify a code unit; that is slightly different than the Java preprocessing. Thus the first two are equivalent, and replace "{" by "x". The last two are also equivalent?and fail?because a single "{" is a broken regex pattern. System.out.println("{".replaceAll("\\u007B", "x")); System.out.println("{".replaceAll("\\x{7B}", "x")); System.out.println("{".replaceAll("\u007B", "x")); System.out.println("{".replaceAll("{", "x")); Mark *? Il meglio ? l?inimico del bene ?* On Sun, Jun 1, 2014 at 7:04 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Sun, 1 Jun 2014 08:58:26 -0700 > Markus Scherer wrote: > > > You misunderstand. In Java, \uD808\uDF45 is the only way to escape a > > supplementary code point, but as long as you have a surrogate pair, > > it is treated as a code point in APIs that support them. > > Wasn't obvious that in the following paragraph \uD808\uDF45 was a > pattern? > > "Bear in mind that a pattern \uD808 shall not match anything in a > well-formed Unicode string. \uD808\uDF45 specifies a sequence of two > codepoints. This sequence can occur in an ill-formed UTF-32 Unicode > string and before Unicode 5.2 could readily be taken to occur in an > ill-formed UTF-8 Unicode string. RL1.7 declares that for a regular > expression engine, the codepoint sequence cannot > occur in a UTF-16 Unicode string; instead, the code unit sequence DF45> is the codepoint sequence KI>." > > (It might have been clearer to you if I'd said '8-bit' and '16-bit' > instead of UTF-8 and UTF-16. It does make me wonder what you'd call a > 16-bit encoding of arbitrary *codepoint* sequences.) > > Richard. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Jun 2 06:44:04 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 2 Jun 2014 13:44:04 +0200 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> <20140601094931.413857e2@JRWUBU2> <20140601180457.273ac6b9@JRWUBU2> Message-ID: Your example would have been better explained by just saying that in Java, the regexp represented in source code as "\\uD808\\uDF45" means matching two successive 16-bit code units, and "\\uD808" or "\\uDF45" just matches one. The "\\uNNNN" regexo notation (in source code, equivalentto "\uNNNN" in string at runtime) does not designate necessarily a full code point. Unlike the "\\x{NNNN}" and "." regexs which will necessarily match a full code point in the target (even if it's an isolated surrogate). But there's no way in Java to represent a target string that can store arbitrary sequences of codepoints if you use the String type (this is not specific to Java but applies as well to any language or runtime library handling streams of 16-bit code units, including in C, C++, Python, Javascript, PHP...). The problem is then not in the way you write regexps, but in the way the target string is encoded : it is not technically posible with 16-bit streams to represent arbitrary sequences of codepoints, but only arbitrary sequences of 16-bit code units (even if they aren't valid UTF-16 text). But there's no problem at all to process valid UTF-16 streams. Your "lead", "trail" and "one+" are representable in Java as arbitrary 16-bit streams but they do not represent not valid Unicode texts. On the opposite all your "tests[]" strings are valid Unicode texts but their interpretation as regexps are not necessarily valid regexps. Each time you use single backslashes in a Java source-code string, there's no warranty it will be a valid Unicode text even though it will compile without problem as a valid 16-bit stream (and the same will be true in other languages). If you want to represent aribtrary sequences of codepoints in a target text, you cannot use any UTF alone (it may be technically possible with UTF-8 or UTF-32, but these are also invalid for these standard encodings), without using an escaping mechanism such as the double backslashes like in the notation of regexps. This escaping mechnims is then independant of the actual runtime encoding used to transport the escaped streams within valid Unicode texts. In summary; arbitrary sequences of codepoints in a valid Unicode text require escaping mechanism on top of the actual text encoding for the storage or transport (there are other ways to escape arbitrary streams into valid texts, including the U+NNNNNN notation, or Base64 or Hex or octal representation of UTF-32, or Punycode. and many other technics used to embed binary objets (UUCP, Postscript streams). In HTTP a few of them are suported as standard "transport syntaxes". Terminal protocols (like VT220 and related, or Videotext) have since long used escape sequences (plus controls like SI/SO encapsulation and isolated DLE escapes for transporting 8-bit data over a 7-bit stream) Technically the Java strings at runtime are not plain text (unless they are checked on input and the validaty conditions are not brokeb by some text transforms like extraction ob substrings at arbitrary absolute positions, or with error recovery with resynchronization after a failure or missing data, where these errors are likely to occur because we have no warranty that validity is kept during the exchange by matching preconditions and postconditions), they are binary object (and this is also true for C/C++ standard strings, or PHP strings, or the content transported by an HTTP session or a terminal protocol (defining also its own escaping mechanism where needed). If yuo develop a general purpose library in any language that can be reused in arbitrary code, you cannot assume on input that all preconditions are satisfied so you need to check the input. And you also have to be careful about the design of your library to make sure that it respects the postconditions (some library APIs are technically unsafe, notably extracting substrings and almost blocked I/O using fixed size buffers such as file I/O in filesystems that do not discritimate text files and binary files (so that text files will use buffers with variable length only broken at codepoint positions and not at arbitrary code unit positions. As far as I know, there does not exist any filesystem that enforce code point positions (unless it uses non-space efficient encodings with code units wider than 20 bits (storage devices are optimized for code units wth size that are a power of 2 in bytes, so you would finally use only files whose sizes in bytes is a multiple of 4 and all random access file positions also a multiple of 4 bytes. You could also use 24-but storage code units with blocks limited to sectors of 255 bytes with the extra byte only used as a filler or as a length indicator in that sector (255 bytes would store 85 arbitrary code units of 24 bits but you will still need to check the value range of these code units if you want to restrict the the U+0000.U+10FFFF codepoint space, unless your application code handles all of the extra code units like non-character code points) However the filesystem may perform this check when writing text files so that it could mark files that are valid Unicode strings by updating some metadata (that metadata could be stored in the spare byte of the first 256-bytes sector or using a separate indexing database of compatible files). You coudl do that also for in-memory temporary buffers by keeping this info (this would allow to discriminate very early those streams that are not plain-text without having to process them up to the end). A relational database could aslso perform this check when creating indexes on table keys so that it will know that it cano only return valid text for any subselection in a table. In all cases, this will still be more efficient for small storage than using any transport syntax or escaping, but generally for moderate and large volumes, the transport syntax or escaping mechanism often wins in terms of performance by minimizing the volume of I/O, notably if theses I/O are very costly compared to data in working memory or even in CPU data caches (bit only if this data is very frequently reaccessed in this cache). However, if your I/O is very slow compared to CPU and the data volume sufficient large, it is always better to use UTF-32 in memory but storing that data in a compressed stream (you can safely use a generic binary compressor which will generally work better with UTF-32 compared to UTF-8 or UTF-16 for moderate and large volumes). Simple algorithms like deflate or even basic Huffmann encoding will generate excellent throughout with very modest CPU cost compared the huge to I/O costs that such compression saved (and in tht case, even a range check on input will have insignicant cost, thanks to branch prediction in your code using the fast pipelined path only for valid texts, and the slower non pipelined path only for exceptions and error handling most processors and CPU caches use now excellent branch predictors, even if code compilers can help them). In summary, ranche checking should no longer be only a debugging option in code (even for production code), even for internal libraries, its cost is rapidly insignificant for large data volumes. Just design your algorithms to minimize the number of state variables and minimize table lookups in order to improve data locality, instead of local data sizes for just one or a few code points or code units: just select runtime code units that can fit in a single CPU register (almost all processors today have registers at least 32-bit wide, so UTF-32 is not a problem for local processing in native code). 2014-06-02 11:29 GMT+02:00 Mark Davis ?? : > > \uD808\uDF45 specifies a sequence of two codepoints. > > ?That is simply incorrect.? > > In Java (and similar environments), \uXXXX means a char (a UTF16 code > unit), not a code point. Here is the difference. If you are not used to > Java, string.replaceAll(x,y) uses Java's regex to replace the pattern x > with the replacement y in string. Backslashes in literals need escaping, so > \x needs to be written in literals as \\x. > > String[] tests = {"\\x{12345}", "\\uD808\\uDF45", "\uD808\uDF45", > "?.?"}; > String target = > "one: ?\uD808\uDF45?\t\t" + > "two: ?\uD808\uDF45\uD808\uDF45?\t\t" + > "lead: ?\uD808?\t\t" + > "trail: ?\uDF45?\t\t" + > "one+: ?\uD808\uDF45\uD808?"; > System.out.println("pattern" + "\t?\t" + target + "\n"); > for (String test : tests) { > System.out.println(test + "\t?\t" + target.replaceAll(test, "??")); > } > > > *?Output:* > pattern ? one: ???? two: ?????? lead: ??? trail: ??? one+: ????? > > \x{12345} ? one: ???? two: ?????? lead: ??? trail: ??? one+: ????? > \uD808\uDF45 ? one: ???? two: ?????? lead: ??? trail: ??? one+: ????? > ?? ? one: ???? two: ?????? lead: ??? trail: ??? one+: ????? > ?.? ? one: ?? two: ?????? lead: ?? trail: ?? one+: ????? > > The target has various combinations of code units, to see what happens. > Notice that Java treats a pair of lead+trail as a single code point for > matching (eg .), but also an isolated surrogate char as a single code point > (last line of output). Note that Java's regex in addition allows \x{hex} > for specifying a code point explicitly. It also has the syntax \uXXXX (in a > literal the \ needs escaping) to specify a code unit; that is slightly > different than the Java preprocessing. Thus the first two are equivalent, > and replace "{" by "x". The last two are also equivalent?and fail?because a > single "{" is a broken regex pattern. > > System.out.println("{".replaceAll("\\u007B", "x")); > System.out.println("{".replaceAll("\\x{7B}", "x")); > > System.out.println("{".replaceAll("\u007B", "x")); > System.out.println("{".replaceAll("{", "x")); > > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > > On Sun, Jun 1, 2014 at 7:04 PM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > >> On Sun, 1 Jun 2014 08:58:26 -0700 >> Markus Scherer wrote: >> >> > You misunderstand. In Java, \uD808\uDF45 is the only way to escape a >> > supplementary code point, but as long as you have a surrogate pair, >> > it is treated as a code point in APIs that support them. >> >> Wasn't obvious that in the following paragraph \uD808\uDF45 was a >> pattern? >> >> "Bear in mind that a pattern \uD808 shall not match anything in a >> well-formed Unicode string. \uD808\uDF45 specifies a sequence of two >> codepoints. This sequence can occur in an ill-formed UTF-32 Unicode >> string and before Unicode 5.2 could readily be taken to occur in an >> ill-formed UTF-8 Unicode string. RL1.7 declares that for a regular >> expression engine, the codepoint sequence cannot >> occur in a UTF-16 Unicode string; instead, the code unit sequence > DF45> is the codepoint sequence > KI>." >> >> (It might have been clearer to you if I'd said '8-bit' and '16-bit' >> instead of UTF-8 and UTF-16. It does make me wonder what you'd call a >> 16-bit encoding of arbitrary *codepoint* sequences.) >> >> Richard. >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Jun 2 10:27:22 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 02 Jun 2014 08:27:22 -0700 Subject: Corrigendum #9 Message-ID: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> It seems that the broadening of the term "interchange" in this corrigendum to mean "almost any type of processing imaginable," below, is what caused the trouble. This is the decision that would need to be reconsidered if the real intent of noncharacters is to be expressed. I suspect everyone can agree on the edge cases, that noncharacters are harmless in internal processing, but probably should not appear in random text shipped around on the web. > This is necessary for the effective use of noncharacters, because > anytime a Unicode string crosses an API boundary, it is in effect > being "interchanged". Furthermore, for distributed software, it is > often very difficult to determine what constitutes an "internal" > versus an "external" context for any particular software process. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From markus.icu at gmail.com Mon Jun 2 10:48:57 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 2 Jun 2014 08:48:57 -0700 Subject: Corrigendum #9 In-Reply-To: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell wrote: > I suspect everyone can agree on the edge cases, that noncharacters are > harmless in internal processing, but probably should not appear in > random text shipped around on the web. > Right, in principle. However, it should be ok to include noncharacters in CLDR data files for processing by CLDR implementations, and it should be possible to edit and diff and version-control and web-view those files etc. It seems that trying to define "interchange" and "public" in ways that satisfy everyone will not be successful. The FAQ already gives some examples of where noncharacters might be used, should be preserved, or could be stripped, starting with "Q: Are noncharacters intended for interchange? " In my view, those Q/A pairs explain noncharacters quite well. If there are further examples of where noncharacters might be used, should be preserved, or could be stripped, and that would be particularly useful to add to the examples already there, then we could add them. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Mon Jun 2 11:02:58 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 2 Jun 2014 16:02:58 +0000 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: I also think that the verbiage swung too far the other way. Sure, I might need to save or transmit a file to talk to myself later, but apps should be strongly discouraged for using these for interchange with other apps. Interchange bugs are why nearly any news web site ends up with at least a few articles with mangled apostrophes or whatever (because of encoding differences). Should authors? tools or feeds or databases or whatever start emitting non-characters from internal use, then we?re going to have ugly leak into text ?everywhere?. So I?d prefer to see text that better permitted interchange with other components of an application?s internal system or partner system, yet discouraged use for interchange with ?foreign? apps. -Shawn -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Mon Jun 2 11:08:14 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 2 Jun 2014 18:08:14 +0200 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: The problem is where to draw the line. In today's world, what's an app? You may have a cooperating system of "apps", where it is perfectly reasonable to interchange sentinel values (for example). I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where we should make it clearer.) Mark *? Il meglio ? l?inimico del bene ?* On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele wrote: > I also think that the verbiage swung too far the other way. Sure, I > might need to save or transmit a file to talk to myself later, but apps > should be strongly discouraged for using these for interchange with other > apps. > > > > Interchange bugs are why nearly any news web site ends up with at least a > few articles with mangled apostrophes or whatever (because of encoding > differences). Should authors? tools or feeds or databases or whatever > start emitting non-characters from internal use, then we?re going to have > ugly leak into text ?everywhere?. > > > > So I?d prefer to see text that better permitted interchange with other > components of an application?s internal system or partner system, yet > discouraged use for interchange with ?foreign? apps. > > > > -Shawn > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Mon Jun 2 11:21:23 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 2 Jun 2014 16:21:23 +0000 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: <13aa00b784a74c55adb12d7eacede01c@BY2PR03MB491.namprd03.prod.outlook.com> That?s what I think is exactly what should be clarified. A cooperating system of apps should likely use some other markup, however if they want to use FFFF to say ?OK to insert ad here? (or whatever), that?s up to them. I fear that the current wording says ?Because you might have a cooperating system of apps that all agree FFFF is ?OK to insert ad here?, you may as well emit FFFF all the time just in case some other app happens to use the same sentinel?. The ?problem? is now that previously these characters were illegal, so my application didn?t have to explicitly remove them when importing external stuff because they weren?t allowed to be there. With the wording of the corrigendum, the onus is on every app importing data to filter out these code points because they are ?suddenly? legal in foreign data streams. That is a breaking change for applications, and, worse, it isn?t in the control of the applications that take advantage of the newly laxer wording, but rather all the other applications on the planet, which may have been stable for years. My interpretation of ?interchanged? was ?interchanged outside of a system that understood your private use of the noncharacters?. I can see where that may not have been everyone?s interpretation, and maybe should be updated. My interpretation of what you?re saying below is ?sentinel values with a private meaning can be exchanged between apps?, which is what the PUA?s for. I don?t mind at all if the definition is loosened somewhat, but if we?re turning them into PUA characters we should just turn them into PUA characters. -Shawn From: mark.edward.davis at gmail.com [mailto:mark.edward.davis at gmail.com] On Behalf Of Mark Davis ?? Sent: Monday, June 2, 2014 9:08 AM To: Shawn Steele Cc: Markus Scherer; Doug Ewell; Unicode Mailing List Subject: Re: Corrigendum #9 The problem is where to draw the line. In today's world, what's an app? You may have a cooperating system of "apps", where it is perfectly reasonable to interchange sentinel values (for example). I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where we should make it clearer.) Mark ? Il meglio ? l?inimico del bene ? On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele > wrote: I also think that the verbiage swung too far the other way. Sure, I might need to save or transmit a file to talk to myself later, but apps should be strongly discouraged for using these for interchange with other apps. Interchange bugs are why nearly any news web site ends up with at least a few articles with mangled apostrophes or whatever (because of encoding differences). Should authors? tools or feeds or databases or whatever start emitting non-characters from internal use, then we?re going to have ugly leak into text ?everywhere?. So I?d prefer to see text that better permitted interchange with other components of an application?s internal system or partner system, yet discouraged use for interchange with ?foreign? apps. -Shawn _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Mon Jun 2 11:27:54 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 2 Jun 2014 18:27:54 +0200 Subject: Corrigendum #9 In-Reply-To: <13aa00b784a74c55adb12d7eacede01c@BY2PR03MB491.namprd03.prod.outlook.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <13aa00b784a74c55adb12d7eacede01c@BY2PR03MB491.namprd03.prod.outlook.com> Message-ID: On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele wrote: > The ?problem? is now that previously these characters were illegal The problem was that we were inconsistent in standard and related material about just what the status was for these things. Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Jun 2 11:35:54 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 02 Jun 2014 09:35:54 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: <538CA7EA.5020906@ix.netcom.com> On 6/2/2014 9:08 AM, Mark Davis ?? wrote: > The problem is where to draw the line. In today's world, what's an > app? You may have a cooperating system of "apps", where it is > perfectly reasonable to interchange sentinel values (for example). The way to draw the line is to insist on there being an agreement between sender and ultimate receiver, and an pass-through agreement (if you will) for any intermediate stage, so that the coast is clear. What defines an "implementation" in this scenario, is the existence of the agreement. What got us into trouble is that the negative case (pass-through) was not well-defined, and lead to people assuming that they had to filter any incoming noncharacters. Because noncharacters can have any interpretation (not limited to interpretations as characters), it is much riskier to send then out oblivious whether the intended recipient is part of the same agreement on their interpretation as the sender. In that sense, they are not mere PUA code points. The other aspect of their original design was to allow code points that recipients were free no to honor or preserve, if they were not part of the agreement (and hadn't made an explicit or implicit pass-through agreement). Otherwise, if anyone expects them to be preserved, no application like Word, would be free to use these for purely internal use. Word thus would not be a tool to handle CLDR data; which may be disappointing to some, but should be fine. A./ > > I agree with Markus; I think the FAQ is pretty clear. (And if not, > that's where we should make it clearer.) > > > Mark > / > / > /? Il meglio ? l?inimico del bene ?/ > // > > > On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele > > wrote: > > I also think that the verbiage swung too far the other way. Sure, > I might need to save or transmit a file to talk to myself later, > but apps should be strongly discouraged for using these for > interchange with other apps. > > Interchange bugs are why nearly any news web site ends up with at > least a few articles with mangled apostrophes or whatever (because > of encoding differences). Should authors? tools or feeds or > databases or whatever start emitting non-characters from internal > use, then we?re going to have ugly leak into text ?everywhere?. > > So I?d prefer to see text that better permitted interchange with > other components of an application?s internal system or partner > system, yet discouraged use for interchange with ?foreign? apps. > > -Shawn > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Jun 2 11:36:50 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 02 Jun 2014 09:36:50 -0700 Subject: Corrigendum #9 Message-ID: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> Shawn Steele wrote: > So I?d prefer to see text that better permitted interchange with other > components of an application?s internal system or partner system, yet > discouraged use for interchange with "foreign" apps. If any wording is to be revised, while we're at it, I'd also like to see a reaffirmation of the proper relationship between private-use characters and noncharacters. I still hear arguments that private-use characters are to be avoided in public interchange at all costs, as if lack of knowledge of the private agreement, or conflicting interpretations, will cause some kind of major security breach. At the same time, the Corrigendum seems to imply that noncharacters in public interchange are no big deal. That seems upside-down. Mark Davis ?? replied: > The problem is where to draw the line. In today's world, what's an > app? You may have a cooperating system of "apps", where it is > perfectly reasonable to interchange sentinel values (for example). Correct. Most people wouldn't consider a cooperating system like that quite the same as true public interchange, like throwing this ??? into a message on a public mailing list. Since the Corrigendum deals with recommendations rather than hard requirements, SHOULDs rather than MUSTs, it doesn't seem that a bright line is really needed. > I agree with Markus; I think the FAQ is pretty clear. (And if not, > that's where we should make it clearer.) But the formal wording of the standard should reflect that clarity, right? -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From asmusf at ix.netcom.com Mon Jun 2 11:37:19 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 02 Jun 2014 09:37:19 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <13aa00b784a74c55adb12d7eacede01c@BY2PR03MB491.namprd03.prod.outlook.com> Message-ID: <538CA83F.9090500@ix.netcom.com> On 6/2/2014 9:27 AM, Mark Davis ?? wrote: > > On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele > > wrote: > > The ?problem? is now that previously these characters were illegal > > > The problem was that we were inconsistent in standard and related > material about just what the status was for these things. > > And threw the baby out to fix it. A./ > > Mark > / > / > /? Il meglio ? l?inimico del bene ?/ > // > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Mon Jun 2 11:38:28 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 2 Jun 2014 16:38:28 +0000 Subject: Corrigendum #9 In-Reply-To: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> References: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> Message-ID: > > I agree with Markus; I think the FAQ is pretty clear. (And if not, > > that's where we should make it clearer.) > But the formal wording of the standard should reflect that clarity, right? I don't tend to read the FAQ :) From doug at ewellic.org Mon Jun 2 11:44:06 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 02 Jun 2014 09:44:06 -0700 Subject: Corrigendum #9 Message-ID: <20140602094406.665a7a7059d7ee80bb4d670165c8327d.edf1a109a4.wbe@email03.secureserver.net> I wrote, sort of: > Correct. Most people wouldn't consider a cooperating system like that > quite the same as true public interchange, like throwing this ??? > into a message on a public mailing list. Oh, look. My mail system converted those nice noncharacters into U+FFFD. Was that compliant? Did I deserve what I got? Are those two different questions? -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From mark at macchiato.com Mon Jun 2 11:47:44 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 2 Jun 2014 18:47:44 +0200 Subject: Corrigendum #9 In-Reply-To: <538CA83F.9090500@ix.netcom.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <13aa00b784a74c55adb12d7eacede01c@BY2PR03MB491.namprd03.prod.outlook.com> <538CA83F.9090500@ix.netcom.com> Message-ID: I disagree with that characterization, of course. The recommendation for libraries and low-level tools to pass them through rather than screw with them makes them usable. The recommendation to check for noncharacters from unknown sources and fix them was good advice then, and is good advice now. Any app where input of noncharacters causes security problems or crashes is and was not a very good app. Mark *? Il meglio ? l?inimico del bene ?* On Mon, Jun 2, 2014 at 6:37 PM, Asmus Freytag wrote: > On 6/2/2014 9:27 AM, Mark Davis ?? wrote: > > > On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele > wrote: > >> The ?problem? is now that previously these characters were illegal > > > The problem was that we were inconsistent in standard and related > material about just what the status was for these things. > > > And threw the baby out to fix it. > > A./ > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > > _______________________________________________ > Unicode mailing listUnicode at unicode.orghttp://unicode.org/mailman/listinfo/unicode > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Jun 2 11:49:29 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 02 Jun 2014 09:49:29 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> Message-ID: <538CAB19.7020208@ix.netcom.com> On 6/2/2014 9:38 AM, Shawn Steele wrote: >>> I agree with Markus; I think the FAQ is pretty clear. (And if not, >>> that's where we should make it clearer.) >> But the formal wording of the standard should reflect that clarity, right? > I don't tend to read the FAQ :) FAQ's are useful, but they are not binding. They are even less binding than general explanation in the text of the Core specification, which itself doesn't rise to the that of conformance clauses and definition... Doug's unease about the "upside-down" nature of the wording regarding PUA and noncharacters is something that should be addressed in revised text in the core specification. A./ > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From Shawn.Steele at microsoft.com Mon Jun 2 12:00:59 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 2 Jun 2014 17:00:59 +0000 Subject: Corrigendum #9 In-Reply-To: <538CAB19.7020208@ix.netcom.com> References: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> <538CAB19.7020208@ix.netcom.com> Message-ID: <00c3fcd9d08f4e4eaf5cda05cec0a63f@BY2PR03MB491.namprd03.prod.outlook.com> To further my understanding, can someone provide examples of how these are used in actual practice? I can't think of any offhand and the closest I get is like the old escape characters to get a dot matrix printer to shift modes, or old word processor internal formatting sequences. From Shawn.Steele at microsoft.com Mon Jun 2 12:08:38 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 2 Jun 2014 17:08:38 +0000 Subject: Corrigendum #9 In-Reply-To: <20140602094406.665a7a7059d7ee80bb4d670165c8327d.edf1a109a4.wbe@email03.secureserver.net> References: <20140602094406.665a7a7059d7ee80bb4d670165c8327d.edf1a109a4.wbe@email03.secureserver.net> Message-ID: > Oh, look. My mail system converted those nice noncharacters into U+FFFD. > Was that compliant? Did I deserve what I got? Are those two different questions? I think I just got spaces. From markus.icu at gmail.com Mon Jun 2 12:17:04 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 2 Jun 2014 10:17:04 -0700 Subject: Corrigendum #9 In-Reply-To: <00c3fcd9d08f4e4eaf5cda05cec0a63f@BY2PR03MB491.namprd03.prod.outlook.com> References: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> <538CAB19.7020208@ix.netcom.com> <00c3fcd9d08f4e4eaf5cda05cec0a63f@BY2PR03MB491.namprd03.prod.outlook.com> Message-ID: On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele wrote: > To further my understanding, can someone provide examples of how these are > used in actual practice? > CLDR collation data defines special contraction mappings that start with a noncharacter, for http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers In CLDR 23 and before (when we were still using XML collation syntax), these were raw noncharacters in the .xml files. As I said earlier: it should be ok to include noncharacters in CLDR data files for processing by CLDR implementations, and it should be possible to edit and diff and version-control and web-view those files etc. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Mon Jun 2 12:50:18 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 2 Jun 2014 17:50:18 +0000 Subject: Corrigendum #9 In-Reply-To: References: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> <538CAB19.7020208@ix.netcom.com> <00c3fcd9d08f4e4eaf5cda05cec0a63f@BY2PR03MB491.namprd03.prod.outlook.com> Message-ID: <8ef7b3954b13479cad76585e628fb83b@BY2PR03MB491.namprd03.prod.outlook.com> Hmm, I find that disconcerting. I?d prefer a real Unicode character with special weights if that concept?s needed. And I guess that goes a long ways to explaining the interchange problem since clearly the code editor?s going to need these ? From: Markus Scherer [mailto:markus.icu at gmail.com] Sent: Monday, June 2, 2014 10:17 AM To: Shawn Steele Cc: Asmus Freytag; Doug Ewell; Mark Davis ??; Unicode Mailing List Subject: Re: Corrigendum #9 On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele > wrote: To further my understanding, can someone provide examples of how these are used in actual practice? CLDR collation data defines special contraction mappings that start with a noncharacter, for http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers In CLDR 23 and before (when we were still using XML collation syntax), these were raw noncharacters in the .xml files. As I said earlier: it should be ok to include noncharacters in CLDR data files for processing by CLDR implementations, and it should be possible to edit and diff and version-control and web-view those files etc. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Jun 2 13:05:11 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 2 Jun 2014 19:05:11 +0100 Subject: Corrigendum #9 In-Reply-To: References: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> <538CAB19.7020208@ix.netcom.com> <00c3fcd9d08f4e4eaf5cda05cec0a63f@BY2PR03MB491.namprd03.prod.outlook.com> Message-ID: <20140602190511.5f67ffd8@JRWUBU2> On Mon, 2 Jun 2014 10:17:04 -0700 Markus Scherer wrote: > CLDR collation data defines special contraction mappings that start > with a noncharacter, for > http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers > In CLDR 23 and before (when we were still using XML collation syntax), > these were raw noncharacters in the .xml files. > As I said earlier: > it should be ok to include noncharacters in CLDR data files for > processing by CLDR implementations, and it should be possible to edit > and diff and version-control and web-view those files etc. They come as a nasty shock when someone thinks XML files are marked-up text files. I'm still surprised that the published human-readable form of CLDR files should contain automatically applied non-Unicode copyright claims. Richard. From richard.wordingham at ntlworld.com Mon Jun 2 15:01:53 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 2 Jun 2014 21:01:53 +0100 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> <20140601094931.413857e2@JRWUBU2> <20140601180457.273ac6b9@JRWUBU2> Message-ID: <20140602210153.40a8bf08@JRWUBU2> On Mon, 2 Jun 2014 11:29:09 +0200 Mark Davis ?? wrote: > > \uD808\uDF45 specifies a sequence of two codepoints. > > ?That is simply incorrect.? The above is in the sample notation of UTS #18 Version 17 Section 1.1. >From what I can make out, the corresponding Java notation would be \x{D808}\x{DF45}. I don't *know* what \x{D808} and \x{DF45} match in Java, or whether they are even acceptable. The only thing UTS #18 RL1.7 permits them to match in Java is lone surrogates, but I don't know if Java complies. All UTS #18 says for sure about regular expressions matching code units is that they don't satisfy RL1.1, though Section 1.7 appears to ban them when it says, "A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units". Perhaps it's a fundamental requirement of something other than UTS #18. I thought matching parts of characters in terms of their canonical equivalences was awkward enough, without having the additional option of matching some of the code units! Richard. From prosfilaes at gmail.com Mon Jun 2 15:32:43 2014 From: prosfilaes at gmail.com (David Starner) Date: Mon, 2 Jun 2014 13:32:43 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: On Mon, Jun 2, 2014 at 8:48 AM, Markus Scherer wrote: > Right, in principle. However, it should be ok to include noncharacters in > CLDR data files for processing by CLDR implementations, and it should be > possible to edit and diff and version-control and web-view those files etc. Why? It seems you're changing the rules so some Unicode guys can get oversmart in using Unicode in their systems. You could do the same thing everyone else does and use special tags or symbols you have to escape. I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if used carelessly for their stated purpose, can probably trigger serious bugs in some lamebrained utility. -- Kie ekzistas vivo, ekzistas espero. From markus.icu at gmail.com Mon Jun 2 16:53:08 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 2 Jun 2014 14:53:08 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: On Mon, Jun 2, 2014 at 1:32 PM, David Starner wrote: > I would especially discourage any web browser from handling > these; they're noncharacters used for unknown purposes that are > undisplayable and if used carelessly for their stated purpose, can > probably trigger serious bugs in some lamebrained utility. > I don't expect "handling these" in web browsers and lamebrained utilities. I expect "treat like unassigned code points". markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Mon Jun 2 17:07:03 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 2 Jun 2014 22:07:03 +0000 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: <81e121ab27544aeca6f23663850c32dd@BY2PR03MB491.namprd03.prod.outlook.com> Except that, particularly the max-weight ones, mean that developers can be expected to use this as sentinels in code using ICU, which would preclude their use for other things? Which makes them more like ?reserved for use in CLDR? than ?noncharacters?? -Shawn From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Markus Scherer Sent: Monday, June 2, 2014 2:53 PM To: David Starner Cc: Unicode Mailing List Subject: Re: Corrigendum #9 On Mon, Jun 2, 2014 at 1:32 PM, David Starner > wrote: I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if used carelessly for their stated purpose, can probably trigger serious bugs in some lamebrained utility. I don't expect "handling these" in web browsers and lamebrained utilities. I expect "treat like unassigned code points". markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Jun 2 17:06:21 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 3 Jun 2014 00:06:21 +0200 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: We can still draw a line : interchange should be meant so that other non-Unicode standards should find their way to not mixup random data within plain-text without defining a clear encapsulation and escaping mechanism that ensures that plain text remains isolatable. In other words, desieng separate layers of representation and processing, and be more imaginative when you design an application or protocol with a better modeling. If an application really internaly needs some non-characters, this is not reallyfor encoding text but for the application/protocol-specific system of encapsulation, which should be clearly identified: - these protocols can use separate APIs for handling objects that are composite and contain some text but that are not text by themselves. - they should isolate data types (or MIME types) - they should use some "magic" identifiers in the headers of their data, including versioning in their protocol - they should document internally their own encapsulation/escaping mechanisms - they should test them to make sure they preserve the valid text content without breaking them As the kind of data is not text, we fall within the design of binary data formats. These kinds of statements mean that protocols and API will be improved for better separation of layers, working more as separate blackboxes. But it's not up to the Unicode standard to explain how they will do it. So for me non-characters are not Unicode text, they are not text at all and we should not attempt to make them legal if we want to allow string designs of isolation mechanisms that allow this separation of layers. The Unicode standard offers enough space for this separation, with non-characters (invalid in all standard UTFs), with onvalid code sequences in standard UTFs that allow building up specific encodings that must not be called "UTFs" (or "Unicode" or "UCS" or other terms defined in TUS) and identified as such in API/protocol designs. Thnigs would be simply better is TUS did not even define what is a non-character and if it dd not even suggest that they are legal in "some" circonstance of text "interchange". 2014-06-02 18:08 GMT+02:00 Mark Davis ?? : > The problem is where to draw the line. In today's world, what's an app? > You may have a cooperating system of "apps", where it is perfectly > reasonable to interchange sentinel values (for example). > > I agree with Markus; I think the FAQ is pretty clear. (And if not, that's > where we should make it clearer.) > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > > On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele > wrote: > >> I also think that the verbiage swung too far the other way. Sure, I >> might need to save or transmit a file to talk to myself later, but apps >> should be strongly discouraged for using these for interchange with other >> apps. >> >> >> >> Interchange bugs are why nearly any news web site ends up with at least a >> few articles with mangled apostrophes or whatever (because of encoding >> differences). Should authors? tools or feeds or databases or whatever >> start emitting non-characters from internal use, then we?re going to have >> ugly leak into text ?everywhere?. >> >> >> >> So I?d prefer to see text that better permitted interchange with other >> components of an application?s internal system or partner system, yet >> discouraged use for interchange with ?foreign? apps. >> >> >> >> -Shawn >> >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Jun 2 17:08:27 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 02 Jun 2014 15:08:27 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: <538CF5DB.3070007@ix.netcom.com> On 6/2/2014 2:53 PM, Markus Scherer wrote: > On Mon, Jun 2, 2014 at 1:32 PM, David Starner > wrote: > > I would especially discourage any web browser from handling > these; they're noncharacters used for unknown purposes that are > undisplayable and if used carelessly for their stated purpose, can > probably trigger serious bugs in some lamebrained utility. > > > I don't expect "handling these" in web browsers and lamebrained > utilities. I expect "treat like unassigned code points". > I can't shake the suspicion that Corrigendum #9 is not actually solving a general problem, but is a special favor to CLDR as being run by insiders, and in the process muddying the waters for everyone else. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Mon Jun 2 17:09:21 2014 From: prosfilaes at gmail.com (David Starner) Date: Mon, 2 Jun 2014 15:09:21 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: On Mon, Jun 2, 2014 at 2:53 PM, Markus Scherer wrote: > On Mon, Jun 2, 2014 at 1:32 PM, David Starner wrote: >> >> I would especially discourage any web browser from handling >> these; they're noncharacters used for unknown purposes that are >> undisplayable and if used carelessly for their stated purpose, can >> probably trigger serious bugs in some lamebrained utility. > > > I don't expect "handling these" in web browsers and lamebrained utilities. I > expect "treat like unassigned code points". So certain programs can't use noncharacters internally because some people want to interchange them? That doesn't seem like what noncharacters should be used for. Unix utilities shouldn't usually go to the trouble of messing with them; limiting the number of changes needed for Unicode was the whole point of UTF-8. Any program transferring them across the Internet as text should filter them, IMO; either some lamebrained utility will open a security hole by using them and not filtering first, or something will filter them after security checks have been done, or something. Unless it's a completely trusted system, text files with these characters should be treated with extreme prejudice by the first thing that receives them over the net. -- Kie ekzistas vivo, ekzistas espero. From Shawn.Steele at microsoft.com Mon Jun 2 17:21:14 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 2 Jun 2014 22:21:14 +0000 Subject: Corrigendum #9 In-Reply-To: <538CF5DB.3070007@ix.netcom.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <538CF5DB.3070007@ix.netcom.com> Message-ID: <216013a44c0845d09d6ae7034dc22468@BY2PR03MB491.namprd03.prod.outlook.com> ? I can't shake the suspicion that Corrigendum #9 is not actually solving a general problem, but is a special favor to CLDR as being run by insiders, and in the process muddying the waters for everyone else I think we could generalize to other scenarios so it wasn?t necessarily an insider scenario. For example, I could have a string manipulation library that used FFFE to indicate the beginning of an identifier for a localizable sentence, terminated by FFFF. Any system using FFFEid1234FFFF would likely expect to be able to read the tokens in their favorite code editor. But I?m concerned that these ?conflict? with each other, and embedding the behavior in major programming languages doesn?t smell to me like ?internal? use. Clearly if I wanted to use that library in a CLDR-aware app, there is a potential risk for a conflict. In the CLDR case, there *IS* a special relationship with Unicode, and perhaps it would be warranted to explicitly encode character(s) with the necessary meaning(s) to handle edge-case collation scenarios. -Shawn -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Jun 2 17:20:49 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 3 Jun 2014 00:20:49 +0200 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: I better expect: "treat them as you like", there will never be any warranty of interoperability, everyone is allowed to use them as they want and even change it at any time. The behavior is not defined in TUS, and users cannot expect that TUS will define this behavior. There's no clear solution about what to do if you encounter them in data supposed to be text. For me they are not text, so the whole data could be rejected or the text remaining after some filtering may be galsely interpreted. You need an external specification outside TUS. I certainly do not consider non-characters like unassigned valid code points where applications are strongly encouraged to not apply any kinf of filter if they want to remain compatible with evolutions of the standard that may assign them (the best you can do with unassigned code points is treat them as symbols, with the minimial properties defined in the standard (notably Bidi properties according to their range, where this direction is defined in some ranges, or treat them as symbols with weak direction), even if applications cannot still render them (renderers will find a way to show them, generally using a .notdef glyph like empty boxes). Normalizers will also not mix them (the default combining class should be 0). Only applications that want to ensure that the text conforms to a specific version of the standard are allowed to filter out or signal as errors the presence of unassigned code points. But all applications can do that kind of things on non-characters (or any code unit whose value falls outside the valid range of a defined UTF?. This is an important difference. non-characters are not like unassigned code points, they are assigned to be considered invalid and filterable by design by any Unicode conforming process for handling text. 2014-06-02 23:53 GMT+02:00 Markus Scherer : > On Mon, Jun 2, 2014 at 1:32 PM, David Starner > wrote: > >> I would especially discourage any web browser from handling >> these; they're noncharacters used for unknown purposes that are >> undisplayable and if used carelessly for their stated purpose, can >> probably trigger serious bugs in some lamebrained utility. >> > > I don't expect "handling these" in web browsers and lamebrained utilities. > I expect "treat like unassigned code points". > > markus > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Jun 2 17:55:31 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 3 Jun 2014 00:55:31 +0200 Subject: Corrigendum #9 In-Reply-To: <81e121ab27544aeca6f23663850c32dd@BY2PR03MB491.namprd03.prod.outlook.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <81e121ab27544aeca6f23663850c32dd@BY2PR03MB491.namprd03.prod.outlook.com> Message-ID: "reserved for CLDR" would be wrong in TUS, you have reached a borderline where you are no longer handling plain text (stream of scalar values assigned to code points), but binary data via a binary interface outside TUS (handling streams of collation elements, whose representation is not even bound to the ICU implementation of CLDR for its own definitions and syntax for its tailorings). CLDR data defines its own interface and protocol, it can reserve these code points only for itself but not in TUS and no other conforming plain-text application is expected to accept these reservations, so they can **freely** mark them in error, replace them, or filter them out, or interpret them differently for their own usage, using their own specification and encapsulation mechanisms and specific **non-plain-text** data types. CLDR data transmitted in binary form that would embed these code points are not transporting plain-text, this is still a binary datatype specific to this application. CLDR data must remain isolated in its scope without forcing other protocols or TUS to follow its practices. Other applications may develop "gateway" interfaces to convert them to be interoperable with ICU but they are not required to do that. If they do, they will follow the ICU specifications, not TUS and this should not influence their own way to handle what TUS describe as plain-text. To make it clear, it is referable to just say in TUS that the behavior of applications with non-characters is completely undefined and unpredictable without an external specification, and these entities should not even be considered as encodable in any standard UTFs (which can be freely be replaced by another one without causing any loss or modification of the represented plain-text). It should be possible to define other (non standard) conforming UTFs which are completely unable to represent these non-characters (as well as any unpaired surrogate). A conforming UTF just needs to be able to represent streams of scalar values in their full standard range (even without knowing if they are assigned or not or without knowing their character properties). You can/should even design CLDR to completely ovoid the use of non-characters: it's up to it to define an encapsulation/escaping mechanism that clearly separates what is standard plain-text in the content and what is not and used for specific purpose in CLDR or ICU implementations. 2014-06-03 0:07 GMT+02:00 Shawn Steele : > Except that, particularly the max-weight ones, mean that developers can > be expected to use this as sentinels in code using ICU, which would > preclude their use for other things? > > > > Which makes them more like ?reserved for use in CLDR? than ?noncharacters?? > > > > -Shawn > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Markus > Scherer > *Sent:* Monday, June 2, 2014 2:53 PM > *To:* David Starner > *Cc:* Unicode Mailing List > *Subject:* Re: Corrigendum #9 > > > > On Mon, Jun 2, 2014 at 1:32 PM, David Starner > wrote: > > I would especially discourage any web browser from handling > > these; they're noncharacters used for unknown purposes that are > undisplayable and if used carelessly for their stated purpose, can > probably trigger serious bugs in some lamebrained utility. > > > > I don't expect "handling these" in web browsers and lamebrained utilities. > I expect "treat like unassigned code points". > > > > markus > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lisam at us.ibm.com Mon Jun 2 18:32:31 2014 From: lisam at us.ibm.com (Lisa Moore) Date: Mon, 2 Jun 2014 16:32:31 -0700 Subject: Corrigendum #9 In-Reply-To: <538CF5DB.3070007@ix.netcom.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <538CF5DB.3070007@ix.netcom.com> Message-ID: I would like to point out to Asmus that this decision was reached unanimously at the UTC by Adobe, Apple, Google, IBM, Microsoft, SAP, UC Berkeley, and Yahoo! One might disagree with the decision, but there were no special favors involved. Lisa > > > I can't shake the suspicion that Corrigendum #9 is not actually > solving a general problem, but is a special favor to CLDR as being > run by insiders, and in the process muddying the waters for everyone else. > > A./_______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Jun 2 18:33:58 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 3 Jun 2014 00:33:58 +0100 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: <20140603003358.1c8f4150@JRWUBU2> On Mon, 2 Jun 2014 15:09:21 -0700 David Starner wrote: > So certain programs can't use noncharacters internally because some > people want to interchange them? That doesn't seem like what > noncharacters should be used for. Much as I don't like their uninvited use, it is possible to pass them and other undesirables through most applications by a slight bit of recoding at the application's boundaries. Using 99 = (3 + 32 + 64) PUA characters, one can ape UTF-16 surrogates and encode: 32 ? 64 pairs for lone surrogates 1 ? 64 pairs to replace some of the PUA characters 1 ? 35 pairs to replace the rest of the PUA characters 1 ? 4 pairs for incoming FFFC to FFFF 1 ? 32 pairs for the other BMP non-characters 1 ? 32 pairs for the supplementary plane non-characters. This then frees up non-characters for the application's use. Richard. From prosfilaes at gmail.com Tue Jun 3 01:21:38 2014 From: prosfilaes at gmail.com (David Starner) Date: Mon, 2 Jun 2014 23:21:38 -0700 Subject: Corrigendum #9 In-Reply-To: <20140603003358.1c8f4150@JRWUBU2> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <20140603003358.1c8f4150@JRWUBU2> Message-ID: On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham wrote: > Much as I don't like their uninvited use, it is possible to pass them > and other undesirables through most applications by a slight bit of > recoding at the application's boundaries. Using 99 = (3 + 32 + 64) PUA > characters, one can ape UTF-16 surrogates and encode: What's the point? If we can use the PUA, then we don't need the noncharacters; we can just use the PUA directly. If we have to play around with remapping them, they're pointless; they're no easier to use in that case then ESC or '\' or PUA characters. -- Kie ekzistas vivo, ekzistas espero. From mark at macchiato.com Tue Jun 3 01:55:09 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 3 Jun 2014 08:55:09 +0200 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: On Mon, Jun 2, 2014 at 10:32 PM, David Starner wrote: > Why? It seems you're changing the rules > ?... > > This isn't "are changing", it is "has changed". The Corrigendum was issued at the start of 2013, about 16 months ago; applicable to all relevant earlier versions. It was the result of fairly extensive debate inside the UTC; there hasn't been a single issue on this thread that wasn't considered during the discussions there. And as far back as 2001, the UTC made it clear that noncharacters *are* scalar values, and are to be converted by UTF converters. Eg, see http://www.unicode.org/mail-arch/unicode-ml/y2001-m09/0149.html (by chance, one day before 9/11). > probably trigger serious bugs in some lamebrained utility. There were already plenty of programs that passed the noncharacters through; very few would filter them (some would delete them, which is horrible for security). Thinking that a utility would never encounter them in input text was a pipe-dream. If a utility or library is so fragile that it *breaks* on input of any valid UTF sequence, then it *is* a "lamebrained" utility. A good unit test for any production chain would be to check there is no crash on any input scalar value (and for that matter, any ill-formed UTF text). -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Tue Jun 3 02:09:27 2014 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Tue, 03 Jun 2014 16:09:27 +0900 Subject: Corrigendum #9 In-Reply-To: <538CF5DB.3070007@ix.netcom.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <538CF5DB.3070007@ix.netcom.com> Message-ID: <538D74A7.5020605@it.aoyama.ac.jp> On 2014/06/03 07:08, Asmus Freytag wrote: > On 6/2/2014 2:53 PM, Markus Scherer wrote: >> On Mon, Jun 2, 2014 at 1:32 PM, David Starner > > wrote: >> >> I would especially discourage any web browser from handling >> these; they're noncharacters used for unknown purposes that are >> undisplayable and if used carelessly for their stated purpose, can >> probably trigger serious bugs in some lamebrained utility. >> >> >> I don't expect "handling these" in web browsers and lamebrained >> utilities. I expect "treat like unassigned code points". Expecting them to be treated like unassigned code points shows that their use is a bad idea: Since when does the Unicode Consortium use unassigned code points (and the like) in plain sight? > I can't shake the suspicion that Corrigendum #9 is not actually solving > a general problem, but is a special favor to CLDR as being run by > insiders, and in the process muddying the waters for everyone else. I have to fully agree with Asmus, Richard, Shawn and others that the use of non-characters in CLDR is a very bad and dangerous example. However convenient the misuse of some of these codepoints in CLDR may be, it sets a very bad example for everybody else. Unicode itself should not just be twice as careful with the use of its own codepoints, but 10 times as careful. I'd strongly suggest that completely independent of when and how Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets worked out for how to get rid of these codepoints in CLDR data. The sooner, the better. Regards, Martin. From richard.wordingham at ntlworld.com Tue Jun 3 02:31:46 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 3 Jun 2014 08:31:46 +0100 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <20140603003358.1c8f4150@JRWUBU2> Message-ID: <20140603083146.3eda0c21@JRWUBU2> On Mon, 2 Jun 2014 23:21:38 -0700 David Starner wrote: > On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham > wrote: > > Using 99 = (3 + > > 32 + 64) PUA characters, one can ape UTF-16 surrogates and encode: > What's the point? If we can use the PUA, then we don't need the > noncharacters; we can just use the PUA directly. If we have to play > around with remapping them, they're pointless; they're no easier to > use in that case then ESC or '\' or PUA characters. A search for two 2-character string '\n' would also find a substring of 4-character string 'a\\n'. The PUA is in general not available for general utilities to make special use of. Richard. From prosfilaes at gmail.com Tue Jun 3 02:41:18 2014 From: prosfilaes at gmail.com (David Starner) Date: Tue, 3 Jun 2014 00:41:18 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: On Mon, Jun 2, 2014 at 11:55 PM, Mark Davis ?? wrote: > Thinking that a utility would never encounter them in input text > was a pipe-dream. Thinking that a utility would never mangle them if encountered in input text was a pipe-dream. > If a utility or library is so fragile that it breaks on > input of any valid UTF sequence, then it is a "lamebrained" utility. And? The world is filled with lamebrained utilities, and being cautious about what you take in can prevent one of those lamebrained utilities from turning into an exploit. > A good > unit test for any production chain would be to check there is no crash on > any input scalar value (and for that matter, any ill-formed UTF text). Right; and if you filter out stuff at the frontend, like ill-formed UTF text and noncharacters, you don't have to worry about what the middle end will do with them. I don't get what the goal of these changes were. It seems you've taken these characters away from programmers to use them in programs and given them to CLDR and anyone else willing to make their "plain text files" skirt the limits. -- Kie ekzistas vivo, ekzistas espero. From prosfilaes at gmail.com Tue Jun 3 02:42:54 2014 From: prosfilaes at gmail.com (David Starner) Date: Tue, 3 Jun 2014 00:42:54 -0700 Subject: Corrigendum #9 In-Reply-To: <20140603083146.3eda0c21@JRWUBU2> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <20140603003358.1c8f4150@JRWUBU2> <20140603083146.3eda0c21@JRWUBU2> Message-ID: On Tue, Jun 3, 2014 at 12:31 AM, Richard Wordingham wrote: > On Mon, 2 Jun 2014 23:21:38 -0700 > David Starner wrote: > >> On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham >> wrote: >> > Using 99 = (3 + >> > 32 + 64) PUA characters, one can ape UTF-16 surrogates and encode: > > The PUA is in general not available for > general utilities to make special use of. No, the PUA is not. Then where are you getting the 99 PUA characters you suggested using? -- Kie ekzistas vivo, ekzistas espero. From richard.wordingham at ntlworld.com Tue Jun 3 02:46:44 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 3 Jun 2014 08:46:44 +0100 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: <20140603084644.2f01e910@JRWUBU2> On Tue, 3 Jun 2014 08:55:09 +0200 Mark Davis ?? wrote: > On Mon, Jun 2, 2014 at 10:32 PM, David Starner > wrote: > > > Why? It seems you're changing the rules > > ?... > > > > > This isn't "are changing", it is "has changed". The Corrigendum was > issued at the start of 2013, about 16 months ago; applicable to all > relevant earlier versions. It was the result of fairly extensive > debate inside the UTC; there hasn't been a single issue on this > thread that wasn't considered during the discussions there. And as > far back as 2001, the UTC made it clear that noncharacters *are* > scalar values, and are to be converted by UTF converters. Eg, see > http://www.unicode.org/mail-arch/unicode-ml/y2001-m09/0149.html (by > chance, one day before 9/11). But that says U+FDD0 is not to be externally interchanged! Richard. From mark at macchiato.com Tue Jun 3 02:52:45 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 3 Jun 2014 09:52:45 +0200 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: On Tue, Jun 3, 2014 at 9:41 AM, David Starner wrote: > Thinking that a utility would never mangle them if encountered in > input text was a pipe-dream. > I didn't say "not mangle", I said "break", as in "crash". ?I don't think this thread is going anywhere productive, so? I'm signing off from it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Jun 3 03:02:32 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 3 Jun 2014 09:02:32 +0100 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <20140603003358.1c8f4150@JRWUBU2> <20140603083146.3eda0c21@JRWUBU2> Message-ID: <20140603090232.0f3cf06c@JRWUBU2> On Tue, 3 Jun 2014 00:42:54 -0700 David Starner wrote: > On Tue, Jun 3, 2014 at 12:31 AM, Richard Wordingham > wrote: > > On Mon, 2 Jun 2014 23:21:38 -0700 > > David Starner wrote: > > > >> On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham > >> wrote: > >> > Using 99 = (3 + > >> > 32 + 64) PUA characters, one can ape UTF-16 surrogates and > >> > encode: > > > > The PUA is in general not available for > > general utilities to make special use of. > > No, the PUA is not. Then where are you getting the 99 PUA characters > you suggested using? By escaping them as well. The point of the complex scheme is to keep searching simple. Using a general escape character doesn't work so well. Richard. From prosfilaes at gmail.com Tue Jun 3 04:46:29 2014 From: prosfilaes at gmail.com (David Starner) Date: Tue, 3 Jun 2014 02:46:29 -0700 Subject: Corrigendum #9 In-Reply-To: <20140603090232.0f3cf06c@JRWUBU2> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <20140603003358.1c8f4150@JRWUBU2> <20140603083146.3eda0c21@JRWUBU2> <20140603090232.0f3cf06c@JRWUBU2> Message-ID: On Tue, Jun 3, 2014 at 1:02 AM, Richard Wordingham wrote: > On Tue, 3 Jun 2014 00:42:54 -0700 > David Starner wrote: > >> No, the PUA is not. Then where are you getting the 99 PUA characters >> you suggested using? > > By escaping them as well. The point of the complex scheme is to keep > searching simple. Using a general escape character doesn't work so > well. The point is, instead of escaping the PUA so you can use the noncharacters, why not just escape the PUA so you can use the PUA characters? The latter is simpler and more flexible. -- Kie ekzistas vivo, ekzistas espero. From verdy_p at wanadoo.fr Tue Jun 3 09:20:35 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 3 Jun 2014 16:20:35 +0200 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <20140603003358.1c8f4150@JRWUBU2> Message-ID: I think his point is that an application may want to encapsulate in a valid text any orbitrary stream of code points (including non characters, PUAs, or isolated surrogate code units found in 16-bit or 32-bit streams that are invalid UTF-16 or UTF-32 streams, or even invalid arbitrary 8-but bytes in streams that are not valid UTF-8). For 8-bit streams, using ESC or \ s generally a good choice of escape to derive a valid UTF-8 text stream. But for 16-bit and 32-bit stream, PUAs are more economical (but PUA code units found in the stream still need to be escaped. If you think about the Java regexp "\\uD800", it does not designates a code point but only a code unit which is not valid plain text alone as it violates UTF-16 encoding rules. Trying to match it in a valid UTF-16 stream can work only if you can reprecent isolated code units for a specific encoding like UTF-16, even if the targer stream to look for this match uses any other valid UTF (not necessarily UTF-16: decode the target text, reencode it to UTF-16 to generate a 16-bit stream in which you'll look for isolated 16-but code units with the regexp) So yes the regexp "\\uXXXX" (in Java source) is not used to match a single valid character 2014-06-03 8:21 GMT+02:00 David Starner : > On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham > wrote: > > Much as I don't like their uninvited use, it is possible to pass them > > and other undesirables through most applications by a slight bit of > > recoding at the application's boundaries. Using 99 = (3 + 32 + 64) PUA > > characters, one can ape UTF-16 surrogates and encode: > > What's the point? If we can use the PUA, then we don't need the > noncharacters; we can just use the PUA directly. If we have to play > around with remapping them, they're pointless; they're no easier to > use in that case then ESC or '\' or PUA characters. > > -- > Kie ekzistas vivo, ekzistas espero. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mpapendick at vermeer.com Tue Jun 3 09:25:46 2014 From: mpapendick at vermeer.com (Papendick, Michelle) Date: Tue, 3 Jun 2014 14:25:46 +0000 Subject: Use of Unicode Symbol 26A0 Message-ID: Good Day - Just wondering if Unicode provides for or anyone know of documentation for standard usage around the following symbol: [cid:image001.png at 01CF7C48.A6D54D00] Noticed that is it used in many applications as a general warning or error symbol, but upon research it is also the symbol for personal injury so appears to be a conflict of meaning. Any information around standard usage of the symbol in software applications is appreciated. Thank you! Michelle -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8819 bytes Desc: image001.png URL: From verdy_p at wanadoo.fr Tue Jun 3 10:56:05 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 3 Jun 2014 17:56:05 +0200 Subject: Use of Unicode Symbol 26A0 In-Reply-To: References: Message-ID: Warning, danger, caution, risk, hazard... All these things are related. The personal injury is just a particular case for this broad meaning which is to ask people to be careful before going forward, and read the notice. The symbol is also used as a street sign, or various dangers on roads, when there's no other specific sign, or for temporary signs (e.g. to signal a nearby accident. In almost all case, it does not come alone, there's a label or sentence explaining the kind of danger or risk to which one could be exposed (risks are not necessarly on health or death, they may be virtual. It is commonly used in softwares in warning prompt dialogs that signal a problem for which something should be investigated, or before continuing with an action destroying data in an unrecoverable way (or only in a way that offers no warranty of success or reliability). The name of the symbol is descriptive enough "WARNING SIGN". Adding extra info would incorrectly limit its broad usage. 2014-06-03 16:25 GMT+02:00 Papendick, Michelle : > Good Day ? > > > > Just wondering if Unicode provides for or anyone know of documentation for > standard usage around the following symbol: > > > > [image: cid:image001.png at 01CF7C48.A6D54D00] > > > > Noticed that is it used in many applications as a general warning or error > symbol, but upon research it is also the symbol for personal injury so > appears to be a conflict of meaning. > > > > Any information around standard usage of the symbol in software > applications is appreciated. > > > > Thank you! > Michelle > > > > > > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8819 bytes Desc: not available URL: From asmusf at ix.netcom.com Tue Jun 3 11:05:11 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 03 Jun 2014 09:05:11 -0700 Subject: Corrigendum #9 In-Reply-To: <538CF5DB.3070007@ix.netcom.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <538CF5DB.3070007@ix.netcom.com> Message-ID: <538DF237.5060906@ix.netcom.com> On 6/2/2014 3:08 PM, Asmus Freytag wrote: > On 6/2/2014 2:53 PM, Markus Scherer wrote: >> On Mon, Jun 2, 2014 at 1:32 PM, David Starner > > wrote: >> >> I would especially discourage any web browser from handling >> these; they're noncharacters used for unknown purposes that are >> undisplayable and if used carelessly for their stated purpose, can >> probably trigger serious bugs in some lamebrained utility. >> >> >> I don't expect "handling these" in web browsers and lamebrained >> utilities. I expect "treat like unassigned code points". >> > > I can't shake the suspicion that Corrigendum #9 is not actually > solving a general problem, but is a special favor to CLDR as being run > by insiders, and in the process muddying the waters for everyone else. Clarifying: I still haven't heard from anyone that this solves a general problem that is widespread. The only actual example has always been CLDR, and its decision to ship these code points in XML. Shipping these code points in files was pretty far down the list of "what not to do" when they were originally adopted. My view continues to be that this is was a questionable design decision by CLDR, given what was on the record. The reaction of several outside implementers during this discussion makes clear that viewing that design as problematic is not just my personal view. Usually, if there's a discrepancy between an implementation and Unicode, the reaction is not to retract conformance language. I think arriving at this decision was easier for the UTC, because CLDR is not a random, unrelated implementation. And, as in any group, it's perhaps easier to not be as keenly aware of the impact on external implementations. So, I'd like to clarify, that this is the sense in which I meant "special favor", and which therefore is not the most felicitous expression to describe what I had in mind. A./ > > A./ > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Tue Jun 3 11:13:17 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 03 Jun 2014 09:13:17 -0700 Subject: Use of Unicode Symbol 26A0 In-Reply-To: References: Message-ID: <538DF41D.7030904@ix.netcom.com> Michelle, Unicode normally does not document all known usages of symbols. Occasionally, if a symbol is used in ways that might be unexpected from its name, the standard may add an alias or annotation. This is done in particular, when there is a question of whether a given symbol is the correct choice for a given application - especially if Unicode contains multiple, similar symbols. In this case, that does not seem the case. The symbol is used for a variety of purposes, from warning to error to alerting readers to important information. These all seem to fit in the same general usage as suggested by the name, and the symbol is distinct enough so that that there is no other symbol in Unicode that might suggest itself as an alternate. The use to warn about risk of personal injury would not seem to demand additional clarification. A./ On 6/3/2014 7:25 AM, Papendick, Michelle wrote: > > Good Day ? > > Just wondering if Unicode provides for or anyone know of documentation > for standard usage around the following symbol: > > cid:image001.png at 01CF7C48.A6D54D00 > > Noticed that is it used in many applications as a general warning or > error symbol, but upon research it is also the symbol for personal > injury so appears to be a conflict of meaning. > > Any information around standard usage of the symbol in software > applications is appreciated. > > Thank you! > Michelle > > > > __ > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 8819 bytes Desc: not available URL: From asmusf at ix.netcom.com Tue Jun 3 11:15:27 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 03 Jun 2014 09:15:27 -0700 Subject: Corrigendum #9 In-Reply-To: <538D74A7.5020605@it.aoyama.ac.jp> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <538CF5DB.3070007@ix.netcom.com> <538D74A7.5020605@it.aoyama.ac.jp> Message-ID: <538DF49F.10605@ix.netcom.com> Nicely put. A./ On 6/3/2014 12:09 AM, "Martin J. D?rst" wrote: > On 2014/06/03 07:08, Asmus Freytag wrote: >> On 6/2/2014 2:53 PM, Markus Scherer wrote: >>> On Mon, Jun 2, 2014 at 1:32 PM, David Starner >> > wrote: >>> >>> I would especially discourage any web browser from handling >>> these; they're noncharacters used for unknown purposes that are >>> undisplayable and if used carelessly for their stated purpose, can >>> probably trigger serious bugs in some lamebrained utility. >>> >>> >>> I don't expect "handling these" in web browsers and lamebrained >>> utilities. I expect "treat like unassigned code points". > > Expecting them to be treated like unassigned code points shows that > their use is a bad idea: Since when does the Unicode Consortium use > unassigned code points (and the like) in plain sight? > >> I can't shake the suspicion that Corrigendum #9 is not actually solving >> a general problem, ... > > I have to fully agree with Asmus, Richard, Shawn and others that the > use of non-characters in CLDR is a very bad and dangerous example. > > However convenient the misuse of some of these codepoints in CLDR may > be, it sets a very bad example for everybody else. Unicode itself > should not just be twice as careful with the use of its own > codepoints, but 10 times as careful. > > I'd strongly suggest that completely independent of when and how > Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets > worked out for how to get rid of these codepoints in CLDR data. The > sooner, the better. > > Regards, Martin. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From jkorpela at cs.tut.fi Tue Jun 3 12:17:01 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Tue, 03 Jun 2014 20:17:01 +0300 Subject: Use of Unicode Symbol 26A0 In-Reply-To: <538DF41D.7030904@ix.netcom.com> References: <538DF41D.7030904@ix.netcom.com> Message-ID: <538E030D.2070703@cs.tut.fi> 2014-06-03 19:13, Asmus Freytag wrote: > Unicode normally does not document all known usages of symbols. Not to mention unknown usages. Characters will be used in different ways, no matter what the Unicode Standard says, and it would be mostly pointless to put restrictions on it. In some cases, however, some types of usage are warned against, or better approaches are suggested?. > The symbol is used for a > variety of purposes, from warning to error to alerting readers to > important information. These all seem to fit in the same general usage > as suggested by the name, and the symbol is distinct enough so that that > there is no other symbol in Unicode that might suggest itself as an > alternate. Right, but if we consider the use of WARNING SIGN as a text character, or contexts where an image resembling WARNING SIGN is used and WARNING SIGN could well be used (with the usual caveats), then it seems to generally indicate a warning message as opposite to an error message, on one hand, and a purely informative note, on the other. The use of graphic symbols similar to WARNING SIGN e.g. in traffic signs is really a different issue and external to Unicode, as it is not about characters, though it might be tangentially related. > The use to warn about risk of personal injury would not seem to demand > additional clarification. On the practical side, it might be in order to warn against usage that relies on some particular interpretation like that. What I mean is that it is OK to use WARNING SIGN as warning about risk of personal injury, but questionable to expect that people will generally take it that way (and not more loosely as warning of some kind). Yucca From richard.wordingham at ntlworld.com Tue Jun 3 13:52:53 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 3 Jun 2014 19:52:53 +0100 Subject: UTF-16 Encoding Scheme and U+FFFE Message-ID: <20140603195253.3c0df53f@JRWUBU2> How do I read definition D98 in TUS Version 6.3.0 Chapter 3 to prohibit a file in the UTF-16 encoding scheme from starting with U+FFFE? Or is U+FFFE actually allowed to start such a file? Is an implementation that deduces the encoding scheme of a plain text file from a leading BOM to be characterised as reckless? Richard. From richard.wordingham at ntlworld.com Tue Jun 3 13:59:23 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 3 Jun 2014 19:59:23 +0100 Subject: Corrigendum #9 In-Reply-To: <538D74A7.5020605@it.aoyama.ac.jp> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <538CF5DB.3070007@ix.netcom.com> <538D74A7.5020605@it.aoyama.ac.jp> Message-ID: <20140603195923.6ec9c275@JRWUBU2> On Tue, 03 Jun 2014 16:09:27 +0900 "Martin J. D?rst" wrote: > I'd strongly suggest that completely independent of when and how > Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets > worked out for how to get rid of these codepoints in CLDR data. The > sooner, the better. I suspect this has already been done. I know of no CLDR text files still containing them. Richard. From petercon at microsoft.com Tue Jun 3 16:28:05 2014 From: petercon at microsoft.com (Peter Constable) Date: Tue, 3 Jun 2014 21:28:05 +0000 Subject: UTF-16 Encoding Scheme and U+FFFE In-Reply-To: <20140603195253.3c0df53f@JRWUBU2> References: <20140603195253.3c0df53f@JRWUBU2> Message-ID: There's never been anything preventing a file from containing and beginning with U+FFFE. It's just not a very useful thing to do, hence not very likely. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham Sent: June 3, 2014 11:53 AM To: unicode at unicode.org Subject: UTF-16 Encoding Scheme and U+FFFE How do I read definition D98 in TUS Version 6.3.0 Chapter 3 to prohibit a file in the UTF-16 encoding scheme from starting with U+FFFE? Or is U+FFFE actually allowed to start such a file? Is an implementation that deduces the encoding scheme of a plain text file from a leading BOM to be characterised as reckless? Richard. _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From xueming.shen at oracle.com Tue Jun 3 17:06:30 2014 From: xueming.shen at oracle.com (Xueming Shen) Date: Tue, 03 Jun 2014 15:06:30 -0700 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: <20140602210153.40a8bf08@JRWUBU2> References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> <20140601094931.413857e2@JRWUBU2> <20140601180457.273ac6b9@JRWUBU2> <20140602210153.40a8bf08@JRWUBU2> Message-ID: <538E46E6.9050406@oracle.com> On 06/02/2014 01:01 PM, Richard Wordingham wrote: > On Mon, 2 Jun 2014 11:29:09 +0200 > Mark Davis ?? wrote: > >>> \uD808\uDF45 specifies a sequence of two codepoints. >> ?That is simply incorrect.? > The above is in the sample notation of UTS #18 Version 17 Section 1.1. > > From what I can make out, the corresponding Java notation would be > \x{D808}\x{DF45}. I don't *know* what \x{D808} and \x{DF45} match in > Java, or whether they are even acceptable. The only thing UTS #18 > RL1.7 permits them to match in Java is lone surrogates, but I don't > know if Java complies. The notation for "\uD808\uDF45" is interpreted as a supplementary codepoint and is represent internally as a pair of surrogates in String. Pattern.compile("\\x{D808}\\x{DF45}").matcher("\ud808\udf45").find()); -> false Pattern.compile("\uD808\uDF45").matcher("\ud808\udf45").find()); -> true Pattern.compile("\\x{D808}").matcher("\ud808\udf45").find()); -> false Pattern.compile("\\x{D808}").matcher("\ud808_\udf45").find()); -> true -Sherman > All UTS #18 says for sure about regular expressions matching code units > is that they don't satisfy RL1.1, though Section 1.7 appears to ban > them when it says, "A fundamental requirement is that Unicode text be > interpreted semantically by code point, not code units". Perhaps it's > a fundamental requirement of something other than UTS #18. I thought > matching parts of characters in terms of their canonical equivalences > was awkward enough, without having the additional option of matching > some of the code units! > From richard.wordingham at ntlworld.com Tue Jun 3 18:40:50 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 4 Jun 2014 00:40:50 +0100 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: <538E46E6.9050406@oracle.com> References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> <20140601094931.413857e2@JRWUBU2> <20140601180457.273ac6b9@JRWUBU2> <20140602210153.40a8bf08@JRWUBU2> <538E46E6.9050406@oracle.com> Message-ID: <20140604004050.566e54c9@JRWUBU2> On Tue, 03 Jun 2014 15:06:30 -0700 Xueming Shen wrote: > On 06/02/2014 01:01 PM, Richard Wordingham wrote: > > On Mon, 2 Jun 2014 11:29:09 +0200 > > Mark Davis ?? wrote: > > > >>> \uD808\uDF45 specifies a sequence of two codepoints. > >> ?That is simply incorrect.? > > The above is in the sample notation of UTS #18 Version 17 Section > > 1.1. > > > > From what I can make out, the corresponding Java notation would be > > \x{D808}\x{DF45}. I don't *know* what \x{D808} and \x{DF45} match > > in Java, or whether they are even acceptable. The only thing UTS > > #18 RL1.7 permits them to match in Java is lone surrogates, but I > > don't know if Java complies. > > The notation for "\uD808\uDF45" is interpreted as a supplementary > codepoint and is represent internally as a pair of surrogates in > String. > > Pattern.compile("\\x{D808}\\x{DF45}").matcher("\ud808\udf45").find()); > -> false > Pattern.compile("\uD808\uDF45").matcher("\ud808\udf45").find()); > -> true > Pattern.compile("\\x{D808}").matcher("\ud808\udf45").find()); > -> false > Pattern.compile("\\x{D808}").matcher("\ud808_\udf45").find()); > -> true Thank you for providing examples confirming that what in the UTS #18 *sample* notation would be written \uD808\uDF45, i.e. \x{D808}\x{DF45} in Java notation, matches nothing in any 16-bit Unicode string. Richard. From richard.wordingham at ntlworld.com Tue Jun 3 18:50:51 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 4 Jun 2014 00:50:51 +0100 Subject: UTF-16 Encoding Scheme and U+FFFE In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> Message-ID: <20140604005051.1f2aee9a@JRWUBU2> On Tue, 3 Jun 2014 21:28:05 +0000 Peter Constable wrote: > There's never been anything preventing a file from containing and > beginning with U+FFFE. It's just not a very useful thing to do, hence > not very likely. Well, while U+FFFE was apparently prohibited from public interchange, one could be very confident of not finding it in an external file. As an internally generated file, it would then be much more likely to be in the UTF-16BE or UTF-16LE encoding scheme. Richard. From ken.whistler at sap.com Tue Jun 3 19:23:53 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Wed, 4 Jun 2014 00:23:53 +0000 Subject: UTF-16 Encoding Scheme and U+FFFE In-Reply-To: <20140604005051.1f2aee9a@JRWUBU2> References: <20140603195253.3c0df53f@JRWUBU2> <20140604005051.1f2aee9a@JRWUBU2> Message-ID: You cannot even be "very confident" of not finding actual ill-formed UTF-16, like unpaired surrogates, in an external file, let alone noncharacters. As for the noncharacters, take a look at the collation test files that we distribute with each version of UCA. The test data includes test strings like the following, to verify that UCA implementations do the correct thing when faced with unusual edge cases: FFFE 0021 FFFE 003F FFFE 0061 FFFE 0041 FFFE 0062 1FFFE 0021 1FFFE 003F 1FFFE 0334 ... As well as test strings starting with unpaired surrogates: D800 0021 D800 003F D800 0061 D800 0041 D800 0062 And while it is true that the *file* CollationTest_SHIFTED.txt doesn't start with either a noncharacter or an unpaired surrogate -- because all of the test data in it is represented in ASCII hex strings instead of directly in UTF-16 -- the issue in any case isn't whether a *file* starts with a noncharacter, but whether a UTF-16 *string* starts with a noncharacter. Any one of those test strings could be trivially turned into a text file by piping out that one UTF-16 string to a file. And I could then write conformant test software that would read UTF-16 string input data from that file and run it through the UCA algorithm to construct sortkeys for it. As Peter said, the main thing that prevents running into these is that it isn't very *useful* to start off files (or strings) with U+FFFE. (And, additionally, in the case of UTF-16 text data files, it would be confusing and possibly lead to misinterpretation of byte order, if you were somehow depending solely on initial BOMs -- which I wouldn't advise, anyway.) Basically, the rules of standards (e.g., you shouldn't try to publicly interchange noncharacters) are not like laws of physics. Just because the standard says you shouldn't do it doesn't mean it doesn't happen. --Ken > On Tue, 3 Jun 2014 21:28:05 +0000 > Peter Constable wrote: > > > There's never been anything preventing a file from containing and > > beginning with U+FFFE. It's just not a very useful thing to do, hence > > not very likely. > > Well, while U+FFFE was apparently prohibited from public interchange, > one could be very confident of not finding it in an external file. As > an internally generated file, it would then be much more likely to be > in the UTF-16BE or UTF-16LE encoding scheme. > > Richard. From asmusf at ix.netcom.com Wed Jun 4 01:32:03 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 03 Jun 2014 23:32:03 -0700 Subject: Use of Unicode Symbol 26A0 In-Reply-To: <538E030D.2070703@cs.tut.fi> References: <538DF41D.7030904@ix.netcom.com> <538E030D.2070703@cs.tut.fi> Message-ID: <538EBD63.8080004@ix.netcom.com> On 6/3/2014 10:17 AM, Jukka K. Korpela wrote: > On the practical side, it might be in order to warn against usage that > relies on some particular interpretation like that. What I mean is > that it is OK to use WARNING SIGN as warning about risk of personal > injury, but questionable to expect that people will generally take it > that way (and not more loosely as warning of some kind). > > Yucca It might be useful to note in the description of symbols that their names are commonly not limited to the semantics (instead, names are frequently based on appearance). The clarification could include statements to the effect that: In the case the name is based on semantics, the name chosen may reflect only one of many uses of the symbol, and, further, the symbol may not always be considered the "best" representative of that semantic by all users. Exceptions occur for example for mathematical symbols, many of which have conventional names outside Unicode, some of which (like integral sign) do directly name the standard use of that symbol. I'm not sure, but I imagine if you read carefully that this is covered already (either in the chapters or in the FAQ). Should comparable language really be absent, that would be good to know. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jun 4 01:54:23 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 4 Jun 2014 08:54:23 +0200 Subject: UTF-16 Encoding Scheme and U+FFFE In-Reply-To: <20140604005051.1f2aee9a@JRWUBU2> References: <20140603195253.3c0df53f@JRWUBU2> <20140604005051.1f2aee9a@JRWUBU2> Message-ID: U+FFFE is prohibited in interchanges because if interchanges specify a UTF-16 encoding (not UTF16-BE or URF16-LE) it would be interpreted as a BOM where it occurs at start of a stream (with the consequence of reparsing it as U+FEFF with bytes swapped). In all other positions where it cannot be a BOM. BOM are *normally* only authorized in interchanges at "start" of streams. But this is a problem for "live" streams that have no defined "start" but can be synced at random positions (such as on the next newline, or the start of a network datagram, but note that some network layers may fragment them so that BOM could be repeated, and also reunite them, leaving multiple BOMs in the same datagram) so we can assume that U+FFFE anywhere in a UTF16 "live" stream, not a UTF16-BE or UTF16-LE stream, is each time a BOM and not a BOM or legacy ZWNBSP or a non-character) Streams that are known to be UTF16-BE or UTF16-LE are also not recommanded for interchanged if these files or live streams may be transmitted without metadata specifying its encoding explicitly (so many remote readers will interpret them instead as UTF16, possibly with multiple BOMs in resynchronizable live streams). The problem of live streams is also a good reason why WZNBSP (U+FEFF) has been strongly discouraged in interchanges in vafor of word joiner (and this also applies to all other conforming UTFs (including UTF-8, UTF16-BE, UTF16-LE, UTF32, UTF32-LE, UTF32-BE) where it is strongly recommended not to use U+FEFF and U+FFFE except for BOMs (possibly repeated on live streams). You should note that conforminf processes working in interchanges (or storage) should always be allowed to switch from one standard UTF to another. And the same encoded streams may be consumed by various clients having different native order. It is now become difficult to define what is a "local" system, when applications are converted to work in a cloud with more and more heterogeneous clients and more intermediate third parties (providing things like caching, archiving, proxying, backup of data and restauration on another system...). For long term reusability of data, we are strongly encouraged not to use U+FFFE and U+FEFF except for BOMs, and we should be tolerant about the number of BOMs found (an in my opinion, UCA implementations should ignore discard them on input, treating them as fully ignorable, except for delimiting combining base characters for the prupose of normalisation, that conforming applications or intermediate filters should be allowed to perform as they want. And we should absolutely forget the legacy semantic of ZWNBSP. But this complexity and tolerance for one or more BOMs also means that all UTFs not based on 8-bit code units should be also discouraged in interchanges. This means that UTF-16 and UTF-32 should be discouraged, leaving only UTF-16BE or UTF-16LE or UTF-32BE not for storage or networking, but for temporary streams in memory used the "blackbox" internally implementing each conforming process. For all the rest, most applications now use UTF-8, possibly packaged within a generic compressed stream (binary compression of live streams remains possible, even if you cannot predict in the text encoding where the resynchronization points will occur: it's up to the protocol using this transport compression to properly define the resynchronization points). In UTF-8 streams we can completely omit U+FFFE, U+FEFF, either as BOMs, ZWNSP or non-characters (and we can also expect that many applications will just discard them silently, as they only have a "no-op" role as BOMs in 8-bit streams). If an application ouputs an 8-bit stream that is not UTF-8, it wil drop all U+FEFF and U+FFFE found in input, and will often ouput its encoding of U+FEFF its non-UTF-8 encoding generated, frequently as a "magic" signature of this encoding. Secure digital signatures of text streams should also ignore these code units silently as these code units won't be relevant elsewhere in the chain of producers or consumers of this data (these secure digital signatures should be computed by dropping these discarvable U+FEFF and U+FFFE, normaling that data for example to NFC or NFD, and producing a specific UTF (the easiest one to avoid complications being to use UTF-32BE or UTF-32LE with a predetermined byte order, as specified by the digital signature algorithm). Additionally it will be very easy to use as many U+FEFF code units as needed as ignorable extra BOMs, for cases where a protocol needs a safe "padding filler" f they want to use fixed-size block I/O with random access and easy resynchronization (in live streams), when the producer safely breaks data blocks at boundary of combining sequences (allowing these blocks to be normalized separately and reunited later witout creating problem. 2014-06-04 1:50 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Tue, 3 Jun 2014 21:28:05 +0000 > Peter Constable wrote: > > > There's never been anything preventing a file from containing and > > beginning with U+FFFE. It's just not a very useful thing to do, hence > > not very likely. > > Well, while U+FFFE was apparently prohibited from public interchange, > one could be very confident of not finding it in an external file. As > an internally generated file, it would then be much more likely to be > in the UTF-16BE or UTF-16LE encoding scheme. > > Richard. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jun 4 03:10:52 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 4 Jun 2014 10:10:52 +0200 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: <20140604004050.566e54c9@JRWUBU2> References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> <20140601094931.413857e2@JRWUBU2> <20140601180457.273ac6b9@JRWUBU2> <20140602210153.40a8bf08@JRWUBU2> <538E46E6.9050406@oracle.com> <20140604004050.566e54c9@JRWUBU2> Message-ID: It does match in a 16-bit "Unicode" string, but this is not a "UTF-16" string : there's no such thing as a "16-bit string" in Unicode if you omit to specify the exact UTF encoding type specified in the standard. - the Java regex "\\x{0020}" (here in Java-source litteral String format which requires escaping the backslash for that regexp literal) is not contextual: it matches exactly one 16-bit char '\u0020' independantly of its context. - the Java regex "\\x{DC00}" (here in Java-source litteral String format) is contextual: it really matches one 16-bit char '\uDC00' either at *start* of the String or NOT immediately preceded by a 16-bit char between '\uD800' and '\uDBFF'. - the Java regex "\\uDC00" (here in Java-source litteral String format) is NOT contextual: it really matches one 16-bit char '\uDC00' in all contexts, so it is the same as the Java regexp "\uDC00" (because this single surrogate char has no "special" meaning in regexps and is interpreted literally by the regexp engine) - the Java regex "\\x{D808}" (here in Java-source litteral String format) is contextual: it really matches one 16-bit char '\uD808' either at *end* of the String or NOT immediately followed by a 16-bit char between '\uDC00' and '\uDFFF'. - the Java regex "\\uD808" (here in Java-source litteral String format) is NOT contextual: it really matches one 16-bit char '\uDC00' in all contexts, so it is the same as the Java regexp "\uD808" (because this single surrogate char has no "special" meaning in regexps and is interpreted literally by the regexp engine) In summary, the regexp engine in Java does not really work with code points, it works directly at the code unit level. The \x notation is a convenient shortcut to specify contexts for litteral codeunits, or to escape the special meaning of some regexp operators. Another example: the Java regexp "A*" is exactly identical to "\u0041\u002A", in both cases this means 0 or more Latin capital letter A (the \u notation in Java source code does not escape the special meaning for regexps at runtime, it is a convenience only for the source code, for example to escape a litteral double quote in a litteral String (note that Java source code files may be be encoded in any text encoding supported by its internationalisation library accessible to the Java compiler, for example the Java source code could be using only US-ASCII or Windows-1252 and there's no otherway than the \u notation to compile a 16-bit char code unit in a String literal if the needed character is absent from the Java source code encoding; Java source code may also be encoded with UTF-8 in which case most uses of \u is not needed in Java you can as well use the \u notation for identifiers, or for operators of the language ! The \u notation in source Java code is in fact interpreted AFTER it has been generated by the source code reader according to its specified source encoding. Then the decoded source string (internally represented in a Java 16-bit char[] array) is processed by the input stage of the lexer which will convert these \u notation, prior to recognizing the lexical items. There are quite similar input stages in ANSI C/C++ compilers. For example ANSI C supports since long the "???" trigram prefix for noting some standard operators or delimiters of the language if the characters needed by its syntax is not supported in the source code encoding, and this input stage also occurs prior to recognizing lexical entities of the language, and it was used if the input encoding did not support the full US-ASCII character set, but only the invariant subset of ISO 646, such as old national versions of 7-bit EBCDIC or even the older 5-bit or 6-bit encodings like Baudot ; very few C programmers know the existence of this notation in ANSI C because today they only write code in files stored in an encoding suporting at least the full US-ASCII subset (including one of the many 8-bit EBCDIC variants remaining on mainframes or when working on source code via old "exotic" 7-bit terminals, or if their national keyboard don't define a way to enter the full US-ASCII graphic set, such as braces or backslashes)... 2014-06-04 1:40 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Tue, 03 Jun 2014 15:06:30 -0700 > Xueming Shen wrote: > > > On 06/02/2014 01:01 PM, Richard Wordingham wrote: > > > On Mon, 2 Jun 2014 11:29:09 +0200 > > > Mark Davis ?? wrote: > > > > > >>> \uD808\uDF45 specifies a sequence of two codepoints. > > >> ?That is simply incorrect.? > > > The above is in the sample notation of UTS #18 Version 17 Section > > > 1.1. > > > > > > From what I can make out, the corresponding Java notation would be > > > \x{D808}\x{DF45}. I don't *know* what \x{D808} and \x{DF45} match > > > in Java, or whether they are even acceptable. The only thing UTS > > > #18 RL1.7 permits them to match in Java is lone surrogates, but I > > > don't know if Java complies. > > > > The notation for "\uD808\uDF45" is interpreted as a supplementary > > codepoint and is represent internally as a pair of surrogates in > > String. > > > > Pattern.compile("\\x{D808}\\x{DF45}").matcher("\ud808\udf45").find()); > > -> false > > Pattern.compile("\uD808\uDF45").matcher("\ud808\udf45").find()); > > -> true > > Pattern.compile("\\x{D808}").matcher("\ud808\udf45").find()); > > -> false > > Pattern.compile("\\x{D808}").matcher("\ud808_\udf45").find()); > > -> true > > Thank you for providing examples confirming that what in the UTS #18 > *sample* notation would be written \uD808\uDF45, i.e. \x{D808}\x{DF45} > in Java notation, matches nothing in any 16-bit Unicode string. > > Richard. > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From A.Schappo at lboro.ac.uk Wed Jun 4 04:28:49 2014 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Wed, 4 Jun 2014 09:28:49 +0000 Subject: Swift In-Reply-To: <20140603195253.3c0df53f@JRWUBU2> References: <20140603195253.3c0df53f@JRWUBU2> Message-ID: <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> Swift is Apple's new programming language. In Swift, variable and constant names can be constructed from Unicode characters. Here are a couple of examples from Apple's doc http://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/TheBasics.html let ? = 3.14159 let ?? = "????" I think this a huge step forward for i18n and Unicode. There are some restrictions on which Unicode chars can be used. From Apple's doc "Constant and variable names cannot contain mathematical symbols, arrows, private-use (or invalid) Unicode code points, or line- and box-drawing characters. Nor can they begin with a number, although numbers may be included elsewhere within the name." The restrictions seem a little like IDNA2008. Anyone have links to info giving a detailed explanation/tabulation of allowed and non allowed Unicode chars for Swift Variable and Constant names? Andr? Schappo From mark at macchiato.com Wed Jun 4 04:41:17 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 4 Jun 2014 11:41:17 +0200 Subject: Swift In-Reply-To: <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> Message-ID: Apparently you can use emoji in the identifiers. ?? ( http://www.globalnerdy.com/2014/06/03/swift-fun-fact-1-you-can-use-emoji-characters-in-variable-constant-function-and-class-names/ ) Mark *? Il meglio ? l?inimico del bene ?* On Wed, Jun 4, 2014 at 11:28 AM, Andre Schappo wrote: > Swift is Apple's new programming language. In Swift, variable and constant > names can be constructed from Unicode characters. Here are a couple of > examples from Apple's doc > http://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/TheBasics.html > > let ? = 3.14159 > let ?? = "????" > > I think this a huge step forward for i18n and Unicode. > > There are some restrictions on which Unicode chars can be used. From > Apple's doc > > "Constant and variable names cannot contain mathematical symbols, arrows, > private-use (or invalid) Unicode code points, or line- and box-drawing > characters. Nor can they begin with a number, although numbers may be > included elsewhere within the name." > > The restrictions seem a little like IDNA2008. Anyone have links to info > giving a detailed explanation/tabulation of allowed and non allowed Unicode > chars for Swift Variable and Constant names? > > Andr? Schappo > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Wed Jun 4 05:01:57 2014 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Wed, 04 Jun 2014 19:01:57 +0900 Subject: Corrigendum #9 In-Reply-To: <20140603195923.6ec9c275@JRWUBU2> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <538CF5DB.3070007@ix.netcom.com> <538D74A7.5020605@it.aoyama.ac.jp> <20140603195923.6ec9c275@JRWUBU2> Message-ID: <538EEE95.70100@it.aoyama.ac.jp> On 2014/06/04 03:59, Richard Wordingham wrote: > On Tue, 03 Jun 2014 16:09:27 +0900 > "Martin J. D?rst" wrote: > >> I'd strongly suggest that completely independent of when and how >> Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets >> worked out for how to get rid of these codepoints in CLDR data. The >> sooner, the better. > > I suspect this has already been done. I know of no CLDR text files > still containing them. Really great if that's true! Regards, Martin. From mark at macchiato.com Wed Jun 4 06:17:15 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 4 Jun 2014 13:17:15 +0200 Subject: Corrigendum #9 In-Reply-To: <538EEE95.70100@it.aoyama.ac.jp> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <538CF5DB.3070007@ix.netcom.com> <538D74A7.5020605@it.aoyama.ac.jp> <20140603195923.6ec9c275@JRWUBU2> <538EEE95.70100@it.aoyama.ac.jp> Message-ID: The characters are present, but are escaped in the source for readability. Here is a sample from collation/zh.xml: ... ... *? Il meglio ? l?inimico del bene ?* On Wed, Jun 4, 2014 at 12:01 PM, "Martin J. D?rst" wrote: > On 2014/06/04 03:59, Richard Wordingham wrote: > >> On Tue, 03 Jun 2014 16:09:27 +0900 >> "Martin J. D?rst" wrote: >> >> I'd strongly suggest that completely independent of when and how >>> Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets >>> worked out for how to get rid of these codepoints in CLDR data. The >>> sooner, the better. >>> >> >> I suspect this has already been done. I know of no CLDR text files >> still containing them. >> > > Really great if that's true! Regards, Martin. > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Wed Jun 4 06:45:18 2014 From: prosfilaes at gmail.com (David Starner) Date: Wed, 4 Jun 2014 04:45:18 -0700 Subject: Swift In-Reply-To: <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> Message-ID: On Wed, Jun 4, 2014 at 2:28 AM, Andre Schappo wrote: > I think this a huge step forward for i18n and Unicode. Could you not do that in Objective-C? If no, then it's a step forward for Apple, but the rest of us--Ada, C, C++, C#, Java, Python--have had this feature for years. 20 years in 2015 in the case of Ada. -- Kie ekzistas vivo, ekzistas espero. From leoboiko at namakajiri.net Wed Jun 4 06:58:22 2014 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Wed, 4 Jun 2014 08:58:22 -0300 Subject: Swift In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> Message-ID: Even Ruby could do it for years, despite having notoriously bad Unicode string support back then: irb> ??? = '????' => "????" irb> ?slenska = 'fj?lubl?r' => "fj?lubl?r" irb> ??? + ' ' + ?slenska => "???? fj?lubl?r" I don't think this feature saw much use, since programmers in a global world can't assume that everyone will have easy access to their input methods, and so tend to restrict code tokens to the ASCII set to encourage participation. 2014-06-04 8:45 GMT-03:00 David Starner : > On Wed, Jun 4, 2014 at 2:28 AM, Andre Schappo > wrote: > > I think this a huge step forward for i18n and Unicode. > > Could you not do that in Objective-C? If no, then it's a step forward > for Apple, but the rest of us--Ada, C, C++, C#, Java, Python--have had > this feature for years. 20 years in 2015 in the case of Ada. > > -- > Kie ekzistas vivo, ekzistas espero. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From haberg-1 at telia.com Wed Jun 4 07:32:43 2014 From: haberg-1 at telia.com (Hans Aberg) Date: Wed, 4 Jun 2014 14:32:43 +0200 Subject: Swift In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> Message-ID: <3DCD6A05-2483-4413-B978-D545F155DDCA@telia.com> On 4 Jun 2014, at 13:58, Leonardo Boiko wrote: > I don't think this feature saw much use, since programmers in a global world can't assume that everyone will have easy access to their input methods, and so tend to restrict code tokens to the ASCII set to encourage participation. Indeed, the lack of good input methods limits the usability of the math characters, which other may be very useful in programming languages. One way is to add shortcut translations, like typing ?real? translates into ? (U+211D), but they must be added by hand. From jkorpela at cs.tut.fi Wed Jun 4 08:00:20 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Wed, 04 Jun 2014 16:00:20 +0300 Subject: Math input methods In-Reply-To: <3DCD6A05-2483-4413-B978-D545F155DDCA@telia.com> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <3DCD6A05-2483-4413-B978-D545F155DDCA@telia.com> Message-ID: <538F1864.9070603@cs.tut.fi> 2014-06-04 15:32, Hans Aberg wrote under Subject: Re: Swift: > On 4 Jun 2014, at 13:58, Leonardo Boiko > wrote: > >> I don't think this feature saw much use, since programmers in a >> global world can't assume that everyone will have easy access to >> their input methods, and so tend to restrict code tokens to the >> ASCII set to encourage participation. > > Indeed, the lack of good input methods limits the usability of the > math characters, which other may be very useful in programming > languages. One way is to add shortcut translations, like typing > ?real? translates into ? (U+211D), but they must be added by hand. If you are interested in math input methods, take a look at my design of math keyboard layout for use on normal US keyboard: http://www.cs.tut.fi/~jkorpela/math/kbd.html Input issues can be handled at many levels, including program-specific translations, but doing them at keyboard level has obvious advantages (and some problems). As an aside, the ISO 80000-2 standard on mathematical notations describes boldface letters such as boldface R as symbols for commonly known sets of numbers. The double-struck letters like ? as mentioned as an alternative way, whereas in the previous standard, these notations were presented the other way around. The change is logical in the sense that bold face is a more original notation and double-struck letters as characters imitate the imitation of boldface letters when writing by hand (with a pen or piece of chalk). Yucca From ian.clifton at chem.ox.ac.uk Wed Jun 4 09:42:26 2014 From: ian.clifton at chem.ox.ac.uk (Ian Clifton) Date: Wed, 04 Jun 2014 15:42:26 +0100 Subject: Math input methods In-Reply-To: <538F1864.9070603@cs.tut.fi> (Jukka K. Korpela's message of "Wed, 4 Jun 2014 16:00:20 +0300") References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <3DCD6A05-2483-4413-B978-D545F155DDCA@telia.com> <538F1864.9070603@cs.tut.fi> Message-ID: <4qioogka7h.fsf@chem-arachne.chem.ox.ac.uk> "Jukka K. Korpela" writes: > As an aside, the ISO 80000-2 standard on mathematical notations > describes boldface letters such as boldface R as symbols for commonly > known sets of numbers. The double-struck letters like ? as mentioned > as an alternative way, whereas in the previous standard, these > notations were presented the other way around. The change is logical > in the sense that bold face is a more original notation and > double-struck letters as characters imitate the imitation of boldface > letters when writing by hand (with a pen or piece of chalk). I?m not sure this is going to catch on with mathematicians, not least because bold letters are already heavily used, for vectors and matrices for instance. My guess is mathematicians are going to stick to their double?struck letters for these sets for as long as the year ? ?. -- Ian ? From Shawn.Steele at microsoft.com Wed Jun 4 10:53:59 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Wed, 4 Jun 2014 15:53:59 +0000 Subject: Swift In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> Message-ID: <1ff0f409444843e08149b57fd87e780c@BY2PR03MB491.namprd03.prod.outlook.com> I?m sort of confused why Unicode would be a big deal. C# & other languages have allowed unicode letters in identifiers for years, so readable strings should be possible in almost any language. It?s a bit cute to include emoji, but I?m not sure how practical it is. It also makes me wonder how they came up with the list, I presume control codes aren?t allowed? Or alternate whitespace? I assume they use some Unicode Categories to figure out the permitted set? I rarely see non-Latin code in practice though, but of course I?m a native English speaker. -Shawn From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark Davis ?? Sent: Wednesday, June 4, 2014 2:41 AM To: Andre Schappo Cc: unicode at unicode.org Subject: Re: Swift Apparently you can use emoji in the identifiers. ?? (http://www.globalnerdy.com/2014/06/03/swift-fun-fact-1-you-can-use-emoji-characters-in-variable-constant-function-and-class-names/) Mark ? Il meglio ? l?inimico del bene ? On Wed, Jun 4, 2014 at 11:28 AM, Andre Schappo > wrote: Swift is Apple's new programming language. In Swift, variable and constant names can be constructed from Unicode characters. Here are a couple of examples from Apple's doc http://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/TheBasics.html let ? = 3.14159 let ?? = "????" I think this a huge step forward for i18n and Unicode. There are some restrictions on which Unicode chars can be used. From Apple's doc "Constant and variable names cannot contain mathematical symbols, arrows, private-use (or invalid) Unicode code points, or line- and box-drawing characters. Nor can they begin with a number, although numbers may be included elsewhere within the name." The restrictions seem a little like IDNA2008. Anyone have links to info giving a detailed explanation/tabulation of allowed and non allowed Unicode chars for Swift Variable and Constant names? Andr? Schappo _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Wed Jun 4 10:54:37 2014 From: petercon at microsoft.com (Peter Constable) Date: Wed, 4 Jun 2014 15:54:37 +0000 Subject: UTF-16 Encoding Scheme and U+FFFE In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <20140604005051.1f2aee9a@JRWUBU2> Message-ID: <16334a09559d4b9080135195e7fad164@BL2PR03MB450.namprd03.prod.outlook.com> How did the word ?prohibited? enter this conversation? Peter From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy Sent: June 3, 2014 11:54 PM To: Richard Wordingham Cc: unicode at unicode.org Subject: Re: UTF-16 Encoding Scheme and U+FFFE U+FFFE is prohibited in interchanges because if interchanges specify a UTF-16 encoding (not UTF16-BE or URF16-LE) it would be interpreted as a BOM where it occurs at start of a stream (with the consequence of reparsing it as U+FEFF with bytes swapped). In all other positions where it cannot be a BOM. BOM are *normally* only authorized in interchanges at "start" of streams. But this is a problem for "live" streams that have no defined "start" but can be synced at random positions (such as on the next newline, or the start of a network datagram, but note that some network layers may fragment them so that BOM could be repeated, and also reunite them, leaving multiple BOMs in the same datagram) so we can assume that U+FFFE anywhere in a UTF16 "live" stream, not a UTF16-BE or UTF16-LE stream, is each time a BOM and not a BOM or legacy ZWNBSP or a non-character) Streams that are known to be UTF16-BE or UTF16-LE are also not recommanded for interchanged if these files or live streams may be transmitted without metadata specifying its encoding explicitly (so many remote readers will interpret them instead as UTF16, possibly with multiple BOMs in resynchronizable live streams). The problem of live streams is also a good reason why WZNBSP (U+FEFF) has been strongly discouraged in interchanges in vafor of word joiner (and this also applies to all other conforming UTFs (including UTF-8, UTF16-BE, UTF16-LE, UTF32, UTF32-LE, UTF32-BE) where it is strongly recommended not to use U+FEFF and U+FFFE except for BOMs (possibly repeated on live streams). You should note that conforminf processes working in interchanges (or storage) should always be allowed to switch from one standard UTF to another. And the same encoded streams may be consumed by various clients having different native order. It is now become difficult to define what is a "local" system, when applications are converted to work in a cloud with more and more heterogeneous clients and more intermediate third parties (providing things like caching, archiving, proxying, backup of data and restauration on another system...). For long term reusability of data, we are strongly encouraged not to use U+FFFE and U+FEFF except for BOMs, and we should be tolerant about the number of BOMs found (an in my opinion, UCA implementations should ignore discard them on input, treating them as fully ignorable, except for delimiting combining base characters for the prupose of normalisation, that conforming applications or intermediate filters should be allowed to perform as they want. And we should absolutely forget the legacy semantic of ZWNBSP. But this complexity and tolerance for one or more BOMs also means that all UTFs not based on 8-bit code units should be also discouraged in interchanges. This means that UTF-16 and UTF-32 should be discouraged, leaving only UTF-16BE or UTF-16LE or UTF-32BE not for storage or networking, but for temporary streams in memory used the "blackbox" internally implementing each conforming process. For all the rest, most applications now use UTF-8, possibly packaged within a generic compressed stream (binary compression of live streams remains possible, even if you cannot predict in the text encoding where the resynchronization points will occur: it's up to the protocol using this transport compression to properly define the resynchronization points). In UTF-8 streams we can completely omit U+FFFE, U+FEFF, either as BOMs, ZWNSP or non-characters (and we can also expect that many applications will just discard them silently, as they only have a "no-op" role as BOMs in 8-bit streams). If an application ouputs an 8-bit stream that is not UTF-8, it wil drop all U+FEFF and U+FFFE found in input, and will often ouput its encoding of U+FEFF its non-UTF-8 encoding generated, frequently as a "magic" signature of this encoding. Secure digital signatures of text streams should also ignore these code units silently as these code units won't be relevant elsewhere in the chain of producers or consumers of this data (these secure digital signatures should be computed by dropping these discarvable U+FEFF and U+FFFE, normaling that data for example to NFC or NFD, and producing a specific UTF (the easiest one to avoid complications being to use UTF-32BE or UTF-32LE with a predetermined byte order, as specified by the digital signature algorithm). Additionally it will be very easy to use as many U+FEFF code units as needed as ignorable extra BOMs, for cases where a protocol needs a safe "padding filler" f they want to use fixed-size block I/O with random access and easy resynchronization (in live streams), when the producer safely breaks data blocks at boundary of combining sequences (allowing these blocks to be normalized separately and reunited later witout creating problem. 2014-06-04 1:50 GMT+02:00 Richard Wordingham >: On Tue, 3 Jun 2014 21:28:05 +0000 Peter Constable > wrote: > There's never been anything preventing a file from containing and > beginning with U+FFFE. It's just not a very useful thing to do, hence > not very likely. Well, while U+FFFE was apparently prohibited from public interchange, one could be very confident of not finding it in an external file. As an internally generated file, it would then be much more likely to be in the UTF-16BE or UTF-16LE encoding scheme. Richard. _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkorpela at cs.tut.fi Wed Jun 4 11:24:14 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Wed, 04 Jun 2014 19:24:14 +0300 Subject: Math input methods In-Reply-To: <4qioogka7h.fsf@chem-arachne.chem.ox.ac.uk> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <3DCD6A05-2483-4413-B978-D545F155DDCA@telia.com> <538F1864.9070603@cs.tut.fi> <4qioogka7h.fsf@chem-arachne.chem.ox.ac.uk> Message-ID: <538F482E.2030503@cs.tut.fi> 2014-06-04 17:42, Ian Clifton wrote: > "Jukka K. Korpela" writes: > >> As an aside, the ISO 80000-2 standard on mathematical notations >> describes boldface letters such as boldface R as symbols for commonly >> known sets of numbers. The double-struck letters like ? as mentioned >> as an alternative way, whereas in the previous standard, these >> notations were presented the other way around. The change is logical >> in the sense that bold face is a more original notation and >> double-struck letters as characters imitate the imitation of boldface >> letters when writing by hand (with a pen or piece of chalk). > > I?m not sure this is going to catch on with mathematicians, not least > because bold letters are already heavily used, for vectors and matrices > for instance. Vectors and matrices are denoted by italic boldface letters, so there is no confusion even in principle. > My guess is mathematicians are going to stick to their > double?struck letters for these sets for as long as the year ? ?. Mathematicians tend to be conservative in notations. They even use italic for the constants i, e, and ?, rather illogically and against standards as well as common practices in natural sciences. But still they have changed their notations somewhat. They do not use the notations of Euclid and Archimedes any more. So maybe this will change, too. The interesting thing from the character code point of view is that we?re now more or less expected to use rich text, at least bolding, rather than just the special characters. In most writing situations, it is easier to bold a letter than to enter the character ??except when typing plain text, of course. This is one reason why boldface might become more common. On the other hand, when mathematicians write in AMSTeX, both notations are equally easy to produce (once you know how to do that). In theory, we could use boldface in plain text, too, when writing mathematical notations, e.g. U+1D411 MATHEMATICAL BOLD CAPITAL R. That?s just not very practical, partly because they are outside the BMP and may make programs choke, partly because font support is rather limited. My math layout has combinations for typing the double-struck letters but not for the math bold letters. It would of course be possible to create a layout specifically for bold or italic or bold italic etc. math symbols in Unicode, but their use seems to be too limited now. Yucca From A.Schappo at lboro.ac.uk Wed Jun 4 12:15:33 2014 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Wed, 4 Jun 2014 17:15:33 +0000 Subject: Swift In-Reply-To: <1ff0f409444843e08149b57fd87e780c@BY2PR03MB491.namprd03.prod.outlook.com> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <1ff0f409444843e08149b57fd87e780c@BY2PR03MB491.namprd03.prod.outlook.com> Message-ID: <5BED9935-6BCE-4A8F-8BA4-4E23B45BA54B@lboro.ac.uk> Well because outside of groups like this there is still little awareness of Unicode, little understanding of Unicode, little willingness to use Unicode and little conscious usage of Unicode Andr? On 4 Jun 2014, at 16:53, Shawn Steele wrote: I?m sort of confused why Unicode would be a big deal. C# & other languages have allowed unicode letters in identifiers for years, so readable strings should be possible in almost any language. It?s a bit cute to include emoji, but I?m not sure how practical it is. It also makes me wonder how they came up with the list, I presume control codes aren?t allowed? Or alternate whitespace? I assume they use some Unicode Categories to figure out the permitted set? I rarely see non-Latin code in practice though, but of course I?m a native English speaker. -Shawn From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark Davis ?? Sent: Wednesday, June 4, 2014 2:41 AM To: Andre Schappo Cc: unicode at unicode.org Subject: Re: Swift Apparently you can use emoji in the identifiers. ?? (http://www.globalnerdy.com/2014/06/03/swift-fun-fact-1-you-can-use-emoji-characters-in-variable-constant-function-and-class-names/) Mark ? Il meglio ? l?inimico del bene ? On Wed, Jun 4, 2014 at 11:28 AM, Andre Schappo > wrote: Swift is Apple's new programming language. In Swift, variable and constant names can be constructed from Unicode characters. Here are a couple of examples from Apple's doc http://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/TheBasics.html let ? = 3.14159 let ?? = "????" I think this a huge step forward for i18n and Unicode. There are some restrictions on which Unicode chars can be used. From Apple's doc "Constant and variable names cannot contain mathematical symbols, arrows, private-use (or invalid) Unicode code points, or line- and box-drawing characters. Nor can they begin with a number, although numbers may be included elsewhere within the name." The restrictions seem a little like IDNA2008. Anyone have links to info giving a detailed explanation/tabulation of allowed and non allowed Unicode chars for Swift Variable and Constant names? Andr? Schappo _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkorpela at cs.tut.fi Wed Jun 4 12:36:43 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Wed, 04 Jun 2014 20:36:43 +0300 Subject: Swift In-Reply-To: <5BED9935-6BCE-4A8F-8BA4-4E23B45BA54B@lboro.ac.uk> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <1ff0f409444843e08149b57fd87e780c@BY2PR03MB491.namprd03.prod.outlook.com> <5BED9935-6BCE-4A8F-8BA4-4E23B45BA54B@lboro.ac.uk> Message-ID: <538F592B.204@cs.tut.fi> 2014-06-04 20:15, Andre Schappo wrote: > Well because outside of groups like this there is still little awareness > of Unicode, little understanding of Unicode, little willingness to use > Unicode and little conscious usage of Unicode That?s very true. In the specific case of ?using Unicode? (which so often means just ?using characters outside the Ascii repertoire?) in programmin language identifiers, there are other reasons affecting, too. As alluded to here: > On 4 Jun 2014, at 16:53, Shawn Steele wrote: [...] >> I rarely see non-Latin code in practice though, but of course I?m a >> native English speaker. The point is that English is largely the de facto standard human language in programming?in documentation, comments, and hence also in forming identifiers, even though the data processed might be in different languages. There are good practical reasons for using English: programmers can be expected to understand it, and it is generally the only language you can expect them to understand. People also learn by example, and they often learn to stick to Ascii without even thinking why. Where I live, they learn to replace ??? and ??? by ?a? and ??? by ?o? rather automatically when they use words of national languages as identifiers. If you ask them, they probably say that the Scandinavian letters cannot be used reliably, which is often so true, even though it might not apply to the use in some programming languages. Personally, I often favor identifiers in the national language for clarity: this distinguishes user-defined identifiers from reserved words and from identifiers defined in libraries. But this is useful mostly in tutorial material, not that much in routine programming. Yucca From doug at ewellic.org Wed Jun 4 13:00:50 2014 From: doug at ewellic.org (Doug Ewell) Date: Wed, 04 Jun 2014 11:00:50 -0700 Subject: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE) Message-ID: <20140604110050.665a7a7059d7ee80bb4d670165c8327d.58a433e6ae.wbe@email03.secureserver.net> How common is it to see any of the following in real-world Unicode text, as opposed to code charts and test suites and the like? 1. Unpaired surrogates 2. Noncharacters (besides CLDR data) 3. U+FEFF at the beginning of a stream (note: not "packet" or arbitrary cutoff point) I'm not asking whether any of these are recommended or "prohibited" or whether they are a good idea. I'm asking about actual usage. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From Shawn.Steele at microsoft.com Wed Jun 4 13:10:05 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Wed, 4 Jun 2014 18:10:05 +0000 Subject: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE) In-Reply-To: <20140604110050.665a7a7059d7ee80bb4d670165c8327d.58a433e6ae.wbe@email03.secureserver.net> References: <20140604110050.665a7a7059d7ee80bb4d670165c8327d.58a433e6ae.wbe@email03.secureserver.net> Message-ID: <84b38c9dcd304f87b85f034c1706f3b8@BY2PR03MB491.namprd03.prod.outlook.com> The BOM I've seen (not FFFE though), it's prevalence depends on the system and other factors. The others I only see if there's corruption, bugs, or tests. The most common error I see that causes those is when some developer calls a binary blob a unicode string and tries to shove it through a text transport or something. Usually that bites them sooner or later. -Shawn -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell Sent: Wednesday, June 4, 2014 11:01 AM To: unicode at unicode.org Subject: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE) How common is it to see any of the following in real-world Unicode text, as opposed to code charts and test suites and the like? 1. Unpaired surrogates 2. Noncharacters (besides CLDR data) 3. U+FEFF at the beginning of a stream (note: not "packet" or arbitrary cutoff point) I'm not asking whether any of these are recommended or "prohibited" or whether they are a good idea. I'm asking about actual usage. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From doug at ewellic.org Wed Jun 4 13:26:01 2014 From: doug at ewellic.org (Doug Ewell) Date: Wed, 04 Jun 2014 11:26:01 -0700 Subject: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE) Message-ID: <20140604112601.665a7a7059d7ee80bb4d670165c8327d.f6d97be3d0.wbe@email03.secureserver.net> Sorry, I left out an important detail. I wrote: > 3. U+FEFF at the beginning of a stream (note: not "packet" or > arbitrary cutoff point) I meant U+FEFF as a zero-width no-break space. Obviously it is very common to see U+FEFF as a signature or BOM. My underlying question here is, how common is it that the producer of a stream actually intends this character *at the start of a stream* to be a ZWNBSP, not to be stripped lest the actual text content be altered? -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From wjgo_10009 at btinternet.com Wed Jun 4 12:57:09 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 4 Jun 2014 18:57:09 +0100 (BST) Subject: UTF-16 Encoding Scheme and U+FFFE In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> Message-ID: <1401904629.853.YahooMailNeo@web87805.mail.ir2.yahoo.com> An interesting use of a U+FEFF character used as a BYTE ORDER MARK is its use in the file format described as Unicode Text Document which one may choose when using Save As... in the Microsoft WordPad program. I have used that file format in my research. http://forum.high-logic.com/viewtopic.php?p=21048#p21048 It is interesting to produce such a file and then examine the contents at a byte by byte level, understanding the use of the BYTE ORDER MARK. William Overington 4 June 2014 From asmusf at ix.netcom.com Wed Jun 4 13:40:11 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 04 Jun 2014 11:40:11 -0700 Subject: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE) In-Reply-To: <20140604112601.665a7a7059d7ee80bb4d670165c8327d.f6d97be3d0.wbe@email03.secureserver.net> References: <20140604112601.665a7a7059d7ee80bb4d670165c8327d.f6d97be3d0.wbe@email03.secureserver.net> Message-ID: <538F680B.9040101@ix.netcom.com> On 6/4/2014 11:26 AM, Doug Ewell wrote: > Sorry, I left out an important detail. > > I wrote: > >> 3. U+FEFF at the beginning of a stream (note: not "packet" or >> arbitrary cutoff point) > I meant U+FEFF as a zero-width no-break space. Obviously it is very > common to see U+FEFF as a signature or BOM. > > My underlying question here is, how common is it that the producer of a > stream actually intends this character *at the start of a stream* to be > a ZWNBSP, not to be stripped lest the actual text content be altered? The semantics of it were chosen at the time to make no sense at the start, and to make the character invisible in most situations. The remnant of its semantic was later taken up by Word Joiner, so that there is now NO use for this as part of text. The use as part of a convention has always been clear. If you stick this at the front, readers will byte-reverse your data; that should weed out accidental use pretty quickly :) Or prevent people from getting "cute" with it in other ways. So, I would think that for this particular code point, you can safely assume that it's buggy or test data. Buggy data you just byte reverse as requested and let the user take the consequence. :) A./ > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From richard.wordingham at ntlworld.com Wed Jun 4 14:01:48 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 4 Jun 2014 20:01:48 +0100 Subject: UTF-16 Encoding Scheme and U+FFFE In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <20140604005051.1f2aee9a@JRWUBU2> Message-ID: <20140604200148.7132c3d3@JRWUBU2> On Wed, 4 Jun 2014 00:23:53 +0000 "Whistler, Ken" wrote: > You cannot even be "very confident" of not finding actual ill-formed > UTF-16, like unpaired surrogates, in an external file, let alone > noncharacters. I though unpaired surrogates were normally mojibake, broken characters, or sabotage attempts. > Any one of those test strings could be > trivially turned into a text file by piping out that one UTF-16 > string to a file. At that point, you should be in detailed control of the Unicode encoding scheme. Also, would not the system be using one of UTF16 with byte order marks, UTF-16BE and UTF-16LE? > And I could then write conformant test software > that would read UTF-16 string input data from that file and run it > through the UCA algorithm to construct sortkeys for it. Given the number of control characters in that file, I wouldn't be confident of getting the output back the same as it went out unless the input were controlled at a binary level. > As Peter said, the main thing that prevents running into these is > that it isn't very *useful* to start off files (or strings) with > U+FFFE. Actually, for sorting records using the CLDR collation algorithm, it may be very useful to use U+FFFE as a field separator. If the most significant field for sorting is sometimes empty (e.g. surname in a list of contacts), then the field separator could very easily be the first non-BOM character after sorting. I suppose one had better use something like as a field separator instead. > (And, additionally, in the case of UTF-16 text data files, it > would be confusing and possibly lead to misinterpretation of byte > order, if you were somehow depending solely on initial BOMs -- which > I wouldn't advise, anyway.) Interesting. Goodbye UTF-16 encoding scheme and hello automatic encoding detection. I'm not sure how automatic detection is supposed to work with a file consisting of just a test string from the collation test. > Basically, the rules of standards (e.g., you shouldn't try to > publicly interchange noncharacters) are not like laws of > physics. Just because the standard says you shouldn't do > it doesn't mean it doesn't happen. Just as theft happens. Richard. From richard.wordingham at ntlworld.com Wed Jun 4 14:21:03 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 4 Jun 2014 20:21:03 +0100 Subject: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE) In-Reply-To: <538F680B.9040101@ix.netcom.com> References: <20140604112601.665a7a7059d7ee80bb4d670165c8327d.f6d97be3d0.wbe@email03.secureserver.net> <538F680B.9040101@ix.netcom.com> Message-ID: <20140604202103.7cbfcc49@JRWUBU2> On Wed, 04 Jun 2014 11:40:11 -0700 Asmus Freytag wrote: > On 6/4/2014 11:26 AM, Doug Ewell wrote: > > I meant U+FEFF as a zero-width no-break space. Obviously it is very > > common to see U+FEFF as a signature or BOM. > The semantics of it were chosen at the time to make no sense > at the start, and to make the character invisible in most situations. > The remnant of its semantic was later taken up by Word Joiner, so that > there is now NO use for this as part of text. > The use as part of a convention has always been clear. If you stick > this at the front, readers will byte-reverse your data; that should > weed out accidental use pretty quickly :) Or prevent people from > getting "cute" with it in other ways. Wrong! If you stick U+FEFF at the start of a file, expect it to be stripped. If you stick U+FFFE at the start of a file, then expect to see the rest of the text to be byte-reversed. > So, I would think that for this particular code point, you can safely > assume that it's buggy or test data. The example that's usually given is that of a text file sliced into segments to avoid file size limits. In these cases, there is the risk that U+FEFF as ZWNBSP will wind up at the start of a segment and be stripped. The solution using the Windows command window is to perform a *binary* concatenation of the segments; if one doesn't, newlines will be inserted between the segments, which is much severer damage. Richard. From asmusf at ix.netcom.com Wed Jun 4 14:52:02 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 04 Jun 2014 12:52:02 -0700 Subject: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE) In-Reply-To: <20140604202103.7cbfcc49@JRWUBU2> References: <20140604112601.665a7a7059d7ee80bb4d670165c8327d.f6d97be3d0.wbe@email03.secureserver.net><538F680B.9040101@ix.netcom.com> <20140604202103.7cbfcc49@JRWUBU2> Message-ID: <538F78E2.2030502@ix.netcom.com> On 6/4/2014 12:21 PM, Richard Wordingham wrote: > On Wed, 04 Jun 2014 11:40:11 -0700 > Asmus Freytag wrote: > >> On 6/4/2014 11:26 AM, Doug Ewell wrote: >>> I meant U+FEFF as a zero-width no-break space. Obviously it is very >>> common to see U+FEFF as a signature or BOM. >> The semantics of it were chosen at the time to make no sense >> at the start, and to make the character invisible in most situations. >> The remnant of its semantic was later taken up by Word Joiner, so that >> there is now NO use for this as part of text. > >> The use as part of a convention has always been clear. If you stick >> this at the front, readers will byte-reverse your data; that should >> weed out accidental use pretty quickly :) Or prevent people from >> getting "cute" with it in other ways. > Wrong! If you stick U+FEFF at the start of a file, expect it to be > stripped. If you stick U+FFFE at the start of a file, then expect to > see the rest of the text to be byte-reversed. Duh. (reminder, have coffee first) A./ > >> So, I would think that for this particular code point, you can safely >> assume that it's buggy or test data. > The example that's usually given is that of a text file sliced into > segments to avoid file size limits. In these cases, there is the risk > that U+FEFF as ZWNBSP will wind up at the start of a segment and be > stripped. The solution using the Windows command window is to perform a > *binary* concatenation of the segments; if one doesn't, newlines will > be inserted between the segments, which is much severer damage. > > Richard. > From npatch at shutterstock.com Wed Jun 4 16:04:07 2014 From: npatch at shutterstock.com (Nick Patch) Date: Wed, 4 Jun 2014 17:04:07 -0400 Subject: Swift In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> Message-ID: On Wed, Jun 4, 2014 at 7:45 AM, David Starner wrote: > Could you not do that in Objective-C? If no, then it's a step forward > for Apple, but the rest of us--Ada, C, C++, C#, Java, Python--have had > this feature for years. 20 years in 2015 in the case of Ada. Also, Perl has supported Unicode identifiers for 14 years. They were added in Perl v5.6, released in March 2000. The officially supported identifier characters are documented here: https://metacpan.org/pod/perldata#Identifier-parsing Here's a UTS #18 style regex to match a Perl identifier: [ [ \p{word} && \p{XID_Start} ] || _ ][ \p{word} && \p{XID_Continue} ]* And the equivalent Perl regex: (?[ ( \p{word} & \p{XID_Start} ) | [_] ])(?[ \p{word} & \p{XID_Continue} ])* This is basically the default XID identifier recommended in UAX #31 but excluding any non-"word" characters and also allowing a leading underscore. By the way, in the past I found that PHP even allows many different whitespace characters in identifiers! -- Nick Patch @nickpatch -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Jun 4 17:48:02 2014 From: doug at ewellic.org (Doug Ewell) Date: Wed, 04 Jun 2014 15:48:02 -0700 Subject: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE) Message-ID: <20140604154802.665a7a7059d7ee80bb4d670165c8327d.9a8683c2d8.wbe@email03.secureserver.net> Richard Wordingham wrote: > The example that's usually given [of U+FEFF at the start of a stream] > is that of a text file sliced into segments to avoid file size limits. > In these cases, there is the risk that U+FEFF as ZWNBSP will wind up > at the start of a segment and be stripped. Nope, that's exactly the case I was excluding when I wrote: > 3. U+FEFF [as a zero-width no-break space] at the beginning of a > stream (note: not "packet" or arbitrary cutoff point) If you are processing arbitrary fragments of a stream, without knowledge of preceding fragments, as in this example, then you have no business making *any* changes to that fragment based on interpretation of that fragment as Unicode text. Your sole responsibilities at that point are to pass the fragments, intact, from one process to the next, or to disassemble and reassemble them. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From haberg-1 at telia.com Wed Jun 4 18:10:52 2014 From: haberg-1 at telia.com (Hans Aberg) Date: Thu, 5 Jun 2014 01:10:52 +0200 Subject: Math input methods In-Reply-To: <538F1864.9070603@cs.tut.fi> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <3DCD6A05-2483-4413-B978-D545F155DDCA@telia.com> <538F1864.9070603@cs.tut.fi> Message-ID: On 4 Jun 2014, at 15:00, Jukka K. Korpela wrote: > 2014-06-04 15:32, Hans Aberg wrote under Subject: Re: Swift: > >> On 4 Jun 2014, at 13:58, Leonardo Boiko >> wrote: >> >>> I don't think this feature saw much use, since programmers in a >>> global world can't assume that everyone will have easy access to >>> their input methods, and so tend to restrict code tokens to the >>> ASCII set to encourage participation. >> >> Indeed, the lack of good input methods limits the usability of the >> math characters, which other may be very useful in programming >> languages. One way is to add shortcut translations, like typing >> ?real? translates into ? (U+211D), but they must be added by hand. > > If you are interested in math input methods, take a look at my design of math keyboard layout for use on normal US keyboard: > http://www.cs.tut.fi/~jkorpela/math/kbd.html Unfortunately I use a different platform. > Input issues can be handled at many levels, including program-specific translations, but doing them at keyboard level has obvious advantages (and some problems). > > As an aside, the ISO 80000-2 standard on mathematical notations describes boldface letters such as boldface R as symbols for commonly known sets of numbers. The double-struck letters like ? as mentioned as an alternative way, whereas in the previous standard, these notations were presented the other way around. The change is logical in the sense that bold face is a more original notation and double-struck letters as characters imitate the imitation of boldface letters when writing by hand (with a pen or piece of chalk). The STIX fonts [1] have a lot the ?traditional? math characters, including the math styles. A discussion here revealed that mathematicians nowadays use a lot more. So a problem is that math uses a lot of characters. 1. http://www.stixfonts.org From joe at unicode.org Wed Jun 4 18:44:24 2014 From: joe at unicode.org (Joe Becker) Date: Wed, 04 Jun 2014 16:44:24 -0700 Subject: Swift In-Reply-To: <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> Message-ID: <538FAF58.4080501@unicode.org> A bit of ancient history, from the System Development Division spinoff of Xerox PARC: Around 1979, Xerox adopted the multi-byte Xerox Character Code Standard (XCCS), an ancestor of Unicode. Around 1980, Larry Masinter and I converted the Xerox Lisp system to XCCS, including our phonetic-based Japanese input method, using Japanese fonts supplied by Fuji Xerox. The system was demo'ed at a trade show as "JLisp" ... of course the attendees showed no interest. Around 1985, Lori Nagata converted our product compiler (for a Pascal derivative called Mesa) to accept XCCS sourcecode, including fully multilingual identifiers, strings, and comments ... of course the Development Environment group showed no interest. Maybe now the world is ready for "????" ... I don't think I am ... Joe From prosfilaes at gmail.com Wed Jun 4 21:50:20 2014 From: prosfilaes at gmail.com (David Starner) Date: Wed, 4 Jun 2014 19:50:20 -0700 Subject: Math input methods In-Reply-To: <538F1864.9070603@cs.tut.fi> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <3DCD6A05-2483-4413-B978-D545F155DDCA@telia.com> <538F1864.9070603@cs.tut.fi> Message-ID: On Wed, Jun 4, 2014 at 6:00 AM, Jukka K. Korpela wrote: > The change is logical in the sense that bold face is a > more original notation and double-struck letters as characters imitate the > imitation of boldface letters when writing by hand (with a pen or piece of > chalk). On the other hand, bold face is a minor variation on normal types. Double-struck letters are more clearly distinct, which is probably why they moved from the chalkboard to printing in the first place. I don't see much advantage of ?????????? over ?????, especially when confusability with NCRZQ comes into play. -- Kie ekzistas vivo, ekzistas espero. From verdy_p at wanadoo.fr Thu Jun 5 02:41:07 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 5 Jun 2014 09:41:07 +0200 Subject: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE) In-Reply-To: <20140604154802.665a7a7059d7ee80bb4d670165c8327d.9a8683c2d8.wbe@email03.secureserver.net> References: <20140604154802.665a7a7059d7ee80bb4d670165c8327d.9a8683c2d8.wbe@email03.secureserver.net> Message-ID: 2014-06-05 0:48 GMT+02:00 Doug Ewell : > If you are processing arbitrary fragments of a stream, without knowledge > of preceding fragments, as in this example, then you have no business > making *any* changes to that fragment based on interpretation of that > fragment as Unicode text. Your sole responsibilities at that point are > to pass the fragments, intact, from one process to the next, or to > disassemble and reassemble them. Not necessarily true. You can easily think about the debugging log coming from an OS or device and accumulating text data coming from various sources in the device. Then you can connect to a live stream at any time without necessarily following all what happened before. You'll probably want to sync on the first newline control and then proceed from that point. But now if you have those devices configured heterogenously and generating their own output encoding you won't necessarily know how it is encoded even uf all of them use some UTF of Unicode. So the stream will regularly repost an encoding mark, for exampel at the begining of each dated logged entry, and this could be just an encoded BOM (even with UTF-8, or some other UTF like UTF-16 which would be more likely if the language contained essentially an East-Asian (CJK) language. These devices would emit these messages or logs with a very basic protocol, or no protocol at all (Telnet, serial link, ...) without any pror negociation (these data feeds are unidirectional meant to be used by any number of consumers that can connect or disconnct from them at any time, the log producer will never know how many clients there are, notably for passive debugging logs) You could then expect BOMs to occur many times in the stream (this is what I called a "live" stream : it has no start, no end, no defined total size, you don't know when new texts will be emitted, you don't even know at which rate; which could be very huge : if the rate is too high one can use a fast local proxy to filter the feed with patterns (e.g. a debug level, reported in the start of line of each log entry, or some identifier of the real source, not controlled drctly at the point of connection where you connect to listen the stream) and hear only the result that can be supported over a slower link to the client. But here also the proxy will not necesarily work continuously but only when there will be some interested client for it and providing a pattern matching. The resulted texts will then be highly fragmented. So your assumption if only true when you think about processes that have a prior agreement to use some specific convention. But in an heterogeous world here participants (prodicers and consumers) and maintained separately and can appear or disappear at any time, you cannot expect they will all use the same encoding and that disassembling/reassembling is as safe as what you think. This is only true if they work in close cooperation under strict common standards. Take na eample of a service that would archive all received emails in a feed or a list of SMS messages from a group of participants; do you need to archive not only the texts them selves but also all the protocol meta data fro which they originated when the application is creating a baic log which will not be used by SMS or emails due to the generated volume? Encoded texts in heterogenous environement and over the web where people could use various OSes and languages are well known examples where plain-text is not always sufficient to determine how to devide it, you cannot just "guess" fro mthe content when this content can change at any time. And these texts are not always safely convertible to the same encoding without data losses or alterations. If you don't insert in the live stream enough BOMs after some resynchronization points, the result that consumers will ger will be full of mojibake. -------------- next part -------------- An HTML attachment was scrubbed... URL: From haberg-1 at telia.com Thu Jun 5 03:57:13 2014 From: haberg-1 at telia.com (Hans Aberg) Date: Thu, 5 Jun 2014 10:57:13 +0200 Subject: Math input methods In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <3DCD6A05-2483-4413-B978-D545F155DDCA@telia.com> <538F1864.9070603@cs.tut.fi> Message-ID: <8A91ABFB-7C83-4AC0-A5B2-584AF4CE438F@telia.com> On 5 Jun 2014, at 04:50, David Starner wrote: > On Wed, Jun 4, 2014 at 6:00 AM, Jukka K. Korpela wrote: >> The change is logical in the sense that bold face is a >> more original notation and double-struck letters as characters imitate the >> imitation of boldface letters when writing by hand (with a pen or piece of >> chalk). > > On the other hand, bold face is a minor variation on normal types. > Double-struck letters are more clearly distinct, which is probably why > they moved from the chalkboard to printing in the first place. I don't > see much advantage of ?????????? over ?????, especially when > confusability with NCRZQ comes into play. The double-struck letters are useful in math, because they free other letter styles for other use. First, only a few were used as for natural, rational, real and complex numbers, but became popular so that all letters, uppercase and lowercase, are now available in Unicode. From jlturriff at centurylink.net Thu Jun 5 05:04:11 2014 From: jlturriff at centurylink.net (J. Leslie Turriff) Date: Thu, 5 Jun 2014 05:04:11 -0500 Subject: Swift In-Reply-To: <1ff0f409444843e08149b57fd87e780c@BY2PR03MB491.namprd03.prod.outlook.com> References: <20140603195253.3c0df53f@JRWUBU2> <1ff0f409444843e08149b57fd87e780c@BY2PR03MB491.namprd03.prod.outlook.com> Message-ID: <201406050504.11990.jlturriff@centurylink.net> On Wednesday 04 June 2014 10:53:59 Shawn Steele wrote: > I?m sort of confused why Unicode would be a big deal. C# & other languages > have allowed unicode letters in identifiers for years, so readable strings > should be possible in almost any language. > > It?s a bit cute to include emoji, but I?m not sure how practical it is. It > also makes me wonder how they came up with the list, I presume control > codes aren?t allowed? Or alternate whitespace? I assume they use some > Unicode Categories to figure out the permitted set? > > I rarely see non-Latin code in practice though, but of course I?m a native > English speaker. > > -Shawn What I find interesting is that (with the possible exception of Ada) I don't think that any of the commonly used languages allow for the use of Unicode characters for non- user-defined tokens (i.e. reserved words, etc.). I'm working on a parser for the Rexx language that will allow all tokens to be recognized using the default (or a user-specified) locale, not just the user-defined tokens. It will also allow various single-character operators equivalent to the multiple-character ones defined in the current language standard (e.g. '?' for '?=', '<>' or '\=', '?' for '<=', '?' for '>=', etc.). Leslie -- "Disobedience is the true foundation of liberty. The obedient must be slaves." --Henry David Thoreau From prosfilaes at gmail.com Thu Jun 5 05:52:13 2014 From: prosfilaes at gmail.com (David Starner) Date: Thu, 5 Jun 2014 03:52:13 -0700 Subject: Swift In-Reply-To: <201406050504.11990.jlturriff@centurylink.net> References: <20140603195253.3c0df53f@JRWUBU2> <1ff0f409444843e08149b57fd87e780c@BY2PR03MB491.namprd03.prod.outlook.com> <201406050504.11990.jlturriff@centurylink.net> Message-ID: On Thu, Jun 5, 2014 at 3:04 AM, J. Leslie Turriff wrote: > What I find interesting is that (with the possible exception of Ada) I don't > think that any of the commonly used languages allow for the use of Unicode > characters for non- user-defined tokens (i.e. reserved words, etc.). There is one non-ASCII character in the library, for Pi, and that caused some fuss, along with some eye-rolling, as writing the Unicode characters as ["03C0"] is permitted. Ada is a conservative language, and there's no real drive to make changes like these. (I was mistaken on the 20 years for Unicode identifiers; it was the Ada 2005 standard that permitted it, not Ada 95.) Scala is not really a commonly used language, but does use some Unicode arrows: ? for =>, ?for <- and ? for ->. Most people don't bother. ALGOL 60 and ALGOL 68 used non-ASCII characters like ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? and ?, and had compiler defined spellings for keywords. -- Kie ekzistas vivo, ekzistas espero. From frederic.grosshans at gmail.com Thu Jun 5 06:10:42 2014 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Thu, 05 Jun 2014 13:10:42 +0200 Subject: Swift In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <1ff0f409444843e08149b57fd87e780c@BY2PR03MB491.namprd03.prod.outlook.com> <201406050504.11990.jlturriff@centurylink.net> Message-ID: <53905032.6010000@gmail.com> Le 05/06/2014 12:52, David Starner a ?crit : > On Thu, Jun 5, 2014 at 3:04 AM, J. Leslie Turriff > wrote: >> What I find interesting is that (with the possible exception of Ada) I don't >> think that any of the commonly used languages allow for the use of Unicode >> characters for non- user-defined tokens (i.e. reserved words, etc.). > There is one non-ASCII character in the library, for Pi, and that > caused some fuss, along with some eye-rolling, as writing the Unicode > characters as ["03C0"] is permitted. Ada is a conservative language, > and there's no real drive to make changes like these. (I was mistaken > on the 20 years for Unicode identifiers; it was the Ada 2005 standard > that permitted it, not Ada 95.) > > Scala is not really a commonly used language, but does use some > Unicode arrows: ? for =>, ?for <- and ? for ->. Most people don't > bother. > > ALGOL 60 and ALGOL 68 used non-ASCII characters like ?, ?, ?, ?, ?, ?, > ?, ?, ?, ?, ? and ?, and had compiler defined spellings for keywords. > And, of course, there is APL ( https://en.wikipedia.org/wiki/APL_%28programming_language%29 ). Unicode has 70 characters specially for its use (APL FUNCTIONNAL SYMBOL ****), U+2336 to U+237A since Unicode 1.1 and U+2395 since Unicode 3.0 From martin at v.loewis.de Thu Jun 5 10:27:42 2014 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Thu, 05 Jun 2014 17:27:42 +0200 Subject: Swift In-Reply-To: <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> Message-ID: <53908C6E.4@v.loewis.de> Am 04.06.14 11:28, schrieb Andre Schappo: > The restrictions seem a little like IDNA2008. Anyone have links to > info giving a detailed explanation/tabulation of allowed and non > allowed Unicode chars for Swift Variable and Constant names? The language reference is at https://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html For reference, the definition of identifier-character is (read each line as an alternative) identifier-character ? Digit 0 through 9 identifier-character ? U+0300?U+036F, U+1DC0?U+1DFF, U+20D0?U+20FF, or U+FE20?U+FE2F identifier-character ? identifier-head? where identifier-head is identifier-head ? Upper- or lowercase letter A through Z identifier-head ? U+00A8, U+00AA, U+00AD, U+00AF, U+00B2?U+00B5, or U+00B7?U+00BA identifier-head ? U+00BC?U+00BE, U+00C0?U+00D6, U+00D8?U+00F6, or U+00F8?U+00FF identifier-head ? U+0100?U+02FF, U+0370?U+167F, U+1681?U+180D, or U+180F?U+1DBF identifier-head ? U+1E00?U+1FFF identifier-head ? U+200B?U+200D, U+202A?U+202E, U+203F?U+2040, U+2054, or U+2060?U+206F identifier-head ? U+2070?U+20CF, U+2100?U+218F, U+2460?U+24FF, or U+2776?U+2793 identifier-head ? U+2C00?U+2DFF or U+2E80?U+2FFF identifier-head ? U+3004?U+3007, U+3021?U+302F, U+3031?U+303F, or U+3040?U+D7FF identifier-head ? U+F900?U+FD3D, U+FD40?U+FDCF, U+FDF0?U+FE1F, or U+FE30?U+FE44 identifier-head ? U+FE47?U+FFFD identifier-head ? U+10000?U+1FFFD, U+20000?U+2FFFD, U+30000?U+3FFFD, or U+40000?U+4FFFD identifier-head ? U+50000?U+5FFFD, U+60000?U+6FFFD, U+70000?U+7FFFD, or U+80000?U+8FFFD identifier-head ? U+90000?U+9FFFD, U+A0000?U+AFFFD, U+B0000?U+BFFFD, or U+C0000?U+CFFFD identifier-head ? U+D0000?U+DFFFD or U+E0000?U+EFFFD As the construction principle for this list, they say "Identifiers begin with an upper case or lower case letter A through Z, an underscore (_), a noncombining alphanumeric Unicode character in the Basic Multilingual Plane, or a character outside the Basic Multilingual Plan that isn?t in a Private Use Area. After the first character, digits and combining Unicode characters are also allowed." Regards, Martin From senn at maya.com Thu Jun 5 10:46:41 2014 From: senn at maya.com (Jeff Senn) Date: Thu, 5 Jun 2014 11:46:41 -0400 Subject: Swift In-Reply-To: <53908C6E.4@v.loewis.de> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <53908C6E.4@v.loewis.de> Message-ID: Has anyone figured out whether character sequences that are non-canonical (de)compositions but could be recomposed to the same result are the same identifier or not? That is: are identifiers merely sequences of characters or intended to be comparable as ?Unicode strings? (under some sort of compatibility rule)? On Jun 5, 2014, at 11:27 AM, Martin v. L?wis wrote: > Am 04.06.14 11:28, schrieb Andre Schappo: >> The restrictions seem a little like IDNA2008. Anyone have links to >> info giving a detailed explanation/tabulation of allowed and non >> allowed Unicode chars for Swift Variable and Constant names? > > The language reference is at > > https://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html > > For reference, the definition of identifier-character is (read each > line as an alternative) > > identifier-character ? Digit 0 through 9 > identifier-character ? U+0300?U+036F, U+1DC0?U+1DFF, U+20D0?U+20FF, or > U+FE20?U+FE2F > identifier-character ? identifier-head? > > where identifier-head is > > identifier-head ? Upper- or lowercase letter A through Z > identifier-head ? U+00A8, U+00AA, U+00AD, U+00AF, U+00B2?U+00B5, or > U+00B7?U+00BA > identifier-head ? U+00BC?U+00BE, U+00C0?U+00D6, U+00D8?U+00F6, or > U+00F8?U+00FF > identifier-head ? U+0100?U+02FF, U+0370?U+167F, U+1681?U+180D, or > U+180F?U+1DBF > identifier-head ? U+1E00?U+1FFF > identifier-head ? U+200B?U+200D, U+202A?U+202E, U+203F?U+2040, U+2054, > or U+2060?U+206F > identifier-head ? U+2070?U+20CF, U+2100?U+218F, U+2460?U+24FF, or > U+2776?U+2793 > identifier-head ? U+2C00?U+2DFF or U+2E80?U+2FFF > identifier-head ? U+3004?U+3007, U+3021?U+302F, U+3031?U+303F, or > U+3040?U+D7FF > identifier-head ? U+F900?U+FD3D, U+FD40?U+FDCF, U+FDF0?U+FE1F, or > U+FE30?U+FE44 > identifier-head ? U+FE47?U+FFFD > identifier-head ? U+10000?U+1FFFD, U+20000?U+2FFFD, U+30000?U+3FFFD, or > U+40000?U+4FFFD > identifier-head ? U+50000?U+5FFFD, U+60000?U+6FFFD, U+70000?U+7FFFD, or > U+80000?U+8FFFD > identifier-head ? U+90000?U+9FFFD, U+A0000?U+AFFFD, U+B0000?U+BFFFD, or > U+C0000?U+CFFFD > identifier-head ? U+D0000?U+DFFFD or U+E0000?U+EFFFD > > As the construction principle for this list, they say > > "Identifiers begin with an upper case or lower case letter A through Z, > an underscore (_), a noncombining alphanumeric Unicode character in the > Basic Multilingual Plane, or a character outside the Basic Multilingual > Plan that isn?t in a Private Use Area. After the first character, digits > and combining Unicode characters are also allowed." > > Regards, > Martin > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From mark at macchiato.com Thu Jun 5 11:06:25 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 5 Jun 2014 18:06:25 +0200 Subject: Swift In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <53908C6E.4@v.loewis.de> Message-ID: I haven't done any analysis, but on first glance it looks like it is based on http://www.unicode.org/reports/tr31/#Alternative_Identifier_Syntax Mark *? Il meglio ? l?inimico del bene ?* On Thu, Jun 5, 2014 at 5:46 PM, Jeff Senn wrote: > Has anyone figured out whether character sequences that are non-canonical > (de)compositions but could be recomposed to the same result > are the same identifier or not? > > That is: are identifiers merely sequences of characters or intended to be > comparable as ?Unicode strings? (under some sort of compatibility rule)? > > On Jun 5, 2014, at 11:27 AM, Martin v. L?wis wrote: > > > Am 04.06.14 11:28, schrieb Andre Schappo: > >> The restrictions seem a little like IDNA2008. Anyone have links to > >> info giving a detailed explanation/tabulation of allowed and non > >> allowed Unicode chars for Swift Variable and Constant names? > > > > The language reference is at > > > > > https://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html > > > > For reference, the definition of identifier-character is (read each > > line as an alternative) > > > > identifier-character ? Digit 0 through 9 > > identifier-character ? U+0300?U+036F, U+1DC0?U+1DFF, U+20D0?U+20FF, or > > U+FE20?U+FE2F > > identifier-character ? identifier-head? > > > > where identifier-head is > > > > identifier-head ? Upper- or lowercase letter A through Z > > identifier-head ? U+00A8, U+00AA, U+00AD, U+00AF, U+00B2?U+00B5, or > > U+00B7?U+00BA > > identifier-head ? U+00BC?U+00BE, U+00C0?U+00D6, U+00D8?U+00F6, or > > U+00F8?U+00FF > > identifier-head ? U+0100?U+02FF, U+0370?U+167F, U+1681?U+180D, or > > U+180F?U+1DBF > > identifier-head ? U+1E00?U+1FFF > > identifier-head ? U+200B?U+200D, U+202A?U+202E, U+203F?U+2040, U+2054, > > or U+2060?U+206F > > identifier-head ? U+2070?U+20CF, U+2100?U+218F, U+2460?U+24FF, or > > U+2776?U+2793 > > identifier-head ? U+2C00?U+2DFF or U+2E80?U+2FFF > > identifier-head ? U+3004?U+3007, U+3021?U+302F, U+3031?U+303F, or > > U+3040?U+D7FF > > identifier-head ? U+F900?U+FD3D, U+FD40?U+FDCF, U+FDF0?U+FE1F, or > > U+FE30?U+FE44 > > identifier-head ? U+FE47?U+FFFD > > identifier-head ? U+10000?U+1FFFD, U+20000?U+2FFFD, U+30000?U+3FFFD, or > > U+40000?U+4FFFD > > identifier-head ? U+50000?U+5FFFD, U+60000?U+6FFFD, U+70000?U+7FFFD, or > > U+80000?U+8FFFD > > identifier-head ? U+90000?U+9FFFD, U+A0000?U+AFFFD, U+B0000?U+BFFFD, or > > U+C0000?U+CFFFD > > identifier-head ? U+D0000?U+DFFFD or U+E0000?U+EFFFD > > > > As the construction principle for this list, they say > > > > "Identifiers begin with an upper case or lower case letter A through Z, > > an underscore (_), a noncombining alphanumeric Unicode character in the > > Basic Multilingual Plane, or a character outside the Basic Multilingual > > Plan that isn?t in a Private Use Area. After the first character, digits > > and combining Unicode characters are also allowed." > > > > Regards, > > Martin > > _______________________________________________ > > Unicode mailing list > > Unicode at unicode.org > > http://unicode.org/mailman/listinfo/unicode > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From haberg-1 at telia.com Thu Jun 5 11:41:17 2014 From: haberg-1 at telia.com (Hans Aberg) Date: Thu, 5 Jun 2014 18:41:17 +0200 Subject: Swift In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <53908C6E.4@v.loewis.de> Message-ID: <36473366-49D7-4F3F-94E4-89565632A9EE@telia.com> On 5 Jun 2014, at 17:46, Jeff Senn wrote: > That is: are identifiers merely sequences of characters or intended to be comparable as ?Unicode strings? (under some sort of compatibility rule)? In computer languages, identifiers are normally compared only for equality, as it reduces lookup time complexity. From senn at maya.com Thu Jun 5 12:24:12 2014 From: senn at maya.com (Jeff Senn) Date: Thu, 5 Jun 2014 13:24:12 -0400 Subject: Swift In-Reply-To: <36473366-49D7-4F3F-94E4-89565632A9EE@telia.com> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <53908C6E.4@v.loewis.de> <36473366-49D7-4F3F-94E4-89565632A9EE@telia.com> Message-ID: <5DC8ECF5-C8C6-4D6E-B4BE-7202F50C57DD@maya.com> On Jun 5, 2014, at 12:41 PM, Hans Aberg wrote: > On 5 Jun 2014, at 17:46, Jeff Senn wrote: > >> That is: are identifiers merely sequences of characters or intended to be comparable as ?Unicode strings? (under some sort of compatibility rule)? > > In computer languages, identifiers are normally compared only for equality, as it reduces lookup time complexity. Well in this case we are talking about parsing a source file and generating internal symbols, so the complexity of the comparison operation is a red herring. The real question is how does the source identifier get mapped into a (compiled) symbol. (e.g. in C++ this is not an obvious operation) If your implication is that there should be no canonicalization (the string from the source is used as a sequence of characters only directly mapped to a symbol), then I predict sticky problems in the future. The most obvious of which is that in some cases I will be able to change the semantics of the complied program by (accidentally) canonicalizing the source text (an operation, I will point out, that is invisible to the user in many (most?) Unicode aware editors). From richard.wordingham at ntlworld.com Thu Jun 5 12:40:09 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 5 Jun 2014 18:40:09 +0100 Subject: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE) In-Reply-To: References: <20140604154802.665a7a7059d7ee80bb4d670165c8327d.9a8683c2d8.wbe@email03.secureserver.net> Message-ID: <20140605184009.75217096@JRWUBU2> On Thu, 5 Jun 2014 09:41:07 +0200 Philippe Verdy wrote: > You'll probably want to sync on the first newline control and then > proceed from that point. But now if you have those devices configured > heterogenously and generating their own output encoding you won't > necessarily know how it is encoded even uf all of them use some UTF of > Unicode. So the stream will regularly repost an encoding mark, for > exampel at the begining of each dated logged entry, and this could be > just an encoded BOM (even with UTF-8, or some other UTF like UTF-16 > which would be more likely if the language contained essentially an > East-Asian (CJK) language. Of course, this is not an arbitrary fragment. In this location, ZWNBSP will have almost no effect. (The only mechanisms I can think of are character counts and the text being pasted immediately after another word.) This, and the early belief that U+FFFE would not occur in Unicode text, are why it was chosen. Richard. From jlturriff at centurylink.net Thu Jun 5 13:14:09 2014 From: jlturriff at centurylink.net (J. Leslie Turriff) Date: Thu, 5 Jun 2014 13:14:09 -0500 Subject: Swift In-Reply-To: <53905032.6010000@gmail.com> References: <20140603195253.3c0df53f@JRWUBU2> <53905032.6010000@gmail.com> Message-ID: <201406051314.09438.jlturriff@centurylink.net> On Thursday 05 June 2014 06:10:42 Fr?d?ric Grosshans wrote: > Le 05/06/2014 12:52, David Starner a ?crit : > > On Thu, Jun 5, 2014 at 3:04 AM, J. Leslie Turriff > > > > wrote: > >> What I find interesting is that (with the possible exception of > >> Ada) I don't think that any of the commonly used languages allow for the > >> use of Unicode characters for non- user-defined tokens (i.e. reserved > >> words, etc.). > > > > There is one non-ASCII character in the library, for Pi, and that > > caused some fuss, along with some eye-rolling, as writing the Unicode > > characters as ["03C0"] is permitted. Ada is a conservative language, > > and there's no real drive to make changes like these. (I was mistaken > > on the 20 years for Unicode identifiers; it was the Ada 2005 standard > > that permitted it, not Ada 95.) > > > > Scala is not really a commonly used language, but does use some > > Unicode arrows: ? for =>, ?for <- and ? for ->. Most people don't > > bother. > > > > ALGOL 60 and ALGOL 68 used non-ASCII characters like ?, ?, ?, ?, ?, ?, > > ?, ?, ?, ?, ? and ?, and had compiler defined spellings for keywords. > > And, of course, there is APL ( > https://en.wikipedia.org/wiki/APL_%28programming_language%29 ). Unicode > has 70 characters specially for its use (APL FUNCTIONAL SYMBOL ****), > U+2336 to U+237A since Unicode 1.1 and U+2395 since Unicode 3.0 All true; but do any languages allow for keywords (if, then, else, do, while, until, end, iterate, leave, call return, exit,...) to be expressed in the programmer's locale? -- "Disobedience is the true foundation of liberty. The obedient must be slaves." --Henry David Thoreau From jlturriff at centurylink.net Thu Jun 5 13:22:09 2014 From: jlturriff at centurylink.net (J. Leslie Turriff) Date: Thu, 5 Jun 2014 13:22:09 -0500 Subject: Swift In-Reply-To: <5DC8ECF5-C8C6-4D6E-B4BE-7202F50C57DD@maya.com> References: <20140603195253.3c0df53f@JRWUBU2> <36473366-49D7-4F3F-94E4-89565632A9EE@telia.com> <5DC8ECF5-C8C6-4D6E-B4BE-7202F50C57DD@maya.com> Message-ID: <201406051322.09586.jlturriff@centurylink.net> On Thursday 05 June 2014 12:24:12 Jeff Senn wrote: > On Jun 5, 2014, at 12:41 PM, Hans Aberg wrote: > > On 5 Jun 2014, at 17:46, Jeff Senn wrote: > >> That is: are identifiers merely sequences of characters or intended to > >> be comparable as ?Unicode strings? (under some sort of compatibility > >> rule)? > > > > In computer languages, identifiers are normally compared only for > > equality, as it reduces lookup time complexity. > > Well in this case we are talking about parsing a source file and generating > internal symbols, so the complexity of the comparison operation is a red > herring. > > The real question is how does the source identifier get mapped into a > (compiled) symbol. (e.g. in C++ this is not an obvious operation) > > If your implication is that there should be no canonicalization (the string > from the source is used as a sequence of characters only directly mapped to > a symbol), then I predict sticky problems in the future. The most obvious > of which is that in some cases I will be able to change the semantics of > the complied program by (accidentally) canonicalizing the source text (an > operation, I will point out, that is invisible to the user in many (most?) > Unicode aware editors). So if programmer A uses editor X to write code, and programmer B uses editor Y to modify the code, suddenly the compiler might start generating multiple symbols for some identifiers, causing compiles to fail for no obvious reason. It seems to me that "the complexity of the comparison operation is a red herring" is perhaps a naive view; this would produce a really high astonishment factor. Leslie -- "Disobedience is the true foundation of liberty. The obedient must be slaves." --Henry David Thoreau From senn at maya.com Thu Jun 5 13:47:28 2014 From: senn at maya.com (Jeff Senn) Date: Thu, 5 Jun 2014 14:47:28 -0400 Subject: Swift In-Reply-To: <201406051322.09586.jlturriff@centurylink.net> References: <20140603195253.3c0df53f@JRWUBU2> <36473366-49D7-4F3F-94E4-89565632A9EE@telia.com> <5DC8ECF5-C8C6-4D6E-B4BE-7202F50C57DD@maya.com> <201406051322.09586.jlturriff@centurylink.net> Message-ID: <4010FAE1-C9C2-4BC6-9D6C-D4592B7BD87E@maya.com> On Jun 5, 2014, at 2:22 PM, J. Leslie Turriff wrote: > On Thursday 05 June 2014 12:24:12 Jeff Senn wrote: >> On Jun 5, 2014, at 12:41 PM, Hans Aberg wrote: >>> On 5 Jun 2014, at 17:46, Jeff Senn wrote: >>>> That is: are identifiers merely sequences of characters or intended to >>>> be comparable as ?Unicode strings? (under some sort of compatibility >>>> rule)? >>> >>> In computer languages, identifiers are normally compared only for >>> equality, as it reduces lookup time complexity. >> >> Well in this case we are talking about parsing a source file and generating >> internal symbols, so the complexity of the comparison operation is a red >> herring. >> >> The real question is how does the source identifier get mapped into a >> (compiled) symbol. (e.g. in C++ this is not an obvious operation) >> >> If your implication is that there should be no canonicalization (the string >> from the source is used as a sequence of characters only directly mapped to >> a symbol), then I predict sticky problems in the future. The most obvious >> of which is that in some cases I will be able to change the semantics of >> the complied program by (accidentally) canonicalizing the source text (an >> operation, I will point out, that is invisible to the user in many (most?) >> Unicode aware editors). > So if programmer A uses editor X to write code, and programmer B uses editor > Y to modify the code, suddenly the compiler might start generating multiple > symbols for some identifiers, causing compiles to fail for no obvious reason. > It seems to me that "the complexity of the comparison operation is a red > herring" is perhaps a naive view; this would produce a really high > astonishment factor. > > Leslie I think we are agreeing (and miscommunicating) ? the comparison operator ON SYMBOLS is incredibly important. Of course symbols must be unique! Comparing sequences of characters in the SOURCE for equality is almost a non-issue. (Consider macros, case-insensitivty in some languages, context in languages such as C++, etc?) You illustrate the problem in your example. If I write (4 characters of source) code (because my editor uses decomposed characters): ?=1 ?a? '' ?=' ?1? And you look at it and think you are going to write code to access that value (and your editor uses composed characters - so you have 3 characters): ? = 2 ?' ?=' ?2? Then we have astonishment. > > -- > "Disobedience is the true foundation of liberty. The obedient must be > slaves." --Henry David Thoreau > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From haberg-1 at telia.com Thu Jun 5 14:11:59 2014 From: haberg-1 at telia.com (Hans Aberg) Date: Thu, 5 Jun 2014 21:11:59 +0200 Subject: Swift In-Reply-To: <5DC8ECF5-C8C6-4D6E-B4BE-7202F50C57DD@maya.com> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <53908C6E.4@v.loewis.de> <36473366-49D7-4F3F-94E4-89565632A9EE@telia.com> <5DC8ECF5-C8C6-4D6E-B4BE-7202F50C57DD@maya.com> Message-ID: <111DF5F2-31AF-41A9-81B5-AC7B412AFA95@telia.com> On 5 Jun 2014, at 19:24, Jeff Senn wrote: > On Jun 5, 2014, at 12:41 PM, Hans Aberg wrote: > >> On 5 Jun 2014, at 17:46, Jeff Senn wrote: >> >>> That is: are identifiers merely sequences of characters or intended to be comparable as ?Unicode strings? (under some sort of compatibility rule)? >> >> In computer languages, identifiers are normally compared only for equality, as it reduces lookup time complexity. > > Well in this case we are talking about parsing a source file and generating internal symbols, so the complexity of the comparison operation is a red herring. > > The real question is how does the source identifier get mapped into a (compiled) symbol. (e.g. in C++ this is not an obvious operation) > > If your implication is that there should be no canonicalization (the string from the source is used as a sequence of characters only directly mapped to a symbol), then I predict sticky problems in the future. The most obvious of which is that in some cases I will be able to change the semantics of the complied program by (accidentally) canonicalizing the source text (an operation, I will point out, that is invisible to the user in many (most?) Unicode aware editors). It is not difficult to mangle any byte sequence into c/C++ identifiers, but Swift compiles directly into LLVM, so perhaps it is not needed. Xcode is very aggressive at combining characters, so it is hard to write non-normalized characters from it. The manual says that after the first character, combining characters are allowed, but does not seem to mention normalization. But it seems the compiler only needs to compare byte sequences for equality, which is what is traditional. From daniel.buenzli at erratique.ch Thu Jun 5 14:28:19 2014 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Thu, 5 Jun 2014 20:28:19 +0100 Subject: Swift In-Reply-To: <5DC8ECF5-C8C6-4D6E-B4BE-7202F50C57DD@maya.com> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <53908C6E.4@v.loewis.de> <36473366-49D7-4F3F-94E4-89565632A9EE@telia.com> <5DC8ECF5-C8C6-4D6E-B4BE-7202F50C57DD@maya.com> Message-ID: Le jeudi, 5 juin 2014 ? 18:24, Jeff Senn a ?crit : > If your implication is that there should be no canonicalization (the string from the source is used as a sequence of characters only directly mapped to a symbol), then I predict sticky problems in the future. Note that this is actually the case in the XML specification, processors are not required to perform normalisation for matching tag names (see ?match' in this section [1] and this comment [2] of the annotated XML specification), I suspect this is rarely a problem in practice since XML vocabularies tend to stick to ASCII identifiers (and so should programmers in general IMHO). Daniel [1] http://www.w3.org/TR/REC-xml/#sec-terminology [2] http://www.xml.com/axml/notes/StringMatch.html From doug at ewellic.org Thu Jun 5 14:46:23 2014 From: doug at ewellic.org (Doug Ewell) Date: Thu, 05 Jun 2014 12:46:23 -0700 Subject: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE) Message-ID: <20140605124623.665a7a7059d7ee80bb4d670165c8327d.cee69d544d.wbe@email03.secureserver.net> Philippe Verdy wrote: > Not necessarily true. > > [602 words] This has nothing to do with the scenario I described, which involved removing a "BOM" from the start of an arbitrary fragment of data, thereby corrupting the data because the "BOM" was actually a ZWNBSP. If you have an arbitrary fragment of data, don't fiddle with it. If you know enough about the data to fiddle with it safely, it's not arbitrary. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From prosfilaes at gmail.com Thu Jun 5 18:03:10 2014 From: prosfilaes at gmail.com (David Starner) Date: Thu, 5 Jun 2014 16:03:10 -0700 Subject: Swift In-Reply-To: <201406051314.09438.jlturriff@centurylink.net> References: <20140603195253.3c0df53f@JRWUBU2> <53905032.6010000@gmail.com> <201406051314.09438.jlturriff@centurylink.net> Message-ID: On Thu, Jun 5, 2014 at 11:14 AM, J. Leslie Turriff wrote: > All true; but do any languages allow for keywords (if, then, else, do, while, > until, end, iterate, leave, call return, exit,...) to be expressed in the > programmer's locale? Both ALGOL 60 and ALGOL 68 had compiler dependent source representations, so the Europeans could use their own words for keywords and use commas as decimal points. I'm pretty sure no one had invented the concept of a user's locale yet, but it would probably come configured for whatever local locale you wanted. (I assume for a machine that cost $14 million in 1966, such adjustments could be made for a single customer.) -- Kie ekzistas vivo, ekzistas espero. From verdy_p at wanadoo.fr Thu Jun 5 18:23:34 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 6 Jun 2014 01:23:34 +0200 Subject: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE) In-Reply-To: <20140605124623.665a7a7059d7ee80bb4d670165c8327d.cee69d544d.wbe@email03.secureserver.net> References: <20140605124623.665a7a7059d7ee80bb4d670165c8327d.cee69d544d.wbe@email03.secureserver.net> Message-ID: 2014-06-05 21:46 GMT+02:00 Doug Ewell : > Philippe Verdy wrote: > > > Not necessarily true. > > > > [602 words] > > This has nothing to do with the scenario I described, which involved > removing a "BOM" from the start of an arbitrary fragment of data, > thereby corrupting the data because the "BOM" was actually a ZWNBSP. > > If you have an arbitrary fragment of data, don't fiddle with it. > Thisis your scenario. The simple concept of a unique "start" of text does not exist in live streams that can start anywhere. So you cannot always expect that U+FEFF or U+FFFE will only exist once in a strram and necessaryly at the start of position where you can start reading it because you may already be past the initial creation of the stream without having any wya to come back to the "start". Your assumption just assumes that you can always "rewind" your file, this is not always possible and each user of that start has its own start differnt from the other one. And this is not becuse they are "fiddling" into it. Many applications use internally such one-way streams that have no random access capability so that they cann cannot be rewinded to the "start". And the producer does not keep a complete log of everything that was emitted. Clients are just connecting to the stream from a position where the producer is already using which is already past the start seen by the producer. In some cases there are even multiple producers contributng independantly to the stream (debug log streams are typical examples, but this could also be a live text stream of subtitles in a live TV or radio channel with a single producer for many consumers connecting to the never ending stream at any time without possibility to rewind back in time possibly months or years before to get the full stream just in order to process thousands of gigabytes of audio or video where the live text stream has been multiplexed). Now you will argue: this live stream is not plain text, it has a binary structure. Yes but only if your consumer application wants to process the full multiplex. Typically clients will demultiplex the stream and pass it down to a simpler client that absolutely does not care about the transport multiplex format. If that downward client is just used to display the incoming text, it will just wait for text that will be buffered ine by line and displayed immediately where there's a newline separator. But even in this case, each line may have been fragmented so that each fragment will contain a leading BOM which will nto be necessarily stripped (notably not if the transport is made with datagrams over a non "reliable" protocol like UDP (you have also incorrectly asuumed that a text stream is necessaily transported over a "reliable" protocol like TCP where there can be no data loss in the middle, i.e. you are still bound to classic storage on a file system (even if this file system is named "HTTP": even in HTTP there also exists live streams without any defined start). Texts are inhernetly fragmentable. Initially they are transcripts of human communication and nobody in real life is permanently connected to someone else and able to remember eveything that was said by someone else. Fragmented texts are natural and have always existed even before they were ritten on a material support. On a numeric network, text is dematerialized again and are materialized only by consumers, you don't transmit the bounded support. The concept of "start" of text is in fact very artificial, this is not the wa people interact between each other or in groups. -------------- next part -------------- An HTML attachment was scrubbed... URL: From billposer2 at gmail.com Thu Jun 5 18:34:11 2014 From: billposer2 at gmail.com (Bill Poser) Date: Thu, 5 Jun 2014 16:34:11 -0700 Subject: Swift In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <53905032.6010000@gmail.com> <201406051314.09438.jlturriff@centurylink.net> Message-ID: A few years ago there was a company in Australia that was developing a multilingual language called Protium Blue. The lead was someone named Diarmuid Pigott. As far as I can tell, the project has come to an end, but one can still find bits about the project, e.g. this: http://www.qualitytesting.info/forum/topics/what-is-protium-project On Thu, Jun 5, 2014 at 4:03 PM, David Starner wrote: > On Thu, Jun 5, 2014 at 11:14 AM, J. Leslie Turriff > wrote: > > All true; but do any languages allow for keywords (if, then, > else, do, while, > > until, end, iterate, leave, call return, exit,...) to be expressed in the > > programmer's locale? > > Both ALGOL 60 and ALGOL 68 had compiler dependent source > representations, so the Europeans could use their own words for > keywords and use commas as decimal points. I'm pretty sure no one had > invented the concept of a user's locale yet, but it would probably > come configured for whatever local locale you wanted. (I assume for a > machine that cost $14 million in 1966, such adjustments could be made > for a single customer.) > > -- > Kie ekzistas vivo, ekzistas espero. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Jun 5 18:55:57 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 6 Jun 2014 01:55:57 +0200 Subject: Swift In-Reply-To: <5DC8ECF5-C8C6-4D6E-B4BE-7202F50C57DD@maya.com> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <53908C6E.4@v.loewis.de> <36473366-49D7-4F3F-94E4-89565632A9EE@telia.com> <5DC8ECF5-C8C6-4D6E-B4BE-7202F50C57DD@maya.com> Message-ID: IMHO, a programming language that accepts non-ASCII identifiers should always nrmalize the identifiers it accepts, before heeding it in its hashed symbol table. And for this type of usage, we strongly need that normalization is stable, but much more than with existing stability rules: the normalization stability is not warrantied if the language can accept unassigned code points that may be allocated later and will normalize differently (the normalization of unassigned code points just assumes a default combining class 0 where reordering and recombining cannot occur, but once code points pass from unassigned to assigned, this may no longer be true. For this reason, a reasonable programming language should restrict itself to only characters of a defined Unicode version and should notaccept unassigned characters in that version. Alternatively compiled programs should track the Unicode version version to make sure that later reusers of compiled programs will link properly to the older compiled programs, by making sure that newer idenfiers used in newer programs cannot never match an identifier defined by the older compield program assuming a different normalization. Programming languages should follow the practices used in IDNA for security reasons. Then, extending the allowed subset should be done with care: this extension will be compatible *only* if the newly assigned characters added to the extended subset have combining class 0 and are not listed in restricted recompositions. Otherwise, all other added characters in the extension should not be compatible with older versions of the language (if the language cannot check Uncidoe version or does not want to be incompatible with past versions, they will not be allowed to extend safely their allowed subset for identifiers, and notably not any combining characers with non-zero combining class). 2014-06-05 19:24 GMT+02:00 Jeff Senn : > > On Jun 5, 2014, at 12:41 PM, Hans Aberg wrote: > > > On 5 Jun 2014, at 17:46, Jeff Senn wrote: > > > >> That is: are identifiers merely sequences of characters or intended to > be comparable as ?Unicode strings? (under some sort of compatibility rule)? > > > > In computer languages, identifiers are normally compared only for > equality, as it reduces lookup time complexity. > > Well in this case we are talking about parsing a source file and > generating internal symbols, so the complexity of the comparison operation > is a red herring. > > The real question is how does the source identifier get mapped into a > (compiled) symbol. (e.g. in C++ this is not an obvious operation) > > If your implication is that there should be no canonicalization (the > string from the source is used as a sequence of characters only directly > mapped to a symbol), then I predict sticky problems in the future. The > most obvious of which is that in some cases I will be able to change the > semantics of the complied program by (accidentally) canonicalizing the > source text (an operation, I will point out, that is invisible to the user > in many (most?) Unicode aware editors). > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ken.whistler at sap.com Thu Jun 5 19:00:47 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Fri, 6 Jun 2014 00:00:47 +0000 Subject: Swift In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <53905032.6010000@gmail.com> <201406051314.09438.jlturriff@centurylink.net> Message-ID: Hmmm. Any programming language project that derives from someone who describes himself as a ?polyhistor?, which claims to be polymorphic and pasigraphic and multi-lingual and orthogonal and polysynthetic, which draws its inspiration from the theory of ?Natural Language Metasemantics?, and which name drops ?the great heritage started by Wilkins and Leibniz?, might seem to be doomed from the start. ;-) --Ken P.S. For Leibniz and pasigraphy and its application to formal calculation, see: http://en.wikipedia.org/wiki/Characteristica_universalis then quickly run the other way! A few years ago there was a company in Australia that was developing a multilingual language called Protium Blue. The lead was someone named Diarmuid Pigott. As far as I can tell, the project has come to an end, but one can still find bits about the project, e.g. this: http://www.qualitytesting.info/forum/topics/what-is-protium-project -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Jun 5 19:09:30 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 6 Jun 2014 02:09:30 +0200 Subject: Swift In-Reply-To: <53908C6E.4@v.loewis.de> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <53908C6E.4@v.loewis.de> Message-ID: Warning ! This definition of allowed identifiers has severe security risks: it does not support any kind of normalization or canonical equivalence, and it's impossible to use normalization in the language lexer/parser while making sure that they will be stable over the set of unassigned characters that may be assigned later. This could cause unecpected bindings initially impossible to enter in collision later with new normalizations (notably if unassigned code poitns get assigned to combining characters with non-zero combining class, or to base characters with combining class 0 but forbidden from recombining (i.e. disallowed in standard normalization forms). No programming language should allow using unassigned characters, they should be checked and marked as invalid (note; this check can work in a compiled version of the language, but will not work in a repository of source code where the only check is possible by parsing all source files in a repositry to make sure that there's no unassigned codepoint anywhere in their source text ; the source repository should enforce this by defining clearly the UCS version it accepts for source files, but as far as I know, no usual source repositories perform this check, that can only be done by extracting all sources from it using some bot script that will detect unassigned code points in these sources). The alternative of not allowing any normalization of identifiers is not safe when source code editors may easily renormalize the identifiers, or when these source may be edited by different users using different input methods. 2014-06-05 17:27 GMT+02:00 "Martin v. L?wis" : > Am 04.06.14 11:28, schrieb Andre Schappo: > > The restrictions seem a little like IDNA2008. Anyone have links to > > info giving a detailed explanation/tabulation of allowed and non > > allowed Unicode chars for Swift Variable and Constant names? > > The language reference is at > > > https://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html > > For reference, the definition of identifier-character is (read each > line as an alternative) > > identifier-character ? Digit 0 through 9 > identifier-character ? U+0300?U+036F, U+1DC0?U+1DFF, U+20D0?U+20FF, or > U+FE20?U+FE2F > identifier-character ? identifier-head? > > where identifier-head is > > identifier-head ? Upper- or lowercase letter A through Z > identifier-head ? U+00A8, U+00AA, U+00AD, U+00AF, U+00B2?U+00B5, or > U+00B7?U+00BA > identifier-head ? U+00BC?U+00BE, U+00C0?U+00D6, U+00D8?U+00F6, or > U+00F8?U+00FF > identifier-head ? U+0100?U+02FF, U+0370?U+167F, U+1681?U+180D, or > U+180F?U+1DBF > identifier-head ? U+1E00?U+1FFF > identifier-head ? U+200B?U+200D, U+202A?U+202E, U+203F?U+2040, U+2054, > or U+2060?U+206F > identifier-head ? U+2070?U+20CF, U+2100?U+218F, U+2460?U+24FF, or > U+2776?U+2793 > identifier-head ? U+2C00?U+2DFF or U+2E80?U+2FFF > identifier-head ? U+3004?U+3007, U+3021?U+302F, U+3031?U+303F, or > U+3040?U+D7FF > identifier-head ? U+F900?U+FD3D, U+FD40?U+FDCF, U+FDF0?U+FE1F, or > U+FE30?U+FE44 > identifier-head ? U+FE47?U+FFFD > identifier-head ? U+10000?U+1FFFD, U+20000?U+2FFFD, U+30000?U+3FFFD, or > U+40000?U+4FFFD > identifier-head ? U+50000?U+5FFFD, U+60000?U+6FFFD, U+70000?U+7FFFD, or > U+80000?U+8FFFD > identifier-head ? U+90000?U+9FFFD, U+A0000?U+AFFFD, U+B0000?U+BFFFD, or > U+C0000?U+CFFFD > identifier-head ? U+D0000?U+DFFFD or U+E0000?U+EFFFD > > As the construction principle for this list, they say > > "Identifiers begin with an upper case or lower case letter A through Z, > an underscore (_), a noncombining alphanumeric Unicode character in the > Basic Multilingual Plane, or a character outside the Basic Multilingual > Plan that isn?t in a Private Use Area. After the first character, digits > and combining Unicode characters are also allowed." > > Regards, > Martin > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Thu Jun 5 22:30:50 2014 From: prosfilaes at gmail.com (David Starner) Date: Thu, 5 Jun 2014 20:30:50 -0700 Subject: Swift In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <53905032.6010000@gmail.com> <201406051314.09438.jlturriff@centurylink.net> Message-ID: On Thu, Jun 5, 2014 at 5:00 PM, Whistler, Ken wrote: > Any programming language project that derives from someone who describes > himself as a ?polyhistor?, which claims to be polymorphic and pasigraphic > and > multi-lingual and orthogonal and polysynthetic, which draws its inspiration > from the > theory of ?Natural Language Metasemantics?, and which name drops ?the > great heritage started by Wilkins and Leibniz?, might seem to be doomed from > the start. ;-) It reminded me of the replacement for Unicode that was 40,000 times more efficient. It's sad how the major manufactures keep suppressing all these brilliant new ideas that would revolutionize the world. -- Kie ekzistas vivo, ekzistas espero. From richard.wordingham at ntlworld.com Fri Jun 6 02:16:36 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 6 Jun 2014 08:16:36 +0100 Subject: Swift In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <53908C6E.4@v.loewis.de> <36473366-49D7-4F3F-94E4-89565632A9EE@telia.com> <5DC8ECF5-C8C6-4D6E-B4BE-7202F50C57DD@maya.com> Message-ID: <20140606081636.0c26f793@JRWUBU2> On Fri, 6 Jun 2014 01:55:57 +0200 Philippe Verdy wrote: > IMHO, a programming language that accepts non-ASCII identifiers should > always nrmalize the identifiers it accepts, before heeding it in its > hashed symbol table. Unfortunately, C and C++ don't normalise. Consequently, all a compiler can do is to warn about or reject identifiers not in the preferred normalisation. Richard. From sdaoden at yandex.com Fri Jun 6 06:14:47 2014 From: sdaoden at yandex.com (Steffen Nurpmeso) Date: Fri, 06 Jun 2014 13:14:47 +0200 Subject: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE) In-Reply-To: <20140605124623.665a7a7059d7ee80bb4d670165c8327d.cee69d544d.wbe@email03.secureserver.net> References: <20140605124623.665a7a7059d7ee80bb4d670165c8327d.cee69d544d.wbe@email03.secureserver.net> Message-ID: <20140606121447.hCM8g6WO%sdaoden@yandex.com> "Doug Ewell" wrote: |Philippe Verdy wrote: |> Not necessarily true. |> |> [602 words] | |This has nothing to do with the scenario I described, which involved |removing a "BOM" from the start of an arbitrary fragment of data, |thereby corrupting the data because the "BOM" was actually a ZWNBSP. | |If you have an arbitrary fragment of data, don't fiddle with it. | |If you know enough about the data to fiddle with it safely, it's not |arbitrary. Yeah! E.g., on the all-UTF-8 Plan9 research operating system: ?0[9front.update_bomb_git]$ git ls-files --with-tree=master --|wc -l 44983 ?0[9front.update_bomb_git]$ git grep -lI "`print '\ufeff'`" master|wc -l 12 ?0[9front.update_bomb_git]$ git grep -lI "`print '\ufeff'`" master master:9front.hg/lib/font/bit/MAP master:9front.hg/lib/glass master:9front.hg/sys/lib/troff/font/devutf/0100to25ff master:9front.hg/sys/lib/troff/font/devutf/C master:9front.hg/sys/lib/troff/font/devutf/CW master:9front.hg/sys/lib/troff/font/devutf/H master:9front.hg/sys/lib/troff/font/devutf/LucidaSans master:9front.hg/sys/lib/troff/font/devutf/PA master:9front.hg/sys/lib/troff/font/devutf/R master:9front.hg/sys/lib/troff/font/devutf/R.nomath master:9front.hg/sys/src/ape/lib/utf/runetype.c master:9front.hg/sys/src/libc/port/runetype.c --steffen From doug at ewellic.org Fri Jun 6 11:15:23 2014 From: doug at ewellic.org (Doug Ewell) Date: Fri, 06 Jun 2014 09:15:23 -0700 Subject: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE) Message-ID: <20140606091523.665a7a7059d7ee80bb4d670165c8327d.bc50fd4323.wbe@email03.secureserver.net> Philippe Verdy wrote: >> If you have an arbitrary fragment of data, don't fiddle with it. > > Thisis your scenario. The simple concept of a unique "start" of text > does not exist in live streams that can start anywhere. So you cannot > always expect that U+FEFF or U+FFFE will only exist once in a strram > and necessaryly at the start of position where you can start reading > it because you may already be past the initial creation of the stream > without having any wya to come back to the "start". An "arbitrary fragment of data" -- I'm going to keep using the exact same phrase until it sinks in -- DOES have a start and an end. THAT is my scenario. > Your assumption just assumes that you can always "rewind" your file, My assumption assumes no such thing. > Now you will argue: this live stream is not plain text, it has a > binary structure. Well, yes. > Yes but only if your consumer application wants to process the full > multiplex. Typically clients will demultiplex the stream and pass it > down to a simpler client that absolutely does not care about the > transport multiplex format. If that downward client is just used to > display the incoming text, it will just wait for text that will be > buffered ine by line and displayed immediately where there's a newline > separator. But even in this case, each line may have been fragmented > so that each fragment will contain a leading BOM which will nto be > necessarily stripped Question: Why did the process that broke the stream into fragments add leading BOMs? > (you have also incorrectly asuumed that a text stream is necessaily > transported over a "reliable" protocol like TCP where there can be no > data loss in the middle Really. I think you have incorrectly asuumed my asuumption. > Texts are inhernetly fragmentable. Initially they are transcripts of > human communication and nobody in real life is permanently connected > to someone else and able to remember eveything that was said by > someone else. OK, I think are far enough removed from Unicode to end this. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From lyratelle at gmx.de Sat Jun 7 12:07:29 2014 From: lyratelle at gmx.de (Dominikus Dittes Scherkl) Date: Sat, 07 Jun 2014 19:07:29 +0200 Subject: *** GMX Spamverdacht *** Re: Swift In-Reply-To: <201406051314.09438.jlturriff@centurylink.net> References: <20140603195253.3c0df53f@JRWUBU2> <53905032.6010000@gmail.com> <201406051314.09438.jlturriff@centurylink.net> Message-ID: <539346D1.6060006@gmx.de> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Am 05.06.2014 20:14, schrieb J. Leslie Turriff: > All true; but do any languages allow for keywords (if, then, else, do, while, > until, end, iterate, leave, call return, exit,...) to be expressed in the > programmer's locale? Oh, in C++ with macros you can do about anything you want. I have a "deutsch.h" that allows you to use german keywords, even including umlauts like in "w?hle" instead of "select". It simply replaces them by the english version before the code is compiled. - -- Best regards, Dominikus Dittes Scherkl -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJTk0bRAAoJELBWOtEemFJV09EIALp7W8m2aDwmAtI4xCtQ9tNv JR1bcyKNjvKkObYe/dQwVwM9VNTzLKRcxqx+aMw0tqu0GjxituSv144s4lMBmgIr ocFWFRVyD5qT3gotDEEaB+KS57Yijg1EY2NpDJoH8mAyFVi02Miv64gbDGBdVZWb hMVCDwOBgOo7CvA7hrhNv9kEI/V1hC0d30/mjbSgAHVaMZa/CiCgbL5X4546jfw2 WBrAOh2xTQexWg24ENWQREn987WKKmErinoo/v0oPtTB4uDQqhkQ+0n5KTzgm6V4 EBwSSR++AmEVp/PdhllonqirkXLU0mI/W5gS6ZSdHRdeFiYJHeNwsg9WtlZUK84= =6+fX -----END PGP SIGNATURE----- From verdy_p at wanadoo.fr Sat Jun 7 13:07:53 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 7 Jun 2014 20:07:53 +0200 Subject: *** GMX Spamverdacht *** Re: Swift In-Reply-To: <539346D1.6060006@gmx.de> References: <20140603195253.3c0df53f@JRWUBU2> <53905032.6010000@gmail.com> <201406051314.09438.jlturriff@centurylink.net> <539346D1.6060006@gmx.de> Message-ID: Note really for localizing the punctuation operators... Plus there are directionality issues with RTL scripts withing source text editors... Imagine this statement: "if (x < 0)" and then rename variable "x" and/or keyword "if" to Arabic; do you test for negative or positive values?... Do you want to include Bidi controls within all RTL variable names or keywords? or make them ignorable in parsed source code? Of add some leading #pragma to set the directionality of source code ? (this cannot be made with macros, in fact you need compiler options to set direction from the first character of source file, or use some custom "magic" value to guess it with a pre-parsing of the first few lines, to infer the correct interpretation, and probably the encoding too if it's not UTF-8, to process a "#pgrama charset") 2014-06-07 19:07 GMT+02:00 Dominikus Dittes Scherkl : > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Am 05.06.2014 20:14, schrieb J. Leslie Turriff: > > > All true; but do any languages allow for keywords (if, then, else, > do, while, > > until, end, iterate, leave, call return, exit,...) to be expressed in the > > programmer's locale? > > Oh, in C++ with macros you can do about anything you want. > I have a "deutsch.h" that allows you to use german keywords, even > including umlauts like in "w?hle" instead of "select". > It simply replaces them by the english version before the code is compiled. > > - -- > > Best regards, > Dominikus Dittes Scherkl > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.10 (MingW32) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iQEcBAEBAgAGBQJTk0bRAAoJELBWOtEemFJV09EIALp7W8m2aDwmAtI4xCtQ9tNv > JR1bcyKNjvKkObYe/dQwVwM9VNTzLKRcxqx+aMw0tqu0GjxituSv144s4lMBmgIr > ocFWFRVyD5qT3gotDEEaB+KS57Yijg1EY2NpDJoH8mAyFVi02Miv64gbDGBdVZWb > hMVCDwOBgOo7CvA7hrhNv9kEI/V1hC0d30/mjbSgAHVaMZa/CiCgbL5X4546jfw2 > WBrAOh2xTQexWg24ENWQREn987WKKmErinoo/v0oPtTB4uDQqhkQ+0n5KTzgm6V4 > EBwSSR++AmEVp/PdhllonqirkXLU0mI/W5gS6ZSdHRdeFiYJHeNwsg9WtlZUK84= > =6+fX > -----END PGP SIGNATURE----- > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlturriff at centurylink.net Sat Jun 7 21:20:26 2014 From: jlturriff at centurylink.net (J. Leslie Turriff) Date: Sat, 7 Jun 2014 21:20:26 -0500 Subject: *** GMX Spamverdacht *** Re: Swift In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <539346D1.6060006@gmx.de> Message-ID: <201406072120.26074.jlturriff@centurylink.net> On Saturday 07 June 2014 13:07:53 Philippe Verdy wrote: > Note really for localizing the punctuation operators... > > Plus there are directionality issues with RTL scripts withing source text > editors... > Imagine this statement: "if (x < 0)" and then rename variable "x" and/or > keyword "if" to Arabic; do you test for negative or positive values?... > > Do you want to include Bidi controls within all RTL variable names or > keywords? or make them ignorable in parsed source code? > > Of add some leading #pragma to set the directionality of source code ? > (this cannot be made with macros, in fact you need compiler options to set > direction from the first character of source file, or use some custom > "magic" value to guess it with a pre-parsing of the first few lines, to > infer the correct interpretation, and probably the encoding too if it's not > UTF-8, to process a "#pgrama charset") Ah. This is getting to be pretty tricky. :-) -- "Disobedience is the true foundation of liberty. The obedient must be slaves." --Henry David Thoreau From public at khwilliamson.com Sat Jun 7 23:19:51 2014 From: public at khwilliamson.com (Karl Williamson) Date: Sat, 07 Jun 2014 22:19:51 -0600 Subject: Corrigendum #9 In-Reply-To: <00c3fcd9d08f4e4eaf5cda05cec0a63f@BY2PR03MB491.namprd03.prod.outlook.com> References: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> <538CAB19.7020208@ix.netcom.com> <00c3fcd9d08f4e4eaf5cda05cec0a63f@BY2PR03MB491.namprd03.prod.outlook.com> Message-ID: <5393E467.20707@khwilliamson.com> On 06/02/2014 11:00 AM, Shawn Steele wrote: > To further my understanding, can someone provide examples of how these are used in actual practice? I can't think of any offhand and the closest I get is like the old escape characters to get a dot matrix printer to shift modes, or old word processor internal formatting sequences. > Here's an example of a possible use. 20 some years ago I wrote a front-end to the Unix diff utility. Showing the differences between files (usually 2 versions of the same program's code) is an extremely common programming activity. I do it many times a day. One reason is to try to find out why a bug has crept in. In doing so, there are some differences that are not relevant to the task at hand, and their being shown is a significant distraction. For example, in programming, one might have renamed a variable (identifier) because its purpose has changed somewhat and the name should accurately reflect its new function so the reader is not subconsciously misled. It would be nice to be able to suppress the variable name changes from the difference display. There could be thousands of them. By changing the name in each file version to the same noncharacter during the diff, these differences won't be displayed, and there would not be any possible conflict with the input files having that noncharacter in them. (For display the noncharacter is changed back to the original value in its respective file) Further, one might want to ignore the name changes of two variables. Just use a second noncharacter, up to 66. I wrote this long before noncharacters were available. What I do instead is scan the files for rarely used characters until I find enough ones that aren't in the files. For example U+9F is unlikely to appear. Scanning the files takes time. This step could be omitted for noncharacters that are known to be illegal in the input. From asmusf at ix.netcom.com Sat Jun 7 23:33:57 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sat, 07 Jun 2014 21:33:57 -0700 Subject: Corrigendum #9 In-Reply-To: <5393E467.20707@khwilliamson.com> References: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> <538CAB19.7020208@ix.netcom.com> <00c3fcd9d08f4e4eaf5cda05cec0a63f@BY2PR03MB491.namprd03.prod.outlook.com> <5393E467.20707@khwilliamson.com> Message-ID: <5393E7B5.2050508@ix.netcom.com> On 6/7/2014 9:19 PM, Karl Williamson wrote: > On 06/02/2014 11:00 AM, Shawn Steele wrote: >> To further my understanding, can someone provide examples of how >> these are used in actual practice? I can't think of any offhand and >> the closest I get is like the old escape characters to get a dot >> matrix printer to shift modes, or old word processor internal >> formatting sequences. >> > > Here's an example of a possible use. 20 some years ago I wrote a > front-end to the Unix diff utility. Showing the differences between > files (usually 2 versions of the same program's code) is an extremely > common programming activity. I do it many times a day. One reason is > to try to find out why a bug has crept in. In doing so, there are > some differences that are not relevant to the task at hand, and their > being shown is a significant distraction. For example, in programming, > one might have renamed a variable (identifier) because its purpose has > changed somewhat and the name should accurately reflect its new > function so the reader is not subconsciously misled. It would be nice > to be able to suppress the variable name changes from the difference > display. There could be thousands of them. By changing the name in > each file version to the same noncharacter during the diff, these > differences won't be displayed, and there would not be any possible > conflict with the input files having that noncharacter in them. (For > display the noncharacter is changed back to the original value in its > respective file) Further, one might want to ignore the name changes > of two variables. Just use a second noncharacter, up to 66. > > I wrote this long before noncharacters were available. What I do > instead is scan the files for rarely used characters until I find > enough ones that aren't in the files. For example U+9F is unlikely to > appear. Scanning the files takes time. This step could be omitted > for noncharacters that are known to be illegal in the input. > > This "illegal in the input" so "I'm free to assume I can use them for my purposes" was definitely the primary(!) design goal discussed when the set of 32 were added to Unicode. Having UTC backpedal from that, many years after original design, based on a single meeting and without public review is really a breakdown of the process. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Sun Jun 8 10:47:16 2014 From: public at khwilliamson.com (Karl Williamson) Date: Sun, 08 Jun 2014 09:47:16 -0600 Subject: Corrigendum #9 In-Reply-To: <5393E7B5.2050508@ix.netcom.com> References: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> <538CAB19.7020208@ix.netcom.com> <00c3fcd9d08f4e4eaf5cda05cec0a63f@BY2PR03MB491.namprd03.prod.outlook.com> <5393E467.20707@khwilliamson.com> <5393E7B5.2050508@ix.netcom.com> Message-ID: <53948584.4080102@khwilliamson.com> On 06/07/2014 10:33 PM, Asmus Freytag wrote: > On 6/7/2014 9:19 PM, Karl Williamson wrote: >> On 06/02/2014 11:00 AM, Shawn Steele wrote: >>> To further my understanding, can someone provide examples of how >>> these are used in actual practice? I can't think of any offhand and >>> the closest I get is like the old escape characters to get a dot >>> matrix printer to shift modes, or old word processor internal >>> formatting sequences. >>> >> >> Here's an example of a possible use. 20 some years ago I wrote a >> front-end to the Unix diff utility. Showing the differences between >> files (usually 2 versions of the same program's code) is an extremely >> common programming activity. I do it many times a day. One reason is >> to try to find out why a bug has crept in. In doing so, there are >> some differences that are not relevant to the task at hand, and their >> being shown is a significant distraction. For example, in programming, >> one might have renamed a variable (identifier) because its purpose has >> changed somewhat and the name should accurately reflect its new >> function so the reader is not subconsciously misled. It would be nice >> to be able to suppress the variable name changes from the difference >> display. There could be thousands of them. By changing the name in >> each file version to the same noncharacter during the diff, these >> differences won't be displayed, and there would not be any possible >> conflict with the input files having that noncharacter in them. (For >> display the noncharacter is changed back to the original value in its >> respective file) Further, one might want to ignore the name changes >> of two variables. Just use a second noncharacter, up to 66. >> >> I wrote this long before noncharacters were available. What I do >> instead is scan the files for rarely used characters until I find >> enough ones that aren't in the files. For example U+9F is unlikely to >> appear. Scanning the files takes time. This step could be omitted >> for noncharacters that are known to be illegal in the input. >> >> > This "illegal in the input" so "I'm free to assume I can use them for my > purposes" was definitely the primary(!) design goal discussed when the > set of 32 were added to Unicode. Having UTC backpedal from that, many > years after original design, based on a single meeting and without > public review is really a breakdown of the process. > > A./ I should note that this front-end to 'diff' changes the input files, writes the modified versions out, and calls 'diff' with those modified files as its inputs. By using noncharacters, it would be depending on 'diff' to 1) not use them, and 2) to not filter them out, and 3) for the system to be able to store and retrieve them in files. I think a revision to the text was advisable to clarify that 2) and 3) were acceptable. I haven't heard anybody on this thread disagree with that. But item 1) shows how tricky this issue really is. My utility looks like a fancier 'diff' to those people who call it, so they would be justified in wanting it not to use noncharacters because they have their own purposes for them. If some of those callers were themselves utilities, their callers might want to use noncharacters for their own purposes. And so on and so on. I don't have a good answer, except to say that Asmus' characterization above looks reasonable. The purpose of public reviews is to try to get a broad range of ideas, and if none are forthcoming, then the fact that there was such a review should be an adequate defense of the ultimate decision. Not holding a review is an invitation to lingering suspicions on the part of the public about the motives behind any such decision. These can fester and the trust level is permanently diminished. There will always be people who won't like the decision, and who will assume that the deciders are malevolent. But the vast majority will accept a decision that seems to have been made in good faith after public input. This is just how things work, no matter what the venue or issue. It may be that the UTC thought this was minor enough to not require a review, but if so, time has shown that to have been an incorrect perception. From Shawn.Steele at microsoft.com Sun Jun 8 11:35:28 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Sun, 8 Jun 2014 16:35:28 +0000 Subject: Corrigendum #9 In-Reply-To: <53948584.4080102@khwilliamson.com> References: <20140602093650.665a7a7059d7ee80bb4d670165c8327d.558f77871a.wbe@email03.secureserver.net> <538CAB19.7020208@ix.netcom.com> <00c3fcd9d08f4e4eaf5cda05cec0a63f@BY2PR03MB491.namprd03.prod.outlook.com> <5393E467.20707@khwilliamson.com> <5393E7B5.2050508@ix.netcom.com> <53948584.4080102@khwilliamson.com> Message-ID: > I should note that this front-end to 'diff' changes the input files, writes the modified versions out, and calls 'diff' with those modified files as its inputs. By using noncharacters, it would be depending on 'diff' to 1) not use them, and 2) to not filter them out, and 3) for the system to be able to store and retrieve them in files. In my view that is still "internal" to your apps use of these characters :) The original text doesn't say that my application cannot store & retrieve them from files for internal use. On the contrary, I'd expect proprietary formats for internal use to require that. I agree that the original text is a bit vague on the question of tools to inspect/modify/whatever your internal use. -Shawn From unicode at norbertlindenberg.com Sun Jun 8 21:46:36 2014 From: unicode at norbertlindenberg.com (Norbert Lindenberg) Date: Sun, 8 Jun 2014 19:46:36 -0700 Subject: Swift In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <53908C6E.4@v.loewis.de> Message-ID: <01CBD00E-0907-470D-93FC-846E3785555E@norbertlindenberg.com> It does allow some usage that may surprise code reviewers ? for example, this is a valid Swift program: let s = "??" let s? = "??" let ? = "??" let all = s + s? + ? The value of the constant ?all? is "??????". Or at least it is as long as mail software doesn?t harm the variation selectors? Norbert On Jun 5, 2014, at 9:06 , Mark Davis ?? wrote: > I haven't done any analysis, but on first glance it looks like it is based on > > http://www.unicode.org/reports/tr31/#Alternative_Identifier_Syntax > > > Mark > > ? Il meglio ? l?inimico del bene ? > > > On Thu, Jun 5, 2014 at 5:46 PM, Jeff Senn wrote: > Has anyone figured out whether character sequences that are non-canonical (de)compositions but could be recomposed to the same result > are the same identifier or not? > > That is: are identifiers merely sequences of characters or intended to be comparable as ?Unicode strings? (under some sort of compatibility rule)? > > On Jun 5, 2014, at 11:27 AM, Martin v. L?wis wrote: > > > Am 04.06.14 11:28, schrieb Andre Schappo: > >> The restrictions seem a little like IDNA2008. Anyone have links to > >> info giving a detailed explanation/tabulation of allowed and non > >> allowed Unicode chars for Swift Variable and Constant names? > > > > The language reference is at > > > > https://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html > > > > For reference, the definition of identifier-character is (read each > > line as an alternative) > > > > identifier-character ? Digit 0 through 9 > > identifier-character ? U+0300?U+036F, U+1DC0?U+1DFF, U+20D0?U+20FF, or > > U+FE20?U+FE2F > > identifier-character ? identifier-head? > > > > where identifier-head is > > > > identifier-head ? Upper- or lowercase letter A through Z > > identifier-head ? U+00A8, U+00AA, U+00AD, U+00AF, U+00B2?U+00B5, or > > U+00B7?U+00BA > > identifier-head ? U+00BC?U+00BE, U+00C0?U+00D6, U+00D8?U+00F6, or > > U+00F8?U+00FF > > identifier-head ? U+0100?U+02FF, U+0370?U+167F, U+1681?U+180D, or > > U+180F?U+1DBF > > identifier-head ? U+1E00?U+1FFF > > identifier-head ? U+200B?U+200D, U+202A?U+202E, U+203F?U+2040, U+2054, > > or U+2060?U+206F > > identifier-head ? U+2070?U+20CF, U+2100?U+218F, U+2460?U+24FF, or > > U+2776?U+2793 > > identifier-head ? U+2C00?U+2DFF or U+2E80?U+2FFF > > identifier-head ? U+3004?U+3007, U+3021?U+302F, U+3031?U+303F, or > > U+3040?U+D7FF > > identifier-head ? U+F900?U+FD3D, U+FD40?U+FDCF, U+FDF0?U+FE1F, or > > U+FE30?U+FE44 > > identifier-head ? U+FE47?U+FFFD > > identifier-head ? U+10000?U+1FFFD, U+20000?U+2FFFD, U+30000?U+3FFFD, or > > U+40000?U+4FFFD > > identifier-head ? U+50000?U+5FFFD, U+60000?U+6FFFD, U+70000?U+7FFFD, or > > U+80000?U+8FFFD > > identifier-head ? U+90000?U+9FFFD, U+A0000?U+AFFFD, U+B0000?U+BFFFD, or > > U+C0000?U+CFFFD > > identifier-head ? U+D0000?U+DFFFD or U+E0000?U+EFFFD > > > > As the construction principle for this list, they say > > > > "Identifiers begin with an upper case or lower case letter A through Z, > > an underscore (_), a noncombining alphanumeric Unicode character in the > > Basic Multilingual Plane, or a character outside the Basic Multilingual > > Plan that isn?t in a Private Use Area. After the first character, digits > > and combining Unicode characters are also allowed." > > > > Regards, > > Martin > > _______________________________________________ > > Unicode mailing list > > Unicode at unicode.org > > http://unicode.org/mailman/listinfo/unicode > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From verdy_p at wanadoo.fr Tue Jun 10 02:03:35 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 10 Jun 2014 09:03:35 +0200 Subject: Swift In-Reply-To: <01CBD00E-0907-470D-93FC-846E3785555E@norbertlindenberg.com> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <53908C6E.4@v.loewis.de> <01CBD00E-0907-470D-93FC-846E3785555E@norbertlindenberg.com> Message-ID: variation selectors are within the subset of characters that should never be permitted in programming identifiers; they could cause surprizing results such as adding new APIs or backdoors that would not be detected by code reviewers looking at the code. But you allow them in the language, the first thing you'll need to integrate in your project is a source code scanner that will work on detecting unsafe characters (including checking the list of confusables, and enforcing the normalization of that source code before compiling it, as text editors may break these normalizations unexpectedly). Such tool should run in a routine, just like there are tools that perform reindentation of code and enforce some common conventions for its presentation; in order to ease exploration/searches in the source code, review, and facilitate use of regexps in editors as well. There are various tools that will also inspect how well the code is documented or if documentation os missing about publicly exposed variables and API and try to infer dependencies. Such tools fall in the same categories as the old "lint" tool for C (almost deprecated the way it was since now most of its rules are integrated in the language itself and by compilers to ensure type safety). The risk however is higher in untyped or weakly typed languages like Javascript/ECMAScript where all objects can be surcharged freely, that confusing identifiers could create unseen security risks. Note that identifiers are not just within programming languages; they exist as well on other types of APIs (and notably within web APIs within protocols transmitting data such as encoded web forms, even if these identifiers will be used and exposed isolately in a true language such as JSON or HTML or XML or CSS, possibly also with some escaping mechanisms). Also "identifiers" should be interpreted broadly to include symbolic operators (e.g. operators) if the language or API allow their extension or surcharge or derivations (Unicode identifiers or identifiers used in classic languages like HML, XML, C/C++, Java, Cobol, Fortran, Ada, PHP, Python... or assembly languages. are more restricted in their allowed repertoire and all other extensions require explicit escaping whose decoding should not weaken the security). Identifiers for data may be very liberal (e.g. if we want to allow toponyms or people names or trademarks) as they will frequently need to use significnat punctuation or symbols as well as spacing or word separation. This is even more critical for work names/titles, pagenames and filenames in an open collection: these identifiers or names should resolve unambiguously to the document or data intended (and generally this implies develping naming conventions and some required classification system to get an accurate invorory of available data; and make it possible to inspect this inventory and detect undesirable/conflicting items). I am then convinced than for such open inventories or collections, normalization should never be an option, it should be enforced and automated as early as possible even if we admit input in non-normalized forms. Any programming language or protocol that considers using a large repertoire from the UCS should seriously look at the specification of security in the Unicode standard and its annexes, and conside what has been made and discussed for maintaining the security of the worldwide DNS within the IDNA. The risks coming from instability of normalizations if you allow unassigned codepoints are real because they they can be easily used by automated tools and human reviewers will not detect these attacks easily without using tools to check these normalizations. Code checkers should immediately alarm about usage of codepoints not assigned in a known version of the UCS; and if they upgrade that version, thay should make sure that all other tools in that chain will check the same version. But some identifiers are not always found in source code but are generated at runtime using dyna,ic language features (dynamic binder libraries or reflection APIs should also perform their own check and will then need to integrate a minmum database of known assigned code points for that UCS version, and this could cause some complications for maintaining compatibility; notably having a version negociation mechanism and integrating the version property of assigned codepoints) 2014-06-09 4:46 GMT+02:00 Norbert Lindenberg : > It does allow some usage that may surprise code reviewers ? for example, > this is a valid Swift program: > > let s = "??" > let s? = "??" > let ? = "??" > let all = s + s? + ? > > The value of the constant ?all? is "??????". Or at least it is as long as > mail software doesn?t harm the variation selectors? > > Norbert > > > On Jun 5, 2014, at 9:06 , Mark Davis ?? wrote: > > > I haven't done any analysis, but on first glance it looks like it is > based on > > > > http://www.unicode.org/reports/tr31/#Alternative_Identifier_Syntax > > > > > > Mark > > > > ? Il meglio ? l?inimico del bene ? > > > > > > On Thu, Jun 5, 2014 at 5:46 PM, Jeff Senn wrote: > > Has anyone figured out whether character sequences that are > non-canonical (de)compositions but could be recomposed to the same result > > are the same identifier or not? > > > > That is: are identifiers merely sequences of characters or intended to > be comparable as ?Unicode strings? (under some sort of compatibility rule)? > > > > On Jun 5, 2014, at 11:27 AM, Martin v. L?wis wrote: > > > > > Am 04.06.14 11:28, schrieb Andre Schappo: > > >> The restrictions seem a little like IDNA2008. Anyone have links to > > >> info giving a detailed explanation/tabulation of allowed and non > > >> allowed Unicode chars for Swift Variable and Constant names? > > > > > > The language reference is at > > > > > > > https://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html > > > > > > For reference, the definition of identifier-character is (read each > > > line as an alternative) > > > > > > identifier-character ? Digit 0 through 9 > > > identifier-character ? U+0300?U+036F, U+1DC0?U+1DFF, U+20D0?U+20FF, or > > > U+FE20?U+FE2F > > > identifier-character ? identifier-head? > > > > > > where identifier-head is > > > > > > identifier-head ? Upper- or lowercase letter A through Z > > > identifier-head ? U+00A8, U+00AA, U+00AD, U+00AF, U+00B2?U+00B5, or > > > U+00B7?U+00BA > > > identifier-head ? U+00BC?U+00BE, U+00C0?U+00D6, U+00D8?U+00F6, or > > > U+00F8?U+00FF > > > identifier-head ? U+0100?U+02FF, U+0370?U+167F, U+1681?U+180D, or > > > U+180F?U+1DBF > > > identifier-head ? U+1E00?U+1FFF > > > identifier-head ? U+200B?U+200D, U+202A?U+202E, U+203F?U+2040, U+2054, > > > or U+2060?U+206F > > > identifier-head ? U+2070?U+20CF, U+2100?U+218F, U+2460?U+24FF, or > > > U+2776?U+2793 > > > identifier-head ? U+2C00?U+2DFF or U+2E80?U+2FFF > > > identifier-head ? U+3004?U+3007, U+3021?U+302F, U+3031?U+303F, or > > > U+3040?U+D7FF > > > identifier-head ? U+F900?U+FD3D, U+FD40?U+FDCF, U+FDF0?U+FE1F, or > > > U+FE30?U+FE44 > > > identifier-head ? U+FE47?U+FFFD > > > identifier-head ? U+10000?U+1FFFD, U+20000?U+2FFFD, U+30000?U+3FFFD, or > > > U+40000?U+4FFFD > > > identifier-head ? U+50000?U+5FFFD, U+60000?U+6FFFD, U+70000?U+7FFFD, or > > > U+80000?U+8FFFD > > > identifier-head ? U+90000?U+9FFFD, U+A0000?U+AFFFD, U+B0000?U+BFFFD, or > > > U+C0000?U+CFFFD > > > identifier-head ? U+D0000?U+DFFFD or U+E0000?U+EFFFD > > > > > > As the construction principle for this list, they say > > > > > > "Identifiers begin with an upper case or lower case letter A through Z, > > > an underscore (_), a noncombining alphanumeric Unicode character in the > > > Basic Multilingual Plane, or a character outside the Basic Multilingual > > > Plan that isn?t in a Private Use Area. After the first character, > digits > > > and combining Unicode characters are also allowed." > > > > > > Regards, > > > Martin > > > _______________________________________________ > > > Unicode mailing list > > > Unicode at unicode.org > > > http://unicode.org/mailman/listinfo/unicode > > > > > > _______________________________________________ > > Unicode mailing list > > Unicode at unicode.org > > http://unicode.org/mailman/listinfo/unicode > > > > _______________________________________________ > > Unicode mailing list > > Unicode at unicode.org > > http://unicode.org/mailman/listinfo/unicode > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Tue Jun 10 06:51:44 2014 From: frederic.grosshans at gmail.com (=?windows-1252?Q?Fr=E9d=E9ric_Grosshans?=) Date: Tue, 10 Jun 2014 13:51:44 +0200 Subject: Quasiquotation marks Message-ID: <5396F150.4080905@gmail.com> This week?s shady character introduces quasiquotation marks, used in fanzines since at least 1944 for ?in substance? quaotation. This mark is the superposition of " (or ') with -. http://www.shadycharacters.co.uk/2014/06/miscellany-49-quasiquote/ This looks like a good candidate for unicode encoding, with many discussions in the linked blog posts and comment being about recreating it through rich text (word processor/CSS/TeX...). Fr?d?ric From verdy_p at wanadoo.fr Tue Jun 10 07:29:49 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 10 Jun 2014 14:29:49 +0200 Subject: Math input methods In-Reply-To: <8A91ABFB-7C83-4AC0-A5B2-584AF4CE438F@telia.com> References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <3DCD6A05-2483-4413-B978-D545F155DDCA@telia.com> <538F1864.9070603@cs.tut.fi> <8A91ABFB-7C83-4AC0-A5B2-584AF4CE438F@telia.com> Message-ID: ????????????? are without doubt more useful and more common in double-struck styles than in Fraktur styles. But there are cases where they will be distinctly replaced by bold letters (notably when woking with homomorphic/dual sets correlated bijectively with them but having distinct projections/coordinates in a numeral set, provided that there's a defined couple of operations for compositing these coordinates with elementary base elements in the dual/homomorphic set). As soon as you start using derived styles for such notation of duals, you'll imemdiately want to use the same styles for deriving elements (numbers) of these numeral sets, so you'll get double-struck or bold or fracktur variable names and digits (notably for zero, one and i,j,k used by some other extended sets that remove some property like commutativity or distributivity of basic operations). These styles become then equivalent in functionality as diacritics or adding a prepended operator or subscripts/superscripts denoting in which set they are projected or denoting their numeral system, except that they are represented in a compact composite which is not easily decomposable. But it is still possible: ????????????? could be written as well: \SetN?\in SetZ?\in \SetQ?\in \SetR?\in \SetC (TeX is commonly used for such notation when composing documents) The notation is functionaliy equivalent but it obscures notations that are already complex, so Mathematicians are inventing various shortcuts or compact representations in their text. But you can't simply treat these notations as prefered visual style, these styles have imporant strict definitions that disambiguate the meaning. These formulas also have strict layout restrictions, mich more than usual plan-text. We are in a corner case where it is just safer to consider that maths notations are not text but are binary objects that do not weok very well with the Unicode character model, and that are also far from the weak definition of symbols. Even a basic variable name 'x' is not a letter x of the Latin script, its letter case cannot be changed, it cannot be freely transliterated, and side-by-side letters do not form "words". Their grammar is in a very specific language and is highly contextual (and frequently altered by document-specific prior definitions and conventions. No plain-text algorithm will work correctly with maths notations, notably in advanced levels (not the level taught in schools for children learning arithmetic for use in daily social life). In fact I have doubts that even Unicode should make lots of efforts trying to encode that more than as an informal collection of independant symbols living in their own world, with their own script and their own "language(s)" (and there are many languages, as many as there are authors in fact and frequently more when authors invent specific languages for specific documents). For most people in the world, they can't understand the level of abstraction meant by these notations. And it is already hard for them to accept the concept of "negative" numbers and understand that the value of almost all reals are not even representable, or what a complex number means ; even the concept of multiplication of numbers is difficult to understand unless you bind it to a 2D cartesian space, and immediately they wonder what happens in their visible 3D world; then let's not speak about zeroes, or infinites or curved spaces, or fractal dimensions, or about infinitesimal quantities that are not absolutely comparable in our commonly perceived Cartesian space on which we have a limited vision... Their vision is more pragmatic: as long as they have a solution (or a tool to compute it) and it gives satisfaction in most of the cases they can perceive in their life, they will not need to go further, their vision is bound to their "experience" (and experience is not bad in itself, it is a string base for science, propagation of knowledge and utility). Most people are not probabilists, they favor statistics for remembering their experience and guide immediately their choices of action and try to explain their intents to others. 2014-06-05 10:57 GMT+02:00 Hans Aberg : > On 5 Jun 2014, at 04:50, David Starner wrote: > > > On Wed, Jun 4, 2014 at 6:00 AM, Jukka K. Korpela > wrote: > >> The change is logical in the sense that bold face is a > >> more original notation and double-struck letters as characters imitate > the > >> imitation of boldface letters when writing by hand (with a pen or piece > of > >> chalk). > > > > On the other hand, bold face is a minor variation on normal types. > > Double-struck letters are more clearly distinct, which is probably why > > they moved from the chalkboard to printing in the first place. I don't > > see much advantage of ?????????? over ?????, especially when > > confusability with NCRZQ comes into play. > > The double-struck letters are useful in math, because they free other > letter styles for other use. First, only a few were used as for natural, > rational, real and complex numbers, but became popular so that all letters, > uppercase and lowercase, are now available in Unicode. > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gansmann at uni-bonn.de Tue Jun 10 07:35:13 2014 From: gansmann at uni-bonn.de (Gerrit Ansmann) Date: Tue, 10 Jun 2014 14:35:13 +0200 Subject: Quasiquotation marks In-Reply-To: <5396F150.4080905@gmail.com> References: <5396F150.4080905@gmail.com> Message-ID: On Tue, 10 Jun 2014 13:51:44 +0200, Fr?d?ric Grosshans wrote: > This week?s shady character introduces quasiquotation marks, used in fanzines since at least 1944 for ?in substance? quaotation. This mark is the superposition of " (or ') with -. > > http://www.shadycharacters.co.uk/2014/06/miscellany-49-quasiquote/ > > This looks like a good candidate for unicode encoding, with many discussions in the linked blog posts and comment being about recreating it through rich text (word processor/CSS/TeX...). This reminds me of the special quotation marks shown and discussed here (with no satisfying conclusion in my opinion): http://german.stackexchange.com/q/10055/2594 Gerrit Ansmann From verdy_p at wanadoo.fr Tue Jun 10 07:39:51 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 10 Jun 2014 14:39:51 +0200 Subject: Quasiquotation marks In-Reply-To: <5396F150.4080905@gmail.com> References: <5396F150.4080905@gmail.com> Message-ID: Aren't they just standard quotes with basic style ? (overstriking with or in HTML) How are they different to quoting multiple personalities, each one with their own color (red, green, blue, black for the author, grey for side remarks...) There are certainly lots of combinations to denote contexts of quotations or add intended emphasis from an author, including changing fonts (italics, bold; font size, character spacing, decorations, indentations,...). Every possible style already working in documents can be used in such combinations. But even in this case, it is possible to extract a part of it that has a standard text meaning, even if its contextual usage is different and carries additional semantics with these styles. 2014-06-10 13:51 GMT+02:00 Fr?d?ric Grosshans : > This week?s shady character introduces quasiquotation marks, used in > fanzines since at least 1944 for ?in substance? quaotation. This mark is > the superposition of " (or ') with -. > > http://www.shadycharacters.co.uk/2014/06/miscellany-49-quasiquote/ > > This looks like a good candidate for unicode encoding, with many > discussions in the linked blog posts and comment being about recreating it > through rich text (word processor/CSS/TeX...). > > Fr?d?ric > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From haberg-1 at telia.com Tue Jun 10 07:57:22 2014 From: haberg-1 at telia.com (Hans Aberg) Date: Tue, 10 Jun 2014 14:57:22 +0200 Subject: Math input methods In-Reply-To: References: <20140603195253.3c0df53f@JRWUBU2> <79B59F4E-FA21-4596-96F2-A2DE458602FA@lboro.ac.uk> <3DCD6A05-2483-4413-B978-D545F155DDCA@telia.com> <538F1864.9070603@cs.tut.fi> <8A91ABFB-7C83-4AC0-A5B2-584AF4CE438F@telia.com> Message-ID: On 10 Jun 2014, at 14:29, Philippe Verdy wrote: > ????????????? are without doubt more useful and more common in double-struck styles than in Fraktur styles. Fraktur would normally be for Lie algebras. For sets, some other style or none. And logicians use their own notation. From leoboiko at namakajiri.net Tue Jun 10 08:33:08 2014 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Tue, 10 Jun 2014 10:33:08 -0300 Subject: Quasiquotation marks In-Reply-To: References: <5396F150.4080905@gmail.com> Message-ID: What about using U+0331 "combining macron below" or U+0320 "combining minus below"? Here are some samples: U+0331 "?test"? ??test?? U+0320 "?test"? ??test?? 2014-06-10 9:39 GMT-03:00 Philippe Verdy : > (overstriking with or in HTML) Modern HTML phased out , and has semantic meanings innapropriate for this case. It would be better to use CSS "text-decoration: line-through". This point has been raised in the comments of the original post. > How are they different to quoting multiple personalities, each one with their own color (red, green, blue, black for the author, grey for side remarks...) That could be bad for people with color blindness (which may reach up to some 10% of the genetically male population). From verdy_p at wanadoo.fr Tue Jun 10 08:53:00 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 10 Jun 2014 15:53:00 +0200 Subject: Quasiquotation marks In-Reply-To: References: <5396F150.4080905@gmail.com> Message-ID: 2014-06-10 15:33 GMT+02:00 Leonardo Boiko : > What about using U+0331 "combining macron below" or U+0320 "combining > minus below"? Here are some samples: > > U+0331 > > "?test"? > ??test?? > > U+0320 > > "?test"? > ??test?? > > > 2014-06-10 9:39 GMT-03:00 Philippe Verdy : > > (overstriking with or in HTML) > > Modern HTML phased out , and has semantic meanings > innapropriate for this case. It would be better to use CSS > "text-decoration: line-through". This point has been raised in the > comments of the original post. > Yes but these two elements have default styl bindings exactly to the same king of decoration. The semantics of "del" is in fact appropriate in this case to mark the fact it is not an exact quotation, but the content is still skept as it gives the intended idea. will not be phased out for the same reason that , , , will be kept. My opinion is that it is even better to use these elements than fixing a dependancy to some style="" attributes spread everywhere in the document. These elements give useful placements where you can contextually apply the styles matching your presentation, they carry the semantic that style does not carry at all (because they are not "cascading" even if they are written with CSS. What makes the cascade in CSS is not what you put in styles, but it is the structure of elements in the document, that you can contextually and semantically preserve in your "selectors". so " another way. you could as well use a ... -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Tue Jun 10 11:22:22 2014 From: public at khwilliamson.com (Karl Williamson) Date: Tue, 10 Jun 2014 10:22:22 -0600 Subject: Apparent discrepanccy between FAQ and Age.txt Message-ID: <539730BE.3030408@khwilliamson.com> The FAQ http://www.unicode.org/faq/private_use.html#sentinels says that the last 2 code points on the planes except BMP were made noncharacters in TUS 3.1. DerivedAge.txt gives 2.0 for these. "The conformance wording about U+FFFE and U+FFFF changed somewhat in Unicode 2.0, but these were still the only two code points with this unique status" Unicode 3.1 [2001] was the watershed for the development of noncharacters in the standard. Unicode 3.1 was the first version to add supplementary characters to the standard. As a result, it also had to come to grips with the fact the ISO/IEC 10646-2:2001 had reserved the last two code points for every plane as "not a character" From sdaoden at yandex.com Tue Jun 10 12:05:38 2014 From: sdaoden at yandex.com (Steffen Nurpmeso) Date: Tue, 10 Jun 2014 19:05:38 +0200 Subject: Apparent discrepanccy between FAQ and Age.txt In-Reply-To: <539730BE.3030408@khwilliamson.com> References: <539730BE.3030408@khwilliamson.com> Message-ID: <20140610180538.in5AqjGH%sdaoden@yandex.com> Hello, Karl Williamson wrote: |The FAQ http://www.unicode.org/faq/private_use.html#sentinels |says that the last 2 code points on the planes except BMP were made |noncharacters in TUS 3.1. DerivedAge.txt gives 2.0 for these. The (nothing but informational except for @missing lines) comments in DerivedAge.txt state very clearly: # - The supplementary private use code points and the non-character code points # were assigned in version 2.0, but not specifically listed in the UCD # until versions 3.0 and 3.1 respectively. |"The conformance wording about U+FFFE and U+FFFF changed somewhat in |Unicode 2.0, but these were still the only two code points with this |unique status" | |Unicode 3.1 [2001] was the watershed for the development of |noncharacters in the standard. Unicode 3.1 was the first version to add |supplementary characters to the standard. As a result, it also had to |come to grips with the fact the ISO/IEC 10646-2:2001 had reserved the |last two code points for every plane as "not a character" Less scattering of information would be a pretty cool thing nonetheless. I.e., i think it would be less academical but much nicer if no FAQ would be necessary at all because the standard as such covers background information, too. I remember that one of the reasons i stopped any effort to go with (the about 120 German Mark book of) Unicode 3.0 was that i was incapable to wrap my head around a combining arabic example somewhere; you need access to technical reports to get it done. --steffen From frederic.grosshans at gmail.com Tue Jun 10 12:14:29 2014 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Tue, 10 Jun 2014 19:14:29 +0200 Subject: Quasiquotation marks In-Reply-To: References: <5396F150.4080905@gmail.com> Message-ID: <53973CF5.5020508@gmail.com> Le 10/06/2014 15:33, Leonardo Boiko a ?crit : > What about using U+0331 "combining macron below" or U+0320 "combining > minus below"? That would more similar to the underline hack discussed briefly here : http://fanac.org/Fannish_Reference_Works/Fan_terms/Fan_terms-07.html But I think it?s the wrong character : typewriters had the underscore (_) charcater, which was used to underline, and which was sometimes used as a "combining macron below". But this was not the one chosen in the 1940?s to create the quasiquote, but the hyphen. Using U+0335 COMBINING SHORT STROKE OVERLAY seems to closer to the original. The various posts linked in that thread tell "? these quasiquotaion marks were practical but everyone says '?They are difficult to use with modern word processors !'? "? Given the fact that it is possible to reproduce them with U+0335 COMBINING SHORT STROKE OVERLAY, what is the practice about encoding them as a new character ? Would the case (given enough usage proof) be similar to the encoding of ? U+024F LATIN SMALL LETTER Y WITH STROKE, which, I guess, has probably a similar origin in overstruck typewriter?s characters ? Or the fact that the stroke doesn?t touch the quotes makes the situation closer to non-existing precomposed latin character + diacritic combinations (http://www.unicode.org/faq/char_combmark.html#13) , and a specific symbol would be against NFC stability and un-encodeable ? From ken.whistler at sap.com Tue Jun 10 14:04:57 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Tue, 10 Jun 2014 19:04:57 +0000 Subject: Apparent discrepanccy between FAQ and Age.txt In-Reply-To: <539730BE.3030408@khwilliamson.com> References: <539730BE.3030408@khwilliamson.com> Message-ID: Karl Williamson noted: > The FAQ http://www.unicode.org/faq/private_use.html#sentinels > says that the last 2 code points on the planes except BMP were made > noncharacters in TUS 3.1. DerivedAge.txt gives 2.0 for these. > The *concept* of noncharacter was not invented until Unicode 3.1, so it could not have formally been applied to anything before then. Before Unicode 3.1, some code points had been referred to as "not a character", but it took a while for the UTC to rationalize the details systematically. Unicode 3.1 was the first version to formally introduce Noncharacter_Code_Point as a property and apply it to FFFE/FFFF (as well as the other noncharacters). Unicode 2.0 introduced the concept of Unicode scalar value and established the framework of definitions and conformance clauses now familiar in Chapter 3 (although it was pretty rough around the edges back then). It also documented UTF-8 (although at that point it was in an annex still), and that *required* a mapping between the UTF-16 and UTF-8 form of 0xnFFFE and 0xnFFFF on each plane. The Age value derives from that. U+FFFE and U+FFFF themselves were given Age=1.1 because they were part of Unicode 1.1 before Unicode 2.0 formally documented the addition of the rest of the planes. Earlier still, when Unicode was still trying to be a pure 16-bit encoding, FFFE and FFFF were simply outside the codespace. Incidentally, the property Age wasn't introduced until Unicode 3.2, to technically speaking it didn't exist before then, either. However, assignments of Age values were derived retroactively backwards to Version 1.1 for parceling out the initial assignments as of Unicode 3.2. Note also that although the majority of the repertoire in Unicode 1.1 actually was already assigned as of Unicode 1.0, no attempt was made to assign Age=1.0 to any characters, because of the churn and renaming that occurred as a result of the Unicode 1.0 and ISO 10646-1 merger effort back in the early 1990's. --Ken From public at khwilliamson.com Wed Jun 11 23:29:53 2014 From: public at khwilliamson.com (Karl Williamson) Date: Wed, 11 Jun 2014 22:29:53 -0600 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> Message-ID: <53992CC1.3010101@khwilliamson.com> On 06/02/2014 09:48 AM, Markus Scherer wrote: > On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell > wrote: > > I suspect everyone can agree on the edge cases, that noncharacters are > harmless in internal processing, but probably should not appear in > random text shipped around on the web. > > > Right, in principle. However, it should be ok to include noncharacters > in CLDR data files for processing by CLDR implementations, and it should > be possible to edit and diff and version-control and web-view those > files etc. > > It seems that trying to define "interchange" and "public" in ways that > satisfy everyone will not be successful. > > The FAQ already gives some examples of where noncharacters might be > used, should be preserved, or could be stripped, starting with "Q: Are > noncharacters intended for interchange? > " > > In my view, those Q/A pairs explain noncharacters quite well. If there > are further examples of where noncharacters might be used, should be > preserved, or could be stripped, and that would be particularly useful > to add to the examples already there, then we could add them. > > markus > > I was unaware of this FAQ. Having read it and re-read this entire thread, I am still troubled. I have a something like a library that was written a long time ago (not by me) assuming that noncharacters were illegal in open interchange. Programs that use the library were guaranteed that they would not receive noncharacters in their input. They thus are free to use any noncharacter internally as they wish. Now that Corrigendum #9 has come out, I'm getting requests to update the library to not reject noncharacters. The library itself does not use noncharacters. If I (or someone else) makes the requested change, it may silently cause security holes in those programs that were depending on it doing the rejection, and who upgrade to use the new version. Some of these programs may have been written many years ago. The original authors are now dead in some instances, or have turned the code over to someone else, or haven't thought about it in years. The current maintainers of those programs may be unaware of this dependence, and hence may upgrade without realizing the consequences. Further, the old versions of the library will soon be unsupported, so there is pressure to upgrade to get bug fixes and the promise of future support. This means there could be security holes that a hacker who gets a hold of the source can exploit. I don't see anything in the FAQ that really addresses this situation. I think there should be an answer that addresses code written before the Corrigendum, and that goes into detail about the security issues. My guess is that the UTC did not really consider the potential for security holes when making this Corrigendum. I agree that CLDR should be able to use noncharacters for internal processing, and that they should be able to be stored in files and edited. But I believe that version control systems and editors have just as much right to use noncharacters for their internal purposes. I disagree with the FAQ that seems to say if you write a utility you should avoid using noncharacters in its implementation. It might be that competitive pressure, or just that the particular implementations don't need non-characters, would cause some such utilities to accept some or all non-characters as inputs. But If I were writing such code, I can see now how using noncharacters for my purposes would be quite convenient. CLDR could be considered to be a utility, and its users might want to use noncharacters for their purposes. Is CLDR constructed so there is no potential for conflicts here? That is, does it reserve certain noncharacters for its own use? The FAQ talks about how various now-noncharacter code points were touted as sentinel candidates in earlier Unicode versions, and that they are no longer so. But it really should emphasize that old code may very well want to continue to use them as sentinels. The answer "Well, the short answer is no, that is not true?at least, not entirely true." is misleading in this regard. The FAQ mentions using 0x7FFFFFFF as a possible sentinel. I did not realize that that was considered representable in any UTF. Likewise -1. From markus.icu at gmail.com Thu Jun 12 03:37:49 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 12 Jun 2014 01:37:49 -0700 Subject: Corrigendum #9 In-Reply-To: <53992CC1.3010101@khwilliamson.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <53992CC1.3010101@khwilliamson.com> Message-ID: On Wed, Jun 11, 2014 at 9:29 PM, Karl Williamson wrote: > I have a something like a library that was written a long time ago (not by > me) assuming that noncharacters were illegal in open interchange. Programs > that use the library were guaranteed that they would not receive > noncharacters in their input. They thus are free to use any noncharacter > internally as they wish. Now that Corrigendum #9 has come out, I'm getting > requests to update the library to not reject noncharacters. The library > itself does not use noncharacters. If I (or someone else) makes the > requested change, it may silently cause security holes in those programs > that were depending on it doing the rejection, and who upgrade to use the > new version. > If your library makes an explict promise to remove noncharacters, then it should continue to do so. However, if your library is understood to pass through any strings, except for the advertised processing, then noncharacters should probably be preserved. I don't see anything in the FAQ that really addresses this situation. I > think there should be an answer that addresses code written before the > Corrigendum, and that goes into detail about the security issues. My guess > is that the UTC did not really consider the potential for security holes > when making this Corrigendum. > There is nothing really new in the corrigendum. The UTC felt that some implementers had misinterpreted inconsistent and misleading statements in and around the standard, and clarified the situation. Any process that requires certain characters or sequences to not occur in the input must explicitly check for those, regardless of whether they are noncharacter, private use characters, unassigned code points, control codes, deprecated language tag characters, discouraged stateful formatting controls, stacks of hundreds of diacritics, or whatever. In a sense, noncharacters are much like the old control codes. Some terminals say "beep" when they see U+0007, or go into strange modes when they see U+001B; on Windows, when you read a text file that contains U+001A, it is interpreted as an end-of-file marker. If your process depended on those things not happening, then you would have to strip those control codes on input. But a pass-through-style library will be universally expected not to do anything special with them. I agree that CLDR should be able to use noncharacters for internal > processing, and that they should be able to be stored in files and edited. > But I believe that version control systems and editors have just as much > right to use noncharacters for their internal purposes. I disagree. If svn or git choked on noncharacters or control codes or private use characters or unassigned code points etc., I would complain. Likewise, I expect to be able to use plain text or programming editors (gedit, kate, vi, emacs, Visual Studio) to handle files with such characters just fine. I do *not* necessarily expect Word, OpenOffice, or Google Docs to handle all of these. Is CLDR constructed so there is no potential for conflicts here? That is, > does it reserve certain noncharacters for its own use? > I believe that CLDR only uses noncharacters for special purposes in collation. In CLDR data files, there are at most contraction mappings that start with noncharacters for purposes of building alphabetic-index tables. (And those noncharacters are \u-escaped in CLDR XML files since CLDR 24.) There is no mechanism to remove them from any input, but the worst thing that would happen is that you get a sequence of code points to sort interestingly. The FAQ mentions using 0x7FFFFFFF as a possible sentinel. I did not > realize that that was considered representable in any UTF. Likewise -1. > No, and that's the point of using those. Integer values that are not code points make for great sentinels in API functions, such as a next() iterator returning -1 when there is no next character. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Thu Jun 12 06:30:19 2014 From: prosfilaes at gmail.com (David Starner) Date: Thu, 12 Jun 2014 04:30:19 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <53992CC1.3010101@khwilliamson.com> Message-ID: On Thu, Jun 12, 2014 at 1:37 AM, Markus Scherer wrote: > If your library makes an explict promise to remove noncharacters, then it > should continue to do so. There is rarely so much frustration as when a library or utility changes behavior and the justification is that well-understood practice was not explicit. I suspect few groups could bring the world to a halt with work-to-rule as quick as programmers. > I disagree. If svn or git choked on noncharacters or control codes or > private use characters or unassigned code points etc., I would complain. > Likewise, I expect to be able to use plain text or programming editors > (gedit, kate, vi, emacs, Visual Studio) to handle files with such characters > just fine. I don't expect plain text editors to handle arbitrary control codes, much less noncharacters, unless they really handle whatever binary junk is shoved at them, which a generic plain text editor can not be relied upon to do. I believe that programming editors should scream bloody murder over noncharacters and unusual control codes; they have no place in source code at all. -- Kie ekzistas vivo, ekzistas espero. From richard.wordingham at ntlworld.com Thu Jun 12 13:28:45 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 12 Jun 2014 19:28:45 +0100 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <53992CC1.3010101@khwilliamson.com> Message-ID: <20140612192845.7e949779@JRWUBU2> On Thu, 12 Jun 2014 01:37:49 -0700 Markus Scherer wrote: > On Wed, Jun 11, 2014 at 9:29 PM, Karl Williamson > wrote: > > The FAQ mentions using 0x7FFFFFFF as a possible sentinel. I did not > > realize that that was considered representable in any UTF. > > Likewise -1. > No, and that's the point of using those. Integer values that are not > code points make for great sentinels in API functions, such as a > next() iterator returning -1 when there is no next character. They work fine as alternatives to scalar values. They don't work so well in 8-bit and 16-bit Unicode strings. A general purpose routine extracting scalar values from Unicode strings is likely to treat them as errors rather than just returning the scalar value as it would for a non-character. The only way to use them directly in 8- and 16-bit Unicode strings is to deliberately create ill-formed Unicode strings. Thus, these 'sentinels' are not full blown sentinels like U+0000 in the C conventions for 'strings', as opposed to arrays of char. There is a get-out clause - just never accept that a Unicode string is purported to be in a Unicode character encoding form. Richard. From petercon at microsoft.com Fri Jun 13 00:14:30 2014 From: petercon at microsoft.com (Peter Constable) Date: Fri, 13 Jun 2014 05:14:30 +0000 Subject: Corrigendum #9 In-Reply-To: <53992CC1.3010101@khwilliamson.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <53992CC1.3010101@khwilliamson.com> Message-ID: <66865346da6645be9558d53dbac1b53a@BL2PR03MB450.namprd03.prod.outlook.com> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Karl Williamson Sent: Wednesday, June 11, 2014 9:30 PM > I have a something like a library that was written a long time ago > (not by me) assuming that noncharacters were illegal in open interchange. > Programs that use the library were guaranteed that they would not receive > noncharacters in their input. I haven't read every post in the thread, so forgive me if I'm making incorrect inferences. I get the impression that you think that Unicode conformance requirements have historically provided that guarantee, and that Corrigendum #9 broke that. If so, then that is a mistaken understanding of Unicode conformance. Here is what has historically been said in the way of conformance requirements related to non-characters: TUS 1.0: There were no conformance requirements stated. This recommendation was given: "U+FFFF and U+FFFE are reserved and should not be transmitted or stored." This same recommendation was repeated in later versions. However, it must be recognized that "should" statements are never absolute requirements. Conformance requirements first appeared in TUS 2.0: TUS 2.0, TUS 3.0: "C5 A process shall not interpret either U+FFFE or U+FFFF as an abstract character." TUS 4.0: "C5 A process shall not interpret a noncharacter code point as an abstract character." "C10 When a process purports not to modify the interpretation of a valid coded character representation, it shall make no change to that coded character representation other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points." Btw, note that C10 makes the assumption that a valid coded character sequence can include non-character code points. TUS 5.0 (trivially different from TUS4.0): C2 = TUS4.0, C5 "C7 When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points." TUS 6.0: C2 = TUS5.0, C2 "C7 When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences." Interestingly, the change to C7 does not permit non-characters to be replaced or removed at all while claiming not to have left the interpretation intact. So, there was a change in 6.0 that could impact conformance claims of existing implementations. But there has never been any guarantees made _by Unicode_ that non-character code points will never occur in open interchange. Interchange has always been discouraged, but never prohibited. Peter From daniel.buenzli at erratique.ch Mon Jun 23 20:31:52 2014 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Tue, 24 Jun 2014 02:31:52 +0100 Subject: Default case algorithms Message-ID: <4E23C2C4B91B4134964765A3700E7E76@erratique.ch> Hello, I don?t understand the rule specifications of default case conversion ?3.13 p.117 of Unicode 6.2.0 [1] (which is what 7.0.0 eventually points to at the moment). Specifically the sentence ? as well as the context-dependent mappings based on the casing context, as specified in Table 3-14 ?. This table just specifies casing contexts and there seem to be no normative property that specifies the context-dependent mappings (was apparently removed when UCD xml was created). So the question is, if I take a string an apply e.g. only the rule R1 as given, is that an implementation of default uppercase conversion ? or would that be a (context-independent) tailoring of the default uppercase conversion algorithm ? If that?s thte case it seems strange to have normative behaviours defined that have no supporting normative properties to implement them. Thanks for your answers, Daniel [1] http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf From markus.icu at gmail.com Tue Jun 24 07:28:38 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 24 Jun 2014 14:28:38 +0200 Subject: Default case algorithms In-Reply-To: <4E23C2C4B91B4134964765A3700E7E76@erratique.ch> References: <4E23C2C4B91B4134964765A3700E7E76@erratique.ch> Message-ID: The context-sensitive and/or language-sensitive mappings are here: http://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From cewcathar at hotmail.com Tue Jun 24 08:16:00 2014 From: cewcathar at hotmail.com (CE Whitehead) Date: Tue, 24 Jun 2014 09:16:00 -0400 Subject: Corrigendum #9 Message-ID: Markus Scherer said what sounds right to me to recommend (maybe what he says should be said in Corrigendum 9): http://www.unicode.org/mail-arch/unicode-ml/y2014-m06/0148.html From: Markus Scherer Date: Thu, 12 Jun 2014 01:37:49 -0700 > If your library makes an explict promise to remove noncharacters, then it > should continue to do so. > However, if your library is understood to pass through any strings, except > for the advertised processing, then noncharacters should probably be > preserved. ME: Am I to believe from the above, that, regarding www.unicode.org/L2/L2013/13015-nonchars.pdf (which rejects the bold interpretation but I don't think that's what Markus's email does) -- the "'bold interpretation' of internal exchange of noncharacters" may continue where deletion of a noncharacter is never a good idea, and should not happen, that unrecognized noncharacters should simply be silently ignored then, with "all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points," thus "mapped to unique code unit sequences"; while, at the same time (albeit as I understand things only if the type of encoding is recognized), noncharacters may replaced with the scalar for unassigned code points (U+FFFD)? In this latter case the non-character is no longer mapped one-to-one with a scalar as all noncharacters will have been replaced with U+FFFD. So is that one-to-one mapping recommendation going to be changed or not? * * * I also have a quesiton on Peter's notes on TUS 6.0 rule C7 (which followed the Unicode 4.0 correction apparently if I understand correctly; maybe I should have sent this question as a separate email) http://www.unicode.org/mail-arch/unicode-ml/y2014-m06/0151.html From: Peter Constable Date: Fri, 13 Jun 2014 05:14:30 +0000 > TUS 6.0: > C2 = TUS5.0, C2 "C7 When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences." > Interestingly, the change to C7 does not permit non-characters to be replaced or removed at all while claiming not to have left the interpretation intact. ME: if two sequences are canonically equivalent except that one has noncharacters in it, are these still canonically equivalent? (just a wild question; would be nice to have an answer in the faq on noncharacters or somewhere; mabye I missed the answer and it was there). * * * Sentinels, Security Regarding the sentinels; I am an outsider but assume that with Corrigendum 9 U+FFFE will continue to be mentioned as having generally (not always?) standard use throughout; in Chapter 16.7 it is currently mentioned; I assume it will still be -- according to info. in the FAQ and elsewhere: http://www.unicode.org/faq/private_use.html "U+FFFE. The 16-bit unsigned hexadecimal value U+FFFE is not a Unicode character value, and should be taken as a signal that Unicode characters should be byte-swapped before interpretation. U+FFFE should only be intepreted as an incorrectly byte-swapped version of U+FEFF" Yes, it would be nice also to have info about security effects I agree of any other sentinels particularly U+FFFF and U+10FFFF -- but I envision most security effects would be caused by removing without replacing one of these (is that right?) Hope these questions are helpful. Best, --C. E. Whitehead cewcathar at hotmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Tue Jun 24 09:56:59 2014 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Tue, 24 Jun 2014 15:56:59 +0100 Subject: Default case algorithms In-Reply-To: References: <4E23C2C4B91B4134964765A3700E7E76@erratique.ch> Message-ID: Thanks, so the context/language sensitive case maps are not available in the XML UCD. But that doesn?t really answer my questions though which is: Does an algorithm that simply applies R1 *regardless of context* constitute a default case algorithm or not ? I.e. does simply mapping each character C in a string using Uppercase_Mapping (C) (e.g. as exposed by the XML UCD) constitute a default case conversion as mandated by the standard ? The wording of the standard is quite confusing since on p.115 many of the context dependent data of SpecialCasing.txt are mentioned as being data to ? assist in the implementation of certain tailoring ? and there is no clear indication in the definition of default case algorithm which context-sensitive mappings should be applied (if any). Best, Daniel From markus.icu at gmail.com Tue Jun 24 10:07:27 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 24 Jun 2014 17:07:27 +0200 Subject: Default case algorithms In-Reply-To: References: <4E23C2C4B91B4134964765A3700E7E76@erratique.ch> Message-ID: On Tue, Jun 24, 2014 at 4:56 PM, Daniel B?nzli wrote: > Does an algorithm that simply applies R1 *regardless of context* > constitute a default case algorithm or not ? I.e. does simply mapping each > character C in a string using Uppercase_Mapping (C) (e.g. as exposed by the > XML UCD) constitute a default case conversion as mandated by the standard ? > It implements simple uppercasing but not full uppercasing. It misses simple, common things like ? -> SS (which is neither language-dependent nor context-sensitive). The wording of the standard is quite confusing since on p.115 many of the > context dependent data of SpecialCasing.txt are mentioned as being data to > ? assist in the implementation of certain tailoring ? and there is no clear > indication in the definition of default case algorithm which > context-sensitive mappings should be applied (if any). > http://www.unicode.org/reporting.html markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Jun 24 11:03:48 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 24 Jun 2014 18:03:48 +0200 Subject: Default case algorithms In-Reply-To: References: <4E23C2C4B91B4134964765A3700E7E76@erratique.ch> Message-ID: 2014-06-24 17:07 GMT+02:00 Markus Scherer : > On Tue, Jun 24, 2014 at 4:56 PM, Daniel B?nzli < > daniel.buenzli at erratique.ch> wrote: > >> Does an algorithm that simply applies R1 *regardless of context* >> constitute a default case algorithm or not ? I.e. does simply mapping each >> character C in a string using Uppercase_Mapping (C) (e.g. as exposed by the >> XML UCD) constitute a default case conversion as mandated by the standard ? >> > > It implements simple uppercasing but not full uppercasing. > It misses simple, common things like ? -> SS (which is neither > language-dependent nor context-sensitive). > Bot so simple; may be it is SS for modern German, but Czech would map it to SZ, and historically that letter is a ligature of SZ (including in old German texts where that ligature was used), along with many other ligatures in medieval texts. If texts were printed in Fraktur style, you always have an ambiguity about if you should even use ? as a single letter or if you should better encoded separate letters (without even needing to encode any ligature hint because ligatures are everywhere in the text in its original form they are inherent of the script style (you would use hints only for variants of these ligatures or infrequent absences of a ligature). -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Tue Jun 24 11:46:10 2014 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Tue, 24 Jun 2014 17:46:10 +0100 Subject: Default case algorithms In-Reply-To: References: <4E23C2C4B91B4134964765A3700E7E76@erratique.ch> Message-ID: <66A33231F3B04D0AB2774CCB752B0935@erratique.ch> Le mardi, 24 juin 2014 ? 16:07, Markus Scherer a ?crit : > > Does an algorithm that simply applies R1 *regardless of context* constitute a default case algorithm or not ? I.e. does simply mapping each character C in a string using Uppercase_Mapping (C) (e.g. as exposed by the XML UCD) constitute a default case conversion as mandated by the standard ? > > It implements simple uppercasing but not full uppercasing. Not really, IIUC simple uppercasing would occur if I would use the Simple_Uppercase_Mapping property. I?m using the Uppercase_Mapping property of the XML UCD. > It misses simple, common things like ? -> SS (which is neither language-dependent nor context-sensitive). This is actually included in the Uppercase_Mapping property of the XML UCD. Having a look at the data it seems that the Uppercase_Mapping property of UCD includes (using the terminology of SpecialCasing.txt): * All the unconditional mappings of SpecialCasing.txt (context independent) * None of the conditional mapping of SpecialCasing.txt (context dependent) * None of the language sensitive mappings (context and language dependent) So what am I implementing if I just map a string using XML UCD?s Uppercase_Mapping property ? Is that Unicode?s default uppercase mapping ? (I did file a bug about that as you suggested, text below for those who are interested) Best, Daniel ---- The default casing algorithms of ?3.13 don't really make it clear *if* or *which* context and language dependent case mappings have to be applied in order to implement default case mapping algorithms. Besides the definitions seem to contradict themselves. 1. The Definitions section seems to imply that all case mapping of SpecialCasing.txt and UnicodeData.txt have to be used in order to get the full case mapping properties of a character C. 2. The Tailoring section indicates that the SpecialCasing.txt files contains data to assist implementation of certain *tailorings* of the default case algorithm which contradicts 1. 3. To muddy things further the XML UCD exposes full case mapping properties that as far as I can tell contain only all the context *insensitive* mappings of SpecialCasing.txt This makes it hard to understand what should be done for implementing proper Unicode default case conversion. From markus.icu at gmail.com Tue Jun 24 16:01:32 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 24 Jun 2014 23:01:32 +0200 Subject: Default case algorithms In-Reply-To: <66A33231F3B04D0AB2774CCB752B0935@erratique.ch> References: <4E23C2C4B91B4134964765A3700E7E76@erratique.ch> <66A33231F3B04D0AB2774CCB752B0935@erratique.ch> Message-ID: On Tue, Jun 24, 2014 at 6:46 PM, Daniel B?nzli wrote: > Having a look at the data it seems that the Uppercase_Mapping property of > UCD includes (using the terminology of SpecialCasing.txt): > > * All the unconditional mappings of SpecialCasing.txt (context independent) > * None of the conditional mapping of SpecialCasing.txt (context dependent) > * None of the language sensitive mappings (context and language dependent) > > So what am I implementing if I just map a string using XML UCD?s > Uppercase_Mapping property ? Is that Unicode?s default uppercase mapping ? > I don't think it's any standard mapping at all. Sorry, I don't use the UCD XML files, and it's been a while since I looked at them. (For my use for ICU, they were missing some things, I guess including these pieces, and they were including things I didn't need. So I kept improving my .txt file parser instead. YMMV) markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Wed Jun 25 03:10:08 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 25 Jun 2014 09:10:08 +0100 Subject: Default case algorithms In-Reply-To: <66A33231F3B04D0AB2774CCB752B0935@erratique.ch> References: <4E23C2C4B91B4134964765A3700E7E76@erratique.ch> <66A33231F3B04D0AB2774CCB752B0935@erratique.ch> Message-ID: <20140625091008.5be39819@JRWUBU2> On Tue, 24 Jun 2014 17:46:10 +0100 Daniel B?nzli wrote: > So what am I implementing if I just map a string using XML UCD?s > Uppercase_Mapping property ? Is that Unicode?s default uppercase > mapping ? Yes - with the caveat that the uppercase mapping of U+0345 is too complicated to defined formally. On the other hand, the Lowercase_Mapping property seems to be inadequate for the default lowercase mapping - Greek final sigma is the complication. Richard. From daniel.buenzli at erratique.ch Wed Jun 25 03:52:18 2014 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 25 Jun 2014 09:52:18 +0100 Subject: Default case algorithms In-Reply-To: <20140625091008.5be39819@JRWUBU2> References: <4E23C2C4B91B4134964765A3700E7E76@erratique.ch> <66A33231F3B04D0AB2774CCB752B0935@erratique.ch> <20140625091008.5be39819@JRWUBU2> Message-ID: Le mercredi, 25 juin 2014 ? 09:10, Richard Wordingham a ?crit : > Yes - with the caveat that the uppercase mapping of U+0345 is too > complicated to defined formally. > > On the other hand, the Lowercase_Mapping property seems to be inadequate > for the default lowercase mapping - Greek final sigma is the > complication. So what you seem to imply is that Unicode?s default full casing are defined by applying 1) The unconditional mappings of SpecialCasing.txt 2) The conditional mappings of SpecialCasing.txt (there?s only one, the final sigma case). Best, Daniel From verdy_p at wanadoo.fr Wed Jun 25 07:37:39 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 25 Jun 2014 14:37:39 +0200 Subject: Default case algorithms In-Reply-To: References: <4E23C2C4B91B4134964765A3700E7E76@erratique.ch> <66A33231F3B04D0AB2774CCB752B0935@erratique.ch> <20140625091008.5be39819@JRWUBU2> Message-ID: 2014-06-25 10:52 GMT+02:00 Daniel B?nzli : > Le mercredi, 25 juin 2014 ? 09:10, Richard Wordingham a ?crit : > > Yes - with the caveat that the uppercase mapping of U+0345 is too > > complicated to defined formally. > > > > On the other hand, the Lowercase_Mapping property seems to be inadequate > > for the default lowercase mapping - Greek final sigma is the > > complication. > > So what you seem to imply is that Unicode?s default full casing are > defined by applying > > 1) The unconditional mappings of SpecialCasing.txt > 2) The conditional mappings of SpecialCasing.txt (there?s only one, the > final > sigma case). > There's also the Turkic i or j (problems related to letters that are usually soft-dotted in the Latin script except in Turkic languages, whose case mapping is context-dependant with the right side to see if we need to add a combining dot above). We could insist to have Turkish texts using an explicit combining dot above after the dotless i (or j), biut most Turkish texts just use the plain ASCII letter, by reinterpreting its soft-dot as a hard dot, that needs to be added when converting to uppercase, and removed when conertng to lowercase. Note also that the dotless i or dotless j are not part of any case pair. For Turkish readers, a dotless i followed by an explicit combining dot above (hard dot) is not recommanded, and they use the standard ASCII letter directly, as if it was a precombined but decomposable letter. In Turkish texts, a dotless i without diacritic pairs with a capital ASCII letter I directly (this mapping to uppercase is *not* contextual,but the reverse conversion to lowercase *is* contextual). -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Wed Jun 25 07:50:19 2014 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 25 Jun 2014 13:50:19 +0100 Subject: Default case algorithms In-Reply-To: References: <4E23C2C4B91B4134964765A3700E7E76@erratique.ch> <66A33231F3B04D0AB2774CCB752B0935@erratique.ch> <20140625091008.5be39819@JRWUBU2> Message-ID: Le mercredi, 25 juin 2014 ? 13:37, Philippe Verdy a ?crit : > There's also the Turkic i or j (problems related to letters that are usually soft-dotted in the Latin script except in Turkic languages, whose case mapping is context-dependant with the right side to see if we need to add a combining dot above). Yes I know there are also language specific case mappings, but it?s unclear (see my previous messages in this discussion) whether this is part of default casing algorithms. (As far as I?m concerned I think this should rather be part of language specific tailorings). Best, Daniel From richard.wordingham at ntlworld.com Wed Jun 25 12:58:55 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 25 Jun 2014 18:58:55 +0100 Subject: Corrigendum #9 In-Reply-To: References: Message-ID: <20140625185855.58a095ad@JRWUBU2> On Tue, 24 Jun 2014 09:16:00 -0400 CE Whitehead wrote: > ME: if two sequences are canonically equivalent except that one has > noncharacters in it, are these still canonically equivalent? Canonical equivalences are defined for all sequences of scalar values; it is just that it changes from version to version for most unassigned characters. Non-characters only decompose to themselves and do not occur in the canonical (or indeed compatibility) decomposition of anything else, so a sequence containing a non-character cannot be canonically equivalent to a seqeunce not containing a non-character. > Regarding the sentinels; I am an outsider but assume that with > Corrigendum 9 U+FFFE will continue to be mentioned as having > generally (not always?) standard use throughout; in Chapter 16.7 it > is currently mentioned; I assume it will still be -- according to > info. in the FAQ and elsewhere: > http://www.unicode.org/faq/private_use.html "U+FFFE. The 16-bit > unsigned hexadecimal value U+FFFE is not a Unicode character value, > and should be taken as a signal that Unicode characters should be > byte-swapped before interpretation. U+FFFE should only be intepreted > as an incorrectly byte-swapped version of U+FEFF" There is a lot of untruth in that FAQ entry, alas. I think U+FFFE and possibly U+FFFF should be treated differently to the other 64 non-characters. At present there is no certainty as to whether an interchanged file in the UTF-16 encoding scheme that appears to contain a BOM contains a BOM or starts with U+FFFE. The only promise is that such a file contains an even number of data bytes. Any such sequence is valid! Will the UTF-16 encoding scheme be withdrawn? Richard. From cewcathar at hotmail.com Thu Jun 26 11:15:24 2014 From: cewcathar at hotmail.com (CE Whitehead) Date: Thu, 26 Jun 2014 12:15:24 -0400 Subject: Corrigendum #9 Message-ID: From: Richard Wordingham Date: Wed, 25 Jun 2014 18:58:55 +0100On Tue, 24 Jun 2014 09:16:00 -0400 > CE Whitehead wrote: >> ME: if two sequences are canonically equivalent except that one has >> noncharacters in it, are these still canonically equivalent? > Canonical equivalences are defined for all sequences of scalar values; > it is just that it changes from version to version for most unassigned > characters. > Non-characters only decompose to themselves and do not > occur in the canonical (or indeed compatibility) decomposition of > anything else, so a sequence containing a non-character cannot be > canonically equivalent to a seqeunce not containing a non-character. My mistake, it's not "canonical equivalence" that Peter was talking about but "conformance" to standard,so that a process can claim a character sequence is the same character sequence as that which was passed to it.(Thus I assume that a process can treat these two sequences (containing canonically equivalent characters but one with noncharacters) as different character sequences but does not have to do so.) Best, --C. E. Whiteheadcewcathar at hotmail.com --from Maria de Ventadorn, 12th century -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Jun 26 12:08:45 2014 From: doug at ewellic.org (Doug Ewell) Date: Thu, 26 Jun 2014 10:08:45 -0700 Subject: Corrigendum #9 Message-ID: <20140626100845.665a7a7059d7ee80bb4d670165c8327d.ca6acb5803.wbe@email03.secureserver.net> Richard Wordingham wrote: > At present there is no certainty as to whether > an interchanged file in the UTF-16 encoding scheme that appears to > contain a BOM contains a BOM or starts with U+FFFE. The only > promise is that such a file contains an even number of data bytes. > Any such sequence is valid! Will the UTF-16 encoding scheme be > withdrawn? One might wonder, given how frequently we hear that unpaired surrogates also occur in the wild and need to be tolerated. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From richard.wordingham at ntlworld.com Sat Jun 28 22:59:46 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 29 Jun 2014 04:59:46 +0100 Subject: Denoting Abstract Substrings Message-ID: <20140629045946.19fe6209@JRWUBU2> I believe it is fairly natural to think of physical sequences of code units as representatives of equivalence classes corresponding to abstract strings. One of the most important such equivalences is canonical equivalence, though one might want to use some tailoring - in which case one would not have canonical equivalence. Given this abstraction, it is natural to want to be able to reference substrings. To this end, one may define the substrings of an abstract string to be the equivalence classes containing a physical substring of physical string in the original string. (There is probably no need to restrict oneself to vaguely contiguous substrings.) For example, I might want to express U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX (canonical decomposition ) as a substring of a string canonically equivalent to U+1EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW, whose NFD equivalent is . This corresponds to a decomposition into a Vietnamese vowel (U+00E2) plus a Vietnamese tone mark (U+0323). Now, if the canonical decomposition of U+1EAD in UTF-8 is held as x = <61, CC, A3, CC, 82>, I might, adapting the boundary based notation, choose to specify the *abstract* substring as the abstract substring x[0:1,3:5]. (This specifies code units 0, 3 and 4.) If I specify that extracted substrings always contain entire scalar values, I might, confusingly, abbreviate this notation to x[0, 3]. However, if U+1EAD is held as a single scalar value in UTF-8 string y = , I want a notation that says, 'Take the first and third components of the NFD equivalent of the scalar value held starting at offset 0', e.g. "y[0.1, 0.3]". Using my own notation is likely to cause confusion - are there any shared, workable schemes in use? I'm putting together a demonstration regular expression engine that works on 'traces' (see http://en.wikipedia.org/wiki/Trace_monoid for the definition of a 'trace') rather than strings, but for it to be 'useful' I see no reason to restrict it to searching text in NFD. I'm currently working on capturing subexpressions. My hope is that we will ultimately have regular expression engines that fully grok canonical equivalence. When RL2.1 in UTS#18 "Unicode Regular Expressions" last looked usable as a specification (Version 13, 2008), the requirement that "an implementation shall provide a mechanism for ensuring that all canonically equivalent literal characters match" was too weak for my desire as a user. Concatenation of expressions is rather more complicated for traces than for strings, though still within the scope of mathematical regular expressions, where 'regular' means recognisable by a finite automaton. There are issues with Unicode sets - does "K" match "\p{ASCII} & \p{block=Letterlike Symbols}"? (The simplest solution seems to be to exclude codepoints with singleton decompositions, such as U+212A KELVIN SIGN, from the set of scalar values in Unicode sets.) As an aside, I'd have liked to have added fully decomposed Unicode strings under canonical equivalence to the Wikipedia article as an example of traces, but I couldn't find a source. Richard. From andrea.giammarchi at gmail.com Sat Jun 28 12:33:17 2014 From: andrea.giammarchi at gmail.com (Andrea Giammarchi) Date: Sat, 28 Jun 2014 10:33:17 -0700 Subject: meaningful and meaningless FE0E Message-ID: Dear all, this is my first email in this channel so apologies in advance if already discussed. I am trying to understand the expected behavior when there an "unexpected VS15" after emoji that have not been defined, accordingly with this file http://www.unicode.org/Public/UNIDATA/NamesList.txt, as VS15 sensitive. My take on FE0E is that all emoji that are sensible to this variant, have an "emojified" counter part that should be used when followed by FE0F and vice-versa a textual part when followed by FE0E, but all other emoji should not consider such variant at all since there's no textual counter part to represent, let's say, a 1F21A pile-of-poo "\ud83d\udca9\ufe0e" Can anyone please confirm my expectations are correct so that above sequence in both Java or JavaScript will show the POP emoji regardless, followed by FE0E variant that will be simply ignored and actually no device/OS/render/viewer/browser would ever create such sequence so it's actually a non problem, this one I am trying to solve? Thanks in advance and Best Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrea.giammarchi at gmail.com Sun Jun 29 01:47:11 2014 From: andrea.giammarchi at gmail.com (Andrea Giammarchi) Date: Sat, 28 Jun 2014 23:47:11 -0700 Subject: meaningful and meaningless FE0E In-Reply-To: References: Message-ID: ok, here the simplified version of my question: would U+1F21A followed by U+FE0E be represented differently from what U+1F21A is normally? is such sequence even a real concern or intent specified anywhere? (no, can't find it, asking just confirmation) Thanks a lot for any outcome! Best Regards On Sat, Jun 28, 2014 at 10:33 AM, Andrea Giammarchi < andrea.giammarchi at gmail.com> wrote: > Dear all, > this is my first email in this channel so apologies in advance if > already discussed. > > I am trying to understand the expected behavior when there an "unexpected > VS15" after emoji that have not been defined, accordingly with this file > http://www.unicode.org/Public/UNIDATA/NamesList.txt, as VS15 sensitive. > > My take on FE0E is that all emoji that are sensible to this variant, have > an "emojified" counter part that should be used when followed by FE0F and > vice-versa a textual part when followed by FE0E, but all other emoji should > not consider such variant at all since there's no textual counter part to > represent, let's say, a 1F21A pile-of-poo > > "\ud83d\udca9\ufe0e" > > Can anyone please confirm my expectations are correct so that above > sequence in both Java or JavaScript will show the POP emoji regardless, > followed by FE0E variant that will be simply ignored and actually no > device/OS/render/viewer/browser would ever create such sequence so it's > actually a non problem, this one I am trying to solve? > > Thanks in advance and Best Regards > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Sun Jun 29 02:00:21 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sun, 29 Jun 2014 09:00:21 +0200 Subject: meaningful and meaningless FE0E In-Reply-To: References: Message-ID: These variation selector characters only apply to specific characters, those listed in http://unicode.org/Public/UNIDATA/StandardizedVariants.html There is a machine-readable version at http://unicode.org/Public/UNIDATA/StandardizedVariants.txt Mark *? Il meglio ? l?inimico del bene ?* On Sun, Jun 29, 2014 at 8:47 AM, Andrea Giammarchi < andrea.giammarchi at gmail.com> wrote: > ok, here the simplified version of my question: > > would U+1F21A followed by U+FE0E be represented differently from what U+1F21A > is normally? > > is such sequence even a real concern or intent specified anywhere? (no, > can't find it, asking just confirmation) > > Thanks a lot for any outcome! > > Best Regards > > > On Sat, Jun 28, 2014 at 10:33 AM, Andrea Giammarchi < > andrea.giammarchi at gmail.com> wrote: > >> Dear all, >> this is my first email in this channel so apologies in advance if >> already discussed. >> >> I am trying to understand the expected behavior when there an "unexpected >> VS15" after emoji that have not been defined, accordingly with this file >> http://www.unicode.org/Public/UNIDATA/NamesList.txt, as VS15 sensitive. >> >> My take on FE0E is that all emoji that are sensible to this variant, have >> an "emojified" counter part that should be used when followed by FE0F and >> vice-versa a textual part when followed by FE0E, but all other emoji should >> not consider such variant at all since there's no textual counter part to >> represent, let's say, a 1F21A pile-of-poo >> >> "\ud83d\udca9\ufe0e" >> >> Can anyone please confirm my expectations are correct so that above >> sequence in both Java or JavaScript will show the POP emoji regardless, >> followed by FE0E variant that will be simply ignored and actually no >> device/OS/render/viewer/browser would ever create such sequence so it's >> actually a non problem, this one I am trying to solve? >> >> Thanks in advance and Best Regards >> > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrea.giammarchi at gmail.com Sun Jun 29 02:13:04 2014 From: andrea.giammarchi at gmail.com (Andrea Giammarchi) Date: Sun, 29 Jun 2014 00:13:04 -0700 Subject: meaningful and meaningless FE0E In-Reply-To: References: Message-ID: Thank You! On Sun, Jun 29, 2014 at 12:00 AM, Mark Davis ?? wrote: > These variation selector characters only apply to specific characters, > those listed in > > http://unicode.org/Public/UNIDATA/StandardizedVariants.html > > There is a machine-readable version at > http://unicode.org/Public/UNIDATA/StandardizedVariants.txt > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > > On Sun, Jun 29, 2014 at 8:47 AM, Andrea Giammarchi < > andrea.giammarchi at gmail.com> wrote: > >> ok, here the simplified version of my question: >> >> would U+1F21A followed by U+FE0E be represented differently from what U+1F21A >> is normally? >> >> is such sequence even a real concern or intent specified anywhere? (no, >> can't find it, asking just confirmation) >> >> Thanks a lot for any outcome! >> >> Best Regards >> >> >> On Sat, Jun 28, 2014 at 10:33 AM, Andrea Giammarchi < >> andrea.giammarchi at gmail.com> wrote: >> >>> Dear all, >>> this is my first email in this channel so apologies in advance if >>> already discussed. >>> >>> I am trying to understand the expected behavior when there an >>> "unexpected VS15" after emoji that have not been defined, accordingly with >>> this file http://www.unicode.org/Public/UNIDATA/NamesList.txt, as VS15 >>> sensitive. >>> >>> My take on FE0E is that all emoji that are sensible to this variant, >>> have an "emojified" counter part that should be used when followed by FE0F >>> and vice-versa a textual part when followed by FE0E, but all other emoji >>> should not consider such variant at all since there's no textual counter >>> part to represent, let's say, a 1F21A pile-of-poo >>> >>> "\ud83d\udca9\ufe0e" >>> >>> Can anyone please confirm my expectations are correct so that above >>> sequence in both Java or JavaScript will show the POP emoji regardless, >>> followed by FE0E variant that will be simply ignored and actually no >>> device/OS/render/viewer/browser would ever create such sequence so it's >>> actually a non problem, this one I am trying to solve? >>> >>> Thanks in advance and Best Regards >>> >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Jun 29 03:51:02 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 29 Jun 2014 09:51:02 +0100 Subject: meaningful and meaningless FE0E In-Reply-To: References: Message-ID: <20140629095102.7671e2f3@JRWUBU2> On Sat, 28 Jun 2014 10:33:17 -0700 Andrea Giammarchi wrote: > I am trying to understand the expected behavior when there an > "unexpected VS15" after emoji that have not been defined, accordingly > with this file http://www.unicode.org/Public/UNIDATA/NamesList.txt, > as VS15 sensitive. Variation selectors are 'default ignorable' - if an implementation does not understand it, it should ignore it. In particular, Section 16.4 Version 6.3.0 of the Unicode Standard says that if the application does not understand the combination of base character and variation selector the variation selector should normally be ignored. This does not preclude the possibility that the renderer only has special modes, in all of which unknown variation selectors are displayed as flashing red question marks. > My take on FE0E is that all emoji that are sensible to this variant, > have an "emojified" counter part that should be used when followed by > FE0F and vice-versa a textual part when followed by FE0E, but all > other emoji should not consider such variant at all since there's no > textual counter part to represent, let's say, a 1F21A pile-of-poo > > "\ud83d\udca9\ufe0e" > > Can anyone please confirm my expectations are correct so that above > sequence in both Java or JavaScript will show the POP emoji > regardless, followed by FE0E variant that will be simply ignored and > actually no device/OS/render/viewer/browser would ever create such > sequence so it's actually a non problem, this one I am trying to > solve? There was nothing to stop me putting the sequence "???" in my reply. Moreover, there is nothing to stop the sequence becoming defined at some time in the future. Richard. From andrea.giammarchi at gmail.com Sun Jun 29 11:24:50 2014 From: andrea.giammarchi at gmail.com (Andrea Giammarchi) Date: Sun, 29 Jun 2014 09:24:50 -0700 Subject: meaningful and meaningless FE0E In-Reply-To: <20140629095102.7671e2f3@JRWUBU2> References: <20140629095102.7671e2f3@JRWUBU2> Message-ID: But today, where emoji are parsed correctly, that's not a couple of pointless empty squares but a POP followed by an ignored FE0E, which is exactly my expectations accordingly with today standards. If tomorrow this would change form some reason, it's not a problem of today parsers and unless you intentionally create that sequence for your own purposes, no keyboard would automatically put such sequence in a text field since such sequence as it is is meaningless for today standards. All good then, I've got my parser right :-) THanks On Sun, Jun 29, 2014 at 1:51 AM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Sat, 28 Jun 2014 10:33:17 -0700 > Andrea Giammarchi wrote: > > > I am trying to understand the expected behavior when there an > > "unexpected VS15" after emoji that have not been defined, accordingly > > with this file http://www.unicode.org/Public/UNIDATA/NamesList.txt, > > as VS15 sensitive. > > Variation selectors are 'default ignorable' - if an implementation > does not understand it, it should ignore it. In particular, > Section 16.4 Version 6.3.0 of the Unicode Standard says that if the > application does not understand the combination of base character and > variation selector the variation selector should normally be ignored. > This does not preclude the possibility that the renderer only has > special modes, in all of which unknown variation selectors are displayed > as flashing red question marks. > > > My take on FE0E is that all emoji that are sensible to this variant, > > have an "emojified" counter part that should be used when followed by > > FE0F and vice-versa a textual part when followed by FE0E, but all > > other emoji should not consider such variant at all since there's no > > textual counter part to represent, let's say, a 1F21A pile-of-poo > > > > "\ud83d\udca9\ufe0e" > > > > Can anyone please confirm my expectations are correct so that above > > sequence in both Java or JavaScript will show the POP emoji > > regardless, followed by FE0E variant that will be simply ignored and > > actually no device/OS/render/viewer/browser would ever create such > > sequence so it's actually a non problem, this one I am trying to > > solve? > > There was nothing to stop me putting the sequence "???" OF POO, U+FE0E VARIATION SELECTOR-15> in my reply. Moreover, there is > nothing to stop the sequence becoming defined at some time in the > future. > > Richard. > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kojiishi at gluesoft.co.jp Sun Jun 29 13:44:05 2014 From: kojiishi at gluesoft.co.jp (Koji Ishii) Date: Sun, 29 Jun 2014 18:44:05 +0000 Subject: Characters that should be displayed? Message-ID: Hello Unicoders, I?m a co-editor of CSS Text Level 3[1], and I would appreciate your support in defining rendering behavior in CSS. The spec currently has the following text[2]: > Control characters (Unicode class Cc) other than tab (U+0009), line feed (U+000A), and carriage return (U+000D) are ignored for the purpose of rendering. (As required by [UNICODE], unsupported Default_ignorable characters must also be ignored for rendering.) and there?s a feedback saying that CSS should display visible glyphs for these control characters. Since all major browsers do not display them today, this is a breaking-change and the CSS WG needs to discuss on this feedback. But the WG would appreciate to understand what Unicode recommends. I found the following text in Unicode 6.3, p. 185, "5.21 Ignoring Characters in Processing?[3]: > Surrogate code points, private-use characters, and control characters are not given the Default_Ignorable_Code_Point property. To avoid security problems, such characters or code points, when not interpreted and not displayable by normal rendering, should be displayed in fallback rendering with a fallback glyph By looking at this, my questions are as follows: 1. Should control characters that browsers do not interpret be displayed in fallback rendering? 2. Should private-use characters (U+E000-F8FF, 0F0000-0FFFFD, 100000-10FFFD) without glyphs be displayed in fallback rendering? These two questions are probably yes from what I understand the text quoted above, but things get harder the more I think: 3. When the above text says ?surrogate code points?, does that mean everything outside BMP? It reads so to me, but I?m surprised that characters in BMP and outside BMP have such differences, so I?m doubting my English skill. 4. Should every code point that are not given the Default_Ignorable_Code_Point property and that without interpretations nor glyphs displayed in fallback rendering? I could not find such statement in Unicode spec, but there are some people who believe so. 5. Is there anything else Unicode recommends to display in fallback rendering, or not to display? This must be RTFM, but pointing out where to read would be appreciated. Thank you for your support in advance. [1] http://dev.w3.org/csswg/css-text/ [2] http://dev.w3.org/csswg/css-text/#white-space-processing [3] http://www.unicode.org/versions/Unicode6.3.0/ch05.pdf /koji From richard.wordingham at ntlworld.com Sun Jun 29 13:44:31 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 29 Jun 2014 19:44:31 +0100 Subject: meaningful and meaningless FE0E In-Reply-To: References: <20140629095102.7671e2f3@JRWUBU2> Message-ID: <20140629194431.692a83ce@JRWUBU2> On Sun, 29 Jun 2014 09:24:50 -0700 Andrea Giammarchi wrote: > ...no keyboard would automatically put such sequence > in a text field since such sequence as it is is meaningless for today > standards. While perhaps no keyboard would map it to a single keystroke plus modifiers, direct hex input is sometimes the swiftest input method, as I found when transcribing some theorems a few days ago. Richard. From Shawn.Steele at microsoft.com Sun Jun 29 13:59:01 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Sun, 29 Jun 2014 18:59:01 +0000 Subject: Characters that should be displayed? In-Reply-To: References: Message-ID: <3fd544a9495b47c7a8273395b6b88532@BY2PR03MB491.namprd03.prod.outlook.com> If the concern is security, I cannot imagine why CSS would even want something like BELL to be legal at all. I'm not sure that replacement glyphs would help much. I mean would someone thing that ?Shawn was something spoofing Shawn, or just assume their browser/computer had a rendering glitch? I think most people would just ignore the unexpected character and assume something was quirky with the web page. -Shawn -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Koji Ishii Sent: Sunday, June 29, 2014 11:44 AM To: Unicode Mailing List Subject: Characters that should be displayed? Hello Unicoders, I?m a co-editor of CSS Text Level 3[1], and I would appreciate your support in defining rendering behavior in CSS. The spec currently has the following text[2]: > Control characters (Unicode class Cc) other than tab (U+0009), line feed (U+000A), and carriage return (U+000D) are ignored for the purpose of rendering. (As required by [UNICODE], unsupported Default_ignorable characters must also be ignored for rendering.) and there?s a feedback saying that CSS should display visible glyphs for these control characters. Since all major browsers do not display them today, this is a breaking-change and the CSS WG needs to discuss on this feedback. But the WG would appreciate to understand what Unicode recommends. I found the following text in Unicode 6.3, p. 185, "5.21 Ignoring Characters in Processing?[3]: > Surrogate code points, private-use characters, and control characters are not given the Default_Ignorable_Code_Point property. To avoid security problems, such characters or code points, when not interpreted and not displayable by normal rendering, should be displayed in fallback rendering with a fallback glyph By looking at this, my questions are as follows: 1. Should control characters that browsers do not interpret be displayed in fallback rendering? 2. Should private-use characters (U+E000-F8FF, 0F0000-0FFFFD, 100000-10FFFD) without glyphs be displayed in fallback rendering? These two questions are probably yes from what I understand the text quoted above, but things get harder the more I think: 3. When the above text says ?surrogate code points?, does that mean everything outside BMP? It reads so to me, but I?m surprised that characters in BMP and outside BMP have such differences, so I?m doubting my English skill. 4. Should every code point that are not given the Default_Ignorable_Code_Point property and that without interpretations nor glyphs displayed in fallback rendering? I could not find such statement in Unicode spec, but there are some people who believe so. 5. Is there anything else Unicode recommends to display in fallback rendering, or not to display? This must be RTFM, but pointing out where to read would be appreciated. Thank you for your support in advance. [1] http://dev.w3.org/csswg/css-text/ [2] http://dev.w3.org/csswg/css-text/#white-space-processing [3] http://www.unicode.org/versions/Unicode6.3.0/ch05.pdf /koji _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From Shawn.Steele at microsoft.com Sun Jun 29 14:22:20 2014 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Sun, 29 Jun 2014 19:22:20 +0000 Subject: Characters that should be displayed? In-Reply-To: <3fd544a9495b47c7a8273395b6b88532@BY2PR03MB491.namprd03.prod.outlook.com> References: <3fd544a9495b47c7a8273395b6b88532@BY2PR03MB491.namprd03.prod.outlook.com> Message-ID: <33990cdde4094cc193cbbcad65612ae3@BY2PR03MB491.namprd03.prod.outlook.com> Corrected typo, sorry. (someone thing/someone think) -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Shawn Steele Sent: Sunday, June 29, 2014 11:59 AM To: Koji Ishii; Unicode Mailing List Subject: RE: Characters that should be displayed? If the concern is security, I cannot imagine why CSS would even want something like BELL to be legal at all. I'm not sure that replacement glyphs would help much. I mean would someone think that ?Shawn was something spoofing Shawn, or just assume their browser/computer had a rendering glitch? I think most people would just ignore the unexpected character and assume something was quirky with the web page. -Shawn -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Koji Ishii Sent: Sunday, June 29, 2014 11:44 AM To: Unicode Mailing List Subject: Characters that should be displayed? Hello Unicoders, I?m a co-editor of CSS Text Level 3[1], and I would appreciate your support in defining rendering behavior in CSS. The spec currently has the following text[2]: > Control characters (Unicode class Cc) other than tab (U+0009), line feed (U+000A), and carriage return (U+000D) are ignored for the purpose of rendering. (As required by [UNICODE], unsupported Default_ignorable characters must also be ignored for rendering.) and there?s a feedback saying that CSS should display visible glyphs for these control characters. Since all major browsers do not display them today, this is a breaking-change and the CSS WG needs to discuss on this feedback. But the WG would appreciate to understand what Unicode recommends. I found the following text in Unicode 6.3, p. 185, "5.21 Ignoring Characters in Processing?[3]: > Surrogate code points, private-use characters, and control characters are not given the Default_Ignorable_Code_Point property. To avoid security problems, such characters or code points, when not interpreted and not displayable by normal rendering, should be displayed in fallback rendering with a fallback glyph By looking at this, my questions are as follows: 1. Should control characters that browsers do not interpret be displayed in fallback rendering? 2. Should private-use characters (U+E000-F8FF, 0F0000-0FFFFD, 100000-10FFFD) without glyphs be displayed in fallback rendering? These two questions are probably yes from what I understand the text quoted above, but things get harder the more I think: 3. When the above text says ?surrogate code points?, does that mean everything outside BMP? It reads so to me, but I?m surprised that characters in BMP and outside BMP have such differences, so I?m doubting my English skill. 4. Should every code point that are not given the Default_Ignorable_Code_Point property and that without interpretations nor glyphs displayed in fallback rendering? I could not find such statement in Unicode spec, but there are some people who believe so. 5. Is there anything else Unicode recommends to display in fallback rendering, or not to display? This must be RTFM, but pointing out where to read would be appreciated. Thank you for your support in advance. [1] http://dev.w3.org/csswg/css-text/ [2] http://dev.w3.org/csswg/css-text/#white-space-processing [3] http://www.unicode.org/versions/Unicode6.3.0/ch05.pdf /koji _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From asmusf at ix.netcom.com Sun Jun 29 14:24:05 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 29 Jun 2014 12:24:05 -0700 Subject: Characters that should be displayed? In-Reply-To: References: Message-ID: <53B067D5.6050102@ix.netcom.com> On 6/29/2014 11:44 AM, Koji Ishii wrote: >> Surrogate code points, private-use characters, and control characters are not given the Default_Ignorable_Code_Point property. To avoid security problems, such characters or code points, when not interpreted and not displayable by normal rendering, should be displayed in fallback rendering with a fallback glyph > By looking at this, my questions are as follows: > > 1. Should control characters that browsers do not interpret be displayed in fallback rendering? > 2. Should private-use characters (U+E000-F8FF, 0F0000-0FFFFD, 100000-10FFFD) without glyphs be displayed in fallback rendering? > > These two questions are probably yes from what I understand the text quoted above, By displaying a fall-back rendering the user is alerted that something is present, but normally not visible to the user. However, these are not the only invisible characters, and many should not (must not) be rendered, ever (except in diagnostic modes). So, it is a bit unclear to me what precisely this recommendation buys you, as it is incomplete. The recommendation is prefixed with "To avoid security problems,...". If this is taken to mean that it should apply in contexts that require strict attention to security issues, then they probably define a minimum of what should be done, and other measures need to be taken in addition. > but things get harder the more I think: > > 3. When the above text says ?surrogate code points?, does that mean everything outside BMP? It reads so to me, but I?m surprised that characters in BMP and outside BMP have such differences, so I?m doubting my English skill. No, those would be supplementary code points. Surrogates are values that are intended to be used in pairs as code units in UTF-16. Ill-formed data may contain unpaired values, those are referred to as Surrogate code points. > 4. Should every code point that are not given the Default_Ignorable_Code_Point property and that without interpretations nor glyphs displayed in fallback rendering? I could not find such statement in Unicode spec, but there are some people who believe so. > 5. Is there anything else Unicode recommends to display in fallback rendering, or not to display? This must be RTFM, but pointing out where to read would be appreciated. From jkorpela at cs.tut.fi Sun Jun 29 16:02:59 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Mon, 30 Jun 2014 00:02:59 +0300 Subject: Characters that should be displayed? In-Reply-To: References: Message-ID: <53B07F03.5010105@cs.tut.fi> 2014-06-29 21:44, Koji Ishii wrote: > The spec currently has the following text[2]: > >> Control characters (Unicode class Cc) other than tab (U+0009), line >> feed (U+000A), and carriage return (U+000D) are ignored for the >> purpose of rendering. (As required by [UNICODE], unsupported >> Default_ignorable characters must also be ignored for rendering.) > > and there?s a feedback saying that CSS should display visible glyphs > for these control characters. That would change the identity of the characters. They are by definition ?control characters?, i.e. they have no visible glyphs, but they may have control effects. However, it might be argued that rendering them somehow would not mean normal rendering but be a diagnostic indication of an error. Those characters are invalid in HTML and XML (except XML 1.1, but who uses it?). However, the tradition of web browsers is permissive in order to be user-friendly. E.g., a casual control character somewhere might be interesting to a *developer* or maintainer to notice, so that he could analyze and fix the problem that caused it, but to a *user* (visitor), it would mostly be just disturbing. He can?t fix the problem, and is mostly useless to him to see that the page has some control character in the source. So *developer tools* should indicate should problems or provide ways to detect, but it seems correct to ignore them in normal rendering. > Since all major browsers do not display > them today, this is a breaking-change Well, I would not take that as strong argument. This would be a change in error processing. But it would not be useful for other reasons. > I found the following text in Unicode 6.3, p. 185, "5.21 Ignoring > Characters in Processing?[3]: > >> Surrogate code points, private-use characters, and control >> characters are not given the Default_Ignorable_Code_Point property. >> To avoid security problems, such characters or code points, when >> not interpreted and not displayable by normal rendering, should be >> displayed in fallback rendering with a fallback glyph > > By looking at this, my questions are as follows: > > 1. Should control characters that browsers do not interpret be > displayed in fallback rendering? It is reasonable to interpret that there are no such control characters, because all control characters except those with special handling are interpreted as being invalid data and therefore ignored. 2. Should private-use characters > (U+E000-F8FF, 0F0000-0FFFFD, 100000-10FFFD) without glyphs be > displayed in fallback rendering? They might be seen as ?not displayable by normal rendering?, so yes. On the practical side, although Private Use characters should not be used in public information interchange, they are increasingly popular in ?icon font? tricks. Whatever we think of such tricks, users should not be punished for them. If the trick fails (usually because a page uses a downloadable font for icon glyphs allocated to Private Use codepoints but something prevents the use of such a font), it is relevant to the user to know that there is *some* data, which can be crucial (e.g., an item in a navigation menu). So some dull fallback rendering is probably better than simply ignoring the characters. > 3. When the above text says ?surrogate code points?, does that mean > everything outside BMP? No, it means code points that do not represent *any* characters due to being in certain special areas in the coding space. They are invalid in HTML and in XML. If they appear in data, the reason is usually that UTF-16 encoded data containing non-BMP characters is being processed in a wrong way. At the level of interpreting a byte stream as a stream of characters, surrogate code *units* in UTF-16 should be processed and interpreted in pairs so that one pair is taken as one character. And when CSS gets at it, it only sees the character in the DOM. It is adequate to ignore surrogate code points, since they are invalid and signalling them to users (as opposite to developers) would hardly do any good. > 4. Should every code point that are not > given the Default_Ignorable_Code_Point property and that without > interpretations nor glyphs displayed in fallback rendering? I could > not find such statement in Unicode spec, but there are some people > who believe so. > 5. Is there anything else Unicode recommends to > display in fallback rendering, or not to display? This must be RTFM, > but pointing out where to read would be appreciated. From the Unicode point of view, an implementation may decide what characters it supports. What it does to characters that it does not support seems to be generally up to the implementation to decide as regards to rendering. Here, too, I would consider the practical impact on users. If a page contain characters that have no glyphs in the fonts that are used, then the page has data that is probably valid but cannot be rendered in a particular situation. Showing some indication of this is relevant, because the user knows he is missing something real, and he might be able to fix the situation in various ways (e.g., changing browser settings, downloading an installing extra fonts, or just switching to a different browser ? browsers are known to differ in their abilities to use the fonts installed in a system). Yucca From prosfilaes at gmail.com Sun Jun 29 16:48:34 2014 From: prosfilaes at gmail.com (David Starner) Date: Sun, 29 Jun 2014 14:48:34 -0700 Subject: Characters that should be displayed? In-Reply-To: <53B07F03.5010105@cs.tut.fi> References: <53B07F03.5010105@cs.tut.fi> Message-ID: On Sun, Jun 29, 2014 at 2:02 PM, Jukka K. Korpela wrote: > They might be seen as ?not displayable by normal rendering?, so yes. On the > practical side, although Private Use characters should not be used in public > information interchange, they are increasingly popular in ?icon font? > tricks. Since when is HTML necessarily public information interchange? I can't imagine where you would better use private use characters then in HTML where a font can be named but you don't have enough control over the format to enter the data in some other format. -- Kie ekzistas vivo, ekzistas espero. From andrea.giammarchi at gmail.com Sun Jun 29 19:59:17 2014 From: andrea.giammarchi at gmail.com (Andrea Giammarchi) Date: Sun, 29 Jun 2014 17:59:17 -0700 Subject: meaningful and meaningless FE0E In-Reply-To: <20140629194431.692a83ce@JRWUBU2> References: <20140629095102.7671e2f3@JRWUBU2> <20140629194431.692a83ce@JRWUBU2> Message-ID: It does not matter, the example POP should be visible, followed by an ignored FE0E ... I think we are good here, nothing else to clarify from my side. Thanks :-) On Sun, Jun 29, 2014 at 11:44 AM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Sun, 29 Jun 2014 09:24:50 -0700 > Andrea Giammarchi wrote: > > > ...no keyboard would automatically put such sequence > > in a text field since such sequence as it is is meaningless for today > > standards. > > While perhaps no keyboard would map it to a single keystroke plus > modifiers, direct hex input is sometimes the swiftest input method, as > I found when transcribing some theorems a few days ago. > > Richard. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkorpela at cs.tut.fi Mon Jun 30 00:00:53 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Mon, 30 Jun 2014 08:00:53 +0300 Subject: Characters that should be displayed? In-Reply-To: References: <53B07F03.5010105@cs.tut.fi> Message-ID: <53B0EF05.2000603@cs.tut.fi> 2014-06-30 0:48, David Starner wrote: > On Sun, Jun 29, 2014 at 2:02 PM, Jukka K. Korpela wrote: >> They might be seen as ?not displayable by normal rendering?, so yes. On the >> practical side, although Private Use characters should not be used in public >> information interchange, they are increasingly popular in ?icon font? >> tricks. > > Since when is HTML necessarily public information interchange? Since 1990. ? Seriously, HTML was designed for public information interchange, and this is still its dominant use and regularly implied when discussing HTML. Besides, even when the use not public in a strict sense, it is generally based on client technologies that have no provisions for private agreements, in the sense of agreeing on meanings for Private Use codepoints. Web browsers and other HTML renderers have special interpretations for some characters (markup-significant characters, special treatment of some input characters, etc.) but no mechanism for adding rules that say something about Private Use characters. The reason why ?icon font? tricks mostly work is that browser treat most codepoints so that they try to render them using some fonts, under the influence of CSS, and in CSS you can nowadays pretty reliably, but not 100% reliably, use @font-face to specify a specific font to be used. The issue here, however, is what happens when the trick fails, for one reason or another. Private Use codepoints are mostly attempts at presenting some glyphs, rather than accidental occurrences of data that is best ignored (like control characters mostly are, e.g. NUL inserted by server-side software or authoring tool). > I can't > imagine where you would better use private use characters then in HTML > where a font can be named but you don't have enough control over the > format to enter the data in some other format. Applications that operate on plain text and use one fixed but configurable font are a much better example. If you need to use, say, a currency symbol that has not yet been added to Unicode but can be included in the font, then a Private Use codepoint is the only good way (and the only other way is to put the glyph into a code position allocated for some defined character, like ????this would work in practice, but it?s really not recommended). In HTML, on the other hand, you can instead use images, and CSS lets you scale the images to the font size if desired Yucca From prosfilaes at gmail.com Mon Jun 30 01:42:15 2014 From: prosfilaes at gmail.com (David Starner) Date: Sun, 29 Jun 2014 23:42:15 -0700 Subject: Characters that should be displayed? In-Reply-To: <53B0EF05.2000603@cs.tut.fi> References: <53B07F03.5010105@cs.tut.fi> <53B0EF05.2000603@cs.tut.fi> Message-ID: On Sun, Jun 29, 2014 at 10:00 PM, Jukka K. Korpela wrote: > Applications that operate on plain text and use one fixed but configurable > font are a much better example. If you need to use, say, a currency symbol > that has not yet been added to Unicode but can be included in the font, then > a Private Use codepoint is the only good way Or record the character using some form of escape. I'm not thinking of many applications that operate on plain text that aren't processed before display to an end user, and there's a reason why currency is recorded by 3 ASCII characters. > In HTML, on the other hand, you can instead use images, and CSS lets you > scale the images to the font size if desired And that's problematic, for the exact same reasons using images of text is always problematic. It can't be copy and then searched for or pasted, and you practically have to write it in ASCII or PUA and transliterate it into references to images. PUA is never necessary if you have your own application, as you can transfer data in whatever format with your own application. It's most useful with standard formats, like HTML and email, where the PUA lets someone use letters or scripts almost like they were encoded. -- Kie ekzistas vivo, ekzistas espero. From jjc at jclark.com Mon Jun 30 08:37:23 2014 From: jjc at jclark.com (James Clark) Date: Mon, 30 Jun 2014 20:37:23 +0700 Subject: Characters that should be displayed? In-Reply-To: References: Message-ID: A couple of your questions are addressed by: http://www.unicode.org/faq/unsup_char.html In particular: Q: Which characters should be displayed with a missing glyph, if not supported? A: All characters other than whitespace and default-ignorable characters. James On Mon, Jun 30, 2014 at 1:44 AM, Koji Ishii wrote: > Hello Unicoders, > > I?m a co-editor of CSS Text Level 3[1], and I would appreciate your > support in defining rendering behavior in CSS. > > The spec currently has the following text[2]: > > > Control characters (Unicode class Cc) other than tab (U+0009), line feed > (U+000A), and carriage return (U+000D) are ignored for the purpose of > rendering. (As required by [UNICODE], unsupported Default_ignorable > characters must also be ignored for rendering.) > > and there?s a feedback saying that CSS should display visible glyphs for > these control characters. Since all major browsers do not display them > today, this is a breaking-change and the CSS WG needs to discuss on this > feedback. But the WG would appreciate to understand what Unicode recommends. > > I found the following text in Unicode 6.3, p. 185, "5.21 Ignoring > Characters in Processing?[3]: > > > Surrogate code points, private-use characters, and control characters > are not given the Default_Ignorable_Code_Point property. To avoid security > problems, such characters or code points, when not interpreted and not > displayable by normal rendering, should be displayed in fallback rendering > with a fallback glyph > > By looking at this, my questions are as follows: > > 1. Should control characters that browsers do not interpret be displayed > in fallback rendering? > 2. Should private-use characters (U+E000-F8FF, 0F0000-0FFFFD, > 100000-10FFFD) without glyphs be displayed in fallback rendering? > > These two questions are probably yes from what I understand the text > quoted above, but things get harder the more I think: > > 3. When the above text says ?surrogate code points?, does that mean > everything outside BMP? It reads so to me, but I?m surprised that > characters in BMP and outside BMP have such differences, so I?m doubting my > English skill. > 4. Should every code point that are not given the > Default_Ignorable_Code_Point property and that without interpretations nor > glyphs displayed in fallback rendering? I could not find such statement in > Unicode spec, but there are some people who believe so. > 5. Is there anything else Unicode recommends to display in fallback > rendering, or not to display? This must be RTFM, but pointing out where to > read would be appreciated. > > Thank you for your support in advance. > > [1] http://dev.w3.org/csswg/css-text/ > [2] http://dev.w3.org/csswg/css-text/#white-space-processing > [3] http://www.unicode.org/versions/Unicode6.3.0/ch05.pdf > > /koji > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ritt.ks at gmail.com Mon Jun 30 10:59:54 2014 From: ritt.ks at gmail.com (Konstantin Ritt) Date: Mon, 30 Jun 2014 18:59:54 +0300 Subject: Characters that should be displayed? In-Reply-To: <53B067D5.6050102@ix.netcom.com> References: <53B067D5.6050102@ix.netcom.com> Message-ID: 2014-06-29 22:24 GMT+03:00 Asmus Freytag : > but things get harder the more I think: >> >> 3. When the above text says ?surrogate code points?, does that mean >> everything outside BMP? It reads so to me, but I?m surprised that >> characters in BMP and outside BMP have such differences, so I?m doubting my >> English skill. >> > > No, those would be supplementary code points. Surrogates are values that > are intended to be used in pairs as code units in UTF-16. Ill-formed data > may contain unpaired values, those are referred to as Surrogate code points. > > IIRC, after HTML parsing, validating and building DOM, no any single surrogate code point could be met in, since presence of any ill-formed data in the Unicode text makes the whole text ill-formed. It's a security recommendation to decoders to replace any unpaired surrogate code point with U+FFFD instead, thus making the text well-formed. As a side effect, the unpaired surrogate code point becomes visible (usually as a square box fallback glyph). What the consideration regarding U+FFFD in CSS? Konstantin -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Jun 30 11:33:11 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 30 Jun 2014 18:33:11 +0200 Subject: Characters that should be displayed? In-Reply-To: References: <53B067D5.6050102@ix.netcom.com> Message-ID: I generally agree with your comment. For your question U+FFFD is not special in CSS, it's just a standard character that will be mapped to some symbol (from any font, or synthetized from an internal font (or collection of glyphs) of the renderer according to other styles (there's no warranty that syles like itelaic or bold will look different, in fact there's no good way to exhibit alternatives if the renderer does not lookup a matching font, but at least the renderer should size it according to the computed "font-size:" setting). That symbol is often (but not necessaily a "white" question mark in a "black" diamond; replace "white" in fact by background color/image/shades, and "black" by the "color:" setting, just like in regular fonts mapping any other symbol). This symbol should also have an inherited direction, not a strong LTR direction: it should not alter the direction of text on either side (or break runs of text) for Bidi rendering, but it may eventually be mirrored in resolved RTL runs (if this is appropriate for the chosen glyph (not always easy to determine if the symbol is chosen from a matching font in context ; but as the symbol to use is quite arbitrary, and should be enough distinctive from other characters, this mirroring is not really necessary, unless the symbol shows some explicit text is a specific style; something to avoid as the character is not specific to any script or language). 2014-06-30 17:59 GMT+02:00 Konstantin Ritt : > 2014-06-29 22:24 GMT+03:00 Asmus Freytag : > >> but things get harder the more I think: >>> >>> 3. When the above text says ?surrogate code points?, does that mean >>> everything outside BMP? It reads so to me, but I?m surprised that >>> characters in BMP and outside BMP have such differences, so I?m doubting my >>> English skill. >>> >> >> No, those would be supplementary code points. Surrogates are values that >> are intended to be used in pairs as code units in UTF-16. Ill-formed data >> may contain unpaired values, those are referred to as Surrogate code points. >> >> > IIRC, after HTML parsing, validating and building DOM, no any single > surrogate code point could be met in, since presence of any ill-formed data > in the Unicode text makes the whole text ill-formed. > It's a security recommendation to decoders to replace any > unpaired surrogate code point with U+FFFD instead, thus making the text > well-formed. As a side effect, the unpaired surrogate code point becomes > visible (usually as a square box fallback glyph). > What the consideration regarding U+FFFD in CSS? > > > Konstantin > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kojiishi at gluesoft.co.jp Mon Jun 30 13:35:22 2014 From: kojiishi at gluesoft.co.jp (Koji Ishii) Date: Mon, 30 Jun 2014 18:35:22 +0000 Subject: Characters that should be displayed? In-Reply-To: References: <53B067D5.6050102@ix.netcom.com> Message-ID: <4BB2DAA1-264F-4B88-92C5-FF604BC32D41@gluesoft.co.jp> Thank you all for this a lot of great feedback. I learned a lot. I, however, still don?t get one thing. In the spec text: Surrogate code points, private-use characters, and control characters are not given the Default_Ignorable_Code_Point property. To avoid security problems, such characters or code points, when not interpreted and not displayable by normal rendering, should be displayed in fallback rendering with a fallback glyph How could displaying missing PUA glyph help security? I can imagine address bar could have such security risks, but this is about rendering. I can imagine 0x00 could lead to buffer overflow attacks, but it looks to me that preventing such characters inserted into DOM is safer, though I admit that I?m not professional in security at all. I understand some here wants to display them to help users to identify broken characters, some consider it doesn?t help users at all. I tend to agree with the later, but either way, it?s about helping users to fix their documents. Anyone knows what security risks the spec is talking about? /koji On Jul 1, 2014, at 1:33 AM, Philippe Verdy > wrote: I generally agree with your comment. For your question U+FFFD is not special in CSS, it's just a standard character that will be mapped to some symbol (from any font, or synthetized from an internal font (or collection of glyphs) of the renderer according to other styles (there's no warranty that syles like itelaic or bold will look different, in fact there's no good way to exhibit alternatives if the renderer does not lookup a matching font, but at least the renderer should size it according to the computed "font-size:" setting). That symbol is often (but not necessaily a "white" question mark in a "black" diamond; replace "white" in fact by background color/image/shades, and "black" by the "color:" setting, just like in regular fonts mapping any other symbol). This symbol should also have an inherited direction, not a strong LTR direction: it should not alter the direction of text on either side (or break runs of text) for Bidi rendering, but it may eventually be mirrored in resolved RTL runs (if this is appropriate for the chosen glyph (not always easy to determine if the symbol is chosen from a matching font in context ; but as the symbol to use is quite arbitrary, and should be enough distinctive from other characters, this mirroring is not really necessary, unless the symbol shows some explicit text is a specific style; something to avoid as the character is not specific to any script or language). 2014-06-30 17:59 GMT+02:00 Konstantin Ritt >: 2014-06-29 22:24 GMT+03:00 Asmus Freytag >: but things get harder the more I think: 3. When the above text says ?surrogate code points?, does that mean everything outside BMP? It reads so to me, but I?m surprised that characters in BMP and outside BMP have such differences, so I?m doubting my English skill. No, those would be supplementary code points. Surrogates are values that are intended to be used in pairs as code units in UTF-16. Ill-formed data may contain unpaired values, those are referred to as Surrogate code points. IIRC, after HTML parsing, validating and building DOM, no any single surrogate code point could be met in, since presence of any ill-formed data in the Unicode text makes the whole text ill-formed. It's a security recommendation to decoders to replace any unpaired surrogate code point with U+FFFD instead, thus making the text well-formed. As a side effect, the unpaired surrogate code point becomes visible (usually as a square box fallback glyph). What the consideration regarding U+FFFD in CSS? Konstantin _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From b.riefenstahl at turtle-trading.net Mon Jun 30 15:47:39 2014 From: b.riefenstahl at turtle-trading.net (Benjamin Riefenstahl) Date: Mon, 30 Jun 2014 22:47:39 +0200 Subject: Problem with Mandaic shaping, IT and IN switched Message-ID: Hi everybody, I am currently in the process of designing a simple OpenType font for Mandaic. As some of you are probably aware, shaping in OpenType as it is recommended by the OpenType standard requires that the application (i.e. the text rendering engine) knows the joining behaviour of the characters. It seems that there is an error in the joining data for Mandaic as defined by the Unicode standard (table 14-5 and 14-6, chapter 14.12 in version 6.3) and by the file ArabicShaping.txt at http://www.unicode.org/Public/UNIDATA/ArabicShaping.txt. The tables list the character IT as dual-joining and the character IN as right-joining. These two seem to be switched. In the table columns with the actual characters (columns Xn, Xr, Xm, Xl) the correct characters are given (compare the code chart at http://www.unicode.org/charts/PDF/U0840.pdf), but the names (and the relative positions in the tables) are wrong and that error is than taken over into the file ArabicShaping.txt: 0847; MANDAIC IT; D; No_Joining_Group [...] 084F; MANDAIC IN; R; No_Joining_Group The correct characters in the table should be (in this order) * Dual-Joining: ATT, AK, AL, AM, AS, IN, AP, ASZ, AQ, AR, AT * Right-Joining: HALQA, AZ, IT, AKSA, ASH And the correct data in ArabicShaping.txt: 0847; MANDAIC IT; R; No_Joining_Group 084F; MANDAIC IN; D; No_Joining_Group Please advise what I can do to help correct this in some future version of the Unicode standard. Regards, Benjamin Riefenstahl From prosfilaes at gmail.com Mon Jun 30 19:14:40 2014 From: prosfilaes at gmail.com (David Starner) Date: Mon, 30 Jun 2014 17:14:40 -0700 Subject: Characters that should be displayed? In-Reply-To: <4BB2DAA1-264F-4B88-92C5-FF604BC32D41@gluesoft.co.jp> References: <53B067D5.6050102@ix.netcom.com> <4BB2DAA1-264F-4B88-92C5-FF604BC32D41@gluesoft.co.jp> Message-ID: On Mon, Jun 30, 2014 at 11:35 AM, Koji Ishii wrote: > I understand some here wants to display them to help users to identify > broken characters, some consider it doesn?t help users at all. I tend to > agree with the later, but either way, it?s about helping users to fix their > documents. If your browser stop recognizing Japanese, would you prefer to see fallback characters or nothing? For me, I'd much rather see fallback characters because then the question is why aren't these characters displaying and not why is this webpage blank. -- Kie ekzistas vivo, ekzistas espero. From kojiishi at gluesoft.co.jp Mon Jun 30 23:12:56 2014 From: kojiishi at gluesoft.co.jp (Koji Ishii) Date: Tue, 1 Jul 2014 04:12:56 +0000 Subject: Characters that should be displayed? In-Reply-To: References: <53B067D5.6050102@ix.netcom.com> <4BB2DAA1-264F-4B88-92C5-FF604BC32D41@gluesoft.co.jp> Message-ID: <0FFF3859-04E1-489C-9C14-D8BCC5D5C354@gluesoft.co.jp> On Jul 1, 2014, at 9:14 AM, David Starner wrote: On Mon, Jun 30, 2014 at 11:35 AM, Koji Ishii wrote: >> I understand some here wants to display them to help users to identify >> broken characters, some consider it doesn?t help users at all. I tend to >> agree with the later, but either way, it?s about helping users to fix their >> documents. > > If your browser stop recognizing Japanese, would you prefer to see > fallback characters or nothing? For me, I'd much rather see fallback > characters because then the question is why aren't these characters > displaying and not why is this webpage blank. Thanks for the reply. It?s very likely that the page contains images, borders, background, etc., so I can recognize all the text are missing. But neither of text missing nor text garbled suggests me how to fix it. I?d try another browser, then give up viewing the page. But the scenario is still not a security issue. Whether it?s a security issue or a feature makes a big difference for us to discuss how important this change is, so I?m interested in knowing what kind of security aspects showing fallback glyphs can help, when considering browser rendering. Any thoughts? /koji From prosfilaes at gmail.com Mon Jun 30 23:49:57 2014 From: prosfilaes at gmail.com (David Starner) Date: Mon, 30 Jun 2014 21:49:57 -0700 Subject: Characters that should be displayed? In-Reply-To: <0FFF3859-04E1-489C-9C14-D8BCC5D5C354@gluesoft.co.jp> References: <53B067D5.6050102@ix.netcom.com> <4BB2DAA1-264F-4B88-92C5-FF604BC32D41@gluesoft.co.jp> <0FFF3859-04E1-489C-9C14-D8BCC5D5C354@gluesoft.co.jp> Message-ID: On Mon, Jun 30, 2014 at 9:12 PM, Koji Ishii wrote: > Thanks for the reply. It?s very likely that the page contains images, borders, background, etc., so I can recognize all the text are missing. But neither of text missing nor text garbled suggests me how to fix it. I?d try another browser, then give up viewing the page. If it didn't suggest how to fix it to you before today, it should suggest it to you today. If you get a bunch of fallback characters, your first guess should be font problems. Anyone using scripts with poor support, especially stuff stored in the PUA, will recognize right off when the text isn't displaying. -- Kie ekzistas vivo, ekzistas espero.