From unicode at unicode.org Fri Jun 1 23:44:29 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 2 Jun 2018 05:44:29 +0100 Subject: Hyphenation Markup Message-ID: <20180602054429.1ef142ab@JRWUBU2> In Latin text, one can indicate permissible line break opportunities between grapheme clusters by inserting U+00AD SOFT HYPHEN. What low-end schemes, if any, exist for such mark-up within grapheme clusters? The visual effect I wish to enable can be presented simply as: line-break Character-1 is a base character, character-2 is a spacing combining mark. Without a line break, one would simply see Alternatively, how might I give general permission for such a break? Richard. From unicode at unicode.org Sat Jun 2 04:06:43 2018 From: unicode at unicode.org (Otto Stolz via Unicode) Date: Sat, 2 Jun 2018 11:06:43 +0200 Subject: Hyphenation Markup In-Reply-To: <20180602054429.1ef142ab@JRWUBU2> References: <20180602054429.1ef142ab@JRWUBU2> Message-ID: Am 2018-06-02 um 06:44 schrieb Richard Wordingham via Unicode: > In Latin text, one can indicate permissible line break opportunities > between grapheme clusters by inserting U+00AD SOFT HYPHEN. What > low-end schemes, if any, exist for such mark-up within grapheme > clusters? What about U+200B ZWSP? > this character is intended for invisible word > separation and for line break control; it has no > width, but its presence between two characters > does not prevent increased letter spacing in > justification Best wishes, Otto Stolz From unicode at unicode.org Sat Jun 2 06:37:45 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 2 Jun 2018 12:37:45 +0100 Subject: Hyphenation Markup In-Reply-To: References: <20180602054429.1ef142ab@JRWUBU2> Message-ID: <20180602123745.7a8b47b9@JRWUBU2> On Sat, 2 Jun 2018 11:06:43 +0200 Otto Stolz via Unicode wrote: > Am 2018-06-02 um 06:44 schrieb Richard Wordingham via Unicode: > > In Latin text, one can indicate permissible line break opportunities > > between grapheme clusters by inserting U+00AD SOFT HYPHEN. What > > low-end schemes, if any, exist for such mark-up within grapheme > > clusters? > > What about U+200B ZWSP? > > this character is intended for invisible word > > separation and for line break control; it has no > > width, but its presence between two characters > > does not prevent increased letter spacing in > > justification Thanks for the suggestion, but it's not likely to work: Within a word and with a proper layout implementation, using ZWSP would be worse than using backing store . 1) In the sequence realisation of the break should definitely result in on one line and in on the next line, whereas in visual order, character-2 should precede character-1. 2) Use of ZWSP will usually result in a dotted circle even when the break does not occur. 3) ZWSP will result in a mandatory word boundary. That will cause problems with the spell checker. I've experimented (http://wrdingham.co.uk/lanna/renderer_test.htm#test_and_tell) with the combination where there is a default grapheme cluster boundary between the two characters. I get generally better results with SHY than ZWSP. The downside was that the rendering systems I tried seemed to insist on inserting the glyph of U+002D or U+2010, rather than the glyph of U+00AD. Incidentally, does CLDR define the rendering of soft hyphen, or is one entirely at the mercy of the application? Richard. From unicode at unicode.org Sat Jun 2 15:33:01 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sat, 2 Jun 2018 14:33:01 -0600 Subject: Hyphenation Markup In-Reply-To: References: Message-ID: <7A7A217F82E64367837AC03EA5D56CC8@DougEwell> Richard Wordingham wrote: >> What about U+200B ZWSP? > > Thanks for the suggestion, but it's not likely to work: Are you asking what schemes exist, or are you trying to call attention to some rendering engine and/or font that doesn't render a combination as it should? > 1) In the sequence > > realisation of the break should definitely result in character-1> on one line and in on the next > line, whereas in visual order, character-2 should precede character-1. This is too general for me to parse. Can you replace these hypotheticals with actual characters, using code points, or at least with actual General Categories? For example, an 'Mc' followed by ZWSP followed by an 'Lo' displays like such-and-so. The code points would be best. > Incidentally, does CLDR define the rendering of soft hyphen, or is one > entirely at the mercy of the application? Why would this be a CLDR thing? -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat Jun 2 19:26:40 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Sun, 3 Jun 2018 09:26:40 +0900 Subject: Hyphenation Markup In-Reply-To: <20180602123745.7a8b47b9@JRWUBU2> References: <20180602054429.1ef142ab@JRWUBU2> <20180602123745.7a8b47b9@JRWUBU2> Message-ID: Hello Richard, On 2018/06/02 20:37, Richard Wordingham via Unicode wrote: >> Am 2018-06-02 um 06:44 schrieb Richard Wordingham via Unicode: >>> In Latin text, one can indicate permissible line break opportunities >>> between grapheme clusters by inserting U+00AD SOFT HYPHEN. What >>> low-end schemes, if any, exist for such mark-up within grapheme >>> clusters? > 1) In the sequence > > > > realisation of the break should definitely result in character-1> on one line and in on the next > line, whereas in visual order, character-2 should precede character-1. My question goes a bit further than to Doug's: Why would you want to do such a thing? Are there actual scripts/languages where line breaks within grapheme clusters occur? If yes, what are there? Can you show actual examples, e.g. scans of documents,...? In writing systems, there are almost always exceptions to simple rules, but in general, breaking a line *within* a grapheme cluster seems to be a bad idea. Regards, Martin. From unicode at unicode.org Sat Jun 2 22:31:32 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 3 Jun 2018 04:31:32 +0100 Subject: Hyphenation Markup In-Reply-To: <7A7A217F82E64367837AC03EA5D56CC8@DougEwell> References: <7A7A217F82E64367837AC03EA5D56CC8@DougEwell> Message-ID: <20180603043132.68a36455@JRWUBU2> On Sat, 2 Jun 2018 14:33:01 -0600 Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > > >> What about U+200B ZWSP? > > > > Thanks for the suggestion, but it's not likely to work: > > Are you asking what schemes exist, or are you trying to call > attention to some rendering engine and/or font that doesn't render a > combination as it should? I'm asking what exists, or is reasonably supposed to exist. > This is too general for me to parse. Can you replace these > hypotheticals with actual characters, using code points, or at least > with actual General Categories? For example, an 'Mc' followed by ZWSP > followed by an 'Lo' displays like such-and-so. The code points would > be best. On Sun, 3 Jun 2018 09:26:40 +0900 "Martin J. D?rst via Unicode" wrote: > My question goes a bit further than to Doug's: Why would you want to > do such a thing? Are there actual scripts/languages where line breaks > within grapheme clusters occur? If yes, what are there? Can you show > actual examples, e.g. scans of documents,...? Three examples are given on p230 of the dissertation "Buddhist Monks and their Search for Knowledge: an examination of the personal collection of manuscripts of Phra Khamchan Virachitto (1920-2007), Abbot of Vat Saen Sukharam, Luang Prabang" by Bounleuth Sengsoulin, available at http://ediss.sub.uni-hamburg.de/volltexte/2016/8039/pdf/Dissertation.pdf . The text is in Lao in the Tham script. The transcriptions in the text are transliterated to the Lao script. The first example, transliterated to Lao, is ???, which one could encode as , provided the soft hyphen had no visual representation beyond the line break. (Strictly, it's a break for a hole for a string.) The third example is likewise ??? . (I can't make out the second example.) However, the text is actually in the Tham script, and without any line-breaking controls, the first and third examples read, marking the grapheme cluster boundaries with '|', as ???? and ???? . The internal grapheme cluster boundaries are purely stopping points for cursor movement; they correspond to nothing graphical and to nothing in user conception. The natural internal boundaries are just before the vowels, which are written on the left, and between the base and subscript characters, i.e. before U+1A60. There seem to be Northern Thai Pali examples in the proposal L2/2007-007 at the end of https://www.unicode.org/L2/L2007/07007r-n3207r-lanna.pdf Figure 9a Page 2 Line 3, and at the end of Figure 9b Page 1 Line 2, but I can't read the Pali well enough to be sure that the apparent visually line-final instances of TAI THAM SIGN E are not just scribal blunders. Reverting to Doug's reply: > > Incidentally, does CLDR define the rendering of soft hyphen, or is > > one entirely at the mercy of the application? > Why would this be a CLDR thing? Because the rendering is quite likely to depend on locale. I had always understood that Thai did not mark breaks in words - and then I discovered them in the Royal Institute Dictionary! The correct German rendering of soft hyphens has recently changed. There are also subtle effects when Dutch words are hyphenated. These rules are not the same as for English, but Unicode tends not to deal in dependencies finer than a script. Richard. From unicode at unicode.org Sun Jun 3 05:50:54 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 3 Jun 2018 11:50:54 +0100 Subject: Hyphenation Markup In-Reply-To: <20180603043132.68a36455@JRWUBU2> References: <7A7A217F82E64367837AC03EA5D56CC8@DougEwell> <20180603043132.68a36455@JRWUBU2> Message-ID: <20180603115054.3023280d@JRWUBU2> On Sun, 3 Jun 2018 04:31:32 +0100 Richard Wordingham via Unicode wrote: > However, the text is actually in the Tham script, and without any > line-breaking controls, the first and third examples read, marking the > grapheme cluster boundaries with '|', as ???? MA, U+1A60 TAI THAM SIGN SAKOT | U+1A3F TAI THAM LETTER LOW YA, U+1A6E > TAI THAM VOWEL SIGN E> and ???? TAI THAM SIGN SAKOT | U+1A45 TAI THAM LETTER WA, U+1A71 TAI THAM VOWEL > SIGN AI>. What I have marked is the *extended* grapheme cluster boundaries. There is a *legacy* grapheme cluster break before the vowel sign. This may make line-breaking after Indic re-ordering a bit easier. However, in the Lao language, we have sequences in Tham such as ('|' = legacy grapheme break), and I now fully expect there to be renderings such as: , break, There seems to be an example about the string hole in the middle line of BAD-13-1-0100 in Figure 5.4 on p222 of Bounleuth's dissertation (http://ediss.sub.uni-hamburg.de/volltexte/2016/8039/pdf/Dissertation.pdf), but I'm not confident of my reading of the split word as . Theppitak would be able to confirm or refute, but he doesn't often participate in this forum. Richard. From unicode at unicode.org Mon Jun 4 14:49:20 2018 From: unicode at unicode.org (Manish Goregaokar via Unicode) Date: Mon, 4 Jun 2018 12:49:20 -0700 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? Message-ID: Hi, The Rust community is considering adding non-ascii identifiers, which follow UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for identifiers to be treated as equivalent under NFKC. Are there any cases where this will lead to inconsistencies? I.e. can the NFKC of a valid UAX 31 ident be invalid UAX 31? (In general, are there other problems folks see with this proposal?) Thanks, -Manish -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jun 4 14:57:16 2018 From: unicode at unicode.org (Manish Goregaokar via Unicode) Date: Mon, 4 Jun 2018 12:57:16 -0700 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: Oh, looks like UAX 31 has info on how to be closed under NFC http://www.unicode.org/reports/tr31/#NFKC_Modifications -Manish On Mon, Jun 4, 2018 at 12:49 PM Manish Goregaokar wrote: > Hi, > > The Rust community is considering > adding non-ascii > identifiers, which follow UAX #31 > (XID_Start XID_Continue*, with tweaks). The proposal also asks for > identifiers to be treated as equivalent under NFKC. > > Are there any cases where this will lead to inconsistencies? I.e. can the > NFKC of a valid UAX 31 ident be invalid UAX 31? > > (In general, are there other problems folks see with this proposal?) > > > Thanks, > -Manish > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jun 4 19:37:47 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 5 Jun 2018 01:37:47 +0100 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: <20180605013747.157d72f1@JRWUBU2> On Mon, 4 Jun 2018 12:49:20 -0700 Manish Goregaokar via Unicode wrote: > Hi, > > The Rust community is considering > adding non-ascii > identifiers, which follow UAX #31 > (XID_Start XID_Continue*, with > tweaks). The proposal also asks for identifiers to be treated as > equivalent under NFKC. > (In general, are there other problems folks see with this proposal?) There's the usual lurking issue that the Thai word for water, ??? , is unacceptable and often untypable and uncopiable when converted to NFKC ???? . The decomposed form that looks the same is ???? . The problem is that for sane results, needs special handling. This sequence is also often untypable - part of the protection against Thai homographs. Richard. From unicode at unicode.org Mon Jun 4 22:43:39 2018 From: unicode at unicode.org (Rebecca T via Unicode) Date: Mon, 4 Jun 2018 23:43:39 -0400 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: I think that the benefits of inclusion from allowing non-ASCII identifiers far outweigh any corner cases this might cause. (Although ironing out and analyzing those is of course important, I don?t think they should be obstacles for implementing this kind of thing.) Something I?d love to see is translated keywords; shouldn?t be hard with a line in the cargo.toml for a ruidmentary lookup. Again, I?m of the opinion that an imperfect implementation is better than no attempt. I remember reading an article about a professor who translated the keywords in... maybe it was Python? And found their students were much more engaged with the material. Anecdotal, of course, but it?s stuck with me. On Mon, Jun 4, 2018 at 3:53 PM Manish Goregaokar via Unicode < unicode at unicode.org> wrote: > Hi, > > The Rust community is considering > adding non-ascii > identifiers, which follow UAX #31 > (XID_Start XID_Continue*, with tweaks). The proposal also asks for > identifiers to be treated as equivalent under NFKC. > > Are there any cases where this will lead to inconsistencies? I.e. can the > NFKC of a valid UAX 31 ident be invalid UAX 31? > > (In general, are there other problems folks see with this proposal?) > > > Thanks, > -Manish > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jun 5 01:09:45 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Tue, 5 Jun 2018 15:09:45 +0900 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: Hello Rebecca, On 2018/06/05 12:43, Rebecca T via Unicode wrote: > Something I?d love to see is translated keywords; shouldn?t be hard with a > line in the cargo.toml for a ruidmentary lookup. Again, I?m of the opinion > that an imperfect implementation is better than no attempt. I remember > reading an article about a professor who translated the keywords in... > maybe it was Python? And found their students were much more engaged with > the material. Anecdotal, of course, but it?s stuck with me. It would be good to have a reference for this. I can certainly see the point. But on the other hand, I have also heard that using keywords in a foreign language makes it clear that there may be a difference between the everyday use of the word and the specific formal meaning in the programming language. Then, there's also the problem that just translating keywords may work for languages with the same sentence structure, but not for languages with a completely different sentence structure. On top of that, keywords are just a start; class/function/method names in libraries would have to be translated, too, which would be much more work (especially if one wants to do a good job). Regards, Martin. From unicode at unicode.org Tue Jun 5 21:48:53 2018 From: unicode at unicode.org (Manish Goregaokar via Unicode) Date: Tue, 5 Jun 2018 19:48:53 -0700 Subject: Requiring typed text to be NFKC (was: Can NFKC turn valid UAX 31 identifiers into non-identifiers?) Message-ID: Following up from my previous email , one of the ideas that was brought up was that if we're going to consider NFKC forms equivalent, we should require things to be typed in NFKC. I'm a bit wary of this. As Richard brought up in that thread, some Thai NFKC forms are untypable. I *suspect* there are Hangul keyboards (perhaps physical non-IME based ones) that have this problem. Do folks have other examples? Interested in both: - Words (as in, real things people will want to type) where a keyboard/IME does not type the NFKC form - Words where a keyboard/IME *can* type the NFKC form but users are not used to it - Words where the NFKC form is *visually* distinct enough that it will look weird to native speakers Thanks, -Manish -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jun 6 04:29:53 2018 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Wed, 6 Jun 2018 10:29:53 +0100 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: On 4 Jun 2018, at 20:49, Manish Goregaokar via Unicode wrote: > > The Rust community is considering adding non-ascii identifiers, which follow UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for identifiers to be treated as equivalent under NFKC. > > Are there any cases where this will lead to inconsistencies? I.e. can the NFKC of a valid UAX 31 ident be invalid UAX 31? > > (In general, are there other problems folks see with this proposal?) IMO the major issue with non-ASCII identifiers is not a technical one, but rather that it runs the risk of fragmenting the developer community. Everyone can *type* ASCII and everyone can read Latin characters (for reasonably wide values of ?everyone?, at any rate? most computer users aren?t going to have a problem). Not everyone can type Hangul, Chinese or Arabic (for instance), and there is no good fix or workaround for this. Note that this is orthogonal to issues such as which language identifiers or comments are written in (indeed, there?s no problem with comments written in any script you please); the problem is that e.g. given a function func ?????(s : String) it isn?t obvious to a non-Arabic speaking user how to enter ????? in order to call it. This isn?t true of e.g. func pituus(s : String) Even though ?pituus? is Finnish, it?s still ASCII and everyone knows how to type that. Copy and paste is not always a good solution here, I might add; in bidi text in particular, copy and paste can have confusing results (and results that vary depending on the editor being used). There is also the issue of additional confusions that might be introduced; even if you stick to Latin scripts, this could be an problem sometimes (e.g. at small sizes, it?s hard to distinguish ? and ? or ? and ?), and of course there are Cyrillic and Greek characters that are indistinguishable from their Latin counterparts in most fonts. UAX #31 also manages (I suspect unintentionally?) to give a good example of a pair of Farsi identifiers that might be awkward to tell apart in certain fonts, namely ?????? and ???????; I think those are OK in monospaced fonts, where the join is reasonably wide, but at small point sizes in proportional fonts the difference in appearance is very subtle, particularly for a non-Arabic speaker. You could avoid *some* of these issues by restricting the allowable scripts somehow (e.g. requiring that an identifier that had Latin characters could not also contain Cyrillic and so on) or perhaps by establishing additional canonical equivalences between similar looking characters (so that e.g. while a and ? - or, more radically, ? and ? - might be different characters, you might nevertheless regard them as the same for symbol lookup). It might be worth looking at UTR #36 and maybe UTR #39, not so much from a security standpoint, but more because those documents already have to deal with the problem of confusables. You could also recommend that people stick to ASCII unless there?s a good reason to do otherwise (and note that using non-ASCII characters might impact on their ability to collaborate with teams in other countries). None of this is necessarily a reason *not* to support non-ASCII identifiers, but it *is* something to be cautious about. Right now, most programming languages operate as a lingua franca, with code written by a wide range of people, not all of whom speak English, but all of whom can collaborate together to a greater or lesser degree by virtue of the fact that they all understand and can write code. Going down this particular rabbit hole risks changing that, and not for the better, and IMO it?s important to understand that when considering whether the trade-off of being able to use non-ASCII characters in identifiers is genuinely worth it. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Wed Jun 6 04:49:01 2018 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Wed, 6 Jun 2018 10:49:01 +0100 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: <5A361BFB-CC46-4DE1-BB31-EADBDC16CCEF@alastairs-place.net> On 5 Jun 2018, at 07:09, Martin J. D?rst via Unicode wrote: > > Hello Rebecca, > > On 2018/06/05 12:43, Rebecca T via Unicode wrote: > >> Something I?d love to see is translated keywords; shouldn?t be hard with a >> line in the cargo.toml for a ruidmentary lookup. Again, I?m of the opinion >> that an imperfect implementation is better than no attempt. I remember >> reading an article about a professor who translated the keywords in... >> maybe it was Python? And found their students were much more engaged with >> the material. Anecdotal, of course, but it?s stuck with me. > > It would be good to have a reference for this. I can certainly see the point. But on the other hand, I have also heard that using keywords in a foreign language makes it clear that there may be a difference between the everyday use of the word and the specific formal meaning in the programming language. Then, there's also the problem that just translating keywords may work for languages with the same sentence structure, but not for languages with a completely different sentence structure. On top of that, keywords are just a start; class/function/method names in libraries would have to be translated, too, which would be much more work (especially if one wants to do a good job). ALGOL68 was apparently localised (the standard explicitly supported that; it wasn?t an extension but rather something explicitly encouraged). AppleScript was also designed to be (French and Japanese syntaxes were defined), and I have an inkling that someone once told me that at least one translation had actually shipped, though the translated variants are now deprecated as far as I?m aware. Translated keywords are in some ways better than allowing non-ASCII identifiers, because they?re typically amenable to machine translation (indeed, in AppleScript, the scripts are not usually saved in ASCII anyway, but IIRC as a set of Apple Event Descriptors, so the ?language? is just a matter for rendering to the user), which means that they don?t suffer from the problem of community fragmentation that non-ASCII identifiers *could* cause. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Wed Jun 6 06:19:31 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 6 Jun 2018 13:19:31 +0200 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: <5A361BFB-CC46-4DE1-BB31-EADBDC16CCEF@alastairs-place.net> References: <5A361BFB-CC46-4DE1-BB31-EADBDC16CCEF@alastairs-place.net> Message-ID: It could be argued that "modern" languages could use unique identifiers for their syntax or API independantly of the name being rendered. The problem is that translated names may collide in non-obvious way and become ambiguous. We've already seen the problems it caused in Excel with its translated function names in some spreadsheets (things being worse when the spreadsheet itself does not contain a language identifier to indicate in which these identifiers are defined, so English-only installations of Excel (without the MUI/LUI installed) cannot open or process correctly the spreadsheets created in other languages. In practice, ASCII-only or ISO8859-1 only identifiers work realtively well, but there's always a problem to enter these identifiers, a solution would be to allow identifiers having an ASCII-only alias even if they are not so friendly for the original authors. But I've not seen any programming language or API allowing to define aliases for identifiers that have exactly the same semantic as the few translated ones that non-English users would prefer to see and use. In C/C++ you may have aliases but this requires special support in the binary object or library format to allow equivalent bindings and resolution. For programming languages that are too near from the machine level (assembly, C, C++), or for common libraries intended to be used worldwide, in most cases these names are in English-only or use "augmented English" with approximate transliterations when they use some borrowed words (notably proper names), or invented words (company names, trademarks, custom neologisms specific to an app or service, and a lot of acronyms). These API or languages tend to create their own "jargon" with their own definitions (which may be translated in their documentation). Programmer comments however are very frequently written in any language or script because they don't have to be restricted by uniqueness and name resolution or binding mechanisms. But newer scripting languages are now very liberal (notably Javascript/ECMAscript) and are somewhat easy to rebind to other names to generate an "equivalent" library, except if the library needs to work through reflection mechanisms and introspection. scripting languages designed to be used for user personalisation should however be user friendly and only designed to work well with the language of the initial author for his own usage (but cooperation will be limited on the Internet, and if one wants to share his code, he will have to create some basic translation or transliteration. Most system-level APIs (filesystem or I/O, multiprocessing/multithreading, networking) and data format options are specified using English terms only (or near-English). The various IDE's however can make this language more friendly by providing documentation searches, contextual helpers in the editor itself, hinting popups, or various "machine learning" tools (including "natural language" query wizards to help create and document the technical language using the English-like jargon). Most programming languages however do not define a lot of reserved keywords (in English) and there's rarely the need to translate them (but I've seen several programming languages also translating them in a few wellknown languages), notably languages designed to be used by children or to learn programming. Some of these languages do not use a plain-text syntax but use graphic diagrams with symbols, arrows, boxes and programmers navigate in the graphic layout or rearrange the layout to fit new items or remove/combine them (then an "advanced" view can be used to present this layout in plain-text using partly translated terms: this is easier if there's a clear syntaxic separation of custom identifiers created by users (not translated) and core keywords of the language (generally this separation uses quotation marks around custom identifiers, but this is not even needed everywhere for data-oriented syntaxes like JSON which does not need any "reserved" identifier, but reserves only some punctuations). Anyway, all programming jobs require a basic proficiency to read/write basic English correctly, and require acquiring a common English-like technical jargon (that jargon does not have to be perfect English, it is used as a de facto standard, which evolves too fast to be correctly translated). This jargon is still NOT normal English and using it means that documentation should still be adapted/translated to better English for native English readers. If you look at some wellknown projects in China, you'll see that many projects are documented and supported only in Chinese, by programmers that have a very limtied knowledge of English (so their usage of Engliush in the crearted technical jargon is liguistically incorrect, but still correct for the technical needs (and to translate/Adapt these programs to other languages, Chinese is the source of all translations, and must be present in all translation files to map it to English or any other language: most people don't know how to type it, what they do is only to copy-paste the existing Chinese-to-English translation files, then fix the English target, and then use that to create other translations based on this English text; finally the resulting translation is tested in the final target language and slightly modified to get a more uniform or consistent terminology. 2018-06-06 11:49 GMT+02:00 Alastair Houghton via Unicode < unicode at unicode.org>: > On 5 Jun 2018, at 07:09, Martin J. D?rst via Unicode > wrote: > > > > Hello Rebecca, > > > > On 2018/06/05 12:43, Rebecca T via Unicode wrote: > > > >> Something I?d love to see is translated keywords; shouldn?t be hard > with a > >> line in the cargo.toml for a ruidmentary lookup. Again, I?m of the > opinion > >> that an imperfect implementation is better than no attempt. I remember > >> reading an article about a professor who translated the keywords in... > >> maybe it was Python? And found their students were much more engaged > with > >> the material. Anecdotal, of course, but it?s stuck with me. > > > > It would be good to have a reference for this. I can certainly see the > point. But on the other hand, I have also heard that using keywords in a > foreign language makes it clear that there may be a difference between the > everyday use of the word and the specific formal meaning in the programming > language. Then, there's also the problem that just translating keywords may > work for languages with the same sentence structure, but not for languages > with a completely different sentence structure. On top of that, keywords > are just a start; class/function/method names in libraries would have to be > translated, too, which would be much more work (especially if one wants to > do a good job). > > ALGOL68 was apparently localised (the standard explicitly supported that; > it wasn?t an extension but rather something explicitly encouraged). > AppleScript was also designed to be (French and Japanese syntaxes were > defined), and I have an inkling that someone once told me that at least one > translation had actually shipped, though the translated variants are now > deprecated as far as I?m aware. > > Translated keywords are in some ways better than allowing non-ASCII > identifiers, because they?re typically amenable to machine translation > (indeed, in AppleScript, the scripts are not usually saved in ASCII anyway, > but IIRC as a set of Apple Event Descriptors, so the ?language? is just a > matter for rendering to the user), which means that they don?t suffer from > the problem of community fragmentation that non-ASCII identifiers *could* > cause. > > Kind regards, > > Alastair. > > -- > http://alastairs-place.net > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jun 6 06:55:07 2018 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Wed, 6 Jun 2018 14:55:07 +0300 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: On Mon, Jun 4, 2018 at 10:49 PM, Manish Goregaokar via Unicode wrote: > The Rust community is considering adding non-ascii identifiers, which follow > UAX #31 (XID_Start XID_Continue*, with tweaks). UAX #31 is rather light on documenting its rationale. I realize that XML is a different case from Rust considering how the Rust compiler is something a programmer runs locally whereas control XML documents and XML processors, especially over time, is significantly less coupled. Still, the experience from XML and HTML suggests that, if non-ASCII is to be allowed in identifiers at all, restricting the value space of identifiers a priori easily ends up restricting too. HTML went with the approach of collecting everything up to the next ASCII code point that's a delimiter in HTML (and a later check for names that are eligible for Custom Element treatment that mainly achieves compatibility with XML but no such check for what the parser can actually put in the document tree) while keeping the actual vocabulary to ASCII (except for Custom Elements whose seemingly arbitrary restrictions are inherited from XML). XML 1.0 codified for element and attribute names what then was the understanding of the topic that UAX #31 now covers and made other cases a hard failure. Later, it turned out that XML originally ruled out too much and the whole mess that was XML 1.1 and XML 1.0 5th ed. resulted from trying to relax the rules. Considering that ruling out too much can be a problem later, but just treating anything above ASCII as opaque hasn't caused trouble (that I know of) for HTML other than compatibility issues with XML's stricter stance, why should a programming language, if it opts to support non-ASCII identifiers in an otherwise ASCII core syntax, implement the complexity of UAX #31 instead of allowing everything above ASCII in identifiers? In other words, what problem does making a programming language conform to UAX #31 solve? Allowing anything above ASCII will lead to some cases that obviously don't make sense, such as declaring a function whose name is a paragraph separator, but why is it important to prohibit that kind of thing when prohibiting things risks prohibiting too much, as happened with XML, and people just don't mint identifiers that aren't practical to them? Is there some important badness prevention concern that applies to programming languages more than it applies to HTML? The key thing here in terms of considering if badness is _prevented_ isn't what's valid HTML but what the parser can actually put in the DOM, and the HTML parser can actually put any non-ASCII code point in the DOM as an element or attribute name (after the initial ASCII code point). (The above question is orthogonal to normalization. I do see the value of normalizing identifiers to NFC or requiring them to be in NFC to begin with. I'm inclined to consider NFKC as a bug in the Rust proposal.) -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Wed Jun 6 12:55:53 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 6 Jun 2018 18:55:53 +0100 Subject: Requiring typed text to be NFKC In-Reply-To: References: Message-ID: <20180606185553.3516a5ea@JRWUBU2> On Tue, 5 Jun 2018 19:48:53 -0700 Manish Goregaokar via Unicode wrote: > Following up from my previous email > , > one of the ideas that was brought up was that if we're going to > consider NFKC forms equivalent, we should require things to be typed > in NFKC. > > > I'm a bit wary of this. As Richard brought up in that thread, some > Thai NFKC forms are untypable. I *suspect* there are Hangul keyboards > (perhaps physical non-IME based ones) that have this problem. > > Do folks have other examples? Interested in both: I don't know of any different problems for NFKC,but there are problems with getting people to enter normalised data. > - Words (as in, real things people will want to type) where a > keyboard/IME does not type the NFKC form There are problems with insisting that users type normalised text. Vietnamese is probably a real issue here; the standard keyboard is set up to enter vowels (some of which are accented) and tone marks separately. Indeed, with the n?n? tone (as in the vowel of its name), one is likely to find the codepoint sequence which is not NFC, not NFD and not even FCD. > - Words where the NFKC form is *visually* distinct enough that it > will look weird to native speakers There may be issues with BMP CJK compatibility ideographs. I don't know how far they've been replaced by variation sequences requesting the same appearance. > - Words where a keyboard/IME *can* type the NFKC form but users are > not used to it Well, typing Tai Khuen in normalised form is hideously counter-intuitive, but at present the USE makes displaying correctly spelt text into a struggle for a font. The problem there is that the usual way of typing a closed syllable with a tone mark gets normalised at the end to ; that normalisation broke early pre-USE OpenType-based fonts as databases caught up with Unicode 5.2. That problem was promptly cured by HarfBuzz tweaking its internal normalisation, until USE unintentionally outlawed correct spelling. A universal keyboard for entering large swathes of the Latin script is not a very big problem, but entering text with diacritics in form NFC is a real pain. This problem might arise when editing a Hungarian program without a Hungarian keyboard. The program development environment would have to provide a normalisation tool. Richard. From unicode at unicode.org Wed Jun 6 16:25:32 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Wed, 6 Jun 2018 23:25:32 +0200 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: <66EEAD40-AB4C-47E2-A77F-13273729D9CB@telia.com> > On 4 Jun 2018, at 21:49, Manish Goregaokar via Unicode wrote: > > The Rust community is considering adding non-ascii identifiers, which follow UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for identifiers to be treated as equivalent under NFKC. So, in this language, if one defines a projection function ?? and the usual constant ?, what is ??(?) supposed to mean? - Just curious. From unicode at unicode.org Wed Jun 6 20:56:37 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 6 Jun 2018 18:56:37 -0700 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: <66EEAD40-AB4C-47E2-A77F-13273729D9CB@telia.com> References: <66EEAD40-AB4C-47E2-A77F-13273729D9CB@telia.com> Message-ID: <58a6d57f-c797-6178-1b63-f0e9a31b8f12@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jun 6 22:08:51 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 7 Jun 2018 04:08:51 +0100 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: <20180607040851.2ad2957d@JRWUBU2> On Mon, 4 Jun 2018 12:49:20 -0700 Manish Goregaokar via Unicode wrote: > Hi, > > The Rust community is considering > adding non-ascii > identifiers, which follow UAX #31 > (XID_Start XID_Continue*, with > tweaks). The proposal also asks for identifiers to be treated as > equivalent under NFKC. > > Are there any cases where this will lead to inconsistencies? I.e. can > the NFKC of a valid UAX 31 ident be invalid UAX 31? > > (In general, are there other problems folks see with this proposal?) Confusable checking may need to be reviewed. There are several cases where, sometimes depending on the font, anagrams (differing even after normalisation) can render the same. The examples I know of are of from SE Asia. The categories I know of are: a) Swapping subscript letters - a big issue in the Myanmar script, but Sanskrit grv- and gvr- can easily be rendered the same. I don't know how easily confusion arises by 'finger trouble'. b) Vowel-subscript consonant and subscript consonant-vowel often look the same in Khmer and Tai Tham. The former spelling was supposedly dropped in Khmer a century ago (the consonant ceasing to be subscript), but lingered on in a few words and is acknowledged by Unicode but not by the Microsoft font developer's guide. c) Unresolved grammar. In Thai minority languages, U+0E3A THAI CHARACTER PHINTHU and a mark above (U+0E34 THAI CHARACTER SARA I, I believe) can and do occur in either order, with no difference in appearance or meaning. The obvious humane solution is a brutal folding of the sequences. (Using spell-checkers works wonders on normal text, but spell checking code is tricky.) I actually suggested a character (U+1A54 TAI THAM LETTER GREAT SA) so that folding 'ses' to 'sse' would not result in the 'ss' conjunct being used; the conjunct is not used in 'ses'. Richard. From unicode at unicode.org Thu Jun 7 02:36:06 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 7 Jun 2018 08:36:06 +0100 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: <20180605013747.157d72f1@JRWUBU2> References: <20180605013747.157d72f1@JRWUBU2> Message-ID: <20180607083606.60010e6e@JRWUBU2> On Tue, 5 Jun 2018 01:37:47 +0100 Richard Wordingham via Unicode wrote: > The decomposed > form that looks the same is ???? . > The problem is that for sane results, needs > special handling. This sequence is also often untypable - part of the > protection against Thai homographs. I've been misquoted on the Rust discussion topic - or the behaviour is more diverse that I was aware of. On LibreOffice, with sequence checking not disabled, typing disables the input by typing of U+0E49 or U+0E32 immediately afterwards. Another mechanism is for typing another vowel to replace the U+0E4D. The problem here is that in standard Thai, U+0E4D may not be followed by another vowel or tone mark, so Wing Thuk Thi (WTT) rules cut in. (They're also quite good at preventing one from typing Northern Khmer.) In LibreOffice, typing the NFKC form is stopped at attempting to type U+0E4D, though one can get back to the original by typing U+0E33 instead. To the rule checker, that is mission accomplished! Richard. From unicode at unicode.org Thu Jun 7 03:10:54 2018 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Thu, 7 Jun 2018 09:10:54 +0100 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: <36822291-BD55-452D-B25A-EF3FEEF3668C@alastairs-place.net> On 6 Jun 2018, at 17:50, Manish Goregaokar wrote: > > I think the recommendation to use ASCII as much as possible is implicit there. It would be a very good idea to make it explicit. Even for English speakers, there may be a temptation to use characters that are hard to distinguish or hard to type on someone else?s keyboard; some thought needs to be given before choosing non-ASCII identifiers. Sometimes you might even choose to support multiple spellings of an API to avoid any problems. And in other cases it?s a good idea to remember that someone other than you might have to maintain your code in the future; that person might not speak the same language you do or use the same keyboard. Kind regards, Alastair. -- http://alastairs-place.net -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 7 03:26:32 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Thu, 7 Jun 2018 10:26:32 +0200 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: <58a6d57f-c797-6178-1b63-f0e9a31b8f12@ix.netcom.com> References: <66EEAD40-AB4C-47E2-A77F-13273729D9CB@telia.com> <58a6d57f-c797-6178-1b63-f0e9a31b8f12@ix.netcom.com> Message-ID: > On 7 Jun 2018, at 03:56, Asmus Freytag via Unicode wrote: > > On 6/6/2018 2:25 PM, Hans ?berg via Unicode wrote: >>> On 4 Jun 2018, at 21:49, Manish Goregaokar via Unicode >>> wrote: >>> >>> The Rust community is considering adding non-ascii identifiers, which follow UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for identifiers to be treated as equivalent under NFKC. >>> >> So, in this language, if one defines a projection function ?? and the usual constant ?, what is ??(?) supposed to mean? - Just curious. >> > In a language where one writes ASCII "pi" instead, what is pi(pi) supposed to mean? Indeed. From unicode at unicode.org Thu Jun 7 03:42:46 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Thu, 7 Jun 2018 10:42:46 +0200 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: <20180607083606.60010e6e@JRWUBU2> References: <20180605013747.157d72f1@JRWUBU2> <20180607083606.60010e6e@JRWUBU2> Message-ID: > The proposal also asks for identifiers to be treated as equivalent under NFKC. The guidance in #31 may not be clear. It is not to replace identifiers as typed in by the user by their NFKC equivalent. It is rather to internally *identify* two identifiers (as typed in by the user) as being the same. For example, Pascal had case-insensitive identifiers. That means someone could type in myIdentifier = 3; MyIdentifier = 4; And both of those would be references to the same internal entity. So cases like SARA AM doesn't necessarily play into this. > IMO the major issue with non-ASCII identifiers is not a technical one, but rather that it runs the risk of fragmenting the developer community. IMO, forcing everyone to stick to the limitations of ASCII for all identifiers is unnecessary and often counterproductive. First, programmers tend to think of "identifiers" as being specifically "identifiers in programming languages" (and often "identifiers in programming languages that I think are important". Identifiers may occur in much broader contexts, often being much closer to end users (eg spreadsheet formulae) or scripting languages, user identifiers, and so on. Secondly, even with programming languages that are restricted to ASCII, people can choose identifiers in code like the following, which would not be obvious to many people. var Stellenwert = Verteidigungsministerium_Konto.verarbeite(); // Asmus k?nnte realistischere Beispiele vorschlagen For a given project, and for programming languages (as opposed to more user-facing languages) the language to be used for variables, functions, comments, &c. will often be English, to allow for broader participation. But that should be a choice of the people involved. There are clearly many cases where that restriction is not optimal for a given project, where not all of the developers (and prospective developers) are fluent in English, but do share another common language. Think of all the in-house development in countries and organizations around the world. And finally, it's not like you hear of huge problems from Java or Swift or other programming languages because they support non-ASCII identifiers. Mark On Thu, Jun 7, 2018 at 9:36 AM, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Tue, 5 Jun 2018 01:37:47 +0100 > Richard Wordingham via Unicode wrote: > > > The decomposed > > form that looks the same is ???? . > > The problem is that for sane results, needs > > special handling. This sequence is also often untypable - part of the > > protection against Thai homographs. > > I've been misquoted on the Rust discussion topic - or the behaviour is > more diverse that I was aware of. On LibreOffice, with sequence > checking not disabled, typing disables the input by > typing of U+0E49 or U+0E32 immediately afterwards. Another mechanism > is for typing another vowel to replace the U+0E4D. The problem here is > that in standard Thai, U+0E4D may not be followed by another vowel or > tone mark, so Wing Thuk Thi (WTT) rules cut in. (They're also quite > good at preventing one from typing Northern Khmer.) In LibreOffice, > typing the NFKC form is stopped at > attempting to type U+0E4D, though one can get back to the original by > typing U+0E33 instead. To the rule checker, that is mission > accomplished! > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 7 05:05:20 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 7 Jun 2018 12:05:20 +0200 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: <66EEAD40-AB4C-47E2-A77F-13273729D9CB@telia.com> References: <66EEAD40-AB4C-47E2-A77F-13273729D9CB@telia.com> Message-ID: In my opinion the usual constant is most often shown as "??" (curly serifs, slightly slanted) in mathematical articles and books (and in TeX), but rarely as "?" (sans-serif). There's a tradition of using handwriting for this symbol on backboards (not always with serifs, but still often slanted). A notation with the "?" symbol uses a legacy troundtrip mapping for old OEM charsets on low-resolution text terminals where it was distinguished from the more common Greek letter which was enhanced for better readability once old low-resolution terminals were replaced. "?" looks too much like an Hangul letter or a legacy box-drawing character and in fact difficult to recognize as the pi constant, but it may still be found in some plain-text paragraphs of inline mathematical formulas on screens (for programmers), at low resolution or with small font sizes, where most text is in sans-serif Latin and not slanted/italicized and not using an handwritten style. If you think about writing a functional programming language using inline formulas, then the "?" symbol may be ok for the constant, and custom identifiers for a function would use standard Greek letters (or other standard scripts for human languages), or would use "pi" in Latin. You would then write "pi(?)" in that inline formula. For a classic 2D mathematical layout, you would use "pi(??)" with distinctive but homonegeous styles for custom variables/function names and for the classic mathematical constant. As much as possible you will avoid mixing confusive letters/symbols in that language. Confusion is still possible is you use old texts mixing old Greek letters for numerals: you would in that case avoid using the Greek letter pi for naming your custom function, and would reserve the pi letter for the wellknown constant. But applying distinctive styles will enhance your formulas for readability. 2018-06-06 23:25 GMT+02:00 Hans ?berg via Unicode : > > > On 4 Jun 2018, at 21:49, Manish Goregaokar via Unicode < > unicode at unicode.org> wrote: > > > > The Rust community is considering adding non-ascii identifiers, which > follow UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also > asks for identifiers to be treated as equivalent under NFKC. > > So, in this language, if one defines a projection function ?? and the > usual constant ?, what is ??(?) supposed to mean? - Just curious. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 7 05:22:13 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 7 Jun 2018 12:22:13 +0200 (CEST) Subject: Unicode 11.0.0: BidiMirroring.txt Message-ID: <125193735.7527.1528366933902.JavaMail.www@wwinf1c20> In the wake of the new release, may we discuss the reason why UTC persisted in recommending that 3 pairs of mathematical symbols featuring tildes are mirrored in low-end support by glyph-exchange bidi-mirroring, with the result that legibility of tildes is challenged, as demonstrated for ?Remedial 11? in: https://www.unicode.org/L2/L2017/17438-bidi-math-fdbk.html (This was written up for meeting #153, while an outdated alias file with roughly same content was discussed again at meeting 154, while there were also some related items in the general feedback hopper.) Thanks, Marcel From unicode at unicode.org Thu Jun 7 05:41:55 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Thu, 7 Jun 2018 12:41:55 +0200 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: <66EEAD40-AB4C-47E2-A77F-13273729D9CB@telia.com> Message-ID: <4861F342-D826-4BB7-96D0-4EAA979E3297@telia.com> Now that the distinction is possible, it is recommended to do that. My original question was directed to the OP, whether it is deliberate. And they are confusables only to those not accustomed to it. > On 7 Jun 2018, at 12:05, Philippe Verdy wrote: > > In my opinion the usual constant is most often shown as "??" (curly serifs, slightly slanted) in mathematical articles and books (and in TeX), but rarely as "?" (sans-serif). > > There's a tradition of using handwriting for this symbol on backboards (not always with serifs, but still often slanted). A notation with the "?" symbol uses a legacy troundtrip mapping for old OEM charsets on low-resolution text terminals where it was distinguished from the more common Greek letter which was enhanced for better readability once old low-resolution terminals were replaced. "?" looks too much like an Hangul letter or a legacy box-drawing character and in fact difficult to recognize as the pi constant, but it may still be found in some plain-text paragraphs of inline mathematical formulas on screens (for programmers), at low resolution or with small font sizes, where most text is in sans-serif Latin and not slanted/italicized and not using an handwritten style. > > If you think about writing a functional programming language using inline formulas, then the "?" symbol may be ok for the constant, and custom identifiers for a function would use standard Greek letters (or other standard scripts for human languages), or would use "pi" in Latin. You would then write "pi(?)" in that inline formula. For a classic 2D mathematical layout, you would use "pi(??)" with distinctive but homonegeous styles for custom variables/function names and for the classic mathematical constant. > > As much as possible you will avoid mixing confusive letters/symbols in that language. > > Confusion is still possible is you use old texts mixing old Greek letters for numerals: you would in that case avoid using the Greek letter pi for naming your custom function, and would reserve the pi letter for the wellknown constant. But applying distinctive styles will enhance your formulas for readability. From unicode at unicode.org Thu Jun 7 06:32:13 2018 From: unicode at unicode.org (=?UTF-8?Q?Joan_Montan=C3=A9?= via Unicode) Date: Thu, 7 Jun 2018 13:32:13 +0200 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: 2018-06-04 21:49 GMT+02:00 Manish Goregaokar via Unicode < unicode at unicode.org>: > Hi, > > The Rust community is considering > adding non-ascii > identifiers, which follow UAX #31 > (XID_Start XID_Continue*, with tweaks). The proposal also asks for > identifiers to be treated as equivalent under NFKC. > > Are there any cases where this will lead to inconsistencies? I.e. can the > NFKC of a valid UAX 31 ident be invalid UAX 31? > Yes, such case exists, for instance in Latin alphabet and Catalan language. * ?, LATIN CAPITAL LETTER L WITH MIDDEL DOT NFKC decomposes to LATIN CAPITAL LETTER L (U+004C) MIDDLE DOT (U+00B7): * ?, LATIN SMALL LETTER L WITH MIDDLE DOT NFKC decomposes to LATIN SMALL LETTER L (U+006C) MIDDLE DOT (U+00B7): ? and ? are (were) used for Catalan language for encoding geminate L [1] when it is (was) encoded using 2 chars only. Preferred (and common used) encoding is currently that of 3 chaacters: . So, some adjustments are needed if you whant to support Catalan language identifiers [2] Yours, Joan Montan? [1] https://en.wikipedia.org/wiki/Interpunct#Catalan [2] http://www.unicode.org/reports/tr31/#Specific_Character_Adjustments -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 7 06:25:22 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 7 Jun 2018 13:25:22 +0200 (CEST) Subject: The Unicode Standard and ISO Message-ID: <107041927.8437.1528370722530.JavaMail.www@wwinf1c20> On Thu, 17 May 2018 09:43:28 -0700, Asmus Freytag via Unicode wrote: > > On 5/17/2018 8:08 AM, Martinho Fernandes via Unicode wrote: > > Hello, > > > > There are several mentions of synchronization with related standards in > > unicode.org, e.g. in https://www.unicode.org/versions/index.html, and > > https://www.unicode.org/faq/unicode_iso.html. However, all such mentions > > never mention anything other than ISO 10646. > > Because that is the standard for which there is an explicit understanding by all involved > relating to synchronization. There have been occasionally some challenging differences > in the process and procedures, but generally the synchronization is being maintained, > something that's helped by the fact that so many people are active in both arenas. Perhaps the cause-effect relationship is somewhat unclear. I think that many people being active in both arenas is helped by the fact that there is a strong will to maintain synching. If there were similar policies notably for ISO/IEC?14651 (collation) and ISO/IEC?15897 (locale data), ISO/IEC 10646 would be far from standing alone in the field of Unicode-ISO/IEC cooperation. > > There are really no other standards where the same is true to the same extent. > > > > I was wondering which ISO standards other than ISO 10646 specify the > > same things as the Unicode Standard, and of those, which ones are > > actively kept in sync. This would be of importance for standardization > > of Unicode facilities in the C++ language (ISO 14882), as reference to > > ISO standards is generally preferred in ISO standards. > > > One of the areas the Unicode Standard differs from ISO 10646 is that its conception > of a character's identity implicitly contains that character's properties - and those are > standardized as well and alongside of just name and serial number. This is probably why, to date, ISO/IEC 10646 features character properties by including normative references to the Unicode Standard, Standard Annexes, and the UCD. Bidi-mirroring e.g. is part of ISO/IEC 10646 that specifies in clause 15.1: ?[?] The list of these characters is determined by having the ?Bidi_Mirrored? property set to ?Y? in the Unicode Standard. These values shall be determined according to the Unicode Standard Bidi Mirrored property (see Clause 2).? > > Many of these properties have associated with them algorithms, e.g. the bidi algorithm, > that are an essential element of data interchange: if you don't know which order in > the backing store is expected by the recipient to produce a certain display order, you > cannot correctly prepare your data. > > There is one area where standardization in ISO relates to work in Unicode that I can > think of, and that is sorting. Yet UCA conforms to ISO/IEC?14651 (where UCA is cited as entry #28 in the bibliography). The reverse relationship is irrelevant and would be unfair, given that the Consortium refused till now to synchronize UCA and ISO/IEC 14651. Here is a need for action. > However, sorting, beyond the underlying framework, > ultimately relates to languages, and language-specific data is now housed in CLDR. > > Early attempts by ISO to standardize a similar framework for locale data failed, in > part because the framework alone isn't the interesting challenge for a repository, > instead it is the collection, vetting and management of the data. For another part it failed because the Consortium refused to cooperate, despite of repeated proposals for a merger of both instances. > > The reality is that the ISO model and its organizational structures are not well suited > to the needs of many important area where some form of standardization is needed. > That's why we have organization like IETF, W3C, Unicode etc.. > > Duplicating all or even part of their effort inside ISO really serves nobody's purpose. An undesirable side-effect of not merging Unicode with ISO/IEC?15897 (locale data) is to divert many competent contributors from monitoring CLDR data, especially for French. Here too is a huge need for action. Thanks in advance. Marcel From unicode at unicode.org Thu Jun 7 07:00:02 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 7 Jun 2018 14:00:02 +0200 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: If you intend to allow all the standard orthography of common languages, you would also need to support apostrophes and regular hyphens in identifiers, including those from ASCII ! The Catalan middle dot is just a compact variant of the hyphen, it should have better been a diacritic, but the usage of upper diacritics on letter l/L with high ascenders caused problems when rendering with compact line-heights. Polish chose to use a smart overstriking slash to avoid that problem, another diacritic could have been used such as the cedilla below, but the middle dot was easier to add between the two handwritten "ll", after composing the rest of the word) without having to release the drawing pen from the surface. The vertical placement of the "middle" dot is also largely variable when handwritten, I have seen it drawn manuall as a short stroke (horizontal or slanted), which is easier to place manually (the dot can easily fall on the vertical strokes, and when "ll" is handdrawn it frequently has the two curls touching each other, so the dot may in fact call in the middle of the curl for the first l), and it that case it looks very much like the Polish l with a stroke bar, or like a l followed by an apostrophe before the second l. 2018-06-07 13:32 GMT+02:00 Joan Montan? via Unicode : > > > 2018-06-04 21:49 GMT+02:00 Manish Goregaokar via Unicode < > unicode at unicode.org>: > >> Hi, >> >> The Rust community is considering >> adding non-ascii >> identifiers, which follow UAX #31 >> (XID_Start XID_Continue*, with tweaks). The proposal also asks for >> identifiers to be treated as equivalent under NFKC. >> >> Are there any cases where this will lead to inconsistencies? I.e. can the >> NFKC of a valid UAX 31 ident be invalid UAX 31? >> > > Yes, such case exists, for instance in Latin alphabet and Catalan language. > > * ?, LATIN CAPITAL LETTER L WITH MIDDEL DOT NFKC decomposes to > LATIN CAPITAL LETTER L (U+004C) MIDDLE DOT (U+00B7): > * ?, LATIN SMALL LETTER L WITH MIDDLE DOT NFKC decomposes to > LATIN SMALL LETTER L (U+006C) MIDDLE DOT (U+00B7): > > ? and ? are (were) used for Catalan language for encoding geminate L [1] > when it is (was) encoded using 2 chars only. Preferred (and common used) > encoding is currently that of 3 chaacters: . So, some adjustments > are needed if you whant to support Catalan language identifiers [2] > > Yours, > Joan Montan? > > > [1] https://en.wikipedia.org/wiki/Interpunct#Catalan > [2] http://www.unicode.org/reports/tr31/#Specific_Character_Adjustments > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 7 08:08:48 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 7 Jun 2018 14:08:48 +0100 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: <20180607140848.50223e25@JRWUBU2> On Thu, 7 Jun 2018 13:32:13 +0200 Joan Montan? via Unicode wrote: > 2018-06-04 21:49 GMT+02:00 Manish Goregaokar via Unicode < > unicode at unicode.org>: > * ?, LATIN CAPITAL LETTER L WITH MIDDEL DOT NFKC decomposes > to LATIN CAPITAL LETTER L (U+004C) MIDDLE DOT (U+00B7): > * ?, LATIN SMALL LETTER L WITH MIDDLE DOT NFKC decomposes to > LATIN SMALL LETTER L (U+006C) MIDDLE DOT (U+00B7): This is only a problem if U+00B7 is part of Rust's syntax. U+00B7 has the properties (X)ID_continue, so there is no formal problem. Richard. From unicode at unicode.org Thu Jun 7 08:20:29 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Thu, 7 Jun 2018 15:20:29 +0200 Subject: The Unicode Standard and ISO In-Reply-To: <107041927.8437.1528370722530.JavaMail.www@wwinf1c20> References: <107041927.8437.1528370722530.JavaMail.www@wwinf1c20> Message-ID: A few facts. > ... Consortium refused till now to synchronize UCA and ISO/IEC 14651. ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could speak to the synchronization level in more detail, but the above statement is inaccurate. > ... For another part it [sync with ISO/IEC?15897] failed because the Consortium refused to cooperate, despite of repeated proposals for a merger of both instances. I recall no serious proposals for that. (And in any event ? very unlike the synchrony with 10646 and 14651 ? ISO 15897 brought no value to the table. Certainly nothing to outweigh the considerable costs of maintaining synchrony. Completely inadequate structure for modern system requirement, no particular industry support, and scant content: see Wikipedia for "The registry has not been updated since December 2001".) Mark Mark On Thu, Jun 7, 2018 at 1:25 PM, Marcel Schneider via Unicode < unicode at unicode.org> wrote: > On Thu, 17 May 2018 09:43:28 -0700, Asmus Freytag via Unicode wrote: > > > > On 5/17/2018 8:08 AM, Martinho Fernandes via Unicode wrote: > > > Hello, > > > > > > There are several mentions of synchronization with related standards in > > > unicode.org, e.g. in https://www.unicode.org/versions/index.html, and > > > https://www.unicode.org/faq/unicode_iso.html. However, all such > mentions > > > never mention anything other than ISO 10646. > > > > Because that is the standard for which there is an explicit > understanding by all involved > > relating to synchronization. There have been occasionally some > challenging differences > > in the process and procedures, but generally the synchronization is > being maintained, > > something that's helped by the fact that so many people are active in > both arenas. > > Perhaps the cause-effect relationship is somewhat unclear. I think that > many people being > active in both arenas is helped by the fact that there is a strong will to > maintain synching. > > If there were similar policies notably for ISO/IEC?14651 (collation) and > ISO/IEC?15897 > (locale data), ISO/IEC 10646 would be far from standing alone in the field > of > Unicode-ISO/IEC cooperation. > > > > > There are really no other standards where the same is true to the same > extent. > > > > > > I was wondering which ISO standards other than ISO 10646 specify the > > > same things as the Unicode Standard, and of those, which ones are > > > actively kept in sync. This would be of importance for standardization > > > of Unicode facilities in the C++ language (ISO 14882), as reference to > > > ISO standards is generally preferred in ISO standards. > > > > > One of the areas the Unicode Standard differs from ISO 10646 is that its > conception > > of a character's identity implicitly contains that character's > properties - and those are > > standardized as well and alongside of just name and serial number. > > This is probably why, to date, ISO/IEC 10646 features character properties > by including > normative references to the Unicode Standard, Standard Annexes, and the > UCD. > Bidi-mirroring e.g. is part of ISO/IEC 10646 that specifies in clause 15.1: > > ?[?] The list of these characters is determined by having the > ?Bidi_Mirrored? property > set to ?Y? in the Unicode Standard. These values shall be determined > according to > the Unicode Standard Bidi Mirrored property (see Clause 2).? > > > > > Many of these properties have associated with them algorithms, e.g. the > bidi algorithm, > > that are an essential element of data interchange: if you don't know > which order in > > the backing store is expected by the recipient to produce a certain > display order, you > > cannot correctly prepare your data. > > > > There is one area where standardization in ISO relates to work in > Unicode that I can > > think of, and that is sorting. > > Yet UCA conforms to ISO/IEC?14651 (where UCA is cited as entry #28 in the > bibliography). > The reverse relationship is irrelevant and would be unfair, given that the > Consortium > refused till now to synchronize UCA and ISO/IEC 14651. > > Here is a need for action. > > > However, sorting, beyond the underlying framework, > > ultimately relates to languages, and language-specific data is now > housed in CLDR. > > > > Early attempts by ISO to standardize a similar framework for locale data > failed, in > > part because the framework alone isn't the interesting challenge for a > repository, > > instead it is the collection, vetting and management of the data. > > For another part it failed because the Consortium refused to cooperate, > despite of > repeated proposals for a merger of both instances. > > > > > The reality is that the ISO model and its organizational structures are > not well suited > > to the needs of many important area where some form of standardization > is needed. > > That's why we have organization like IETF, W3C, Unicode etc.. > > > > Duplicating all or even part of their effort inside ISO really serves > nobody's purpose. > > An undesirable side-effect of not merging Unicode with ISO/IEC?15897 > (locale data) is > to divert many competent contributors from monitoring CLDR data, > especially for French. > > Here too is a huge need for action. > > Thanks in advance. > > Marcel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 7 08:28:34 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Thu, 7 Jun 2018 14:28:34 +0100 Subject: The Unicode Standard and ISO In-Reply-To: References: <107041927.8437.1528370722530.JavaMail.www@wwinf1c20> Message-ID: On 7 Jun 2018, at 14:20, Mark Davis ?? via Unicode wrote: > > A few facts. > >> > ... Consortium refused till now to synchronize UCA and ISO/IEC 14651. > > ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could speak to the synchronization level in more detail, but the above statement is inaccurate. Mark is right. >> > ... For another part it [sync with ISO/IEC?15897] failed because the Consortium refused to cooperate, despite of repeated proposals for a merger of both instances. > > I recall no serious proposals for that. Nor do I. > (And in any event ? very unlike the synchrony with 10646 and 14651 ? ISO 15897 brought no value to the table. Certainly nothing to outweigh the considerable costs of maintaining synchrony. Completely inadequate structure for modern system requirement, no particular industry support, and scant content: see Wikipedia for "The registry has not been updated since December 2001?.) Mark is right. Michael Everson From unicode at unicode.org Thu Jun 7 08:29:42 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 7 Jun 2018 14:29:42 +0100 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: <20180605013747.157d72f1@JRWUBU2> <20180607083606.60010e6e@JRWUBU2> Message-ID: <20180607142942.62e805d2@JRWUBU2> On Thu, 7 Jun 2018 10:42:46 +0200 Mark Davis ?? via Unicode wrote: > > The proposal also asks for identifiers to be treated as equivalent > > under > NFKC. > > The guidance in #31 may not be clear. It is not to replace > identifiers as typed in by the user by their NFKC equivalent. It is > rather to internally *identify* two identifiers (as typed in by the > user) as being the same. For example, Pascal had case-insensitive > identifiers. That means someone could type in > > myIdentifier = 3; > MyIdentifier = 4; > > And both of those would be references to the same internal entity. So > cases like SARA AM doesn't necessarily play into this. There has been a suggestion to not just restrict identifiers to NFKC equivalence classes (UAX31-R4), but to actually restrict them to NFKC form (UAX31-R6). That is where the issue with SARA AM changes from a lurking issue to an active problem. Others have realised that NFC makes more sense than NFKC for Rust. Richard. From unicode at unicode.org Thu Jun 7 08:47:21 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Thu, 7 Jun 2018 15:47:21 +0200 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: <20180607142942.62e805d2@JRWUBU2> References: <20180605013747.157d72f1@JRWUBU2> <20180607083606.60010e6e@JRWUBU2> <20180607142942.62e805d2@JRWUBU2> Message-ID: Got it, thanks. Mark On Thu, Jun 7, 2018 at 3:29 PM, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Thu, 7 Jun 2018 10:42:46 +0200 > Mark Davis ?? via Unicode wrote: > > > > The proposal also asks for identifiers to be treated as equivalent > > > under > > NFKC. > > > > The guidance in #31 may not be clear. It is not to replace > > identifiers as typed in by the user by their NFKC equivalent. It is > > rather to internally *identify* two identifiers (as typed in by the > > user) as being the same. For example, Pascal had case-insensitive > > identifiers. That means someone could type in > > > > myIdentifier = 3; > > MyIdentifier = 4; > > > > And both of those would be references to the same internal entity. So > > cases like SARA AM doesn't necessarily play into this. > > There has been a suggestion to not just restrict identifiers to NFKC > equivalence classes (UAX31-R4), but to actually restrict them to NFKC > form (UAX31-R6). That is where the issue with SARA AM changes from a > lurking issue to an active problem. Others have realised that NFC > makes more sense than NFKC for Rust. > > Richard. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 7 09:51:58 2018 From: unicode at unicode.org (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?= via Unicode) Date: Thu, 7 Jun 2018 16:51:58 +0200 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: <133f6bf5-575e-dab0-843f-f409b30b0232@gmail.com> Le 06/06/2018 ? 11:29, Alastair Houghton via Unicode a ?crit?: > On 4 Jun 2018, at 20:49, Manish Goregaokar via Unicode wrote: >> The Rust community is considering adding non-ascii identifiers, which follow UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for identifiers to be treated as equivalent under NFKC. >> >> Are there any cases where this will lead to inconsistencies? I.e. can the NFKC of a valid UAX 31 ident be invalid UAX 31? >> >> (In general, are there other problems folks see with this proposal?) > IMO the major issue with non-ASCII identifiers is not a technical one, but rather that it runs the risk of fragmenting the developer community. Everyone can *type* ASCII and everyone can read Latin characters (for reasonably wide values of ?everyone?, at any rate? most computer users aren?t going to have a problem). Not everyone can type Hangul, Chinese or Arabic (for instance), and there is no good fix or workaround for this. Well, your ?reasonable? value of everyone exclude many kids, and puts social barriers in the use of computer to non-native latin writers. If the programme has no reason to be read and written by foreign programmers, why not use native language and alphabet identifiers? Of course, as long as you write a function named ?????, you consciously restrict the developer community having access to this programme. But you also make your programme more clear to your arabic speaking community. If said community is e.g. school teachers (or students) in an arab speaking country, it may be a good choice. I don?t see the difference with choosing to write a book in a language or another. > Note that this is orthogonal to issues such as which language identifiers [...] are written in [...]; It is indeed different, but not orthogonal > the problem is that e.g. given a function > > func ?????(s : String) > > it isn?t obvious to a non-Arabic speaking user how to enter ????? in order to call it. OK. Clearly, someone not knowing the Arabic alphabet will have difficulties with this one, but if one has good reason to think the targeted developper community is literate in Arabic and a lower mastery of the latin alphabet, it still may be a good idea. If I understand you correctly, an Arabic speaker should always transliterate the function name to ASCII, and there are many different way to do it? (see e.g. https://en.wikipedia.org/wiki/Romanization_of_Arabic). Should they name his function altawil, altwl, alt.wl ? And when calling it later, they should remember their ad-hoc ASCII Arabic orthography. I don?t soubt many, if not most, do it, but it can add an extra burden in programming. It?s a bit like remembering if your name should be transliterated in Greek as ???????? or ??????, and use that for every identifier you come across. A mitigation strategy is to name your identifier x1, x2, x3 and so on. The common knowledge is that this is a bad idea, and programming teachers spend some time discouraging their student to use such a strategy. However, many Chinese website and email addresses are of this form, because it is the only one clear enough for a big fraction of the population. > This isn?t true of e.g. > > func pituus(s : String) > > Even though ?pituus? is Finnish, it?s still ASCII and everyone knows how to type that. Avoiding ?special characters? can be annoying in Latin based language, specially for beginners, and kids among them. Unicode (too slow) adoption has already eased the difficulty of writing a ?Hello world? and? ?What?s your name programme?, but avoiding non-ASCII characters in identifiers can be a bit esoteric for kids with a native language full of them. (And by the way, several big French companies regularly send me mail with my first name mojibakeed, while their software is presumably written by adults) [...] > UAX #31 also manages (I suspect unintentionally?) to give a good example of a pair of Farsi identifiers that might be awkward to tell apart in certain fonts, namely ?????? and ???????; I think those are OK in monospaced fonts, where the join is reasonably wide, but at small point sizes in proportional fonts the difference in appearance is very subtle, particularly for a non-Arabic speaker. In ASCII, identifiers with I, l, and 1 can be difficult to tell apart. And it is not an artificial problem: I?ve once had some difficulties with an automatically generated login which was do11y but tried to type dolly, despites my familiarity with ASCII. So I guess this problem is not specific to the ASCII vs non-ASCII debate From unicode at unicode.org Thu Jun 7 11:01:08 2018 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Thu, 7 Jun 2018 17:01:08 +0100 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: <133f6bf5-575e-dab0-843f-f409b30b0232@gmail.com> References: <133f6bf5-575e-dab0-843f-f409b30b0232@gmail.com> Message-ID: On 7 Jun 2018, at 15:51, Fr?d?ric Grosshans via Unicode wrote: > >> IMO the major issue with non-ASCII identifiers is not a technical one, but rather that it runs the risk of fragmenting the developer community. Everyone can *type* ASCII and everyone can read Latin characters (for reasonably wide values of ?everyone?, at any rate? most computer users aren?t going to have a problem). Not everyone can type Hangul, Chinese or Arabic (for instance), and there is no good fix or workaround for this. > Well, your ?reasonable? value of everyone exclude many kids, Every keyboard I?ve ever seen, including Chinese ones, is marked with ASCII characters as well. Typing ASCII on a machine in the Chinese locale might not be entirely straightforward, but entering Chinese characters, even on such a machine, takes significant training, and on a machine not set to Chinese locale it might even require the installation of additional software. It isn?t even the case, as I understand it, that all machines set to Chinese locales use the same input method, so being able to enter Chinese on one system doesn?t necessarily mean you?ll be able to do so on another. (I imagine it makes it easier to learn, once you?ve done it once, but still?) I appreciate that the upshot of the Anglicised world of software engineering is that native English speakers have an advantage, and those for whom Latin isn?t their usual script are at a particular disadvantage, and I?m sure that seems unfair to many of us ? but that doesn?t mean that allowing the use of other scripts everywhere, desirable as it is, is entirely unproblematic. >> it isn?t obvious to a non-Arabic speaking user how to enter ????? in order to call it. > OK. Clearly, someone not knowing the Arabic alphabet will have difficulties with this one, but if one has good reason to think the targeted developper community is literate in Arabic and a lower mastery of the latin alphabet, it still may be a good idea. > If I understand you correctly, an Arabic speaker should always transliterate the function name to ASCII, That?s one option; or they could write it in Arabic, but they need to be aware of the consequences of doing so (and those they are working for or with also need to understand that); or they could choose some other language, perhaps one shared with other teams who are likely to work on the code. Imagine you outsourced development to a team that happened to be Arabic speaking, and they developed (let?s say) French language software for you, but later you wanted to bring development in house and found all the identifiers were in Arabic script, which made the code very difficult for your developers to work with. That isn?t exactly going to make your day, and if it isn?t a problem that anyone has mentioned, it might not be obvious that you when you originally outsourced your development that you needed to make sure people weren't going to do that. >> UAX #31 also manages (I suspect unintentionally?) to give a good example of a pair of Farsi identifiers that might be awkward to tell apart in certain fonts, namely ?????? and ???????; I think those are OK in monospaced fonts, where the join is reasonably wide, but at small point sizes in proportional fonts the difference in appearance is very subtle, particularly for a non-Arabic speaker. > In ASCII, identifiers with I, l, and 1 can be difficult to tell apart. And it is not an artificial problem: I?ve once had some difficulties with an automatically generated login which was do11y but tried to type dolly, despites my familiarity with ASCII. So I guess this problem is not specific to the ASCII vs non-ASCII debate It isn?t, though fonts used by programmers typically emphasise the differences between I, l and 1 as well as 0 and O, 5 and S and so on specifically to avoid this problem. But please don?t misunderstand; I am not ? and have not been ? arguing against non-ASCII identifiers. We were asked whether there were any problems. These are problems (or perhaps we might call them ?trade-offs?). We can debate the severity of them, and whether, and what, it?s worthwhile doing anything to mitigate any of them. What we shouldn?t do is sweep them under the carpet. Personally I think a combination of documentation to explain that it?s worth thinking carefully about which script(s) to use, and some steps to consider certain characters to be equivalent even though they aren?t the same (and shouldn?t be the same even when normalised) might be a good idea. Is that really so controversial a position? Kind regards, Alastair. -- http://alastairs-place.net -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 7 12:31:00 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 7 Jun 2018 19:31:00 +0200 (CEST) Subject: The Unicode Standard and ISO Message-ID: <1356275067.14615.1528392660527.JavaMail.www@wwinf1f27> On Thu, 7 Jun 2018 15:20:29 +0200, Mark Davis ?? via Unicode wrote: > > A few facts.? > > > ...?Consortium?refused till now to synchronize UCA and ISO/IEC 14651. > > ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could speak to the > synchronization level in more detail, but the above statement is inaccurate. > > > ...?For another part it [sync with ISO/IEC?15897]?failed because the Consortium refused to > > cooperate, despite of?repeated proposals for a merger of both instances. > > I recall no serious proposals for that.? > > (And in any event ? very unlike the synchrony with 10646 and 14651 ? ISO?15897 brought > no value to the table. Certainly nothing to outweigh the considerable costs of maintaining synchrony. > Completely inadequate structure for modern system requirement, no particular industry support, and > scant content: see Wikipedia for "The registry has not been updated since December 2001".) Thank you for correcting as of the Unicode ISO/IEC 14651 synchrony; indeed while on http://www.unicode.org/reports/tr10/#Synch_ISO14651 we can read that ?This relationship between the two standards is similar to that maintained between the Unicode Standard and ISO/IEC 10646[,]? confusingly there seems to be no related FAQ. Even more confusingly, a straightforward question like ?I was wondering which ISO standards other than ISO 10646 specify the same things as the Unicode Standard? remains ultimately unanswered. The reason might be that the ?and of those, which ones are actively kept in sync? part is really best answered by ?none.? In fact, while UCA is synched with ISO/IEC 14651, the reverse statement is reportedly false. Hence, UCA would be what is called an implementation of ISO/IEC 14651. Nevertheless, UAX #10 refers to ?The synchronized version of ISO/IEC 14651[,]? and mentions a ?common tool[.]? Hence one simple question: Why does the fact that the Unicode-ISO synchrony encompasses *two* standards remain untold in the first places? As of ISO/IEC 15897, it would certainly be a piece of good diplomacy that Unicode pick the usable data in the existing set, and then ISO/IEC 15897 will be in a position to cite CLDR as a normative reference so that all potential contributors are redirected and may feel free to contribute to CLDR. And it would be nice that Unicode don?t forget to order an additional FAQ about the topic, please. Thanks, Marcel From unicode at unicode.org Thu Jun 7 12:38:50 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 7 Jun 2018 10:38:50 -0700 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: <133f6bf5-575e-dab0-843f-f409b30b0232@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 7 12:47:02 2018 From: unicode at unicode.org (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?= via Unicode) Date: Thu, 7 Jun 2018 19:47:02 +0200 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: <133f6bf5-575e-dab0-843f-f409b30b0232@gmail.com> Message-ID: <46894749-3726-71b1-b474-e0e03dff2c42@gmail.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 7 14:13:14 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 7 Jun 2018 21:13:14 +0200 (CEST) Subject: The Unicode Standard and ISO In-Reply-To: <1356275067.14615.1528392660527.JavaMail.www@wwinf1f27> References: <1356275067.14615.1528392660527.JavaMail.www@wwinf1f27> Message-ID: <1563074572.18915.1528398795005.JavaMail.www@wwinf1m18> On Thu, 17 May 2018 22:26:15 +0000, Peter Constable via Unicode wrote: [?] > Hence, from an ISO perspective, ISO 10646 is the only standard for which on-going > synchronization with Unicode is needed or relevant. This point of view is fueled by the Unicode Standard being traditionally thought of as a mere character set, regardless of all efforts?lastly by first responder Asmus Freytag himself?to widen the conception. On Fri, 18 May 2018 00:29:36 +0100, Michael Everson via Unicode responded: > > It would be great if mutual synchronization were considered to be of benefit. > Some of us in SC2 are not happy that the Unicode Consortium has published characters > which are still under Technical ballot. And this did not happen only once. I?m not happy catching up this thread out of time, the less as it ultimately brings me where I?ve started in 2014/2015: to the wrong character names that the ISO/IEC 10646 merger infiltrated into Unicode. This is the very thing I did not vent in my first reply. From my point of view, this misfortune would be reason enough for Unicode not to seek further cooperation with ISO/IEC. But I remember the many voices raising on this List to tell me that this is all over and forgiven. Therefore I?m confident that the Consortium will have the mindfulness to complete the ISO/IEC JTC 1 partnership by publicly assuming synchronization with ISO/IEC 14651, and achieving a fullscale merger with ISO/IEC 15897, after which the valid data stay hosted entirely in CLDR, and ISO/IEC 15897 would be its ISO mirror. That is a matter of smart diplomacy, that Unicode may prove again to be great in. Please consider making this move. Thanks, Marcel From unicode at unicode.org Thu Jun 7 14:46:12 2018 From: unicode at unicode.org (via Unicode) Date: Thu, 7 Jun 2018 22:46:12 +0300 Subject: The Unicode Standard and ISO Message-ID: <000b01d3fe98$30c29920$9247cb60$@iki.fi> I cannot but fully agree with Mark and Michael. Sincerely Erkki I. Kolehmainen Mannerheimintie 75 B 37, 00270 Helsinki, Finland Mob: +358 400 825 943 -----Alkuper?inen viesti----- L?hett?j?: Unicode Puolesta Michael Everson via Unicode L?hetetty: torstai 7. kes?kuuta 2018 16.29 Vastaanottaja: unicode Unicode Discussion Aihe: Re: The Unicode Standard and ISO On 7 Jun 2018, at 14:20, Mark Davis ?? via Unicode wrote: > > A few facts. > >> > ... Consortium refused till now to synchronize UCA and ISO/IEC 14651. > > ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could speak to the synchronization level in more detail, but the above statement is inaccurate. Mark is right. >> > ... For another part it [sync with ISO/IEC?15897] failed because the Consortium refused to cooperate, despite of repeated proposals for a merger of both instances. > > I recall no serious proposals for that. Nor do I. > (And in any event ? very unlike the synchrony with 10646 and 14651 ? ISO 15897 brought no value to the table. Certainly nothing to outweigh the considerable costs of maintaining synchrony. Completely inadequate structure for modern system requirement, no particular industry support, and scant content: see Wikipedia for "The registry has not been updated since December 2001?.) Mark is right. Michael Everson From unicode at unicode.org Thu Jun 7 17:43:04 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 8 Jun 2018 00:43:04 +0200 Subject: The Unicode Standard and ISO In-Reply-To: <1563074572.18915.1528398795005.JavaMail.www@wwinf1m18> References: <1356275067.14615.1528392660527.JavaMail.www@wwinf1f27> <1563074572.18915.1528398795005.JavaMail.www@wwinf1m18> Message-ID: 2018-06-07 21:13 GMT+02:00 Marcel Schneider via Unicode : > On Thu, 17 May 2018 22:26:15 +0000, Peter Constable via Unicode wrote: > [?] > > Hence, from an ISO perspective, ISO 10646 is the only standard for which > on-going > > synchronization with Unicode is needed or relevant. > > This point of view is fueled by the Unicode Standard being traditionally > thought of as a mere character set, > regardless of all efforts?lastly by first responder Asmus Freytag > himself?to widen the conception. > > On Fri, 18 May 2018 00:29:36 +0100, Michael Everson via Unicode responded: > > > > It would be great if mutual synchronization were considered to be of > benefit. > > Some of us in SC2 are not happy that the Unicode Consortium has > published characters > > which are still under Technical ballot. And this did not happen only > once. > > I?m not happy catching up this thread out of time, the less as it > ultimately brings me where I?ve started > in 2014/2015: to the wrong character names that the ISO/IEC 10646 merger > infiltrated into Unicode. > This is the very thing I did not vent in my first reply. From my point of > view, this misfortune would be > reason enough for Unicode not to seek further cooperation with ISO/IEC. > The "normative names" are in fact normative only as a forward reference to the ISO/IEC repertoire becaus it insists that these names are essential part of the stable encoding policy which was then integrated in the Unicode stability rules, so that the normative reference remains stable as well). Beside this, Unicode has other more useful properties. People don't care at all about these names. The character properties and the related algorithms that use them (and even the representative glyph even if it's not stabilized) are much more important (and the ISO/IEC 101646 does not do anything to solve the real encoding issues, and needed properties for correct processing). Unicode is more based on commonly used practices and allows experimetnation and progressive enhancing without having to break the agreed ISO/EIC normative properties. The position of Unicode is more pragmatic, and is much more open to lot of contibutors than the small ISO/IEC subcomities with in fact very few active members, but it's still an interesting counter-power that allows governments to choose where it is more useful to contribute and have influence when the industry may have different needs and practices not fo?llowing the government recommendations adopted at ISO. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 7 19:22:11 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 8 Jun 2018 01:22:11 +0100 Subject: Hyphenation Markup In-Reply-To: <20180602054429.1ef142ab@JRWUBU2> References: <20180602054429.1ef142ab@JRWUBU2> Message-ID: <20180608012211.2bd24320@JRWUBU2> On Sat, 2 Jun 2018 05:44:29 +0100 Richard Wordingham via Unicode wrote: > In Latin text, one can indicate permissible line break opportunities > between grapheme clusters by inserting U+00AD SOFT HYPHEN. What > low-end schemes, if any, exist for such mark-up within grapheme > clusters? It didn't come into existence, but I've found a proposed HTML markup element HYPH that would almost have done the job at http://www.nada.kth.se/i18n/html/hyph.html . The one problem is the old one of displaying a left matra in isolation. Of course, if one has total font control, the PUA could have come to the rescue if HYPH had been adopted and implemented. Richard. From unicode at unicode.org Thu Jun 7 22:32:51 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 8 Jun 2018 05:32:51 +0200 (CEST) Subject: The Unicode Standard and ISO Message-ID: <1360815095.111.1528428771583.JavaMail.www@wwinf1m18> On Thu, 7 Jun 2018 22:46:12 +0300, Erkki I. Kolehmainen via Unicode wrote: > > I cannot but fully agree with Mark and Michael. > > Sincerely > Thank you for confirming. All witnesses concur to invalidate the statement about uniqueness of ISO/IEC?10646 ? Unicode synchrony. ? After being invented in its actual form, sorting was standardized simultaneously in ISO/IEC?14651 and in Unicode Collation Algorithm, the latter including practice?oriented extra features. Since then, these two standards are kept in synchrony uninterruptedly. Getting people to correct the overall response was not really my initial concern, however. What bothered me before I learned that Unicode refuses to cooperate with ISO/IEC JTC1 SC22 is that the registration of the French locale in CLDR is still surprisingly incomplete despite the meritorious efforts made by the actual contributors, and then after some investigation, that the main part of the potential French contributors are prevented from cooperating because Unicode refuses to cooperate with ISO/IEC on locale data while ISO/IEC 15897 predates CLDR, reportedly after many attempts made to merge both standards, remaining unsuccessful without any striking exposure or friendly agreement to avoid kind of an impression of unconcerned rebuff. Best regards, Marcel From unicode at unicode.org Thu Jun 7 22:58:14 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 8 Jun 2018 05:58:14 +0200 (CEST) Subject: The Unicode Standard and ISO In-Reply-To: References: <1356275067.14615.1528392660527.JavaMail.www@wwinf1f27> <1563074572.18915.1528398795005.JavaMail.www@wwinf1m18> Message-ID: <1646538911.167.1528430294413.JavaMail.www@wwinf1m18> On?Fri, 8 Jun 2018 00:43:04 +0200, Philippe Verdy via Unicode wrote: [cited mail] > > The "normative names" are in fact normative only as a forward reference > to the ISO/IEC repertoire becaus it insists that these names are essential part > of the stable encoding policy which was then integrated in the Unicode stability rules, > so that the normative reference remains stable as well). Beside this, Unicode has other > more useful properties. People don't care at all about these names. Effectively we have learned to live even with those that are uselessly misleading and had been pushed through against better proposals made on Unicode side, particularly the wrong left/right attributes. Unicode have worked hard to palliate these misnomers by introducing the bidi_bracket (yes, no) and bidi_bracket_type (open, close) properties, and specifying in TUS that beside a few exceptions, LEFT and RIGHT in names of paired punctuation is to be read as OPENING and CLOSING, respectively. > The character properties and the related algorithms that use them (and even > the representative glyph even if it's not stabilized) are much more important > (and the ISO/IEC 101646 does not do anything to solve the real encoding issues, > and needed properties for correct processing). Unicode is more based on commonly > used practices and allows experimetnation and progressive enhancing without having > to break the agreed ISO/EIC normative properties. The position of Unicode is more > pragmatic, and is much more open to lot of contibutors than the small ISO/IEC subcomities > with in fact very few active members, but it's still an interesting counter-power that allows > governments to choose where it is more useful to contribute and have influence when > the industry may have different needs and practices not fo?llowing the government > recommendations adopted at ISO. Now it becomes clear to me that this opportunity of governmental action is exactly what could be useful when it?s up to fix the textual appearance of national user interfaces, and that is exactly why not federating communities around CLDR, and not attempting to get efforts converge, is so counter?productive. Thanks for getting this point out. Best regards, Marcel From unicode at unicode.org Fri Jun 8 03:06:46 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 8 Jun 2018 09:06:46 +0100 Subject: The Unicode Standard and ISO In-Reply-To: <1360815095.111.1528428771583.JavaMail.www@wwinf1m18> References: <1360815095.111.1528428771583.JavaMail.www@wwinf1m18> Message-ID: <20180608090646.03604ff1@JRWUBU2> On Fri, 8 Jun 2018 05:32:51 +0200 (CEST) Marcel Schneider via Unicode wrote: > Thank you for confirming. All witnesses concur to invalidate the > statement about uniqueness of ISO/IEC?10646 ? Unicode synchrony. ? > After being invented in its actual form, sorting was standardized > simultaneously in ISO/IEC?14651 and in Unicode Collation Algorithm, > the latter including practice?oriented extra features. The UCA contains features essential for respecting canonical equivalence. ICU works hard to avoid the extra effort involved, apparently even going to the extreme of implicitly declaring that Vietnamese is not a human language. (Some contractions are not supported by ICU!) The synchronisation is manifest in the DUCET collation, which seems to make the effort to ensure that some canonical equivalent will sort the same way under ISO/IEC 14651. > Since then, > these two standards are kept in synchrony uninterruptedly. But the consortium has formally dropped the commitment to DUCET in CLDR. Even when restricted to strings of assigned characters, the CLDR and ICU no longer make the effort to support the DUCET collation. Indeed, I'm not even sure that the DUCET is a tailoring of the root CLDR collation, even when restricted to assigned characters. Tailorings tend to have odd side effects; fortunately, they rarely if ever matter. CLDR root is a rewrite with modifications of DUCET; it has changes that are prohibited as 'tailorings'! Richard. From unicode at unicode.org Fri Jun 8 04:07:48 2018 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Fri, 8 Jun 2018 12:07:48 +0300 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen wrote: > Considering that ruling out too much can be a problem later, but just > treating anything above ASCII as opaque hasn't caused trouble (that I > know of) for HTML other than compatibility issues with XML's stricter > stance, why should a programming language, if it opts to support > non-ASCII identifiers in an otherwise ASCII core syntax, implement the > complexity of UAX #31 instead of allowing everything above ASCII in > identifiers? In other words, what problem does making a programming > language conform to UAX #31 solve? After refreshing my memory of XML history, I realize that mentioning XML does not helpfully illustrate my question despite the mention of XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please ignore the XML part. Trying to rephrase my question more clearly: Let's assume that we are designing a computer-parseable syntax where tokens consisting of user-chosen characters can't occur next to each other and, instead, always have some syntax-reserved characters between them. That is, I'm talking about syntaxes that look like this (could be e.g. Java): ab.cd(); Here, ab and cd are tokens with user-chosen characters whereas space (the indent), period, parenthesis and the semicolon are syntax-reserved. We know that ab and cd are distinct tokens, because there is a period between them, and we know the opening parethesis ends the cd token. To illustrate what I'm explicitly _not_ talking about, I'm not talking about a syntax like this: ????? Here ?? and ?? are user-named variable names and ? is a user-named operator and the distinction between different kinds of user-named tokens has to be known somehow in order to be able to tell that there are three distinct tokens: ??, ?, and ??. My question is: When designing a syntax where tokens with the user-chosen characters can't occur next to each other without some syntax-reserved characters between them, what advantages are there from limiting the user-chosen characters according to UAX #31 as opposed to treating any character that is not a syntax-reserved character as a character that can occur in user-named tokens? I understand that taking the latter approach allows users to mint tokens that on some aesthetic measure don't make sense (e.g. minting tokens that consist of glyphless code points), but why is it important to prescribe that this is prohibited as opposed to just letting users choose not to mint tokens that are inconvenient for them to work with given the behavior that their plain text editor gives to various characters? That is, why is conforming to UAX #31 worth the risk of prohibiting the use of characters that some users might want to use? The introduction of XID after ID and the introduction of Extended Hashtag Identifiers after XID is indicative of over-restriction having been a problem. Limiting user-minted tokens to UAX #31 does not appear to be necessary for security purposes considering that HTML and CSS exist in a particularly adversarial environment and get away with taking the approach that any character that isn't a syntax-reserved character is collected as part of a user-minted identifier. (Informally, both treat non-ASCII characters the same as an ASCII underscore. HTML even treats non-whitespace, non-U+0000 ASCII controls that way.) -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Fri Jun 8 06:06:18 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Fri, 8 Jun 2018 13:06:18 +0200 Subject: The Unicode Standard and ISO In-Reply-To: <1360815095.111.1528428771583.JavaMail.www@wwinf1m18> References: <1360815095.111.1528428771583.JavaMail.www@wwinf1m18> Message-ID: Where are you getting your "facts"? Among many unsubstantiated or ambiguous claims in that very long sentence: 1. "French locale in CLDR is still surprisingly incomplete". 1. For each release, the data collected for the French locale is complete to the bar we have set for Level=Modern. 2. What you may mean is that CLDR doesn't support a structure that you think it should. For that, you have to make a compelling case that the structure you propose is worth it, worth diverting people from other priorities. 2. French contributors are not "prevented from cooperating". Where do you get this from? Who do you mean? 1. We have many French contribute data over time. Now, it works better when people engage under the umbrella of an organization, but even there that doesn't have to be a company; we have liaison relationships with government agencies and NGOs. 3. There were not "many attempts" at a merger, and Unicode didn't "refuse" anything. Who do you think "attempted", and when? 1. Albeit given the state of ISO/IEC 15897, there was nothing such a merger would have contributed anyway. 2. BTW, your use of the term "refuse" might be a language issue. I don't "refuse" to respond to the widow of a Nigerian Prince who wants to give me $1M. Since I don't think it is worth my time, or am not willing to upfront the low, low fee of $10K, I might "ignore" the email, or "not respond" to it. Or I might "decline" it with a no-thanks or not-interested response. But none of that is to "refuse" it. Mark On Fri, Jun 8, 2018 at 5:32 AM, Marcel Schneider via Unicode < unicode at unicode.org> wrote: > On Thu, 7 Jun 2018 22:46:12 +0300, Erkki I. Kolehmainen via Unicode wrote: > > > > I cannot but fully agree with Mark and Michael. > > > > Sincerely > > > > Thank you for confirming. All witnesses concur to invalidate the statement > about > uniqueness of ISO/IEC 10646 ? Unicode synchrony. ? After being invented in > its > actual form, sorting was standardized simultaneously in ISO/IEC 14651 and > in > Unicode Collation Algorithm, the latter including practice?oriented extra > features. > Since then, these two standards are kept in synchrony uninterruptedly. > > Getting people to correct the overall response was not really my initial > concern, > however. What bothered me before I learned that Unicode refuses to > cooperate > with ISO/IEC JTC1 SC22 is that the registration of the French locale in > CLDR is > still surprisingly incomplete despite the meritorious efforts made by the > actual > contributors, and then after some investigation, that the main part of the > potential > French contributors are prevented from cooperating because Unicode refuses > to > cooperate with ISO/IEC on locale data while ISO/IEC 15897 predates CLDR, > reportedly after many attempts made to merge both standards, remaining > unsuccessful without any striking exposure or friendly agreement to avoid > kind of > an impression of unconcerned rebuff. > > Best regards, > > Marcel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jun 8 06:40:21 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Fri, 8 Jun 2018 13:40:21 +0200 Subject: The Unicode Standard and ISO In-Reply-To: <20180608090646.03604ff1@JRWUBU2> References: <1360815095.111.1528428771583.JavaMail.www@wwinf1m18> <20180608090646.03604ff1@JRWUBU2> Message-ID: Mark On Fri, Jun 8, 2018 at 10:06 AM, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Fri, 8 Jun 2018 05:32:51 +0200 (CEST) > Marcel Schneider via Unicode wrote: > > > Thank you for confirming. All witnesses concur to invalidate the > > statement about uniqueness of ISO/IEC 10646 ? Unicode synchrony. ? > > After being invented in its actual form, sorting was standardized > > simultaneously in ISO/IEC 14651 and in Unicode Collation Algorithm, > > the latter including practice?oriented extra features. > > The UCA contains features essential for respecting canonical > equivalence. ICU works hard to avoid the extra effort involved, > apparently even going to the extreme of implicitly declaring that > Vietnamese is not a human language. A bit over the top, eh?? > (Some contractions are not > supported by ICU!) I'm guessing you mean https://unicode.org/cldr/trac/ticket/10868, which nicely outlines a proposal for dealing with a number of problems with Vietnamese. We clearly don't support every sorting feature that various dictionaries and agencies come up with. Sometimes it is because we can't (yet) see a good way to do it: 1. it might be not determinant: many governmental standards or style sheets require "interesting" sorting, such as determining that "XI" is a roman numeral (not the president of China) and sorting as 11, or when "St." is meant to be Street *and* when meant to be Saint (St. Stephen's St.) 2. the prospective cost in memory, code complexity, or performance, or the time necessary to figure out to do complex requirements, doesn't seem to warrant adding it at this point?. Now, if you or others are interested in proposing specific patches to address certain issues, then you can propose that. Best to make a proposal (ticket) before doing the work, because if the solution is very intricate, even the time necessary to evaluate the patch can be too much to fit into the schedule. For that reason, it is best to break up such tickets into small, tractable pieces. The synchronisation is manifest in the DUCET > collation, which seems to make the effort to ensure that some canonical > equivalent will sort the same way under ISO/IEC 14651. > > > Since then, > > these two standards are kept in synchrony uninterruptedly. > > But the consortium has formally dropped the commitment to DUCET in > CLDR. Even when restricted to strings of assigned characters, the CLDR > and ICU no longer make the effort to support the DUCET collation. > Indeed, I'm not even sure that the DUCET is a tailoring of the root CLDR > collation, even when restricted to assigned characters. Tailorings > tend to have odd side effects; fortunately, they rarely if ever matter. > CLDR root is a rewrite with modifications of DUCET; it has changes that > are prohibited as 'tailorings'! > ?CLDR does make some tailorings to the DUCET to create its root collation, ?notably adding special contractions of private use characters to allow for tailoring support and indexes [ http://unicode.org/reports/tr35/tr35-collation.html#File_Format_FractionalUCA_txt ] plus the rearrangement of some characters (mostly punctuation and symbols) to allow runtime parametric reordering of groups of characters (eg to put numbers after letters) [ http://unicode.org/reports/tr35/tr35-collation.html#grouping_classes_of_characters ]. - If there are other changes that are not well documented, or if you think those features are causing problems in some way, please file a ticket. - If there is a particular change that you think is not conformant to UCA, please also file that. > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jun 8 07:01:48 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Fri, 8 Jun 2018 13:01:48 +0100 Subject: The Unicode Standard and ISO In-Reply-To: <1563074572.18915.1528398795005.JavaMail.www@wwinf1m18> References: <1356275067.14615.1528392660527.JavaMail.www@wwinf1f27> <1563074572.18915.1528398795005.JavaMail.www@wwinf1m18> Message-ID: <7CD0EF10-D783-4F34-A064-73BE78A42AA0@evertype.com> On 7 Jun 2018, at 20:13, Marcel Schneider via Unicode wrote: > On Fri, 18 May 2018 00:29:36 +0100, Michael Everson via Unicode responded: >> >> It would be great if mutual synchronization were considered to be of benefit. >> Some of us in SC2 are not happy that the Unicode Consortium has published characters >> which are still under Technical ballot. And this did not happen only once. > > I?m not happy catching up this thread out of time, the less as it ultimately brings me where I?ve started > in 2014/2015: to the wrong character names that the ISO/IEC 10646 merger infiltrated into Unicode. Many things have more than one name. The only truly bad misnomers from that period was related to a mapping error, namely, in the treatment of Latvian characters which are called CEDILLA rather than COMMA BELOW. > This is the very thing I did not vent in my first reply. From my point of view, this misfortune would be > reason enough for Unicode not to seek further cooperation with ISO/IEC. This is absolutely NOT what we want. What we want is for the two parties to remember that industrial concerns and public concerns work best together. > But I remember the many voices raising on this List to tell me that this is all over and forgiven. I think you are digging up an old grudge that nobody thinks about any longer. > Therefore I?m confident that the Consortium will have the mindfulness to complete the ISO/IEC JTC 1 > partnership by publicly assuming synchronization with ISO/IEC 14651, There is no trouble with ISO/IEC 14651. > and achieving a fullscale merger with ISO/IEC 15897, after which the valid data stay hosted entirely in CLDR, and ISO/IEC 15897 would be its ISO mirror. I wonder if Mark Davis will be quick to agree with me ?? when I say that ISO/IEC 15897 has no use and should be withdrawn. Michael Everson From unicode at unicode.org Fri Jun 8 07:05:37 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Fri, 8 Jun 2018 13:05:37 +0100 Subject: The Unicode Standard and ISO In-Reply-To: <1360815095.111.1528428771583.JavaMail.www@wwinf1m18> References: <1360815095.111.1528428771583.JavaMail.www@wwinf1m18> Message-ID: <0A930B7D-BE9C-445B-B0C0-15B5D38A3B12@evertype.com> On 8 Jun 2018, at 04:32, Marcel Schneider via Unicode wrote: > the registration of the French locale in CLDR is still surprisingly incomplete despite the meritorious efforts made by the actual contributors Nothing prevents people from working to complete the French locale in CLDR. Synchronization with an unused ISO standard is not necessary to do that. Michael Everson From unicode at unicode.org Fri Jun 8 07:31:14 2018 From: unicode at unicode.org (Andrew West via Unicode) Date: Fri, 8 Jun 2018 13:31:14 +0100 Subject: The Unicode Standard and ISO In-Reply-To: <7CD0EF10-D783-4F34-A064-73BE78A42AA0@evertype.com> References: <1356275067.14615.1528392660527.JavaMail.www@wwinf1f27> <1563074572.18915.1528398795005.JavaMail.www@wwinf1m18> <7CD0EF10-D783-4F34-A064-73BE78A42AA0@evertype.com> Message-ID: On 8 June 2018 at 13:01, Michael Everson via Unicode wrote: > > I wonder if Mark Davis will be quick to agree with me ?? when I say that ISO/IEC 15897 has no use and should be withdrawn. It was reviewed and confirmed in 2017, so the next systematic review won't be until 2022. And as the standard is now under SC35, national committees mirroring SC2 may well overlook (or be unable to provide feedback to) the systematic review when it next comes around. I agree that ISO/IEC 15897 has no use, and should be withdrawn. Andrew From unicode at unicode.org Fri Jun 8 07:39:09 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Fri, 8 Jun 2018 14:39:09 +0200 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: <8DA274E4-C8CB-438B-951C-162243BACC34@telia.com> > On 8 Jun 2018, at 11:07, Henri Sivonen via Unicode wrote: > > My question is: > > When designing a syntax where tokens with the user-chosen characters > can't occur next to each other without some syntax-reserved characters > between them, what advantages are there from limiting the user-chosen > characters according to UAX #31 as opposed to treating any character > that is not a syntax-reserved character as a character that can occur > in user-named tokens? It seems best to stick to the canonical forms and add the sequences one deems useful and safe, as treating inequivalent characters as equal is likely to be confusing. But this requires more work; it seems that the use of the compatibility forms is aimed at something simple to implement. From unicode at unicode.org Fri Jun 8 07:50:28 2018 From: unicode at unicode.org (Tom Gewecke via Unicode) Date: Fri, 8 Jun 2018 08:50:28 -0400 Subject: The Unicode Standard and ISO In-Reply-To: <1360815095.111.1528428771583.JavaMail.www@wwinf1m18> References: <1360815095.111.1528428771583.JavaMail.www@wwinf1m18> Message-ID: > On Jun 7, 2018, at 11:32 PM, Marcel Schneider via Unicode wrote: > > What bothered me ... is that the registration of the French locale in CLDR is > still surprisingly incomplete Could you provide an example or two? From unicode at unicode.org Fri Jun 8 08:52:50 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 8 Jun 2018 15:52:50 +0200 (CEST) Subject: The Unicode Standard and ISO Message-ID: <1516605476.11375.1528465970271.JavaMail.www@wwinf1m18> On Fri, 8 Jun 2018 13:06:18 +0200, Mark Davis ?? via Unicode wrote: > > Where are you getting your "facts"? Among many unsubstantiated or ambiguous claims in that very long sentence: > > > "French locale in CLDR is still surprisingly incomplete".? > > For each release, the data collected for the French locale is complete to the bar we have set for Level=Modern. What got me started is that before even I requested a submitter ID (and the reason why I?ve requested one), "Characters | Category | Label | keycap" remained untranslated, i.e. its French translation was "keycap". When I proposed "cabochon", the present contributors kindly upvoted or proposed "touche" even before I launched a forum thread, and when I got aware, I changed my vote and posted the rationale on the forum, so the upvoting contributor kindly followed so that now we stay united for "touche", rather than "keycap". Please note that I acknowledge everybody and don?t criticize anybody. It doesn?t require much imagination to figure out that when CLDR was set up, there were so few or even no French contributors that translating "keycap" either fell out of deadline or was overlooked or whatever, and later passed unnoticed. That is a tracer detecting that none of the people setting up the French translation of the Code Charts were ever on the CLDR project. Because if anybody of them had been active on CLDR, no English word would have been kept in use mistakenly for the French locale. Beyond what everybody on this List is able to decrypt on his or her own, I?m not in a position to disclose any further personal information, for witness protection?s sake. > What you may mean is that CLDR doesn't support a structure that you think it should. > For that, you have to make a compelling case that the structure you propose is worth it, worth diverting people from other priorities. Thank you, that is not a problem and may be resolved after filing a ticket, which would be done for a later release, given that top priority tasks require a potentially huge amount of work. First NBSP and NNBSP need to be added to the French charset (see http://unicode.org/cldr/trac/ticket/11120 ). Adding centuries to Date&Time (with French short form "s.") is of interest for any locale, but irrelevant to everyday business practice. > > French contributors are not "prevented from cooperating". Where do you get this from? Who do you mean? Historic French contributors are ethically prevented from contributing to CLDR, because of a strong commitment to involve ISO/IEC, a notion that is very meaningful to Unicode. People relevant to projects for French locale do trace the borderline of applicability wider than do those people who are closerly tied to Unicode?related projects. > > We have many French contribute data over time. When finding the word "keycap" as a French translation of "keycap" in my copy of CLDR data at home, I wanted to know who contributed that data. I was told that when survey is open, I?ll see who is contributing. I won?t blame those who are helping resolve the issue now. > Now, it works better when people engage under the umbrella of an organization, but even there that doesn't have to be a company; > we have liaison relationships with government agencies and NGOs. That?s fine. But even as a guest I?m well received, and anyhow the point is to bring the arguments. My concern is that starting with a good translation from scratch is more efficient than attempting to correct the same error(s) across multiple instances via the survey tool, that seems to be designed to fix small errors rather than to redesign entire parts of the scheme. > > There were not "many attempts" at a merger, and?Unicode didn't "refuse" anything. Who do you think "attempted", and when? An influential person consistently campaigned for a merger of CLDR and ISO/IEC 15897, but that never succeeded. It?s unlikely to be ignored. > > Albeit given the state of?ISO/IEC 15897, there was nothing such a merger would have contributed anyway. I?ve took a glance at the data of ISO/IEC 15897 and cannot figure out that there is nothing to pick from. At least they won?t be disposed to sell you "keycap" as a French term or as being in any use in that target locale. And anyhow, the gesture would be appreciated as a piece of good diplomacy. Hopefully a lightweight proceeding could end up in that data being transferred to CLDR, and this being cited as sole normative reference in ISO/IEC 15897. As a result, everybody?s happy. > BTW, your use of the term "refuse" might be a language issue. I don't "refuse" to respond > to the widow of a Nigerian Prince who wants to give me $1M. Since?I don't think it is worth my time, > or am not willing to upfront the low, low fee of $10K,?I might "ignore" the email, or "not respond" to it. > Or I might "decline" it with a no-thanks or not-interested response. But none of that is to "refuse" it.? Thanks, I got it (the point, and the e?mail). More seriously, to ignore or not to respond to, or even to decline a suggestion made by a well?known high official is in my opinion as much as to refuse that proposition. Beyond that, I think I?d be unable to carve out any common denominator with an unsolicited bulk e?mail. Marcel > > On Fri, Jun 8, 2018 at 5:32 AM, Marcel Schneider via Unicode wrote: > > > On Thu, 7 Jun 2018 22:46:12 +0300, Erkki I. Kolehmainen via Unicode wrote: > > > > > I cannot but fully agree with Mark and Michael. > > > > > > Sincerely > > > > > >Thank you for confirming. All witnesses concur to invalidate the statement about > >uniqueness of ISO/IEC?10646 ? Unicode synchrony. ? After being invented in its > >actual form, sorting was standardized simultaneously in ISO/IEC?14651 and in > >Unicode Collation Algorithm, the latter including practice?oriented extra features. > >Since then, these two standards are kept in synchrony uninterruptedly. > > > >Getting people to correct the overall response was not really my initial concern, > >however. What bothered me before I learned that Unicode refuses to cooperate > >with ISO/IEC JTC1 SC22 is that the registration of the French locale in CLDR is > >still surprisingly incomplete despite the meritorious efforts made by the actual > >contributors, and then after some investigation, that the main part of the potential > >French contributors are prevented from cooperating because Unicode refuses to > >cooperate with ISO/IEC on locale data while ISO/IEC 15897 predates CLDR, > >reportedly after many attempts made to merge both standards, remaining > >unsuccessful without any striking exposure or friendly agreement to avoid kind of > >an impression of unconcerned rebuff. > > > >Best regards, > > > >Marcel > > From unicode at unicode.org Fri Jun 8 09:04:35 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 8 Jun 2018 16:04:35 +0200 (CEST) Subject: The Unicode Standard and ISO In-Reply-To: References: <1360815095.111.1528428771583.JavaMail.www@wwinf1m18> Message-ID: <478577706.11665.1528466675286.JavaMail.www@wwinf1m18> On Fri, 8 Jun 2018 08:50:28 -0400, Tom Gewecke via Unicode wrote: > > > > On Jun 7, 2018, at 11:32 PM, Marcel Schneider via Unicode wrote: > > > > What bothered me ... is that the registration of the French locale in CLDR is > > still surprisingly incomplete > > Could you provide an example or two? > What got me started is that "Characters | Category | Label | keycap" remained untranslated, i.e. its French translation was "keycap". A number of keyword translations are missing or wrong. I can tell that all actual contributors are working hard to fix the issues. I can imagine that it?s by lack of time in front of the huge mass of data, or by feeling so alone (only three corporate contributors, no liaison or NGOs). No wonder if the official French translators are all sulking the job (reportedly, not me figuring out). Marcel From unicode at unicode.org Fri Jun 8 11:20:09 2018 From: unicode at unicode.org (Steven R. Loomis via Unicode) Date: Fri, 8 Jun 2018 09:20:09 -0700 Subject: The Unicode Standard and ISO In-Reply-To: <1516605476.11375.1528465970271.JavaMail.www@wwinf1m18> References: <1516605476.11375.1528465970271.JavaMail.www@wwinf1m18> Message-ID: Marcel, On Fri, Jun 8, 2018 at 6:52 AM, Marcel Schneider via Unicode < unicode at unicode.org> wrote: > > What got me started is that before even I requested a submitter ID (and > the reason why I?ve requested one), > "Characters | Category | Label | keycap" remained untranslated, i.e. its > French translation was "keycap". > When I proposed "cabochon", the present contributors kindly upvoted or > proposed "touche" even before I > launched a forum thread, and when I got aware, I changed my vote and > posted the rationale on the forum, > so the upvoting contributor kindly followed so that now we stay united for > "touche", rather than "keycap". > But, it sounds like the CLDR process was successful in this case. Thank you for contributing. > Please note that I acknowledge everybody and don?t criticize anybody. It > doesn?t require much imagination > to figure out that when CLDR was set up, there were so few or even no > French contributors that translating > "keycap" either fell out of deadline or was overlooked or whatever, and > later passed unnoticed. That is a > tracer detecting that none of the people setting up the French translation > of the Code Charts were ever on > the CLDR project. Because if anybody of them had been active on CLDR, no > English word would have been > kept in use mistakenly for the French locale. > Actually, I think the particular data item you found is relatively new. The first values entered for it in any language were May 18th of this year. Were there votes for "keycap" earlier? Rather than a tracer finding evidence of neglect, you are at the forefront of progressing the translated data for French. Congratulations! > French contributors are not "prevented from cooperating". Where do you get this from? Who do you mean? > > Historic French contributors are ethically prevented from contributing to > CLDR, because of a strong commitment to involve ISO/IEC, > a notion that is very meaningful to Unicode. People relevant to projects > for French locale do trace the borderline of applicability wider > than do those people who are closerly tied to Unicode?related projects. Which contributors specifically are prevented? > > There were not "many attempts" at a merger, and Unicode didn't "refuse" > anything. Who do you think "attempted", and when? > > An influential person consistently campaigned for a merger of CLDR and > ISO/IEC 15897, but that never succeeded. It?s unlikely to be ignored. Which person? > Albeit given the state of ISO/IEC 15897, there was nothing such a merger > would have contributed anyway. > > I?ve took a glance at the data of ISO/IEC 15897 and cannot figure out that > there is nothing to pick from. At least they won?t be disposed to > sell you "keycap" as a French term or as being in any use in that target > locale. And anyhow, the gesture would be appreciated as a piece > of good diplomacy. Hopefully a lightweight proceeding could end up in that > data being transferred to CLDR, and this being cited as sole > normative reference in ISO/IEC 15897. As a result, everybody?s happy. > The registry for ISO/IEC 15897 has neither data for French, nor structure that would translate the term "Characters | Category | Label | keycap". So there would be nothing to merge with there. So, historically, CLDR began not a part of Unicode, but as part of Li18nx under the Free Standards Group. See the bottom of the page http://cldr.unicode.org/index/acknowledgments "The founding members of the workgroup were IBM, Sun and OpenOffice.org". What we were trying to do was to provide internationalized content for Linux, and also, to resolve the then-disparity between locale data across platforms. Locale data was very divergent between platforms - spelling and word choice changes, etc. Comparisons were done and a Common locale data repository (with its attendant XML formats) emerged. That's the C in CLDR. Seed data came from IBM?s ICIR which dates many decades before 15897 (example http://www.computinghistory.org.uk/det/13342/IBM-National-Language-Support-Reference-Manual-Volume-2/ - 4th edition published in 1994.) 100 locales we contributed to glibc as well. Where there is opportunity for productive sync and merging with is glibc. We have had some discussions, but more needs to be done- especially a lot of tooling work. Currently many bug reports are duplicated between glibc and cldr, a sort of manual synchronization. Help wanted here. Steven -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jun 8 12:41:20 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 8 Jun 2018 18:41:20 +0100 Subject: The Unicode Standard and ISO In-Reply-To: References: <1360815095.111.1528428771583.JavaMail.www@wwinf1m18> <20180608090646.03604ff1@JRWUBU2> Message-ID: <20180608184120.7e108ab0@JRWUBU2> On Fri, 8 Jun 2018 13:40:21 +0200 Mark Davis ?? wrote: > Mark > > On Fri, Jun 8, 2018 at 10:06 AM, Richard Wordingham via Unicode < > unicode at unicode.org> wrote: > > > On Fri, 8 Jun 2018 05:32:51 +0200 (CEST) > > Marcel Schneider via Unicode wrote: > > > > > Thank you for confirming. All witnesses concur to invalidate the > > > statement about uniqueness of ISO/IEC 10646 ? Unicode synchrony. ? > > > After being invented in its actual form, sorting was standardized > > > simultaneously in ISO/IEC 14651 and in Unicode Collation > > > Algorithm, the latter including practice?oriented extra > > > features. > > > > The UCA contains features essential for respecting canonical > > equivalence. ICU works hard to avoid the extra effort involved, > > apparently even going to the extreme of implicitly declaring that > > Vietnamese is not a human language. > A bit over the top, eh?? Then remove the "no known language" from the bug list, or declare that you don't know SE Asian languages. The root problem is that the UCA cannot handle syllable by syllable comparisons; if the UCA could handle that, the correct collation of unambiguous true Lao would become simple. The CLDR algorithm provides just enough memory to make Lao collation possible; however, ICU isn't fast enough to load a collation from customisation - it takes hours! One could probably do better if one added suffix contractions, but adding that capability might be nightmare. > I'm guessing you mean https://unicode.org/cldr/trac/ticket/10868, > which nicely outlines a proposal for dealing with a number of > problems with Vietnamese. It still includes a brute force work-around. > We clearly don't support every sorting feature that various > dictionaries and agencies come up with. Sometimes it is because we > can't (yet) see a good way to do it: > 1. it might be not determinant: many governmental standards or > style sheets require "interesting" sorting, such as determining that > "XI" is a roman numeral (not the president of China) and sorting as > 11, or when "St." is meant to be Street *and* when meant to be Saint > (St. Stephen's St.) I believe the first is a character identity issue. Some of us see the difference between U+0058 LATIN CAPITAL LETTER X and the discouraged U+2169 ROMAN NUMERAL TEN as more than just a round-tripping difference. For example, by hand, I write the 'V' in 'Henry V' with a regnal number quite differently to 'Henry V.' where 'V' is short for a name. > > > Since then, > > > these two standards are kept in synchrony uninterruptedly. > > But the consortium has formally dropped the commitment to DUCET in > > CLDR. Even when restricted to strings of assigned characters, the > > CLDR and ICU no longer make the effort to support the DUCET > > collation. Indeed, I'm not even sure that the DUCET is a tailoring > > of the root CLDR collation, even when restricted to assigned > > characters. Tailorings tend to have odd side effects; fortunately, > > they rarely if ever matter. CLDR root is a rewrite with > > modifications of DUCET; it has changes that are prohibited as > > 'tailorings'! > ?CLDR does make some tailorings to the DUCET to create its root > collation, ?notably adding special contractions of private use > characters to allow for tailoring support and indexes [ > http://unicode.org/reports/tr35/tr35-collation.html#File_Format_FractionalUCA_txt > ] plus the rearrangement of some characters (mostly punctuation and > symbols) to allow runtime parametric reordering of groups of > characters (eg to put numbers after letters) [ > http://unicode.org/reports/tr35/tr35-collation.html#grouping_classes_of_characters > ]. My main point is that for practical purposes (i.e. ICU), Unicode has moved away from ISO/IEC 14651. The difference is small. I didn't say that there weren't good reasons. > - If there are other changes that are not well documented, or if > you think those features are causing problems in some way, please > file a ticket. Well, I don't have to use DUCET, though I've found it easier for unmaintainable tailorings. I need to write code to apply non-parametric LDML tailorings - ICU is, alas, ridiculously slow. I hope that's just a matter of optimisation balance between compiling a tailoring and applying it. Are there any published compliance tests for non-parametric tailorings? I'm not sure how one would check that an alleged parametric reordering of numbers and letters applied to a tailoring of DUCET was in accordance with the LDML definition, but I don't think you want to expend money sorting that out. > - If there is a particular change that you think is not conformant > to UCA, please also file that. Sorry, I must have scanned the conformance requirements too quickly. I had got it into my head that someone had recklessly required that tailorings being in accordance with LDML. That constraint only applies to parametric tailorings, so any properly structured unambiguously defined finite complete set of weights (albeit some implicit) is a tailoring of UCA. Formally, the CLDR root collation uses prefix weights, but using the CLDR collation algorithm on the CLDR root collation is equivalent to using the UCA. (This isn't always so - my tailoring for Lao using the CLDR collation algorithm is not equivalent to using the UCA on a finite table of weights.) Richard. From unicode at unicode.org Fri Jun 8 13:45:26 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 8 Jun 2018 20:45:26 +0200 Subject: The Unicode Standard and ISO In-Reply-To: <20180608184120.7e108ab0@JRWUBU2> References: <1360815095.111.1528428771583.JavaMail.www@wwinf1m18> <20180608090646.03604ff1@JRWUBU2> <20180608184120.7e108ab0@JRWUBU2> Message-ID: 2018-06-08 19:41 GMT+02:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Fri, 8 Jun 2018 13:40:21 +0200 > Mark Davis ?? wrote: > > > Mark > > > > On Fri, Jun 8, 2018 at 10:06 AM, Richard Wordingham via Unicode < > > unicode at unicode.org> wrote: > > > > > On Fri, 8 Jun 2018 05:32:51 +0200 (CEST) > > > Marcel Schneider via Unicode wrote: > > > > > > > Thank you for confirming. All witnesses concur to invalidate the > > > > statement about uniqueness of ISO/IEC 10646 ? Unicode synchrony. ? > > > > After being invented in its actual form, sorting was standardized > > > > simultaneously in ISO/IEC 14651 and in Unicode Collation > > > > Algorithm, the latter including practice?oriented extra > > > > features. > > > > > > The UCA contains features essential for respecting canonical > > > equivalence. ICU works hard to avoid the extra effort involved, > > > apparently even going to the extreme of implicitly declaring that > > > Vietnamese is not a human language. > > > A bit over the top, eh?? > > Then remove the "no known language" from the bug list, or declare that > you don't know SE Asian languages. > > The root problem is that the UCA cannot handle syllable by syllable > comparisons; if the UCA could handle that, the correct collation of > unambiguous true Lao would become simple. The CLDR algorithm provides > just enough memory to make Lao collation possible; however, ICU isn't > fast enough to load a collation from customisation - it takes hours! > One could probably do better if one added suffix contractions, but > adding that capability might be nightmare. The way tailoring is designed in CLDR using only data used by a generic algorithm, and not custom algorithm is not the only way to collate Lao. You can perectly add new custom algorithm promitives that will use new collation data rules that can be inserted as "hooks" in UCA (which provides several points at which it is possible, but UCA just makes these hooks act as "no-op". You can be much faster is you create a specific library for Lao, that would still be able to process the basic collation rules and then make more advanced inferences based on larger cluster boundaries than just those considered in the standard basic UCA, so it is perfectly possible to extend it to cover more complex Lao syllables and various specific quirks (such as hyphenation in the middle of clusters, as seen in some Indic scripts using left matras). Not everything has to be specified by UCA itself notably if it's specific to a script (or sometimes only a single locale, i.e. a specific combination of a script, language, orthographic convention, and stylistic convention for some kinds of documents or presentations). -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jun 8 15:33:20 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 8 Jun 2018 13:33:20 -0700 Subject: The Unicode Standard and ISO In-Reply-To: <7CD0EF10-D783-4F34-A064-73BE78A42AA0@evertype.com> References: <1356275067.14615.1528392660527.JavaMail.www@wwinf1f27> <1563074572.18915.1528398795005.JavaMail.www@wwinf1m18> <7CD0EF10-D783-4F34-A064-73BE78A42AA0@evertype.com> Message-ID: <2053fd36-6ec2-8923-323d-972068c0a20e@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jun 8 15:54:20 2018 From: unicode at unicode.org (Tom Gewecke via Unicode) Date: Fri, 8 Jun 2018 16:54:20 -0400 Subject: The Unicode Standard and ISO In-Reply-To: <1516605476.11375.1528465970271.JavaMail.www@wwinf1m18> References: <1516605476.11375.1528465970271.JavaMail.www@wwinf1m18> Message-ID: > On Jun 8, 2018, at 9:52 AM, Marcel Schneider via Unicode wrote: > > People relevant to projects for French locale do trace the borderline of applicability wider > than do those people who are closerly tied to Unicode?related projects. Could you give a concrete example or two of what these people mean by ?wider borderline of applicability? that might generate their ethical dilemma? From unicode at unicode.org Fri Jun 8 16:14:51 2018 From: unicode at unicode.org (Steven R. Loomis via Unicode) Date: Fri, 8 Jun 2018 14:14:51 -0700 Subject: The Unicode Standard and ISO In-Reply-To: <20180608184120.7e108ab0@JRWUBU2> References: <1360815095.111.1528428771583.JavaMail.www@wwinf1m18> <20180608090646.03604ff1@JRWUBU2> <20180608184120.7e108ab0@JRWUBU2> Message-ID: Richard, > But the consortium has formally dropped the commitment to DUCET in CLDR. > Even when restricted to strings of assigned characters, the > CLDR and ICU no longer make the effort to support the DUCET > collation. CLDR is not a collation implementation, it is a data repository with associated specification. It was never required to 'support' DUCET. The contents of CLDR have no bearing on whether implementations support DUCET. CLDR ? ICU. On Fri, Jun 8, 2018 at 10:41 AM, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Fri, 8 Jun 2018 13:40:21 +0200 > Mark Davis ?? wrote: > > > > The UCA contains features essential for respecting canonical > > > equivalence. ICU works hard to avoid the extra effort involved, > > > apparently even going to the extreme of implicitly declaring that > > > Vietnamese is not a human language. > > > A bit over the top, eh?? > > Then remove the "no known language" from the bug list > What does this refer to? > > ?ICU isn't > fast enough to load a collation from customisation - it takes hours! ? > ICU is, alas, ridiculously slow > I'm also curious what this refers to, perhaps it should be a separate ICU bug? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jun 8 16:28:23 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 8 Jun 2018 23:28:23 +0200 (CEST) Subject: The Unicode Standard and ISO Message-ID: <1042661471.18801.1528493303738.JavaMail.www@wwinf1m18> On Fri, 8 Jun 2018 13:33:20 -0700, Asmus Freytag via Unicode wrote: > [?] > There's no value added in creating "mirrors" of something that is successfully being developed and maintained under a different umbrella. Wouldn?t the same be true for ISO/IEC 10646? It has no value added neither, and WG2 meetings could be merged with UTC meetings. Unicode maintains the entire chain, from the roadmap to the production tool (that the Consortium ordered without paying a full license). But the case is about part of the people who are eager to maintain an alternate forum, whereas the industry (i.e. the main users of the data) are interested in fast?tracking character batches, and thus tend to shortcut the ISO/IEC JTC1 SC2 WG2. This is proof enough that applying the same logic than to ISO/IEC 15897, WG2 would be eliminated. The reason why it was not, is that Unicode was weaker and needed support from ISO/IEC to gain enough traction, despite the then?ISO/IEC 10646 being useless in practice, as it pursued an unrealistic encoding scheme. To overcome this, somebody in ISO started actively campaigning for the Unicode encoding model, encountering fierce resistance from fellow ISO people until he succeeded in teaching them real?life computing. He had already invented and standardized the sorting method later used to create UCA and ISO/IEC 14651. I don?t believe that today everybody forgot about him. Marcel From unicode at unicode.org Fri Jun 8 17:24:28 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 9 Jun 2018 00:24:28 +0200 (CEST) Subject: The Unicode Standard and ISO In-Reply-To: References: <1516605476.11375.1528465970271.JavaMail.www@wwinf1m18> Message-ID: <1711377974.19065.1528496669068.JavaMail.www@wwinf1m18> On Fri, 8 Jun 2018 16:54:20 -0400, Tom Gewecke via Unicode wrote: > > > On Jun 8, 2018, at 9:52 AM, Marcel Schneider via Unicode wrote: > > > > People relevant to projects for French locale do trace the borderline of applicability wider > > than do those people who are closerly tied to Unicode?related projects. > > Could you give a concrete example or two of what these people mean by ?wider borderline of applicability? > that might generate their ethical dilemma? > Drawing the borderline until which ISO/IEC should be among the involved parties, as I put it, is about the Unicode policy as of how ISO/IEC JTC1 SC2 WG2 is involved in the process, how it appears in public (FAQs, Mailing List responding practice, and so on), and how people in that WG2 feel with respect to Unicode. That may be different depending on the standard concerned (ISO/IEC 10646, ISO/IEC 14651), so that the former is put in the first place as vital to Unicode, while the latter is almost entirely hidden (except in appendix B of UTS #10). Then when it?s up to locale data, Unicode people see the borderline below, while ISO people tend to see it above. This is why Unicode people do not want the twin?standards?bodies?principle applied to locale data, and are ignoring or declining any attempt to equalize situations, arguing that ISO/IEC 15897 is useless. As I?ve pointed in my previous e?mail responding to Asmus Freytag, ISO/IEC 10646 was about as useless until Unicode came on it and merged itself with that UCS embryo (not to say that miscarriage on the way). The only thing WG2 could insist upon were names and huge bunches of precomposed or preformatted characters that Unicode was designed to support in plain text by other means. The essential part was Unicode?s, and without Unicode we wouldn?t have any usable UCS. ISO/IEC 15897 appears to be in a similar position: not very useful, not very performative, not very complete. But an ISO/IEC standard. Logically, Unicode should feel committed to merge with it the same way it did with the other standard, maintaining the data, and publishing periodical abstracts under ISO coverage. There is no problem in publishing a framework standard under the ISO/IEC umbrella, associated with a regular up?to?date snapshot of the data. That is what I mean when I say that Unicode arbitrarily draw borderlines of their own, regardless of how people at ISO feel about them. Marcel From unicode at unicode.org Fri Jun 8 17:41:58 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 8 Jun 2018 15:41:58 -0700 Subject: The Unicode Standard and ISO In-Reply-To: <1042661471.18801.1528493303738.JavaMail.www@wwinf1m18> References: <1042661471.18801.1528493303738.JavaMail.www@wwinf1m18> Message-ID: <01af828b-4bf9-338f-f099-795d125d57ce@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jun 8 19:24:47 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 9 Jun 2018 01:24:47 +0100 Subject: The Unicode Standard and ISO In-Reply-To: References: <1360815095.111.1528428771583.JavaMail.www@wwinf1m18> <20180608090646.03604ff1@JRWUBU2> <20180608184120.7e108ab0@JRWUBU2> Message-ID: <20180609012447.03650653@JRWUBU2> On Fri, 8 Jun 2018 14:14:51 -0700 "Steven R. Loomis via Unicode" wrote: > > But the consortium has formally dropped the commitment to DUCET in > > CLDR. Even when restricted to strings of assigned characters, the > > CLDR and ICU no longer make the effort to support the DUCET > > collation. > CLDR is not a collation implementation, it is a data repository with > associated specification. It was never required to 'support' DUCET. > The contents of CLDR have no bearing on whether implementations > support DUCET. DUCET used to be the root collation of CLDR. > CLDR ? ICU. DUCET is a standard collation. Language-specific collations are stored in CLDR, so why not an international standard? Does ICU store collations not defined in CLDR? The formal snag is that the collations have to be LDML tailorings of the CLDR root collation, which is a formal problem for U+FDD0. I would expect you to argue that it is more useful for U+FDD0 to have the special behaviour defined in CLDR, and restrict conformance with DUCET to characters other than non-characters. > On Fri, Jun 8, 2018 at 10:41 AM, Richard Wordingham via Unicode < > unicode at unicode.org> wrote: > > On Fri, 8 Jun 2018 13:40:21 +0200 > > Mark Davis ?? wrote: > > > > The UCA contains features essential for respecting canonical > > > > equivalence. ICU works hard to avoid the extra effort involved, > > > > apparently even going to the extreme of implicitly declaring > > > > that Vietnamese is not a human language. > > > A bit over the top, eh?? > > > > Then remove the "no known language" from the bug list > What does this refer to? http://userguide.icu-project.org/collation/customization Under the heading "Known Limitations" it says: "The following are known limitations of the ICU collation implementation. These are theoretical limitations, however, since there are no known languages for which these limitations are an issue. However, for completeness they should be fixed in a future version after 1.8.1. The examples given are designed for simplicity in testing, and do not match any real languages." Then, the particular problem is listed under the heading "Contractions Spanning Normalization". The assumption is that FCD strings do not need to be decomposed. This comes unstuck when what is locally a secondary weight due to a diacritic on a vowel has to be promoted to a primary weight to support syllable by syllable collation in a system not set up for such a tiered comparison. > > ?ICU isn't > > fast enough to load a collation from customisation - it takes > > hours! > > ICU is, alas, ridiculously slow > I'm also curious what this refers to, perhaps it should be a separate > ICU bug? There may be reproducibility issues. A proper bug report will take some work. There's also the argument that nearly 200,000 contractions is excessive. I had to disable certain checks that were treating "should not" as a prohibition - working round them either exceeded ICU's capacity because of the necessary increase in the number of contractions, or was incompatible with the design of the collation. The weight customisation creates 45 new weights, with lines like "&\u0EA1 = \ufdd2\u0e96 < \ufdd2\u0e97 # MO for THO_H & THO_L" I use strings like \ufdd2\u0e96 to emulate ISO/IEC 14651 (primary) weights. I carefully reuse default Lao weights so as to keep collating elements' list of collation elements short. There are a total of 187174 non-comment lines, most being simple contractions like "&\u0ec8\ufdd2\u0e96\ufdd2AAW\ufdd3\u0e94 = \u0ec8\u0e96\u0ead\u0e94 # 1+K+AW+N N is mandatory!" and prefix contractions like "&\ufdd2AAW\ufdd3\u0e81\u0ec9 = \u0e96\u0ec9 | ?\u0e81 # K+1|?+N N is mandatory". I strip the comments off as I convert the collation definition to UTF-16; if I remember correctly I also have to convert escape sequences to characters. That processing is a negligible part of the time. By comparison, the loading of 30,000 lines from allkeys.txt is barely discernible. The generation of the loading of the collation was reasonably fast when I generated DUCET-style collation weights using bash. For my purposes, I would get better performance if ICU's collation just blindly converted strings to NFD, but then all I am using it for is to compare collation rules against a dictionary. I suspect it's just that I lose out massively as a result of ICU's tradeoffs. Richard. From unicode at unicode.org Fri Jun 8 20:53:04 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 9 Jun 2018 02:53:04 +0100 Subject: The Unicode Standard and ISO In-Reply-To: References: <1360815095.111.1528428771583.JavaMail.www@wwinf1m18> <20180608090646.03604ff1@JRWUBU2> <20180608184120.7e108ab0@JRWUBU2> Message-ID: <20180609025304.61b9dca2@JRWUBU2> On Fri, 8 Jun 2018 20:45:26 +0200 Philippe Verdy via Unicode wrote: > 2018-06-08 19:41 GMT+02:00 Richard Wordingham via Unicode < > unicode at unicode.org>: > The way tailoring is designed in CLDR using only data used by a > generic algorithm, and not custom algorithm is not the only way to > collate Lao. You can perectly add new custom algorithm promitives > that will use new collation data rules that can be inserted as > "hooks" in UCA (which provides several points at which it is > possible, but UCA just makes these hooks act as "no-op". The ideal is to have a common library rather than add specific routines to support specific languages. Now, this can be done in a common library; ICU break iterators have dedicated routines for CJK and for Siamese. I wonder if this could be done for Lao and possibly Tai Lue. I've a vague recollection that UCA collation for Tai Lue in the New Tai Lue script only needs thousands of contractions, so it may work well enough in the main CLDR collation algorithm. Martin Hosken provided the numbers, probably on the Unicore list, when New Tai Lue formally switched from phonetic to visual order. Taking the definition of logical order literally, the change legitimised the logical order of New Tai Lue. > You can be much faster is you create a specific library for Lao, that > would still be able to process the basic collation rules and then > make more advanced inferences based on larger cluster boundaries than > just those considered in the standard basic UCA, so it is perfectly > possible to extend it to cover more complex Lao syllables and various > specific quirks (such as hyphenation in the middle of clusters, as > seen in some Indic scripts using left matras). How is this hyphenation done? The answer probably belongs in the thread entitled 'Hyphenation Markup', unless its restricted to the visual order scripts. If it's occurring in the visual order scripts, we may need to add contractions for ; U+00AD breaks contractions, and, indeed, may be used for exactly that purpose, as it is generally easier to type than CGJ. While I've seen line-breaking after a left matra in Thai, I've never *seen* a hyphen after a left matra. Richard. From unicode at unicode.org Fri Jun 8 21:54:57 2018 From: unicode at unicode.org (=?UTF-8?B?WWlmw6FuIFfDoW5n?= via Unicode) Date: Sat, 9 Jun 2018 11:54:57 +0900 Subject: UTS#51 and emoji-sequences.txt Message-ID: When I'm looking at https://unicode.org/Public/emoji/11.0/emoji-sequences.txt It goes on line 16 that: ---------- # type_field: any of {Emoji_Combining_Sequence, Emoji_Flag_Sequence, Emoji_Modifier_Sequence} # The type_field is a convenience for parsing the emoji sequence files, and is not intended to be maintained as a property. ---------- This field, however, actually contains "Emoji_Keycap_Sequence" and "Emoji_Tag_Sequence", instead of "Emoji_Combining_Sequence" (it was already so in 5.0). And I go back to http://www.unicode.org/reports/tr51/ Under the section 1.4.6: ---------- ED-21. emoji keycap sequence set ? The specific set of emoji sequences listed in the emoji-sequences.txt file [emoji-data] under the category Emoji_Keycap_Sequence. ED-22. emoji modifier sequence set ? The specific set of emoji sequences listed in the emoji-sequences.txt file [emoji-data] under the category Emoji_Modifier_Sequence. ED-23. RGI emoji flag sequence set ? The specific set of emoji sequences listed in the emoji-sequences.txt file [emoji-data] under the category Emoji_Flag_Sequence. ED-24. RGI emoji tag sequence set ? The specific set of emoji sequences listed in the emoji-sequences.txt file [emoji-data] under the category Emoji_Tag_Sequence. ---------- I'm not sure if the "category" means "type_field" or headings in the txt file, as the headings do not contain underscores. If it means "type_field", then the description of type_field above is wrong. Also the section 1.4.5: ---------- ED-14c. emoji keycap sequence ? A sequence of the following form: emoji_keycap_sequence := [0-9#*] \x{FE0F 20E3} - These characters are in the emoji-sequences.txt file listed under the category Emoji_Keycap_Sequence ---------- While in the previous version (rev. 12): ---------- ED-14c. emoji keycap sequence ? An emoji combining sequence of the following form: emoji_keycap_sequence := [0-9#*] \x{FE0F 20E3} - These characters are in the emoji-sequences.txt file listed under the category Emoji_Combining_Keycap_Sequence ---------- It seems there was some kind of confusion on terms, but anyway, isn't the last line of ED-14c redundant with the current revision? (Or "Emoji_Combining_Sequence" is intended?) Thank you. Wang Yifan From unicode at unicode.org Sat Jun 9 01:23:33 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 9 Jun 2018 08:23:33 +0200 (CEST) Subject: The Unicode Standard and ISO Message-ID: <1255922515.940.1528525413705.JavaMail.www@wwinf1m18> On Fri, 8 Jun 2018 09:20:09 -0700, Steven R. Loomis via Unicode wrote: [?] >?But, it sounds like the CLDR process was?successful?in this case. Thank you for contributing. ? You are welcome, but thanks are due to the actual corporate contributors. [?] > Actually, I think the particular data item you found is relatively new. The first values entered > for it in any language were May 18th of this year.? Were there votes for "keycap" earlier? The "keycap" category is found as soon as in v30 (released 2016-10-05). > Rather than a tracer finding evidence of neglect, you are at the forefront of progressing the translated data for French. Congratulations! The neglect is on my part as I neglected to check the data history. Please note that I did not make accusations of neglect. Again: The historic Code Charts translators, partly still active, sulk CLDR because Unicode is perceived as sulking ISO/IEC 15897, so that minimal staff is actively translating CLDR for the French locale and can legitimately feel forsaken. I even made detailed suppositions as of how it could happen that "keycap" remained untranslated. ? [?] [Unanswered questions (please refer to my other e?mails in this thread)] > The registry for?ISO/IEC 15897 has neither data for French, nor structure that would translate the term "Characters | Category | Label | keycap". > So there would be nothing to merge with there. Correct. The only data for French is an ISO/IEC 646 charset: http://std.dkuug.dk/cultreg/registrations/number/156 As far as I can see there are available data to merge for Danish, Faroese, Finnish Greenlandic, Norwegian, and Swedish. > So, historically, CLDR began not a part of Unicode, but as part of Li18nx under the Free Standards Group. See the bottom of the page >?http://cldr.unicode.org/index/acknowledgments > "The founding members of the workgroup were IBM, Sun and OpenOffice.org".? > What we were trying to do was to provide internationalized content for Linux, and also, to resolve the then-disparity between locale data > across platforms. Locale data was very divergent between platforms - spelling and word choice changes, etc.? Comparisons were done > and a Common locale data repository ?(with its attendant XML formats) emerged. That's the C in CLDR. Seed data came from IBM?s ICIR > which dates many decades before 15897 (example? > http://www.computinghistory.org.uk/det/13342/IBM-National-Language-Support-Reference-Manual-Volume-2/ > - 4th edition published in 1994.) 100 locales we contributed to glibc as well. Thank you for the account and resources. The Linux Internationalization Initiative appears to have issued a last release on August 23, 2000: https://www.redhat.com/en/about/press-releases/83 the year before ISO/IEC 15897 was lastly updated: http://std.dkuug.dk/cultreg/registrations/chreg.htm > Where there is opportunity for productive sync and merging with is glibc. We have had some discussions, but more needs to be > done- especially a lot of tooling work. Currently many bug reports are duplicated between glibc and cldr, a sort of manual synchronization. > Help wanted here.? Noted. For my part, sadly for C libraries I?m unlikely to be of any help. Marcel From unicode at unicode.org Sat Jun 9 03:47:01 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 9 Jun 2018 09:47:01 +0100 Subject: The Unicode Standard and ISO In-Reply-To: <1255922515.940.1528525413705.JavaMail.www@wwinf1m18> References: <1255922515.940.1528525413705.JavaMail.www@wwinf1m18> Message-ID: <20180609094701.164b0442@JRWUBU2> On Sat, 9 Jun 2018 08:23:33 +0200 (CEST) Marcel Schneider via Unicode wrote: > > Where there is opportunity for productive sync and merging with is > > glibc. We have had some discussions, but more needs to be done- > > especially a lot of tooling work. Currently many bug reports are > > duplicated between glibc and cldr, a sort of manual > > synchronization. Help wanted here.? > > Noted. For my part, sadly for C libraries I?m unlikely to be of any > help. I wonder how much of that comes under the sad category of "better not translated". If an English speaker has to resort to search engines to understand, let alone fix, a reported problem, it may be better for a non-English speaker to search for the error message in English, and then with luck he may find a solution he can understand. In a related vein, one hears reports of people using English as the interface language, because they can't understand the messages allegedly in their native language. Richard. From unicode at unicode.org Sat Jun 9 06:10:00 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sat, 9 Jun 2018 13:10:00 +0200 Subject: UTS#51 and emoji-sequences.txt In-Reply-To: References: Message-ID: Thanks, it definitely looks like there are some mismatches in terminology there. Can you please file this with the reporting form on the unicode site? {phone} On Sat, Jun 9, 2018, 05:00 Yif?n W?ng via Unicode wrote: > When I'm looking at > https://unicode.org/Public/emoji/11.0/emoji-sequences.txt > > It goes on line 16 that: > ---------- > # type_field: any of {Emoji_Combining_Sequence, Emoji_Flag_Sequence, > Emoji_Modifier_Sequence} > # The type_field is a convenience for parsing the emoji sequence > files, and is not intended to be maintained as a property. > ---------- > > This field, however, actually contains "Emoji_Keycap_Sequence" and > "Emoji_Tag_Sequence", instead of "Emoji_Combining_Sequence" (it was > already so in 5.0). > > And I go back to > http://www.unicode.org/reports/tr51/ > > Under the section 1.4.6: > ---------- > ED-21. emoji keycap sequence set ? The specific set of emoji sequences > listed in the emoji-sequences.txt file [emoji-data] under the category > Emoji_Keycap_Sequence. > ED-22. emoji modifier sequence set ? The specific set of emoji > sequences listed in the emoji-sequences.txt file [emoji-data] under > the category Emoji_Modifier_Sequence. > ED-23. RGI emoji flag sequence set ? The specific set of emoji > sequences listed in the emoji-sequences.txt file [emoji-data] under > the category Emoji_Flag_Sequence. > ED-24. RGI emoji tag sequence set ? The specific set of emoji > sequences listed in the emoji-sequences.txt file [emoji-data] under > the category Emoji_Tag_Sequence. > ---------- > > I'm not sure if the "category" means "type_field" or headings in the > txt file, as the headings do not contain underscores. If it means > "type_field", then the description of type_field above is wrong. > > Also the section 1.4.5: > ---------- > ED-14c. emoji keycap sequence ? A sequence of the following form: > > emoji_keycap_sequence := [0-9#*] \x{FE0F 20E3} > > - These characters are in the emoji-sequences.txt file listed under > the category Emoji_Keycap_Sequence > ---------- > While in the previous version (rev. 12): > ---------- > ED-14c. emoji keycap sequence ? An emoji combining sequence of the > following form: > > emoji_keycap_sequence := [0-9#*] \x{FE0F 20E3} > > - These characters are in the emoji-sequences.txt file listed under > the category Emoji_Combining_Keycap_Sequence > ---------- > > It seems there was some kind of confusion on terms, but anyway, isn't > the last line of ED-14c redundant with the current revision? (Or > "Emoji_Combining_Sequence" is intended?) > > Thank you. > > Wang Yifan > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jun 9 09:38:05 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 9 Jun 2018 16:38:05 +0200 Subject: The Unicode Standard and ISO In-Reply-To: <1042661471.18801.1528493303738.JavaMail.www@wwinf1m18> References: <1042661471.18801.1528493303738.JavaMail.www@wwinf1m18> Message-ID: I just see the WG2 as a subcomity where governements may just check their practices and make minimum recommendations. Most governements are in fact very late to adopt the industry standards that evolve fast, and they just want to reduce the frequency of necessary changes jsut to enterinate what seems to be stable enough and gives them long enough period to plan the transitions. So ISO 10646 has had in fact very few updates compared to Unicode (even if these Unicode changes were "synchronized", most of them remained for long within optional amendments that are then synchronized in ISO 10646 long after the inbdustry has started working on updating their code for Unicode and made checks to ensure that it is stable enough to be finally included in ISO 10646 later as the new minimal platform that governments can reasonnably ask to be provided by their providers in the industry at reasonnable (or no) additional cost. So I see now ISO 646 only as a small subset of the Unicode standard. The WG2 technical comity is jsut there to finally approve what can be endorsed as a standard whose usage is made mandatory in governments, when the UTS itself is still (and will remain) just optional (not a requirement). It takes months or years to have new TUS features being available on all platforms that governements use. WG2 probably does not focus really on technical merits, but just evaluating the implementation and deployment costs, and that's where the WG2 members decide what is reasonnable for them to adopt (let's also not forget that ISO standards are mapped to national standards that reference it normatively, and these national standards (or European standards in the EEA) are legal requirements: governements then no longer need to specify each time which requirement they want, they're just saying that the national standards within a certain class are required for all product/service offers, and failure to implement theses standards will require those providers to fix their products at no additional cost, and independantly of the contractual or subscribed period of support). 2018-06-08 23:28 GMT+02:00 Marcel Schneider via Unicode : > On Fri, 8 Jun 2018 13:33:20 -0700, Asmus Freytag via Unicode wrote: > > > [?] > > There's no value added in creating "mirrors" of something that is > successfully being developed and maintained under a different umbrella. > > Wouldn?t the same be true for ISO/IEC 10646? It has no value added > neither, and WG2 meetings could be merged with UTC meetings. > Unicode maintains the entire chain, from the roadmap to the production > tool (that the Consortium ordered without paying a full license). > > But the case is about part of the people who are eager to maintain an > alternate forum, whereas the industry (i.e. the main users of the data) > are interested in fast?tracking character batches, and thus tend to > shortcut the ISO/IEC JTC1 SC2 WG2. This is proof enough that applying > the same logic than to ISO/IEC 15897, WG2 would be eliminated. The reason > why it was not, is that Unicode was weaker and needed support > from ISO/IEC to gain enough traction, despite the then?ISO/IEC 10646 being > useless in practice, as it pursued an unrealistic encoding scheme. > To overcome this, somebody in ISO started actively campaigning for the > Unicode encoding model, encountering fierce resistance from fellow > ISO people until he succeeded in teaching them real?life computing. He had > already invented and standardized the sorting method later used > to create UCA and ISO/IEC 14651. I don?t believe that today everybody > forgot about him. > > Marcel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jun 9 10:22:53 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 9 Jun 2018 17:22:53 +0200 (CEST) Subject: The Unicode Standard and ISO Message-ID: <783156611.7027.1528557773571.JavaMail.www@wwinf1m18> On Sat, 9 Jun 2018 09:47:01 +0100, Richard Wordingham via Unicode wrote: > > On Sat, 9 Jun 2018 08:23:33 +0200 (CEST) > Marcel Schneider via Unicode wrote: > > > > Where there is opportunity for productive sync and merging with is > > > glibc. We have had some discussions, but more needs to be done- > > > especially a lot of tooling work. Currently many bug reports are > > > duplicated between glibc and cldr, a sort of manual > > > synchronization. Help wanted here.? > > > > Noted. For my part, sadly for C libraries I?m unlikely to be of any > > help. > > I wonder how much of that comes under the sad category of "better not > translated". If an English speaker has to resort to search engines to > understand, let alone fix, a reported problem, it may be better for a > non-English speaker to search for the error message in English, and then > with luck he may find a solution he can understand. Then adding a "Display in English" button in the message box is best practice. Still I?ve never encountered any yet, and I guess this is because such a facility would be understood as an admission that up to now, i18n is partly a failure. > In a related vein, > one hears reports of people using English as the interface language, > because they can't understand the messages allegedly in their native > language. If to date, automatic translation of technical English still does not work, then I?d suggest that CLDR feature a complete message library allowing to compose any localized piece of information. But such an attempt requires that all available human resources really focus on the project, instead of being diverted by interpersonal discordances. Sulking people around a project are an indicator of poor project management branding dissenters as enemies out of an inability to behave in a diplomatic way by lack of social skills. At least that?s what they?d teach you in any management school. The way Unicode behaves against William Overington is in my opinion a striking example of mismanagement. In one dimension I can see, the "localizable sentences" that William invented and that he actively promotes do fit exactly into the scheme of localizable information elements suggested in the preceding paragraph. I strongly recommend that instead of publicly blacklisting the author in the mailbox of the president and directing the List moderation to prohibit the topic as out of scope of Unicode, an extensible and flexible framework be designed in urgency under the Unicode?CLDR umbrella to put an end to the pseudo?localization that Richard pointed above. OK I?m lacking diplomatic skills too, and this e?mail is harsh, but I see it as a true echo. And I apologize for my last reply to William Overington, if I need to. http://www.unicode.org/mail-arch/unicode-ml/y2018-m03/0118.html Beside that, I?d suggest also to add a CLDR library of character name elements allowing to compose every existing Unicode character name in all supported locales, for use in system character pickers and special character dialogs. This library should then be updated at each major release of the UCS. Hopefully this library is then flexible enough to avoid any Standardese, be it in English, in French, or in any language aping English Standardese. E.g. when the ISO/IEC 10646 mirror of Unicode was published in an official French version, the official translators felt partly committed to ape English Standardese, of which we know that it isn?t due mainly to Unicode, but to the then?head of ISO/IEC JTC1 SC2 WG2. Not to warm up that old grudge, just to show how on?topic that is. Be it Standardese or pseudo? localization, the effect is always to worsen UX by missing the point. Best regards, Marcel From unicode at unicode.org Sat Jun 9 11:49:16 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 9 Jun 2018 18:49:16 +0200 Subject: The Unicode Standard and ISO In-Reply-To: <783156611.7027.1528557773571.JavaMail.www@wwinf1m18> References: <783156611.7027.1528557773571.JavaMail.www@wwinf1m18> Message-ID: 2018-06-09 17:22 GMT+02:00 Marcel Schneider via Unicode : > On Sat, 9 Jun 2018 09:47:01 +0100, Richard Wordingham via Unicode wrote: > > > > On Sat, 9 Jun 2018 08:23:33 +0200 (CEST) > > Marcel Schneider via Unicode wrote: > > > > > > Where there is opportunity for productive sync and merging with is > > > > glibc. We have had some discussions, but more needs to be done- > > > > especially a lot of tooling work. Currently many bug reports are > > > > duplicated between glibc and cldr, a sort of manual > > > > synchronization. Help wanted here. > > > > > > Noted. For my part, sadly for C libraries I?m unlikely to be of any > > > help. > > > > I wonder how much of that comes under the sad category of "better not > > translated". If an English speaker has to resort to search engines to > > understand, let alone fix, a reported problem, it may be better for a > > non-English speaker to search for the error message in English, and then > > with luck he may find a solution he can understand. > > Then adding a "Display in English" button in the message box is best > practice. > Still I?ve never encountered any yet, and I guess this is because such a > facility > would be understood as an admission that up to now, i18n is partly a > failure. - Navigate any page on the web in another language than yours, with a Google Translate plugin enabled on your browser. you'll have the choice of seeing the automatic translation or the original. - Many websites that have pages proposed in multiple languages offers such buttons to select the language you want to see (and not necesarily falling back to English, becausse the original may as well be in another language and English is an approximate translation, notably for sites in Asia, Africa and south America). - Even the official websites of the European Union (or EEA) offers such choice (but at least the available translations are correctly reviewed for European languages; not all pages are translated in all official languages of member countries, but this is the case for most pages intended to be read by the general public, while pages about ongoing works, or technical reports for specialists, or recent legal decisions may not be translated except in a few "working languages", generally English, German, and French, sometimes Italian, the 4 languages spoken officially in multiple countries in the EEA including at least one in the European Union). So it's not a "failure" but a feature to be able to select the language, and to know when a proposed translation is fully or partly automated. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jun 9 13:14:17 2018 From: unicode at unicode.org (Jonathan Rosenne via Unicode) Date: Sat, 9 Jun 2018 18:14:17 +0000 Subject: The Unicode Standard and ISO In-Reply-To: References: <783156611.7027.1528557773571.JavaMail.www@wwinf1m18> Message-ID: Translated error messages are a horror story. Often I have to play around with my locale settings to avoid them. Using computer translation on programming error messages is no way near to being useful. Best Regards, Jonathan Rosenne From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy via Unicode Sent: Saturday, June 09, 2018 7:49 PM To: Marcel Schneider Cc: UnicodeMailingList Subject: Re: The Unicode Standard and ISO 2018-06-09 17:22 GMT+02:00 Marcel Schneider via Unicode >: On Sat, 9 Jun 2018 09:47:01 +0100, Richard Wordingham via Unicode wrote: > > On Sat, 9 Jun 2018 08:23:33 +0200 (CEST) > Marcel Schneider via Unicode > wrote: > > > > Where there is opportunity for productive sync and merging with is > > > glibc. We have had some discussions, but more needs to be done- > > > especially a lot of tooling work. Currently many bug reports are > > > duplicated between glibc and cldr, a sort of manual > > > synchronization. Help wanted here. > > > > Noted. For my part, sadly for C libraries I?m unlikely to be of any > > help. > > I wonder how much of that comes under the sad category of "better not > translated". If an English speaker has to resort to search engines to > understand, let alone fix, a reported problem, it may be better for a > non-English speaker to search for the error message in English, and then > with luck he may find a solution he can understand. Then adding a "Display in English" button in the message box is best practice. Still I?ve never encountered any yet, and I guess this is because such a facility would be understood as an admission that up to now, i18n is partly a failure. - Navigate any page on the web in another language than yours, with a Google Translate plugin enabled on your browser. you'll have the choice of seeing the automatic translation or the original. - Many websites that have pages proposed in multiple languages offers such buttons to select the language you want to see (and not necesarily falling back to English, becausse the original may as well be in another language and English is an approximate translation, notably for sites in Asia, Africa and south America). - Even the official websites of the European Union (or EEA) offers such choice (but at least the available translations are correctly reviewed for European languages; not all pages are translated in all official languages of member countries, but this is the case for most pages intended to be read by the general public, while pages about ongoing works, or technical reports for specialists, or recent legal decisions may not be translated except in a few "working languages", generally English, German, and French, sometimes Italian, the 4 languages spoken officially in multiple countries in the EEA including at least one in the European Union). So it's not a "failure" but a feature to be able to select the language, and to know when a proposed translation is fully or partly automated. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jun 9 14:01:39 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 9 Jun 2018 21:01:39 +0200 (CEST) Subject: The Unicode Standard and ISO Message-ID: <1244966180.7608.1528570899306.JavaMail.www@wwinf1f27> On the other hand, most end-users don?t appreciate to get ?a screenfull of all-in-English? when ?something happened.? If even big companies still didn?t succeed in getting automatted computer translation to work for error messages, then best practice could eventually be to provide an internet link with every message. Given that web pages are generally less sibylline than error messages, they may be better translatable, and Philippe Verdy?s hint is therefore a working solution for localized software end-user support. Still a computer should be understandable off-line, so CLDR providing a standard library of error messages could be appreciated by the industry. Best regards, Marcel On Sat, 9 Jun 2018 18:14:17 +0000, Jonathan Rosenne via Unicode wrote: > > Translated error messages are a horror story. Often I have to play around with my locale settings to avoid them. > Using computer translation on programming error messages is no way near to being useful. >? > Best Regards, >? > Jonathan Rosenne >? > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy via Unicode > Sent: Saturday, June 09, 2018 7:49 PM > To: Marcel Schneider > Cc: UnicodeMailingList > Subject: Re: The Unicode Standard and ISO ? ? 2018-06-09 17:22 GMT+02:00 Marcel Schneider via Unicode : On Sat, 9 Jun 2018 09:47:01 +0100, Richard Wordingham via Unicode wrote: > > > > On Sat, 9 Jun 2018 08:23:33 +0200 (CEST) > > Marcel Schneider via Unicode wrote: > > > > > > Where there is opportunity for productive sync and merging with is > > > > glibc. We have had some discussions, but more needs to be done- > > > > especially a lot of tooling work. Currently many bug reports are > > > > duplicated between glibc and cldr, a sort of manual > > > > synchronization. Help wanted here.? > > > > > > Noted. For my part, sadly for C libraries I?m unlikely to be of any > > > help. > > > > I wonder how much of that comes under the sad category of "better not > > translated". If an English speaker has to resort to search engines to > > understand, let alone fix, a reported problem, it may be better for a > > non-English speaker to search for the error message in English, and then > > with luck he may find a solution he can understand. > > Then adding a "Display in English" button in the message box is best practice. > Still I?ve never encountered any yet, and I guess this is because such a facility > would be understood as an admission that up to now, i18n is partly a failure. ? - Navigate any page on the web in another language than yours, with a Google Translate plugin enabled on your browser. you'll have the choice of seeing the automatic translation or the original. ? - Many websites that have pages proposed in multiple languages offers such buttons to select the language you want to see (and not necesarily falling back to English, becausse the original may as well be in another language and English is an approximate translation, notably for sites in Asia, Africa and south America). ? - Even the official websites of the European Union (or EEA) offers such choice (but at least the available translations are correctly reviewed for European languages; not all pages are translated in all official languages of member countries, but this is the case for most pages intended to be read by the general public, while pages about ongoing works, or technical reports for specialists, or recent legal decisions may not be translated except in a few "working languages", generally English, German, and French, sometimes Italian, the 4 languages spoken officially in multiple countries in the EEA including at least one in the European Union). ? So it's not a "failure" but a feature to be able to select the language, and to know when a proposed translation is fully or partly automated. From unicode at unicode.org Sat Jun 9 14:56:28 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 9 Jun 2018 12:56:28 -0700 Subject: The Unicode Standard and ISO In-Reply-To: <1244966180.7608.1528570899306.JavaMail.www@wwinf1f27> References: <1244966180.7608.1528570899306.JavaMail.www@wwinf1f27> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jun 9 17:41:19 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sun, 10 Jun 2018 00:41:19 +0200 (CEST) Subject: The Unicode Standard and ISO In-Reply-To: References: <1244966180.7608.1528570899306.JavaMail.www@wwinf1f27> Message-ID: <1739600906.8497.1528584079440.JavaMail.www@wwinf1f27> On Sat, 9 Jun 2018 12:56:28 -0700, Asmus Freytag via Unicode wrote: > > On 6/9/2018 12:01 PM, Marcel Schneider via Unicode wrote: > > Still a computer should be understandable off-line, so CLDR providing a standard library of error messages could be > > appreciated by the industry. > > The kind of translations that CLDR accumulates, like day, and month names, language and territory names, are a widely > applicable subset and one that is commonly required in machine generated or machine-assembled text (like displaying > the date, providing pick lists for configuration of locale settings, etc). > The universe of possible error messages is a completely different beast. > If you tried to standardize all error messages even in one language you would never arrive at something that would be > universally useful. While some simple applications may find that all their needs for communicating with their users are > covered, most would wish they had some other messages available. Indeed, error messages althouth technical are like the world?s books, a never-ending production of content. To account for this infinity, I was not proposing a closed set of messages to replace application libraries able to display message #123. In fact I wrote first: ?If to date, automatic [automated] translation of technical English still does not work, then I?d suggest? that CLDR feature a complete message library allowing to compose any localized piece?of information.? Here the piece of information displayed by the application is like a Lego spacecraft, the CLDR messages like Lego bricks. I didn?t play with Lego since a very long time, but as a boy I learned how it works. I even remember that when building a construct, it often happened that some bricks were ?missing?. A Lego box is complete wrt one or several models, but once my mom showing me the boxes on the shelves explained that they?re composed in a way that you?ll always lack something [when trying to build further]. ? That doesn?t prevent Lego from thriving, nor many people from enjoying. > To adopt your scheme, they would need to have a bifurcated approach, where some messages follow the standard, > while others do not (cannot). At that point, why bother? Determining whether some message can be rewritten to follow > the standard adds another level of complexity while you'd need to have translation resources for all the non-standard ones anyway. When CLDR libraries will allow to generate 98?% well-translated info boxes, human translators may focus on the remaining 2?%. If for any reason they cannot, yet the vendor will get much less support requests than with the ill-translated messages. ? > A middle ground is a shared terminology database that allows translators working on different products to arrive at the same translation > for the same things. Translators already know how to use such databases in their work flow, and integrating a shared one with > a product-specific one is much easier than trying to deal with a set of random error messages. If the scheme you outline works well, where come the reported oddities from? Obviously terminology is not all, it?s like Lego bricks without studs: Terms alone don?t interlock and therefore the user cannot make sense. This is where CLDR?s hopefully on-coming localizable message bricks enter in action, helping automated translation software compose understandable output, using patterns. Google translate is unable to do that, as shown in the English and French translations of this sentence found in a page of the Finnish NB: https://www.sfs.fi/ajankohtaista/uutiset/nappaimistoon_tarjolla_lisayksia.4249.news Finnish: Kielitoimiston ohjeen mukaan esimerkiksi vieraskielisiss? nimiss? on pyritt?v? s?ilytt?m??n kaikki tarkkeet. Google English: According to the Language Office, for example, in the name of a foreign language, it is necessary to maintain all the checkpoints. Google French: Selon le Language Office, par exemple, au nom d'une langue ?trang?re, il est n?cessaire de maintenir tous les points de contr?le. > It's pushing this kind of impractical scheme that gives standardizers a bad name. > > Especially if it is immediately tied to governmental procurement, forcing people to adopt it (or live with it) whether it provides any actual benefit. These statements make much sense to me? > However, a high-quality terminology database recommends itself (and doesn't need any procurement standards). > Ultimately, it was its demonstrated usefulness that drove the adoption of CLDR. This is why I?m so hopeful that CLDR will go much farther than date and time and other locale settings, and emoji names and keywords. Best regards, Marcel From unicode at unicode.org Sat Jun 9 23:21:40 2018 From: unicode at unicode.org (Steven R. Loomis via Unicode) Date: Sat, 9 Jun 2018 21:21:40 -0700 Subject: The Unicode Standard and ISO In-Reply-To: <1739600906.8497.1528584079440.JavaMail.www@wwinf1f27> References: <1244966180.7608.1528570899306.JavaMail.www@wwinf1f27> <1739600906.8497.1528584079440.JavaMail.www@wwinf1f27> Message-ID: Marcel, The idea is not necessarily without merit. However, CLDR does not usually expand scope just because of a suggestion. I usually recommend creating a new project first - gathering data, looking at and talking to projects to ascertain the usefulness of common messages.. one of the barriers to adding new content for CLDR is not just the design, but collecting initial data. When emoji or sub-territory names were added, many languages were included before it was added to CLDR. Also note CLDR does have some typographical terms for use in UI, such as 'bold' and 'italic' Regards, Steven On Sat, Jun 9, 2018 at 3:41 PM Marcel Schneider via Unicode < unicode at unicode.org> wrote: > On Sat, 9 Jun 2018 12:56:28 -0700, Asmus Freytag via Unicode wrote: > > > > On 6/9/2018 12:01 PM, Marcel Schneider via Unicode wrote: > > > Still a computer should be understandable off-line, so CLDR providing > a standard library of error messages could be > > > appreciated by the industry > The kind of translations that CLDR accumulates, like day, and month > names, language and territory names, are a widely > > applicable subset and one that is commonly required in machine generated > or machine-assembled text (like displaying > > the date, providing pick lists for configuration of locale settings, > etc). > > The universe of possible error messages is a completely different beast. > > If you tried to standardize all error messages even in one language you > would never arrive at something that would be > > universally useful. While some simple applications may find that all > their needs for communicating with their users are > > covered, most would wish they had some other messages available. > ? > > > However, a high-quality terminology database recommends itself (and > doesn't need any procurement standards). > > Ultimately, it was its demonstrated usefulness that drove the adoption > of CLDR. > > This is why I?m so hopeful that CLDR will go much farther than date and > time and other locale settings, and emoji names and keywords. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jun 10 01:35:39 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sun, 10 Jun 2018 08:35:39 +0200 (CEST) Subject: The Unicode Standard and ISO Message-ID: <756806252.545.1528612539616.JavaMail.www@wwinf1f27> On Sat, 9 Jun 2018 12:56:28 -0700, Asmus Freytag via Unicode wrote: [?] > It's pushing this kind of impractical scheme that gives standardizers a bad name. > > Especially if it is immediately tied to governmental procurement, forcing people to adopt it (or live with it) > whether it provides any actual benefit. Or not. What I left untold is that governmental action does effectively work in both directions (examples following), but governments don?t own that lien of ambivalence out of unbalanced discretion. When the French NB positioned? against encoding ?? in ISO/IEC 8859-1:1986, it wasn?t the government but a manufacturer who wanted to get around?adding support for this letter in printers. It?s not fully clear to me why the same happened to Dutch ??. Anyway as a result we had (and legacy doing the rest, still have)?two digitally malfunctioning languages. Thanks to the work of?Hugh McGregor Ross,?Peter Fenwick,?Bernard Marti?and Luek Zeckendorf (ISO/IEC 6937:1983), and from 1987 on thanks to the work of Joe Becker, Lee Collins and Mark Davis from Apple and Xerox, things started working fine, and do work the longer the better thanks to Mark Davis? on-going commitment. Industrial and governmental action both are ambivalent by nature simply because human action may happen to be short-viewed or far-sighted for a variety of reasons. When the French NB issued a QWERTY keyboard standard in 1973 and revised it in 1976, there were short-viewed industrial interests rather than governmental procurement. End-users never adopted it, there was no market, and it has recently been withdrawn. When governmental action, hard scientific work, human genius and an up-starting industrialization brought into existence a working keyboard for French that is usefully transposable to many other locales as well, it was enthousiastically adopted by the end-users and everybody urged the NB to standardize it. But the industry first asked for an international keyboard standard as a precondition? (which ended up being an excellent idea as well). The rest of the story may be spared as the conclusion is already clear. There is one impractical scheme that bothers me, and that is that we have two hyphens because the ASCII hyphen was duplicated as U+2010. Now since font designers (e.g. Lucida Sans Unicode) took the hyphen conundrum seriously to avoid spoofing, or for whatever reason, we?re supposed to have keyboard layouts with two hyphens, both being Gc=Pd. That is where the related ISO WG2 could have been useful by positioning against U+2010, because disambiguating the the minus sign U+2212 and keeping the hyphen-minus U+002D in use like e.g. the period would have been sufficient. On the other hand, it is entirely Unicode?s merit that we have two curly apostrophes, one that doesn?t break hashtags (U+02BC, Gc=Lm), and one that does (U+2019, Gc=Pf), as has been shared on this List (thanks to Andr? Schappo). But despite a language being in a position to make a distinct use of each one of them, depending on whether the apostrophe helps denote a particular sound or marks an elision (and despite of having already a physical keyboard and driver that would make distinct entry very easy and straightforward), submitting feedback didn?t help to raise concern so far. This is an example how the industry and the governments united in the Unicode Consortium are saving end-users lots of trouble. Thank you. Marcel From unicode at unicode.org Sun Jun 10 02:10:27 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sun, 10 Jun 2018 09:10:27 +0200 (CEST) Subject: The Unicode Standard and ISO In-Reply-To: References: <1244966180.7608.1528570899306.JavaMail.www@wwinf1f27> <1739600906.8497.1528584079440.JavaMail.www@wwinf1f27> Message-ID: <386316561.945.1528614628110.JavaMail.www@wwinf1f27> On Sat, 9 Jun 2018 21:21:40 -0700, Steven R. Loomis via Unicode wrote: > > Marcel, >?The idea is not necessarily without merit. However, CLDR does not usually expand scope just because of a suggestion. > > I usually recommend creating a new project first - gathering data, looking at and talking to projects to ascertain the usefulness > of common messages.. one of the barriers to adding new content for CLDR is not just the design, but collecting initial data. > When emoji or sub-territory names were added, many languages were included before it was added to CLDR. We know it took years to collect the subterritory names and make sure the list and translations are complete. > > Also note CLDR does have some typographical terms for use in UI, such as 'bold' and 'italic' I figure out that these are intended for tooltips on basic formatting facilities. High-end software like Microsoft Office has many more and adds tooltips showing instructions for use out of a corporate strategy that aims at raising usability and overall quality. So I wonder whether there are limits for software vendors in cooperating with competitors to mutualize UI content? This point and others would be cleared in the preliminary stage that you drafted above but that I don?t feel in a position to carry out, at least not now as I?m focusing on our national data in CLDR and on keyboard layouts and standards. Anyhow, Thank you for letting us know. Best regards, Marcel > Regards, > Steven > On Sat, Jun 9, 2018 at 3:41 PM Marcel Schneider via Unicode wrote: > > On Sat, 9 Jun 2018 12:56:28 -0700, Asmus Freytag via Unicode wrote: > > > > On 6/9/2018 12:01 PM, Marcel Schneider via Unicode wrote: > > > Still a computer should be understandable off-line, so CLDR providing a standard library of error messages could be > > > appreciated by the industry > The kind of translations that CLDR accumulates, like day, and month names, language and territory names, are a widely > > applicable subset and one that is commonly required in machine generated or machine-assembled text (like displaying > > the date, providing pick lists for configuration of locale settings, etc). > > The universe of possible error messages is a completely different beast. > > If you tried to standardize all error messages even in one language you would never arrive at something that would be > > universally useful. While some simple applications may find that all their needs for communicating with their users are > > covered, most would wish they had some other messages available. > > ? > > > However, a high-quality terminology database recommends itself (and doesn't need any procurement standards). > > Ultimately, it was its demonstrated usefulness that drove the adoption of CLDR. > > This is why I?m so hopeful that CLDR will go much farther than date and time and other locale settings, and emoji names and keywords. > > From unicode at unicode.org Sun Jun 10 10:11:48 2018 From: unicode at unicode.org (Peter Constable via Unicode) Date: Sun, 10 Jun 2018 15:11:48 +0000 Subject: The Unicode Standard and ISO In-Reply-To: References: <107041927.8437.1528370722530.JavaMail.www@wwinf1c20> Message-ID: > ... For another part it [sync with ISO/IEC?15897] failed because the Consortium refused to cooperate, despite of repeated proposals for a merger of both instances. First, ISO/IEC 15897 is built on a data-format specification, ISO/IEC TR 14652, that never achieved the support needed to become an international standard, and has since been withdrawn. (TRs cannot remain TRs forever.) Now, JTC1/SC35 began work four or five years ago to create data-format specification for this, Approved Work Item 30112. From the outset, Unicode and the US national body tried repeatedly to engage with SC35 and SC35/WG5, informing them of UTS #35 (LDML) and CLDR, but were ignored. SC35 didn?t appear to be interested a pet project and not in what is actually being used in industry. After several failed attempts, Unicode and the USNB gave up trying. So, any suggestion that Unicode has failed to cooperate or is is dropping the ball with regard to locale data and ISO is simply uninformed. Peter From: Unicode On Behalf Of Mark Davis ?? via Unicode Sent: Thursday, June 7, 2018 6:20 AM To: Marcel Schneider Cc: UnicodeMailing Subject: Re: The Unicode Standard and ISO A few facts. > ... Consortium refused till now to synchronize UCA and ISO/IEC 14651. ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could speak to the synchronization level in more detail, but the above statement is inaccurate. > ... For another part it [sync with ISO/IEC?15897] failed because the Consortium refused to cooperate, despite of repeated proposals for a merger of both instances. I recall no serious proposals for that. (And in any event ? very unlike the synchrony with 10646 and 14651 ? ISO 15897 brought no value to the table. Certainly nothing to outweigh the considerable costs of maintaining synchrony. Completely inadequate structure for modern system requirement, no particular industry support, and scant content: see Wikipedia for "The registry has not been updated since December 2001".) Mark Mark On Thu, Jun 7, 2018 at 1:25 PM, Marcel Schneider via Unicode > wrote: On Thu, 17 May 2018 09:43:28 -0700, Asmus Freytag via Unicode wrote: > > On 5/17/2018 8:08 AM, Martinho Fernandes via Unicode wrote: > > Hello, > > > > There are several mentions of synchronization with related standards in > > unicode.org, e.g. in https://www.unicode.org/versions/index.html, and > > https://www.unicode.org/faq/unicode_iso.html. However, all such mentions > > never mention anything other than ISO 10646. > > Because that is the standard for which there is an explicit understanding by all involved > relating to synchronization. There have been occasionally some challenging differences > in the process and procedures, but generally the synchronization is being maintained, > something that's helped by the fact that so many people are active in both arenas. Perhaps the cause-effect relationship is somewhat unclear. I think that many people being active in both arenas is helped by the fact that there is a strong will to maintain synching. If there were similar policies notably for ISO/IEC?14651 (collation) and ISO/IEC?15897 (locale data), ISO/IEC 10646 would be far from standing alone in the field of Unicode-ISO/IEC cooperation. > > There are really no other standards where the same is true to the same extent. > > > > I was wondering which ISO standards other than ISO 10646 specify the > > same things as the Unicode Standard, and of those, which ones are > > actively kept in sync. This would be of importance for standardization > > of Unicode facilities in the C++ language (ISO 14882), as reference to > > ISO standards is generally preferred in ISO standards. > > > One of the areas the Unicode Standard differs from ISO 10646 is that its conception > of a character's identity implicitly contains that character's properties - and those are > standardized as well and alongside of just name and serial number. This is probably why, to date, ISO/IEC 10646 features character properties by including normative references to the Unicode Standard, Standard Annexes, and the UCD. Bidi-mirroring e.g. is part of ISO/IEC 10646 that specifies in clause 15.1: ?[?] The list of these characters is determined by having the ?Bidi_Mirrored? property set to ?Y? in the Unicode Standard. These values shall be determined according to the Unicode Standard Bidi Mirrored property (see Clause 2).? > > Many of these properties have associated with them algorithms, e.g. the bidi algorithm, > that are an essential element of data interchange: if you don't know which order in > the backing store is expected by the recipient to produce a certain display order, you > cannot correctly prepare your data. > > There is one area where standardization in ISO relates to work in Unicode that I can > think of, and that is sorting. Yet UCA conforms to ISO/IEC?14651 (where UCA is cited as entry #28 in the bibliography). The reverse relationship is irrelevant and would be unfair, given that the Consortium refused till now to synchronize UCA and ISO/IEC 14651. Here is a need for action. > However, sorting, beyond the underlying framework, > ultimately relates to languages, and language-specific data is now housed in CLDR. > > Early attempts by ISO to standardize a similar framework for locale data failed, in > part because the framework alone isn't the interesting challenge for a repository, > instead it is the collection, vetting and management of the data. For another part it failed because the Consortium refused to cooperate, despite of repeated proposals for a merger of both instances. > > The reality is that the ISO model and its organizational structures are not well suited > to the needs of many important area where some form of standardization is needed. > That's why we have organization like IETF, W3C, Unicode etc.. > > Duplicating all or even part of their effort inside ISO really serves nobody's purpose. An undesirable side-effect of not merging Unicode with ISO/IEC?15897 (locale data) is to divert many competent contributors from monitoring CLDR data, especially for French. Here too is a huge need for action. Thanks in advance. Marcel -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 27605 bytes Desc: not available URL: From unicode at unicode.org Sun Jun 10 14:28:23 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sun, 10 Jun 2018 21:28:23 +0200 (CEST) Subject: The Unicode Standard and ISO Message-ID: <1513989935.9050.1528658903203.JavaMail.www@wwinf1f27> On Sun, 10 Jun 2018 15:11:48 +0000, Peter Constable via Unicode wrote: > > > ... For another part it [sync with ISO/IEC?15897] failed because the Consortium refused to cooperate, despite of > > repeated proposals for a merger of both instances. > > First, ISO/IEC 15897 is built on a data-format specification, ISO/IEC TR 14652, that never achieved the support > needed to become an international standard, and has since been withdrawn. (TRs cannot remain TRs forever.) > Now, JTC1/SC35 began work four or five years ago to create data-format specification for this, Approved Work Item 30112. > From the outset, Unicode and the US national body tried repeatedly to engage with SC35 and SC35/WG5, The involvement in this decade of ISO/IEC JTC1 SC35 WG5 adds a scary level of complexity unrelated to the core issues. Andrew West already hinted that the stuff was moved from SC22 to SC35, but it took me some extra investigation to get the point. As a reminder: The actual SC35 is in total disconnect from the same SC35 as it was from the mid-eighties to mid-nineties and beyond. > informing them of UTS #35 (LDML) and CLDR, but were ignored. SC35 didn?t appear to be interested [, or appeared to be interested in ] > a pet project and not in what is actually being used in industry. Sorry, I experienced some difficulty to understand and filled in what I think could have been elided. > After several failed attempts, Unicode and the USNB gave up trying. Thank you for bringing up this key information. > > So, any suggestion that Unicode has failed to cooperate or is is dropping the ball with regard to locale data and ISO > is simply uninformed. That is exact. So I think this thread has now led to a main response, and all concerned people on this List are welcome to take note of these new facts showing that Unicode is totally innocent in ISO/IEC locale data issues. If that doesn?t suffice to convince missing people to cooperate in reviewing French data in CLDR, they may be pleased to know that I try to keep helping do our best. Thank you everyone. Best regards, Marcel > > > Peter > > > From: Unicode On Behalf Of Mark Davis ?? via Unicode > Sent: Thursday, June 7, 2018 6:20 AM > To: Marcel Schneider > Cc: UnicodeMailing > Subject: Re: The Unicode Standard and ISO > > A few facts. > > > ... Consortium refused till now to synchronize UCA and ISO/IEC 14651. > > ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could speak to the synchronization level in more detail, but the above statement is inaccurate. > > > ... For another part it [sync with ISO/IEC?15897] failed because the Consortium refused to cooperate, despite of > repeated proposals for a merger of both instances. > > I recall no serious proposals for that. > > (And in any event ? very unlike the synchrony with 10646 and 14651 ? ISO 15897 brought no value to the table. Certainly nothing to outweigh the considerable costs of maintaining synchrony. Completely inadequate structure for modern system requirement, no particular industry support, and scant content: see Wikipedia for "The registry has not been updated since December 2001".) > > Mark > [?] From unicode at unicode.org Mon Jun 11 04:28:41 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Mon, 11 Jun 2018 11:28:41 +0200 (CEST) Subject: The Unicode Standard and ISO In-Reply-To: <1513989935.9050.1528658903203.JavaMail.www@wwinf1f27> References: <1513989935.9050.1528658903203.JavaMail.www@wwinf1f27> Message-ID: <1984732363.5482.1528709321471.JavaMail.www@wwinf1f27> > > From the outset, Unicode and the US national body tried repeatedly to engage with SC35 and SC35/WG5, [?] > As a reminder: The actual SC35 is in total disconnect from the same SC35 as it was from the mid-eighties to mid-nineties and beyond. Edit: ISO/IEC JTC1 SC35 was founded in 1999. (In the mentioned timespan, there was SC18/WG9.) > > informing them of UTS #35 (LDML) and CLDR, but were ignored. SC35 didn?t appear to be interested > [, or appeared to be interested in ] > > a pet project and not in what is actually being used in industry. It seems it isn?t even a pet project, today it?s just nothing but a deplorable mismanagement mess. In my opinion, at some point the inadvertant French NB will apologize to the US National Body and to the Unicode Consortium. As of now, I apologize for my part. Best regards, Marcel From unicode at unicode.org Mon Jun 11 10:42:38 2018 From: unicode at unicode.org (Jonathan Rosenne via Unicode) Date: Mon, 11 Jun 2018 15:42:38 +0000 Subject: The Unicode Standard and ISO In-Reply-To: <15636441.36248.1528731165076.JavaMail.defaultUser@defaultHost> References: <1244966180.7608.1528570899306.JavaMail.www@wwinf1f27> <1739600906.8497.1528584079440.JavaMail.www@wwinf1f27> <15636441.36248.1528731165076.JavaMail.defaultUser@defaultHost> Message-ID: The scheme I have been using for years is a short message in the local language giving the main point of the error, together with a detailed message in English. One has to see it to believe what happens to messages translated mechanically from English to bidi languages when data is embedded in the text. Best Regards, Jonathan Rosenne -----Original Message----- From: William_J_G Overington [mailto:wjgo_10009 at btinternet.com] Sent: Monday, June 11, 2018 6:33 PM To: verdy_p at wanadoo.fr; Jonathan Rosenne; asmusf at ix.netcom.com; Steven R. Loomis; jameskasskrv at gmail.com; charupdate at orange.fr; petercon at microsoft.com; richard.wordingham at ntlworld.com Cc: unicode at unicode.org Subject: Re: The Unicode Standard and ISO Steven R. Loomis wrote: >Marcel, > The idea is not necessarily without merit. However, CLDR does not usually expand scope just because of a suggestion. I usually recommend creating a new project first - gathering data, looking at and talking to projects to ascertain the usefulness of common messages.. one of the barriers to adding new content for CLDR is not just the design, but collecting initial data. When emoji or sub-territory names were added, many languages were included before it was added to CLDR. Well, maybe usually, but perhaps not this time? I opine that if it is going to be done it needs to be done under the umbrella of Unicode Inc. and have lots of people contribute a bit: that way businesses may well use it because being part of Unicode Inc. they will have provenance over there being no possibility of later claims for payment. Not that any such claim would necessarily be made, but they need to know that. Also having lots of people can help get the translations done as there are a number of people who are bilingual who might like to pitch in. So, give the idea a sound chance of being implemented please. Asmus Freytag wrote: > If you tried to standardize all error messages even in one language you would never arrive at something that would be universally useful. Well that is a big "If". One cannot standardize all pictures as emoji, but emoji still get encoded, some every year now. I first learned to program back in the 1960s using the Algol 60 language on an Elliott 803 mainframe computer, five track paper tape, teleprinters to prepare a program on white tape, results out on coloured tape, colours changed when the rolls changed. If I remember correctly, error messages, either at compile time or at run time came out as messages of a line number and an error number for compile time errors and a number for a run time error. One then looked up the number in the manual or on the enlarged version of the numbers and the corresponding error messages that was mounted on the wall. > While some simple applications may find that all their needs for communicating with their users are covered, most would wish they had some other messages available. Yes, but more messages could be added to the list much more often than emoji are added to The Unicode Standard, maybe every month or every fortnight or every week if needed. > To adopt your scheme, they would need to have a bifurcated approach, where some messages follow the standard, while others do not (cannot). Not necessarily. A developer would just need to send in a request to Unicode Inc. to add the needed extra sentences to the list and get a code number. > It's pushing this kind of impractical scheme that gives standardizers a bad name. It is not an impractical scheme. It can be implemented straightforwardly using the star space system that I have devised. http://www.users.globalnet.co.uk/~ngo/An_encoding_space_designed_for_application_in_encoding_localizable_sentences.pdf http://www.users.globalnet.co.uk/~ngo/localizable_sentences_the_novel_chapter_019.pdf Start off with space for 9999 error messages and number them from 4840001 through to 484999 and allocate meanings as needed. Then a side view of a 4-8-4 locomotive facing to the left could be a logo for the project. Big 4-8-4 locomotives were built years ago. If people could do that then surely people can implement this project successfully now if they want to do so. For example, one error message could be as follows: Data entry for the currency field must be either a whole positive number or a positive number to exactly two decimal places. Another could be as follows: Division by zero was attempted. Yet another could be as follows: The number of opening parentheses in the expression does not match the number of closing parentheses. If some day more than 9999 error messages are needed, these can be provided within star space as it is vast. http://www.users.globalnet.co.uk/~ngo/a_completed_publication_about_localizable_sentences_research.pdf William Overington Monday 11 June 2018 From unicode at unicode.org Mon Jun 11 10:32:45 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Mon, 11 Jun 2018 16:32:45 +0100 (BST) Subject: The Unicode Standard and ISO In-Reply-To: References: <1244966180.7608.1528570899306.JavaMail.www@wwinf1f27> <1739600906.8497.1528584079440.JavaMail.www@wwinf1f27> Message-ID: <15636441.36248.1528731165076.JavaMail.defaultUser@defaultHost> Steven R. Loomis wrote: >Marcel, > The idea is not necessarily without merit. However, CLDR does not usually expand scope just because of a suggestion. I usually recommend creating a new project first - gathering data, looking at and talking to projects to ascertain the usefulness of common messages.. one of the barriers to adding new content for CLDR is not just the design, but collecting initial data. When emoji or sub-territory names were added, many languages were included before it was added to CLDR. Well, maybe usually, but perhaps not this time? I opine that if it is going to be done it needs to be done under the umbrella of Unicode Inc. and have lots of people contribute a bit: that way businesses may well use it because being part of Unicode Inc. they will have provenance over there being no possibility of later claims for payment. Not that any such claim would necessarily be made, but they need to know that. Also having lots of people can help get the translations done as there are a number of people who are bilingual who might like to pitch in. So, give the idea a sound chance of being implemented please. Asmus Freytag wrote: > If you tried to standardize all error messages even in one language you would never arrive at something that would be universally useful. Well that is a big "If". One cannot standardize all pictures as emoji, but emoji still get encoded, some every year now. I first learned to program back in the 1960s using the Algol 60 language on an Elliott 803 mainframe computer, five track paper tape, teleprinters to prepare a program on white tape, results out on coloured tape, colours changed when the rolls changed. If I remember correctly, error messages, either at compile time or at run time came out as messages of a line number and an error number for compile time errors and a number for a run time error. One then looked up the number in the manual or on the enlarged version of the numbers and the corresponding error messages that was mounted on the wall. > While some simple applications may find that all their needs for communicating with their users are covered, most would wish they had some other messages available. Yes, but more messages could be added to the list much more often than emoji are added to The Unicode Standard, maybe every month or every fortnight or every week if needed. > To adopt your scheme, they would need to have a bifurcated approach, where some messages follow the standard, while others do not (cannot). Not necessarily. A developer would just need to send in a request to Unicode Inc. to add the needed extra sentences to the list and get a code number. > It's pushing this kind of impractical scheme that gives standardizers a bad name. It is not an impractical scheme. It can be implemented straightforwardly using the star space system that I have devised. http://www.users.globalnet.co.uk/~ngo/An_encoding_space_designed_for_application_in_encoding_localizable_sentences.pdf http://www.users.globalnet.co.uk/~ngo/localizable_sentences_the_novel_chapter_019.pdf Start off with space for 9999 error messages and number them from 4840001 through to 484999 and allocate meanings as needed. Then a side view of a 4-8-4 locomotive facing to the left could be a logo for the project. Big 4-8-4 locomotives were built years ago. If people could do that then surely people can implement this project successfully now if they want to do so. For example, one error message could be as follows: Data entry for the currency field must be either a whole positive number or a positive number to exactly two decimal places. Another could be as follows: Division by zero was attempted. Yet another could be as follows: The number of opening parentheses in the expression does not match the number of closing parentheses. If some day more than 9999 error messages are needed, these can be provided within star space as it is vast. http://www.users.globalnet.co.uk/~ngo/a_completed_publication_about_localizable_sentences_research.pdf William Overington Monday 11 June 2018 From unicode at unicode.org Mon Jun 11 21:53:54 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 12 Jun 2018 04:53:54 +0200 (CEST) Subject: The Unicode Standard and ISO Message-ID: <235812787.183.1528772034025.JavaMail.www@wwinf1n14> On Mon, 11 Jun 2018 16:32:45 +0100 (BST), William_J_G Overington via Unicode wrote: [?] > Asmus Freytag wrote: > > > If you tried to standardize all error messages even in one language you would never arrive at something that would be universally useful. > > Well that is a big "If". One cannot standardize all pictures as emoji, but emoji still get encoded, some every year now. > > I first learned to program back in the 1960s using the Algol 60 language on an Elliott 803 mainframe computer, five track paper tape, > teleprinters to prepare a program on white tape, results out on coloured tape, colours changed when the rolls changed. If I remember > correctly, error messages, either at compile time or at run time came out as messages of a line number and an error number for compile > time errors and a number for a run time error. One then looked up the number in the manual or on the enlarged version of the numbers > and the corresponding error messages that was mounted on the wall. > > > While some simple applications may find that all their needs for communicating with their users are covered, most would wish they had > > some other messages available. > > Yes, but more messages could be added to the list much more often than emoji are added to The Unicode Standard, maybe every month > or every fortnight or every week if needed. > > > To adopt your scheme, they would need to have a bifurcated approach, where some messages follow the standard, while others do not (cannot). > > Not necessarily. A developer would just need to send in a request to Unicode Inc. to add the needed extra sentences to the list and get a code number. > > > It's pushing this kind of impractical scheme that gives standardizers a bad name. > > It is not an impractical scheme. I don?t fully disagree with Asmus, as I suggested to make available localizable (and effectively localized) libraries of message components, rather than of entire messages. The challenge as I see it is to get them translated to all locales. For this I'm hoping that the advantage of improving user support upstream instead of spending more time on support fora would be obvious. By contrast I do disagree with the idea that industrial standards (as opposed to governmental procurement) are a safeguard against impractical schemes. Devising impractical specifications on industrial procurement hasn't even been a privilege of the French NB (referring to the examples in my e-mail: https://unicode.org/mail-arch/unicode-ml/y2018-m06/0082.html ), as demonstrated with the example of the hyphen conundrum where Unicode pushes the use of keyboard layouts featuring two distinct hyphens with same general category and same behavior, but different glyphs in some fonts whose designers didn?t think further than the original point of overly disambiguating hyphen semantics?while getting around similar traps with other punctuations. And in this thread I wanted to demonstrate that by focusing on the wrong priorities, i.e. legacy character names instead of the practicability of on-going encoding and the accurateness of specified decompositions?so that in some instances cedilla was used instead of comma below, Michael pointed out?, ISO/IEC JTC1 SC2/WG2 failed to do its part and missed its mission?and thus didn?t inspire a desire of extensive cooperation (and damaged the reputation of the whole ISO/IEC). Best regards, Marcel From unicode at unicode.org Tue Jun 12 05:26:47 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Tue, 12 Jun 2018 11:26:47 +0100 (BST) Subject: The Unicode Standard and ISO In-Reply-To: <235812787.183.1528772034025.JavaMail.www@wwinf1n14> References: <235812787.183.1528772034025.JavaMail.www@wwinf1n14> Message-ID: <8922114.16140.1528799207793.JavaMail.defaultUser@defaultHost> Hi Marcel > I don?t fully disagree with Asmus, as I suggested to make available localizable (and effectively localized) libraries of message components, rather than of entire messages. Could you possibly give some examples of the message components to which you refer please? Asmus wrote: > A middle ground is a shared terminology database that allows translators working on different products to arrive at the same translation for the same things. Translators already know how to use such databases in their work flow, and integrating a shared one with a product-specific one is much easier than trying to deal with a set of random error messages. I am not a linguist. I am interested in languages but my knowledge of languages is little more than that of general education, though I have written a song in French. http://www.users.globalnet.co.uk/~ngo/une_chanson.pdf So when Asmus wrote "Translators already know how to use such databases in their work flow, ....", I do not know how to do that myself. > The challenge as I see it is to get them translated to all locales. Well, yes, that is a big challenge. It depends whether people want to get it done. In England, with its changeable weather, part of the culture is to talk about the weather. For example, at a bus stop talking about the weather with other people: it is sociable without being intrusive or controversial. Alas it did not occur to me that that might seem strange to some people who are not from England. http://www.english-at-home.com/speaking/talking-about-the-weather/ http://www.bbc.com/future/story/20151214-why-do-brits-talk-about-the-weather-so-much I remember when I wrote about localizable sentences in this mailing list in mid-April 2009, using sentences about the weather, I hoped, in hindsight rather naively, that people on the mailing list would be interested and that translations into many languages would be posted and then things would get going. In the event, only one person, Magnus Bodin, provided translations. Magnus provided translations into Swedish and also provided a translation for an additional sentence as well. I knew no Swedish myself. These translations have been extremely helpful in my research project as they demonstrate communication through the language barrier using encoded localizable sentences. Yesterday I provided three example error message sentences. https://www.unicode.org/mail-arch/unicode-ml/y2018-m06/0088.html Please consider one of them, which could be output as a code number, say, ::4842357:; from an application program if someone enters a letter of the alphabet into a curency field, and then displayed localized into a language by first decoding using a sentence.dat UTF-16 text file for that language that includes a line that starts ::4842357:;| and then has the localization into that particular language, the language being any language that can be displayed using Unicode. For English, the line in the sentence.dat file would be as follows. ::4842357:;|Data entry for the currency field must be either a whole positive number or a positive number to exactly two decimal places. It would be great if some bilingual readers of this mailing list were to post a translation of the above line of text into another language. In my research I am using an integral sign as a base character and circled digit characters. If possible, a character such as U+FFF7 could be encoded to be the base character as that would provide a unique unambiguous link to star space from Unicode plain text. However whether that happens at some future time will depend upon there being sufficient interest at that future time in using localizable sentences for communication through the language barrier. William Overington Tuesday 12 June 2018 From unicode at unicode.org Tue Jun 12 09:53:27 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 12 Jun 2018 16:53:27 +0200 (CEST) Subject: The Unicode Standard and ISO Message-ID: <1178550961.12964.1528815207637.JavaMail.www@wwinf1c20> William, On 12/06/18 12:26, William_J_G Overington wrote: > > Hi Marcel > > > I don?t fully disagree with Asmus, as I suggested to make available localizable (and effectively localized) libraries of message components, rather than of entire messages. > > Could you possibly give some examples of the message components to which you refer please? > Likewise I?d be interested in asking Jonathan Rosenne for an example or two of automated translation from English to bidi languages with data embedded, as on Mon, 11 Jun 2018 15:42:38 +0000, Jonathan Rosenne via Unicode wrote: [?] > > > One has to see it to believe what happens to messages translated mechanically from English to bidi languages when data is embedded in the text. But both would require launching a new thread. Thinking hard enough, I?m even afraid that most subscribers wouldn?t be interested, so we?d have to move off-list. One alternative I can think of is to use one of the CLDR mailing lists. I subscribed to CLDR-users when I was directed to move there some technical discussion about keyboard layouts from Unicode Public. But now as international message components are not yet a part of CLDR, we?d need to ask for extra permission to do so. An additional drawback of launching a technical discussion right now is that significant parts of CLDR data are not yet correctly localized so there is another bunch of priorities under July 11 deadline. I guess that vendors wouldn?t be glad to see us gathering data for new structures while level=Modern isn?t complete. In the meantime, you are welcome to contribute and to motivate missing people to do the same. Best regards, Marcel From unicode at unicode.org Tue Jun 12 09:58:09 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Tue, 12 Jun 2018 15:58:09 +0100 Subject: The Unicode Standard and ISO In-Reply-To: <235812787.183.1528772034025.JavaMail.www@wwinf1n14> References: <235812787.183.1528772034025.JavaMail.www@wwinf1n14> Message-ID: <9054B8BE-2689-442A-89CE-DD853D4D67AB@evertype.com> Marcel, You have put words into my mouth. Please don?t. Your description of what I said is NOT accurate. > On 12 Jun 2018, at 03:53, Marcel Schneider via Unicode wrote: > > And in this thread I wanted to demonstrate that by focusing on the wrong priorities, i.e. legacy character names instead of the practicability of on-going encoding and the accurateness of specified decompositions?so that in some instances cedilla was used instead of comma below, Michael pointed out?, ISO/IEC JTC1 SC2/WG2 failed to do its part and missed its mission?and thus didn?t inspire a desire of extensive cooperation (and damaged the reputation of the whole ISO/IEC). From unicode at unicode.org Tue Jun 12 10:20:29 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 12 Jun 2018 17:20:29 +0200 (CEST) Subject: The Unicode Standard and ISO In-Reply-To: <9054B8BE-2689-442A-89CE-DD853D4D67AB@evertype.com> References: <235812787.183.1528772034025.JavaMail.www@wwinf1n14> <9054B8BE-2689-442A-89CE-DD853D4D67AB@evertype.com> Message-ID: <883035019.13497.1528816830070.JavaMail.www@wwinf1c20> On Tue, 12 Jun 2018 15:58:09 +0100, Michael Everson via Unicode wrote: > > Marcel, > > You have put words into my mouth. Please don?t. Your description of what I said is NOT accurate. > > > On 12 Jun 2018, at 03:53, Marcel Schneider via Unicode wrote: > > > > And in this thread I wanted to demonstrate that by focusing on the wrong priorities, i.e. legacy character names instead of > > the practicability of on-going encoding and the accurateness of specified decompositions?so that in some instances cedilla > > was used instead of comma below, Michael pointed out?, ISO/IEC JTC1 SC2/WG2 failed to do its part and missed its mission? > > and thus didn?t inspire a desire of extensive cooperation (and damaged the reputation of the whole ISO/IEC). Michael, I?d better quote your actual e-mail: On Fri, 8 Jun 2018 13:01:48 +0100, Michael Everson via Unicode wrote: [?] > Many things have more than one name. The only truly bad misnomers from that period was related to a mapping error, > namely, in the treatment of Latvian characters which are called CEDILLA rather than COMMA BELOW. Now I fail to understand why this mustn?t be reworded to ?the accurateness of specified decompositions?so that in some instances cedilla was used instead of comma below[.]? If any correction can be made, I?d be eager to take note. Thanks for correcting. Now let?s append the e-mail that I was about to send: Another ISO Standard that needs to be mentioned in this thread is ISO 15924 (script codes; not ISO/IEC). It has a particular status in that Unicode is the Registration Authority. I wonder whether people agree that it has a French version. Actually it does have a French version, but Michael Everson (Registrar) revealed on this List multiple issues with synching French script names in ISO 15924-fr and in Code Charts translations. Shouldn?t this content be moved to CLDR? At least with respect to localized script names. From unicode at unicode.org Tue Jun 12 10:57:08 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Tue, 12 Jun 2018 16:57:08 +0100 Subject: The Unicode Standard and ISO In-Reply-To: <883035019.13497.1528816830070.JavaMail.www@wwinf1c20> References: <235812787.183.1528772034025.JavaMail.www@wwinf1n14> <9054B8BE-2689-442A-89CE-DD853D4D67AB@evertype.com> <883035019.13497.1528816830070.JavaMail.www@wwinf1c20> Message-ID: <45FB2BD6-40B0-47F6-8639-E8A3258C04EE@evertype.com> All right, if you want a clear explanation. Yes, I think the ISO 8859-4 character names for the Latvian letters were mistaken. Yes, I think that mapping them to decompositions with CEDILLA rather than COMMA BELOW was a mistake. Evidently some felt that the normative mapping was important. This does not mean that SC2 ?failed to do its part? and it did not cause a lack of desire for cooperation, and it bloody well did not ?damage the reputation of the whole ISO/IEC?. As to ISO 15924, it was developed bilingually, and there was consensus on the names that are there. Last year you suggested a massive number of name changes to the French translation of ISO/IEC 10646, and I criticized you for foregoing stability for your own preferences. When it came to the names in 15924, I told you that I do not trust your judgement, and that I would consider revisions to the French names when you came back with consensus on those changes with experts Alain LaBont?, Patrick Andries, Denis Jacquerye, and Marc Lodewijck. As I have not heard from them, I conclude that no such consensus exists. ISO 15924 is and ISO standard. Aspects of its content may be mirrored in other places, but ?moving its content? to CLDR makes no sense. Michael Everson > On 12 Jun 2018, at 16:20, Marcel Schneider via Unicode wrote: > On Tue, 12 Jun 2018 15:58:09 +0100, Michael Everson via Unicode wrote: >> >> Marcel, >> You have put words into my mouth. Please don?t. Your description of what I said is NOT accurate. >> >>> On 12 Jun 2018, at 03:53, Marcel Schneider via Unicode wrote: >>> And in this thread I wanted to demonstrate that by focusing on the wrong priorities, i.e. legacy character names instead of the practicability of on-going encoding and the accurateness of specified decompositions?so that in some instances cedilla was used instead of comma below, Michael pointed out?, ISO/IEC JTC1 SC2/WG2 failed to do its part and missed its mission?and thus didn?t inspire a desire of extensive cooperation (and damaged the reputation of the whole ISO/IEC). > > Michael, I?d better quote your actual e-mail: > > On Fri, 8 Jun 2018 13:01:48 +0100, Michael Everson via Unicode wrote: > [?] >> Many things have more than one name. The only truly bad misnomers from that period was related to a mapping error, >> namely, in the treatment of Latvian characters which are called CEDILLA rather than COMMA BELOW. > > Now I fail to understand why this mustn?t be reworded to ?the accurateness of specified decompositions?so that in some instances cedilla was used instead of comma below[.]? If any correction can be made, I?d be eager to take note. Thanks for correcting. > > Now let?s append the e-mail that I was about to send: > > Another ISO Standard that needs to be mentioned in this thread is ISO 15924 (script codes; not ISO/IEC). It has a particular status in that Unicode is the Registration Authority. > > I wonder whether people agree that it has a French version. Actually it does have a French version, but Michael Everson (Registrar) revealed on this List multiple issues with synching French script names in ISO 15924-fr and in Code Charts translations. > > Shouldn?t this content be moved to CLDR? At least with respect to localized script names. From unicode at unicode.org Tue Jun 12 11:00:55 2018 From: unicode at unicode.org (Steven R. Loomis via Unicode) Date: Tue, 12 Jun 2018 09:00:55 -0700 Subject: The Unicode Standard and ISO In-Reply-To: <883035019.13497.1528816830070.JavaMail.www@wwinf1c20> References: <235812787.183.1528772034025.JavaMail.www@wwinf1n14> <9054B8BE-2689-442A-89CE-DD853D4D67AB@evertype.com> <883035019.13497.1528816830070.JavaMail.www@wwinf1c20> Message-ID: CLDR already has localized script names. The English is taken from ISO 15924. https://cldr-ref.unicode.org/cldr-apps/v#/fr/Scripts/ On Tue, Jun 12, 2018 at 8:20 AM, Marcel Schneider via Unicode < unicode at unicode.org> wrote: > On Tue, 12 Jun 2018 15:58:09 +0100, Michael Everson via Unicode wrote: > > > > Marcel, > > > > You have put words into my mouth. Please don?t. Your description of what > I said is NOT accurate. > > > > > On 12 Jun 2018, at 03:53, Marcel Schneider via Unicode wrote: > > > > > > And in this thread I wanted to demonstrate that by focusing on the > wrong priorities, i.e. legacy character names instead of > > > the practicability of on-going encoding and the accurateness of > specified decompositions?so that in some instances cedilla > > > was used instead of comma below, Michael pointed out?, ISO/IEC JTC1 > SC2/WG2 failed to do its part and missed its mission? > > > and thus didn?t inspire a desire of extensive cooperation (and damaged > the reputation of the whole ISO/IEC). > > Michael, I?d better quote your actual e-mail: > > On Fri, 8 Jun 2018 13:01:48 +0100, Michael Everson via Unicode wrote: > [?] > > Many things have more than one name. The only truly bad misnomers from > that period was related to a mapping error, > > namely, in the treatment of Latvian characters which are called CEDILLA > rather than COMMA BELOW. > > Now I fail to understand why this mustn?t be reworded to ?the accurateness > of specified decompositions?so that in some instances > cedilla was used instead of comma below[.]? > If any correction can be made, I?d be eager to take note. > Thanks for correcting. > > Now let?s append the e-mail that I was about to send: > > Another ISO Standard that needs to be mentioned in this thread is ISO > 15924 (script codes; not ISO/IEC). > It has a particular status in that Unicode is the Registration Authority. > > I wonder whether people agree that it has a French version. Actually it > does have a French version, but > Michael Everson (Registrar) revealed on this List multiple issues with > synching French script names in > ISO 15924-fr and in Code Charts translations. > > Shouldn?t this content be moved to CLDR? At least with respect to > localized script names. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jun 12 11:34:20 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 12 Jun 2018 09:34:20 -0700 Subject: The Unicode Standard and ISO In-Reply-To: <9054B8BE-2689-442A-89CE-DD853D4D67AB@evertype.com> References: <235812787.183.1528772034025.JavaMail.www@wwinf1n14> <9054B8BE-2689-442A-89CE-DD853D4D67AB@evertype.com> Message-ID: <2f49747e-5f6e-4413-9b5e-5bf2c1146e69@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jun 12 12:21:07 2018 From: unicode at unicode.org (Steven R. Loomis via Unicode) Date: Tue, 12 Jun 2018 10:21:07 -0700 Subject: The Unicode Standard and ISO In-Reply-To: <45FB2BD6-40B0-47F6-8639-E8A3258C04EE@evertype.com> References: <235812787.183.1528772034025.JavaMail.www@wwinf1n14> <9054B8BE-2689-442A-89CE-DD853D4D67AB@evertype.com> <883035019.13497.1528816830070.JavaMail.www@wwinf1c20> <45FB2BD6-40B0-47F6-8639-E8A3258C04EE@evertype.com> Message-ID: > ISO 15924 is and ISO standard. Aspects of its content may be mirrored in other places, but ?moving its content? to CLDR makes no sense. Fully agreed. For what it's worth, I reopened a bug of Roozbeh's https://unicode.org/cldr/trac/ticket/827?#comment:9 to make sure the ISO 15924 French content gets properly mirrored into CLDR, it looks like there is a French-specific bug there, which may be what you are seeing, Marcel. On Tue, Jun 12, 2018 at 8:57 AM, Michael Everson via Unicode < unicode at unicode.org> wrote: > All right, if you want a clear explanation. > > Yes, I think the ISO 8859-4 character names for the Latvian letters were > mistaken. Yes, I think that mapping them to decompositions with CEDILLA > rather than COMMA BELOW was a mistake. Evidently some felt that the > normative mapping was important. This does not mean that SC2 ?failed to do > its part? and it did not cause a lack of desire for cooperation, and it > bloody well did not ?damage the reputation of the whole ISO/IEC?. > > As to ISO 15924, it was developed bilingually, and there was consensus on > the names that are there. Last year you suggested a massive number of name > changes to the French translation of ISO/IEC 10646, and I criticized you > for foregoing stability for your own preferences. When it came to the names > in 15924, I told you that I do not trust your judgement, and that I would > consider revisions to the French names when you came back with consensus on > those changes with experts Alain LaBont?, Patrick Andries, Denis Jacquerye, > and Marc Lodewijck. As I have not heard from them, I conclude that no such > consensus exists. > > ISO 15924 is and ISO standard. Aspects of its content may be mirrored in > other places, but ?moving its content? to CLDR makes no sense. > > Michael Everson > > > On 12 Jun 2018, at 16:20, Marcel Schneider via Unicode < > unicode at unicode.org> wrote: > > On Tue, 12 Jun 2018 15:58:09 +0100, Michael Everson via Unicode wrote: > >> > >> Marcel, > >> You have put words into my mouth. Please don?t. Your description of > what I said is NOT accurate. > >> > >>> On 12 Jun 2018, at 03:53, Marcel Schneider via Unicode wrote: > >>> And in this thread I wanted to demonstrate that by focusing on the > wrong priorities, i.e. legacy character names instead of the practicability > of on-going encoding and the accurateness of specified decompositions?so > that in some instances cedilla was used instead of comma below, Michael > pointed out?, ISO/IEC JTC1 SC2/WG2 failed to do its part and missed its > mission?and thus didn?t inspire a desire of extensive cooperation (and > damaged the reputation of the whole ISO/IEC). > > > > Michael, I?d better quote your actual e-mail: > > > > On Fri, 8 Jun 2018 13:01:48 +0100, Michael Everson via Unicode wrote: > > [?] > >> Many things have more than one name. The only truly bad misnomers from > that period was related to a mapping error, > >> namely, in the treatment of Latvian characters which are called CEDILLA > rather than COMMA BELOW. > > > > Now I fail to understand why this mustn?t be reworded to ?the > accurateness of specified decompositions?so that in some instances cedilla > was used instead of comma below[.]? If any correction can be made, I?d be > eager to take note. Thanks for correcting. > > > > Now let?s append the e-mail that I was about to send: > > > > Another ISO Standard that needs to be mentioned in this thread is ISO > 15924 (script codes; not ISO/IEC). It has a particular status in that > Unicode is the Registration Authority. > > > > I wonder whether people agree that it has a French version. Actually it > does have a French version, but Michael Everson (Registrar) revealed on > this List multiple issues with synching French script names in ISO 15924-fr > and in Code Charts translations. > > > > Shouldn?t this content be moved to CLDR? At least with respect to > localized script names. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jun 12 12:49:10 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Tue, 12 Jun 2018 19:49:10 +0200 Subject: The Unicode Standard and ISO In-Reply-To: <1178550961.12964.1528815207637.JavaMail.www@wwinf1c20> References: <1178550961.12964.1528815207637.JavaMail.www@wwinf1c20> Message-ID: Steven wrote: > I usually recommend creating a new project first... That is often a viable approach. But proponents shouldn't get the wrong impression. I think the chance of anything resembling the "localized sentences" / "international message components" have zero chance of being adopted by Unicode (including the encoding, CLDR, anything). It is a waste of many people's time discussing it further on this list. Why? As discussed many times on this list, it would take a major effort, is not scoped properly (the translation of messages depends highly on context, including specific products), and would not meet the needs of practically anyone. People interested in this topic should (a) start up their own project somewhere else, (b) take discussion of it off this list, (c) never bring it up again on this list. Mark On Tue, Jun 12, 2018 at 4:53 PM, Marcel Schneider via Unicode < unicode at unicode.org> wrote: > > William, > > On 12/06/18 12:26, William_J_G Overington wrote: > > > > Hi Marcel > > > > > I don?t fully disagree with Asmus, as I suggested to make available > localizable (and effectively localized) libraries of message components, > rather than of entire messages. > > > > Could you possibly give some examples of the message components to which > you refer please? > > > > Likewise I?d be interested in asking Jonathan Rosenne for an example or > two of automated translation from English to bidi languages with data > embedded, > as on Mon, 11 Jun 2018 15:42:38 +0000, Jonathan Rosenne via Unicode wrote: > [?] > > > > One has to see it to believe what happens to messages translated > mechanically from English to bidi languages when data is embedded in the > text. > > But both would require launching a new thread. > > Thinking hard enough, I?m even afraid that most subscribers wouldn?t be > interested, so we?d have to move off-list. > > One alternative I can think of is to use one of the CLDR mailing lists. I > subscribed to CLDR-users when I was directed to move there some technical > discussion > about keyboard layouts from Unicode Public. > > But now as international message components are not yet a part of CLDR, > we?d need to ask for extra permission to do so. > > An additional drawback of launching a technical discussion right now is > that significant parts of CLDR data are not yet correctly localized so > there is another > bunch of priorities under July 11 deadline. I guess that vendors wouldn?t > be glad to see us gathering data for new structures while level=Modern > isn?t complete. > > In the meantime, you are welcome to contribute and to motivate missing > people to do the same. > > Best regards, > > Marcel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jun 12 14:00:13 2018 From: unicode at unicode.org (Steven R. Loomis via Unicode) Date: Tue, 12 Jun 2018 12:00:13 -0700 Subject: The Unicode Standard and ISO In-Reply-To: <15636441.36248.1528731165076.JavaMail.defaultUser@defaultHost> References: <1244966180.7608.1528570899306.JavaMail.www@wwinf1f27> <1739600906.8497.1528584079440.JavaMail.www@wwinf1f27> <15636441.36248.1528731165076.JavaMail.defaultUser@defaultHost> Message-ID: On Mon, Jun 11, 2018 at 8:32 AM, William_J_G Overington < wjgo_10009 at btinternet.com> wrote: > Steven R. Loomis wrote: > > >Marcel, > > The idea is not necessarily without merit. However, CLDR does not > usually expand scope just because of a suggestion. > I usually recommend creating a new project first - gathering data, > looking at and talking to projects to ascertain the usefulness of common > messages.. one of the barriers to adding new content for CLDR is not just > the design, but collecting initial data. When emoji or sub-territory names > were added, many languages were included before it was added to CLDR. > > Well, maybe usually, but perhaps not this time? Especially this time. To Mark's later point: Start a separate project. Don't assume it will ever merge with CLDR. If it succeeds, great. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jun 12 14:28:00 2018 From: unicode at unicode.org (Sarasvati via Unicode) Date: Tue, 12 Jun 2018 14:28:00 -0500 Subject: The Unicode Standard and ISO [localizable sentences] Message-ID: <201806121928.w5CJS0pD012762@sarasvati.unicode.org> The topic of localizable sentences is now closed on this mail list. Please take that topic elsewhere. Thank you. On 6/12/2018 10:49 AM, Mark Davis ?????? via Unicode wrote: > That is often a viable approach. But proponents shouldn't get the wrong impression. I think the chance of anything resembling the "localized sentences" / "international message components" have zero chance of being adopted by Unicode (including the encoding, CLDR, anything). It is a waste of many people's time discussing it further on this list. > Why? As discussed many times on this list, it would take a major effort, is not scoped properly (the translation of messages depends highly on context, including specific products), and would not meet the needs of practically anyone. > People interested in this topic should > (a) start up their own project somewhere else, > (b) take discussion of it off this list, > (c) never bring it up again on this list. From unicode at unicode.org Wed Jun 13 06:49:56 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Wed, 13 Jun 2018 13:49:56 +0200 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: > That is, why is conforming to UAX #31 worth the risk of prohibiting the use of characters that some users might want to use? One could parse for certain sequences, putting characters into a number of broad categories. Very approximately: - junk ~= [[:cn:][:cs:][:co:]]+ - whitespace ~= [[:z:][:c:]-junk]+ - syntax ~= [[:s:][:p:]] // broadly speaking, including both the language syntax & user-named operators - identifiers ~= [all-else]+ UAX #31 specifies several different kinds of identifiers, and takes roughly that approach for http://unicode.org/reports/tr31/#Immutable_Identifier_Syntax, although the focus there is on immutability. So an implementation could choose to follow that course, rather than the more narrowly defined identifiers in http://unicode.org/reports/tr31/#Default_Identifier_Syntax. Alternatively, one can conform to the Default Identifiers but declare a profile that expands the allowable characters. One could take a Swiftian approach , for example... Mark On Fri, Jun 8, 2018 at 11:07 AM, Henri Sivonen via Unicode < unicode at unicode.org> wrote: > On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen > wrote: > > Considering that ruling out too much can be a problem later, but just > > treating anything above ASCII as opaque hasn't caused trouble (that I > > know of) for HTML other than compatibility issues with XML's stricter > > stance, why should a programming language, if it opts to support > > non-ASCII identifiers in an otherwise ASCII core syntax, implement the > > complexity of UAX #31 instead of allowing everything above ASCII in > > identifiers? In other words, what problem does making a programming > > language conform to UAX #31 solve? > > After refreshing my memory of XML history, I realize that mentioning > XML does not helpfully illustrate my question despite the mention of > XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please > ignore the XML part. > > Trying to rephrase my question more clearly: > > Let's assume that we are designing a computer-parseable syntax where > tokens consisting of user-chosen characters can't occur next to each > other and, instead, always have some syntax-reserved characters > between them. That is, I'm talking about syntaxes that look like this > (could be e.g. Java): > > ab.cd(); > > Here, ab and cd are tokens with user-chosen characters whereas space > (the indent), period, parenthesis and the semicolon are > syntax-reserved. We know that ab and cd are distinct tokens, because > there is a period between them, and we know the opening parethesis > ends the cd token. > > To illustrate what I'm explicitly _not_ talking about, I'm not talking > about a syntax like this: > > ????? > > Here ?? and ?? are user-named variable names and ? is a user-named > operator and the distinction between different kinds of user-named > tokens has to be known somehow in order to be able to tell that there > are three distinct tokens: ??, ?, and ??. > > My question is: > > When designing a syntax where tokens with the user-chosen characters > can't occur next to each other without some syntax-reserved characters > between them, what advantages are there from limiting the user-chosen > characters according to UAX #31 as opposed to treating any character > that is not a syntax-reserved character as a character that can occur > in user-named tokens? > > I understand that taking the latter approach allows users to mint > tokens that on some aesthetic measure don't make sense (e.g. minting > tokens that consist of glyphless code points), but why is it important > to prescribe that this is prohibited as opposed to just letting users > choose not to mint tokens that are inconvenient for them to work with > given the behavior that their plain text editor gives to various > characters? That is, why is conforming to UAX #31 worth the risk of > prohibiting the use of characters that some users might want to use? > The introduction of XID after ID and the introduction of Extended > Hashtag Identifiers after XID is indicative of over-restriction having > been a problem. > > Limiting user-minted tokens to UAX #31 does not appear to be necessary > for security purposes considering that HTML and CSS exist in a > particularly adversarial environment and get away with taking the > approach that any character that isn't a syntax-reserved character is > collected as part of a user-minted identifier. (Informally, both treat > non-ASCII characters the same as an ASCII underscore. HTML even treats > non-whitespace, non-U+0000 ASCII controls that way.) > > -- > Henri Sivonen > hsivonen at hsivonen.fi > https://hsivonen.fi/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jun 13 15:25:13 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Wed, 13 Jun 2018 22:25:13 +0200 (CEST) Subject: The Unicode Standard and ISO Message-ID: <1331468207.16012.1528921513396.JavaMail.www@wwinf1f27> On Tue, 12 Jun 2018 19:49:10 +0200, Mark Davis ?? via Unicode wrote: [?] > People interested in this topic should > (a) start up their own project somewhere else, > (b) take discussion of it off this list, > (c) never bring it up again on this list. Thank you for letting us know. I apologize for my e-mailing. I didn?t respond in the wake for a variety of reasons while immediately fully agreeing, of course as I had mainly wondered why I got no feedback when I?d lastly terminated a thread turning likewise, but no matter anymore. No problem, as far as it belongs to me, this topic will never be read again here nor elsewhere. Sorry again. Best regards, Marcel From unicode at unicode.org Wed Jun 13 15:52:35 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Wed, 13 Jun 2018 22:52:35 +0200 (CEST) Subject: Your message to Unicode awaits moderator approval In-Reply-To: References: Message-ID: <1519975345.16208.1528923155917.JavaMail.www@wwinf1f27> > Message du 13/06/18 22:25 > De : "via Unicode" > A : charupdate at orange.fr > Copie ? : > Objet : Your message to Unicode awaits moderator approval > > Your mail to 'Unicode' with the subject > > Re: The Unicode Standard and ISO > > Is being held until the list moderator can review it for approval. > > The reason it is being held: > > Post to moderated list > > Either the message will get posted to the list, or you will receive > notification of the moderator's decision. If you would like to cancel > this posting, please visit the following URL: > > http://unicode.org/mailman/confirm/unicode/07224dbb3f89488430be25c396d1590baa55c022 > I?m unable to decide whether I should cancel this myself or do nothing. If there is no use in posting, the better. Anyway I?ve nothing more to tell on any list, as UTC isn?t interested in fixing the bidi legibility issue I?ve pointed out, and won?t probably be interested in deprecating U+2010 for not misleading font designers. Additionally some people post false allegations at my expense and start getting insultant, confusing Unicode Public with a WG2 meeting in the nineties, despite of all that having been discussed off-list past year. On Tue, 12 Jun 2018 19:49:10 +0200, Mark Davis ?? via Unicode wrote: [?] > People interested in this topic should > (a) start up their own project somewhere else, > (b) take discussion of it off this list, > (c) never bring it up again on this list. Thank you for letting us know. I apologize for my e-mailing. I didn?t respond in the wake for a variety of reasons while immediately fully agreeing, of course as I had mainly wondered why I got no feedback when I?d lastly terminated a thread turning likewise, but no matter anymore. No problem, as far as it belongs to me, this topic will never be read again here nor elsewhere. Sorry again. Best regards, Marcel From unicode at unicode.org Wed Jun 13 16:16:46 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Wed, 13 Jun 2018 23:16:46 +0200 (CEST) Subject: Please disregard my mistaken e-mail Message-ID: <62481283.16494.1528924606818.JavaMail.www@wwinf1f27> My last e-mail with subject ?re: Your message to Unicode awaits moderator approval? was mistakenly sent to the Mailing List, for forgetting remove address in cc-field (end hidden). Please disregard. My apologies. Marcel From unicode at unicode.org Thu Jun 14 09:27:06 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Thu, 14 Jun 2018 15:27:06 +0100 (BST) Subject: Regarding document L2/18-203 Coded Hashes of Arbitrary Images (L2/16-105) Message-ID: <6137238.31087.1528986426377.JavaMail.defaultUser@defaultHost> I have been reading through the document. I am wondering if the way forward would be to use a different technique and instead to encode images directly using vector graphics. For example, as in the paper that starts at page 21 of IBA Technical Review 20. IBA was the Independent Broadcasting Authority of the United Kingdom. The document has been scanned and added to the web. The link is about half-way down the following web page. http://www.ntlpa.org.uk/memorabilia The file is 2.2 Megabytes as it is scanned images. William Overington Thursday 14 June 2018 From unicode at unicode.org Fri Jun 15 07:29:50 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Fri, 15 Jun 2018 13:29:50 +0100 (BST) Subject: The Unicode Standard and ISO [localizable sentences] Message-ID: <21424153.21554.1529065790660.JavaMail.defaultUser@defaultHost> > The topic of localizable sentences is now closed on this mail list. > Please take that topic elsewhere. > Thank you. May I please mention, with permission, that there is now a thread to discuss the issue of translations and their context that was mentioned? https://community.serif.com/discussion/112261/a-discussion-about-translations-and-their-context-localizable-sentences-research-project-related The thread is in the lounge section of the support forum of Serif, the English software company that produced the program that I use to produce PDF (Portable Document Format) documents. William Overington Friday 15 June 2018 From unicode at unicode.org Tue Jun 19 02:57:37 2018 From: unicode at unicode.org (Ivan Panchenko via Unicode) Date: Tue, 19 Jun 2018 09:57:37 +0200 Subject: Italic mu in squared Latin abbreviations? Message-ID: <837522a8-b98a-7146-6f03-857c53798993@gmail.com> Is there a reason why the mu does not appear upright in the reference glyph for U+3382 ?, U+338C ?, U+338D ?, U+3395 ?, U+339B ?, U+33B2 ?, U+33B6 ? and U+33BC ? of the CJK Compatibility code chart? I also see it this way in fonts such as Junicode, Unifont and WenQuanYi Zen Hei while U+00B5 ? is displayed upright there. ?Prefix symbols are printed in roman (upright) type, as are unit symbols, regardless of the type used in the surrounding text, [?].? (SI Brochure) By the way, U+3396 ? is displayed with a capital M instead of a small m in Droid Sans Fallback (as included in my Ubuntu system) and Arial Unicode MS, suggesting ?megaliter? instead of ?milliliter?! The latter font has not been developed further since version 1.01, does someone know about the former? Droid fonts can be purchased at fonts.com but I cannot find the fallback font there (so maybe contacting Ascender would be of no use). Best regards Ivan From unicode at unicode.org Tue Jun 19 11:31:21 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 19 Jun 2018 18:31:21 +0200 Subject: Italic mu in squared Latin abbreviations? In-Reply-To: <837522a8-b98a-7146-6f03-857c53798993@gmail.com> References: <837522a8-b98a-7146-6f03-857c53798993@gmail.com> Message-ID: CJK-specific letter forms for these abbreviations/units should be left as is. They are kept for compatibility reason and I don't see a reason to change them to upright which would contradict their legacy usage. The SI brochure does not apply to these legacy square presentations (which would be enyway incomplete for many SI units and derived units). Anyway it is still possibly to create a CJK font that would map them with upright mu, I don't think it would cause major damages. When using SI units these characters are not used, and the standalone Latin characters should not use any ligature or special form, but they may still be styled separately if one needs to distinguish SI units from abbreviations or actual words (so it's possible to change the font-faminly, font style specifically for the unit, or for the whole anatity with the number and unit symbol, or a whole formula, to isolate them fro mthe rest of normal text: this is what CJK fonts already do for some of these units or some common abbrevs, so for me these CJK compatibility characters are already such explicit style modifications applied to Latin, these characters may still be partly restyled (color, weight, but probably not italic/obliique which would leave them intact, just like the CJK wide and narrow variants of Latin letters, as they have to keep their CJK square or half-square metrics). 2018-06-19 9:57 GMT+02:00 Ivan Panchenko via Unicode : > Is there a reason why the mu does not appear upright in the reference > glyph for U+3382 ?, U+338C ?, U+338D ?, U+3395 ?, U+339B ?, U+33B2 ?, > U+33B6 ? and U+33BC ? of the CJK Compatibility code chart? I also see it > this way in fonts such as Junicode, Unifont and WenQuanYi Zen Hei while > U+00B5 ? is displayed upright there. ?Prefix symbols are printed in roman > (upright) type, as are unit symbols, regardless of the type used in the > surrounding text, [?].? (SI Brochure) > > By the way, U+3396 ? is displayed with a capital M instead of a small m in > Droid Sans Fallback (as included in my Ubuntu system) and Arial Unicode MS, > suggesting ?megaliter? instead of ?milliliter?! The latter font has not > been developed further since version 1.01, does someone know about the > former? Droid fonts can be purchased at fonts.com but I cannot find the > fallback font there (so maybe contacting Ascender would be of no use). > > Best regards > Ivan > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jun 20 16:17:54 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 20 Jun 2018 14:17:54 -0700 Subject: Italic mu in squared Latin =?UTF-8?Q?abbreviations=3F?= Message-ID: <20180620141754.665a7a7059d7ee80bb4d670165c8327d.926ef21daf.wbe@email03.godaddy.com> Ivan Panchenko wrote: > Is there a reason why the mu does not appear upright It was probably italicized in the glyphs printed in the relevant Japanese standard, back in the 1990s. The glyphs in the Unicode charts are not normative, except for a very small handful of encoded characters like Dingbats where they are "kind of normative." Because of this, it's not necessary to worry about whether the ? in the CJK squared Latin abbreviations is italic or roman in any given font. Fonts will be fonts. Glyph variation happens. Rendering ? with a capital M does seem to be a violation of character identity, but Arial Unicode MS has not been updated since 2000 and this problem is likely to remain unsolved. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed Jun 20 16:53:07 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 20 Jun 2018 14:53:07 -0700 Subject: Italic mu in squared Latin abbreviations? In-Reply-To: <20180620141754.665a7a7059d7ee80bb4d670165c8327d.926ef21daf.wbe@email03.godaddy.com> References: <20180620141754.665a7a7059d7ee80bb4d670165c8327d.926ef21daf.wbe@email03.godaddy.com> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jun 21 18:46:41 2018 From: unicode at unicode.org (Daniel R. Tobias via Unicode) Date: Thu, 21 Jun 2018 19:46:41 -0400 Subject: Reminder Ribbon U+1F397 Message-ID: <5B2C38E1.29458.28102C81@dan.tobias.name> The Unicode standard for the Reminder Ribbon character (U+1F397) does not appear to specify or suggest a color for the ribbon (the glyph shown in the code chart is black, like other characters there). Platforms that support this character among the other emojis do however assign a color to it, as seen in character pick lists as well as where the character is shown in sent or received messages. This, however, is not done with any consistency; different platforms have used yellow, blue, and red ribbons, as shown here: https://emojipedia.org/reminder-ribbon/ Different colors have different associations when used in various campaigns and movements; some are listed here: https://en.wikipedia.org/wiki/List_of_awareness_ribbons This can produce confusion when somebody uses the character (e.g., in a tweet or text message) in association with a campaign that uses the color that happens to match that used in the sender's platform (for instance, yellow ribbons have been in current use to call for release of Catalan prisoners held by Spain) but a reader of the message on a different platform sees it differently, with a color that might have different associations. Perhaps a larger set of ribbon characters, with defined colors for each, is called for? Or is this better done by creating composite characters with the existing ribbon character combined with a color-specifying code point? -- == Dan == Dan's Mail Format Site: http://mailformat.dan.info/ Dan's Web Tips: http://webtips.dan.info/ Dan's Domain Site: http://domains.dan.info/ From unicode at unicode.org Sat Jun 23 22:45:48 2018 From: unicode at unicode.org (Rebecca T via Unicode) Date: Sat, 23 Jun 2018 23:45:48 -0400 Subject: Reminder Ribbon U+1F397 In-Reply-To: <5B2C38E1.29458.28102C81@dan.tobias.name> References: <5B2C38E1.29458.28102C81@dan.tobias.name> Message-ID: > > Perhaps a larger set of ribbon characters, with defined colors for > each, is called for? [image: crying pointing gun.jpg] But uh, seriously, the ribbon was encoded in the Great Wingdings and Webdings Migration of 2011 (see L2/12-368 p.21) and I would imagine that future ribbon characters would be rejected precisely *because *they ?have different associations when used in various campaigns and movements.? On Fri, Jun 22, 2018 at 2:08 AM Daniel R. Tobias via Unicode < unicode at unicode.org> wrote: > The Unicode standard for the Reminder Ribbon character (U+1F397) does > not appear to specify or suggest a color for the ribbon (the glyph > shown in the code chart is black, like other characters there). > Platforms that support this character among the other emojis do > however assign a color to it, as seen in character pick lists as well > as where the character is shown in sent or received messages. This, > however, is not done with any consistency; different platforms have > used yellow, blue, and red ribbons, as shown here: > > https://emojipedia.org/reminder-ribbon/ > > Different colors have different associations when used in various > campaigns and movements; some are listed here: > > https://en.wikipedia.org/wiki/List_of_awareness_ribbons > > This can produce confusion when somebody uses the character (e.g., in > a tweet or text message) in association with a campaign that uses the > color that happens to match that used in the sender's platform (for > instance, yellow ribbons have been in current use to call for release > of Catalan prisoners held by Spain) but a reader of the message on a > different platform sees it differently, with a color that might have > different associations. > > Perhaps a larger set of ribbon characters, with defined colors for > each, is called for? Or is this better done by creating composite > characters with the existing ribbon character combined with a > color-specifying code point? > > -- > == Dan == > Dan's Mail Format Site: http://mailformat.dan.info/ > Dan's Web Tips: http://webtips.dan.info/ > Dan's Domain Site: http://domains.dan.info/ > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: crying pointing gun.jpg Type: image/jpeg Size: 11701 bytes Desc: not available URL: