From unicode at unicode.org Sat Dec 2 17:49:03 2017 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Sun, 3 Dec 2017 05:19:03 +0530 Subject: \b and Indic word boundaries? Message-ID: Hello. Yesterday I reported https://bugs.python.org/issue32198 but then was pointed to already existing https://bugs.python.org/issue1693050 and friends. >From reading these I came to find \b under https://unicode.org/reports/tr18/#Compatibility_Properties. I confess I don't entirely grok all the intricacies. So my question: isn't \b the Unicode-recommended way of identifying full Unicode-aware word boundaries in regexes? If not, what is? -- Shriramana Sharma ???????????? ???????????? ???????????????????????? From unicode at unicode.org Mon Dec 4 07:30:22 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 4 Dec 2017 13:30:22 +0000 Subject: Minimal Implementation of Unicode Collation Algorithm Message-ID: <20171204133022.07571022@JRWUBU2> May a collation algorithm that always compares all strings as equal be a compliant implementation of the Unicode Collation Algorithm (UTS #10)? If not, by which clause is it not compliant? Formally, this algorithm would require that all weights be zero. Would an implementation that supported no characters be compliant? It used to be that for an implementation to be claimed as compliant, it also had to pass a specific conformance test. This requirement has now been abandoned, perhaps because the Default Unicode Collation Element Table (DUCET) is incompatible with the CLDR Collation Algorithm. The compatibility issues are that the DUCET weighting of U+FFFE is incompatible with the CLDR Collation algorithm, and it seems that the ICU implementation will not work if well-formedness condition WF5 is not met. Meeting WF5 without changing the collation would require about a thousand extra entries in the table - the CLDR root collation just adds the six changes (plus a consequent four entries for FCD closure) desirable for natural language, and accepts the consequent changes for unlikely strings. Richard. From unicode at unicode.org Mon Dec 4 14:48:11 2017 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Mon, 4 Dec 2017 12:48:11 -0800 Subject: Minimal Implementation of Unicode Collation Algorithm In-Reply-To: <20171204133022.07571022@JRWUBU2> References: <20171204133022.07571022@JRWUBU2> Message-ID: On Mon, Dec 4, 2017 at 5:30 AM, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > May a collation algorithm that always compares all strings as equal be a > compliant implementation of the Unicode Collation Algorithm (UTS #10)? > If not, by which clause is it not compliant? Formally, this algorithm > would require that all weights be zero. > I think so. The algorithm would be equivalent to an implementation of the UCA with a degenerate CET that maps every character to a Completely Ignorable Collation Element. Would an implementation that supported no characters be compliant? > I guess so. I assume that would mean that the CET maps nothing, and that the implementation does implement the implicit weighting of Han characters and unassigned (here: unmapped) code points. It would also have to do NFD first. It used to be that for an implementation to be claimed as compliant, it > also had to pass a specific conformance test. This requirement has now > been abandoned, perhaps because the Default Unicode Collation Element > Table (DUCET) is incompatible with the CLDR Collation Algorithm. > The DUCET is missing some things that are needed by the CLDR Collation Algorithm, but that has nothing to do with UCA compliance. The simple fact is that tailorings are common, and it has to be possible to conform to the algorithm without forbidding tailorings. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Dec 4 19:02:22 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 5 Dec 2017 01:02:22 +0000 Subject: Minimal Implementation of Unicode Collation Algorithm In-Reply-To: References: <20171204133022.07571022@JRWUBU2> Message-ID: <20171205010222.408a2e96@JRWUBU2> On Mon, 4 Dec 2017 12:48:11 -0800 Markus Scherer via Unicode wrote: > On Mon, Dec 4, 2017 at 5:30 AM, Richard Wordingham via Unicode < > unicode at unicode.org> wrote: > > Would an implementation that supported no characters be compliant? > I guess so. I assume that would mean that the CET maps nothing, and > that the implementation does implement the implicit weighting of Han > characters and unassigned (here: unmapped) code points. It would also > have to do NFD first. I am extrapolating from the comment on UTS10-C1 in UTS#10, "In particular, a conformant implementation must be able to compare any two canonical-equivalent strings as being equal, for all Unicode characters supported by that implementation." There is now nothing that forces the implementation to support any Unicode characters! Possibly this results from an attempt to allow an implementation to conform to Version x.y.z of the UCA with supporting normalisation for some other set of characters or choosing not to support character with non-zero canonical combining class, which, while not eliminating the need to address canonical equivalence, goes a long way towards doing so. I am not aware of any general requirement that a CET be a tailoring of DUCET or of the CLDR root collation, so the implicit weights would be irrelevant in this case. The implicit weights are part of DUCET. If no characters are supported, performing NFD will be a rather obvious trivial transformation of the null string to itself. > > It used to be that for an implementation to be claimed as compliant, > it > > also had to pass a specific conformance test. This requirement has > > now been abandoned, perhaps because the Default Unicode Collation > > Element Table (DUCET) is incompatible with the CLDR Collation > > Algorithm. > > The DUCET is missing some things that are needed by the CLDR Collation > Algorithm, but that has nothing to do with UCA compliance. An implementation that only implements the CLDR collation algorithm cannot be tailored to support DUCET, because DUCET (at Version 10.0.0) has the ordering U+FFF8 < U+FFFE < U+1004E, which is incompatible with UTS#35 Part 5 Section 1.1.1 - "U+FFFE maps to a CE with a minimal, unique primary weight". Therefore one could only apply the published UCA conformance test if it deliberately avoided strings containing U+FFFE. > The simple fact is that tailorings are common, and it has to be > possible to conform to the algorithm without forbidding tailorings. It's the CLDR collation algorithm that prohibits DUCET. Thankfully, the CLDR root collation can be interpreted to be compatible with the UCA. (Tailorings may be incompatible, or at least, incompatible with the concept of a finite CET.) Richard. From unicode at unicode.org Tue Dec 5 11:44:05 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 5 Dec 2017 18:44:05 +0100 Subject: Armenian Mijaket (Armenian colon) Message-ID: The Armenian script has its own distinctive punctuation (vertsaket) for the standard full stop at end of sentence (whose glyph looks very much like the Basic Latin/ASCII colon, however slighly more bold and slanted and whose dots are rectangular). It is encoded at U+0589. And used in traditional texts instead of the "modern" full stop. But Armenian also has its own distinctive puctuation (mijaket) for the introductory colon between two phrases of the same sentence (whose glyph looks very much like the Basic Latin/ASCII full stop). It is not encoded and I don't like using the ASCII full stop where it causes confusion. Where is the Armenian distinctive mijaket? Shouldn't it be encoded at U+0588? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Dec 5 13:28:10 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 5 Dec 2017 20:28:10 +0100 Subject: Armenian Mijaket (Armenian colon) In-Reply-To: <20171205185925.w52f4rijsld7m6cy@number19> References: <20171205185925.w52f4rijsld7m6cy@number19> Message-ID: U+2024 is not supported in any fonts I have loaded. A websearch of mijaket gives nothing. U+20224 is used as a "leader dot", and does not match the expected metrics (it is certainly not a mijaket, it should be more like U+0589, i.e. as a bold parallelogram, and not a thin leader dot). Leader dots are NOT used as real punctuation, they are presentational, for example in TOC (table of contents), where they are aligned in arbitrarily long rows. The note in http://www.unicode.org/charts/PDF/U2000.pdf is absolutely not normative and in fact it is wrong in my opinion. The mijaket (Armenian colon) should be encoded (preferably at U+0588 in the Armenian block) as it also has to be distinguisdhed from leader dots in Armenian TOC, exactly like the vertsaket was distinguished at U+0589. 2017-12-05 19:59 GMT+01:00 S. Gilles : > On 2017-12-05T18:44:05+0100, Philippe Verdy via Unicode wrote: > > The Armenian script has its own distinctive punctuation (vertsaket) for > the > > standard full stop at end of sentence (whose glyph looks very much like > the > > Basic Latin/ASCII colon, however slighly more bold and slanted and whose > > dots are rectangular). It is encoded at U+0589. And used in traditional > > texts instead of the "modern" full stop. > > > > But Armenian also has its own distinctive puctuation (mijaket) for the > > introductory colon between two phrases of the same sentence (whose glyph > > looks very much like the Basic Latin/ASCII full stop). It is not encoded > > and I don't like using the ASCII full stop where it causes confusion. > > > > Where is the Armenian distinctive mijaket? Shouldn't it be encoded at > > U+0588? > > Off-list because I generally don't know what I'm talking about, but > grepping NamesList.txt for ?mijaket? gives U+2024. If this isn't > what you're looking for, my apologies. > > -- > S. Gilles > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Dec 5 13:46:22 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 5 Dec 2017 20:46:22 +0100 Subject: Armenian Mijaket (Armenian colon) In-Reply-To: References: <20171205185925.w52f4rijsld7m6cy@number19> Message-ID: Note that "Noto Sans Armenian" does not even map U+2024 (I doubt it is accepted as a real replacement for the missing Armenian mijaket which plays a role similar to a Latin semicolon or colon), it does match the hyphen at U+2010. But U+0589 (Armenian "versakjet", the Armenian full stop that looks like a ":" Latin colon) is mapped. My opinion is that the one dot leader has only been used by some sources that don't need to render tabular data or TOCs: these sources needing these traditional distinctions are probably religious texts, and clearly they don't even look like what is in the Unicode PDF for the representative glyph, and "Noto Sans Armenian" is designed for modern use on display and even there we'll need a better distinction and better metrics than going with the possible "Noto Sans" mapping of the leader dot at U+2024 (which still does not exist: in fact leaders are better represented another way than by repeating this character: leaders are essially parsed in arbitrary lengths like a tabulation whitespace and so the leader dot is not semantically suitable at all as a mijaket (it's just like if we wanted to replace ASCII full stops or colons and semicolons in English by SPACE or TAB: in Armenian this just causes havoc). 2017-12-05 20:28 GMT+01:00 Philippe Verdy : > U+2024 is not supported in any fonts I have loaded. A websearch of mijaket > gives nothing. > U+20224 is used as a "leader dot", and does not match the expected metrics > (it is certainly not a mijaket, it should be more like U+0589, i.e. as a > bold parallelogram, and not a thin leader dot). > > Leader dots are NOT used as real punctuation, they are presentational, for > example in TOC (table of contents), where they are aligned in arbitrarily > long rows. > > The note in http://www.unicode.org/charts/PDF/U2000.pdf is absolutely not > normative and in fact it is wrong in my opinion. > > The mijaket (Armenian colon) should be encoded (preferably at U+0588 in > the Armenian block) as it also has to be distinguisdhed from leader dots in > Armenian TOC, exactly like the vertsaket was distinguished at U+0589. > > > 2017-12-05 19:59 GMT+01:00 S. Gilles : > >> On 2017-12-05T18:44:05+0100, Philippe Verdy via Unicode wrote: >> > The Armenian script has its own distinctive punctuation (vertsaket) for >> the >> > standard full stop at end of sentence (whose glyph looks very much like >> the >> > Basic Latin/ASCII colon, however slighly more bold and slanted and whose >> > dots are rectangular). It is encoded at U+0589. And used in traditional >> > texts instead of the "modern" full stop. >> > >> > But Armenian also has its own distinctive puctuation (mijaket) for the >> > introductory colon between two phrases of the same sentence (whose glyph >> > looks very much like the Basic Latin/ASCII full stop). It is not encoded >> > and I don't like using the ASCII full stop where it causes confusion. >> > >> > Where is the Armenian distinctive mijaket? Shouldn't it be encoded at >> > U+0588? >> >> Off-list because I generally don't know what I'm talking about, but >> grepping NamesList.txt for ?mijaket? gives U+2024. If this isn't >> what you're looking for, my apologies. >> >> -- >> S. Gilles >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Dec 5 14:35:14 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 5 Dec 2017 12:35:14 -0800 Subject: Armenian Mijaket (Armenian colon) In-Reply-To: References: <20171205185925.w52f4rijsld7m6cy@number19> Message-ID: <0037e6a8-4af9-fd7d-50b3-a4a378827f0a@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Dec 5 15:08:39 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 5 Dec 2017 22:08:39 +0100 Subject: Armenian Mijaket (Armenian colon) In-Reply-To: <0037e6a8-4af9-fd7d-50b3-a4a378827f0a@ix.netcom.com> References: <20171205185925.w52f4rijsld7m6cy@number19> <0037e6a8-4af9-fd7d-50b3-a4a378827f0a@ix.netcom.com> Message-ID: In fact I would also remove the suggested misleading (non normative) note in NamesList.txt about the use of the ONE LEADER DOT (it is jsut one of the possible fallbacks but it has wrong properties for encoiding plaintext, it is only useful as a rendering fallback, but is not even useful for that because almsot no font map this character, as leader dots are preferably rendered another way, by drawing a dotted line; one some text renderers may use the leader dot only when they need to transform a leader space into a botrted line and they need a glyph for that, but note that they'll also need to control the spacing, margins and will probably always put it on the baseline like regular full stops) A better fallback is the middle dot (but with additional thin space around it). Still for the semantics, and because we should not have to use such renndering fallbacks for composing plain texts (imagine what we want to enter in a database of texts or in translation engines that don't know and should have to worry about the fonts, font styles or metrics, when here we need a clear semantic distinction of the mikajet (colon or semi-colon articulating two phrases in the same sentence, or at end of an introductory sentence followed by one value or a list of Armenian words, itself terminated by an Armenian full stop U+0589). You'll note that on Wikipedia, the ArmSCII table at top of the page was composed and rendered (in LaTeX) with the middle dot and is clearly distinguished from the ASCII full stop and the Armenian full stop. You will find no place there about the ONE DOT LEADER. This is espacially important because today Armenian will be written using eithern "modern" (ASCII) punctuations (like in English with colons, semicolons, and full stops), or traditional punctuation. And it cannot be predicted in which context the transalted texts will be used (modern/ASCII or traditional) so we have an ambiguity about how to translate and represent colons/semicolons and full stops. The Armenian full stop is clearly encoded. The Armenian [semi]colon is not and we only have fallbacks. So we need the "mikajet" at U+0588 (unallocated and jut before the distinctive U+0589 Armenian full stop) is the best place. Even for the Unicode represenative chart, you'll note that the characters are slanted including the punctuation and the dots become ovals. Various Armenian texts use square dots (apparently drawn as a small nearly vertical stroke with a pencil or plum). This will leave the renderers choosing how to rendere the two Armenian punctuations (either traditional, or modern) and will preserve the semantics of text without conflicting with other rendering options (for the leaders in TOCs or tabular data, which may eventually use U+2024 with some rare fonts specific to the renderer engine and its own typographical engine, if it ever needs a font for its needed glyphs, but zven in that case this internal fonts will not need to be Unicode encoded, it will just be a collection of glyphs for the intended rendering effect and styles it wants to support). For now the immediate real need is for fully translating interfaces in applications and allowing them to support either a "modern" style (English/ASCII punctuations) or "traditional" style. No fallback characters should be encoded in these texts so that no confusion will arise if ever one uses both the real Armenian full stop (two dots) and a fallback for the distinctive missing mikajet (single dot, to distinguish also from leaders and decimal separators in numbers or abbreviation dots). The new encoded mikajet may include a note suggesting the use of the MIDDLE DOT as a preferable fallback. 2017-12-05 21:35 GMT+01:00 Asmus Freytag via Unicode : > On 12/5/2017 11:28 AM, Philippe Verdy via Unicode wrote: > > U+2024 is not supported in any fonts I have loaded. A websearch of mijaket > gives nothing. > U+20224 is used as a "leader dot", and does not match the expected metrics > (it is certainly not a mijaket, it should be more like U+0589, i.e. as a > bold parallelogram, and not a thin leader dot). > > Leader dots are NOT used as real punctuation, they are presentational, for > example in TOC (table of contents), where they are aligned in arbitrarily > long rows. > > The note in http://www.unicode.org/charts/PDF/U2000.pdf is absolutely not > normative and in fact it is wrong in my opinion. > > The mijaket (Armenian colon) should be encoded (preferably at U+0588 in > the Armenian block) as it also has to be distinguisdhed from leader dots in > Armenian TOC, exactly like the vertsaket was distinguished at U+0589. > > > Well, unless someone (you?) writes a proposal to that effect.... > > (I don't know the history of this particular "unification" but on the face > of it would share your concern that unifying something with a very specific > functionality and metrics, leader dots, with ordinary script-specific > punctuation is not helpful - unless it can be shown that this unification > is widely supported in practice. However, if your claim that 2024 is > unsupported is correct, that would strengthen the case for reconsidering > this; however the case would have to be made in a formal proposal first). > > A./ > > > > 2017-12-05 19:59 GMT+01:00 S. Gilles : > >> On 2017-12-05T18:44:05+0100, Philippe Verdy via Unicode wrote: >> > The Armenian script has its own distinctive punctuation (vertsaket) for >> the >> > standard full stop at end of sentence (whose glyph looks very much like >> the >> > Basic Latin/ASCII colon, however slighly more bold and slanted and whose >> > dots are rectangular). It is encoded at U+0589. And used in traditional >> > texts instead of the "modern" full stop. >> > >> > But Armenian also has its own distinctive puctuation (mijaket) for the >> > introductory colon between two phrases of the same sentence (whose glyph >> > looks very much like the Basic Latin/ASCII full stop). It is not encoded >> > and I don't like using the ASCII full stop where it causes confusion. >> > >> > Where is the Armenian distinctive mijaket? Shouldn't it be encoded at >> > U+0588? >> >> Off-list because I generally don't know what I'm talking about, but >> grepping NamesList.txt for ?mijaket? gives U+2024. If this isn't >> what you're looking for, my apologies. >> >> -- >> S. Gilles >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Dec 5 15:32:56 2017 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Tue, 5 Dec 2017 13:32:56 -0800 Subject: Armenian Mijaket (Armenian colon) In-Reply-To: <0037e6a8-4af9-fd7d-50b3-a4a378827f0a@ix.netcom.com> References: <20171205185925.w52f4rijsld7m6cy@number19> <0037e6a8-4af9-fd7d-50b3-a4a378827f0a@ix.netcom.com> Message-ID: <0b5fcb11-b7de-c497-adac-3494656c8fde@att.net> Asmus, On 12/5/2017 12:35 PM, Asmus Freytag via Unicode wrote: > I don't know the history of this particular "unification" Here are some clues to guide further research on the history. The annotation in question was added to a draft of the NamesList.txt file for Unicode 4.1 on October 7, 2003. The annotation was not yet in the Unicode 4.0 charts, published in April, 2003. That should narrow down the search for everybody. I can't find specific mention of this in the UTC minutes from the relevant 2003 window. But I strongly suspect that the catalyst for the change was the discussion that took place regarding PRI #12 re terminal punctuation: http://www.unicode.org/review/pr-12.html That document, at least, does mention "Armenian" and U+2024, although not in the same breath. That PRI was discussed and closed at UTC #96, on August 25, 2003: http://www.unicode.org/L2/L2003/03240.htm I don't find any particular mention of U+2024 in my own notes from that meeting, so I suspect the proximal cause for the change to the annotation for U+2024 on October 7 will have to be dug out of an email archive at some point. --Ken From unicode at unicode.org Tue Dec 5 16:26:51 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 5 Dec 2017 14:26:51 -0800 Subject: Armenian Mijaket (Armenian colon) In-Reply-To: <0b5fcb11-b7de-c497-adac-3494656c8fde@att.net> References: <20171205185925.w52f4rijsld7m6cy@number19> <0037e6a8-4af9-fd7d-50b3-a4a378827f0a@ix.netcom.com> <0b5fcb11-b7de-c497-adac-3494656c8fde@att.net> Message-ID: <7bfa3bf5-7c92-7a99-3902-3a7ed9accb59@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Dec 8 12:13:11 2017 From: unicode at unicode.org (Dreiheller, Albrecht via Unicode) Date: Fri, 8 Dec 2017 18:13:11 +0000 Subject: Typo in FAQ-Indic ? Message-ID: <3E10480FE4510343914E4312AB46E74212D180E2@DEFTHW99EH5MSX.ww902.siemens.net> Is this a typo? >> Q: Is the keyboard arrangement in a Unicode system different form that of the regular "TTF" fonts? Maybe it should read "... different FROM that ..." Regards, Albrecht -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Dec 8 16:06:19 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 8 Dec 2017 22:06:19 +0000 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues Message-ID: <20171208220619.3eb2fcbe@JRWUBU2> Apart from the likely but unmandated consequence of making editing Indic text more difficult (possibly contrary to the UK's Equality Act 2010), there is another difficulty that will follow directly from the currently proposed expansion of grapheme clusters (https://www.unicode.org/reports/tr29/proposed.html). Unless I am missing something, text boundaries have hitherto been cunningly crafted so that they are not changed by normalisation. Have I missed something, or has there been a change in policy? For extended grapheme clusters, the relevant rules are proposed as: GB9: ? (Extend | ZWJ | Virama) GB9c: (Virama | ZWJ ) ? LinkingConsonant Most of the Indian scripts have both nukta (ccc=7) and virama (ccc=9). This would lead canonically equivalent text to have strikingly different divisions: (no break) but There are other variations on this theme. In Tai Tham, we have the following conflict: natural order, no break: but normalised, there would be a break: >From reading the text, it seems that it is expected that the presence or absence of a break should be fine-tuned by CLDR language-specific rules. How is this expected to work, e.g. for Saurashtra in Tamil script? (There's no Saurashtra data in Version 32 of CLDR.) Would the root locale now specify the default segmentation rule, rather than UAX#29 plus the Unicode Character Database? Richard. From unicode at unicode.org Sat Dec 9 08:28:31 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 9 Dec 2017 14:28:31 +0000 Subject: =?UTF-8?B?QXF1Yc+Gzr/Oss6vzrE=?= Message-ID: <20171209142831.772d1f49@JRWUBU2> Draft 1 of UAX#29 'Unicode Text Segmentation' for Unicode 11.0.0 implies that it might be considered desirable to have a word boundary in 'aqua?????' or a grapheme cluster break in a coding such as <006C, U+0901 DEVANAGARI SIGN CANDRABINDU> for el candrabindu (l?), which should be <006C, U+0310 COMBINING CANDRABINDU> in accordance with the principle of script separation. Why are such breaks desirable? I can understand an argument that these should be tolerated, as an application could have been designed on the basis that script boundaries imply word boundaries (not true for Japanese) and that word boundaries imply grapheme cluster boundaries (not true for Sanskrit, where they don't even imply character boundaries.) There are some who claim that the Laotian consonant place holder is the letter 'x' rather than the multiplication sign, U+00D7, which does have Indic_syllabic_category=Consonant_Placeholder. (I trust no-one is suggesting that there should be grapheme cluster boundary between U+00D7 with script=common and a non-spacing Lao vowel any more than there would be with a Lao consonant.) Richard. From unicode at unicode.org Sat Dec 9 09:08:22 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 9 Dec 2017 16:08:22 +0100 Subject: =?UTF-8?B?UmU6IEFxdWHPhs6/zrLOr86x?= In-Reply-To: <20171209142831.772d1f49@JRWUBU2> References: <20171209142831.772d1f49@JRWUBU2> Message-ID: 2017-12-09 15:28 GMT+01:00 Richard Wordingham via Unicode < unicode at unicode.org>: > Draft 1 of UAX#29 'Unicode Text Segmentation' for Unicode 11.0.0 > implies that it might be considered desirable to have a word boundary > in 'aqua?????' or a grapheme cluster break in a coding such as <006C, > U+0901 DEVANAGARI SIGN CANDRABINDU> for el candrabindu (l?), which > should be <006C, U+0310 COMBINING CANDRABINDU> in accordance with the > principle of script separation. Why are such breaks desirable? > I don't understand why one would encode a DEVANAGARI SIGN in the middle of a Greek word to mean it implies a word boundary in Greek !?! > There are some who > claim that the Laotian consonant place holder is the letter 'x' rather > than the multiplication sign, U+00D7, which does have > Indic_syllabic_category=Consonant_Placeholder. (I trust no-one is > suggesting that there should be grapheme cluster boundary between > U+00D7 with script=common and a non-spacing Lao vowel any more than > there would be with a Lao consonant.) > Here again the multiplication sign has nothing to do with an Indic consonnant. May be it has been used like this in some texts but this look more like a tweak. If one needs a consonnant holder propose to encode an "empty" letter (like in Hangul or in Arabic), possibly with variant forms (e.g. changing between a circle, dotted circle, cross, or horizontal joiner on the hanging baseline for Devenagari and similar scripts). The usual base letter placeholder for combining diacritics is usually a whitespace (preferably NBSP, not SPACE) or the dotted circle symbol, but not a mathematical symbol which is used also within math formulas with variable names using common letters or even words. The multiplication sign used in the UTS standard was chosen because it normally does not occur within words, and only for defining the breaking rules (to indicate that NO break is allowed here, i.e. the opposite of what you describe): it is notational only and is clearly not meant to combine with what follows: if you encode the multiplication sign then an Indic diacritic, we expect to see the separate multipliation sign (with break opportunities on both sides) then a dotted circle glyph used for defective grapheme clusters to hold the diacritic. So for me Indic_syllabic_category=Consonant_Placeholder is wrong: for such use of the cross, an Indic (or generic) consonant placeholder should better be encoded and used and that property may be added on it, and removed from the multiplication sign. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Dec 9 09:16:44 2017 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sat, 9 Dec 2017 16:16:44 +0100 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: <20171208220619.3eb2fcbe@JRWUBU2> References: <20171208220619.3eb2fcbe@JRWUBU2> Message-ID: 1. You make a good point about the GB9c. It should probably instead be something like: GB9c: (Virama | ZWJ ) ? Extend* LinkingConsonant Extend is a broader than necessary, and there are a few items that have ccc!=0 but not gcb=extend. But all of those look to be degenerate cases. https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\p{ccc!=0}-\p{gcb=extend}]&g=ccc+indicsyllabiccategory Mark On Fri, Dec 8, 2017 at 11:06 PM, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > Apart from the likely but unmandated consequence of making editing > Indic text more difficult (possibly contrary to the UK's Equality Act > 2010), there is another difficulty that will follow directly from the > currently proposed expansion of grapheme clusters > (https://www.unicode.org/reports/tr29/proposed.html). > > Unless I am missing something, text boundaries have hitherto been > cunningly crafted so that they are not changed by normalisation. > Have I missed something, or has there been a change in policy? > > For extended grapheme clusters, the relevant rules are proposed as: > > GB9: ? (Extend | ZWJ | Virama) > > GB9c: (Virama | ZWJ ) ? LinkingConsonant > > Most of the Indian scripts have both nukta (ccc=7) and virama (ccc=9). > This would lead canonically equivalent text to have strikingly > different divisions: > > (no break) > > but > > > > There are other variations on this theme. In Tai Tham, we have the > following conflict: > > natural order, no break: > > > > but normalised, there would be a break: > > > > From reading the text, it seems that it is expected that the presence > or absence of a break should be fine-tuned by CLDR language-specific > rules. How is this expected to work, e.g. for Saurashtra in Tamil > script? (There's no Saurashtra data in Version 32 of CLDR.) Would the > root locale now specify the default segmentation rule, rather than > UAX#29 plus the Unicode Character Database? > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Dec 9 09:31:06 2017 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sat, 9 Dec 2017 16:31:06 +0100 Subject: =?UTF-8?B?UmU6IEFxdWHPhs6/zrLOr86x?= In-Reply-To: <20171209142831.772d1f49@JRWUBU2> References: <20171209142831.772d1f49@JRWUBU2> Message-ID: Some people have been confused by the previous wording, and thought that it wouldn't be legitimate to break on script boundaries. So we wanted to make it clear that that was possible, since: 1. Many implementations of rendering break text into script-runs before further processing, and 2. There are certainly cases where user's expectations are better met with breaks on script boundaries* We thus wanted to make it clear to people that it *is* a legitimate customization to break on script boundaries. * Clearly such an approach can't be hard-nosed: an implementation would need at the very least to handle Common and Inherited specially: not impose a boundary *because of script* where the SCX value is one of those, either before or after a break point. Any suggestions for clarifying language are appreciated. Mark Mark On Sat, Dec 9, 2017 at 3:28 PM, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > Draft 1 of UAX#29 'Unicode Text Segmentation' for Unicode 11.0.0 > implies that it might be considered desirable to have a word boundary > in 'aqua?????' or a grapheme cluster break in a coding such as <006C, > U+0901 DEVANAGARI SIGN CANDRABINDU> for el candrabindu (l?), which > should be <006C, U+0310 COMBINING CANDRABINDU> in accordance with the > principle of script separation. Why are such breaks desirable? > > I can understand an argument that these should be tolerated, as an > application could have been designed on the basis that script > boundaries imply word boundaries (not true for Japanese) and that word > boundaries imply grapheme cluster boundaries (not true for Sanskrit, > where they don't even imply character boundaries.) There are some who > claim that the Laotian consonant place holder is the letter 'x' rather > than the multiplication sign, U+00D7, which does have > Indic_syllabic_category=Consonant_Placeholder. (I trust no-one is > suggesting that there should be grapheme cluster boundary between > U+00D7 with script=common and a non-spacing Lao vowel any more than > there would be with a Lao consonant.) > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Dec 9 10:22:47 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 9 Dec 2017 16:22:47 +0000 Subject: =?UTF-8?B?QXF1Yc+Gzr/Oss6vzrE=?= In-Reply-To: References: <20171209142831.772d1f49@JRWUBU2> Message-ID: <20171209162247.58c60e3c@JRWUBU2> On Sat, 9 Dec 2017 16:08:22 +0100 Philippe Verdy wrote: > 2017-12-09 15:28 GMT+01:00 Richard Wordingham via Unicode < > unicode at unicode.org>: > > > Draft 1 of UAX#29 'Unicode Text Segmentation' for Unicode 11.0.0 > > implies that it might be considered desirable to have a word > > boundary in 'aqua?????' or a grapheme cluster break in a coding > > such as <006C, U+0901 DEVANAGARI SIGN CANDRABINDU> for el > > candrabindu (l?), which should be <006C, U+0310 COMBINING > > CANDRABINDU> in accordance with the principle of script > > CANDRABINDU> separation. Why are such breaks desirable? > > > > I don't understand why one would encode a DEVANAGARI SIGN in the > middle of a Greek word to mean it implies a word boundary in Greek !?! The two examples given are "aqua?????" and "A?". The first switches from Latin to Greek and the second is a Latin letter with a Devanagari mark. However, there is a pre-Unicode tradition of using el with candrabindu when writing Sanskrit in the the Roman alphabet, which is why there is U+0310. > > There are some who > > claim that the Laotian consonant place holder is the letter 'x' > > rather than the multiplication sign, U+00D7, which does have > > Indic_syllabic_category=Consonant_Placeholder. (I trust no-one is > > suggesting that there should be grapheme cluster boundary between > > U+00D7 with script=common and a non-spacing Lao vowel any more than > > there would be with a Lao consonant.) > > > > Here again the multiplication sign has nothing to do with an Indic > consonnant. May be it has been used like this in some texts but this > look more like a tweak. Whatever its origin, it seems well established in Laos, and I've seen it used for the Tai Tham script as well as for the Lao script. Try searching for images of Lao vowels in French. Googling in English found plenty of examples, and the teaching book shown at http://www.bigbrothermouse.com/books/antknife16size-book.html supports the case nicely. I've also seen it used for Khmer, but not to the extent that I can argue that it is well-established in Cambodia. The Khmer example was produced using a typewriter and apparently a felt-tipped pen, so unsurprisingly the vowel bearer was clearly a typewritten letter 'x'. > If one needs a consonnant holder propose to > encode an "empty" letter (like in Hangul or in Arabic), possibly with > variant forms (e.g. changing between a circle, dotted circle, cross, > or horizontal joiner on the hanging baseline for Devenagari and > similar scripts). Propose a disunification if you like. The competing tradition is to use LAO LETTER KO, and a Lao-English dictionary from Thailand uses a grey LAO LETTER O, following the Thai tradition of using the Thai letter for /?/, which serves as the 'empty' letter for Pali and Sanskrit. Remember that a proposal for an invisible letter for Indic was rejected. > The usual base letter placeholder for combining diacritics is usually > a whitespace (preferably NBSP, not SPACE) or the dotted circle > symbol, but not a mathematical symbol which is used also within math > formulas with variable names using common letters or even words. > The multiplication sign used in the UTS standard was chosen because it > normally does not occur within words,... and has nothing to do with the usage of U+00D7 as a consonant placeholder. Richard. From unicode at unicode.org Sat Dec 9 14:30:17 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 9 Dec 2017 20:30:17 +0000 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: References: <20171208220619.3eb2fcbe@JRWUBU2> Message-ID: <20171209203017.77dbcbf9@JRWUBU2> On Sat, 9 Dec 2017 16:16:44 +0100 Mark Davis ?? via Unicode wrote: > 1. You make a good point about the GB9c. It should probably instead be > something like: > > GB9c: (Virama | ZWJ ) ? Extend* LinkingConsonant > > > Extend is a broader than necessary, and there are a few items that > have ccc!=0 but not gcb=extend. But all of those look to be > degenerate cases. Something *like*. Gcb=Extend includes ZWNJ and U+0D02 MALAYALAM SIGN ANUSVARA. I believe these both prevent a preceding candrakkala from extending an akshara - see TUS Section 12.9 about Table 12-33. I think Extend will have to be split between starters and non-starters. I believe there is a problem with the first two examples in Table 12-33. If one suffixed to the first two examples, yielding *??????? and *????????, one would have three Malayalam aksharas, not two extended grapheme clusters as the proposed rules would say. This is different to Tai Tham, where there would indeed just be two aksharas in each word, albit odd-looking - ??????? and ????????. Who's checking the impact of these changes on Malayalam? Richard. From unicode at unicode.org Sat Dec 9 14:56:19 2017 From: unicode at unicode.org (Jonathan Rosenne via Unicode) Date: Sat, 9 Dec 2017 20:56:19 +0000 Subject: =?utf-8?B?UkU6IEFxdWHPhs6/zrLOr86x?= In-Reply-To: <20171209162247.58c60e3c@JRWUBU2> References: <20171209142831.772d1f49@JRWUBU2> <20171209162247.58c60e3c@JRWUBU2> Message-ID: There exist several Judeo-Arabic texts, Arabic written in Hebrew script with Arabic vowels and other marks. One well known is The Guide to the Perplexed. See a modern transcript at https://he.wikisource.org/wiki/%D7%9E%D7%95%D7%A8%D7%94_%D7%A0%D7%91%D7%95%D7%9B%D7%99%D7%9D_(%D7%9E%D7%A7%D7%95%D7%A8)/%D7%9E%D7%91%D7%95%D7%90. A manuscript: http://web.nli.org.il/sites/NLI/Hebrew/digitallibrary/pages/viewer.aspx?presentorid=MANUSCRIPTS&docid=PNX_MANUSCRIPTS000043324-1#|FL36876376 Best Regards, Jonathan Rosenne From unicode at unicode.org Sun Dec 10 23:14:18 2017 From: unicode at unicode.org (Manish Goregaokar via Unicode) Date: Sun, 10 Dec 2017 21:14:18 -0800 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: References: <20171208220619.3eb2fcbe@JRWUBU2> Message-ID: > GB9c: (Virama | ZWJ ) ? Extend* LinkingConsonant You can also explicitly request ligatureification with a ZWJ, so perhaps this rule should be something like (Virama ZWJ? | ZWJ) x Extend* LinkingConsonant -Manish On Sat, Dec 9, 2017 at 7:16 AM, Mark Davis ?? via Unicode < unicode at unicode.org> wrote: > 1. You make a good point about the GB9c. It should probably instead be > something like: > > GB9c: (Virama | ZWJ ) ? Extend* LinkingConsonant > > > Extend is a broader than necessary, and there are a few items that have > ccc!=0 but not gcb=extend. But all of those look to be degenerate cases. > > https://unicode.org/cldr/utility/list-unicodeset.jsp?a= > [\p{ccc!=0}-\p{gcb=extend}]&g=ccc+indicsyllabiccategory > > > > > Mark > > On Fri, Dec 8, 2017 at 11:06 PM, Richard Wordingham via Unicode < > unicode at unicode.org> wrote: > >> Apart from the likely but unmandated consequence of making editing >> Indic text more difficult (possibly contrary to the UK's Equality Act >> 2010), there is another difficulty that will follow directly from the >> currently proposed expansion of grapheme clusters >> (https://www.unicode.org/reports/tr29/proposed.html). >> >> Unless I am missing something, text boundaries have hitherto been >> cunningly crafted so that they are not changed by normalisation. >> Have I missed something, or has there been a change in policy? >> >> For extended grapheme clusters, the relevant rules are proposed as: >> >> GB9: ? (Extend | ZWJ | Virama) >> >> GB9c: (Virama | ZWJ ) ? LinkingConsonant >> >> Most of the Indian scripts have both nukta (ccc=7) and virama (ccc=9). >> This would lead canonically equivalent text to have strikingly >> different divisions: >> >> (no break) >> >> but >> >> >> >> There are other variations on this theme. In Tai Tham, we have the >> following conflict: >> >> natural order, no break: >> >> >> >> but normalised, there would be a break: >> >> >> >> From reading the text, it seems that it is expected that the presence >> or absence of a break should be fine-tuned by CLDR language-specific >> rules. How is this expected to work, e.g. for Saurashtra in Tamil >> script? (There's no Saurashtra data in Version 32 of CLDR.) Would the >> root locale now specify the default segmentation rule, rather than >> UAX#29 plus the Unicode Character Database? >> >> Richard. >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Dec 11 01:59:20 2017 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Mon, 11 Dec 2017 08:59:20 +0100 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: <20171209203017.77dbcbf9@JRWUBU2> References: <20171208220619.3eb2fcbe@JRWUBU2> <20171209203017.77dbcbf9@JRWUBU2> Message-ID: The proposed rules do not distinguish the different visual forms that a sequence of characters surrounding a virama can have, such as 1. an explicit virama, or 2. a half-form is visible, or 3. a ligature is created. That is following the requested structure in http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf. So with these rules a ZWNJ (see Figure 12-3. Preventing Conjunct Forms in Devanagari ) doesn't break a GC, nor do instances where a particular script always shows an explicit virama between two particular consonants. All the lines on Figure 12-7. Consonant Forms in Devanagari and Oriya having a virama would have single GCs (that is, all but the first line). [That, after correcting the rules as per Manish Goregaokar's feedback, thanks!] The examples in "Annexure B" of 17200-text-seg-rec.pdf clearly include #2 and #3, but don't have any examples of #1 (as far as I can tell from a quick scan). It would be very useful to have explicit examples that included #1, and included scripts other than Devanagari (+swaran, others). While the online tool at http://unicode.org/cldr/utility/breaks.jsp can't yet be used until the Unicode 11 UCD is further along, I have an implementation of the new rules such that I can take any particular list of words and generate the breaks. So if someone can supply examples from different scripts or with different combinations of virama, zwj, zwnj, etc..... I can push out the result to this list. And yes, we do need review of these for Malayalam (+cibu, others). If there are scripts for which the rules really don't work (or need more research before #29 is finalized in May), it is fairly straightforward to restrict the rule changes by modifying http://www.unicode.org/reports/tr29/proposed.html#Virama to either exclude particular scripts or include only particular scripts. Mark On Sat, Dec 9, 2017 at 9:30 PM, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Sat, 9 Dec 2017 16:16:44 +0100 > Mark Davis ?? via Unicode wrote: > > > 1. You make a good point about the GB9c. It should probably instead be > > something like: > > > > GB9c: (Virama | ZWJ ) ? Extend* LinkingConsonant > > > > > > Extend is a broader than necessary, and there are a few items that > > have ccc!=0 but not gcb=extend. But all of those look to be > > degenerate cases. > > Something *like*. > > Gcb=Extend includes ZWNJ and U+0D02 MALAYALAM SIGN ANUSVARA. I believe > these both prevent a preceding candrakkala from extending an akshara - > see TUS Section 12.9 about Table 12-33. I think Extend will have to be > split between starters and non-starters. > > I believe there is a problem with the first two examples in Table > 12-33. If one suffixed VOWEL SIGN AA> to the first two examples, yielding *??????? and > *????????, one would have three Malayalam aksharas, not two extended > grapheme clusters as the proposed rules would say. This is different to > Tai Tham, where there would indeed just be two aksharas in each word, > albit odd-looking - ??????? and ????????. Who's checking the impact of > these changes on Malayalam? > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Dec 11 04:16:31 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 11 Dec 2017 10:16:31 +0000 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: References: <20171208220619.3eb2fcbe@JRWUBU2> Message-ID: <20171211101631.44155a27@JRWUBU2> On Sun, 10 Dec 2017 21:14:18 -0800 Manish Goregaokar via Unicode wrote: > > GB9c: (Virama | ZWJ ) ? Extend* LinkingConsonant > > You can also explicitly request ligatureification with a ZWJ, so > perhaps this rule should be something like > > (Virama ZWJ? | ZWJ) x Extend* LinkingConsonant > > -Manish > > On Sat, Dec 9, 2017 at 7:16 AM, Mark Davis ?? via Unicode < > unicode at unicode.org> wrote: > > > 1. You make a good point about the GB9c. It should probably instead > > be something like: > > > > GB9c: (Virama | ZWJ ) ? Extend* LinkingConsonant This change is unnecessary. If we start from Draft 1 where there are: GB9: ? (Extend | ZWJ | Virama) GB9c: (Virama | ZWJ ) ? LinkingConsonant If the classes used in the rules are to be disjoint, we then have to split Extend into something like ViramaExtend and OtherExtend to allow normalised (NFC/NFD) text, at which point we may as well continue to have rules that work without any normalisation. Informally, ViramaExtend = Extend and ccc ? 0. OtherExtend = Extend and ccc = 0. (We might need to put additional characters in ViramaExtend.) This gives us rules: GB9': ? (OtherExtend | ViramaExtend | ZWJ | Virama) GB9c': (Virama | ZWJ ) ViramaExtend* ? LinkingConsonant So, for a sequence , GB9' gives us virama ? ZWJ ? nukta LinkingConsonant and GB9c' gives us virama ? ZWJ ? nukta ? LinkingConsonant --- In Rule GB9c, what examples justify including ZWJ? Are they just the C1 half-forms? My knowledge suggests that GB9c'': Virama (ZWJ | ViramaExtend)* ? LinkingConsonant might be more appropriate. Richard. From unicode at unicode.org Mon Dec 11 05:56:51 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 11 Dec 2017 11:56:51 +0000 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: References: <20171208220619.3eb2fcbe@JRWUBU2> <20171209203017.77dbcbf9@JRWUBU2> Message-ID: <20171211115651.58ee7ad9@JRWUBU2> On Mon, 11 Dec 2017 08:59:20 +0100 Mark Davis ?? via Unicode wrote: > The proposed rules do not distinguish the different visual forms that > a sequence of characters surrounding a virama can have, such as > > 1. an explicit virama, or > 2. a half-form is visible, or > 3. a ligature is created. Do you mean 'visible virama' by an 'explicit virama'? In the context of the Indic syllabic category of virama (which is what I think of as the Unicode virama), I would expect 'explicit virama' to refer to the sequence . (In several scripts, this is encoded as a separate character, and usually classified as a 'pure killer'.) > That is following the requested structure in > http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf. > > So with these rules a ZWNJ (see Figure 12-3. Preventing Conjunct > Forms in Devanagari > ) > doesn't break a GC, nor do instances where a particular script always > shows an explicit virama between two particular consonants. Actually, I don't see ZWJ or ZWNJ in this document. A literal reading of the document would see a syllable break after an explicit half-form! > All the > lines on Figure 12-7. Consonant Forms in Devanagari and Oriya > > having a virama would have single GCs (that is, all but the first > line). [That, after correcting the rules as per Manish Goregaokar's > feedback, thanks!] That looks like a change of intent. For NFD text in Indian Indic blocks plus control characters, in Version 11.0 Draft 1, ZWNJ does stop a gcb virama from including the next consonant in an extended grapheme cluster. > The examples in "Annexure B" of 17200-text-seg-rec.pdf > clearly > include #2 and #3, but don't have any examples of #1 (as far as I can > tell from a quick scan). It would be very useful to have explicit > examples that included #1, and included scripts other than Devanagari > (+swaran, others). There aren't any examples of explicitly encoded half-forms (C1 or C2) or explicitly encoded viramas, either. It would be good to have examples of visible viramas in conjunction with preposed vowels, such as U+093F DEVANAGARI VOWEL SIGN I. From Paul Nelson's remarks many years ago, I gather there are language-dependent variations in their placement when the halant appears. A bit of Sanskrit would be nice to see as well. Hindi and Sanskrit have different preferred shapes for several consonant clusters. Some Tamil script Sanskrit shlokas would be good, as well. Richard. From unicode at unicode.org Mon Dec 11 10:07:05 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 11 Dec 2017 16:07:05 +0000 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: References: <20171208220619.3eb2fcbe@JRWUBU2> <20171209203017.77dbcbf9@JRWUBU2> Message-ID: <20171211160705.78828972@JRWUBU2> On Mon, 11 Dec 2017 08:59:20 +0100 Mark Davis ?? via Unicode wrote: > The proposed rules do not distinguish the different visual forms that > a sequence of characters surrounding a virama can have, such as > > 1. an explicit virama, or > 2. a half-form is visible, or > 3. a ligature is created. > > That is following the requested structure in > http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf. > > So with these rules a ZWNJ (see Figure 12-3. Preventing Conjunct > Forms in Devanagari > ) > doesn't break a GC, nor do instances where a particular script always > shows an explicit virama between two particular consonants. All the > lines on Figure 12-7. Consonant Forms in Devanagari and Oriya > > having a virama would have single GCs (that is, all but the first > line). [That, after correcting the rules as per Manish Goregaokar's > feedback, thanks!] > > The examples in "Annexure B" of 17200-text-seg-rec.pdf > clearly > include #2 and #3, but don't have any examples of #1 (as far as I can > tell from a quick scan). It would be very useful to have explicit > examples that included #1, and included scripts other than Devanagari > (+swaran, others). While > the online tool at http://unicode.org/cldr/utility/breaks.jsp can't > yet be used until the Unicode 11 UCD is further along, I have an > implementation of the new rules such that I can take any particular > list of words and generate the breaks. So if someone can supply > examples from different scripts or with different combinations of > virama, zwj, zwnj, etc..... I can push out the result to this list. Tai Tham oddities, which could cause issues with advanced typograpy: ??????? (Currently C-VN-C-VH-C, becoming C-VN-C-VHC) ????? (Currently C-CH-C-V, becoming C-CHC-V) More obvious versions of the above, with consonants other than U+1A36 TAI THAM LETTER NA: ???? (Currently C-VH-C, becoming C-VHC) ????? (Currently CMS-C-V, becoming CMSC-V) A clear case for tailoring is Pali ????? (CM-CVV, but in Laos and in much Northern Thai usage, U+1A58 TAI THAM SIGN MAI KANG LAI merits gcb=prepend. Northeastern Thailand has the same style as Laos, so pi_TH would be far too vague as a locale.) Compare with Myanmar script ??????? (currently C-CHH-CVV, becoming C-CHHCVV), with a pure killer followed by an invisible stacker. ??????? (currently CVV-CMH-C, becoming CVV-CHHC) will be a case of adjacent pure killer and invisible stacker that commute (to use the terminology of traces). The more typical commutation problem from Tai Tham is exemplified by ????? (currently CVTH-C, becoming CVTHC), where the tone mark and invisible stacker commute. I'd like to add the example of Northern Thai Tai Tham ????????? /n?ai/ 'to ache all over'. At present that akshara is split into three grapheme clusters, composed of 2, 6 and 1 characters. (Thai teaching splits it into four logically contiguous groups of 3, 3, 1 and 2 characters for onset, vowel, tone and final consonant. I find ??? in native abecedaries, and the other three all have names, namely mai kuea, mai yak and hang ya.) When the change goes through, this will be just one extended grapheme cluster of nine characters. Moving back to India, I suggest the Tamil example from https://github.com/w3c/ilreq/issues/31#issuecomment-349589752, namely ??????????? (y?va??aiyum), which currently has an extended grapheme cluster for each consonant. At a minimum, we need the Malayalam examples from the TUS. Finally, I would recommend the Nepali example from L2/11-370,???????????, that I brought to the UTC's attention in L2/17-122. I hope someone else can deal with the other Devanagari issues. (Yep, even Devanagari needs more research!) > > And yes, we do need review of these for Malayalam (+cibu, others). > > If there are scripts for which the rules really don't work (or need > more research before #29 is finalized in May), it is fairly > straightforward to restrict the rule changes by modifying > http://www.unicode.org/reports/tr29/proposed.html#Virama to either > exclude particular scripts or include only particular scripts. > > Mark > > On Sat, Dec 9, 2017 at 9:30 PM, Richard Wordingham via Unicode < > unicode at unicode.org> wrote: > > > On Sat, 9 Dec 2017 16:16:44 +0100 > > Mark Davis ?? via Unicode wrote: > > > > > 1. You make a good point about the GB9c. It should probably > > > instead be something like: > > > > > > GB9c: (Virama | ZWJ ) ? Extend* LinkingConsonant > > > > > > > > > Extend is a broader than necessary, and there are a few items that > > > have ccc!=0 but not gcb=extend. But all of those look to be > > > degenerate cases. > > > > Something *like*. > > > > Gcb=Extend includes ZWNJ and U+0D02 MALAYALAM SIGN ANUSVARA. I > > believe these both prevent a preceding candrakkala from extending > > an akshara - see TUS Section 12.9 about Table 12-33. I think > > Extend will have to be split between starters and non-starters. > > > > I believe there is a problem with the first two examples in Table > > 12-33. If one suffixed > MALAYALAM VOWEL SIGN AA> to the first two examples, yielding > > *??????? and *????????, one would have three Malayalam aksharas, > > not two extended grapheme clusters as the proposed rules would say. > > This is different to Tai Tham, where there would indeed just be two > > aksharas in each word, albit odd-looking - ??????? and ????????. > > Who's checking the impact of these changes on Malayalam? > > > > Richard. > > > > From unicode at unicode.org Mon Dec 11 13:07:18 2017 From: unicode at unicode.org (Roozbeh Pournader via Unicode) Date: Mon, 11 Dec 2017 11:07:18 -0800 Subject: =?UTF-8?B?UmU6IEFxdWHPhs6/zrLOr86x?= In-Reply-To: References: <20171209142831.772d1f49@JRWUBU2> <20171209162247.58c60e3c@JRWUBU2> Message-ID: Jonathan, I've been trying to gather a list of the Arabic marks that actually happen in Hebrew for a while now, but don't have sources. I want to add them to ScriptExtensions data in Unicode. Do you know of a source that lists them? On Sat, Dec 9, 2017 at 12:56 PM, Jonathan Rosenne via Unicode < unicode at unicode.org> wrote: > There exist several Judeo-Arabic texts, Arabic written in Hebrew script > with Arabic vowels and other marks. One well known is The Guide to the > Perplexed. > > See a modern transcript at https://he.wikisource.org/ > wiki/%D7%9E%D7%95%D7%A8%D7%94_%D7%A0%D7%91%D7%95%D7%9B%D7% > 99%D7%9D_(%D7%9E%D7%A7%D7%95%D7%A8)/%D7%9E%D7%91%D7%95%D7%90. > > A manuscript: http://web.nli.org.il/sites/NLI/Hebrew/digitallibrary/ > pages/viewer.aspx?presentorid=MANUSCRIPTS&docid=PNX_ > MANUSCRIPTS000043324-1#|FL36876376 > > Best Regards, > > Jonathan Rosenne > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Dec 11 16:13:48 2017 From: unicode at unicode.org (Jonathan Rosenne via Unicode) Date: Mon, 11 Dec 2017 22:13:48 +0000 Subject: =?utf-8?B?UkU6IEFxdWHPhs6/zrLOr86x?= In-Reply-To: References: <20171209142831.772d1f49@JRWUBU2> <20171209162247.58c60e3c@JRWUBU2> Message-ID: Roozbeh, You could look at the second link, but I am not at all sure they are new characters. One can easily see the three vowels and the shadda, and the i'jam dot which in Arabic is considered to be a part of the letter. I think that browsing the NLI will get you further manuscripts. And of course there is https://he.wikipedia.org/wiki/%D7%A2%D7%A8%D7%91%D7%99%D7%AA_%D7%99%D7%94%D7%95%D7%93%D7%99%D7%AA Best Regards, Jonathan Rosenne From: roozbeh at google.com [mailto:roozbeh at google.com] On Behalf Of Roozbeh Pournader Sent: Monday, December 11, 2017 9:07 PM To: Jonathan Rosenne Cc: unicode at unicode.org Subject: Re: Aqua????? Jonathan, I've been trying to gather a list of the Arabic marks that actually happen in Hebrew for a while now, but don't have sources. I want to add them to ScriptExtensions data in Unicode. Do you know of a source that lists them? On Sat, Dec 9, 2017 at 12:56 PM, Jonathan Rosenne via Unicode > wrote: There exist several Judeo-Arabic texts, Arabic written in Hebrew script with Arabic vowels and other marks. One well known is The Guide to the Perplexed. See a modern transcript at https://he.wikisource.org/wiki/%D7%9E%D7%95%D7%A8%D7%94_%D7%A0%D7%91%D7%95%D7%9B%D7%99%D7%9D_(%D7%9E%D7%A7%D7%95%D7%A8)/%D7%9E%D7%91%D7%95%D7%90. A manuscript: http://web.nli.org.il/sites/NLI/Hebrew/digitallibrary/pages/viewer.aspx?presentorid=MANUSCRIPTS&docid=PNX_MANUSCRIPTS000043324-1#|FL36876376 Best Regards, Jonathan Rosenne -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Dec 11 19:25:04 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 12 Dec 2017 01:25:04 +0000 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: References: <20171208220619.3eb2fcbe@JRWUBU2> <20171209203017.77dbcbf9@JRWUBU2> Message-ID: <20171212012504.4fbe9d10@JRWUBU2> On Mon, 11 Dec 2017 21:45:23 +0000 Cibu Johny (????) wrote: > I am assuming the purpose of the grapheme cluster definition is to be > used line spacing, vertical writing or cursor movement. Without > defining the purpose, it is hard for me to say if a ruleset is valid > or not. That is a very fair point. Take the example of Thai, an Indic script which isn't affected by the proposal. There, the spacing vowel signs, whether before or after, may undergo greater separation when text is stretched to fill a space. I've seen great separation on hoardings. The spacing vowel signs are given gc=Lo. Vertical writing examples are fairly rare, but I've seen 'Yamaha' written vertically in three horizontal stretches - ?? ?? ??. Also, 'video' may be written vertically in three horizontal stretches, as V D O or as ?? ?? ??. I'm not absolutely sure I've the latter in Thai script, but Glenn Slayden reports it at http://www.thai-language.com/phpbb/viewtopic.php?f=11&t=2568&start=0. The striking thing is that four of these syllables have spacing vowels, which would be written on their own in writing stretched horizontally, but associate with the consonant in vertical writing. I haven't checked on the software-free behaviour of U+0E33 THAI CHARACTER SARA AM, which is historically a combination of a mark above and a mark to the right. The Royal Institute Dictionary of 1999 resolves it into NIKHAHIT and SARA AA for what is a very slight horizontal spacing (e.g. the entry for ??????, but I have seen the NIKHAHIT component still attached to the SARA AA component. However, I don't know how much control the RID had over the typesetting of the dictionary. I think making the proposed change and still saying that cursor motion should follow the extended grapheme cluster boundaries is contrary to the Equality Act 2010. It would be knowingly making text editing harder for the users of most Indic scripts. Those writing a Tai language in the Tai Tham script would be hit hardest, even if one mapped compound vowels to simple key stroke sequences. > Assuming that purpose driven definition, we probably need > language specific definitions - a pan-indic algorithm may not work. There is the intermediate level of script-specific definitions. We already have them - following spacing marks are generally excluded from the grapheme clusters in the Burmic scripts. > For instance, the proposed ruleset, may not hold good for Tamil. For > example, see the title in the following image: ??????? broken as > [ta-u, ka-virama, lla, ka-virama]. However, as per the proposed > algorithm it would be: [ta-u, ka-virama-lla, ka-virama] > > [image: image.png] > http://www.chennaispider.com/attachments/Resources/3486-7144-Thuglak-Tamil-Magazine-Chennai.jpg Thank you for the example. I think the rule for the Tamil script should be that pulli attaches a following consonant to its grapheme cluster only in the case of the sequences ??? and ????, but as I typed the latter, I was surprised to see the sequence ??? adopt a conjunct shape, so I don't know whether I'm seeing variation or a font error. > Malayalam could be a similar story. In case of Malayalam, it can be > font specific because of the existence of traditional and reformed > writing styles. A conjunct might be a ligature in traditional; and it > might get displayed with explicit virama in the reformed style. For > example see the poster with word ??????? broken as [u, sa-virama, > ta-aa, da-virama] - as it is written in the reformed style. As per > the proposed algorithm, it would be [u, sa-virama-ta-aa, da-virama]. > These breaks would be used by the traditional style of writing. It seems that the of UAX#29 have been forgotten - "So tailorings for aksaras may need to be script-, language-, font-, or context-specific to be useful". The big problem is that virama leaves too much up to the font. Richard. From unicode at unicode.org Wed Dec 13 12:36:35 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 13 Dec 2017 18:36:35 +0000 Subject: Atomicity of Grapheme Clusters Message-ID: <20171213183635.6faf88e6@JRWUBU2> I have been reviewing UAX#29 Unicode Text Segmentation because I have a feeling we will be trying to do too much with the concept of grapheme clusters, even with tailoring, when we extend it to include whole aksharas. What is the meaning of "Word boundaries, line boundaries, and sentence boundaries should not occur within a grapheme cluster: in other words, a grapheme cluster should be an atomic unit with respect to the process of determining these other boundaries"? In particular, whom is it directed to? Now, once quadrate support is added and we are able to write Ancient Egyptian in Unicode, we will probably have two very significant languages that regularly breach parts of that rule. (At least, I assume a whole Egyptian quadrate would be included in a dropped capital.) Sanskrit word boundaries frequently occur within *legacy* grapheme clusters, and sentence boundaries may occur within quadrates. I presume UAX#29 does not intend that we should use means other than Unicode to write samhita Sanskrit and Ancient Egyptian. Richard. From unicode at unicode.org Thu Dec 14 02:09:57 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 14 Dec 2017 08:09:57 +0000 Subject: Word_Break for Hieroglyphs Message-ID: <20171214080957.419a5668@JRWUBU2> Is there any valid reason for Egyptian hieroglyphs to have Word_Break=ALetter rather than Complex_Context? So far as I am aware, hieroglyphs lack visible word breaks in both inscriptions and in modern transcriptions. Richard. From unicode at unicode.org Thu Dec 14 07:12:10 2017 From: unicode at unicode.org (Michael Everson via Unicode) Date: Thu, 14 Dec 2017 13:12:10 +0000 Subject: Word_Break for Hieroglyphs In-Reply-To: <20171214080957.419a5668@JRWUBU2> References: <20171214080957.419a5668@JRWUBU2> Message-ID: <6D368253-EA4C-42E3-A938-9FD3EC324E83@evertype.com> On 14 Dec 2017, at 08:09, Richard Wordingham via Unicode wrote: > > Is there any valid reason for Egyptian hieroglyphs to have > Word_Break=ALetter rather than Complex_Context? So far as I am aware, > hieroglyphs lack visible word breaks in both inscriptions and in modern > transcriptions. Why should visibility matter here? Michael Everson From unicode at unicode.org Thu Dec 14 08:14:31 2017 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Thu, 14 Dec 2017 15:14:31 +0100 Subject: Word_Break for Hieroglyphs In-Reply-To: <20171214080957.419a5668@JRWUBU2> References: <20171214080957.419a5668@JRWUBU2> Message-ID: The Word_Break property doesn't have a value Complex_Context, but I think that was just a typo in your message. The word break and line break properties for 1,057 [:Script=Egyp:] characters are currently Word_Break=ALetter Line_Break=Alphabetic Off the top of my head, I think the best course would be to make them both the same as for most of [:Script=Hani:] Word_Break=Other Line_Break=Ideographic We would only need to use Complex_Context [:lb=SA:] for scripts that keep some letters together and break others apart (typically needing dictionary lookup). I would suspect for modern use of Egyp, that is not the case; most people would expect the characters to would just flow like ideographs, breaking between any pair: you wouldn't need to disallow breaks between a and a , for example. Also, I noticed that the 14 Egyp characters with Line_Break?Alphabetic have a linebreak and general category properties that seem odd and inconsistent to me. Line_Break=Close_Punctuation General_Category=Other_Letter items: 8 Egyptian Hieroglyphs ? *O. Buildings, parts of buildings, etc. * items: 6 ?? U+1325B EGYPTIAN HIEROGLYPH O006D ?? U+1325C EGYPTIAN HIEROGLYPH O006E ?? U+1325D EGYPTIAN HIEROGLYPH O006F ?? U+13282 EGYPTIAN HIEROGLYPH O033A ?? U+13287 EGYPTIAN HIEROGLYPH O036B ?? U+13289 EGYPTIAN HIEROGLYPH O036D Egyptian Hieroglyphs ? *V. Rope, fiber, baskets, bags, etc. * items: 2 ?? U+1337A EGYPTIAN HIEROGLYPH V011B ?? U+1337B EGYPTIAN HIEROGLYPH V011C Line_Break=Open_Punctuation General_Category=Other_Letter items: 6 Egyptian Hieroglyphs ? *O. Buildings, parts of buildings, etc. * items: 5 ?? U+13258 EGYPTIAN HIEROGLYPH O006A ?? U+13259 EGYPTIAN HIEROGLYPH O006B ?? U+1325A EGYPTIAN HIEROGLYPH O006C ?? U+13286 EGYPTIAN HIEROGLYPH O036A ?? U+13288 EGYPTIAN HIEROGLYPH O036C Egyptian Hieroglyphs ? *V. Rope, fiber, baskets, bags, etc. * items: 1 ?? U+13379 EGYPTIAN HIEROGLYPH V011A Mark On Thu, Dec 14, 2017 at 9:09 AM, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > Is there any valid reason for Egyptian hieroglyphs to have > Word_Break=ALetter rather than Complex_Context? So far as I am aware, > hieroglyphs lack visible word breaks in both inscriptions and in modern > transcriptions. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Dec 14 08:22:54 2017 From: unicode at unicode.org (Michael Everson via Unicode) Date: Thu, 14 Dec 2017 14:22:54 +0000 Subject: Word_Break for Hieroglyphs In-Reply-To: References: <20171214080957.419a5668@JRWUBU2> Message-ID: <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com> On 14 Dec 2017, at 14:14, Mark Davis ?? via Unicode wrote: > The Word_Break property doesn't have a value Complex_Context, but I think that was just a typo in your message. > > The word break and line break properties for 1,057 [:Script=Egyp:] characters are currently > > Word_Break=ALetter > Line_Break=Alphabetic > > Off the top of my head, I think the best course would be to make them both the same as for most of [:Script=Hani:] > > Word_Break=Other > Line_Break=Ideographic Egyptian is not ideographic and is certainly not fixed-width. CJK does not cluster. Why should you want to make them the same? Moreover, these properties were defined at the beginning, were they not? Bob Richmond and others will certainly have a view on this. > We would only need to use Complex_Context [:lb=SA:] for scripts that keep some letters together and break others apart (typically needing dictionary lookup). I would suspect for modern use of Egyp, that is not the case; Please do not ?suspect?. It is not hard to ask experts. > most people would expect the characters to would just flow like ideographs, breaking between any pair: NO. Clusters cannot be broken up just anywhere. > you wouldn't need to disallow breaks between a and a , for example. > > Also, I noticed that the 14 Egyp characters with Line_Break?Alphabetic have a linebreak and general category properties that seem odd and inconsistent to me. > > Line_Break=Close_Punctuation > General_Category=Other_Letteritems: 8 > Egyptian Hieroglyphs ? O. Buildings, parts of buildings, etc.items: 6 > > ?? U+1325B EGYPTIAN HIEROGLYPH O006D > ?? U+1325C EGYPTIAN HIEROGLYPH O006E > ?? U+1325D EGYPTIAN HIEROGLYPH O006F > ?? U+13282 EGYPTIAN HIEROGLYPH O033A > ?? U+13287 EGYPTIAN HIEROGLYPH O036B > ?? U+13289 EGYPTIAN HIEROGLYPH O036D > Egyptian Hieroglyphs ? V. Rope, fiber, baskets, bags, etc.items: 2 > > ?? U+1337A EGYPTIAN HIEROGLYPH V011B > ?? U+1337B EGYPTIAN HIEROGLYPH V011C > Line_Break=Open_Punctuation > General_Category=Other_Letteritems: 6 > Egyptian Hieroglyphs ? O. Buildings, parts of buildings, etc.items: 5 > > ?? U+13258 EGYPTIAN HIEROGLYPH O006A > ?? U+13259 EGYPTIAN HIEROGLYPH O006B > ?? U+1325A EGYPTIAN HIEROGLYPH O006C > ?? U+13286 EGYPTIAN HIEROGLYPH O036A > ?? U+13288 EGYPTIAN HIEROGLYPH O036C > Egyptian Hieroglyphs ? V. Rope, fiber, baskets, bags, etc.items: 1 > > ?? U+13379 EGYPTIAN HIEROGLYPH V011A These properties were chosen explicitly when Egyptian was first defined. Those are enclosing punctuation characters. Michael Everson. From unicode at unicode.org Thu Dec 14 08:53:13 2017 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Thu, 14 Dec 2017 15:53:13 +0100 Subject: Word_Break for Hieroglyphs In-Reply-To: <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com> References: <20171214080957.419a5668@JRWUBU2> <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com> Message-ID: Mark On Thu, Dec 14, 2017 at 3:22 PM, Michael Everson wrote: > On 14 Dec 2017, at 14:14, Mark Davis ?? via Unicode > wrote: > > > The Word_Break property doesn't have a value Complex_Context, but I > think that was just a typo in your message. > > > > The word break and line break properties for 1,057 [:Script=Egyp:] > characters are currently > > > > Word_Break=ALetter > > Line_Break=Alphabetic > > > > Off the top of my head, I think the best course would be to make them > both the same as for most of [:Script=Hani:] > > > > Word_Break=Other > > Line_Break=Ideographic > > Egyptian is not ideographic and is certainly not fixed-width. CJK does not > cluster. Why should you want to make them the same? ?fixed-width has *nothing* to do with these properties. The issue is whether spaces are required between words. The impact of the *these* properties with their current values are that - you would ?never break a word within a string of hieroglyphs (eg double-click) and - you would only break within a string of hieroglyphs if there are no spaces, etc. on the line. For example, if you have a string of 300 hieroglyphs in a paragraph, double clicking on one of them would select the entire string, because as far as Word_Break is concerned, the entire 300 characters form one word. For linebreak, you would only break when forced. So in a paragraph of passages of English + hieroglyphs (represented here by CAPS), you would only break at the spaces and when forced. For example, suppose we have: ... the passage ABCJKQELRKLQNEKLAFNKLAEFNKLARENKQLNRKEWLQNFNNAKDFNFNQKLER is constructed from 15 words with... It would not line break (with the current properties) as: ... the passage ABCJKQELRKLQNEKLAFNKLAEFNKLAREN KQLNRKEWLQNFNNAKDFNFNQKLER is constructed from 15 words with... but rather as: ... the passage ABCJKQELRKLQNEKLAFNKLAEFNKLARENKQLNRKEW LQNFNNAKDFNFNQKLER is constructed from 15 words with... > Moreover, these properties were defined at the beginning, were they not? > Bob Richmond and others will certainly have a view on this. > If there is defined clustering behavior that affects line break, then the line break property value would need to be Complex_Context. But the *current* value is Alphabetic, which makes any length of hieroglyphs function as one (possibly very long) word. That appears clearly wrong, even if it was "defined at the beginning". Properties are not carved in stone (so to speak); we sometimes find out later, especially for seldom used scripts, that property values can be improved. > > We would only need to use Complex_Context [:lb=SA:] for scripts that > keep some letters together and break others apart (typically needing > dictionary lookup). I would suspect for modern use of Egyp, that is not the > case; > > Please do not ?suspect?. It is not hard to ask experts. > ?You misunderstand. When I say "I suspect" that means I'm not certain. Thus I would like people who are both knowledgeable about hieroglyphs *and* Unicode properties to weigh in. I know that people like Andrew Glass are on this list, who satisfy both criteria. ? > > > most people would expect the characters to would just flow like > ideographs, breaking between any pair: > > NO. Clusters cannot be broken up just anywhere. > A simple assertion without more information is useless. Does that mean that ancient inscriptions would leave gaps at the end of lines in order to not break a cluster, or that modern users would expect software to leave gaps at the end of lines in order ?to not break a cluster? And what constitutes a cluster? Is that semantically determined (eg like Thai), or is it based on algorithmic features of the hieroglyphs? > > you wouldn't need to disallow breaks between a with an axe> and a , for example. > > > > Also, I noticed that the 14 Egyp characters with Line_Break?Alphabetic > have a linebreak and general category properties that seem odd and > inconsistent to me. > > > > Line_Break=Close_Punctuation > > General_Category=Other_Letteritems: 8 > > Egyptian Hieroglyphs ? O. Buildings, parts of buildings, etc.items: 6 > > > > ?? U+1325B EGYPTIAN HIEROGLYPH O006D > > ?? U+1325C EGYPTIAN HIEROGLYPH O006E > > ?? U+1325D EGYPTIAN HIEROGLYPH O006F > > ?? U+13282 EGYPTIAN HIEROGLYPH O033A > > ?? U+13287 EGYPTIAN HIEROGLYPH O036B > > ?? U+13289 EGYPTIAN HIEROGLYPH O036D > > Egyptian Hieroglyphs ? V. Rope, fiber, baskets, bags, etc.items: 2 > > > > ?? U+1337A EGYPTIAN HIEROGLYPH V011B > > ?? U+1337B EGYPTIAN HIEROGLYPH V011C > > Line_Break=Open_Punctuation > > General_Category=Other_Letteritems: 6 > > Egyptian Hieroglyphs ? O. Buildings, parts of buildings, etc.items: 5 > > > > ?? U+13258 EGYPTIAN HIEROGLYPH O006A > > ?? U+13259 EGYPTIAN HIEROGLYPH O006B > > ?? U+1325A EGYPTIAN HIEROGLYPH O006C > > ?? U+13286 EGYPTIAN HIEROGLYPH O036A > > ?? U+13288 EGYPTIAN HIEROGLYPH O036C > > Egyptian Hieroglyphs ? V. Rope, fiber, baskets, bags, etc.items: 1 > > > > ?? U+13379 EGYPTIAN HIEROGLYPH V011A > > These properties were chosen explicitly when Egyptian was first defined. > Those are enclosing punctuation characters. > ?The issue is that the general category property values are *not* punctuation characters, so there appears to be an inconsistency (as I said). > > Michael Everson. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Dec 14 10:27:00 2017 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Thu, 14 Dec 2017 08:27:00 -0800 Subject: Word_Break for Hieroglyphs In-Reply-To: References: <20171214080957.419a5668@JRWUBU2> <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com> Message-ID: <44ddcbd0-a9d1-0ada-d8e6-85f268f7d7e7@att.net> Gentlemen, On 12/14/2017 6:53 AM, Mark Davis ?? via Unicode wrote: > Thus I would like people who are both knowledgeable about?hieroglyphs > /and/ Unicode properties to weigh in. I know that people like Andrew > Glass are on this list, who satisfy both criteria. > ? > And what constitutes a cluster? This entire discussion is premature. The model for Egyptian is in flux right now. What constitutes a "quadrat", which is significantly relevant to any determination of how other segmentation properties should work for Egyptian hieroglyphics, will depend on the details of the model and how quadrat formation interacts with the exact set of format controls eventually agreed upon. See: http://www.unicode.org/L2/L2017/17112r-quadrat-encoding.pdf (And please note that that has a reference list of 13 *other* documents. This is not simple stuff.) When we get closure on the Egyptian model, *then* will be the time to make suggestions for how Egyptian values for GCB, WB, and LB might we adjusted for possible better default behavior. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Dec 14 12:11:33 2017 From: unicode at unicode.org (Andrew Glass via Unicode) Date: Thu, 14 Dec 2017 18:11:33 +0000 Subject: Word_Break for Hieroglyphs In-Reply-To: <44ddcbd0-a9d1-0ada-d8e6-85f268f7d7e7@att.net> References: <20171214080957.419a5668@JRWUBU2> <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com> <44ddcbd0-a9d1-0ada-d8e6-85f268f7d7e7@att.net> Message-ID: We?ve made a lot of progress on Hieroglyphs this year with the addition of the quadrat forming controls (thanks again to everyone involved in that effort and in the preceding 13 documents). I like to think that that part of the model is no longer in flux. Certainly, there is more work to be done on correct breaking. At this point we know that quadrat breaks != word breaks, but quadrat boundaries must align with line breaks. We had some discussion on the sidelines of the August UTC meeting at which time it became clear that more work is needed as current property values are not entirely correct. Currently, my Hieroglyphic energies are focused on completing font documentation and a reference font. I think it will be most helpful to understand the properties when we have a font that fully supports the quadrat controls so we have specific examples we can look at and confer on with specialists. So I?m happy to take Ken?s suggestion that we don?t rush in here. Cheers, Andrew From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ken Whistler via Unicode Sent: Thursday, December 14, 2017 8:27 AM To: mark Cc: unicode at unicode.org Subject: Re: Word_Break for Hieroglyphs Gentlemen, On 12/14/2017 6:53 AM, Mark Davis ?? via Unicode wrote: Thus I would like people who are both knowledgeable about hieroglyphs and Unicode properties to weigh in. I know that people like Andrew Glass are on this list, who satisfy both criteria. ? And what constitutes a cluster? This entire discussion is premature. The model for Egyptian is in flux right now. What constitutes a "quadrat", which is significantly relevant to any determination of how other segmentation properties should work for Egyptian hieroglyphics, will depend on the details of the model and how quadrat formation interacts with the exact set of format controls eventually agreed upon. See: http://www.unicode.org/L2/L2017/17112r-quadrat-encoding.pdf (And please note that that has a reference list of 13 *other* documents. This is not simple stuff.) When we get closure on the Egyptian model, *then* will be the time to make suggestions for how Egyptian values for GCB, WB, and LB might we adjusted for possible better default behavior. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Dec 14 14:13:21 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 14 Dec 2017 20:13:21 +0000 Subject: Word_Break for Hieroglyphs In-Reply-To: References: <20171214080957.419a5668@JRWUBU2> <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com> <44ddcbd0-a9d1-0ada-d8e6-85f268f7d7e7@att.net> Message-ID: <20171214201321.30706b7b@JRWUBU2> On Thu, 14 Dec 2017 18:11:33 +0000 Andrew Glass via Unicode wrote: > We had some discussion on the sidelines of > the August UTC meeting at which time it became clear that more work > is needed as current property values are not entirely correct. > Currently, my Hieroglyphic energies are focused on completing font > documentation and a reference font. I think it will be most helpful > to understand the properties when we have a font that fully supports > the quadrat controls so we have specific examples we can look at and > confer on with specialists. So I?m happy to take Ken?s suggestion > that we don?t rush in here. I'll read that as saying there is no need to report a problem; that we already know that there will be a problem with real text of more than a few characters. (The current encoding was justified as primarily defining short strings marshalled by a layout language.) I was approaching hieroglyphs as a system where grapheme cluster breaks, line break opportunities and sentence boundaries have little connection, unlike the hierarchy seen in most writing systems. At least, quadrats seem to be strong candidates for the status of graphme clusters. Richard. From unicode at unicode.org Thu Dec 14 15:13:22 2017 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Thu, 14 Dec 2017 22:13:22 +0100 (CET) Subject: Include emoticons in CLDR character annotation? Message-ID: <310424535.7599.1513286002504@ox.hosteurope.de> The CLDR Survey Tool is currently open to, among other things, collect improvements to (emoji) character names and keywords. I don't see it being done for any language yet, but wouldn't it make sense to add classic emoticons (like :-) for various smiling emojis), kaomoji (like o/ for Person Raising Hand) and ASCII art (like ><)))?> and similar for Fish) to the keywords of Face emoji? From unicode at unicode.org Thu Dec 14 16:40:23 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 14 Dec 2017 22:40:23 +0000 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: References: <20171208220619.3eb2fcbe@JRWUBU2> <20171209203017.77dbcbf9@JRWUBU2> Message-ID: <20171214224023.23e5723f@JRWUBU2> On Mon, 11 Dec 2017 21:45:23 +0000 Cibu Johny (????) wrote: > I am assuming the purpose of the grapheme cluster definition is to be > used line spacing, vertical writing or cursor movement. Without > defining the purpose, it is hard for me to say if a ruleset is valid > or not. Assuming that purpose driven definition, we probably need > language specific definitions - a pan-indic algorithm may not work. > For instance, the proposed ruleset, may not hold good for Tamil. For > example, see the title in the following image: ??????? broken as > [ta-u, ka-virama, lla, ka-virama]. However, as per the proposed > algorithm it would be: [ta-u, ka-virama-lla, ka-virama] > > http://www.chennaispider.com/attachments/Resources/3486-7144-Thuglak-Tamil-Magazine-Chennai.jpg I think Tamil is actually rather straightforward. For native intuition, I would cite the Tamil letter-counting account at https://venkatarangan.com/blog/content/binary/Counting%20Letters%20in%20an%20Unicode%20String.pdf. What the author counts is not spacing glyphs, but vowel letters and consonant characters, with two significant modifications. Firstly, K.SSA counts as just one consonant, and SH.R.II is also counted as containing a single consonant. In other words, the Tamil virama character works as a pure killer except in those two environments. This is also the story the TUNE protagonists tell us. It will be an inelegant rule for UAX#29, but, unfortunately, reality is messy. > Malayalam could be a similar story. In case of Malayalam, it can be > font specific because of the existence of traditional and reformed > writing styles. A conjunct might be a ligature in traditional; and it > might get displayed with explicit virama in the reformed style. For > example see the poster with word ??????? broken as [u, sa-virama, > ta-aa, da-virama] - as it is written in the reformed style. As per > the proposed algorithm, it would be [u, sa-virama-ta-aa, da-virama]. > These breaks would be used by the traditional style of writing. Working round that seems to be tricky. The best I can think of is to have two different locales, traditional and reformed, and hope that the right font is selected. It doesn't seem at all straightforward to work out what the font is doing even from a character to glyph map without knowing what the glyphs are. I'm not sure how one should have the difference designated - language variants, or two scripts? > > [image: image.png] > https://upload.wikimedia.org/wikipedia/en/6/64/Ustad_Hotel_%282012%29_-_Poster.jpg > BTW, there is an example with explicit virama in the proposal under > the Sanskrit section: The alleged grapheme cluster is the last cluster of the second word in the Sanskrit section of L2/17-200 Recommendations to UTC #152 on Text segmentation in Indian languages (https://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf). The rendering seems odd if there is no ZWNJ in the word. I read the word as ???????????? pprpadya with two pitch accents. However, I can't explain the visible virama under the DA - even a Hindi font should have a conjunct for D.YA. Richard. From unicode at unicode.org Sat Dec 16 16:06:03 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 16 Dec 2017 22:06:03 +0000 Subject: Word_Break for Hieroglyphs In-Reply-To: References: <20171214080957.419a5668@JRWUBU2> <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com> Message-ID: <20171216220603.7137899c@JRWUBU2> On Thu, 14 Dec 2017 15:53:13 +0100 Mark Davis ?? via Unicode wrote: > On Thu, Dec 14, 2017 at 3:22 PM, Michael Everson > wrote: > > NO. Clusters cannot be broken up just anywhere. > Does that mean that ancient inscriptions would leave gaps at the end > of lines in order to not break a cluster, or that modern users would > expect software to leave gaps at the end of lines in order ?to not > break a cluster? And what constitutes a cluster? Is that semantically > determined (eg like Thai), or is it based on algorithmic features of > the hieroglyphs? An absence of gaps in ancient inscriptions would not be revealing. One justification trick available to the engravers was variable spelling - spacing phonetic complements were optional. Original letters would offer the best evidence in this respect. We're going to have some algorithmic clusters - it will make no sense to break quadrats between lines. Also, it would be perverse to line-break a graphic transposition. Phonetic elements normally occur in phonetic order, but bird plus tall thin character is usually replaced by tall thin character plus bird. Thus splitting ?????? /wd?/ 'order' i.e. into wD on one line and w-Y1A on the next would be perverse. Unfortunately, I don't know whether it happens or not. Preventing this particular example ought to require a semantic analysis, but I couldn't find an example of word final V024 in the free, 2006 edition of Paul Dickson's "Dictionary of Middle Egyptian in Gardiner Classification Order", so perhaps a sequence wD-w will always be word-internal. Richard. From unicode at unicode.org Sun Dec 17 09:16:20 2017 From: unicode at unicode.org (David P. Kendal via Unicode) Date: Sun, 17 Dec 2017 16:16:20 +0100 Subject: Possible bug in formal grammar for extended grapheme cluster Message-ID: Hi, It?s possible I?m missing something, but the formal grammar/regular expression given for extended grapheme clusters appears to have a bug in it. The bug is here: RI-Sequence := Regional_Indicator+ If the formal grammar is intended to exactly match the rules given the the ?Grapheme Cluster Boundary Rules? section below it as-is, then this should be RI-Sequence := Regional_Indicator Regional_Indicator since as given it would cause any number of RI characters to coalesce into a single grapheme cluster, instead of pairs of characters. That is, the text U+1F1EC U+1F1E7 U+1F1EA U+1F1FA would represent one grapheme cluster instead of the correct two. -- dpk (David P. Kendal) ? Nassauische Str. 36, 10717 DE ? http://dpk.io/ we do these things not because they are easy, +49 159 03847809 but because we thought they were going to be easy ? ?The Programmers? Credo?, Maciej Ceg?owski From unicode at unicode.org Sun Dec 17 11:17:57 2017 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sun, 17 Dec 2017 18:17:57 +0100 Subject: Possible bug in formal grammar for extended grapheme cluster In-Reply-To: References: Message-ID: Thanks for the feedback. You're correct about this; that is a holdover from an earlier version of the document when there was a more basic treatment of RI sequences. There is already an action to modify these. There is a placeholder review note about that just above http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters (scroll up just a bit). Mark Mark On Sun, Dec 17, 2017 at 4:16 PM, David P. Kendal via Unicode < unicode at unicode.org> wrote: > Hi, > > It?s possible I?m missing something, but the formal grammar/regular > expression given for extended grapheme clusters appears to have a bug > in it. > Sequences_and_Grapheme_Clusters> > > The bug is here: > > RI-Sequence := Regional_Indicator+ > > If the formal grammar is intended to exactly match the rules given the > the ?Grapheme Cluster Boundary Rules? section below it as-is, then > this should be > > RI-Sequence := Regional_Indicator Regional_Indicator > > since as given it would cause any number of RI characters to coalesce > into a single grapheme cluster, instead of pairs of characters. That > is, the text U+1F1EC U+1F1E7 U+1F1EA U+1F1FA would represent one > grapheme cluster instead of the correct two. > > -- > dpk (David P. Kendal) ? Nassauische Str. 36, 10717 DE ? http://dpk.io/ > we do these things not because they are easy, +49 159 03847809 > but because we thought they were going to be easy > ? ?The Programmers? Credo?, Maciej Ceg?owski > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Dec 18 03:59:06 2017 From: unicode at unicode.org (Andre Schappo via Unicode) Date: Mon, 18 Dec 2017 09:59:06 +0000 Subject: Possible bug in formal grammar for extended grapheme cluster In-Reply-To: References: Message-ID: Ah! That explains why pcre2grep -u '^\X{1}$' matches with ???? ???????? ???????????? ???????????????????? ...etc... Andr? Schappo On 17 Dec 2017, at 17:17, Mark Davis ?? via Unicode > wrote: Thanks for the feedback. You're correct about this; that is a holdover from an earlier version of the document when there was a more basic treatment of RI sequences. There is already an action to modify these. There is a placeholder review note about that just above http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters (scroll up just a bit). Mark Mark On Sun, Dec 17, 2017 at 4:16 PM, David P. Kendal via Unicode > wrote: Hi, It?s possible I?m missing something, but the formal grammar/regular expression given for extended grapheme clusters appears to have a bug in it. The bug is here: RI-Sequence := Regional_Indicator+ If the formal grammar is intended to exactly match the rules given the the ?Grapheme Cluster Boundary Rules? section below it as-is, then this should be RI-Sequence := Regional_Indicator Regional_Indicator since as given it would cause any number of RI characters to coalesce into a single grapheme cluster, instead of pairs of characters. That is, the text U+1F1EC U+1F1E7 U+1F1EA U+1F1FA would represent one grapheme cluster instead of the correct two. -- dpk (David P. Kendal) ? Nassauische Str. 36, 10717 DE ? http://dpk.io/ we do these things not because they are easy, +49 159 03847809 but because we thought they were going to be easy ? ?The Programmers? Credo?, Maciej Ceg?owski ?? ?? ?? Andr? Schappo https://schappo.blogspot.co.uk https://twitter.com/andreschappo https://weibo.com/andreschappo https://groups.google.com/forum/#!forum/computer-science-curriculum-internationalization -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Dec 18 08:15:11 2017 From: unicode at unicode.org (Serge Rosmorduc via Unicode) Date: Mon, 18 Dec 2017 15:15:11 +0100 Subject: Word_Break for Hieroglyphs In-Reply-To: <20171216220603.7137899c@JRWUBU2> References: <20171214080957.419a5668@JRWUBU2> <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com> <20171216220603.7137899c@JRWUBU2> Message-ID: <02383052-7697-41D9-A197-CC65C3DDD48E@iut.univ-paris8.fr> Hello, Concerning word separation and clusters, there was a variety on different practices. At best, one could say that statistically, there is a positive correlation between word cuts and cluster limits. This being said, it depends widely on the era, the quality of the inscription, and the available space. At some times (for instance, XIIth dynasty), the scribes would work hard to avoid cutting a word between two lines. At other time, and in other circumstances (limited available space), word cutting could be extreme. For instance, in Stela Cairo CGC 34025 (AKA Israel Stela), Merenptah?s text, reusing a stela by Amenophis III, lacks room. Hence, you have things like (like 5-6) : : the word ?sy ? small ?, is cut between the two lines. The phonetic part is line 5, and the bird determinative is alone on line 5, above the preposition ? m ?, which is itself above the consonnant ? m ? which is the first consonant of the following word. I have written the three words in different colours to display their intrication. Best regards, Serge Rosmorduc -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Xsy.png Type: image/png Size: 2790 bytes Desc: not available URL: From unicode at unicode.org Mon Dec 18 08:17:04 2017 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Mon, 18 Dec 2017 15:17:04 +0100 Subject: Possible bug in formal grammar for extended grapheme cluster In-Reply-To: References: Message-ID: If you look back at http://www.unicode.org/reports/tr29/tr29-27.html#GB8a (2015), the rule was simply not to break sequences of RI characters. We changed that in http://www.unicode.org/reports/tr29/tr29-29.html#GB12 (2016) to only group pairs. Unfortunately, the (informative) table http://www.unicode.org/reports/tr29/tr29-31.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters was not updated after 2015 to keep pace with the changes in rules. So that is still to do.... Mark On Mon, Dec 18, 2017 at 10:59 AM, Andre Schappo via Unicode < unicode at unicode.org> wrote: > Ah! That explains why > > pcre2grep -u '^\X{1}$' > > matches with > > ???? > ???????? > ???????????? > ???????????????????? > > ...etc... > > Andr? Schappo > > On 17 Dec 2017, at 17:17, Mark Davis ?? via Unicode > wrote: > > Thanks for the feedback. You're correct about this; that is a holdover > from an earlier version of the document when there was a more basic > treatment of RI sequences. > > There is already an action to modify these. There is a placeholder review > note about that just above > > http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_ > Sequences_and_Grapheme_Clusters > > (scroll up just a bit). > > Mark > > Mark > > On Sun, Dec 17, 2017 at 4:16 PM, David P. Kendal via Unicode < > unicode at unicode.org> wrote: > >> Hi, >> >> It?s possible I?m missing something, but the formal grammar/regular >> expression given for extended grapheme clusters appears to have a bug >> in it. >> > ences_and_Grapheme_Clusters> >> >> The bug is here: >> >> RI-Sequence := Regional_Indicator+ >> >> If the formal grammar is intended to exactly match the rules given the >> the ?Grapheme Cluster Boundary Rules? section below it as-is, then >> this should be >> >> RI-Sequence := Regional_Indicator Regional_Indicator >> >> since as given it would cause any number of RI characters to coalesce >> into a single grapheme cluster, instead of pairs of characters. That >> is, the text U+1F1EC U+1F1E7 U+1F1EA U+1F1FA would represent one >> grapheme cluster instead of the correct two. >> >> -- >> dpk (David P. Kendal) ? Nassauische Str. 36, 10717 >> >> DE ? http://dpk.io/ >> we do these things not because they are easy, +49 159 03847809 >> but because we thought they were going to be easy >> ? ?The Programmers? Credo?, Maciej Ceg?owski >> >> >> > > ?? ?? ?? > Andr? Schappo > https://schappo.blogspot.co.uk > https://twitter.com/andreschappo > https://weibo.com/andreschappo > https://groups.google.com/forum/#!forum/computer-science-curriculum- > internationalization > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Dec 20 02:46:33 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 20 Dec 2017 08:46:33 +0000 Subject: Word_Break for Hieroglyphs In-Reply-To: <02383052-7697-41D9-A197-CC65C3DDD48E@iut.univ-paris8.fr> References: <20171214080957.419a5668@JRWUBU2> <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com> <20171216220603.7137899c@JRWUBU2> <02383052-7697-41D9-A197-CC65C3DDD48E@iut.univ-paris8.fr> Message-ID: <20171220084633.56725ae1@JRWUBU2> On Mon, 18 Dec 2017 15:15:11 +0100 Serge Rosmorduc via Unicode wrote: > Hence, you have things like (like 5-6) : : the word ?sy ? small ?, > is cut between the two lines. The phonetic part is line 5, and the > bird determinative is alone on line 5, above the preposition ? m ?, > which is itself above the consonnant ? m ? which is the first > consonant of the following word. I have written the three words in > different colours to display their intrication. In an implementation that offered genuine whole word selection, and thus tackled with the challenges of Chinese, Japanese, Korean and Vietnamese (both scripts, not just CJKV) as well as Thai, I would expect the selections to be bounded by word boundaries. Thus, if the cited line break (labelled by '6') were not in the text, I would expect double-clicking on the quadrat G37:Aa13:Aa13 to select all three words. Looking at the rendering in https://mjn.host.cs.st-andrews.ac.uk/egyptian/texts/corpus/pdf/Merneptah.pdf, it is worth noting that the cartouche in Line 4 of the inscription is not broken between lines. I don't know whether this is to avoid breaking the cartouche or to avoid separating the facing figure therein. Richard. From unicode at unicode.org Wed Dec 20 03:06:28 2017 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Wed, 20 Dec 2017 18:06:28 +0900 Subject: Word_Break for Hieroglyphs In-Reply-To: <20171220084633.56725ae1@JRWUBU2> References: <20171214080957.419a5668@JRWUBU2> <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com> <20171216220603.7137899c@JRWUBU2> <02383052-7697-41D9-A197-CC65C3DDD48E@iut.univ-paris8.fr> <20171220084633.56725ae1@JRWUBU2> Message-ID: <0c4fb320-f1c5-4d02-b8d2-4c1ddd8531ca@it.aoyama.ac.jp> On 2017/12/20 17:46, Richard Wordingham via Unicode wrote: > In an implementation that offered genuine whole word selection, and > thus tackled with the challenges of Chinese, Japanese, Korean and > Vietnamese (both scripts, not just CJKV) as well as Thai, I would > expect the selections to be bounded by word boundaries. Thus, if the > cited line break (labelled by '6') were not in the text, I would expect > double-clicking on the quadrat G37:Aa13:Aa13 to select all three words. This may be common knowledge to some, but I just had a Japanese document open in MS Word, and tried what happened on double-clicking. What it does is select same-script runs. This means that a run of kanji, a run of hiragana, or a run of katakana (interestingly, the (kata)kana length mark is treated as a forth script) is selected. This is of course not the same as words, but it can match, and it comes close in terms of offering something for editorial convenience while being easy to implement. Regards, Martin. From unicode at unicode.org Thu Dec 21 02:55:33 2017 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Thu, 21 Dec 2017 17:55:33 +0900 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: <20171214224023.23e5723f@JRWUBU2> References: <20171208220619.3eb2fcbe@JRWUBU2> <20171209203017.77dbcbf9@JRWUBU2> <20171214224023.23e5723f@JRWUBU2> Message-ID: <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp> On 2017/12/15 07:40, Richard Wordingham via Unicode wrote: > On Mon, 11 Dec 2017 21:45:23 +0000 > Cibu Johny (????) wrote: >> Malayalam could be a similar story. In case of Malayalam, it can be >> font specific because of the existence of traditional and reformed >> writing styles. A conjunct might be a ligature in traditional; and it >> might get displayed with explicit virama in the reformed style. For >> example see the poster with word ??????? broken as [u, sa-virama, >> ta-aa, da-virama] - as it is written in the reformed style. As per >> the proposed algorithm, it would be [u, sa-virama-ta-aa, da-virama]. >> These breaks would be used by the traditional style of writing. > > Working round that seems to be tricky. The best I can think of is to > have two different locales, traditional and reformed, and hope that the > right font is selected. It doesn't seem at all straightforward to > work out what the font is doing even from a character to glyph map > without knowing what the glyphs are. I'm not sure how one should have > the difference designated - language variants, or two scripts? I'm not at all familiar with Malayalam, but from my experience with typing Japanese (where the average kana character requires two keystrokes for input, but only one for deleting) would lead to different advice. When typing, it is very helpful to know how many times one has to hit backspace when making an error. This kind of knowledge is usually assimilated into what one calls muscle memory, i.e. it is done without thinking about it. I would guess that would be very difficult to maintain two different kinds of muscle memory for typing Malayalam. (My assumption is that the populations typing traditional and reformed writing styles are not disjoint.) Regards, Martin. From unicode at unicode.org Thu Dec 21 15:44:49 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 21 Dec 2017 21:44:49 +0000 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp> References: <20171208220619.3eb2fcbe@JRWUBU2> <20171209203017.77dbcbf9@JRWUBU2> <20171214224023.23e5723f@JRWUBU2> <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp> Message-ID: <20171221214449.7567d0ac@JRWUBU2> On Thu, 21 Dec 2017 17:55:33 +0900 "Martin J. D?rst via Unicode" wrote: > On 2017/12/15 07:40, Richard Wordingham via Unicode wrote: > > On Mon, 11 Dec 2017 21:45:23 +0000 > > Cibu Johny (????) wrote: > >> For example see the poster with word ??????? broken as [u, > >> sa-virama, ta-aa, da-virama] - as it is written in the reformed > >> style. As per the proposed algorithm, it would be [u, > >> sa-virama-ta-aa, da-virama]. These breaks would be used by the > >> traditional style of writing. > I'm not at all familiar with Malayalam, but from my experience with > typing Japanese (where the average kana character requires two > keystrokes for input, but only one for deleting) would lead to > different advice. When typing, it is very helpful to know how many > times one has to hit backspace when making an error. This kind of > knowledge is usually assimilated into what one calls muscle memory, > i.e. it is done without thinking about it. I would guess that would > be very difficult to maintain two different kinds of muscle memory > for typing Malayalam. (My assumption is that the populations typing > traditional and reformed writing styles are not disjoint.) When deleting by backspace, the usual practice is to delete one Unicode character for each key press. The proposed change to the definition of grapheme clusters will not affect this. What will change, for some systems, is stepping through Indic text in most scripts. (The visual order scripts will be unaffected.) In Linux applications, one can often step to the start of each grapheme cluster, i.e. to the breaks in |u|sa-virama|ta-aa|da-virama|. If the proposal to expand extended grapheme clusters to whole aksharas goes through, a likely effect for traditional Malayalam is that one will only be able to step to the positions marked as breaks in |u|sa-virama-ta-aa|da-virama|. Every major system will then be in the same position as Windows, where already only the reduced set of cursor positions is allowed. Thus if the 'sa' were mistyped, one would have to retype the entire 4-character akshara. I find this an unpleasant prospect, and some Indians already find it extremely annoying not to be able to edit the join between consonants, e.g. to replace by . Richard. From unicode at unicode.org Thu Dec 21 18:18:34 2017 From: unicode at unicode.org (Karl Williamson via Unicode) Date: Thu, 21 Dec 2017 17:18:34 -0700 Subject: Inconsistency between UTS 39 and 24 Message-ID: <7e516a02-14eb-1020-3561-864251346a34@khwilliamson.com> In http://unicode.org/reports/tr39/#Mixed_Script_Detection it says, "For more information on the Script_Extensions property and Jpan, Kore, and Hanb, see UAX #24" In http://www.unicode.org/reports/tr24/, there certainly is more information on scx; however, none of the terms Jpan Kore nor Hanb are mentioned. From unicode at unicode.org Thu Dec 21 20:11:37 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 22 Dec 2017 03:11:37 +0100 Subject: Inconsistency between UTS 39 and 24 In-Reply-To: <7e516a02-14eb-1020-3561-864251346a34@khwilliamson.com> References: <7e516a02-14eb-1020-3561-864251346a34@khwilliamson.com> Message-ID: These are ISO 15924 script codes for script variants or groups of related scripts, not used in Unicode classification of characters due to their unification (even if there are registered variants for them) 2017-12-22 1:18 GMT+01:00 Karl Williamson via Unicode : > In http://unicode.org/reports/tr39/#Mixed_Script_Detection > it says, "For more information on the Script_Extensions property and Jpan, > Kore, and Hanb, see UAX #24" > > In http://www.unicode.org/reports/tr24/, there certainly is more > information on scx; however, none of the terms Jpan Kore nor Hanb are > mentioned. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Dec 22 00:04:37 2017 From: unicode at unicode.org (Manish Goregaokar via Unicode) Date: Thu, 21 Dec 2017 22:04:37 -0800 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: <20171221214449.7567d0ac@JRWUBU2> References: <20171208220619.3eb2fcbe@JRWUBU2> <20171209203017.77dbcbf9@JRWUBU2> <20171214224023.23e5723f@JRWUBU2> <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp> <20171221214449.7567d0ac@JRWUBU2> Message-ID: > When deleting by backspace, the usual practice is to delete one Unicode character for each key press. This seems to depend on the operating system and program involved. For example, on OSX any native text input field (Spotlight, TextEdit, etc) will delete by extended grapheme cluster. Chrome also deletes by extended grapheme cluster. However, Firefox deletes by code point. Or, more accurately, something codepoint-like. Backspace will delete flag emoji wholesale, but will delete the jamos in `????????` (a single EGC) one at a time. It also deletes the variation selector and the heart in `????????` in a single keystroke. There's probably a simple metric being used here, but I haven't looked into it yet. ----------- Overall it seems like there's a different preference for forming clusters in different scripts. Perhaps we should have a specific "cluster forming virama" category for viramas from scripts that almost always prefer clusters? (e.g. devanagari). IIRC some indic scripts prefer explicit virama rendering. -Manish On Thu, Dec 21, 2017 at 1:44 PM, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Thu, 21 Dec 2017 17:55:33 +0900 > "Martin J. D?rst via Unicode" wrote: > > > On 2017/12/15 07:40, Richard Wordingham via Unicode wrote: > > > On Mon, 11 Dec 2017 21:45:23 +0000 > > > Cibu Johny (????) wrote: > > > >> For example see the poster with word ??????? broken as [u, > > >> sa-virama, ta-aa, da-virama] - as it is written in the reformed > > >> style. As per the proposed algorithm, it would be [u, > > >> sa-virama-ta-aa, da-virama]. These breaks would be used by the > > >> traditional style of writing. > > > I'm not at all familiar with Malayalam, but from my experience with > > typing Japanese (where the average kana character requires two > > keystrokes for input, but only one for deleting) would lead to > > different advice. When typing, it is very helpful to know how many > > times one has to hit backspace when making an error. This kind of > > knowledge is usually assimilated into what one calls muscle memory, > > i.e. it is done without thinking about it. I would guess that would > > be very difficult to maintain two different kinds of muscle memory > > for typing Malayalam. (My assumption is that the populations typing > > traditional and reformed writing styles are not disjoint.) > > When deleting by backspace, the usual practice is to delete one Unicode > character for each key press. The proposed change to the definition of > grapheme clusters will not affect this. > > What will change, for some systems, is stepping through Indic text in > most scripts. (The visual order scripts will be unaffected.) In Linux > applications, one can often step to the start of each grapheme cluster, > i.e. to the breaks in |u|sa-virama|ta-aa|da-virama|. If the proposal to > expand extended grapheme clusters to whole aksharas goes through, a > likely effect for traditional Malayalam is that one will only be able to > step to the positions marked as breaks in > |u|sa-virama-ta-aa|da-virama|. Every major system will then be in the > same position as Windows, where already only the reduced set of cursor > positions is allowed. Thus if the 'sa' were mistyped, one would have > to retype the entire 4-character akshara. I find this an unpleasant > prospect, and some Indians already find it extremely annoying not to be > able to edit the join between consonants, e.g. to replace by > . > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Dec 22 01:27:15 2017 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 22 Dec 2017 09:27:15 +0200 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: (message from Manish Goregaokar via Unicode on Thu, 21 Dec 2017 22:04:37 -0800) References: <20171208220619.3eb2fcbe@JRWUBU2> <20171209203017.77dbcbf9@JRWUBU2> <20171214224023.23e5723f@JRWUBU2> <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp> <20171221214449.7567d0ac@JRWUBU2> Message-ID: <83mv2bmdd8.fsf@gnu.org> > Date: Thu, 21 Dec 2017 22:04:37 -0800 > Cc: Unicode Public > From: Manish Goregaokar via Unicode > > However, Firefox deletes by code point. As does Emacs, btw. From unicode at unicode.org Fri Dec 22 09:36:35 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 22 Dec 2017 15:36:35 +0000 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: <83mv2bmdd8.fsf@gnu.org> References: <20171208220619.3eb2fcbe@JRWUBU2> <20171209203017.77dbcbf9@JRWUBU2> <20171214224023.23e5723f@JRWUBU2> <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp> <20171221214449.7567d0ac@JRWUBU2> <83mv2bmdd8.fsf@gnu.org> Message-ID: <20171222153635.67628752@JRWUBU2> On Fri, 22 Dec 2017 09:27:15 +0200 Eli Zaretskii via Unicode wrote: > > Date: Thu, 21 Dec 2017 22:04:37 -0800 > > Cc: Unicode Public > > From: Manish Goregaokar via Unicode > > > > However, Firefox deletes by code point. > > As does Emacs, btw. And deleting in that fashion from the right is mentioned by UAX#29 with the implication that it is a sensible way of doing things. Emacs is civilised in that it allows one to delete character by character from either end. That may, however, require some intelligence on the part of the user so that they don't get confused or frightened when the text rearranges itself. However, it seems that one has to modify the source code of Emacs to be able to edit in the middle of a cluster (other than by substitution commands). Or am I overlooking some per-window 'reveal codes' mode that the cognoscenti can use? Richard. From unicode at unicode.org Fri Dec 22 09:44:39 2017 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 22 Dec 2017 17:44:39 +0200 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: <20171222153635.67628752@JRWUBU2> (message from Richard Wordingham via Unicode on Fri, 22 Dec 2017 15:36:35 +0000) References: <20171208220619.3eb2fcbe@JRWUBU2> <20171209203017.77dbcbf9@JRWUBU2> <20171214224023.23e5723f@JRWUBU2> <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp> <20171221214449.7567d0ac@JRWUBU2> <83mv2bmdd8.fsf@gnu.org> <20171222153635.67628752@JRWUBU2> Message-ID: <83tvwilqc8.fsf@gnu.org> > Date: Fri, 22 Dec 2017 15:36:35 +0000 > From: Richard Wordingham via Unicode > > Emacs is civilised in that it allows one to delete character by > character from either end. That may, however, require some > intelligence on the part of the user so that they don't get confused > or frightened when the text rearranges itself. However, it seems that > one has to modify the source code of Emacs to be able to edit in the > middle of a cluster You can always delete a codepoint at a given position in Emacs, specifying the position by its number, but there are no user-level commands to conveniently allow doing that in the middle of a grapheme cluster. It was never requested nor deemed necessary to provide such a capability. Normally, replacing some portions of a grapheme cluster produces a radically different display, so it makes more sense to delete everything and start anew. Deleting individual codepoints by Backspace is useful for accents and diacritics, which generally are input after the base characters, so that is provided. From unicode at unicode.org Fri Dec 22 16:56:53 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 22 Dec 2017 22:56:53 +0000 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: References: <20171208220619.3eb2fcbe@JRWUBU2> <20171209203017.77dbcbf9@JRWUBU2> <20171214224023.23e5723f@JRWUBU2> <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp> <20171221214449.7567d0ac@JRWUBU2> Message-ID: <20171222225653.01cc4b8a@JRWUBU2> On Thu, 21 Dec 2017 22:04:37 -0800 Manish Goregaokar via Unicode wrote: > > When deleting by backspace, the usual practice is to delete one > > Unicode > character for each key press. > > This seems to depend on the operating system and program involved. For > example, on OSX any native text input field (Spotlight, TextEdit, > etc) will delete by extended grapheme cluster. Chrome also deletes by > extended grapheme cluster. That seems nasty, even for Thai with its consonant + vowel + tone legacy grapheme clusters. Or does Thai get special treatment? iPhone messages shows the normal (mandated?) Thai behaviour of deleting character by character. Do you not find this mass deletion annoying for Hindi aksharas with anusvara? > However, Firefox deletes by code point. Or, more accurately, something > codepoint-like. Backspace will delete flag emoji wholesale, but will > delete the jamos in `????????` (a single EGC) one at a time. It > also deletes the variation selector and the heart in `????????` in a > single keystroke. There's probably a simple metric being used here, > but I haven't looked into it yet. There are some odd behaviours around. Claws-mail, which I think uses straight GTK2, has been changing its treatment of Latin diacritics. Long ago, if I remember correctly, it treated 'e acute' differently depending on whether it was one or two codepoints, then it started converting text to NFC on input, and now it treats the NFC and NFD sequence as though it were a single codepoint. This might be using the property 'diacritic', but it isn't treating Thai tone marks that way, so I'm guessing. Presumably it's been implemented on the principle that the user should not receive any pleasant surprises. > ----------- > Overall it seems like there's a different preference for forming > clusters in different scripts. Perhaps we should have a specific > "cluster forming virama" category for viramas from scripts that > almost always prefer clusters? (e.g. devanagari). IIRC some indic > scripts prefer explicit virama rendering. The denial of "one size fits all" is appropriate within writing systems as well as across all systems. For example, using grapheme clusters as the unit of matching may generally work well, but is a total disaster in Indic if one needs to replace one vowel by another, as in Hariraama's plea for help on the Indic list on 6 December. Moving the editing position within a cluster is another issue. Sometimes one needs to adjust the type of joining of consonants in an akshara, e.g. to give a Devanagari text a Hindi look even if the text gets displayed with a Sanskrit font. This is where the font-dependent interpretation of a virama is a disaster. For Devanagari it might have been better if there had been three different characters for the two types of joining (half-forms on one hand and conjuncts or repha on the other) and one type of non-joining, the visible virama. Instead, it seems that people rely on the appropriate gaps in the font capability. The nettle was grasped for the Myanmar script, which now has an invisible stacker, a pure vowel killer, and a composite code for the repha-type combination. I really do find it hard to believe that it is considered to be bad to correct a single consonant in the middle of an akshara. I an not persuaded that the users of languages with many multi-consonant aksharas think that each distinct akshara is a different character. The akshara lies in a hierarchy, between grapheme cluster and pada patha word. What is needed is an extra level of cursor motion, between the levels of word and grapheme cluster. Thai also shows different levels of division. For horizontal spacing, the unit is indeed the grapheme cluster. However, looking at dictionary published in 1971, I noticed that a few marks above are conditionally placed between the grapheme cluster. The primary examples are MAI HANAKAHT and MAI THO, neither of which the Thais considered a vowel back in 1892. (Michell's dictionary of 1892 is apologetic about treating the former as a vowel.) I haven't noticed this behaviour in 20th century Thai with characters separated by several character widths, though both these marks tend to be placed in the rightmost part of the space allocated to the base consonant. Correct positioning of MAI THO seems to require a grammatical analysis, and the documentation of Uniscribe certainly used to suggest that this was not possible at the font level. Vertical writing in Thai is extremely rare. There are Thai crosswords, and they do use the grapheme clusters. Many, but not all, of the examples use an irregular pentagon to accommodate marks above and a different irregular pentagon to accommodate marks below. Thais seem better acquainted with Scrabble played in English. Commercial vertical signs follow a different grammar. I have two examples, segmented ??-??-?? 'Yamaha', and ??-??-?? 'video', the latter typically accompanied by V-D-O in Roman letters. It is not clear whether these words are split into super-extended grapheme clusters or syllables. When it comes to line-breaking, the Thai preference for emergency line-breaks (which are supposed to be beyond the scope of Unicode) seems to be for division into syllables. This seems to be the standard for Lao line-breaking, though it might be connected with the facts that syllable boundaries are easier to detect with modern Lao spelling and that there are far fewer users of Lao than of Thai. When it comes to detecting aksharas in Tamil the situation seems to be rather simple. In two environments, U+0BCD TAMIL SIGN VIRAMA behaves like an invisible stacker. Otherwise, it behave like a pure killer. For Malayalam, the two writing syles have different behaviours for the virama. Disunification of U+0D4D MALAYALAM SIGN VIRAMA is probably not an option. In theory, one could try demanding that ambiguous intentionally visible virama be spelt with ZWNJ, but I doubt that such a command would be heeded. For Sinhalese, it may be that there is no ambiguity in the effect of virama, provided one is aware that the current Unicode prescription is contrary to the rules laid down by the government of Sri Lanka. Note that the recent W3C investigation of Indic layout requirements was restricted to *Indian* Indic scripts. Does anyone here know what a Burmese 'dropped capital' looks like? The investigation did not cover Insular Southeast Asia, where there are characters of Indic syllabic category virama. The coding of mainland Southeast Asian Indic scripts has evolved beyond the virama, using an invisible stacker and a pure killer instead. The ISCII stage is, so far as I am aware, restricted to India and Sri Lanka. Richard. From unicode at unicode.org Fri Dec 22 19:39:40 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 23 Dec 2017 01:39:40 +0000 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: <83tvwilqc8.fsf@gnu.org> References: <20171208220619.3eb2fcbe@JRWUBU2> <20171209203017.77dbcbf9@JRWUBU2> <20171214224023.23e5723f@JRWUBU2> <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp> <20171221214449.7567d0ac@JRWUBU2> <83mv2bmdd8.fsf@gnu.org> <20171222153635.67628752@JRWUBU2> <83tvwilqc8.fsf@gnu.org> Message-ID: <20171223013940.7530f89b@JRWUBU2> On Fri, 22 Dec 2017 17:44:39 +0200 Eli Zaretskii via Unicode wrote: > You can always delete a codepoint at a given position in Emacs, > specifying the position by its number, but there are no user-level > commands to conveniently allow doing that in the middle of a grapheme > cluster. > > It was never requested nor deemed necessary to provide such a > capability. Kenichi Handa provided such a capability for Emacs, but it has not been accepted for the main line. I am using a version of his code which he kindly provided for my own editing. The discussion is available at https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20140 . The figure attached to it shows stepping through a 7-character (6 graphic and one 'invisible' stacker) akshara. > Normally, replacing some portions of a grapheme cluster > produces a radically different display, so it makes more sense to > delete everything and start anew. Non sequitur. > Deleting individual codepoints by > Backspace is useful for accents and diacritics, which generally are > input after the base characters, so that is provided. Don't forget that a cluster can be a large constellation of characters. Richard. From unicode at unicode.org Wed Dec 27 15:31:19 2017 From: unicode at unicode.org (Karl Williamson via Unicode) Date: Wed, 27 Dec 2017 14:31:19 -0700 Subject: Traditional and Simplified Han in UTS 39 Message-ID: <5c85cffc-0648-3d51-aa2c-5ade05da4810@khwilliamson.com> In UTS 39, it says, that optionally, "Mark Chinese strings as ?mixed script? if they contain both simplified (S) and traditional (T) Chinese characters, using the Unihan data in the Unicode Character Database [UCD]. "The criterion can only be applied if the language of the string is known to be Chinese." What does it mean for the language to "be known to be Chinese"? Is this something algorithmically determinable, or does it come from information about the input text that comes from outside the UCD? The example given shows some Hirigana in the text. That clearly indicates the language isn't Chinese. So in this example we can algorithmically rule out that its Chinese. And what does Chinese really mean here? From unicode at unicode.org Wed Dec 27 16:20:10 2017 From: unicode at unicode.org (Phake Nick via Unicode) Date: Thu, 28 Dec 2017 06:20:10 +0800 Subject: Traditional and Simplified Han in UTS 39 In-Reply-To: <5c85cffc-0648-3d51-aa2c-5ade05da4810@khwilliamson.com> References: <5c85cffc-0648-3d51-aa2c-5ade05da4810@khwilliamson.com> Message-ID: 2017?12?28? ??5:34 ? "Karl Williamson via Unicode" ??? > > In UTS 39, it says, that optionally, > > "Mark Chinese strings as ?mixed script? if they contain both simplified (S) and traditional (T) Chinese characters, using the Unihan data in the Unicode Character Database [UCD]. > > "The criterion can only be applied if the language of the string is known to be Chinese." > > What does it mean for the language to "be known to be Chinese"? As in, the string is written in Chinese language, not Japanese language, not old Korean/Vietnamese text that use Chinese character, nor any other languages that use Chinese characters. According to my knowledge, some Chinese dialects/variants also use both Simplified and Traditional characters together with different etymology and that probably shouldn't be considered as mixed script too, although they aren't really common and is not mentioned in the UTS either. > Is this something algorithmically determinable, or does it come from information about the input text that comes from outside the UCD? > > The example given shows some Hirigana in the text. That clearly indicates the language isn't Chinese. So in this example we can algorithmically rule out that its Chinese. Usually when there are Japanese kana in the mix then the text would be Japanese instead of Chinese. However the reverse is not necessarily true, especially for a single word or short pharse, older styled text and such, where a string with only Chinese characters can still be a Japanese text. > > And what does Chinese really mean here? > The written form of the (Mandarin) Chinese language? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Dec 27 16:39:21 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 27 Dec 2017 23:39:21 +0100 Subject: Traditional and Simplified Han in UTS 39 In-Reply-To: <5c85cffc-0648-3d51-aa2c-5ade05da4810@khwilliamson.com> References: <5c85cffc-0648-3d51-aa2c-5ade05da4810@khwilliamson.com> Message-ID: I bet it means the difference in terms of scripts, not in terms of languages. So it says to use "Hani" instead of "Hans" or "Hant" if the character forms cannot be determined, and this will apply equally if the language is Chinese/Mandarin, Cantonese/Yue, Taiwanese, Wu, or even Japanese. For the Japanese language there's an additional mixed script code "Jpan" when it uses a mis of sinograms, Katakana and Hiragana. For the Chinese languages there should be a script code for sinograms+Bopomofo (Bopomofo is rarely used alone, but most often with Traditional sinograms; it occurs sometimes with Simplified sinograms as well) 2017-12-27 22:31 GMT+01:00 Karl Williamson via Unicode : > In UTS 39, it says, that optionally, > > "Mark Chinese strings as ?mixed script? if they contain both simplified > (S) and traditional (T) Chinese characters, using the Unihan data in the > Unicode Character Database [UCD]. > > "The criterion can only be applied if the language of the string is known > to be Chinese." > > What does it mean for the language to "be known to be Chinese"? Is this > something algorithmically determinable, or does it come from information > about the input text that comes from outside the UCD? > > The example given shows some Hirigana in the text. That clearly indicates > the language isn't Chinese. So in this example we can algorithmically rule > out that its Chinese. > > And what does Chinese really mean here? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Dec 27 23:24:52 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 27 Dec 2017 21:24:52 -0800 Subject: Traditional and Simplified Han in UTS 39 In-Reply-To: <5c85cffc-0648-3d51-aa2c-5ade05da4810@khwilliamson.com> References: <5c85cffc-0648-3d51-aa2c-5ade05da4810@khwilliamson.com> Message-ID: <1c9f4bf5-7589-a5a6-ddd2-dd4de4e5d0a0@ix.netcom.com> The full excerpt from the UTS reads: > Mark Chinese strings as ?mixed script? if they contain both simplified > (S) and traditional (T) Chinese characters, using the Unihan data in > the Unicode Character Database [UCD > ]. > > 1. The criterion can only be applied if the language of the string is > known to be Chinese. So, for example, the string ????????? ? > is Japanese, and should not be marked as mixed script because of a > mixture of S and T characters. > 2. Testing for whether a character is S or T needs to be based not on > whether the character /has/?a S or T variant , but whether the > character /is/?an S or T variant. > There are several issues with this. First and foremost, the definition of S and T variants is not something that is universally agreed upon. The .cn, .hk or .tw registries are using a definition of S and T variants that does not agree with the Unihan data in many particulars. Therefore, using the Unihan data would result in false positives. (And false negatives). Second, there are many characters that are variants that are acceptable with both "S" or "T" labels. You only have to look at the published Label Generation Rulesets (or IDN tables) for these domains to see many examples. And, as mentioned above, you cannot reverse engineer these tables from Unihan data. Third, the same domains mentioned have a policy of delegating up to three label to the same applicant: a "traditional", "simplified" and a mixed label matching the spelling of the label in the original application (for situations where a mixed label is appropriate). In other words, certain mixed labels are seen as appropriate. Fourth, the Chinese ccTLDs all have a robust policy of preventing any other mixed label that is a variant of the three from being allocated to an unrelated party. If you "know" that the language has to be Chinese, because the domain is a ccTLD, then at the same time the check is superfluous. Other registries are not known to have similar policies, so for them additional spoof detection may be useful --- however it is precisely those cases where it's impossible to know whether a label is intended to be in the Chinese language. Fifth, generally the only thing that can be ascertained is that a label is *not* in Chinese: by virtue of having Kana or Hangul characters mixed in. However, the reverse is not true. You will find labels registered under .jp that do not contain Hiragana or Katakana. Sixth, for zones that are shared by different CJK languages, the state of the art is to have a coordinated policy that prevents "random" variant labels from coexisting in the registry. An example of this kind of effort is being developed for the root zone. By definition, for the root zone, there is no implied information about the language context, unlike the case for the second level, where the presence of a ccTLD in the full domain name may give a clue. Seventh, attempting to determine whether a label is potentially valid based on variant data (of any kind) is doomed, because actual usage is not limited to "pure" labels. The variant mechanism is something that works differently (in those registries that apply it): instead of looking at a single label, the registry can implement "mutual exclusion". Once one variant label from a given set has been delegated, all others are excluded (or in practice, all but three, which are limited to the same applicant). Without access to the registry data, you cannot predict which variants in a set are the "good ones", and with access to the data, spoof labels are rejected and cannot be registered. In conclusion, my recommendation would be to retract this particular passage. A./ On 12/27/2017 1:31 PM, Karl Williamson via Unicode wrote: > In UTS 39, it says, that optionally, > > "Mark Chinese strings as ?mixed script? if they contain both > simplified (S) and traditional (T) Chinese characters, using the > Unihan data in the Unicode Character Database [UCD]. > > "The criterion can only be applied if the language of the string is > known to be Chinese." > > What does it mean for the language to "be known to be Chinese"? Is > this something algorithmically determinable, or does it come from > information about the input text that comes from outside the UCD? > > The example given shows some Hirigana in the text.? That clearly > indicates the language isn't Chinese.? So in this example we can > algorithmically rule out that its Chinese. > > And what does Chinese really mean here? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Dec 29 19:08:00 2017 From: unicode at unicode.org (David Starner via Unicode) Date: Sat, 30 Dec 2017 01:08:00 +0000 Subject: Linearized tilde? Message-ID: https://en.wikipedia.org/wiki/African_reference_alphabet says "The 1982 revision of the alphabet was made by Michael Mann and David Dalby, who had attended the Niamey conference. It has 60 letters; some are quite different from the 1978 version." and offers the linearized tilde, a tilde squeezed into the space and location of the normal lowercase 'x' or 'o'. (See https://commons.wikimedia.org/wiki/File:Latin_letter_Linearized_tilde_(Mann-Dalby_form).svg )The German WP article specifically says "Der Buchstabe ist in keine aktuelle Orthografie ?bernommen und ist auch nicht in Unicode enthalten (Stand 2013, Unicode Version 6.3)." "The letter is not included in any current spelling and is not included in Unicode." Should it be? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Dec 29 20:54:19 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 30 Dec 2017 03:54:19 +0100 Subject: Linearized tilde? In-Reply-To: References: Message-ID: Isn't it a rounded variant of Latin letter n ? Then it could exist also in uppercase form (like "n" and "N") It could also be used as a spacing version of the combining tilde diacritic, to be written after the letter instead of being combined above it (so "el Ni?o" would we written with it as "el Nino" (without using the encoded tilde symbol in the ugly "el Nin~o", but with a normal letter), or capitalized as "EL NINO" (instead of the ugly "EL NIN~O"). I don't think that "LINEARIZED TILDE" is the correct name. I think it's better named LATIN TILDE LETTER, to be sorted between LATIN LETTER N and LATIN LETTER O (unlike the ASCII tile symbol which sorts with other symbols after spacing whitespaces but before all digits and letters) 2017-12-30 2:08 GMT+01:00 David Starner via Unicode : > https://en.wikipedia.org/wiki/African_reference_alphabet says "The 1982 > revision of the alphabet was made by Michael Mann and David Dalby, who had > attended the Niamey conference. It has 60 letters; some are quite different > from the 1978 version." and offers the linearized tilde, a tilde squeezed > into the space and location of the normal lowercase 'x' or 'o'. (See > https://commons.wikimedia.org/wiki/File:Latin_ > letter_Linearized_tilde_(Mann-Dalby_form).svg )The German WP article > specifically says "Der Buchstabe ist in keine aktuelle Orthografie > ?bernommen und ist auch nicht > in Unicode enthalten (Stand 2013, > Unicode Version 6.3)." "The letter is not included in any current > spelling and is not included in Unicode." Should it be? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Dec 30 12:54:44 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sat, 30 Dec 2017 11:54:44 -0700 Subject: Linearized tilde? Message-ID: <343575ACDCA641938CD4C9EC6BA9C909@DougEwell> David Starner wrote: > "The letter is not included in any current spelling and is not > included in Unicode." Should it be? Did anyone ever use the 1982 alphabet, other than Mann and Dalby? If not, I wonder if this letter is a bit like the "proposed new punctuation marks" that show up in proposals from time to time, but have never been used except by their inventors and to talk about them. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat Dec 30 12:59:36 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sat, 30 Dec 2017 11:59:36 -0700 Subject: Linearized tilde? Message-ID: Philippe Verdy wrote: > Isn't it a rounded variant of Latin letter n ? Then it could exist > also in uppercase form (like "n" and "N") A defining characteristic of the 1982 African Reference Alphabet was that it was lowercase-only. An uppercase form would be an invention with no basis in history or usage. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat Dec 30 13:02:41 2017 From: unicode at unicode.org (Michael Everson via Unicode) Date: Sat, 30 Dec 2017 19:02:41 +0000 Subject: Linearized tilde? In-Reply-To: References: Message-ID: <0DFBECBC-F43D-4F92-869D-678CFCCBDE19@evertype.com> On 30 Dec 2017, at 18:59, Doug Ewell via Unicode wrote: > A defining characteristic of the 1982 African Reference Alphabet was that it was lowercase-only. An uppercase form would be an invention with no basis in history or usage. Which is why it failed. Everybody who used anything like it or derived from it ended up devising capital letters. Doke?s click letters are better candidates for encoding. Michael Everson From unicode at unicode.org Sun Dec 31 12:09:59 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 31 Dec 2017 18:09:59 +0000 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: <83tvwilqc8.fsf@gnu.org> References: <20171208220619.3eb2fcbe@JRWUBU2> <20171209203017.77dbcbf9@JRWUBU2> <20171214224023.23e5723f@JRWUBU2> <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp> <20171221214449.7567d0ac@JRWUBU2> <83mv2bmdd8.fsf@gnu.org> <20171222153635.67628752@JRWUBU2> <83tvwilqc8.fsf@gnu.org> Message-ID: <20171231180959.649681bb@JRWUBU2> On Fri, 22 Dec 2017 17:44:39 +0200 Eli Zaretskii via Unicode wrote: > > Date: Fri, 22 Dec 2017 15:36:35 +0000 > > From: Richard Wordingham via Unicode > > However, it seems > > that one has to modify the source code of Emacs to be able to edit > > in the middle of a cluster > You can always delete a codepoint at a given position in Emacs, > specifying the position by its number, but there are no user-level > commands to conveniently allow doing that in the middle of a grapheme > cluster. > It was never requested nor deemed necessary to provide such a > capability. Whilst not the nicest of mechanisms, it turns out that Emacs does have a 'standard' command auto-composition-mode which will toggle automatic clustering. If one disables automatic clustering, one can then step through the clusters character by character. This is the sort of thing Hariraama has been asking for on the Indic list, though he would like the capability for Microsoft Word. Richard. From unicode at unicode.org Sun Dec 31 20:14:36 2017 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Mon, 1 Jan 2018 07:44:36 +0530 Subject: Popular wordprocessors treating U+00A0 as fixed-width Message-ID: While http://unicode.org/reports/tr14/ clearly states that: When expanding or compressing interword space according to common typographical practice, only the spaces marked by U+0020 SPACE and U+00A0 NO-BREAK SPACE are subject to compression, and only spaces marked by U+0020 SPACE, U+00A0 NO-BREAK SPACE, and occasionally spaces marked by U+2009 THIN SPACE are subject to expansion. All other space characters normally have fixed width. ? really sad to see the misunderstanding around U+00A0: https://answers.microsoft.com/en-us/msoffice/forum/msoffice_word-mso_windows8-mso_2016/nonbreakable-space-justification-in-word-2016/4fa1ad30-004c-454f-9775-a3beaa91c88b?auth=1 https://bugs.documentfoundation.org/show_bug.cgi?id=41652 -- Shriramana Sharma ???????????? ???????????? ???????????????????????? From unicode at unicode.org Sun Dec 31 21:43:26 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 1 Jan 2018 04:43:26 +0100 Subject: Popular wordprocessors treating U+00A0 as fixed-width In-Reply-To: References: Message-ID: Well it's unfortunate that Microsoft's own response (by its MSVP) is completely wrong, suggesting to use Narrow non-breaking space to get justification, which is exactly the reverse where these NNBSP should NOT be justified and keep their width. Microsoft's developers have absolutely misunderstood the standard where both SPACE and NBSP should really behave the same for justification (being different only for the existence of the break opportunity). This Microsoft response is completrrely supid, and it even breaks the classic typography for French use of NNBSP ("fine" in French) around some punctuations (before :;!?? or after ?) and as group separators in numbers (note that NNBSP was introduced in Unicode very late in the standard (and before that NBSP was used only because this was the only non-breaking space available but it was much too large!) Still many documents use NBSP instead of NNBSP around punctuations or as group separators (but in Word these contextual occurences of NBSP which are easy to detect, could have been autoreplaced when typesetting, or proposed as a correction in the integrated speller, at least for French). But the old behavior of old versions of Office (before NNBSP existed in Unicode) should have been cleaned up since long. It's clear that MS Office developers don't know the standards and do what they want (they also don't know the correct standards for maths in Excel and use a lot of very stupid assumptions, as if they were smarter than their users that suffer since long from these bugs !) and don't want to fix their past errors. 2018-01-01 3:14 GMT+01:00 Shriramana Sharma via Unicode : > While http://unicode.org/reports/tr14/ clearly states that: > > > When expanding or compressing interword space according to common > typographical practice, only the spaces marked by U+0020 SPACE and > U+00A0 NO-BREAK SPACE are subject to compression, and only spaces > marked by U+0020 SPACE, U+00A0 NO-BREAK SPACE, and occasionally spaces > marked by U+2009 THIN SPACE are subject to expansion. All other space > characters normally have fixed width. > > > ? really sad to see the misunderstanding around U+00A0: > > https://answers.microsoft.com/en-us/msoffice/forum/msoffice_ > word-mso_windows8-mso_2016/nonbreakable-space-justification-in-word-2016/ > 4fa1ad30-004c-454f-9775-a3beaa91c88b?auth=1 > > https://bugs.documentfoundation.org/show_bug.cgi?id=41652 > > -- > Shriramana Sharma ???????????? ???????????? ???????????????????????? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: