From irgendeinbenutzername at gmail.com Sun Oct 20 03:43:18 2024 From: irgendeinbenutzername at gmail.com (Charlotte Eiffel Lilith Buff) Date: Sun, 20 Oct 2024 10:43:18 +0200 Subject: Script-Specific Composition Exclusions Message-ID: This question on Stack Overflow sent me on a wild Google spree yesterday trying to find the reason why certain characters are included in the Composition_Exclusion set, particularly the Devanagari, Bengali, Gurmukhi, and Oriya letters with nukta, but I wasn?t able to locate any relevant documents from back then. As I understand it (and I believe this was even the wording used in previous versions of UAX #15), the script-specific exclusions exist because for a handful of characters the fully decomposed form is the preferred representation in regular usage. This makes sense to me for the precomposed Hebrew letters because with so many combining marks with unique CCC values, it just seems easier to deal exclusively with combining character sequences and not have some random marks ?glue? themselves to the base letter. The two-part Tibetan subjoined letters are similar in this regard. However, the Indic nuktas seem entirely unproblematic and in fact not all precomposed letters with nukta are composition-excluded: Devanagari has ?, ?, and ? for example. Does anyone remember what lead to these specific decisions or knows where to find the relevant documents if they exist? -------------- next part -------------- An HTML attachment was scrubbed... URL: From egg.robin.leroy at gmail.com Sun Oct 20 09:14:00 2024 From: egg.robin.leroy at gmail.com (Robin Leroy) Date: Sun, 20 Oct 2024 16:14:00 +0200 Subject: Script-Specific Composition Exclusions In-Reply-To: References: Message-ID: Le dim. 20 oct. 2024 ? 10:48, Charlotte Eiffel Lilith Buff via Unicode < unicode at corp.unicode.org> a ?crit : > As I understand it (and I believe this was even the wording used in > previous versions of UAX #15), the script-specific exclusions exist because > for a handful of characters the fully decomposed form is the preferred > representation in regular usage. This makes sense to me for the precomposed > Hebrew letters because with so many combining marks with unique CCC values, > it just seems easier to deal exclusively with combining character sequences > and not have some random marks ?glue? themselves to the base letter. The > two-part Tibetan subjoined letters are similar in this regard. > However, the Indic nuktas seem entirely unproblematic and in fact not all > precomposed letters with nukta are composition-excluded: Devanagari has ?, > ?, and ? for example. > > Does anyone remember what lead to these specific decisions or knows where > to find the relevant documents if they exist? > I certainly wasn?t involved in Unicode when the relevant documents were discussed, as I was busy learning the letters in the Basic Latin block?, but I looked at some of them a couple of years ago. - Revision 9 of then-DUTR? #15 https://www.unicode.org/reports/tr15/tr15-9.html, dated 1998-11-23, and entered into the registry as L2/98-404, does not mention composition exclusions. - The first revision (10) that mentions characters *excluded from being primary composites* is https://www.unicode.org/reports/tr15/tr15-10.html#Definitions, dated 1998-12-16. The rationale is indeed that *This would be to match common practice for scripts that use fully decomposed forms.* The sole example given is FB31. - The next revision (11) includes a list of composition exclusions: https://www.unicode.org/reports/tr15/tr15-11.html#Primary%20Exclusion%20List%20Table, dated 1999-02-25. This list includes 0958..095F. Between revisions 9 and 10, we have UTC #78, whose minutes are L2/98-419 . See the discussion in the section titled ?Normalization [Document L2/98-404]?, and in particular the last comment from Ken Whistler. Between revisions 10 and 11, we have UTC #79, in whose minutes L2/99-054R , in the section ?Proposed Draft UTR #15, Unicode Normalization?, we get a similar comment from Ken towards the end. The minutes of UTC #80, L2/99-176 , have some discussion of normalization, and motion 80-M25 letting the editorial committee change the composition exclusions table; but by that point 0958 is already in there, so digging there isn?t going to help. However, some later documents provide relevant context: - L2/01-304 (p. 17, in the section on Devanagari). - L2/01-305 (section on Devanagari). So there was clear feedback from India that U+0958 ? and friends should be discouraged; presumably the UTC must have been aware of that in 1999. On the distinction between ? vs. ?, I guess this is related to ? being atomic in ISCII; in turn that is because while ? is decomposable, corresponding letters in other ISCII scripts (?, ?, ?) are not. See also point (viii) of L2/01-304 ; there still was a desire to make the encodings similar between the scripts. I am sure Ken can provide more details. Best regards, Robin Leroy ? ? As well as a few from the Latin-1 Supplement and Latin Extended-A blocks. ? This predates L2/00-118 and UTC decision 83-C6 which gave us the terms UAX and UTS. -------------- next part -------------- An HTML attachment was scrubbed... URL: From irgendeinbenutzername at gmail.com Mon Oct 21 12:13:17 2024 From: irgendeinbenutzername at gmail.com (Charlotte Eiffel Lilith Buff) Date: Mon, 21 Oct 2024 19:13:17 +0200 Subject: Script-Specific Composition Exclusions In-Reply-To: References: Message-ID: Excellent research! Thanks a lot! Am So., 20. Okt. 2024 um 16:14 Uhr schrieb Robin Leroy < egg.robin.leroy at gmail.com>: > Le dim. 20 oct. 2024 ? 10:48, Charlotte Eiffel Lilith Buff via Unicode < > unicode at corp.unicode.org> a ?crit : > >> As I understand it (and I believe this was even the wording used in >> previous versions of UAX #15), the script-specific exclusions exist because >> for a handful of characters the fully decomposed form is the preferred >> representation in regular usage. This makes sense to me for the precomposed >> Hebrew letters because with so many combining marks with unique CCC values, >> it just seems easier to deal exclusively with combining character sequences >> and not have some random marks ?glue? themselves to the base letter. The >> two-part Tibetan subjoined letters are similar in this regard. > > >> > However, the Indic nuktas seem entirely unproblematic and in fact not all >> precomposed letters with nukta are composition-excluded: Devanagari has ?, >> ?, and ? for example. >> >> Does anyone remember what lead to these specific decisions or knows where >> to find the relevant documents if they exist? >> > I certainly wasn?t involved in Unicode when the relevant documents were > discussed, as I was busy learning the letters in the Basic Latin block?, > but I looked at some of them a couple of years ago. > > - Revision 9 of then-DUTR? #15 > https://www.unicode.org/reports/tr15/tr15-9.html, dated 1998-11-23, > and entered into the registry > as L2/98-404, > does not mention composition exclusions. > - The first revision (10) that mentions characters *excluded from > being primary composites* is > https://www.unicode.org/reports/tr15/tr15-10.html#Definitions, > dated 1998-12-16. The rationale is indeed that *This would be to match > common practice for scripts that use fully decomposed forms.* The sole > example given is FB31. > - The next revision (11) includes a list of composition exclusions: > https://www.unicode.org/reports/tr15/tr15-11.html#Primary%20Exclusion%20List%20Table, > dated 1999-02-25. This list includes 0958..095F. > > Between revisions 9 and 10, we have UTC #78, whose minutes are L2/98-419 > . See the discussion in the > section titled ?Normalization [Document L2/98-404]?, and in particular the > last comment from Ken Whistler. > Between revisions 10 and 11, we have UTC #79, in whose minutes L2/99-054R > , in the section > ?Proposed Draft UTR #15, Unicode Normalization?, we get a similar comment > from Ken towards the end. > The minutes of UTC #80, L2/99-176 > , have some discussion of > normalization, and motion 80-M25 letting the editorial committee change the > composition exclusions table; but by that point 0958 is already in there, > so digging there isn?t going to help. > > However, some later documents provide relevant context: > > - L2/01-304 (p. > 17, in the section on Devanagari). > - L2/01-305 > (section on Devanagari). > > So there was clear feedback from India that U+0958 ? and friends should be > discouraged; presumably the UTC must have been aware of that in 1999. On > the distinction between ? vs. ?, I guess this is related to ? being atomic > in ISCII; in turn that is because while ? is decomposable, corresponding > letters in other ISCII scripts (?, ?, ?) are not. See also point (viii) of > L2/01-304 ; there > still was a desire to make the encodings similar between the scripts. > > I am sure Ken can provide more details. > > Best regards, > > Robin Leroy > > ? > ? As well as a few from the Latin-1 Supplement and Latin Extended-A blocks. > ? This predates L2/00-118 > and UTC decision 83-C6 > which gave us the > terms UAX and UTS. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Sun Oct 27 23:38:11 2024 From: prosfilaes at gmail.com (David Starner) Date: Sun, 27 Oct 2024 23:38:11 -0500 Subject: Archaic Maltese characters Message-ID: https://babel.hathitrust.org/cgi/pt?id=hvd.hx5bpd&seq=11 is a scan of a 1831 Maltese reader, with an unusual alphabet. Most of it is encoded, but a few have been missed. https://en.wikipedia.org/wiki/Maltese_alphabet has an article I'll be referring to; it shows different 1788 and 1845 alphabets. Wikipedia says "/w/ was written as ?w?, ?u? or as a modified u (not present in Unicode)." There's a line for U and apparently U for w, but the lower-case versions don't have a tail. I couldn't find it in Unicode. There's a couple characters that can be combined with existing h's with hooks and one that can be combined with Cyrillic ?. The mirrored gamma could be encoded as turned L, but the lowercase forms don't match. Wikipedia says "Until the middle of the 19th century, two sounds which would merge into /??/ were differentiated in Maltese. These were variously represented as ?gh?, ??h?, ?gh??, ?gh?? and with two letters not represented in Unicode (they resembled an upside down U). " These are the most functionally unencoded characters; turned U and turned U with hook. I'm not going to push them through, but it seems like fertile ground for a proposal. We could have saved some time by encoding rotation operators, but that didn't happen, and there's good reasons for it not to have happened. -- The standard is written in English . If you have trouble understanding a particular section, read it again and again and again . . . Sit up straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185 (1991)