Script-Specific Composition Exclusions
Robin Leroy
egg.robin.leroy at gmail.com
Sun Oct 20 09:14:00 CDT 2024
Le dim. 20 oct. 2024 à 10:48, Charlotte Eiffel Lilith Buff via Unicode <
unicode at corp.unicode.org> a écrit :
> As I understand it (and I believe this was even the wording used in
> previous versions of UAX #15), the script-specific exclusions exist because
> for a handful of characters the fully decomposed form is the preferred
> representation in regular usage. This makes sense to me for the precomposed
> Hebrew letters because with so many combining marks with unique CCC values,
> it just seems easier to deal exclusively with combining character sequences
> and not have some random marks “glue” themselves to the base letter. The
> two-part Tibetan subjoined letters are similar in this regard.
>
However, the Indic nuktas seem entirely unproblematic and in fact not all
> precomposed letters with nukta are composition-excluded: Devanagari has ऩ,
> ऱ, and ऴ for example.
>
> Does anyone remember what lead to these specific decisions or knows where
> to find the relevant documents if they exist?
>
I certainly wasn’t involved in Unicode when the relevant documents were
discussed, as I was busy learning the letters in the Basic Latin block¹,
but I looked at some of them a couple of years ago.
- Revision 9 of then-DUTR² #15
https://www.unicode.org/reports/tr15/tr15-9.html, dated 1998-11-23, and
entered into the registry
<https://www.unicode.org/L2/L1998/Register-1998.html> as L2/98-404, does
not mention composition exclusions.
- The first revision (10) that mentions characters *excluded from being
primary composites* is
https://www.unicode.org/reports/tr15/tr15-10.html#Definitions,
dated 1998-12-16. The rationale is indeed that *This would be to match
common practice for scripts that use fully decomposed forms.* The sole
example given is FB31.
- The next revision (11) includes a list of composition exclusions:
https://www.unicode.org/reports/tr15/tr15-11.html#Primary%20Exclusion%20List%20Table,
dated 1999-02-25. This list includes 0958..095F.
Between revisions 9 and 10, we have UTC #78, whose minutes are L2/98-419
<https://www.unicode.org/L2/L1998/98419.pdf>. See the discussion in the
section titled “Normalization [Document L2/98-404]”, and in particular the
last comment from Ken Whistler.
Between revisions 10 and 11, we have UTC #79, in whose minutes L2/99-054R
<https://www.unicode.org/L2/L1999/99054r.htm#79-0>, in the section
“Proposed Draft UTR #15, Unicode Normalization”, we get a similar comment
from Ken towards the end.
The minutes of UTC #80, L2/99-176
<https://www.unicode.org/L2/L1999/99176.htm>, have some discussion of
normalization, and motion 80-M25 letting the editorial committee change the
composition exclusions table; but by that point 0958 is already in there,
so digging there isn’t going to help.
However, some later documents provide relevant context:
- L2/01-304 <https://www.unicode.org/L2/L2001/01304-feedback.pdf> (p.
17, in the section on Devanagari).
- L2/01-305 <https://www.unicode.org/L2/L2001/01305-india-resp.txt>
(section on Devanagari).
So there was clear feedback from India that U+0958 क़ and friends should be
discouraged; presumably the UTC must have been aware of that in 1999. On
the distinction between क़ vs. ऴ, I guess this is related to ऴ being atomic
in ISCII; in turn that is because while ऴ is decomposable, corresponding
letters in other ISCII scripts (ழ, ఴ, ഴ) are not. See also point (viii) of
L2/01-304 <https://www.unicode.org/L2/L2001/01304-feedback.pdf>; there
still was a desire to make the encodings similar between the scripts.
I am sure Ken can provide more details.
Best regards,
Robin Leroy
―
¹ As well as a few from the Latin-1 Supplement and Latin Extended-A blocks.
² This predates L2/00-118 <https://www.unicode.org/L2/L2000/00118-parts.txt>
and UTC decision 83-C6 <https://www.unicode.org/L2/L2000/00115.htm#83-C6> which
gave us the terms UAX and UTS.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20241020/64c6d9e4/attachment.htm>
More information about the Unicode
mailing list