Script-Specific Composition Exclusions

Mon Oct 21 12:13:17 CDT 2024

Excellent research! Thanks a lot!

Am So., 20. Okt. 2024 um 16:14 Uhr schrieb Robin Leroy <
egg.robin.leroy at gmail.com>:

> Le dim. 20 oct. 2024 à 10:48, Charlotte Eiffel Lilith Buff via Unicode <
> unicode at corp.unicode.org> a écrit :
>
>> As I understand it (and I believe this was even the wording used in
>> previous versions of UAX #15), the script-specific exclusions exist because
>> for a handful of characters the fully decomposed form is the preferred
>> representation in regular usage. This makes sense to me for the precomposed
>> Hebrew letters because with so many combining marks with unique CCC values,
>> it just seems easier to deal exclusively with combining character sequences
>> and not have some random marks “glue” themselves to the base letter. The
>> two-part Tibetan subjoined letters are similar in this regard.
>
>
>>
> However, the Indic nuktas seem entirely unproblematic and in fact not all
>> precomposed letters with nukta are composition-excluded: Devanagari has ऩ,
>> ऱ, and ऴ for example.
>>
>> Does anyone remember what lead to these specific decisions or knows where
>> to find the relevant documents if they exist?
>>
> I certainly wasn’t involved in Unicode when the relevant documents were
> discussed, as I was busy learning the letters in the Basic Latin block¹,
> but I looked at some of them a couple of years ago.
>
>    - Revision 9 of then-DUTR² #15
>    https://www.unicode.org/reports/tr15/tr15-9.html, dated 1998-11-23,
>    and entered into the registry
>    <https://www.unicode.org/L2/L1998/Register-1998.html> as L2/98-404,
>    does not mention composition exclusions.
>    - The first revision (10) that mentions characters *excluded from
>    being primary composites* is
>    https://www.unicode.org/reports/tr15/tr15-10.html#Definitions,
>    dated 1998-12-16. The rationale is indeed that *This would be to match
>    common practice for scripts that use fully decomposed forms.* The sole
>    example given is FB31.
>    - The next revision (11) includes a list of composition exclusions:
>    https://www.unicode.org/reports/tr15/tr15-11.html#Primary%20Exclusion%20List%20Table,
>    dated 1999-02-25. This list includes 0958..095F.
>
> Between revisions 9 and 10, we have UTC #78, whose minutes are L2/98-419
> <https://www.unicode.org/L2/L1998/98419.pdf>. See the discussion in the
> section titled “Normalization [Document L2/98-404]”, and in particular the
> last comment from Ken Whistler.
> Between revisions 10 and 11, we have UTC #79, in whose minutes L2/99-054R
> <https://www.unicode.org/L2/L1999/99054r.htm#79-0>, in the section
> “Proposed Draft UTR #15, Unicode Normalization”, we get a similar comment
> from Ken towards the end.
> The minutes of UTC #80, L2/99-176
> <https://www.unicode.org/L2/L1999/99176.htm>, have some discussion of
> normalization, and motion 80-M25 letting the editorial committee change the
> composition exclusions table; but by that point 0958 is already in there,
> so digging there isn’t going to help.
>
> However, some later documents provide relevant context:
>
>    - L2/01-304 <https://www.unicode.org/L2/L2001/01304-feedback.pdf> (p.
>    17, in the section on Devanagari).
>    - L2/01-305 <https://www.unicode.org/L2/L2001/01305-india-resp.txt>
>    (section on Devanagari).
>
> So there was clear feedback from India that U+0958 क़ and friends should be
> discouraged; presumably the UTC must have been aware of that in 1999. On
> the distinction between क़ vs. ऴ, I guess this is related to ऴ being atomic
> in ISCII; in turn that is because while ऴ is decomposable, corresponding
> letters in other ISCII scripts (ழ, ఴ, ഴ) are not. See also point (viii) of
> L2/01-304 <https://www.unicode.org/L2/L2001/01304-feedback.pdf>; there
> still was a desire to make the encodings similar between the scripts.
>
> I am sure Ken can provide more details.
>
> Best regards,
>
> Robin Leroy
>
> ―
> ¹ As well as a few from the Latin-1 Supplement and Latin Extended-A blocks.
> ² This predates L2/00-118
> <https://www.unicode.org/L2/L2000/00118-parts.txt> and UTC decision 83-C6
> <https://www.unicode.org/L2/L2000/00115.htm#83-C6> which gave us the
> terms UAX and UTS.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20241021/19ee47ee/attachment-0001.htm>