From irgendeinbenutzername at gmail.com  Sun Oct 20 03:43:18 2024
From: irgendeinbenutzername at gmail.com (Charlotte Eiffel Lilith Buff)
Date: Sun, 20 Oct 2024 10:43:18 +0200
Subject: Script-Specific Composition Exclusions
Message-ID: <CAKLR3Ap_qmvnXx0ejAxbdP3NuaPcZ3k=GZ4m0W9=BVOmuRHLbA@mail.gmail.com>

This question on Stack Overflow
<https://stackoverflow.com/questions/79104685/in-unicode-why-%e0%a5%98-is-excluded-from-composition-whereas-%c3%85-is-not>
sent me on a wild Google spree yesterday trying to find the reason why
certain characters are included in the Composition_Exclusion set,
particularly the Devanagari, Bengali, Gurmukhi, and Oriya letters with
nukta, but I wasn?t able to locate any relevant documents from back then.

As I understand it (and I believe this was even the wording used in
previous versions of UAX #15), the script-specific exclusions exist because
for a handful of characters the fully decomposed form is the preferred
representation in regular usage. This makes sense to me for the precomposed
Hebrew letters because with so many combining marks with unique CCC values,
it just seems easier to deal exclusively with combining character sequences
and not have some random marks ?glue? themselves to the base letter. The
two-part Tibetan subjoined letters are similar in this regard.

However, the Indic nuktas seem entirely unproblematic and in fact not all
precomposed letters with nukta are composition-excluded: Devanagari has ?,
?, and ? for example.

Does anyone remember what lead to these specific decisions or knows where
to find the relevant documents if they exist?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20241020/afabece9/attachment.htm>

From egg.robin.leroy at gmail.com  Sun Oct 20 09:14:00 2024
From: egg.robin.leroy at gmail.com (Robin Leroy)
Date: Sun, 20 Oct 2024 16:14:00 +0200
Subject: Script-Specific Composition Exclusions
In-Reply-To: <CAKLR3Ap_qmvnXx0ejAxbdP3NuaPcZ3k=GZ4m0W9=BVOmuRHLbA@mail.gmail.com>
References: <CAKLR3Ap_qmvnXx0ejAxbdP3NuaPcZ3k=GZ4m0W9=BVOmuRHLbA@mail.gmail.com>
Message-ID: <CAK6dhvy2BQOBsvFbYgkXYefvLt2d0AFW0AdtT9fh9-Wf9nNzFQ@mail.gmail.com>

Le dim. 20 oct. 2024 ? 10:48, Charlotte Eiffel Lilith Buff via Unicode <
unicode at corp.unicode.org> a ?crit :

> As I understand it (and I believe this was even the wording used in
> previous versions of UAX #15), the script-specific exclusions exist because
> for a handful of characters the fully decomposed form is the preferred
> representation in regular usage. This makes sense to me for the precomposed
> Hebrew letters because with so many combining marks with unique CCC values,
> it just seems easier to deal exclusively with combining character sequences
> and not have some random marks ?glue? themselves to the base letter. The
> two-part Tibetan subjoined letters are similar in this regard.


>
However, the Indic nuktas seem entirely unproblematic and in fact not all
> precomposed letters with nukta are composition-excluded: Devanagari has ?,
> ?, and ? for example.
>
> Does anyone remember what lead to these specific decisions or knows where
> to find the relevant documents if they exist?
>
I certainly wasn?t involved in Unicode when the relevant documents were
discussed, as I was busy learning the letters in the Basic Latin block?,
but I looked at some of them a couple of years ago.

   - Revision 9 of then-DUTR? #15
   https://www.unicode.org/reports/tr15/tr15-9.html, dated 1998-11-23, and
   entered into the registry
   <https://www.unicode.org/L2/L1998/Register-1998.html> as L2/98-404, does
   not mention composition exclusions.
   - The first revision (10) that mentions characters *excluded from being
   primary composites* is
   https://www.unicode.org/reports/tr15/tr15-10.html#Definitions,
   dated 1998-12-16. The rationale is indeed that *This would be to match
   common practice for scripts that use fully decomposed forms.* The sole
   example given is FB31.
   - The next revision (11) includes a list of composition exclusions:
   https://www.unicode.org/reports/tr15/tr15-11.html#Primary%20Exclusion%20List%20Table,
   dated 1999-02-25. This list includes 0958..095F.

Between revisions 9 and 10, we have UTC #78, whose minutes are L2/98-419
<https://www.unicode.org/L2/L1998/98419.pdf>. See the discussion in the
section titled ?Normalization [Document L2/98-404]?, and in particular the
last comment from Ken Whistler.
Between revisions 10 and 11, we have UTC #79, in whose minutes L2/99-054R
<https://www.unicode.org/L2/L1999/99054r.htm#79-0>, in the section
?Proposed Draft UTR #15, Unicode Normalization?, we get a similar comment
from Ken towards the end.
The minutes of UTC #80, L2/99-176
<https://www.unicode.org/L2/L1999/99176.htm>, have some discussion of
normalization, and motion 80-M25 letting the editorial committee change the
composition exclusions table; but by that point 0958 is already in there,
so digging there isn?t going to help.

However, some later documents provide relevant context:

   - L2/01-304 <https://www.unicode.org/L2/L2001/01304-feedback.pdf> (p.
   17, in the section on Devanagari).
   - L2/01-305 <https://www.unicode.org/L2/L2001/01305-india-resp.txt>
   (section on Devanagari).

So there was clear feedback from India that U+0958 ? and friends should be
discouraged; presumably the UTC must have been aware of that in 1999. On
the distinction between ? vs. ?, I guess this is related to ? being atomic
in ISCII; in turn that is because while ? is decomposable, corresponding
letters in other ISCII scripts (?, ?, ?) are not. See also point (viii) of
L2/01-304 <https://www.unicode.org/L2/L2001/01304-feedback.pdf>; there
still was a desire to make the encodings similar between the scripts.

I am sure Ken can provide more details.

Best regards,

Robin Leroy

?
? As well as a few from the Latin-1 Supplement and Latin Extended-A blocks.
? This predates L2/00-118 <https://www.unicode.org/L2/L2000/00118-parts.txt>
and UTC decision 83-C6 <https://www.unicode.org/L2/L2000/00115.htm#83-C6> which
gave us the terms UAX and UTS.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20241020/64c6d9e4/attachment.htm>

From irgendeinbenutzername at gmail.com  Mon Oct 21 12:13:17 2024
From: irgendeinbenutzername at gmail.com (Charlotte Eiffel Lilith Buff)
Date: Mon, 21 Oct 2024 19:13:17 +0200
Subject: Script-Specific Composition Exclusions
In-Reply-To: <CAK6dhvy2BQOBsvFbYgkXYefvLt2d0AFW0AdtT9fh9-Wf9nNzFQ@mail.gmail.com>
References: <CAKLR3Ap_qmvnXx0ejAxbdP3NuaPcZ3k=GZ4m0W9=BVOmuRHLbA@mail.gmail.com>
 <CAK6dhvy2BQOBsvFbYgkXYefvLt2d0AFW0AdtT9fh9-Wf9nNzFQ@mail.gmail.com>
Message-ID: <CAKLR3Ar5=BL4uiW8ZAw11LRaOBVdw0AbzTb_Dc2gUwmON339Ow@mail.gmail.com>

Excellent research! Thanks a lot!

Am So., 20. Okt. 2024 um 16:14 Uhr schrieb Robin Leroy <
egg.robin.leroy at gmail.com>:

> Le dim. 20 oct. 2024 ? 10:48, Charlotte Eiffel Lilith Buff via Unicode <
> unicode at corp.unicode.org> a ?crit :
>
>> As I understand it (and I believe this was even the wording used in
>> previous versions of UAX #15), the script-specific exclusions exist because
>> for a handful of characters the fully decomposed form is the preferred
>> representation in regular usage. This makes sense to me for the precomposed
>> Hebrew letters because with so many combining marks with unique CCC values,
>> it just seems easier to deal exclusively with combining character sequences
>> and not have some random marks ?glue? themselves to the base letter. The
>> two-part Tibetan subjoined letters are similar in this regard.
>
>
>>
> However, the Indic nuktas seem entirely unproblematic and in fact not all
>> precomposed letters with nukta are composition-excluded: Devanagari has ?,
>> ?, and ? for example.
>>
>> Does anyone remember what lead to these specific decisions or knows where
>> to find the relevant documents if they exist?
>>
> I certainly wasn?t involved in Unicode when the relevant documents were
> discussed, as I was busy learning the letters in the Basic Latin block?,
> but I looked at some of them a couple of years ago.
>
>    - Revision 9 of then-DUTR? #15
>    https://www.unicode.org/reports/tr15/tr15-9.html, dated 1998-11-23,
>    and entered into the registry
>    <https://www.unicode.org/L2/L1998/Register-1998.html> as L2/98-404,
>    does not mention composition exclusions.
>    - The first revision (10) that mentions characters *excluded from
>    being primary composites* is
>    https://www.unicode.org/reports/tr15/tr15-10.html#Definitions,
>    dated 1998-12-16. The rationale is indeed that *This would be to match
>    common practice for scripts that use fully decomposed forms.* The sole
>    example given is FB31.
>    - The next revision (11) includes a list of composition exclusions:
>    https://www.unicode.org/reports/tr15/tr15-11.html#Primary%20Exclusion%20List%20Table,
>    dated 1999-02-25. This list includes 0958..095F.
>
> Between revisions 9 and 10, we have UTC #78, whose minutes are L2/98-419
> <https://www.unicode.org/L2/L1998/98419.pdf>. See the discussion in the
> section titled ?Normalization [Document L2/98-404]?, and in particular the
> last comment from Ken Whistler.
> Between revisions 10 and 11, we have UTC #79, in whose minutes L2/99-054R
> <https://www.unicode.org/L2/L1999/99054r.htm#79-0>, in the section
> ?Proposed Draft UTR #15, Unicode Normalization?, we get a similar comment
> from Ken towards the end.
> The minutes of UTC #80, L2/99-176
> <https://www.unicode.org/L2/L1999/99176.htm>, have some discussion of
> normalization, and motion 80-M25 letting the editorial committee change the
> composition exclusions table; but by that point 0958 is already in there,
> so digging there isn?t going to help.
>
> However, some later documents provide relevant context:
>
>    - L2/01-304 <https://www.unicode.org/L2/L2001/01304-feedback.pdf> (p.
>    17, in the section on Devanagari).
>    - L2/01-305 <https://www.unicode.org/L2/L2001/01305-india-resp.txt>
>    (section on Devanagari).
>
> So there was clear feedback from India that U+0958 ? and friends should be
> discouraged; presumably the UTC must have been aware of that in 1999. On
> the distinction between ? vs. ?, I guess this is related to ? being atomic
> in ISCII; in turn that is because while ? is decomposable, corresponding
> letters in other ISCII scripts (?, ?, ?) are not. See also point (viii) of
> L2/01-304 <https://www.unicode.org/L2/L2001/01304-feedback.pdf>; there
> still was a desire to make the encodings similar between the scripts.
>
> I am sure Ken can provide more details.
>
> Best regards,
>
> Robin Leroy
>
> ?
> ? As well as a few from the Latin-1 Supplement and Latin Extended-A blocks.
> ? This predates L2/00-118
> <https://www.unicode.org/L2/L2000/00118-parts.txt> and UTC decision 83-C6
> <https://www.unicode.org/L2/L2000/00115.htm#83-C6> which gave us the
> terms UAX and UTS.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20241021/19ee47ee/attachment-0001.htm>

From prosfilaes at gmail.com  Sun Oct 27 23:38:11 2024
From: prosfilaes at gmail.com (David Starner)
Date: Sun, 27 Oct 2024 23:38:11 -0500
Subject: Archaic Maltese characters
Message-ID: <CAMZ=zj73G=HZesm33Hmsgp8TLq3Prq5XvQFQY0bnTiMFY-af1Q@mail.gmail.com>

https://babel.hathitrust.org/cgi/pt?id=hvd.hx5bpd&seq=11 is a scan of
a 1831 Maltese reader, with an unusual alphabet. Most of it is
encoded, but a few have been missed.
https://en.wikipedia.org/wiki/Maltese_alphabet has an article I'll be
referring to; it shows different 1788 and 1845 alphabets.

Wikipedia says "/w/ was written as ?w?, ?u? or as a modified u (not
present in Unicode)." There's a line for U and apparently U for w, but
the lower-case versions don't have a tail. I couldn't find it in
Unicode.

There's a couple characters that can be combined with existing h's
with hooks and one that can be combined with Cyrillic ?. The mirrored
gamma could be encoded as turned L, but the lowercase forms don't
match.

Wikipedia says "Until the middle of the 19th century, two sounds which
would merge into /??/ were differentiated in Maltese. These were
variously represented as ?gh?, ??h?, ?gh??, ?gh?? and with two letters
not represented in Unicode (they resembled an upside down U). " These
are the most functionally unencoded characters; turned U and turned U
with hook.

I'm not going to push them through, but it seems like fertile ground
for a proposal. We could have saved some time by encoding rotation
operators, but that didn't happen, and there's good reasons for it not
to have happened.

-- 
The standard is written in English . If you have trouble understanding
a particular section, read it again and again and again . . . Sit up
straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185
(1991)