From irgendeinbenutzername at gmail.com Thu May 1 01:56:01 2025 From: irgendeinbenutzername at gmail.com (Charlotte Eiffel Lilith Buff) Date: Thu, 1 May 2025 08:56:01 +0200 Subject: Query Regarding NFC Normalization and Script-Specific Exclusions for Devanagari In-Reply-To: References: Message-ID: Someone on Stack Overflow had the same question a while ago, and Robin Leroy managed to find some old documents that provide a likely explanation: https://stackoverflow.com/questions/79104685/in-unicode-why-%e0%a5%98-is-excluded-from-composition-whereas-%c3%85-is-not/79115293#79115293 The short version is that the decomposed forms of U+0958 and similar letters were the preferred representations by experts at the time, presumably because ISCII also encoded them using a combining nukta. Am Do., 1. Mai 2025 um 06:37 Uhr schrieb Neha Gupta via Unicode < unicode at corp.unicode.org>: > Dear All, > > I have a question regarding Unicode normalization, specifically in the > context of Indian languages and the Devanagari script. In the Unicode > Devanagari block, characters such as U+0958 (?), U+0915 (?), and U+093C (?) > were introduced in Unicode version 1. U+0958 (?), known as DEVANAGARI > LETTER QA, visually and functionally represents the combination of U+0915 > (?) and the NUKTA sign U+093C (?). > > According to Unicode Normalization Form C (NFC), normalization involves > first fully decomposing a string , and then recomposing it, except in cases > where canonical composition is blocked or the composition is explicitly > excluded in the Unicode Character Database. In this context, U+0958 is > listed as a script-specific composition exclusion, meaning the sequence > is not normalized (i.e., recomposed) into U+0958. > > I understand that some characters are deliberately excluded from canonical > composition to preserve distinctions important in specific scripts or > historical encoding practices. However, in this case, my confusion arises > from the fact that U+0958 was introduced in Unicode version 1, along with > its decomposable components. Given that it predates the formalization of > many normalization rules, and that the canonical equivalence between > and U+0958 appears linguistically justified, I am curious > about the rationale behind its exclusion from composition. > > Could you please help clarify why U+0958 is treated as a composition > exclusion despite its early inclusion? > > Regards, > > Neha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Thu May 1 04:27:07 2025 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2E_D=C3=BCrst?=) Date: Thu, 1 May 2025 18:27:07 +0900 Subject: Query Regarding NFC Normalization and Script-Specific Exclusions for Devanagari In-Reply-To: References: Message-ID: <6cadb290-ca66-49c3-a7c2-a58cae9b3c57@it.aoyama.ac.jp> Hello Neha, Charlotte, On 2025-05-01 15:56, Charlotte Eiffel Lilith Buff via Unicode wrote: > Someone on Stack Overflow had the same question a while ago, and Robin > Leroy managed to find some old documents that provide a likely explanation: > https://stackoverflow.com/questions/79104685/in-unicode-why-%e0%a5%98-is-excluded-from-composition-whereas-%c3%85-is-not/79115293#79115293 > > The short version is that the decomposed forms of U+0958 and similar > letters were the preferred representations by experts at the time, > presumably because ISCII also encoded them using a combining nukta. To understand the decisions, a bit more background may help. The goal for NFC was not e.g. to save memory by using composite characters. The main goal was to match data in the wild. This is documented in the first proposal for what later became labeled NFC, in https://datatracker.ietf.org/doc/html/draft-duerst-i18n-norm-00. The relevant paragraph reads: >>>> A key design goal of the algorithm was and is that for most identi- fiers in current use, applying the algorithm results in the identity transform (i.e. the identifier is already normalized). This allows to continue to use existing identifiers and to start to use internation- alized identifiers in new settings even without all the details of the normalization algorithm having been agreed upon. >>>> For Devanagari, that meant favoring the decomposed variants, according to the experts. This was done by creating the concept of composition exclusions. Regards, Martin. > Am Do., 1. Mai 2025 um 06:37 Uhr schrieb Neha Gupta via Unicode < > unicode at corp.unicode.org>: > >> Dear All, >> >> I have a question regarding Unicode normalization, specifically in the >> context of Indian languages and the Devanagari script. In the Unicode >> Devanagari block, characters such as U+0958 (?), U+0915 (?), and U+093C (?) >> were introduced in Unicode version 1. U+0958 (?), known as DEVANAGARI >> LETTER QA, visually and functionally represents the combination of U+0915 >> (?) and the NUKTA sign U+093C (?). >> >> According to Unicode Normalization Form C (NFC), normalization involves >> first fully decomposing a string , and then recomposing it, except in cases >> where canonical composition is blocked or the composition is explicitly >> excluded in the Unicode Character Database. In this context, U+0958 is >> listed as a script-specific composition exclusion, meaning the sequence >> is not normalized (i.e., recomposed) into U+0958. >> >> I understand that some characters are deliberately excluded from canonical >> composition to preserve distinctions important in specific scripts or >> historical encoding practices. However, in this case, my confusion arises >> from the fact that U+0958 was introduced in Unicode version 1, along with >> its decomposable components. Given that it predates the formalization of >> many normalization rules, and that the canonical equivalence between >> and U+0958 appears linguistically justified, I am curious >> about the rationale behind its exclusion from composition. >> >> Could you please help clarify why U+0958 is treated as a composition >> exclusion despite its early inclusion? >> >> Regards, >> >> Neha