Query Regarding NFC Normalization and Script-Specific Exclusions for Devanagari
Charlotte Eiffel Lilith Buff
irgendeinbenutzername at gmail.com
Thu May 1 01:56:01 CDT 2025
Someone on Stack Overflow had the same question a while ago, and Robin
Leroy managed to find some old documents that provide a likely explanation:
https://stackoverflow.com/questions/79104685/in-unicode-why-%e0%a5%98-is-excluded-from-composition-whereas-%c3%85-is-not/79115293#79115293
The short version is that the decomposed forms of U+0958 and similar
letters were the preferred representations by experts at the time,
presumably because ISCII also encoded them using a combining nukta.
Am Do., 1. Mai 2025 um 06:37 Uhr schrieb Neha Gupta via Unicode <
unicode at corp.unicode.org>:
> Dear All,
>
> I have a question regarding Unicode normalization, specifically in the
> context of Indian languages and the Devanagari script. In the Unicode
> Devanagari block, characters such as U+0958 (क़), U+0915 (क), and U+093C (़)
> were introduced in Unicode version 1. U+0958 (क़), known as DEVANAGARI
> LETTER QA, visually and functionally represents the combination of U+0915
> (क) and the NUKTA sign U+093C (़).
>
> According to Unicode Normalization Form C (NFC), normalization involves
> first fully decomposing a string , and then recomposing it, except in cases
> where canonical composition is blocked or the composition is explicitly
> excluded in the Unicode Character Database. In this context, U+0958 is
> listed as a script-specific composition exclusion, meaning the sequence
> <U+0915, U+093C> is not normalized (i.e., recomposed) into U+0958.
>
> I understand that some characters are deliberately excluded from canonical
> composition to preserve distinctions important in specific scripts or
> historical encoding practices. However, in this case, my confusion arises
> from the fact that U+0958 was introduced in Unicode version 1, along with
> its decomposable components. Given that it predates the formalization of
> many normalization rules, and that the canonical equivalence between
> <U+0915, U+093C> and U+0958 appears linguistically justified, I am curious
> about the rationale behind its exclusion from composition.
>
> Could you please help clarify why U+0958 is treated as a composition
> exclusion despite its early inclusion?
>
> Regards,
>
> Neha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20250501/334269e6/attachment.htm>
More information about the Unicode
mailing list