Query Regarding NFC Normalization and Script-Specific Exclusions for Devanagari
Neha Gupta
nehagupta2885 at gmail.com
Wed Apr 30 23:35:08 CDT 2025
Dear All,
I have a question regarding Unicode normalization, specifically in the
context of Indian languages and the Devanagari script. In the Unicode
Devanagari block, characters such as U+0958 (क़), U+0915 (क), and U+093C (़)
were introduced in Unicode version 1. U+0958 (क़), known as DEVANAGARI
LETTER QA, visually and functionally represents the combination of U+0915
(क) and the NUKTA sign U+093C (़).
According to Unicode Normalization Form C (NFC), normalization involves
first fully decomposing a string , and then recomposing it, except in cases
where canonical composition is blocked or the composition is explicitly
excluded in the Unicode Character Database. In this context, U+0958 is
listed as a script-specific composition exclusion, meaning the sequence
<U+0915, U+093C> is not normalized (i.e., recomposed) into U+0958.
I understand that some characters are deliberately excluded from canonical
composition to preserve distinctions important in specific scripts or
historical encoding practices. However, in this case, my confusion arises
from the fact that U+0958 was introduced in Unicode version 1, along with
its decomposable components. Given that it predates the formalization of
many normalization rules, and that the canonical equivalence between
<U+0915, U+093C> and U+0958 appears linguistically justified, I am curious
about the rationale behind its exclusion from composition.
Could you please help clarify why U+0958 is treated as a composition
exclusion despite its early inclusion?
Regards,
Neha
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20250501/2170482f/attachment.htm>
More information about the Unicode
mailing list