Can NFKC turn valid UAX 31 identifiers into non-identifiers?

Joan Montané via Unicode unicode at unicode.org
Thu Jun 7 06:32:13 CDT 2018


2018-06-04 21:49 GMT+02:00 Manish Goregaokar via Unicode <
unicode at unicode.org>:

> Hi,
>
> The Rust community is considering
> <https://github.com/rust-lang/rfcs/pull/2457> adding non-ascii
> identifiers, which follow UAX #31 <http://www.unicode.org/reports/tr31/>
> (XID_Start XID_Continue*, with tweaks). The proposal also asks for
> identifiers to be treated as equivalent under NFKC.
>
> Are there any cases where this will lead to inconsistencies? I.e. can the
> NFKC of a valid UAX 31 ident be invalid UAX 31?
>

Yes, such case exists, for instance in Latin alphabet and Catalan language.

* Ŀ, LATIN CAPITAL LETTER L WITH MIDDEL DOT <U+013F> NFKC decomposes to
LATIN CAPITAL LETTER L (U+004C) MIDDLE DOT (U+00B7): <L,·>
* ŀ, LATIN SMALL LETTER L WITH MIDDLE DOT <U+0140> NFKC decomposes to LATIN
SMALL LETTER L (U+006C) MIDDLE DOT (U+00B7): <l,·>

Ŀ and ŀ are (were) used for Catalan language for encoding geminate L [1]
when it is (was) encoded using 2 chars only. Preferred (and common used)
encoding is currently that of 3 chaacters: <L,·,L>. So, some adjustments
are needed if you whant to support Catalan language identifiers [2]

Yours,
Joan Montané


[1] https://en.wikipedia.org/wiki/Interpunct#Catalan
[2] http://www.unicode.org/reports/tr31/#Specific_Character_Adjustments
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180607/29fc9d48/attachment.html>


More information about the Unicode mailing list