Can NFKC turn valid UAX 31 identifiers into non-identifiers?
Joan Montané via Unicode
unicode at unicode.org
Thu Jun 7 06:32:13 CDT 2018
2018-06-04 21:49 GMT+02:00 Manish Goregaokar via Unicode <
unicode at unicode.org>:
> The Rust community is considering
> <https://github.com/rust-lang/rfcs/pull/2457> adding non-ascii
> identifiers, which follow UAX #31 <http://www.unicode.org/reports/tr31/>
> (XID_Start XID_Continue*, with tweaks). The proposal also asks for
> identifiers to be treated as equivalent under NFKC.
> Are there any cases where this will lead to inconsistencies? I.e. can the
> NFKC of a valid UAX 31 ident be invalid UAX 31?
Yes, such case exists, for instance in Latin alphabet and Catalan language.
* Ŀ, LATIN CAPITAL LETTER L WITH MIDDEL DOT <U+013F> NFKC decomposes to
LATIN CAPITAL LETTER L (U+004C) MIDDLE DOT (U+00B7): <L,·>
* ŀ, LATIN SMALL LETTER L WITH MIDDLE DOT <U+0140> NFKC decomposes to LATIN
SMALL LETTER L (U+006C) MIDDLE DOT (U+00B7): <l,·>
Ŀ and ŀ are (were) used for Catalan language for encoding geminate L 
when it is (was) encoded using 2 chars only. Preferred (and common used)
encoding is currently that of 3 chaacters: <L,·,L>. So, some adjustments
are needed if you whant to support Catalan language identifiers 
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode