Can NFKC turn valid UAX 31 identifiers into non-identifiers?

Thu Jun 7 07:00:02 CDT 2018

If you intend to allow all the standard orthography of common languages,
you would also need to support apostrophes and regular hyphens in
identifiers, including those from ASCII !

The Catalan middle dot is just a compact variant of the hyphen, it should
have better been a diacritic, but the usage of upper diacritics on letter
l/L with high ascenders caused problems when rendering with compact
line-heights.

Polish chose to use a smart overstriking slash to avoid that problem,
another diacritic could have been used such as the cedilla below, but the
middle dot was easier to add between the two handwritten "ll", after
composing the rest of the word) without having to release the drawing pen
from the surface.

The vertical placement of the "middle" dot is also largely variable when
handwritten, I have seen it drawn manuall as a short stroke (horizontal or
slanted), which is easier to place manually (the dot can easily fall on the
vertical strokes, and when "ll" is handdrawn it frequently has the two
curls touching each other, so the dot may in fact call in the middle of the
curl for the first l), and it that case it looks very much like the Polish
l with a stroke bar, or like a l followed by an apostrophe before the
second l.

2018-06-07 13:32 GMT+02:00 Joan Montané via Unicode <unicode at unicode.org>:

>
>
> 2018-06-04 21:49 GMT+02:00 Manish Goregaokar via Unicode <
> unicode at unicode.org>:
>
>> Hi,
>>
>> The Rust community is considering
>> <https://github.com/rust-lang/rfcs/pull/2457> adding non-ascii
>> identifiers, which follow UAX #31 <http://www.unicode.org/reports/tr31/>
>> (XID_Start XID_Continue*, with tweaks). The proposal also asks for
>> identifiers to be treated as equivalent under NFKC.
>>
>> Are there any cases where this will lead to inconsistencies? I.e. can the
>> NFKC of a valid UAX 31 ident be invalid UAX 31?
>>
>
> Yes, such case exists, for instance in Latin alphabet and Catalan language.
>
> * Ŀ, LATIN CAPITAL LETTER L WITH MIDDEL DOT <U+013F> NFKC decomposes to
> LATIN CAPITAL LETTER L (U+004C) MIDDLE DOT (U+00B7): <L,·>
> * ŀ, LATIN SMALL LETTER L WITH MIDDLE DOT <U+0140> NFKC decomposes to
> LATIN SMALL LETTER L (U+006C) MIDDLE DOT (U+00B7): <l,·>
>
> Ŀ and ŀ are (were) used for Catalan language for encoding geminate L [1]
> when it is (was) encoded using 2 chars only. Preferred (and common used)
> encoding is currently that of 3 chaacters: <L,·,L>. So, some adjustments
> are needed if you whant to support Catalan language identifiers [2]
>
> Yours,
> Joan Montané
>
>
> [1] https://en.wikipedia.org/wiki/Interpunct#Catalan
> [2] http://www.unicode.org/reports/tr31/#Specific_Character_Adjustments
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180607/2ac7e69f/attachment.html>