Can NFKC turn valid UAX 31 identifiers into non-identifiers?
Richard Wordingham via Unicode
unicode at unicode.org
Wed Jun 6 22:08:51 CDT 2018
On Mon, 4 Jun 2018 12:49:20 -0700
Manish Goregaokar via Unicode <unicode at unicode.org> wrote:
> The Rust community is considering
> <https://github.com/rust-lang/rfcs/pull/2457> adding non-ascii
> identifiers, which follow UAX #31
> <http://www.unicode.org/reports/tr31/> (XID_Start XID_Continue*, with
> tweaks). The proposal also asks for identifiers to be treated as
> equivalent under NFKC.
> Are there any cases where this will lead to inconsistencies? I.e. can
> the NFKC of a valid UAX 31 ident be invalid UAX 31?
> (In general, are there other problems folks see with this proposal?)
Confusable checking may need to be reviewed. There are several cases
where, sometimes depending on the font, anagrams (differing
even after normalisation) can render the same. The examples I
know of are of from SE Asia. The categories I know of are:
a) Swapping subscript letters - a big issue in the Myanmar script, but
Sanskrit grv- and gvr- can easily be rendered the same. I don't know
how easily confusion arises by 'finger trouble'.
b) Vowel-subscript consonant and subscript consonant-vowel often look
the same in Khmer and Tai Tham. The former spelling was supposedly
dropped in Khmer a century ago (the consonant ceasing to be subscript),
but lingered on in a few words and is acknowledged by Unicode but not by
the Microsoft font developer's guide.
c) Unresolved grammar. In Thai minority languages, U+0E3A THAI
CHARACTER PHINTHU and a mark above (U+0E34 THAI CHARACTER SARA I, I
believe) can and do occur in either order, with no difference in
appearance or meaning.
The obvious humane solution is a brutal folding of the sequences.
(Using spell-checkers works wonders on normal text, but spell
checking code is tricky.)
I actually suggested a character (U+1A54 TAI THAM LETTER GREAT SA) so
that folding 'ses' to 'sse' would not result in the 'ss' conjunct being
used; the conjunct is not used in 'ses'.
More information about the Unicode