Can NFKC turn valid UAX 31 identifiers into non-identifiers?

Richard Wordingham via Unicode unicode at unicode.org
Mon Jun 4 19:37:47 CDT 2018


On Mon, 4 Jun 2018 12:49:20 -0700
Manish Goregaokar via Unicode <unicode at unicode.org> wrote:

> Hi,
> 
> The Rust community is considering
> <https://github.com/rust-lang/rfcs/pull/2457> adding non-ascii
> identifiers, which follow UAX #31
> <http://www.unicode.org/reports/tr31/> (XID_Start XID_Continue*, with
> tweaks). The proposal also asks for identifiers to be treated as
> equivalent under NFKC.

> (In general, are there other problems folks see with this proposal?)

There's the usual lurking issue that the Thai word for water, น้ำ
<U+0E19 THAI CHARACTER NO NU, U+0E49 THAI CHARACTER MAI THO, U+0E33 THAI
CHARACTER SARA AM>, is unacceptable and often untypable and uncopiable
when converted to NFKC น้ํา  <U+0E19, U+0E49, U+0E4D THAI CHARACTER
NIKHAHIT, U+0E32 THAI CHARACTER SARA AA>.  The decomposed form that
looks the same is นํ้า <U+0E19, U+0E4D, U+0E49, U+0E32>.  The problem
is that for sane results, <tone mark, SARA AM> needs special handling.
This sequence is also often untypable - part of the protection against
Thai homographs.

Richard.



More information about the Unicode mailing list