Can NFKC turn valid UAX 31 identifiers into non-identifiers?
Richard Wordingham via Unicode
unicode at unicode.org
Mon Jun 4 19:37:47 CDT 2018
On Mon, 4 Jun 2018 12:49:20 -0700
Manish Goregaokar via Unicode <unicode at unicode.org> wrote:
> Hi,
>
> The Rust community is considering
> <https://github.com/rust-lang/rfcs/pull/2457> adding non-ascii
> identifiers, which follow UAX #31
> <http://www.unicode.org/reports/tr31/> (XID_Start XID_Continue*, with
> tweaks). The proposal also asks for identifiers to be treated as
> equivalent under NFKC.
> (In general, are there other problems folks see with this proposal?)
There's the usual lurking issue that the Thai word for water, น้ำ
<U+0E19 THAI CHARACTER NO NU, U+0E49 THAI CHARACTER MAI THO, U+0E33 THAI
CHARACTER SARA AM>, is unacceptable and often untypable and uncopiable
when converted to NFKC น้ํา <U+0E19, U+0E49, U+0E4D THAI CHARACTER
NIKHAHIT, U+0E32 THAI CHARACTER SARA AA>. The decomposed form that
looks the same is นํ้า <U+0E19, U+0E4D, U+0E49, U+0E32>. The problem
is that for sane results, <tone mark, SARA AM> needs special handling.
This sequence is also often untypable - part of the protection against
Thai homographs.
Richard.
More information about the Unicode
mailing list