Can NFKC turn valid UAX 31 identifiers into non-identifiers?

Richard Wordingham via Unicode unicode at unicode.org
Thu Jun 7 02:36:06 CDT 2018


On Tue, 5 Jun 2018 01:37:47 +0100
Richard Wordingham via Unicode <unicode at unicode.org> wrote:

> The decomposed
> form that looks the same is นํ้า <U+0E19, U+0E4D, U+0E49, U+0E32>.
> The problem is that for sane results, <tone mark, SARA AM> needs
> special handling. This sequence is also often untypable - part of the
> protection against Thai homographs.

I've been misquoted on the Rust discussion topic - or the behaviour is
more diverse that I was aware of.  On LibreOffice, with sequence
checking not disabled, typing <U+0E19, U+0E4D> disables the input by
typing of U+0E49 or U+0E32 immediately afterwards.  Another mechanism
is for typing another vowel to replace the U+0E4D.  The problem here is
that in standard Thai, U+0E4D may not be followed by another vowel or
tone mark, so Wing Thuk Thi (WTT) rules cut in.  (They're also quite
good at preventing one from typing Northern Khmer.)  In LibreOffice,
typing the NFKC form <U+0E19, U+0E49, U+0E4D, U+0E32> is stopped at
attempting to type U+0E4D, though one can get back to the original by
typing U+0E33 instead.  To the rule checker, that is mission
accomplished!

Richard.



More information about the Unicode mailing list