Long-Encoded Restricted Characters in High Frequency Modern Use

Sun Jun 1 05:42:39 CDT 2014

On Sat, 31 May 2014 21:27:55 +0200
Mark Davis ☕️ <mark at macchiato.com> wrote:

> The structure of the data is based on the use of NFKC characters in
> identifiers. So SARA AM and the Lao equivalent are both not NFKC
> characters, and are categorized as such, and would need to be
> represented by their NFKC fors. The process is in
> http://www.unicode.org/reports/tr39/proposed.html#IDMOD_Data_Collection

There's no absolute IETF prohibition on NFKC characters.

> > Now, U+0E4D THAI
> > CHARACTER NIKHAHIT is classified as 'allowed; recommended', although
> > its main use is in writing Pali, which would suggest that it should
> > be 'restricted; historic' or 'restricted; limited-use'.

> For that, it would be best to submit via
> http://www.unicode.org/reports/tr39/proposed.html#Feedback, AND file a
> feedback form at http://www.unicode.org/reporting.html, just to be
> sure. 

I have no desire to restrict NIKHAHIT simply because of limited use.
The problem is simply the confusion caused by the existence of SARA
AM.  Unicode support for the compatibility decomposition of SARA AM is
incomplete, in part irremediably so.  The problem is that <tonemark,
SARA AM> has a different appearance to <tonemark, NIKHAHIT, SARA AA>.
In the former, the tone mark is the topmost glyph; in the latter, the
nikkhahit is the topmost glyph.  <tonemark, SARA AM> usually has the
same appearance as <NIKHAHIT, tonemark, SARA AA>, which is what
Uniscribe effectively converts it to.

There used to be filters in place to stop <NIKHAHIT, SARA AA>
being typed.  It's not unknown for <tonemark, SARA AM> to be mistyped
as <NIKHAHIT, tonemark, SARA AA>, and that too used to be blocked.
DUCET has a contraction for <NIKHAHIT, SARA AA> to reduce the
ill-effects, but of course the contraction doesn't work for the
sequence <NIKHAHIT, tonemark, SARA AA>.  (Action on me: CLDR ticket on
omission for th locale.)

In short, the co-existence of NIKHAHIT with ccc=0 and SARA AM causes
problems.  The simplest solution is to restrict NIKHAHIT, which should
be tolerable. Ideally, one would merely prohibit the sequence
\p{Mn}*\u0E4D\p{Mn}*\u0E32.

There is no virtue in making both NIKHAHIT and SARA AM 'restricted'.
Indeed, one could argue that applying the compatibility decomposition
to SARA AM brings NIKHAHIT into 'high frequency modern use' - it
depends on the frequency of NFKC and NFKD conversions.  However, the
compatibility decomposition of SARA AM is simply *wrong* as Thai text.

It would be good to hear from someone at Thailand's National Electronics
and Computer Technology Center (NECTEC) on the matter of SARA AM in
domain names.

The sequence-prohibiting solution ought to extend to Lao, but there may
be the additional problem of the tone mark being applied to the SARA
AM.  The m17n Lao keyboard on my computer actually comes with a single
keystroke for the sequence <SARA AM, MAI THO>!  (Action on me: File a
bug report against the keyboard.)

Richard.