Requiring typed text to be NFKC

Richard Wordingham via Unicode unicode at unicode.org
Wed Jun 6 12:55:53 CDT 2018


On Tue, 5 Jun 2018 19:48:53 -0700
Manish Goregaokar via Unicode <unicode at unicode.org> wrote:

> Following up from my previous email
> <https://www.unicode.org/mail-arch/unicode-ml/y2018-m06/0007.html>,
> one of the ideas that was brought up was that if we're going to
> consider NFKC forms equivalent, we should require things to be typed
> in NFKC.
> 
> 
> I'm a bit wary of this. As Richard brought up in that thread, some
> Thai NFKC forms are untypable. I *suspect* there are Hangul keyboards
> (perhaps physical non-IME based ones) that have this problem.
> 
> Do folks have other examples? Interested in both:

I don't know of any different problems for NFKC,but there are problems
with getting people to enter normalised data.

>  - Words (as in, real things people will want to type) where a
> keyboard/IME does not type the NFKC form

There are problems with insisting that users type normalised text.
Vietnamese is probably a real issue here; the standard keyboard is set
up to enter vowels (some of which are accented) and tone marks
separately.  Indeed, with the nặnɡ tone (as in the vowel of its name),
one is likely to find the codepoint sequence <U+0103 LATIN SMALL LETTER
A WITH BREVE, U+0323 COMBINING DOT BELOW> which is not NFC, not NFD and
not even FCD.

>  - Words where the NFKC form is *visually* distinct enough that it
> will look weird to native speakers

There may be issues with BMP CJK compatibility ideographs.  I don't
know how far they've been replaced by variation sequences requesting
the same appearance.

>  - Words where a keyboard/IME *can* type the NFKC form but users are
> not used to it

Well, typing Tai Khuen in normalised form is hideously
counter-intuitive, but at present the USE makes displaying correctly
spelt text into a struggle for a font.  The problem there is that the
usual way of typing a closed syllable with a tone mark gets normalised
at the end to <SAKOT, tone mark, final_consonant>; that normalisation
broke early pre-USE OpenType-based fonts as databases caught up with
Unicode 5.2.  That problem was promptly cured by HarfBuzz tweaking its
internal normalisation, until USE unintentionally outlawed correct
spelling.

A universal keyboard for entering large swathes of the Latin script is
not a very big problem, but entering text with diacritics in form NFC is
a real pain. This problem might arise when editing a Hungarian program
without a Hungarian keyboard.  The program development environment
would have to provide a normalisation tool.

Richard.



More information about the Unicode mailing list