Misspelling or Miscoding?
Asmus Freytag
asmusf at ix.netcom.com
Thu Jan 19 16:25:14 CST 2017
OK, I was first thinking you had something more in mind like ordering of
(e.g. Lao?) tone marks that normally do not render exactly the same, but
close, and where some font/rendering engine could go and make them
identical in an effort to be helpful. In those cases one can presume a
preferred ordering, and, in principle, that can be imposed upon a text,
whether via autocorrrect or spell check.
Now I'm thinking your focus was more on cases the like two Khmer
subjoined consonant sequences:
U+17D2 U+178A ្ដ KHMER CONSONANT SIGN COENG DA
U+17D2 U+178F ្ត KHMER CONSONANT SIGN COENG TA
that apparently have identical appearance, even though one is a 'd' and
the other a 't'. (That's the only example that I'm personally familiar
with).
Unless some fonts ever make a distinction, this seems to be a case where
"miscoding" might be an appropriate term. As far as the user is
concerned, the issue only arises because of the encoding scheme used. (A
hypothetical different scheme that had one of these precomposed with a
name containing something like DA OR TA would have not surfaced an
invisible distinction).
Are your examples likewise legitimate duplications or merely the case
that one could type something else and have it look the same (accidentally).
The Khmer example would seem fairly resistant to automated correction if
it is a free choice. If, instead, the immediately preceding consonant
comes from two disjoined sets, for example if TA COENG TA was possible,
but not TA COENG DA, then there's scope for spell check.
In designing label generation rules for domain names, one clearly
doesn't want two labels that cannot be distinguished other than on the
encoding level. For Khmer, the decision was to allow both, but not
simultaneously (by allowing only one member of each minimal pair to be
registered, which one is decided by the order of application).
A./
On 1/19/2017 12:45 AM, Richard Wordingham wrote:
> On Wed, 18 Jan 2017 23:24:21 -0800
> Asmus Freytag <asmusf at ix.netcom.com> wrote:
>
>> The sequence of character codes isn't necessarily determined by the
>> typist's choice of keystrokes.
> Wow! ESP for input?
>
>> For example, autocorrection and similar support can result in a
>> substitution of character codes. For scripts with this issue, it
>> would be useful if such mechanisms were more widespread; effectively
>> normalizing to a preferred input order.
> That's not the problem I have in mind. Dotted circles can help, but
> for Northern Thai in the Lanna script, USE has accidentally (I hope)
> banned 17% of the vocabulary and demanded that a further 37% be
> misspelt. It will be much the same for Tai Khuen. Once USE is
> fixed, the problem is that the encodings of */hi:m/ and /mi:/ may be
> different but render identically; it so happens that words like the
> former are rare. Are you aware of predictive input causing havoc with
> intellectual content?
>
>> Arguing over whether this is called mistyping or miscoding or
>> misspelling is perhaps less helpful than trying to get the word out
>> that some scripts could strongly benefit from that additional
>> software layer.
> Enabling that may require some tools to update to Unicode 5.1.
> (Hunspell, I'm looking at you.)
>
> One thing that would be helpful is some way of showing the difference
> between distinctly encoded homographs if a spell-checker can help. (I
> fear it may not be quite the right tool - different suggestion logic is
> needed.) Coloured fonts may help once support for them has spread, but
> we're probably still looking at bespoke tools to switch such hints on
> and off. In the past I've used transliteration fonts to check what I've
> actually typed.
>
> One problem with getting the message out is choosing the right words.
> That's why I came here for advice on the terminology for such issues.
>
> Richard.
>
More information about the Unicode
mailing list