Misspelling or Miscoding?

Asmus Freytag asmusf at ix.netcom.com
Thu Jan 19 20:41:07 CST 2017


On 1/19/2017 5:04 PM, Richard Wordingham wrote:
> On Thu, 19 Jan 2017 14:25:14 -0800
> Asmus Freytag <asmusf at ix.netcom.com> wrote:
>
>> Now I'm thinking your focus was more on cases the like two Khmer
>> subjoined consonant sequences:
>> U+17D2 U+178A     ្ដ         KHMER CONSONANT SIGN COENG DA
>> U+17D2 U+178F     ្ត         KHMER CONSONANT SIGN COENG TA
>> that apparently have identical appearance, even though one is a 'd'
>> and the other a 't'. (That's the only example that I'm personally
>> familiar with).
>> Unless some fonts ever make a distinction, this seems to be a case
>> where "miscoding" might be an appropriate term. As far as the user is
>> concerned, the issue only arises because of the encoding scheme used.
>> (A hypothetical different scheme that had one of these precomposed
>> with a name containing something like DA OR TA would have not
>> surfaced an invisible distinction).
> Such a font might be KHOM2004 mentioned by Michel Antelme in his paper
> aefek.free.fr/iso_album/antelme_bis.pdf.  On p25 he makes the point
> that a distinct COENG DA was still on its last legs in Cambodia in the
> 1920's; it's still distinct in the Khom variety of the script.  This
> situation makes a good case for the Tibetan model.  We might end up
> making the Khmer script a mixed system like Tai Tham by adding a
> character KHMER CONSONANT SIGN ARCHAIC COENG DA.
>
> There seem to be some Arabic script analogues, where only one or two
> forms differ between a pair of letters.
Yes, and these are treated similarly to the Khmer case in label 
generation rulesets for domain names.
>
> This is not the situation I was interested in, but it's clearly related.
Funny thing is, not actually knowing Khmer, I hadn't thought of the 
COENG DA as a "form of DA", but had considered the sequence it's own entity.

In Latin you have to characters that look like reverse e but have 
different upper cases so that they have a distinct encoding. (You could 
argue that picking the wrong member of a disunified set is a miscoding, 
but I think "misspelling" works fine -- in another context we limit the 
term "misspelling" to phono-something or typo/grapho-something 
*possible* spellings, and try to not restrict them for that purpose. The 
"impossible" ones, are ones that we expect some font or renderer to not 
support on the basis that they are not needed, and those we do restrict; 
wouldn't use the name "miscoding" for those, just "invalid" does nicely 
for us in that context).

The case where something (=member of or associated with an alphabet) is 
simply and fully identical in appearance in all contexts (and I regard 
script as a context) is fortunately quite rare in Unicode. Your examples 
may be the closest thing.
>
>> Are your examples likewise legitimate duplications or merely the case
>> that one could type something else and have it look the same
>> (accidentally).
> They're mostly legitimate duplications, though some may stretch
> phonological credulity.  For example, in Tai Tham, <NA, SAKOT, HIGH TA,
> SIGN I> is part of a common Pali verb inflection and <NA, SIGN I, SAKOT,
> HIGH TA> is a valid Northern Thai word (apparently not a Pali loan,
> despite its spelling), but <MA, SAKOT, HIGH TA, SIGN I> would probably
> be a miscoding of <MA, SIGN I, SAKOT, HIGH TA> (an attested final
> syllable) if the language were Northern Thai.  I suppose
> it's just conceivable that the former might be the name of a fruit, but
> I'm not aware of the syllabic nasal being written that way.
>
> A spell checker would pick up most such errors, though getting the
> underlying problem explained to the user might be difficult.
>
>> The Khmer example would seem fairly resistant to automated correction
>> if it is a free choice. If, instead, the immediately preceding
>> consonant comes from two disjoined sets, for example if TA COENG TA
>> was possible, but not TA COENG DA, then there's scope for spell check.
> It's supposed to be based on the phonetics, so a spell check could be
> used, but not a grammar rule.  However, I can imagine someone writing
> in accordance with a rule restricting them to certain bases.
Your last sentence reads as if you might equally well meant "can't" 
instead of "can" (?)

Having agreement in consonants or vowels across syllables or words isn't 
necessarily unheard of; spell checkers tend to go on the basis of 
existing lexical items, not necessarily purely productive rules. At 
least the ones I use for European languages have this annoying habit of 
not having a productive rule for compounds - even for languages that do 
allow arbitrary compound formation.

Anyway, digressing from your point.

A./


More information about the Unicode mailing list