Character Boundaries - Who is to choose?
Richard Wordingham via Unicode
unicode at unicode.org
Thu May 31 11:59:30 CDT 2018
This has nothing to do with grapheme boundaries.
A few days ago, I remarked that deciding whether two character usages
were of the same character was akin to deciding whether two populations
were of the same species.
It can also be difficult to decide where the boundary between two
species lies. Is it the job of Unicode to prescribe the boundary
between two characters, or should it prefer to describe the boundary
that users largely follow? A good example of an unobvious boundary is
U+02BC MODIFIER LETTER APOSTROPHE v. U+2019 RIGHT SINGLE QUOTATION MARK.
I am seeing a boundary issue between U+1A7A TAI THAM SIGN RA HAAM and
U+1A7C TAI THAM SIGN KHUEN-LUE KARAN. Between them, they have two
different functions, namely as the superscript final consonant form of
RA and as a killer. My understanding of the difference was that it was
based on the glyph shape. The function of final consonant would always
be performed by U+1A7A, and U+1A7C would always have the function of
killer. The 'HAAM' in 'RA HAAM' means 'to prohibit'. KARAN seems to
be a loanword from Siamese, where it originally seems to just mean
'final letter', which is the only meaning I could find for it in Pali
(as _kāranta_); nowadays, in Siamese it means 'a letter bearing the
mark U+0E4C THAI CHARACTER THANTHAKHAT', which Siamese mark is known as
_mai wanchakan_ when it just kills the vowel.
In older Tai Khuen (1930's), both functions are performed by the RA
HAAM glyph. The glyph used is relatively large.
What I have been seeing a lot of recently is Northern Thai text where
the killer function is encoded U+1A7C. This does not strike me as
unreasonable; the usage expresses the view that the difference between
U+1A7C, which typically has a small glyph, and the Northern Thai glyph
for the killer function, which also tends to be small, is simply glyph
variation. (I have no evidence of Northern Thai using superscript
The idea of encoding the two functions differently was abandoned
because of the principle that combining marks are encoded on the basis
of form; encoding them separately would, on the face of the evidence,
have been like encoding diaeresis and the mark of umlaut separately.
More information about the Unicode