Specification of Encoding of Plain Text

Richard Wordingham richard.wordingham at ntlworld.com
Tue Jan 10 16:54:57 CST 2017


On Tue, 10 Jan 2017 13:12:47 -0800
Asmus Freytag <asmusf at ix.netcom.com> wrote:

> Unicode clearly doesn't forbid most sequences in complex scripts,
> even if they cannot be expected to render properly and otherwise
> would stump the native reader.

Is this expectation based on sequence enforcement in the renderer?  The
main problem with getting text to render reasonably (not necessarily as
desired) is now anti-phishing.  The Unicode standard does define what
short sequences of characters mean.  The problem is that then, outside
the Apple world, it seems to be left to Microsoft to decide what longer
sequences they will allow.

> The advantage of the text I brought to your attention is the way it
> is formalized and that it was created with local expertise. The 
> disadvantage from your perspective is that the scope does not match
> with your intended use case.

Perhaps ICANN will be the industry-wide definer.  However, to stay with
Indic rendering, one may have cases where CVC and CCV orthographic
syllables have little to no visible difference.  The Khmer writing
system once made much greater use of CVC syllables.  For reproducing
older texts, one might be forced to encode phonetic CVC as though it
were CCV.

This is already the case, through error rather than design,
with the Thai script in Tai Tham.  This affects about 30% of the
Northern Thai lexicon*, and I believe even a higher proportion when
adjusted for word frequency. Now, to fight phishing, I have always
believed that some brutal folding would be required for Tai Tham, which
is why I suggested that the S.SA ligature be encoded (U+1A54 TAI THAM
LETTER GREAT SA).

*I've sampled the MFL dictionary.  I suspect a bias to untruncated forms
in loans from Pali, such as _kathina_ rather than _kathin_.  If my
suspicion is correct, the proportion would be even higher.

However, I believe there is some advantage in distinguishing CVC and
CCV at the code level, even where there is no visual difference.  To
display small visual differences, perhaps we will be forced to beg for
mark-up to make the distinction visible.

In Tai Tham, there are very few CCV-CVC visual homographs in native
words because of the phonological structure of Northern Thai, and one
can usually guess whether the word is CCV or CVC.  

Richard.


More information about the Unicode mailing list