Specification of Encoding of Plain Text

Asmus Freytag asmusf at ix.netcom.com
Tue Jan 10 19:25:06 CST 2017

On 1/10/2017 2:54 PM, Richard Wordingham wrote:
> On Tue, 10 Jan 2017 13:12:47 -0800
> Asmus Freytag <asmusf at ix.netcom.com> wrote:
>> Unicode clearly doesn't forbid most sequences in complex scripts,
>> even if they cannot be expected to render properly and otherwise
>> would stump the native reader.
> Is this expectation based on sequence enforcement in the renderer?  The
> main problem with getting text to render reasonably (not necessarily as
> desired) is now anti-phishing.

You mean anti-spoofing. There are many types of phishing attempts that 
do not
rely on spoofing identifiers.

There are many different tacks that can be taken to make spoofing more 

Among them, for critical identifiers:
1)  allow only a restricted repertoire
2)  disallow certain sequences
3) use a registry and
    3a) define sets of labels that overlap (variant sets)
    3b) restrict actual labels to be in disjoint sets
           (one label blocks all others in the same variant set)

The ICANN work on creating label generation rules attempts to implement
these strategies (currently for 28 scripts in the Root Zone of the DNS). 
work on the first half dozen scripts is basically completed.

> The Unicode standard does define what
> short sequences of characters mean.  The problem is that then, outside
> the Apple world, it seems to be left to Microsoft to decide what longer
> sequences they will allow.

MS and Apple are not the only ones writing renderers.
>> The advantage of the text I brought to your attention is the way it
>> is formalized and that it was created with local expertise. The
>> disadvantage from your perspective is that the scope does not match
>> with your intended use case.
> Perhaps ICANN will be the industry-wide definer.  However, to stay with
> Indic rendering, one may have cases where CVC and CCV orthographic
> syllables have little to no visible difference.  The Khmer writing
> system once made much greater use of CVC syllables.  For reproducing
> older texts, one might be forced to encode phonetic CVC as though it
> were CCV.

The restriction on sequences appropriate as an anti-spoofing measure are 
not appropriate on  general encoded text! For one, the Root Zone 
explicitly disallows anything that's not in "widespread everyday" use. 
This covers most transcriptions of "historic" texts, as well as 
religious or technical (phonetic) notations and transcriptions.

But restriction of repertoire and sequences goes only so far. You will 
always have a residual set of labels that overlap to a degree that users 
do not reliably distinguish them. (Actually many disjoint sets of 
overlapping labels). The hard core of these are labels that appear 
(practically) identical. There's a further aura of more or less confusables.

Mathematically these two behave differently: a set of (practically) 
identical labels is symmetric and transitive, while a set of merely 
similar labels may be symmetric, but is not transitive. If A is 
equivalent to B and B to C then A is equivalent to C (transitivity). 
However, for merely similar labels there's a non-zero "similarity 
distance", if you will. If you try to chain similarity together via 
transitivity then you might exceed a similarity threshold and your end 
points (e.g. A and C above) may both be similar to B but not 
(sufficiently) to each other.

The project I'm involved in tackles only transitive forms of equivalence 
(whether visual or semantic).

Collisions based on these equivalences can be handled with label 
generation rulesets defined per RFC 7940, which allow registration 
policies that are automated.

The further "halo" of "merely" similar labels needs to be handled with 
additional technology that can handle concepts like similarity distance.

 From a Unicode perspective, there's a virtue in not over specifying 
sequences, because you don't want to be caught having to re-encode 
entire scripts should the conventions for the use of the elements making 
up the script change in an orthography reform!

That does not mean that Unicode (at all times) endorses all permutations 
of free-form sequences as equally valid.

> This is already the case, through error rather than design,
> with the Thai script in Tai Tham.  This affects about 30% of the
> Northern Thai lexicon*, and I believe even a higher proportion when
> adjusted for word frequency. Now, to fight phishing, I have always
> believed that some brutal folding would be required for Tai Tham, which
> is why I suggested that the S.SA ligature be encoded (U+1A54 TAI THAM
> *I've sampled the MFL dictionary.  I suspect a bias to untruncated forms
> in loans from Pali, such as _kathina_ rather than _kathin_.  If my
> suspicion is correct, the proportion would be even higher.
> However, I believe there is some advantage in distinguishing CVC and
> CCV at the code level, even where there is no visual difference.  To
> display small visual differences, perhaps we will be forced to beg for
> mark-up to make the distinction visible.
> In Tai Tham, there are very few CCV-CVC visual homographs in native
> words because of the phonological structure of Northern Thai, and one
> can usually guess whether the word is CCV or CVC.
> Richard.

More information about the Unicode mailing list