Specification of Encoding of Plain Text

Wed Jan 11 20:56:19 CST 2017

On Tue, 10 Jan 2017 17:25:06 -0800
Asmus Freytag <asmusf at ix.netcom.com> wrote:

> On 1/10/2017 2:54 PM, Richard Wordingham wrote:

> There are many different tacks that can be taken to make spoofing
> more difficult.
> 
> Among them, for critical identifiers:
> 1)  allow only a restricted repertoire
> 2)  disallow certain sequences
> 3) use a registry and
>     3a) define sets of labels that overlap (variant sets)
>     3b) restrict actual labels to be in disjoint sets
>            (one label blocks all others in the same variant set)
> 
> The ICANN work on creating label generation rules attempts to
> implement these strategies (currently for 28 scripts in the Root Zone
> of the DNS). The
> work on the first half dozen scripts is basically completed.
> 
> > The Unicode standard does define what
> > short sequences of characters mean.  The problem is that then,
> > outside the Apple world, it seems to be left to Microsoft to decide
> > what longer sequences they will allow.  
> 
> MS and Apple are not the only ones writing renderers.

HarfBuzz OpenType rendering tries to follow MS.  That includes dotted
circles.  However, it will challenge the MS lead when it is blatantly
wrong.  In particular, it has a policy of rendering canonically
equivalent text the same, though that is a challenge when emulating USE.

So far as I am aware, M17n is not in wide use.  It's tolerant, but
one's text won't go far if it relies on M17n.

Text can travel with a graphite font, but that is limiting.  Sooner or
later, one will want most text to work with different fonts.

I'm having trouble digging up hard facts about InDesign's rendering, so
I don't know how willing it is to be different to Microsoft's.

> > Perhaps ICANN will be the industry-wide definer.  However, to stay
> > with Indic rendering, one may have cases where CVC and CCV
> > orthographic syllables have little to no visible difference.  The
> > Khmer writing system once made much greater use of CVC syllables.
> > For reproducing older texts, one might be forced to encode phonetic
> > CVC as though it were CCV.  

> The restriction on sequences appropriate as an anti-spoofing measure
> are not appropriate on general encoded text!

So ICANN will at best serve to indicate sequences that should be
renderable.

> The project I'm involved in tackles only transitive forms of
> equivalence (whether visual or semantic).

> Collisions based on these equivalences can be handled with label 
> generation rulesets defined per RFC 7940, which allow registration 
> policies that are automated.

> The further "halo" of "merely" similar labels needs to be handled
> with additional technology that can handle concepts like similarity
> distance.

'Merely' similar CCV and CVC tend to differ when the vowel is
above the consonant and the subscript consonant is spacing, e.g. because
it rises to the hanging baseline. The difference, which is in vowel
placement, is comparable to the variation within one person's
handwriting.  However, the difference in mean position seems to be
statistically significant.  The inequivalence issue starts to arise with
spacing vowels, which is when one may find marks being applied to
syllables rather than to individual glyphs.

>  From a Unicode perspective, there's a virtue in not over specifying 
> sequences, because you don't want to be caught having to re-encode 
> entire scripts should the conventions for the use of the elements
> making up the script change in an orthography reform!

This seems to run counter to Mark's idea of regexes defining scripts'
words.

> That does not mean that Unicode (at all times) endorses all
> permutations of free-form sequences as equally valid.

Just as well, as such freedom runs counter to the principle of avoiding
inequivalent encodings of the same thing.

Richard.