Standardised Variation Sequences with Toggles

Richard Wordingham richard.wordingham at ntlworld.com
Sun Aug 16 17:50:08 CDT 2015


On Sun, 16 Aug 2015 12:08:34 -0700
Ken Whistler <kenwhistler at att.net> wrote:

> Some editorial oversight or a typo in the text of the core
> specification cannot
> be taken as legalistically somehow trumping the data file, just
> because somebody finds it "written in the standard".
> 
> Capiche?

No.  What about oversights and typos in the UCD?

Indeed, two variation sequences were removed because it was found that
their bases were decomposable, which contradicts the core
specification.  In this case, the UCD did not trump the rules for
variation sequences.

When there is a contradiction, it needs to be investigated and
resolved, with awareness that different people may be relying on
different parts of the specification.

> (In most
> cases, the core specification is simply underspecified because the
> research, writing and editing for it is under-resourced.)

That is also true of much of the UCD.  I suspect that much of it relies
on intelligent guesswork.  Some properties may simply be ignored
because nothing readily testable uses them (e.g. line and word-break
properties relevant for scriptio continua writing systems), and others
appear to be arbitrary. (Is the allocation of digits to L, AN or EN
actually anything but an encoding decision?)  Fortunately, most errors
in the UCD can be corrected when the settings don't work; casing
pairs, names, decompositions and canonical combining classes are the
main problems. I believe problems arising from codepoint assignments
could be fixed by created singleton decompositions, e.g. to change mere
numbers into decimal digits.

As an example of an effectively ignored line break property, I offer
the line-break property of the Thai repetition character U+0E46 THAI
CHARACTER MAIYAMOK. It is currently of general category Lm, and has the
line-break property SA 'South-East Asian line-breaking'. This means
that the Unicode line-breaking algorithm calls upon a non-standard
algorithm to assign each instance of the character a line-break
property.

Now I believe that it should have line break property EX.  I can find
a grammatical description that says it should be separated from the
preceding word by a space, and I have found no example in books of
U+0E46 starting a line.  Giving it line break property EX would
prevent a line break between the space and the repetition mark.
However, there is little point in trying to have it assigned line break
property EX, for the Unicode assignment is irrefutable.  My argument
has to be addressed to the specifications of the algorithms doing Thai
line-breaking.

A historical example of errors in the UCD is U+200B ZERO WIDTH SPACE
(ZWSP). It's primary use is as a word separator in scripts that don't
have visible word separators, though I'm currently finding it useful in
Word 2010 to split up excessively long path names without visible
hyphens being added.  When its general category was changed from Zs to
Cf, its Unicode word-break property became 'Format'; it no longer had
any effect on word-breaking. Its line-breaking behaviour was preserved,
so the control of text layout was unaffected.  For SE Asian languages,
the change had no direct effect, for their word-breaking rules are
largely outside the scope of the Unicode text segmentation algorithms.
All went well until someone decided that TUS text describing it as a
word-breaker was an 'editorial oversight'. A corrigendum removed this
word-breaking behaviour, and SE Asian word processors started to
misbehave as software maintainers caught up with the corrigendum.

For details see an email from Javier Soláː 
http://unicode.org/mail-arch/unicode-ml/y2009-m01/0604.html .  The
referenced proposal gives the text of the erratum, dated May 2008.
Presumably corrigenda did not then have numbers, for there is no trace
of its former existence in
http://www.unicode.org/versions/corrigenda.html .

A similar process is now in progress for U+2060 WORD JOINER (WJ), which
is the opposite of ZWSP.  It is intended that WJ will cease to indicate
the absence of word boundaries.  In scripts that have visible
line-boundaries, the absence of an effect on word-breaking is of no
consequence for sequences of letters, for the mere juxtaposition of
letters prevents a word-break between them.  By contrast, SE Asian
word-boundary detectors largely rely on recognising words, and they can
make mistakes, or be given an impossible task.  The English analogue is
detecting the word boundary in 'humanevents' - is the last word
'events' or 'vents'?  A notable challenge is to persuade a Thai
spell-checker that a transliteration of 'Hemingway' is actually a
single word.  Delimiting the boundaries does not work - one has to
join the fragments into which the automatic word-breaker splits it.
The language proposed for ISO 10646, in
http://www.unicode.org/L2/L2015/15211-word-joiner.pdf , does not
actually state that it does not prevent a word break, though stronger
text denying that it suppresses word breaks has been proposed for
Unicode.

By contrast, U+202F NARROW NO-BREAK SPACE (NNBSP) looks set to regain
its originally intended purpose, that of a narrow space that does not
break words.  The script for which it was intended, Mongolian, will be
able to use the Unicode word-boundary detection algorithm once NNBSP is
allowed as part of a word.  However, the fact remains that NNBSP should
never have been allowed to break words.  The core text has long stated
that NNBSP does not break Mongolian words.  There remains, however, a
possibility that European usage of NNBSP will prevent it from
recovering its intended functionality.

> Yes, a notice at the top:
> 
> @+ For details about the implementation of variation sequences in
> Phags-pa, please refer to the Phags-pa section of the core
> specification.

a) This is likely to be ignored by someone who is just looking for the
*specification*.  I think replacing 'implementation' by 'rendering'
would be better.  I would be inclined to add, 'These sequences are more
complicated than they appear at first reading'.  Otherwise, someone
will just add them to the character to glyph conversion section of a
font and think, "Job done".

b) This won't work where the effort has not been expended on the core
text.

As to StandardizedVariants.txt, Section 23.4 needs to refer to the
Phags-pa section in the core text.  As that file points to the Section
23.4 of TUS, this should then at least suggest that the descriptions in
the file do not override the core specification.

Richard.



More information about the Unicode mailing list