Standardised Variation Sequences with Toggles
richard.wordingham at ntlworld.com
Sun Aug 16 05:20:24 CDT 2015
The view of the Unicode Technical committee appears to be that the
Unicode Character Database (UCD) takes priority over the core text of
the Unicode Standard in case of conflict. (Please advise if I have
misunderstood; I only have the core text and samples of past behaviour
to go on, neither of which appears to be binding.)
I am worried that this view may come to cause a redefinition of
sequences in which the variation selector is intended to toggle between
what are normally contextually determined forms. The clearest example is
<U+A856, U+FE00> 'phags-pa letter reversed shaping small a'. Phags-pa
is a 'cursive' script, and this letter is dual-joining. From just the
text of StandardizedVariants.txt and the text and pictures of
StandardizedVariants.html (the latter are in the processing of migrating
to the code charts, which will replace the HTML file in Unicode 9.0.0),
one could easily imagine that the usual forms of <U+A856> and <U+A856,
U+FE00> in authentic continuous text were different. In fact,
*careful* reading of the core text shows that the commonest forms of
these two sequences in authentic text are identical!
The paradox arises because U+A856 PHAGS-PA LETTER SMALL A and several
other characters may be mirrored about the reading axis after certain
letters or flipped letters, and to avoid complications, the rule is
that by default they are in these extremely rare environments. I
believe this mirroring is what is meant by the word 'shaping' in the
description of the variant; it is not a reflection of the 'cursive'
nature of the script. U+FE00 toggles the mirroring state, and this is
what is meant by the word 'reversed', not that the letter is the other
way round to the form in the code chart. Unlike the other contextually
mirrored characters, it so happens that, more often than not, U+A856 is
not actually mirrored in the authentic extant text where the Unicode
rules call for mirroring.
I believe the Phags-pa code chart should have a normative statement that
U+FE00 is acting as a toggle, and refer back to the core text. Now
Phags-pa is a relatively clean case - all standardised variants in
the block have the same behaviour, so a single sentence in the block's
code chart might suffice. However, I do not believe this is always the
case. One possibility would be to change the text from
~ A856 FE00 <U+A856, FE00> phags-pa letter reversed shaping small a
~ A856 FE00 phags-pa letter reversed shaping small a
• Toggles between <U+A586> and <U+A586, FE00>; see core text for
where text in '<...>' is rendered as a string, not echoed as ASCII.
However, that reads clumsily. Can people suggest improvements?
Similar text would be needed for StandardizedVariants.txt in the UCD.
The relevant line currently reads
"A856 FE00; phags-pa letter reversed shaping small a; # PHAGS-PA LETTER
Obviously this potential problem needs to be formally reported, but I
would first like to see other people's views.
There are other cases where variation selectors were intended as
toggles, but the ones I know of are not so clearly documented.
More information about the Unicode