Choosing the Set of Renderable Strings

Richard Wordingham via Unicode unicode at unicode.org
Tue May 15 15:29:49 CDT 2018


On Tue, 15 May 2018 02:18:11 -0800
James Kass via Unicode <unicode at unicode.org> wrote:

> Richard Wordingham replied,
> 
> >> ...Private Use Area...
>> > That's what the Xishuangbanna News does for final consonants.
> 
> I failed to find a link for their web site, but only spent about an
> hour and a half searching for it.  There is a web site for
> "Xishuangbanna Daily", but the pages I saw there were all in Chinese.

There's a sample at New sample:
http://www.dw12.com/DigitalNewspaper/xsbnbold/content/20180325/ArticelA04001DK.htm .
I'd have added a link, but the sample page wasn't working.  The page is
currently suffering an attack of dittography (seen on both IE on windows
7 and Firefox on Ubuntu).

> If Xishuangbanna News is publishing using PUA, then they probably
> offer a font for download.  I was just curious to see what their web
> pages looked like, and wondered how pervasive the PUA use is.  If
> their site only resorts to PUA for final consonants, then a
> presumption would be that the USE supports all other shaping
> requirements for the script.
> 
> > My issues are generally not with producing the right image,
> > but rather with enabling the semantically correct sequence
> > of characters.
> 
> Because you started out with all the Tai Tham glyphs mapped to the
> PUA, and are now trying to produce a working font using the standard
> encoding?

No.  The problem is a grammar Nazi of a rendering engine.  I have been
working from a set of characters, and what has happened is that some
glyphs (in the ISO sense, not in the sense used for fonts) that
looked as though they may have needed variation sequences have been
split off as formally unrelated characters - MEDIAL LA, MEDIAL RA,
SIGN SA and SIGN LOW PA OR HIGH RATHA.

What do you mean by 'standard encoding'?  It is agreed that there is a
standard coding for *characters*.  I have been using the
encoding proposal accepted by the Unicode Technical Committee as the
definition of the encoding of text; that, interpreted in the light of
the changes to the encoding for characters, is what I have been using
as the definition of the encoding of characters. A problem is
that it seems that Unicode does not specify the encoding of text.
HarfBuzz used to more or less implement the rules in the proposal, and
rendering generally worked. Then HarfBuzz switched to USE.

For example, what prompted my question was the encoding of the
words /tɛːn tɔː/ and /tɔː tɛːn/, both meaning 'hornet'.  If the
subscript consonant representing /n/ and and the vowel /ɔː/ form a
ligature which is ambiguous as to the order of the phonemes, or the
vowel truly falls through below the consonant, then the contracted form
is the same for both words, and will be rendered if I type it as
ᨲ᩠ᨶᩯᩬᩴ᩵ <HIGH TA, SAKOT, NA, SIGN AE, SIGN OA BELOW, MAI KANG,
TONE-1>.  However, the logical reading of that spelling is /tănɛː
tănɔː/, which sounds like a slightly unusual intensifier.   If we
follow the principal of using phonetic order, then  /tɛːn tɔː/ will be
encoded ᨲᩯ᩠ᨶᩬᩴ᩵ <HIGH TA, SIGN AE, SAKOT, NA, SIGN OA BELOW, MAI KANG,
TONE-1> and /tɔː tɛːn/ will be encoded ᨲᩬᩴ᩵ᩯ᩠ᨶ <HIGH TA, SIGN OA BELOW,
MAI KANG, TONE-1, SIGN AE, SAKOT, NA>.  Both get a dotted circle
because of the sequence <dependent vowel, SAKOT>.  The second one gets
a dotted circle because of tone before vowel; misapplying the single
subsyllable rule from the proposal, the offence is having a tone mark
before a vowel not on the right.  Without the tone mark or MAI KANG,
the offence would be having a below matra (SIGN OA BELOW) before a left
matra (SIGN AE). When MAI KANG was a vowel, back in Unicode 9.0, a USE
implementation would detect two different offences:

(a) Having a top matra (MAI KANG) before a left matra (SIGN AE) and
(b) Following the accepted proposal for Tai Tham and having a bottom
matra (SIGN OA BELOW) before a top matra (MAI KANG).

A fastidious writer would separate the two subsyllables with MAI SAM,
which is a visible mark.  My specific question was whether, in the
absence of MAI SAM, it was in order to use CGJ to separate the two
subsyllables, so that a grammar checker would know where the boundary
between the subsyllables lay. The issue is that the TUS says that CGJ
does not affect rendering, just after an example of it affecting
rendering in Hebrew.  Now, a possible argument is that it may affect
whether rendering occurs; the insertion of a dotted circle is to be
interpreted as meaning that the renderer has refused to render the
string.

Richard.



More information about the Unicode mailing list