What is the time frame for USE shapers to provide support for CV+C ?

Richard Wordingham via Unicode unicode at unicode.org
Thu Aug 8 03:06:47 CDT 2019


On Wed, 7 Aug 2019 14:19:26 -0700
Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> What about text that must exist normalized for other purposes?
> 
> Domain names must be normalized to NFC, for example. Will such
> strings display correctly if passed to USE?

One solution, of course, is to minimise the use of Microsoft
products.  (The trick
is to apply the normalisation algorithm using a permutation of the
positive ccc values.)  The latest version of HarfBuzz renders
subscripted final consonants; it's slowly recovering its pre-USE
rendering capabilities. 

> On 8/7/2019 1:39 PM, Andrew Glass via Unicode wrote:
> That's correct, the Microsoft implementation of USE spec does not
> normalize as part of the shaping process. Why? Because the ccc system
> for non-Latin scripts is not a good mechanism for handling complex
> requirements for these writing systems and the effects of ccc-based
> normalization can disrupt authors intent. Unfortunately, because we
> cannot fix ccc values, shaping engines at Microsoft have ignored
> them. Therefore, recommendation for passing text to USE is to not
> normalize.

HarfBuzz solved the problem of <tone, sakot> by choosing a
suitable normalisation; it uses the same technique for Hebrew, where
the normalisation classes are also unfriendly to renderers.  

> By the way, at the current time, I do not have a final consensus from
> Tai Tham experts and community on the changes required to support Tai
> Tham in USE. Therefore, I've not been able to make the changes
> proposed in this thread.

Grammatical denazification is one solution.  Another one is to delegate
matters to the font.  Give us a script type that will implement a GSUB
feature by default, and font writers can take it from there. At present
I have a conundrum on how to render the accusative singular of the
cruciform form of the word for enlightenment without usinɡ chained
syllables, _bodhiṃ_.  The obvious visual encoding is <LOW PA, LOW THA,
SIGN E, SIGN I, MAI KANG, SIGN AA>.  This combination is very
unusual, perhaps unique to this word.  (Pali 'o' is <SIGN E, SIGN
AA>). However, a very common combination, because the UTC refused Tai
Tham the character SIGN AM, is SIGN AA, MAI KANG, so for the USE, SIGN
AA and MAI KANG have to be in the same character class.  (Alternatively,
we split the syllable before SIGN AA.)  MAI KANG has InSc=bindu, while
SIGN AA is a right matra. Unfortunately, there is a strong temptation
for many to write what would have been 'SIGN AM' as MAI KANG, SIGN AA,
which is to be rendered quite differently from 'SIGN AM' outside
Northern Thailand, e.g. in NE Thailand.  (Northern Thailand has both
syles; it is quite diverse.)  If I understand the principles of USE,
allowing both '... MAI KANG, SIGN AA...' and '... SIGN AA, MAI
KANG ....', which immediately after a consonant have the same rendering
in some fonts and very confusable renderings in many others, is
considered highly undesirable.

For Microsoft applications, another solution is for fonts to deleted
dotted circles between Tai Tham characters.  (I try to be more
selective, but this results in a complicated set of lookups to
ensure that deletion only occurs when the renderer has inserted
inappropriate dotted circles.)  This is not compliant with Unicode, but
neither is deliberately treating canonically equivalent forms
differently.

Richard.



More information about the Unicode mailing list