Normalizing Syriac

Richard Wordingham richard.wordingham at ntlworld.com
Mon Apr 26 17:48:40 CDT 2021


On Mon, 26 Apr 2021 16:50:40 -0500
Lorna Evans via Unicode <unicode at unicode.org> wrote:

> You can see why they are reordering it when you see 0308 is 230 and 
> U+0738 or U+0739 are 220.
> 
> 0308;COMBINING DIAERESIS;Mn;*230*;NSM;;;;;N;NON-SPACING DIAERESIS;;;;
> 0738;SYRIAC DOTTED ZLAMA HORIZONTAL;Mn;*220*;NSM;;;;;N;;;;;
> 0739;SYRIAC DOTTED ZLAMA ANGULAR;Mn;*220*;NSM;;;;;N;;;;;
> 
> All of the Syriac fonts that I see only handle this sequence *U+072A 
> U+0308 U+0739* and not the reordered *U+072A U+0739 U+0308*
> 
> Are the fonts wrong, should they be able to handle U+072A U+0739
> U+0308?
> 
> Or, is there a special normalization rule for this?
> 
> How should /rish seyame/ followed by a below mark like U+0738 or
> U+0739 be handled?

It depends on your technology.  In an OpenType font, I would combine
RISH with COMBINING DIAERESIS using a substitution lookup that ignores
marks below.  Am I missing something?  In a combination of base, mark
above and mark below, the order of the marks shouldn't matter if they
don't interact - one just sets up the mark 'attachment' classes so that
the marks are in different classes.  In later version of OpenType, one
can even ignore a set of marks peculiar to that lookup.

Of course, the OpenType (syntax) specification doesn't state what the
subsequent sequence of glyphs is after a ligature lookup if an
intervening mark has been skipped.  John Hudson has publicly complained
that the semantics of OpenType ought to be defined.  Perhaps some
Syriac shaper exploits this gap to go spectacularly wrong - one would
hope it doesn't.

It has struck me as odd that there is very little hint around of what
sequences of marks fonts have to handle.  Back when Harfbuzz was
beginning to handle Tai Tham, Behdad kindly did a normalisation on the
fly so that tone marks (ccc=230) would come before COENG (ccc=9) so
that COENG would remain adjacent to its following consonant.  There is
a similar issue with Hebrew.  (Like a good boy, I'd elaborated my fonts
to handle normalised sequences.)

It is well known that the set of character sequences supported by
Uniscribe is not closed under canonical equivalence - apparently this is
allowed by the conformance clauses of TUS.

Richard.


More information about the Unicode mailing list