Normalizing Syriac

Mon Apr 26 17:58:23 CDT 2021

What I gather for background information (which you may well already be aware of, but just in case) is that:

① Normalisation rules are set in stone per stability policy (software has to be able to rely on any input that normalises to a certain output continuing to normalise like that, so it can use a normalised form as e.g. a database key, input for a password hash, etc.—even if a better behaviour theoretically exists).

② A cluster of a base character and combining characters can be interrupted with one or more of the confusingly named Combining Grapheme Joiner, which is typically used to split what is one grapheme cluster for display purposes into multiple grapheme clusters for normalisation and/or collation purposes. This can be used to inhibit diacritic reörderings that pose an issue in practice.

—Har.

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Unicode <unicode-bounces at unicode.org> on behalf of Lorna Evans via Unicode <unicode at unicode.org>
Sent: Monday, April 26, 2021 10:50:40 PM
To: Unicode Mailing List <unicode at unicode.org>
Subject: Normalizing Syriac

I've got a situation that I'm not sure how to handle...or even if Unicode or the rendering engines need update.

In a language using Syriac there is a rish seyame which can be followed by U+0739 or U+0738

rish = 072A

seyame = 0308

In TUS, chapter 9, it says:

In Modern Syriac usage, when a word contains a rish and a seyame, the dot of
the rish and the seyame are replaced by a rish with two dots above it.
Then, there's a table which indicates this ligature is obligatory:

Table 9-17. Syriac Ligatures

Ligature Classes. As in other scripts, ligatures in Syriac vary depending on the font style.
Table 9-17 identifies the principal valid ligatures for each font style. When applicable, these
ligatures are obligatory, unless denoted with an asterisk (*).

rish seyame Right-joining Right-joining Right-joining BFBS (no asterisk, so it is obligatory)

Finally, in "Developing OpenType Fonts for Syriac Script" https://docs.microsoft.com/en-us/typography/script-development/syriac

In the "Glossary section" it says:

Ligature - A combination of glyphs that join to form a single glyph. For example, the 'rish seyame' (U072a + U0308) combinations of glyphs are mandatory ligatures for Syriac. Other ligatures are optional.

So, it seems clear that 072a+0308 is a mandatory ligature. The problem I'm seeing is that when this ligature is followed by U+0739 or U+0738 AND an application does normalization, it changes the sequence to U+072A U+0739 U+0308 and that breaks the ligature.

You can see why they are reordering it when you see 0308 is 230 and U+0738 or U+0739 are 220.

0308;COMBINING DIAERESIS;Mn;230;NSM;;;;;N;NON-SPACING DIAERESIS;;;;
0738;SYRIAC DOTTED ZLAMA HORIZONTAL;Mn;220;NSM;;;;;N;;;;;
0739;SYRIAC DOTTED ZLAMA ANGULAR;Mn;220;NSM;;;;;N;;;;;

All of the Syriac fonts that I see only handle this sequence U+072A U+0308 U+0739 and not the reordered U+072A U+0739 U+0308

Are the fonts wrong, should they be able to handle U+072A U+0739 U+0308?

Or, is there a special normalization rule for this?

How should rish seyame followed by a below mark like U+0738 or U+0739 be handled?

Lorna

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210426/e802bd89/attachment.htm>