From tom at honermann.net  Sun Apr  4 17:07:58 2021
From: tom at honermann.net (Tom Honermann)
Date: Sun, 4 Apr 2021 18:07:58 -0400
Subject: White spaces for the purpose of programming languages
In-Reply-To: <CAN49p6r-M1Ho9QfWeXiVpHCruwu2=mbRM=dEbrdyXTKPEEuT9Q@mail.gmail.com>
References: <CA+Om+ShNpJv-O1YsxjdNXsxDgJDH2NnbUB59iczXY=DZSojpYA@mail.gmail.com>
 <CAN49p6r-M1Ho9QfWeXiVpHCruwu2=mbRM=dEbrdyXTKPEEuT9Q@mail.gmail.com>
Message-ID: <5c071b21-0fa9-b35c-f9eb-226c13ced32c@honermann.net>

On 3/31/21 11:10 PM, Markus Scherer via Unicode wrote:
>
>       o I can't tell if the EBCDIC platforms are "alive". Elsewhere I
>         have tried to find out if there is a competent C++11 compiler
>         available.
>
Yes, EBCDIC platforms continue to roam the earth.? IBM's traditional xlC 
for z/OS compiler is effectively on life support and stuck at a 
pre-C++11 language level, but there are multiple options for recent C++ 
language standards available today.? IBM has other EBCDIC-based OSs as 
well, but I'm not as familiar with them.

IBM started distributing a Clang-based compiler (xlclang) with XL C/C++ 
V2.3.1 
<https://community.ibm.com/community/user/ibmz-and-linuxone/blogs/fang-lu2/2020/03/24/xl-cc-v231-for-zos-v23-web-deliverable-is-available-today-on-march-29-2019> 
for z/OS two years ago and has started posting patches to LLVM Clang to 
add z/OS support.? One such patch to enable -fexec-charset to support 
IBM-1047 (an EBCDIC encoding of the ISO-8819-1 character repertoire) is 
currently in review here <https://reviews.llvm.org/D93031>.

Dignus offers Systems/C++ <http://www.dignus.com/dcxx/>, a LLVM-based 
C++ compiler that, as of version 2.25 released last year 
<http://www.dignus.com/press_releases/200728.html>, supports C++17.

Tom.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210404/3cb5eaa3/attachment.htm>

From wjgo_10009 at btinternet.com  Mon Apr  5 12:00:02 2021
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Mon, 5 Apr 2021 18:00:02 +0100 (BST)
Subject: A new era of encoding
Message-ID: <c7de308.5699.178a2fa1794.Webtop.93@btinternet.com>

One week to go for the Public  Review on QID Emoji.

Then a report from the Emoji Subcommittee about what they propose.

Then consideration by the Unicode Technical Committee.

In my opinion this has implications far beyond just emoji, because what 
is under consideration is the introduction of an auxiliary encoding 
space, with its own map of code points. So if one auxiliary encoding 
space is introduced, then there is the possibility of other such 
auxiliary encoding spaces becoming introduced.

I consider that the introduction of such auxiliary encoding spaces will 
be a good thing, with great benefits.

It will allow new concepts to become implemented in a rigorous 
interoperable format.

William Overington

Monday 5 April 2021


From public at khwilliamson.com  Mon Apr  5 16:44:56 2021
From: public at khwilliamson.com (Karl Williamson)
Date: Mon, 5 Apr 2021 15:44:56 -0600
Subject: White spaces for the purpose of programming languages
In-Reply-To: <5c071b21-0fa9-b35c-f9eb-226c13ced32c@honermann.net>
References: <CA+Om+ShNpJv-O1YsxjdNXsxDgJDH2NnbUB59iczXY=DZSojpYA@mail.gmail.com>
 <CAN49p6r-M1Ho9QfWeXiVpHCruwu2=mbRM=dEbrdyXTKPEEuT9Q@mail.gmail.com>
 <5c071b21-0fa9-b35c-f9eb-226c13ced32c@honermann.net>
Message-ID: <9fbc35cd-4e7e-2eab-45eb-6085cc9bf44a@khwilliamson.com>

On 4/4/21 4:07 PM, Tom Honermann via Unicode wrote:
> On 3/31/21 11:10 PM, Markus Scherer via Unicode wrote:
>>
>>       o I can't tell if the EBCDIC platforms are "alive". Elsewhere I
>>         have tried to find out if there is a competent C++11 compiler
>>         available.
>>
> Yes, EBCDIC platforms continue to roam the earth.? IBM's traditional xlC 
> for z/OS compiler is effectively on life support and stuck at a 
> pre-C++11 language level, but there are multiple options for recent C++ 
> language standards available today.? IBM has other EBCDIC-based OSs as 
> well, but I'm not as familiar with them.
> 
> IBM started distributing a Clang-based compiler (xlclang) with XL C/C++ 
> V2.3.1 
> <https://community.ibm.com/community/user/ibmz-and-linuxone/blogs/fang-lu2/2020/03/24/xl-cc-v231-for-zos-v23-web-deliverable-is-available-today-on-march-29-2019> 
> for z/OS two years ago and has started posting patches to LLVM Clang to 
> add z/OS support.? One such patch to enable -fexec-charset to support 
> IBM-1047 (an EBCDIC encoding of the ISO-8819-1 character repertoire) is 
> currently in review here <https://reviews.llvm.org/D93031>.
> 
> Dignus offers Systems/C++ <http://www.dignus.com/dcxx/>, a LLVM-based 
> C++ compiler that, as of version 2.25 released last year 
> <http://www.dignus.com/press_releases/200728.html>, supports C++17.
> 
> Tom.
> 

Both modern Python and Perl run on z/OS.  Perl offers full support of 
UTF-EBCDIC; I don't know about Python


From wjgo_10009 at btinternet.com  Tue Apr 13 11:49:25 2021
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Tue, 13 Apr 2021 17:49:25 +0100 (BST)
Subject: Haiku Poetry Day Saturday 17 April 2021
Message-ID: <40dc874.f2b4.178cc233fdd.Webtop.93@btinternet.com>

https://www.daysoftheyear.com/days/haiku-poetry-day/

Are haiku poems written in any languages other than Japanese and 
English?

It occurs to me that the 5-7-5 structure in one language may well not 
have a 5-7-5 structure if translated from one language to another.

Also, meaning could be lost. For example if the word 'Elles' in French 
in translated to 'They' in English, the translation may loose the 
meaning.

Although not a haiku poem, my song lyrics about colourful fonts would 
lose some meaning in translation if 'Elles' were translated to 'They'.

http://www.users.globalnet.co.uk/~ngo/une_chanson.pdf

I remember that one year this mailing list featured a haiku contest. I 
entered but did not win a prize. Perhaps there could be another haiku 
contest this year, though different in that all entries must be posted 
to this mailing list, then they would be archived and publicly 
available. Maybe it could be more of a festival than a contest, with 
original haiku in many languages that can be represented in Unicode.

Actually, I sent in fourteen entries and afterwards I placed them in the 
mailing list archives. I have searched but cannot find them yet.

Can one have haiku expressed in emoji?

William Overington

Tuesday 13 April 2021


From jameskass at code2001.com  Tue Apr 13 13:49:15 2021
From: jameskass at code2001.com (James Kass)
Date: Tue, 13 Apr 2021 18:49:15 +0000
Subject: Haiku Poetry Day Saturday 17 April 2021
In-Reply-To: <40dc874.f2b4.178cc233fdd.Webtop.93@btinternet.com>
References: <40dc874.f2b4.178cc233fdd.Webtop.93@btinternet.com>
Message-ID: <16a24830-de21-d86e-9bf9-fc540e3f3c71@code2001.com>


On 2021-04-13 4:49 PM, William_J_G Overington via Unicode wrote:
> I remember that one year this mailing list featured a haiku contest. 
http://blog.unicode.org/2009/09/unicode-announcement-unicode-haiku.html

From junicode at jcbradfield.org  Tue Apr 13 14:19:17 2021
From: junicode at jcbradfield.org (Julian Bradfield)
Date: Tue, 13 Apr 2021 20:19:17 +0100 (BST)
Subject: Haiku Poetry Day Saturday 17 April 2021
References: <40dc874.f2b4.178cc233fdd.Webtop.93@btinternet.com>
Message-ID: <slrns7brll.7tf.jcb@home.stevens-bradfield.com>

On 2021-04-13, William_J_G Overington via Unicode <unicode at unicode.org> wrote:
> https://www.daysoftheyear.com/days/haiku-poetry-day/
>
> Are haiku poems written in any languages other than Japanese and 
> English?
>
> It occurs to me that the 5-7-5 structure in one language may well not 
> have a 5-7-5 structure if translated from one language to another.
>
> Also, meaning could be lost. For example if the word 'Elles' in French 
> in translated to 'They' in English, the translation may loose the 
> meaning.

I have a friend who sometimes writes quadrilingual haikus: the same
sentiment expressed in haiku form in each of French, German, English
and Quenya.
Like all translation, there are choices that can be made.

From john.w.kennedy at gmail.com  Tue Apr 13 14:45:08 2021
From: john.w.kennedy at gmail.com (John W Kennedy)
Date: Tue, 13 Apr 2021 15:45:08 -0400
Subject: Haiku Poetry Day Saturday 17 April 2021
In-Reply-To: <40dc874.f2b4.178cc233fdd.Webtop.93@btinternet.com>
References: <40dc874.f2b4.178cc233fdd.Webtop.93@btinternet.com>
Message-ID: <F441FA24-4723-4EB9-A4A4-AEC6841AC661@gmail.com>

On Apr 13, 2021, at 12:49 PM, William_J_G Overington via Unicode <unicode at unicode.org> wrote:

> https://www.daysoftheyear.com/days/haiku-poetry-day/
> 
> Are haiku poems written in any languages other than Japanese and English?

I don?t know of any other languages in which haiku /are/ written; I dare say French would serve better than English.

English hates haiku;
Syllables, like April snow,
Melt and flow away.

Of course, one might always question the meaning of ?mora?,
Returning to the springtime of English and the forge of the ancient English scops,
But I fear that a seven-footed line?s too long.

> It occurs to me that the 5-7-5 structure in one language may well not have a 5-7-5 structure if translated from one language to another.

?Traduttore traditore.? (And the more exquisite, the worse.)

-- 
John W Kennedy
"...when you're trying to build a house of cards, the last thing you should do is blow hard and wave your hands like a madman."
  --  Rupert Goodwins


From eik at iki.fi  Tue Apr 13 23:55:26 2021
From: eik at iki.fi (eik at iki.fi)
Date: Wed, 14 Apr 2021 07:55:26 +0300
Subject: Haiku Poetry Day Saturday 17 April 2021
Message-ID: <001001d730ea$63b9a3f0$2b2cebd0$@iki.fi>

There are several Haikus written and published also at least in Finnish.

Erkki I. Kolehmainen
Mannerheimintie 75 B 37, 00270 Helsinki, Finland
Mob: +358 400 825 943 

-----Alkuper?inen viesti-----
L?hett?j?: Unicode <unicode-bounces at unicode.org> Puolesta John W Kennedy via Unicode
L?hetetty: tiistai 13. huhtikuuta 2021 22.45
Vastaanottaja: Unicode Discussion <unicode at unicode.org>
Aihe: Re: Haiku Poetry Day Saturday 17 April 2021

On Apr 13, 2021, at 12:49 PM, William_J_G Overington via Unicode <unicode at unicode.org> wrote:

> https://www.daysoftheyear.com/days/haiku-poetry-day/
> 
> Are haiku poems written in any languages other than Japanese and English?

I don?t know of any other languages in which haiku /are/ written; I dare say French would serve better than English.

English hates haiku;
Syllables, like April snow,
Melt and flow away.

Of course, one might always question the meaning of ?mora?, Returning to the springtime of English and the forge of the ancient English scops, But I fear that a seven-footed line?s too long.

> It occurs to me that the 5-7-5 structure in one language may well not have a 5-7-5 structure if translated from one language to another.

?Traduttore traditore.? (And the more exquisite, the worse.)

--
John W Kennedy
"...when you're trying to build a house of cards, the last thing you should do is blow hard and wave your hands like a madman."
  --  Rupert Goodwins


From marius.spix at web.de  Wed Apr 14 02:11:33 2021
From: marius.spix at web.de (Marius Spix)
Date: Wed, 14 Apr 2021 09:11:33 +0200
Subject: Aw: RE: Haiku Poetry Day Saturday 17 April 2021
In-Reply-To: <001001d730ea$63b9a3f0$2b2cebd0$@iki.fi>
References: <001001d730ea$63b9a3f0$2b2cebd0$@iki.fi>
Message-ID: <trinity-48fdd347-6527-45f2-8427-a8bfa700db57-1618384293776@3c-app-webde-bs33>

In German, there is a similar poem form called Elfchen (diminutive of ?Elf?, which mean ?eleven?) , which has the structure 1 word ? 2 words ? 3 words ? 4 words ? 1 word.

> Gesendet: Mittwoch, 14. April 2021 um 06:55 Uhr
> Von: "eik--- via Unicode" <unicode at unicode.org>
> An: "'John W Kennedy'" <john.w.kennedy at gmail.com>, unicode at unicode.org
> Betreff: RE: Haiku Poetry Day Saturday 17 April 2021
>
> There are several Haikus written and published also at least in Finnish.
> 
> Erkki I. Kolehmainen
> Mannerheimintie 75 B 37, 00270 Helsinki, Finland
> Mob: +358 400 825 943 
> 
> -----Alkuper?inen viesti-----
> L?hett?j?: Unicode <unicode-bounces at unicode.org> Puolesta John W Kennedy via Unicode
> L?hetetty: tiistai 13. huhtikuuta 2021 22.45
> Vastaanottaja: Unicode Discussion <unicode at unicode.org>
> Aihe: Re: Haiku Poetry Day Saturday 17 April 2021
> 
> On Apr 13, 2021, at 12:49 PM, William_J_G Overington via Unicode <unicode at unicode.org> wrote:
> 
> > https://www.daysoftheyear.com/days/haiku-poetry-day/
> > 
> > Are haiku poems written in any languages other than Japanese and English?
> 
> I don?t know of any other languages in which haiku /are/ written; I dare say French would serve better than English.
> 
> English hates haiku;
> Syllables, like April snow,
> Melt and flow away.
> 
> Of course, one might always question the meaning of ?mora?, Returning to the springtime of English and the forge of the ancient English scops, But I fear that a seven-footed line?s too long.
> 
> > It occurs to me that the 5-7-5 structure in one language may well not have a 5-7-5 structure if translated from one language to another.
> 
> ?Traduttore traditore.? (And the more exquisite, the worse.)
> 
> --
> John W Kennedy
> "...when you're trying to build a house of cards, the last thing you should do is blow hard and wave your hands like a madman."
>   --  Rupert Goodwins
> 
> 
> 
> 
> 
>


From doug at ewellic.org  Wed Apr 14 11:41:28 2021
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 14 Apr 2021 10:41:28 -0600
Subject: Need reference to good ABNF for \uXXXX syntax
Message-ID: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org>

Is anyone aware of an existing RFC or other specification that includes complete, correct, and clear ABNF for Unicode escape sequences using the UTF-16 encoding scheme?

Examples:
\u0041
\u3042
\uD801\uDC02  (NOT: \U0001042A)

This type of sequence is described in Section 6.3 of RFC 5137, but that RFC does not recommend this syntax and does not include ABNF for it.

"Correct" implies, for instance, that the ABNF excludes unpaired surrogates.

To be clear, I'm NOT looking for someone on this list to contribute their own code, but rather a pointer to code that is already published, and easy for another document, such as an I-D, to reference.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org


From markus.icu at gmail.com  Wed Apr 14 12:43:51 2021
From: markus.icu at gmail.com (Markus Scherer)
Date: Wed, 14 Apr 2021 07:43:51 -1000
Subject: Need reference to good ABNF for \uXXXX syntax
In-Reply-To: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org>
References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org>
Message-ID: <CAN49p6r0F9P+V6HrDDbovYsafn+3D2tH8ptqMdz-fTR0QiK0uQ@mail.gmail.com>

Hi Doug,

On Wed, Apr 14, 2021 at 6:45 AM Doug Ewell via Unicode <unicode at unicode.org>
wrote:

> Is anyone aware of an existing RFC or other specification that includes
> complete, correct, and clear ABNF for Unicode escape sequences using the
> UTF-16 encoding scheme?
> ...
> "Correct" implies, for instance, that the ABNF excludes unpaired
> surrogates.
>

I was looking for something, but all I can find is either loose about
surrogates (e.g., Java
<https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html>), or deals
in code points rather than UTF-16 code units.

Can you say why you want/need strict 16-bit escapes for well-formed UTF-16
code units, rather than what others are doing?

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210414/44c528d0/attachment.htm>

From wjgo_10009 at btinternet.com  Wed Apr 14 13:55:18 2021
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Wed, 14 Apr 2021 19:55:18 +0100 (BST)
Subject: Colours
Message-ID: <24f9813c.115dc.178d1bcda63.Webtop.107@btinternet.com>

A very interesting document was added yesterday to the Current Document 
Register.

Examining Emoji Color Spaces:
A Strategy for Improving the Coverage of Heart Emoji

https://www.unicode.org/L2/L2021/21075-heart-emoji-coverage.pdf

I have been studying this and have started a thread in the Affinity 
forum, which some readers might perhaps find of interest.

https://forum.affinity.serif.com/index.php?/topic/140122-colours/

William Overington

Wednesday 14 April 2021


From duerst at it.aoyama.ac.jp  Wed Apr 14 18:50:43 2021
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=)
Date: Thu, 15 Apr 2021 08:50:43 +0900
Subject: Need reference to good ABNF for \uXXXX syntax
In-Reply-To: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org>
References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org>
Message-ID: <cdb4c35b-5c64-430e-32c4-c1a79dee541f@it.aoyama.ac.jp>

Hello Doug,

On 2021-04-15 01:41, Doug Ewell via Unicode wrote:
> Is anyone aware of an existing RFC or other specification that includes complete, correct, and clear ABNF for Unicode escape sequences using the UTF-16 encoding scheme?
> 
> Examples:
> \u0041
> \u3042
> \uD801\uDC02  (NOT: \U0001042A)
> 
> This type of sequence is described in Section 6.3 of RFC 5137, but that RFC does not recommend this syntax and does not include ABNF for it.
> 
> "Correct" implies, for instance, that the ABNF excludes unpaired surrogates.
> 
> To be clear, I'm NOT looking for someone on this list to contribute their own code, but rather a pointer to code that is already published, and easy for another document, such as an I-D, to reference.

So I guess you are looking for something like the regular expression on
https://www.w3.org/International/questions/qa-forms-utf-8, but for the 
above syntax (rather than byte sequences in UTF-8) and in ABNF.

The closest I was able to come up from memory may be 
https://tools.ietf.org/html/rfc5137, but it's not exactly what you want. 
I'd guess it might be quicker for you to put something together on your 
own (and then maybe run it by this list).

Regards, ? Martin.


> --
> Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org
> 

From doug at ewellic.org  Wed Apr 14 20:52:11 2021
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 14 Apr 2021 19:52:11 -0600
Subject: Need reference to good ABNF for \uXXXX syntax
In-Reply-To: <CAN49p6r0F9P+V6HrDDbovYsafn+3D2tH8ptqMdz-fTR0QiK0uQ@mail.gmail.com>
References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org>
 <CAN49p6r0F9P+V6HrDDbovYsafn+3D2tH8ptqMdz-fTR0QiK0uQ@mail.gmail.com>
Message-ID: <002701d73199$f3e94f70$dbbbee50$@ewellic.org>

Markus Scherer wrote:

> I was looking for something, but all I can find is either loose about
> surrogates (e.g.,
> https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html), or
> deals in code points rather than UTF-16 code units.

Yes, the text of the Java spec knows about concatenating a high surrogate and a low surrogate, but doesn't know about excluding unpaired surrogates. So the syntax on that page is really just pre-1993 UCS-2.

> Can you say why you want/need strict 16-bit escapes for well-formed
> UTF-16 code units, rather than what others are doing?

It's for an update to RFC 8610, which defines CDDL, a metalanguage for expressing CBOR data structures. The syntax is already defined and out in the field, so it's too late to change it, but the ABNF describing it was incorrect and someone filed an erratum.

The discussion was on how to fix the ABNF, and I thought it would be better to find and validate a rule already published than to create an all-new, probably slightly different, and possibly buggy one.

In the end, some time after I wrote my message, the decision was made to create a new rule (see below). Fortunately it has a lot of eyes on it, and seems to be correct.

Martin J. D?rst wrote:

> So I guess you are looking for something like the regular expression
> on https://www.w3.org/International/questions/qa-forms-utf-8, but for
> the above syntax (rather than byte sequences in UTF-8) and in ABNF.

Yes.

> The closest I was able to come up from memory may be
> https://tools.ietf.org/html/rfc5137, but it's not exactly what you
> want.

No, that just repeats the Java spec's UCS-2 definition, in real ABNF instead of whatever the Java spec is using. I did mention RFC 5167 in my post; when I said it didn't include ABNF for the syntax we're talking about, I meant surrogate-aware.

> I'd guess it might be quicker for you to put something together on
> your own (and then maybe run it by this list).

What Carsten Bormann came up with was this:

hexchar = non-surrogate / (high-surrogate "\" %x75 low-surrogate)

non-surrogate = ((DIGIT / "A" / "B" / "C" / "E" / "F") 3HEXDIG) /
                ("D" %x30-37 2HEXDIG )

high-surrogate = "D" ("8" / "9" / "A" / "B") 2HEXDIG

low-surrogate = "D" ("C" / "D" / "E" / "F") 2HEXDIG

(My contribution was to define non-surrogate, high-surrogate, and low-surrogate separately instead of making this one behemoth rule.)

J Decker wrote:

> There's also long encode in JS using \u{NNNNN}  where the N digits
> aren't required, because there's a framing of {}.... this allows one
> to specify A character without surrogate encoding.

Thanks, but the goal was not to find a better encoding for CDDL, but to find good ABNF for the encoding that CDDL already uses.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org


From markus.icu at gmail.com  Thu Apr 15 00:09:04 2021
From: markus.icu at gmail.com (Markus Scherer)
Date: Wed, 14 Apr 2021 19:09:04 -1000
Subject: Need reference to good ABNF for \uXXXX syntax
In-Reply-To: <002701d73199$f3e94f70$dbbbee50$@ewellic.org>
References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org>
 <CAN49p6r0F9P+V6HrDDbovYsafn+3D2tH8ptqMdz-fTR0QiK0uQ@mail.gmail.com>
 <002701d73199$f3e94f70$dbbbee50$@ewellic.org>
Message-ID: <CAN49p6oa4A9t9r9kzxw+kY3Na93adeg1rdwC=2GgB7n42EWcyQ@mail.gmail.com>

On Wed, Apr 14, 2021 at 3:55 PM Doug Ewell via Unicode <unicode at unicode.org>
wrote:

> (My contribution was to define non-surrogate, high-surrogate, and
> low-surrogate separately instead of making this one behemoth rule.)
>

lgtm

I personally like to use "lead surrogate" and "trail surrogate" because I
find "high" vs. "low" confusing.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210414/9f6f123d/attachment.htm>

From duerst at it.aoyama.ac.jp  Fri Apr 16 00:33:34 2021
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=)
Date: Fri, 16 Apr 2021 14:33:34 +0900
Subject: Need reference to good ABNF for \uXXXX syntax
In-Reply-To: <002701d73199$f3e94f70$dbbbee50$@ewellic.org>
References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org>
 <CAN49p6r0F9P+V6HrDDbovYsafn+3D2tH8ptqMdz-fTR0QiK0uQ@mail.gmail.com>
 <002701d73199$f3e94f70$dbbbee50$@ewellic.org>
Message-ID: <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp>

Hello Doug,

(Carsten cc'ed as a shortcut.)

On 2021-04-15 10:52, Doug Ewell via Unicode wrote:

> Martin J. D?rst wrote:
> 
>> So I guess you are looking for something like the regular expression
>> on https://www.w3.org/International/questions/qa-forms-utf-8, but for
>> the above syntax (rather than byte sequences in UTF-8) and in ABNF.
> 
> Yes.
> 
>> The closest I was able to come up from memory may be
>> https://tools.ietf.org/html/rfc5137, but it's not exactly what you
>> want.
> 
> No, that just repeats the Java spec's UCS-2 definition, in real ABNF instead of whatever the Java spec is using. I did mention RFC 5167 in my post;

Sorry, I shouldn't have missed that.

> when I said it didn't include ABNF for the syntax we're talking about, I meant surrogate-aware.
> 
>> I'd guess it might be quicker for you to put something together on
>> your own (and then maybe run it by this list).
> 
> What Carsten Bormann came up with was this:
> 
> hexchar = non-surrogate / (high-surrogate "\" %x75 low-surrogate)
> 
> non-surrogate = ((DIGIT / "A" / "B" / "C" / "E" / "F") 3HEXDIG) /
>                  ("D" %x30-37 2HEXDIG )
> 
> high-surrogate = "D" ("8" / "9" / "A" / "B") 2HEXDIG
> 
> low-surrogate = "D" ("C" / "D" / "E" / "F") 2HEXDIG
> 
> (My contribution was to define non-surrogate, high-surrogate, and low-surrogate separately instead of making this one behemoth rule.)

What bothers me in this grammar is that the first "\u" isn't anywhere in 
sight, but the second one is there. It would be much clearer if either 
the first "\u" is at the start of hexchar, i.e.

hexchar = "\" %x75 (non-surrogate / (high-surrogate "\" %x75 low-surrogate))

or the various "\u" parts are integrated with the various parts, as follows:

hexchar = non-surrogate / (high-surrogate low-surrogate)

non-surrogate = "\" %x75 ((DIGIT / "A" / "B" / "C" / "E" / "F") 3HEXDIG) /
                  ("D" %x30-37 2HEXDIG )

high-surrogate = "\" %x75 "D" ("8" / "9" / "A" / "B") 2HEXDIG

low-surrogate = "\" %x75 "D" ("C" / "D" / "E" / "F") 2HEXDIG

The way it is written, it looks like the convenience of ABNF details
(such as maybe line length) are dominating the expression of a clear 
structure.

Regards, ? Martin.

From doug at ewellic.org  Fri Apr 16 10:25:38 2021
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 16 Apr 2021 09:25:38 -0600
Subject: Need reference to good ABNF for \uXXXX syntax
In-Reply-To: <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp>
References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org>
 <CAN49p6r0F9P+V6HrDDbovYsafn+3D2tH8ptqMdz-fTR0QiK0uQ@mail.gmail.com>
 <002701d73199$f3e94f70$dbbbee50$@ewellic.org>
 <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp>
Message-ID: <000001d732d4$c1e1b850$45a528f0$@ewellic.org>

Martin J. D?rst wrote:

> What bothers me in this grammar is that the first "\u" isn't anywhere
> in sight, but the second one is there. It would be much clearer if
> either the first "\u" is at the start of hexchar, i.e.

Sorry, I neglected to include this line, which precedes everything I did quote:

SESC = "\" ( %x22 / %x2F / %x5C / %x62 / %x66 / %x6E / %x72 / %x74 /
             (%x75 hexchar) )

SESC incorporates all the other backslash-escaped characters. It, not hexchar, is the real entity referenced by everything else.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org


From beckiergb at gmail.com  Fri Apr 16 14:00:40 2021
From: beckiergb at gmail.com (Rebecca Bettencourt)
Date: Fri, 16 Apr 2021 12:00:40 -0700
Subject: Need reference to good ABNF for \uXXXX syntax
In-Reply-To: <000001d732d4$c1e1b850$45a528f0$@ewellic.org>
References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org>
 <CAN49p6r0F9P+V6HrDDbovYsafn+3D2tH8ptqMdz-fTR0QiK0uQ@mail.gmail.com>
 <002701d73199$f3e94f70$dbbbee50$@ewellic.org>
 <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp>
 <000001d732d4$c1e1b850$45a528f0$@ewellic.org>
Message-ID: <CAH=y87bGx22TazM8-DQ_ffXwnAQyqngCCvuVtSwSy77Gg6Nb-Q@mail.gmail.com>

Is %x2F supposed to be %x27?

-- Rebecca Bettencourt


On Fri, Apr 16, 2021 at 8:29 AM Doug Ewell via Unicode <unicode at unicode.org>
wrote:

> Martin J. D?rst wrote:
>
> > What bothers me in this grammar is that the first "\u" isn't anywhere
> > in sight, but the second one is there. It would be much clearer if
> > either the first "\u" is at the start of hexchar, i.e.
>
> Sorry, I neglected to include this line, which precedes everything I did
> quote:
>
> SESC = "\" ( %x22 / %x2F / %x5C / %x62 / %x66 / %x6E / %x72 / %x74 /
>              (%x75 hexchar) )
>
> SESC incorporates all the other backslash-escaped characters. It, not
> hexchar, is the real entity referenced by everything else.
>
> --
> Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210416/5bbbfa69/attachment.htm>

From kent.b.karlsson at bahnhof.se  Fri Apr 16 15:09:27 2021
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Fri, 16 Apr 2021 22:09:27 +0200
Subject: Need reference to good ABNF for \uXXXX syntax
In-Reply-To: <000001d732d4$c1e1b850$45a528f0$@ewellic.org>
References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org>
 <CAN49p6r0F9P+V6HrDDbovYsafn+3D2tH8ptqMdz-fTR0QiK0uQ@mail.gmail.com>
 <002701d73199$f3e94f70$dbbbee50$@ewellic.org>
 <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp>
 <000001d732d4$c1e1b850$45a528f0$@ewellic.org>
Message-ID: <1151F8C2-78C2-4B7F-887B-015D3B4C5E9D@bahnhof.se>


> 16 apr. 2021 kl. 17:25 skrev Doug Ewell via Unicode <unicode at unicode.org>:
> 
> Martin J. D?rst wrote:
> 
>> What bothers me in this grammar is that the first "\u" isn't anywhere
>> in sight, but the second one is there. It would be much clearer if
>> either the first "\u" is at the start of hexchar, i.e.
> 
> Sorry, I neglected to include this line, which precedes everything I did quote:
> 
> SESC = "\" ( %x22 / %x2F / %x5C / %x62 / %x66 / %x6E / %x72 / %x74 /
>             (%x75 hexchar) )

1) Why are some ?very plain letters in ASCII? given as hex escapes here? Esp. since the not so plain (it is used as an escape, which is the point here?) ?\? has not warranted a hex escape. (The grammar even uses it to escape ?, which is a bit ironic).

2) Apart from the second line there, these have nothing to do with ?\u? escapes, and in addition the set of these other escapes vary (a bit) by programming language (or other context), and technically aren?t needed when \u escapes are allowed (though still practical).

/Kent K


> SESC incorporates all the other backslash-escaped characters. It, not hexchar, is the real entity referenced by everything else.
> 
> --
> Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org
> 
> 
> 


From doug at ewellic.org  Fri Apr 16 15:38:11 2021
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 16 Apr 2021 14:38:11 -0600
Subject: Need reference to good ABNF for \uXXXX syntax
In-Reply-To: <1151F8C2-78C2-4B7F-887B-015D3B4C5E9D@bahnhof.se>
References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org>
 <CAN49p6r0F9P+V6HrDDbovYsafn+3D2tH8ptqMdz-fTR0QiK0uQ@mail.gmail.com>
 <002701d73199$f3e94f70$dbbbee50$@ewellic.org>
 <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp>
 <000001d732d4$c1e1b850$45a528f0$@ewellic.org>
 <1151F8C2-78C2-4B7F-887B-015D3B4C5E9D@bahnhof.se>
Message-ID: <000001d73300$6b1788c0$41469a40$@ewellic.org>

Again, the object of this exercise was not to redefine the CDDL syntax, but to find good, debugged ABNF to describe the existing syntax.

It does, however, seem reasonable that backslash (%5C) should be included in the list. Also, as Rebecca pointed out, solidus (%2F) should apparently be changed to single-quote (%27). These are helpful corrections, but orthogonal to the question of how best to represent the \u syntax in ABNF.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org


-----Original Message-----
From: Kent Karlsson <kent.b.karlsson at bahnhof.se> 

> 16 apr. 2021 kl. 17:25 skrev Doug Ewell via Unicode <unicode at unicode.org>:
> 
> Martin J. D?rst wrote:
> 
>> What bothers me in this grammar is that the first "\u" isn't anywhere 
>> in sight, but the second one is there. It would be much clearer if 
>> either the first "\u" is at the start of hexchar, i.e.
> 
> Sorry, I neglected to include this line, which precedes everything I did quote:
> 
> SESC = "\" ( %x22 / %x2F / %x5C / %x62 / %x66 / %x6E / %x72 / %x74 /
>             (%x75 hexchar) )

1) Why are some ?very plain letters in ASCII? given as hex escapes here? Esp. since the not so plain (it is used as an escape, which is the point here?) ?\? has not warranted a hex escape. (The grammar even uses it to escape ?, which is a bit ironic).

2) Apart from the second line there, these have nothing to do with ?\u? escapes, and in addition the set of these other escapes vary (a bit) by programming language (or other context), and technically aren?t needed when \u escapes are allowed (though still practical).

/Kent K


From doug at ewellic.org  Fri Apr 16 15:49:46 2021
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 16 Apr 2021 14:49:46 -0600
Subject: Need reference to good ABNF for \uXXXX syntax
In-Reply-To: <ADCF8BE3-D7BE-4150-A617-239B8E866B3D@tzi.org>
References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org>
 <CAN49p6r0F9P+V6HrDDbovYsafn+3D2tH8ptqMdz-fTR0QiK0uQ@mail.gmail.com>
 <002701d73199$f3e94f70$dbbbee50$@ewellic.org>
 <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp>
 <000001d732d4$c1e1b850$45a528f0$@ewellic.org>
 <CAH=y87bGx22TazM8-DQ_ffXwnAQyqngCCvuVtSwSy77Gg6Nb-Q@mail.gmail.com>
 <ADCF8BE3-D7BE-4150-A617-239B8E866B3D@tzi.org>
Message-ID: <000201d73302$0937c500$1ba74f00$@ewellic.org>

Carsten Bormann wrote:

>> Is %x2F supposed to be %x27?
>
> No, it?s really %x2F.

OK, I take back what I wrote a few minutes ago.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org


From kent.b.karlsson at bahnhof.se  Sat Apr 17 08:13:21 2021
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Sat, 17 Apr 2021 15:13:21 +0200
Subject: Need reference to good ABNF for \uXXXX syntax
In-Reply-To: <8BAFBF3F-79E8-4E22-88C4-58FBF4041C50@tzi.org>
References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org>
 <CAN49p6r0F9P+V6HrDDbovYsafn+3D2tH8ptqMdz-fTR0QiK0uQ@mail.gmail.com>
 <002701d73199$f3e94f70$dbbbee50$@ewellic.org>
 <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp>
 <000001d732d4$c1e1b850$45a528f0$@ewellic.org>
 <1151F8C2-78C2-4B7F-887B-015D3B4C5E9D@bahnhof.se>
 <8BAFBF3F-79E8-4E22-88C4-58FBF4041C50@tzi.org>
Message-ID: <0334017C-1263-4FFE-BE01-D67810BAD8A3@bahnhof.se>

(Going a bit further off the original topic of this thread?)

> 16 apr. 2021 kl. 22:54 skrev Carsten Bormann <cabo at tzi.org>:
> 
> 
>> On 16. Apr 2021, at 22:09, Kent Karlsson <kent.b.karlsson at bahnhof.se> wrote:
>> 
>>> SESC = "\" ( %x22 / %x2F / %x5C / %x62 / %x66 / %x6E / %x72 / %x74 /
>>>           (%x75 hexchar) )
>> 
>> 1) Why are some ?very plain letters in ASCII? given as hex escapes here? Esp. since the not so plain (it is used as an escape, which is the point here?) ?\? has not warranted a hex escape. (The grammar even uses it to escape ?, which is a bit ironic).
> 
> Because RFC 8259 does.
> 
> This is ABNF, so there are some peculiarities to be taken care of.
> Long form: %x2F and %x5C really should be ?/? and ?\?, as I wrote before.
> %x22 is a convenient form to put a double quote into ABNF (there is no escaping in ABNF, which was invented around 1977, for RFC 733).
> 
> %x62 / %x66 / %x6E / %x72 / %x74 are of course ?b?/?f?/?n?/?t?, which prefixed by ?\? are popular white space escapes

\b is usually used for backspace (going back decades in tradition?). But backspace is NOT a ?whitespace? character at all. Neither when it was used to create bold (on typewriter type of terminals), overtyping to create combined characters (long since deprecated) or used as a command to erase character preceding current position, it has never been a whitespace character.

However, vertical tab (sometimes representable as a \v escape, as in C/C++, JavaScript, GoLang, PHP), nowadays used more for an ASCII representation of LINE SEPARATOR than for vertical tabulation, is usually regarded as a whitespace character.

/Kent K

> so you don?t have to use \uXXXX for them.
> Unfortunately, writing ?b?/?f?/?n?/?t? in ABNF would invoke the default case-insensitivity of ABNF (think 1977 again).
> This could be written %s?b?/%s?f?/%s?n?/%s?t? with the ABNF extension documented in RFC 7405, but RFC 7159 (that became RFC 8259 later) was written before RFC 7405 (obviously).  Also, not using the extension slightly widens the set of tools that can be used with this ABNF.
> 
> I apologise for polluting this list with arcane details of JavaScript and ABNF, but those are the reasons this grammar looks like it does.
> 
> Gr??e, Carsten
> 


From doug at ewellic.org  Sat Apr 17 10:50:42 2021
From: doug at ewellic.org (Doug Ewell)
Date: Sat, 17 Apr 2021 09:50:42 -0600
Subject: Need reference to good ABNF for \uXXXX syntax
In-Reply-To: <0334017C-1263-4FFE-BE01-D67810BAD8A3@bahnhof.se>
References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org>
 <CAN49p6r0F9P+V6HrDDbovYsafn+3D2tH8ptqMdz-fTR0QiK0uQ@mail.gmail.com>
 <002701d73199$f3e94f70$dbbbee50$@ewellic.org>
 <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp>
 <000001d732d4$c1e1b850$45a528f0$@ewellic.org>
 <1151F8C2-78C2-4B7F-887B-015D3B4C5E9D@bahnhof.se>
 <8BAFBF3F-79E8-4E22-88C4-58FBF4041C50@tzi.org>
 <0334017C-1263-4FFE-BE01-D67810BAD8A3@bahnhof.se>
Message-ID: <001301d733a1$6cc99360$465cba20$@ewellic.org>

I can see it would have been best if I had not posted the "SESC = ..." line at all, but simply replied with something like "hexchar is referenced by a previous rule that includes the '\u' prefix."

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org


From lorna_evans at sil.org  Mon Apr 26 16:50:40 2021
From: lorna_evans at sil.org (Lorna Evans)
Date: Mon, 26 Apr 2021 16:50:40 -0500
Subject: Normalizing Syriac
Message-ID: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org>

I've got a situation that I'm not sure how to handle...or even if 
Unicode or the rendering engines need update.

In a language using Syriac there is a /rish seyame/ which can be 
followed by U+0739 or U+0738

/rish /= 072A

/seyame /= 0308

In TUS, chapter 9, it says:

> In Modern Syriac usage, when a word contains a /rish /and a /seyame/, 
> the dot of
> the /rish /and the /seyame /are replaced by a /rish /with two dots 
> above it.
Then, there's a table which indicates this ligature is obligatory:

> Table 9-17. Syriac Ligatures
>
> Ligature Classes. As in other scripts, ligatures in Syriac vary 
> depending on the font style.
> Table 9-17 identifies the principal valid ligatures for each font 
> style. When applicable, these
> ligatures are obligatory, unless denoted with an asterisk (*).
>
> rish seyame Right-joining Right-joining Right-joining BFBS (no 
> asterisk, so it is obligatory)
>

Finally, in "Developing OpenType Fonts for Syriac Script" 
https://docs.microsoft.com/en-us/typography/script-development/syriac

In the "Glossary section" it says:

> *Ligature* - A combination of glyphs that join to form a single glyph. 
> For example, the 'rish seyame' (U072a + U0308) combinations of glyphs 
> are mandatory ligatures for Syriac. Other ligatures are optional.
>
So, it seems clear that 072a+0308 is a mandatory ligature. The problem 
I'm seeing is that when this ligature is followed by U+0739 or U+0738 
AND an application does normalization, it changes the sequence to U+072A 
U+0739 U+0308 and that breaks the ligature.

You can see why they are reordering it when you see 0308 is 230 and 
U+0738 or U+0739 are 220.

0308;COMBINING DIAERESIS;Mn;*230*;NSM;;;;;N;NON-SPACING DIAERESIS;;;;
0738;SYRIAC DOTTED ZLAMA HORIZONTAL;Mn;*220*;NSM;;;;;N;;;;;
0739;SYRIAC DOTTED ZLAMA ANGULAR;Mn;*220*;NSM;;;;;N;;;;;

All of the Syriac fonts that I see only handle this sequence *U+072A 
U+0308 U+0739* and not the reordered *U+072A U+0739 U+0308*

Are the fonts wrong, should they be able to handle U+072A U+0739 U+0308?

Or, is there a special normalization rule for this?

How should /rish seyame/ followed by a below mark like U+0738 or U+0739 
be handled?

Lorna


**


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210426/fc0d0580/attachment.htm>

From richard.wordingham at ntlworld.com  Mon Apr 26 17:48:40 2021
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 26 Apr 2021 23:48:40 +0100
Subject: Normalizing Syriac
In-Reply-To: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org>
References: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org>
Message-ID: <20210426234840.76732e6c@JRWUBU2>

On Mon, 26 Apr 2021 16:50:40 -0500
Lorna Evans via Unicode <unicode at unicode.org> wrote:

> You can see why they are reordering it when you see 0308 is 230 and 
> U+0738 or U+0739 are 220.
> 
> 0308;COMBINING DIAERESIS;Mn;*230*;NSM;;;;;N;NON-SPACING DIAERESIS;;;;
> 0738;SYRIAC DOTTED ZLAMA HORIZONTAL;Mn;*220*;NSM;;;;;N;;;;;
> 0739;SYRIAC DOTTED ZLAMA ANGULAR;Mn;*220*;NSM;;;;;N;;;;;
> 
> All of the Syriac fonts that I see only handle this sequence *U+072A 
> U+0308 U+0739* and not the reordered *U+072A U+0739 U+0308*
> 
> Are the fonts wrong, should they be able to handle U+072A U+0739
> U+0308?
> 
> Or, is there a special normalization rule for this?
> 
> How should /rish seyame/ followed by a below mark like U+0738 or
> U+0739 be handled?

It depends on your technology.  In an OpenType font, I would combine
RISH with COMBINING DIAERESIS using a substitution lookup that ignores
marks below.  Am I missing something?  In a combination of base, mark
above and mark below, the order of the marks shouldn't matter if they
don't interact - one just sets up the mark 'attachment' classes so that
the marks are in different classes.  In later version of OpenType, one
can even ignore a set of marks peculiar to that lookup.

Of course, the OpenType (syntax) specification doesn't state what the
subsequent sequence of glyphs is after a ligature lookup if an
intervening mark has been skipped.  John Hudson has publicly complained
that the semantics of OpenType ought to be defined.  Perhaps some
Syriac shaper exploits this gap to go spectacularly wrong - one would
hope it doesn't.

It has struck me as odd that there is very little hint around of what
sequences of marks fonts have to handle.  Back when Harfbuzz was
beginning to handle Tai Tham, Behdad kindly did a normalisation on the
fly so that tone marks (ccc=230) would come before COENG (ccc=9) so
that COENG would remain adjacent to its following consonant.  There is
a similar issue with Hebrew.  (Like a good boy, I'd elaborated my fonts
to handle normalised sequences.)

It is well known that the set of character sequences supported by
Uniscribe is not closed under canonical equivalence - apparently this is
allowed by the conformance clauses of TUS.

Richard.

From harjitmoe at outlook.com  Mon Apr 26 17:58:23 2021
From: harjitmoe at outlook.com (Harriet Riddle)
Date: Mon, 26 Apr 2021 22:58:23 +0000
Subject: Normalizing Syriac
In-Reply-To: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org>
References: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org>
Message-ID: <VI1PR07MB57124255A861B9439EDB1655B7429@VI1PR07MB5712.eurprd07.prod.outlook.com>

What I gather for background information (which you may well already be aware of, but just in case) is that:

? Normalisation rules are set in stone per stability policy (software has to be able to rely on any input that normalises to a certain output continuing to normalise like that, so it can use a normalised form as e.g. a database key, input for a password hash, etc.?even if a better behaviour theoretically exists).

? A cluster of a base character and combining characters can be interrupted with one or more of the confusingly named Combining Grapheme Joiner, which is typically used to split what is one grapheme cluster for display purposes into multiple grapheme clusters for normalisation and/or collation purposes. This can be used to inhibit diacritic re?rderings that pose an issue in practice.

?Har.

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Unicode <unicode-bounces at unicode.org> on behalf of Lorna Evans via Unicode <unicode at unicode.org>
Sent: Monday, April 26, 2021 10:50:40 PM
To: Unicode Mailing List <unicode at unicode.org>
Subject: Normalizing Syriac


I've got a situation that I'm not sure how to handle...or even if Unicode or the rendering engines need update.

In a language using Syriac there is a rish seyame which can be followed by U+0739 or U+0738

rish = 072A

seyame = 0308

In TUS, chapter 9, it says:

In Modern Syriac usage, when a word contains a rish and a seyame, the dot of
the rish and the seyame are replaced by a rish with two dots above it.
Then, there's a table which indicates this ligature is obligatory:

Table 9-17. Syriac Ligatures

Ligature Classes. As in other scripts, ligatures in Syriac vary depending on the font style.
Table 9-17 identifies the principal valid ligatures for each font style. When applicable, these
ligatures are obligatory, unless denoted with an asterisk (*).

rish seyame Right-joining Right-joining Right-joining BFBS (no asterisk, so it is obligatory)


Finally, in "Developing OpenType Fonts for Syriac Script" https://docs.microsoft.com/en-us/typography/script-development/syriac

In the "Glossary section" it says:

Ligature - A combination of glyphs that join to form a single glyph. For example, the 'rish seyame' (U072a + U0308) combinations of glyphs are mandatory ligatures for Syriac. Other ligatures are optional.

So, it seems clear that 072a+0308 is a mandatory ligature. The problem I'm seeing is that when this ligature is followed by U+0739 or U+0738 AND an application does normalization, it changes the sequence to U+072A U+0739 U+0308 and that breaks the ligature.

You can see why they are reordering it when you see 0308 is 230 and U+0738 or U+0739 are 220.

0308;COMBINING DIAERESIS;Mn;230;NSM;;;;;N;NON-SPACING DIAERESIS;;;;
0738;SYRIAC DOTTED ZLAMA HORIZONTAL;Mn;220;NSM;;;;;N;;;;;
0739;SYRIAC DOTTED ZLAMA ANGULAR;Mn;220;NSM;;;;;N;;;;;

All of the Syriac fonts that I see only handle this sequence U+072A U+0308 U+0739 and not the reordered U+072A U+0739 U+0308

Are the fonts wrong, should they be able to handle U+072A U+0739 U+0308?

Or, is there a special normalization rule for this?

How should rish seyame followed by a below mark like U+0738 or U+0739 be handled?

Lorna


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210426/e802bd89/attachment.htm>

From doug at ewellic.org  Mon Apr 26 23:21:08 2021
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 26 Apr 2021 22:21:08 -0600
Subject: Normalizing Syriac
In-Reply-To: <20210426234840.76732e6c@JRWUBU2>
References: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org>
 <20210426234840.76732e6c@JRWUBU2>
Message-ID: <000001d73b1c$c0ea90c0$42bfb240$@ewellic.org>

Richard Wordingham wrote:

> In an OpenType font, I would combine RISH with COMBINING DIAERESIS
> using a substitution lookup that ignores marks below.

I thought the number one goal of Unicode was to make text encoding independent of font selection.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org


From richard.wordingham at ntlworld.com  Tue Apr 27 13:47:02 2021
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 27 Apr 2021 19:47:02 +0100
Subject: Normalizing Syriac
In-Reply-To: <000001d73b1c$c0ea90c0$42bfb240$@ewellic.org>
References: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org>
 <20210426234840.76732e6c@JRWUBU2>
 <000001d73b1c$c0ea90c0$42bfb240$@ewellic.org>
Message-ID: <20210427194702.5470e918@JRWUBU2>

On Mon, 26 Apr 2021 22:21:08 -0600
Doug Ewell via Unicode <unicode at unicode.org> wrote:

> Richard Wordingham wrote:
> 
> > In an OpenType font, I would combine RISH with COMBINING DIAERESIS
> > using a substitution lookup that ignores marks below.  
> 
> I thought the number one goal of Unicode was to make text encoding
> independent of font selection.

Which is why I would do it that way.  Those who wrote the fonts seem to
have assumed that these two characters would wind up adjacent.

It isn't always trivial to ensure that canonically equivalent sequences
render the same.  The rendering engine can make a big difference to how
much work has to be done by the tables in the font.

Richard.

From asmusf at ix.netcom.com  Tue Apr 27 17:18:07 2021
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 27 Apr 2021 15:18:07 -0700
Subject: Normalizing Syriac
In-Reply-To: <20210427194702.5470e918@JRWUBU2>
References: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org>
 <20210426234840.76732e6c@JRWUBU2>
 <000001d73b1c$c0ea90c0$42bfb240$@ewellic.org>
 <20210427194702.5470e918@JRWUBU2>
Message-ID: <72d59eaf-7ba1-5f3f-afe5-629f444a6da0@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210427/f580293c/attachment-0001.htm>

From richard.wordingham at ntlworld.com  Wed Apr 28 03:12:37 2021
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 28 Apr 2021 09:12:37 +0100
Subject: Normalizing Syriac
In-Reply-To: <72d59eaf-7ba1-5f3f-afe5-629f444a6da0@ix.netcom.com>
References: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org>
 <20210426234840.76732e6c@JRWUBU2>
 <000001d73b1c$c0ea90c0$42bfb240$@ewellic.org>
 <20210427194702.5470e918@JRWUBU2>
 <72d59eaf-7ba1-5f3f-afe5-629f444a6da0@ix.netcom.com>
Message-ID: <20210428091237.414f91cf@JRWUBU2>

On Tue, 27 Apr 2021 15:18:07 -0700
Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> Doug, what Richard is saying is that Normalization being predictable
> and fixed, it is up to each font (all of them) to correctly render
> all forms of normalized text (and not to make assumptions on some
> unnormalized ordering).

I don't know if he has repented, but someone proposed that fonts
should not be allowed to remove dotted circles so as, for instance, to
render normalised Tibetan contractions with vowels above and below in
the same akshara.

Richard.

From rick at unicode.org  Wed Apr 28 16:25:02 2021
From: rick at unicode.org (Rick McGowan)
Date: Wed, 28 Apr 2021 14:25:02 -0700
Subject: Unicode.org mail system maintenance
Message-ID: <6089D2AE.30209@unicode.org>

Hello everyone,

On May 3 the Unicode Consortium will be doing maintenance and 
configuration updates to our e-mail service infrastructure. This could 
entail some mail delays, mail list outage, and/or temporary delivery 
issues. The maintenance is expected to begin mid-morning US Pacific Time 
and be complete by afternoon of the same day. I will be back in touch 
with further news after that.

Regards,