From abrahamgross at disroot.org  Tue Jun  2 22:18:35 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Wed, 03 Jun 2020 03:18:35 +0000
Subject: Why do the Hebrew Alphabetic Presentation Forms  Exist
Message-ID: <f34323018601fb33ee08c040702906a7@disroot.org>

Why are there precomposed Hebrew characters in Unicode (Alphabetic Presentation Forms block)?

It says in the FAQ that ?a substantial number of presentation forms were encoded in Unicode as compatibility characters, because legacy software or data included them.? (https://www.unicode.org/faq/ligature_digraph.html#PForms (https://www.unicode.org/faq/ligature_digraph.html#PForms))

I can't find any character set other than Unicode that has separate codepoints for all Hebrew letters with a dagesh/mapiq or any of the other precomposed letters other than the Yiddish ligatures. (ex: Code page 862, ISO/IEC 8859-8, Windows-1255)

Does anyone know where I can find the legacy software or character sets that had these presentation forms?

I also want to see the documents/proposals that got these characters accepted as part of Unicode. Does anyone know where I can find them? The closest I got was when I figured out the proposal to add HEBREW LETTER YOD WITH HIRIQ is in proposal N1364, but I can't find it in the document register?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200603/397e9097/attachment.htm>

From asmusf at ix.netcom.com  Wed Jun  3 00:33:09 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 2 Jun 2020 22:33:09 -0700
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <f34323018601fb33ee08c040702906a7@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
Message-ID: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200602/5d0919c0/attachment.htm>

From abrahamgross at disroot.org  Wed Jun  3 00:37:19 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Wed, 3 Jun 2020 05:37:19 +0000 (UTC)
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
Message-ID: <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>

How successful might I be in adding an additional Hebrew character to the Alphabetic Presentation Forms block?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200603/6fa6efb6/attachment.htm>

From jk at koremail.com  Wed Jun  3 02:34:43 2020
From: jk at koremail.com (jk at koremail.com)
Date: Wed, 03 Jun 2020 15:34:43 +0800
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
Message-ID: <c019acec51861f2d88ff15de12821a40@koremail.com>

Dear Abraham,

adding such characters as these, whatever the language are a thing of 
the past. So a submission would not be successful.

Yours sincerely
John Knightley

On 2020-06-03 13:37, abrahamgross--- via Unicode wrote:
> How successful might I be in adding an additional Hebrew character to
> the Alphabetic Presentation Forms block?


From asmusf at ix.netcom.com  Wed Jun  3 12:51:38 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Wed, 3 Jun 2020 10:51:38 -0700
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <c019acec51861f2d88ff15de12821a40@koremail.com>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
Message-ID: <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200603/e6918ca6/attachment.htm>

From mark at kli.org  Wed Jun  3 16:42:52 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Wed, 3 Jun 2020 17:42:52 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <af04f3b2-a2eb-674d-09d4-552c2cb93bd4@shoulson.com>
References: <af04f3b2-a2eb-674d-09d4-552c2cb93bd4@shoulson.com>
Message-ID: <9afe4a17-3b80-87b9-697e-3e14f9eab536@kli.org>

On 6/3/20 1:37 AM, abrahamgross--- via Unicode wrote:
> How successful might I be in adding an additional Hebrew character to 
> the Alphabetic Presentation Forms block?
>
It is unlikely that such characters would be considered, but the way to 
add an additional character to anyplace is pretty much the same: submit 
a proposal, as described on https://www.unicode.org/pending/proposals.html

What were you thinking of adding?? Things like a bent-neck LAMED or more 
"wide" letters are pretty much guaranteed non-starters, I would guess.? 
A "closed" QOF or the broken VAV (as is traditionally written in Numbers 
25:12) would at best be a very tough sell.

~mark


From mark at kli.org  Wed Jun  3 16:51:04 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Wed, 3 Jun 2020 17:51:04 -0400
Subject: QID Emoij (was: Re: Wireless Connection Symbol)
In-Reply-To: <d7aaec09-a7cb-8b67-9f1d-f9f32c9ac4a4@shoulson.com>
References: <d7aaec09-a7cb-8b67-9f1d-f9f32c9ac4a4@shoulson.com>
Message-ID: <3a0e8cf4-353e-ac6c-b039-7503ce53ad75@kli.org>

{Sorry this is out of date; I discovered my email to the unicode list 
wasn't going through.}

I'm not sure how much I could add to the points that have already been 
made, but just to stand up and be counted, I also think QID emoji are an 
awful idea and I can barely believe they are even being considered 
seriously.? The possibilities are just too broad, etc... what everyone 
else said. We'd do better with a highly-compressed (vector?) image 
format that could somehow squeeze decent pictures into a few dozen 
characters.


On 5/27/20 12:18 PM, S?awomir Osipiuk via Unicode wrote:
> The issue to be resolved here lies in the process for adding emojis. 
> The current process is too onerous and slow. I can imagine a new 
> process, that isn't bound to a regular schedule, and that allows 
> eminently useful and needed emojis to be fast-tracked to approval in 
> days, not months. Perhaps an entire plane could be reserved for such 
> emojis - 65K should be enough for anyone, right? ;) Perhaps there 
> could be a provisional or probationary approval granted to certain 
> emojis, or at least a "reservation" system for code points. A vendor 
> could reserve spaces with emojis they plan to add (with reasonable 
> limits, of course). There could be a public voting system to add or 
> approve emojis in near-real-time based on thresholds for approval. 
> It's 2020; we have the technology. Provisional emojis or code points 
> reservations that don't see use/support after some amount of time are 
> rejected and code points are allowed to be reused. Those that see use 
> or public support are given final appro!
> val and become bound by stability requirements. The Unicode Consortium 
> is still involved, but less so, relying more on automated metrics than 
> meetings, though they would still have veto power if there is some 
> valid subjective factor to consider.

This is fairly well-said.? The problem is obviously real, or real enough 
to bug people: it takes too darn long to get emoji into Unicode.? It 
takes a long time to get anything into Unicode, but most of the things 
we're putting in at this stage of the game are rare characters, 
small-userbase scripts, etc, and even the people who would use them have 
been doing okay without them for a while. Emoji have a different type of 
demand.? Emoji become popular, and even "necessary," _after_ they are in 
the standard, lots of people are itching to use each incoming proposal, 
and their userbase is a very large and outspoken segment of 
computer-users.? A provisional something-or-other?? Not entirely a bad 
idea.? Lots of details perhaps to work out, to avoid assorted "horror" 
situations (reusing a code-point??? so my serious document about pok?mon 
might in later years appear with emoji of Linux distributions??? oh, 
won't someone think of the stability!) while still making it all work 
out.? No, I don't know how to solve all those issues. But the idea bears 
consideration, more than QID emoji do, IMO.

~mark


From abrahamgross at disroot.org  Wed Jun  3 19:21:39 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Thu, 4 Jun 2020 00:21:39 +0000 (UTC)
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
Message-ID: <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>

What about a folded lamed? How do you think a proposal for that would go? I have plenty of proof of it being used in the same sentence (even in the same word) as a regular lamed, so its not just an alternate form of the same character like a and ?.

Here are some examples:
https://imgur.com/a/xw9Kb8Z

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200604/b7e8bb1d/attachment.htm>

From mark at kli.org  Wed Jun  3 20:43:34 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Wed, 3 Jun 2020 21:43:34 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
Message-ID: <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>

On 6/3/20 8:21 PM, abrahamgross--- via Unicode wrote:
> What about a folded lamed? How do you think a proposal for that would 
> go? I have plenty of proof of it being used in the same sentence (even 
> in the same word) as a regular lamed, so its not just an alternate 
> form of the same character like a and ?.
>
>
> Here are some examples:
> https://imgur.com/a/xw9Kb8Z
>
>
I think it would be a very hard sell.? Just because they're used in the 
same sentence doesn't mean they aren't alternate forms of the same 
character.? Sometimes there were scribal preferences, etc.? There's not 
*meaning* that's different between the two LAMEDs.? There isn't any text 
where it matters which one you use where, except for trying to replicate 
the exact *appearance* of a document?and that is exactly the realm of 
more sophisticated systems.? Unicode isn't publishing software; it isn't 
supposed to replace Word.? A LAMED is a LAMED.? The example in your 
picture is actually quite interesting because it looks like they either 
ran out of bent LAMEDs or made a mistake or something.? The bent LAMED 
was invented for reasons of typesetting: LAMED is the only letter with 
an ascender, and it tended to get in the way of things with Hebrew text 
being set with little or no leading and letter-height filling almost the 
entire line-height.? You can see where there are straight LAMEDs on your 
page, that their ascenders reach into places in the line above that 
happen to be open enough not to cause problems, like spaces between 
words or letters with no baseline.? Otherwise, the bent LAMED was 
pressed into service, because that's what it's for.? Except... for the 
one you show inside a blue box.? That should have been a bent LAMED, 
because a straight one would have been bumping or almost bumping into 
the TSERE above it.? But for whatever reason, they didn't use a bent 
LAMED, and made do by taking a straight LAMED and cutting off its head!

Here's another way to look at it.? If you (or the original typesetter) 
would have set this same text in the same font slightly differently, 
maybe a little wider or narrower, or maybe with an additional word or 
even footnote-mark inserted or something, would the bent LAMEDs still be 
bent and the straight LAMEDs still be straight?? No!? The text would 
flow differently, and some of the straight LAMEDs would have to be bent, 
because they no longer had space above them, while some of the bent 
LAMEDs could be straight, because in this layout there's space for them. 
So there isn't anything about the LAMED in the word ?? that you have 
highlighted in red that makes it "straight."? That isn't a feature of 
the letter in the plain text.? It's a feature of the typeset page.? Just 
like there's nothing special about an "i" following an "f" (in many 
fonts) that makes it have no dot; it's just a thing that happens to i 
following f in those fonts, that they join into an ? ligature.? It isn't 
a feature of the i, it's a feature of the typesetting.? (OK, that's a 
bad example because of course ? *is* encoded, but that was due to 
round-tripping considerations and other stuff that we don't like to 
apply anymore.? But the idea is still useful.)

~mark


From asmusf at ix.netcom.com  Wed Jun  3 21:12:45 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Wed, 3 Jun 2020 19:12:45 -0700
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
Message-ID: <bb0ad2a3-c4a9-51c4-a80a-01ab32e73e22@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200603/7967d2c1/attachment.htm>

From abrahamgross at disroot.org  Wed Jun  3 21:21:39 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Thu, 4 Jun 2020 02:21:39 +0000 (UTC)
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
Message-ID: <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>

This is exactly why I want the folded lamed.

(I also want the headless lamed cuz I've also seen it used a lot and I really like it. its especially useful when u need to put a RAFE or other trop/accent marks on a folded lamed)

2020/06/03 ??9:44:18 Mark E. Shoulson via Unicode <unicode at unicode.org>:
> The bent LAMED was invented for reasons of typesetting: LAMED is the only letter with an ascender, and it tended to get in the way of things with Hebrew text being set with little or no leading and letter-height filling almost the entire line-height.? You can see where there are straight LAMEDs on your page, that their ascenders reach into places in the line above that happen to be open enough not to cause problems, like spaces between words or letters with no baseline.? Otherwise, the bent LAMED was pressed into service, because that's what it's for.
> 


From abrahamgross at disroot.org  Wed Jun  3 21:52:19 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Thu, 4 Jun 2020 02:52:19 +0000 (UTC)
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
Message-ID: <fade360f-87b1-484b-86e8-7eca852298e4@disroot.org>

Finally a good explanation! I still want the folded lamed tho? but at least I get it now

2020/06/03 ??9:44:18 Mark E. Shoulson via Unicode <unicode at unicode.org>:
> If you (or the original typesetter) would have set this same text in the same font slightly differently, maybe a little wider or narrower, or maybe with an additional word or even footnote-mark inserted or something, would the bent LAMEDs still be bent and the straight LAMEDs still be straight?? No!? The text would flow differently, and some of the straight LAMEDs would have to be bent, because they no longer had space above them, while some of the bent LAMEDs could be straight, because in this layout there's space for them. So there isn't anything about the LAMED in the word ?? that you have highlighted in red that makes it "straight."? That isn't a feature of the letter in the plain text.? It's a feature of the typeset page.? Just like there's nothing special about an "i" following an "f" (in many fonts) that makes it have no dot; it's just a thing that happens to i following f in those fonts, that they join into an ? ligature.? It isn't a feature of the i, it's a feature of the typesetting.? (OK, that's a bad example because of course ? *is* encoded, but that was due to round-tripping considerations and other stuff that we don't like to apply anymore.? But the idea is still useful.)
> 


From mark at kli.org  Wed Jun  3 22:02:44 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Wed, 3 Jun 2020 23:02:44 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
Message-ID: <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>

On 6/3/20 10:21 PM, abrahamgross--- via Unicode wrote:
> This is exactly why I want the folded lamed.
>
> (I also want the headless lamed cuz I've also seen it used a lot and I really like it. its especially useful when u need to put a RAFE or other trop/accent marks on a folded lamed)

Aha!? So you need it for typesetting reasons!? And that's exactly how 
you should obtain it.? This is *precisely* why God created OpenType 
tables in modern fonts!? So that when you have an AYIN with a vowel 
underneath it, the shape changes so it's flat and not descending as much 
(yes, I know, that's U+FB20 HEBREW LETTER ALTERNATIVE AYIN, but that, 
too, was added for reasons we don't like to admit anymore, and it would 
never be accepted today.)? I know for certain that John Hudson's "SBL 
Hebrew" font does exactly that, see the attached image.? Nothing was 
done between the right frame and the left frame aside from typing a 
QAMATS.? The letter changed automatically, because John Hudson has 
killer typography skillz[sic].? In fact, if I had used a PATAH, the 
letter would _not_ have changed, UNTIL I typed a following letter, 
because a PATAH under an AYIN at the end of a word is a patah genuvah, 
which some prefer to set shifted over to right a little.

I don't know of any font machinery that can actually change things based 
on what's present on the previous *line*; that may not be supported.? 
But you can bet that such a thing won't be reason enough to encode a new 
character.

As for wanting other funky shapes, why, there's nothing to stop you.? 
Just because they're all glyphic variants of the same letter doesn't 
mean you can't use them both.? You can have stylistic alternatives in a 
font, so THIS "a" is two-story while THAT "a" is one-story, in the same 
font, by using your (higher-level!) formatting software to turn options 
on and off in setting the font.? Look 'em up.

(A more brute-force method would be to make two copies of the font, 
FontA and FontB, the same except that one has a bent LAMED and one has a 
straight LAMED.? Then you could change the LAMEDs you want to be this 
way to FontA and the ones you want that way to FontB.)

(I hope the picture came through.)

Bottom line: it's not bad to want these things, but this is not the way 
to get them.? There are other tools especially for situations like this.

~mark

> 2020/06/03 ??9:44:18 Mark E. Shoulson via Unicode<unicode at unicode.org>:
>> The bent LAMED was invented for reasons of typesetting: LAMED is the only letter with an ascender, and it tended to get in the way of things with Hebrew text being set with little or no leading and letter-height filling almost the entire line-height.? You can see where there are straight LAMEDs on your page, that their ascenders reach into places in the line above that happen to be open enough not to cause problems, like spaces between words or letters with no baseline.? Otherwise, the bent LAMED was pressed into service, because that's what it's for.
>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200603/f73dc8c3/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hdojaenebggkdich.png
Type: image/png
Size: 5416 bytes
Desc: not available
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200603/f73dc8c3/attachment.png>

From mark at kli.org  Wed Jun  3 22:10:04 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Wed, 3 Jun 2020 23:10:04 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
Message-ID: <eba6c0dc-4ddd-26b8-1398-61c442b8867d@kli.org>

On 6/3/20 11:02 PM, Mark E. Shoulson via Unicode wrote:
> Nothing was done between the right frame and the left frame aside from 
> typing a QAMATS.? The letter changed automatically, because John 
> Hudson has killer typography skillz[sic].

And to be clear, that means that the *characters* in the document are 
U+05E2 HEBREW LETTER AYIN followed by U+05B8 HEBREW POINT QAMATS, and 
the "alternative ayin" U+FB20 is nowhere to be seen and did not in fact 
need to exist for this to work.? It's just an alternate glyph for the 
character U+05E2.? Unicode encodes characters, not glyphs.

~mark


From 747.neutron at gmail.com  Wed Jun  3 22:15:57 2020
From: 747.neutron at gmail.com (=?UTF-8?B?V8OhbmcgWWlmw6Fu?=)
Date: Thu, 4 Jun 2020 12:15:57 +0900
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
Message-ID: <CAF5KyExj+tCF_zWt-XuwrR+C+TJ7p5jb1LTf5bjOBPEuL_62qg@mail.gmail.com>

I can't say that I am knowledgeable in the Hebrew script at all, but
at first glance of your examples, I feel that it's more appropriate to
be either put in the main block or realized with a variation selector
if it's of some significance and its usage is not algorithmically
inferable.

Compatibility characters are for compatibility, which means coping
with standards preceding Unicode that don't go along with the Unicode
model. If there'd be no prior standard manifested the need of that
character/glyph, it'll be rather called "new character" so that it
won't have a reason to be stuffed into that block.

2020?6?4?(?) 10:38 abrahamgross--- via Unicode <unicode at unicode.org>:
>
> What about a folded lamed? How do you think a proposal for that would go? I have plenty of proof of it being used in the same sentence (even in the same word) as a regular lamed, so its not just an alternate form of the same character like a and ?.
>
>
> Here are some examples:
> https://imgur.com/a/xw9Kb8Z
>
>


From abrahamgross at disroot.org  Wed Jun  3 22:23:39 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Thu, 4 Jun 2020 03:23:39 +0000 (UTC)
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
Message-ID: <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org>

> I don't know of any font machinery that can actually change things based on what's present on the previous *line*; that may not be supported.? But you can bet that such a thing won't be reason enough to encode a new character.

???
Not even with a variation selector?

Do you know which standards that existed before unicode had the hebrew characters from the presentation forms block? if it had the alternative ayin, then chances are that it had an "alternative lamed" too.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200604/a2b16f1b/attachment.htm>

From abrahamgross at disroot.org  Wed Jun  3 22:26:30 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Thu, 4 Jun 2020 03:26:30 +0000 (UTC)
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <eba6c0dc-4ddd-26b8-1398-61c442b8867d@kli.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <eba6c0dc-4ddd-26b8-1398-61c442b8867d@kli.org>
Message-ID: <fe2245ea-9722-4ed3-a986-a7cfcd232cfb@disroot.org>

Why do the final forms of the hebrew letters (?????) exist as separate codepoints from their regular counterparts (?????), when arabic - which has up to 4 forms for each letter - only got a single codepoint per letter?


From sosipiuk at gmail.com  Wed Jun  3 22:44:28 2020
From: sosipiuk at gmail.com (=?utf-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Wed, 3 Jun 2020 23:44:28 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org>
Message-ID: <002701d63a22$737f25e0$5a7d71a0$@gmail.com>

A variation selector seems like a good choice here. There should be a way to request from the rendering engine a specific variant of the ?same? character. There is precedent for that in many other characters/languages.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200603/9bc6786f/attachment.htm>

From jameskasskrv at gmail.com  Thu Jun  4 01:26:34 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Thu, 4 Jun 2020 06:26:34 +0000
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <fe2245ea-9722-4ed3-a986-a7cfcd232cfb@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <eba6c0dc-4ddd-26b8-1398-61c442b8867d@kli.org>
 <fe2245ea-9722-4ed3-a986-a7cfcd232cfb@disroot.org>
Message-ID: <706837b6-5200-97dd-305f-6bac0adb27b1@gmail.com>


On 2020-06-04 3:26 AM, abrahamgross--- via Unicode wrote:
> Why do the final forms of the hebrew letters (?????) exist as separate codepoints from their regular counterparts (?????), when arabic - which has up to 4 forms for each letter - only got a single codepoint per letter?
>
>
Because they were in a legacy character set.? Windows 1255:
https://en.wikipedia.org/wiki/Windows-1255

From jameskasskrv at gmail.com  Thu Jun  4 01:30:53 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Thu, 4 Jun 2020 06:30:53 +0000
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <706837b6-5200-97dd-305f-6bac0adb27b1@gmail.com>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <eba6c0dc-4ddd-26b8-1398-61c442b8867d@kli.org>
 <fe2245ea-9722-4ed3-a986-a7cfcd232cfb@disroot.org>
 <706837b6-5200-97dd-305f-6bac0adb27b1@gmail.com>
Message-ID: <4fde4f17-0168-90da-2c10-02edbc6d8764@gmail.com>


On 2020-06-04 6:26 AM, James Kass wrote:
>
> On 2020-06-04 3:26 AM, abrahamgross--- via Unicode wrote:
>> Why do the final forms of the hebrew letters (?????) exist as 
>> separate codepoints from their regular counterparts (?????), when 
>> arabic - which has up to 4 forms for each letter - only got a single 
>> codepoint per letter?
>>
>>
> Because they were in a legacy character set.? Windows 1255:
> https://en.wikipedia.org/wiki/Windows-1255
P.S. - The Arabic positional variants from legacy character sets were 
encoded as presentation forms.

From jr at qsm.co.il  Thu Jun  4 02:02:36 2020
From: jr at qsm.co.il (Jonathan Rosenne)
Date: Thu, 4 Jun 2020 07:02:36 +0000
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <4fde4f17-0168-90da-2c10-02edbc6d8764@gmail.com>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <eba6c0dc-4ddd-26b8-1398-61c442b8867d@kli.org>
 <fe2245ea-9722-4ed3-a986-a7cfcd232cfb@disroot.org>
 <706837b6-5200-97dd-305f-6bac0adb27b1@gmail.com>
 <4fde4f17-0168-90da-2c10-02edbc6d8764@gmail.com>
Message-ID: <AM6PR10MB285687DAC2428C5FFDCCF64084890@AM6PR10MB2856.EURPRD10.PROD.OUTLOOK.COM>

In modern Hebrew it is not possible, in general, to determine by means of a simple rule whether to use the final form or the non-final form. For example, in the word ?????? the non-final ? is used at the final position of the word, or in the Hebrew transliteration of Arabic words, such as ??????. In Arabic, on the other hand, according to the Arab representatives to the ISO committees, the choice of the form is strictly dependent on its position in the word and on the surrounding letters.

Anecdotally, the first draft of Windows 1255 did use the same algorithm as for 1256, and it failed miserably on first demonstration as the name of the Microsoft Israel manager was ????.

Best Regards,

Jonathan Rosenne
-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of James Kass via Unicode
Sent: Thursday, June 4, 2020 9:31 AM
To: unicode at unicode.org
Subject: Re: Why do the Hebrew Alphabetic Presentation Forms Exist


On 2020-06-04 6:26 AM, James Kass wrote:
>
> On 2020-06-04 3:26 AM, abrahamgross--- via Unicode wrote:
>> Why do the final forms of the hebrew letters (?????) exist as 
>> separate codepoints from their regular counterparts (?????), when 
>> arabic - which has up to 4 forms for each letter - only got a single 
>> codepoint per letter?
>>
>>
> Because they were in a legacy character set.? Windows 1255:
> https://en.wikipedia.org/wiki/Windows-1255
P.S. - The Arabic positional variants from legacy character sets were 
encoded as presentation forms.


From marius.spix at web.de  Thu Jun  4 02:31:51 2020
From: marius.spix at web.de (Marius Spix)
Date: Thu, 4 Jun 2020 09:31:51 +0200
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
Message-ID: <20200604093151.227437a0@spixxi>

Unicode also has German s (U+0073) and ? (U+017F) which are
equivalent, but were used in typesetting for a long time. If you want
to precisely reproduce a historic text, it is required to have
separate ways to encode different glyps. In plaintext documents you
have no influence on OpenType presentation.

But you can use variant selectors, which can be registered
in the IVD database. This would be propably the best way. Technically,
using variant selectors has the same effect as different code points,
as <U+05DC> and <U+05DC U+FE0E> would encode different shapes of the
same character.

It also appears that there are more variants of lamed with special
meanings in the bible:
https://www.hebrew4christians.com/Grammar/Unit_One/Aleph-Bet/Lamed/lamed.html

Can someone confirm that all variants of lamed have the same numeric
value of 30? If it is different between the variants, that would
qualify for different characters.

We also have special glyph variants of the same character for special
purposes, like an open tail g for IPA (?, U+0261? or an alternative phi
for math (?, U+03D5),  but these are completely optional and have no
different meaning from the closed tail g and the curled phi. As far as
I know linguists and mathematicians accept both glyph variants mutually
interchangeable. I guess, they are only in Unicode for historic reasons.

Regards,

Marius


On Wed, 3 Jun 2020 23:02:44 -0400
"Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> On 6/3/20 10:21 PM, abrahamgross--- via Unicode wrote:
> > This is exactly why I want the folded lamed.
> >
> > (I also want the headless lamed cuz I've also seen it used a lot
> > and I really like it. its especially useful when u need to put a
> > RAFE or other trop/accent marks on a folded lamed)  
> 
> Aha!? So you need it for typesetting reasons!? And that's exactly how 
> you should obtain it.? This is *precisely* why God created OpenType 
> tables in modern fonts!? So that when you have an AYIN with a vowel 
> underneath it, the shape changes so it's flat and not descending as
> much (yes, I know, that's U+FB20 HEBREW LETTER ALTERNATIVE AYIN, but
> that, too, was added for reasons we don't like to admit anymore, and
> it would never be accepted today.)? I know for certain that John
> Hudson's "SBL Hebrew" font does exactly that, see the attached
> image.? Nothing was done between the right frame and the left frame
> aside from typing a QAMATS.? The letter changed automatically,
> because John Hudson has killer typography skillz[sic].? In fact, if I
> had used a PATAH, the letter would _not_ have changed, UNTIL I typed
> a following letter, because a PATAH under an AYIN at the end of a
> word is a patah genuvah, which some prefer to set shifted over to
> right a little.
> 
> I don't know of any font machinery that can actually change things
> based on what's present on the previous *line*; that may not be
> supported. But you can bet that such a thing won't be reason enough
> to encode a new character.
> 
> As for wanting other funky shapes, why, there's nothing to stop you.? 
> Just because they're all glyphic variants of the same letter doesn't 
> mean you can't use them both.? You can have stylistic alternatives in
> a font, so THIS "a" is two-story while THAT "a" is one-story, in the
> same font, by using your (higher-level!) formatting software to turn
> options on and off in setting the font.? Look 'em up.
> 
> (A more brute-force method would be to make two copies of the font, 
> FontA and FontB, the same except that one has a bent LAMED and one
> has a straight LAMED.? Then you could change the LAMEDs you want to
> be this way to FontA and the ones you want that way to FontB.)
> 
> (I hope the picture came through.)
> 
> Bottom line: it's not bad to want these things, but this is not the
> way to get them.? There are other tools especially for situations
> like this.
> 
> ~mark
> 
> > 2020/06/03 ??9:44:18 Mark E. Shoulson via
> > Unicode<unicode at unicode.org>:  
> >> The bent LAMED was invented for reasons of typesetting: LAMED is
> >> the only letter with an ascender, and it tended to get in the way
> >> of things with Hebrew text being set with little or no leading and
> >> letter-height filling almost the entire line-height.? You can see
> >> where there are straight LAMEDs on your page, that their ascenders
> >> reach into places in the line above that happen to be open enough
> >> not to cause problems, like spaces between words or letters with
> >> no baseline.? Otherwise, the bent LAMED was pressed into service,
> >> because that's what it's for. 
> 


From abrahamgross at disroot.org  Thu Jun  4 02:32:54 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Thu, 4 Jun 2020 07:32:54 +0000 (UTC)
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <20200604093151.227437a0@spixxi>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <20200604093151.227437a0@spixxi>
Message-ID: <55a36a11-1463-46bf-be57-7282b61b6b68@disroot.org>

They all share the value of 30

2020/06/04 ??3:28:51 Marius Spix via Unicode <unicode at unicode.org>:
> It also appears that there are more variants of lamed with special
> meanings in the bible:
> https://www.hebrew4christians.com/Grammar/Unit_One/Aleph-Bet/Lamed/lamed.html
> 
> Can someone confirm that all variants of lamed have the same numeric
> value of 30? If it is different between the variants, that would
> qualify for different characters.
> 


From richard.wordingham at ntlworld.com  Thu Jun  4 02:59:37 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 4 Jun 2020 08:59:37 +0100
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
Message-ID: <20200604085937.5c3135d9@JRWUBU2>

On Wed, 3 Jun 2020 23:02:44 -0400
"Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> I don't know of any font machinery that can actually change things
> based on what's present on the previous *line*; that may not be
> supported. But you can bet that such a thing won't be reason enough
> to encode a new character.

And that leads to a problem that would not be solved by the encoding of
a new character.  If the position of line breaks may change, or the
text may be reset in a font with different relative character widths,
then which lamedhs are bent would change.

Arguably, the right place for standardisation is probably OpenType and
AAT features - and it might even be addressed already.

Richard.

From richard.wordingham at ntlworld.com  Thu Jun  4 03:28:57 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 4 Jun 2020 09:28:57 +0100
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <fe2245ea-9722-4ed3-a986-a7cfcd232cfb@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <eba6c0dc-4ddd-26b8-1398-61c442b8867d@kli.org>
 <fe2245ea-9722-4ed3-a986-a7cfcd232cfb@disroot.org>
Message-ID: <20200604092857.32b2cf60@JRWUBU2>

On Thu, 4 Jun 2020 03:26:30 +0000 (UTC)
abrahamgross--- via Unicode <unicode at unicode.org> wrote:

> Why do the final forms of the hebrew letters (?????) exist as
> separate codepoints from their regular counterparts (?????), when
> arabic - which has up to 4 forms for each letter - only got a single
> codepoint per letter?

TUS gives an explanation for the separate encoding of those final
forms, in the section on Hebrew.  Devising rules for automatic selection
would be too difficult, and would probably need an override mechanism
anyway.  There are similar cases scattered through Unicode.  Off the
top of my head I can think of:

U+017F LATIN SMALL LETTER LONG S

U+03C2 GREEK SMALL LETTER FINAL SIGMA

U+1A55 TAI THAM CONSONANT SIGN MEDIAL RA

U+1A56 TAI THAM CONSONANT SIGN MEDIAL LA

Richard.


From abrahamgross at disroot.org  Thu Jun  4 03:37:41 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Thu, 4 Jun 2020 08:37:41 +0000 (UTC)
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <20200604092857.32b2cf60@JRWUBU2>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <eba6c0dc-4ddd-26b8-1398-61c442b8867d@kli.org>
 <fe2245ea-9722-4ed3-a986-a7cfcd232cfb@disroot.org>
 <20200604092857.32b2cf60@JRWUBU2>
Message-ID: <8307c659-5901-4f37-b761-ac6c56990fd1@disroot.org>

Whats TUS?


From mark at kli.org  Thu Jun  4 07:28:08 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Thu, 4 Jun 2020 08:28:08 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <002701d63a22$737f25e0$5a7d71a0$@gmail.com>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org>
 <002701d63a22$737f25e0$5a7d71a0$@gmail.com>
Message-ID: <2706c719-9c76-b50b-7179-0aac9c38ece7@kli.org>

On 6/3/20 11:44 PM, S?awomir Osipiuk via Unicode wrote:
>
> A variation selector seems like a good choice here. There should be a 
> way to request from the rendering engine a specific variant of the 
> ?same? character. There is precedent for that in many other 
> characters/languages.
>
This isn't a matter for a variation selector.? This is purely a 
*scribal* or *presentation* alternation.? It has as much relevance to 
the content of the text as choice of font.? This is a matter for a 
stylistic alternate in the font tables.? This is *exactly* what those 
are for!

~mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200604/911427b1/attachment.htm>

From frederic.grosshans at gmail.com  Thu Jun  4 07:58:23 2020
From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=)
Date: Thu, 4 Jun 2020 14:58:23 +0200
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <8307c659-5901-4f37-b761-ac6c56990fd1@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <eba6c0dc-4ddd-26b8-1398-61c442b8867d@kli.org>
 <fe2245ea-9722-4ed3-a986-a7cfcd232cfb@disroot.org>
 <20200604092857.32b2cf60@JRWUBU2>
 <8307c659-5901-4f37-b761-ac6c56990fd1@disroot.org>
Message-ID: <2f972d97-620c-c5fd-c224-1c0fc5636cad@gmail.com>

Le 04/06/2020 ? 10:37, abrahamgross--- via Unicode a ?crit?:
> Whats TUS?
>
The Unicode Standard, I guess. It is available here 
www.unicode.org/versions/latest/.? The part on Hebrew in 
https://www.unicode.org/versions/Unicode13.0.0/ch09.pdf , indeed 
contains the following paragraph:

    Because final form usage is a matter of spelling convention,
    software should not automatically substitute final forms for nominal
    forms at the end of words. The positional variants should be coded
    directly and rendered one-to-one via their own glyphs?that is,
    without contextual analysis.

    Fr?d?ric


From sosipiuk at gmail.com  Thu Jun  4 08:00:28 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Thu, 4 Jun 2020 09:00:28 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <8307c659-5901-4f37-b761-ac6c56990fd1@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <eba6c0dc-4ddd-26b8-1398-61c442b8867d@kli.org>
 <fe2245ea-9722-4ed3-a986-a7cfcd232cfb@disroot.org>
 <20200604092857.32b2cf60@JRWUBU2>
 <8307c659-5901-4f37-b761-ac6c56990fd1@disroot.org>
Message-ID: <CAM+ijLgKBh2B+QqjXzKYqV-EnMq+ORjW5CLoVeQch2vOiJcgzw@mail.gmail.com>

On Thu, Jun 4, 2020 at 4:42 AM abrahamgross--- via Unicode
<unicode at unicode.org> wrote:
>
> Whats TUS?
>

I believe that means "The Unicode Standard" and the section Richard
Wordingham was referring to is in Chapter 9:

Final (Contextual Variant) Letterforms. Variant forms of five Hebrew
letters are encoded
as separate characters in this block, as in Hebrew standards including
ISO/IEC 8859-8.
These variant forms are generally used in place of the nominal
letterforms at the end of
words. Certain words, however, are spelled with nominal rather than
final forms, particu-
larly names and foreign borrowings in Hebrew and some words in
Yiddish. Because final
form usage is a matter of spelling convention, software should not
automatically substitute
final forms for nominal forms at the end of words. The positional
variants should be coded
directly and rendered one-to-one via their own glyphs?that is, without
contextual analy-
sis.


From mark at kli.org  Thu Jun  4 08:02:40 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Thu, 4 Jun 2020 09:02:40 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <20200604085937.5c3135d9@JRWUBU2>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <20200604085937.5c3135d9@JRWUBU2>
Message-ID: <ea23e250-f5ec-f8b7-9c3c-879f5d4767e7@kli.org>

On 6/4/20 3:59 AM, Richard Wordingham via Unicode wrote:
> On Wed, 3 Jun 2020 23:02:44 -0400
> "Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:
>
>> I don't know of any font machinery that can actually change things
>> based on what's present on the previous *line*; that may not be
>> supported. But you can bet that such a thing won't be reason enough
>> to encode a new character.
> And that leads to a problem that would not be solved by the encoding of
> a new character.  If the position of line breaks may change, or the
> text may be reset in a font with different relative character widths,
> then which lamedhs are bent would change.
>
> Arguably, the right place for standardisation is probably OpenType and
> AAT features - and it might even be addressed already.
Yes, exactly.? An author (or typesetting program, higher level than a 
font) would have to choose the right variant for each LAMED... which is 
what 'salt' tables are for, isn't it?

~mark


From richard.wordingham at ntlworld.com  Thu Jun  4 08:30:38 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 4 Jun 2020 14:30:38 +0100
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <8307c659-5901-4f37-b761-ac6c56990fd1@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <eba6c0dc-4ddd-26b8-1398-61c442b8867d@kli.org>
 <fe2245ea-9722-4ed3-a986-a7cfcd232cfb@disroot.org>
 <20200604092857.32b2cf60@JRWUBU2>
 <8307c659-5901-4f37-b761-ac6c56990fd1@disroot.org>
Message-ID: <20200604143038.25fa9e05@JRWUBU2>

On Thu, 4 Jun 2020 08:37:41 +0000 (UTC)
abrahamgross--- via Unicode <unicode at unicode.org> wrote:

> Whats TUS?
> 

The Unicode Standard.

Richard.

From richard.wordingham at ntlworld.com  Thu Jun  4 11:15:39 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 4 Jun 2020 17:15:39 +0100
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <2706c719-9c76-b50b-7179-0aac9c38ece7@kli.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org>
 <002701d63a22$737f25e0$5a7d71a0$@gmail.com>
 <2706c719-9c76-b50b-7179-0aac9c38ece7@kli.org>
Message-ID: <20200604171539.7bfb71cb@JRWUBU2>

On Thu, 4 Jun 2020 08:28:08 -0400
"Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> On 6/3/20 11:44 PM, S?awomir Osipiuk via Unicode wrote:
> >
> > A variation selector seems like a good choice here. There should be
> > a way to request from the rendering engine a specific variant of
> > the ?same? character. There is precedent for that in many other 
> > characters/languages.

> This isn't a matter for a variation selector.? This is purely a 
> *scribal* or *presentation* alternation.? It has as much relevance to 
> the content of the text as choice of font.? This is a matter for a 
> stylistic alternate in the font tables.? This is *exactly* what those 
> are for!

That wasn't obvious to whoever first implemented them in MS Word.  The
feature settings for a font applied throughout the document!  There's
also a problem that application writers think one needs a friendly
interface expressed in layman's terms, whereas a fix like this is
quite likely to be described in the documentation as 'Set feature cv05
to 6 for lamedh to be bent'. It took ages to get OpenType features
supported in LibreOffice, even though they'd already implemented
Graphite features. 

Now, it has been pointed out elsewhere that for best effects, shaping
should apply to whole paragraphs.  Fortunately, applying to whole words
is usually good enough.  However, what if a word has two lamedhs, and
only is to be bent?  Are mere word-processors now up to handling that
and processing the whole word as a whole, even though different parts
have different feature settings?

Richard.


From richard.wordingham at ntlworld.com  Thu Jun  4 11:27:49 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 4 Jun 2020 17:27:49 +0100
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <ea23e250-f5ec-f8b7-9c3c-879f5d4767e7@kli.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <20200604085937.5c3135d9@JRWUBU2>
 <ea23e250-f5ec-f8b7-9c3c-879f5d4767e7@kli.org>
Message-ID: <20200604172749.357309a1@JRWUBU2>

On Thu, 4 Jun 2020 09:02:40 -0400
"Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> > Arguably, the right place for standardisation is probably OpenType
> > and AAT features - and it might even be addressed already.  
> Yes, exactly.? An author (or typesetting program, higher level than a 
> font) would have to choose the right variant for each LAMED... which
> is what 'salt' tables are for, isn't it?

I was thinking more along the lines of something like tnum, which gets
digits to have the same advance width so that numbers in rows of
digits can more easily align. You then don't have to refer to the font
documentation; if you want that behaviour, either the font doesn't
support it, or you just specify that feature tnum be applied.

Richard.


From abrahamgross at disroot.org  Thu Jun  4 11:29:50 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Thu, 4 Jun 2020 16:29:50 +0000 (UTC)
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <20200604171539.7bfb71cb@JRWUBU2>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org>
 <002701d63a22$737f25e0$5a7d71a0$@gmail.com>
 <2706c719-9c76-b50b-7179-0aac9c38ece7@kli.org>
 <20200604171539.7bfb71cb@JRWUBU2>
Message-ID: <dd8930d7-cb70-4877-9274-72fcac5ab42b@disroot.org>

What? I don't understand what you're saying here

2020/06/04 ??0:17:21 Richard Wordingham via Unicode <unicode at unicode.org>:
> However, what if a word has two lamedhs, and
> only is to be bent?? Are mere word-processors now up to handling that
> and processing the whole word as a whole, even though different parts
> have different feature settings?
> 


From prosfilaes at gmail.com  Thu Jun  4 12:05:00 2020
From: prosfilaes at gmail.com (David Starner)
Date: Thu, 4 Jun 2020 10:05:00 -0700
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <fe2245ea-9722-4ed3-a986-a7cfcd232cfb@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <eba6c0dc-4ddd-26b8-1398-61c442b8867d@kli.org>
 <fe2245ea-9722-4ed3-a986-a7cfcd232cfb@disroot.org>
Message-ID: <CAMZ=zj7aW+svDTqyAW19Fdj8p0Ve6rbk3XY6W4VX+5=cWuo5rQ@mail.gmail.com>

On Wed, Jun 3, 2020 at 10:51 PM abrahamgross--- via Unicode
<unicode at unicode.org> wrote:
>
> Why do the final forms of the hebrew letters (?????) exist as separate codepoints from their regular counterparts (?????), when arabic - which has up to 4 forms for each letter - only got a single codepoint per letter?

Because encoding is full of somewhat arbitrary choices. Alphabets with
a handful of variant forms, like Latin, Greek, and Hebrew, it's easier
and more expected to encode those separately, instead of complicating
systems with one exception. Keyboard entry can go directly into a
buffer with minimal massaging. Scripts like Arabic, where each letter
takes four forms, would be harder to deal with under that model; you
can't expect keyboard users to type each form separately, so either
you add a heavy input manager, or you encode each letter and let the
font deal with the different forms. (Which has its problems; I suspect
if Persian script had been encoded separately/Persian was the main
user of the Arabic script, that it would have been encoded slightly
differently, as Persian uses ZWJ and ZWNJ more frequently to force
forms. But the current encoding still works for Persian; it's just a
matter of tradeoffs.)

-- 
The standard is written in English . If you have trouble understanding
a particular section, read it again and again and again . . . Sit up
straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185
(1991)


From mark at kli.org  Thu Jun  4 15:30:20 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Thu, 4 Jun 2020 16:30:20 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <20200604093151.227437a0@spixxi>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <20200604093151.227437a0@spixxi>
Message-ID: <0ea68b66-9d9a-e1ef-cb60-b89747c04ee4@kli.org>

{Sent this morning, but it bounced due to size.? Re-sending, with 
attachments, using jpg for smaller file-sizes.}

On 6/4/20 3:31 AM, Marius Spix via Unicode wrote:
> Unicode also has German s (U+0073) and ? (U+017F) which are
> equivalent, but were used in typesetting for a long time. If you want
> to precisely reproduce a historic text, it is required to have
> separate ways to encode different glyps. In plaintext documents you
> have no influence on OpenType presentation.
Long-s also existed in earlier standards, and so had to be preserved.
> But you can use variant selectors, which can be registered
> in the IVD database. This would be propably the best way. Technically,
> using variant selectors has the same effect as different code points,
> as <U+05DC> and <U+05DC U+FE0E> would encode different shapes of the
> same character.
I don't think this rises even to the level of variation selectors.? This 
is a scribal alternation, like deciding to put some extra swash into a 
letter in this word but not that one. It's the whole purpose of OpenType 
tables.
> It also appears that there are more variants of lamed with special
> meanings in the bible:
> https://www.hebrew4christians.com/Grammar/Unit_One/Aleph-Bet/Lamed/lamed.html
>
> Can someone confirm that all variants of lamed have the same numeric
> value of 30? If it is different between the variants, that would
> qualify for different characters.

They are all 30, and more importantly they are all LAMEDs. Every one of 
those examples, the spelling of the word includes simply LAMED.? That's 
what's in the text.? What's on the paper (or parchment) can't be 
considered "plain text" since written or printed text is by definition 
formatted somehow, to fit on the page.

You don't want to go down the rabbit hole of letters written in certain 
old Torahs with anomalous tags, extra tags, curled and looped heads, etc 
(these exist, I have sources if you want.)? Those are specialized cases 
and not even accepted (halachically) as significant in writing a Torah. 
(You'd have better luck with the broken VAV in Numbers 25:12, which is 
at least still done in modern Torahs.)? I think these are too 
specialized a case to be considered actual variant letters.? Attached 
are some pictures from an old Torah I saw on display.? The first shows a 
"looped" or "wrapped" PEH.? In the second one, note extra tags on the 
SAMECHs in the second line and on the FINAL KAF in the last.? The medial 
closed MEM in ????? in Isaiah 9:6 is at least codified in the Mesorah as 
well.

Unlike (I think) Arabic positional variants, the Hebrew final forms have 
had more of an independent life as letters, considered as symbols of 
their own, so even if it weren't for the legacy encodings, they probably 
would have been rightly encoded separately.? After all, you can adjust 
what kind of joining an Arabic letter shows with proper use of ZWJ and 
ZWNJ, so the use of non-final PEH at the end of a word, *from a purely 
typographic perspective*, would not have been a barrier to encoding only 
a single PEH and choosing the form only by context.

But there are other considerations in the case of Hebrew as it actually 
is.? The fact is that a straight (final) PEH and a bent (non-final) PEH 
are *distinct* and different letters in Modern Hebrew, at least in the 
context of the end of a word.? As was mentioned already, if you spell 
the word ???? as ????, you have spelled it *wrong*, and it would be 
pronounced differently.? And that usage has been in place for a long 
time; I think it's in Yiddish as well (but not Biblical Hebrew, witness 
Proverbs 30:6, with the word ?????????, a final -P spelled with 
straight-PEH-dagesh).? There are some forms of gematria (numerology) 
which consider the final letters to have different numerical values than 
the non-final letters.? So there's some reasonable history to consider 
them distinct, and encoding them separately would have been the right 
move even without the legacy considerations.? I think Arabic traditions 
don't have such distinctions.

> We also have special glyph variants of the same character for special
> purposes, like an open tail g for IPA (?, U+0261? or an alternative phi
> for math (?, U+03D5),  but these are completely optional and have no
> different meaning from the closed tail g and the curled phi. As far as
> I know linguists and mathematicians accept both glyph variants mutually
> interchangeable. I guess, they are only in Unicode for historic reasons.
Not so!? Contrariwise, in fact, at least for the IPA ?.? The reason it 
is encoded is because IPA stipulates that the symbol for the voiced 
velar stop be written ? with an open loop, and it is incorrect to write 
it with a binocular g.? Linguists do not consider these to be mutually 
interchangeable.? Same with the IPA ?, which is wrong if written 
two-storey.? I'm not sure about mathematics usage, but I think that 
there may be situations in math wherein ? and ? were used with distinct 
meanings (and not just by an isolated author.)


~mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200604/688911df/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pastedpic.jpg
Type: image/jpeg
Size: 40116 bytes
Desc: not available
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200604/688911df/attachment-0002.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pastedpic2.jpg
Type: image/jpeg
Size: 7026 bytes
Desc: not available
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200604/688911df/attachment-0003.jpg>

From mark at kli.org  Thu Jun  4 16:01:34 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Thu, 4 Jun 2020 17:01:34 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <20200604171539.7bfb71cb@JRWUBU2>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org>
 <002701d63a22$737f25e0$5a7d71a0$@gmail.com>
 <2706c719-9c76-b50b-7179-0aac9c38ece7@kli.org>
 <20200604171539.7bfb71cb@JRWUBU2>
Message-ID: <c9c92d81-2d39-e1df-6cda-179f8c3083e7@kli.org>

On 6/4/20 12:15 PM, Richard Wordingham via Unicode wrote:
> On Thu, 4 Jun 2020 08:28:08 -0400
> "Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:
>
>> On 6/3/20 11:44 PM, S?awomir Osipiuk via Unicode wrote:
>>> A variation selector seems like a good choice here. There should be
>>> a way to request from the rendering engine a specific variant of
>>> the ?same? character. There is precedent for that in many other
>>> characters/languages.
>> This isn't a matter for a variation selector.? This is purely a
>> *scribal* or *presentation* alternation.? It has as much relevance to
>> the content of the text as choice of font.? This is a matter for a
>> stylistic alternate in the font tables.? This is *exactly* what those
>> are for!
> That wasn't obvious to whoever first implemented them in MS Word.  The
> feature settings for a font applied throughout the document!

Ah.? I'd been seeing it in LibreOffice and other places, where you can 
twiddle the settings on individual spans, and didn't realize that 
originally these things were expected to be document-wide.? Thank you 
for correcting me.? Would you say, though, that while it may not be what 
they were originally meant for, that this use fits very well into how 
they can be and are used today?

>    There's
> also a problem that application writers think one needs a friendly
> interface expressed in layman's terms, whereas a fix like this is
> quite likely to be described in the documentation as 'Set feature cv05
> to 6 for lamedh to be bent'. It took ages to get OpenType features
> supported in LibreOffice, even though they'd already implemented
> Graphite features.
Yeah, user interface is a hassle at all levels, and complicated things 
are going to have complicated interfaces.
> Now, it has been pointed out elsewhere that for best effects, shaping
> should apply to whole paragraphs.  Fortunately, applying to whole words
> is usually good enough.  However, what if a word has two lamedhs, and
> only is to be bent?  Are mere word-processors now up to handling that
> and processing the whole word as a whole, even though different parts
> have different feature settings?

Yes, what I had been envisioning would indeed involve setting the use of 
font-features on small (one-character) spans in the middle of words, and 
I didn't consider how well word-processors can handle such a thing, and 
I don't really know.? What about things like 'swsh' tables for swash 
effects?? Are those applied to a whole word (paragraph?) at a time, but 
the table itself only affects the final letters of words?? Or do you 
have to apply it to each individual letter that you would see swashed?? 
If the latter, it's a lot like what I'm thinking about in this case.

~mark


From mark at kli.org  Thu Jun  4 16:08:57 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Thu, 4 Jun 2020 17:08:57 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <20200604172749.357309a1@JRWUBU2>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <20200604085937.5c3135d9@JRWUBU2>
 <ea23e250-f5ec-f8b7-9c3c-879f5d4767e7@kli.org>
 <20200604172749.357309a1@JRWUBU2>
Message-ID: <561d3072-dce7-afa9-1c15-3281f4e51520@kli.org>

On 6/4/20 12:27 PM, Richard Wordingham via Unicode wrote:
> On Thu, 4 Jun 2020 09:02:40 -0400
> "Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:
>
>>> Arguably, the right place for standardisation is probably OpenType
>>> and AAT features - and it might even be addressed already.
>> Yes, exactly.? An author (or typesetting program, higher level than a
>> font) would have to choose the right variant for each LAMED... which
>> is what 'salt' tables are for, isn't it?
> I was thinking more along the lines of something like tnum, which gets
> digits to have the same advance width so that numbers in rows of
> digits can more easily align. You then don't have to refer to the font
> documentation; if you want that behaviour, either the font doesn't
> support it, or you just specify that feature tnum be applied.
And this, as you mentioned before, affecting the entire document, or at 
least a whole paragraph or table.? But of course, the intent isn't to 
make the user choose between all straight LAMEDs and all bent ones, but 
to allow some to be one and some the other.? I was thinking 'salt' 
tables could be used kind of like formatting instructions, to apply to 
_this_ span and not _that_ one, like you can highlight a single letter 
and italicize it.? Even if they can't be used that way, then maybe it 
isn't a font thing, maybe the the higher typesetting system has to make 
these decisions.? After all, it's something that depends on the entire 
text-block and how the typesetter saw fit to lay it out.? It's like 
hyphenation in that way, if you think about it.? A hyphen character 
can't "know" that it is in a situation where it must break the line and 
become visible; that decision is made by the word processor.? (just 
turning visible at the end of a line can, of course, be handled at the 
font level.)

~mark


From wjgo_10009 at btinternet.com  Thu Jun  4 16:14:20 2020
From: wjgo_10009 at btinternet.com (wjgo_10009 at btinternet.com)
Date: Thu, 4 Jun 2020 22:14:20 +0100 (BST)
Subject: QID Emoji
Message-ID: <28e02754.18ab.172812f4f7a.Webtop.43@btinternet.com>

QID Emoji

The Public Review on the QID Emoji proposal is open for comment until 9 
July 2020.

https://www.unicode.org/review/

Discussion of QID Emoji in this mailing list, which is interesting and 
useful, does not however automatically form part of what is considered 
by the Unicode Technical Committee.

https://www.unicode.org/review/pri408/

So, whatever is your opinion on the QID Emoji proposal, you might like 
to consider please sending it in on the contact form.

Maybe a good compromise solution to the issue can be found.

William Overington

Thursday 4 June 2020


From mark at kli.org  Thu Jun  4 17:14:59 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Thu, 4 Jun 2020 18:14:59 -0400
Subject: Alternate presentation for U+229C CIRCLED EQUALS?
Message-ID: <a65d1548-c4c9-2307-770e-1efc6df9076d@kli.org>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200604/fd5a76b2/attachment.htm>

From richard.wordingham at ntlworld.com  Thu Jun  4 17:49:58 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 4 Jun 2020 23:49:58 +0100
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <dd8930d7-cb70-4877-9274-72fcac5ab42b@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org>
 <002701d63a22$737f25e0$5a7d71a0$@gmail.com>
 <2706c719-9c76-b50b-7179-0aac9c38ece7@kli.org>
 <20200604171539.7bfb71cb@JRWUBU2>
 <dd8930d7-cb70-4877-9274-72fcac5ab42b@disroot.org>
Message-ID: <20200604234958.518a3a76@JRWUBU2>

On Thu, 4 Jun 2020 16:29:50 +0000 (UTC)
abrahamgross--- via Unicode <unicode at unicode.org> wrote:

> What? I don't understand what you're saying here
> 
> 2020/06/04 ??0:17:21 Richard Wordingham via Unicode
> <unicode at unicode.org>:
> > However, what if a word has two lamedhs, and
> > only is to be bent?? Are mere word-processors now up to handling
> > that and processing the whole word as a whole, even though
> > different parts have different feature settings?

Enabling and disabling features changes the set of rules a renderer
uses to convert a sequence of characters to a sequence of coloured
glyphs with defined relative positions.  Even in simple scripts, they
can control, amongst many other things, the horizontal spacing of
letters, and even adjustments to white space, even handling things that
were handled on typewriters by rules such as two spaces after commas and
three spaces after full stops.  (I'm told these were the RAF rules.)  A
set of rules is easiest to implement if the rules are the same for the
whole of the string being shaped.  One solution is to chop a string up
into runs with the same rules to be applied.  However, the font then
loses control over how the two parts line up.

Now, if you have two lamedhs in a word, you may wish to bend one and
not bend the other.  Any adjustment between letters becomes difficult
if the letters are subject to different rules.  An obvious question
would be, "Which rules apply to their interaction?"  Much as I dislike
the idea of use variation sequences to control 'stylistic' effects, it
does avoid these problems.  It does come with the cost of increasing
the number of glyphs whose interaction must be considered, though
there are tricks to reduce the amount of thought required.

Richard.


From richard.wordingham at ntlworld.com  Thu Jun  4 18:22:05 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 5 Jun 2020 00:22:05 +0100
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <0ea68b66-9d9a-e1ef-cb60-b89747c04ee4@kli.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <20200604093151.227437a0@spixxi>
 <0ea68b66-9d9a-e1ef-cb60-b89747c04ee4@kli.org>
Message-ID: <20200605002205.696251ba@JRWUBU2>

On Thu, 4 Jun 2020 16:30:20 -0400
"Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> On 6/4/20 3:31 AM, Marius Spix via Unicode wrote:

> > We also have special glyph variants of the same character for
> > special purposes, like an open tail g for IPA (?, U+0261? or an
> > alternative phi for math (?, U+03D5),  but these are completely
> > optional and have no different meaning from the closed tail g and
> > the curled phi. As far as I know linguists and mathematicians
> > accept both glyph variants mutually interchangeable. I guess, they
> > are only in Unicode for historic reasons.  

> Not so!? Contrariwise, in fact, at least for the IPA ?.? The reason
> it is encoded is because IPA stipulates that the symbol for the
> voiced velar stop be written ? with an open loop, and it is incorrect
> to write it with a binocular g.

The IPA threw the towel in on that one, and now allow either.

>? Linguists do not consider these to
> be mutually interchangeable.? Same with the IPA ?, which is wrong if
> written two-storey.

That's different.  [a] and [?] are two different sounds.  Of course, it
all gets horribly confused when type faces for children's books use
single storey 'a' and open loop 'g'.

>? I'm not sure about mathematics usage, but I
> think that there may be situations in math wherein ? and ? were used
> with distinct meanings (and not just by an isolated author.)

I suspect that's the difference between curly phi and straight phi.  I
must say though that I need a soft stroked phi that drops the part
above the circle when one applies a superscript.  I'm British and I
find the fluxion notation useful.  (And no, differentiation was
introduced to me with the 'd' notation.)

Richard.


From richard.wordingham at ntlworld.com  Thu Jun  4 20:11:39 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 5 Jun 2020 02:11:39 +0100
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <c9c92d81-2d39-e1df-6cda-179f8c3083e7@kli.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org>
 <002701d63a22$737f25e0$5a7d71a0$@gmail.com>
 <2706c719-9c76-b50b-7179-0aac9c38ece7@kli.org>
 <20200604171539.7bfb71cb@JRWUBU2>
 <c9c92d81-2d39-e1df-6cda-179f8c3083e7@kli.org>
Message-ID: <20200605021139.43c54eee@JRWUBU2>

On Thu, 4 Jun 2020 17:01:34 -0400
"Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> On 6/4/20 12:15 PM, Richard Wordingham via Unicode wrote:
> > On Thu, 4 Jun 2020 08:28:08 -0400
> > "Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> >> On 6/3/20 11:44 PM, S?awomir Osipiuk via Unicode wrote:  
> >> This isn't a matter for a variation selector.? This is purely a
> >> *scribal* or *presentation* alternation.? It has as much relevance
> >> to the content of the text as choice of font.? This is a matter
> >> for a stylistic alternate in the font tables.? This is *exactly*
> >> what those are for!  
> > That wasn't obvious to whoever first implemented them in MS Word.
> > The feature settings for a font applied throughout the document!  
> 
> Ah.? I'd been seeing it in LibreOffice and other places, where you
> can twiddle the settings on individual spans, and didn't realize that 
> originally these things were expected to be document-wide.? Thank you 
> for correcting me.? Would you say, though, that while it may not be
> what they were originally meant for, that this use fits very well
> into how they can be and are used today?

Features were around long before MS Word implemented user control of
them. The description of some of the features implies interaction
between the author and the rendering machine.  Cascading Style Sheets
(CSS) brought optional features to the masses, and they're not designed
for interactive layout. The CSS approach effectively provides a set of
font customisation, so switching from one set to another seems to be
like switching fonts, which you suggested as one approach.  If it is
implemented that way, then one loses font control across changes of
options.  A lot of Indic script engines appear not to have allowed
interaction between clusters, so there would have been no loss of
control by applying different fonts to different clusters. 

Now, I've seen Windows interfaces that allow the application of
features to be limited to parts of a string.  I don't know how that
works.  I can imagine it becoming more sophisticated over time.

Reviewers of features were horrified that Word required the settings to
apply across the document.  That was widely seen as a design fault, and
I trust it has now been fixed.  My point was simply that what you saw
as an obvious use of features was not obvious to everyone.

> Yes, what I had been envisioning would indeed involve setting the use
> of font-features on small (one-character) spans in the middle of
> words, and I didn't consider how well word-processors can handle such
> a thing, and I don't really know.? What about things like 'swsh'
> tables for swash effects?? Are those applied to a whole word
> (paragraph?) at a time, but the table itself only affects the final
> letters of words?? Or do you have to apply it to each individual
> letter that you would see swashed? If the latter, it's a lot like
> what I'm thinking about in this case.

I haven't used sophisticated layout systems, so I don't know how they
work. I could well imagine that they didn't work with automatic kerning.

Rchard.


From richard.wordingham at ntlworld.com  Thu Jun  4 20:29:31 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 5 Jun 2020 02:29:31 +0100
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <561d3072-dce7-afa9-1c15-3281f4e51520@kli.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <20200604085937.5c3135d9@JRWUBU2>
 <ea23e250-f5ec-f8b7-9c3c-879f5d4767e7@kli.org>
 <20200604172749.357309a1@JRWUBU2>
 <561d3072-dce7-afa9-1c15-3281f4e51520@kli.org>
Message-ID: <20200605022931.2bacd68a@JRWUBU2>

On Thu, 4 Jun 2020 17:08:57 -0400
"Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> On 6/4/20 12:27 PM, Richard Wordingham via Unicode wrote:
> > On Thu, 4 Jun 2020 09:02:40 -0400
> > "Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:
> >  
> >>> Arguably, the right place for standardisation is probably OpenType
> >>> and AAT features - and it might even be addressed already.  
> >> Yes, exactly.? An author (or typesetting program, higher level
> >> than a font) would have to choose the right variant for each
> >> LAMED... which is what 'salt' tables are for, isn't it?  
> > I was thinking more along the lines of something like tnum, which
> > gets digits to have the same advance width so that numbers in rows
> > of digits can more easily align. You then don't have to refer to
> > the font documentation; if you want that behaviour, either the font
> > doesn't support it, or you just specify that feature tnum be
> > applied.  
> And this, as you mentioned before, affecting the entire document, or
> at least a whole paragraph or table.? But of course, the intent isn't
> to make the user choose between all straight LAMEDs and all bent
> ones, but to allow some to be one and some the other.? I was thinking
> 'salt' tables could be used kind of like formatting instructions, to
> apply to _this_ span and not _that_ one, like you can highlight a
> single letter and italicize it.

Well, there's the rub.  One loses layout control between italicised and
unitalicised  portions.  This is how one would apply them using CSS.
Of course, the system might be clever enough to work round breaks.  The
wording of the salt feature at
https://docs.microsoft.com/en-us/typography/opentype/spec/features_pt#-tag-salt
suggests that the conception was that the salt feature would enable
substitutions of a single glyph by another glyph, with a user interface
allowing the user to choose the replacement glyph from a menu.  There's
no whiff of the notion that the choice presented might be context
dependent.  A clever enough system could have context-sensitive
substitutions before and after that straddle the change in options
selected.  Other systems might not allow interactions between spans
with different options chosen.

Richard.


? Even if they can't be used that way,
> then maybe it isn't a font thing, maybe the the higher typesetting
> system has to make these decisions.? After all, it's something that
> depends on the entire text-block and how the typesetter saw fit to
> lay it out.? It's like hyphenation in that way, if you think about
> it.? A hyphen character can't "know" that it is in a situation where
> it must break the line and become visible; that decision is made by
> the word processor.? (just turning visible at the end of a line can,
> of course, be handled at the font level.)
> 
> ~mark
> 
> 


From otto.stolz at uni-konstanz.de  Fri Jun  5 06:26:50 2020
From: otto.stolz at uni-konstanz.de (Otto Stolz)
Date: Fri, 5 Jun 2020 13:26:50 +0200
Subject: German long S (was: Why do the Hebrew Alphabetic Presentation Forms
 Exist)
In-Reply-To: <20200604093151.227437a0@spixxi>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <20200604093151.227437a0@spixxi>
Message-ID: <487cc47f-df5b-0311-1e6b-f165ca8ee946@uni-konstanz.de>

Hello,

am 2020-06-04 um 9:31 hat Marius Spix via Unicode geschrieben:
> Unicode also has German s (U+0073) and ? (U+017F) which are
> equivalent,

No, they are not equivalent. In any orthography using ???, at all,
?s? marks the end of a word, or of a constituent of a compound.
Thus, e. g.
- ?Wachstube? [?vakstu?b?] = ?Wachs-Tube?, a tube containing wax
- ?Wach?tube? [?vax?tu?b?] = ?Wach-Stube?, guard room

Just a reminder, we have discussed this earlier in this list.

Best wishes,
 ? ?Otto

From asmusf at ix.netcom.com  Fri Jun  5 14:41:01 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Fri, 5 Jun 2020 12:41:01 -0700
Subject: German long S
In-Reply-To: <487cc47f-df5b-0311-1e6b-f165ca8ee946@uni-konstanz.de>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <20200604093151.227437a0@spixxi>
 <487cc47f-df5b-0311-1e6b-f165ca8ee946@uni-konstanz.de>
Message-ID: <ea9c8273-f0cf-1d77-eba5-9472ae612db0@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/9cda178a/attachment.htm>

From tom at honermann.net  Fri Jun  5 15:10:19 2020
From: tom at honermann.net (Tom Honermann)
Date: Fri, 5 Jun 2020 16:10:19 -0400
Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8
 encoding signature?
Message-ID: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>

Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte order, 
states (emphasis mine):

> ... *Use of a BOM is neither required nor recommended for UTF-8*, but 
> may be encountered in contexts where UTF-8 data is converted from 
> other encoding forms that? use? a? BOM? or? where? the? BOM? is? used 
> as? a? UTF-8? signature.? See? the? ?Byte? Order Mark? subsection in 
> Section 23.8, Specials, for more information.
The emphasized statement is unconditional regarding the recommendation, 
but it isn't clear to me that this recommendation is intended to extend 
to both presence of a BOM in contexts where the encoding is known to be 
UTF-8 (where the BOM provides no additional information) and to contexts 
where the BOM signifies the presence of UTF-8 encoded text (where the 
BOM does provide additional information).? Is the guidance intended to 
state that, when possible, use of UTF-8 as an encoding signature is to 
be avoided in favor of some other mechanism?

The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8 
(Specials) contains no similar guidance; it is factual and details some 
possible consequences of use, but does not apply a judgement.? The 
discussion of use with other character sets could be read as an 
endorsement for use of a BOM as an encoding signature.

Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode FAQ 
<https://www.unicode.org/faq/utf_bom.html> does not recommend for or 
against use of a BOM as an encoding signature.? It also can be read as 
endorsing such usage.

So, my question is, what exactly is the intent of the emphasized 
statement above?? Is the recommendation intended to be so broadly 
worded?? Or is it only intended to discourage BOM use in cases where the 
encoding is known by other means?

Tom.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/0dd4b63c/attachment.htm>

From markus.icu at gmail.com  Fri Jun  5 15:14:27 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Fri, 5 Jun 2020 13:14:27 -0700
Subject: Alternate presentation for U+229C CIRCLED EQUALS?
In-Reply-To: <a65d1548-c4c9-2307-770e-1efc6df9076d@kli.org>
References: <a65d1548-c4c9-2307-770e-1efc6df9076d@kli.org>
Message-ID: <CAN49p6pSiioNbjjhsD41W3ys5FJHH3L5zKDVVdKKaRJOEKwdJw@mail.gmail.com>

emoji style seems wrong here. You would want this to look like the CC logo,
not cute and colorful.

It sounds like the default assumption is for choosing a font with a
math-like glyph vs. a CC-like glyph.

If this does not work, then a standardized variation sequence might be
useful.

https://www.unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt
...

2295 FE00; with white rim; # CIRCLED PLUS
2297 FE00; with white rim; # CIRCLED TIMES*229C FE00; with equal sign
touching the circle; # CIRCLED EQUALS*
22DA FE00; with slanted equal; # LESS-THAN EQUAL TO OR GREATER-THAN
22DB FE00; with slanted equal; # GREATER-THAN EQUAL TO OR LESS-THAN

...

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/57f70f70/attachment.htm>

From markus.icu at gmail.com  Fri Jun  5 15:20:49 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Fri, 5 Jun 2020 13:20:49 -0700
Subject: Alternate presentation for U+229C CIRCLED EQUALS?
In-Reply-To: <CAN49p6pSiioNbjjhsD41W3ys5FJHH3L5zKDVVdKKaRJOEKwdJw@mail.gmail.com>
References: <a65d1548-c4c9-2307-770e-1efc6df9076d@kli.org>
 <CAN49p6pSiioNbjjhsD41W3ys5FJHH3L5zKDVVdKKaRJOEKwdJw@mail.gmail.com>
Message-ID: <CAN49p6qa37Vy1FX=x+MLCyKcNNKJo3bx-X36gpMPdT+7TxTHgw@mail.gmail.com>

Actually, I should have looked at the proposal doc first:
http://www.unicode.org/L2/L2017/17242r2-n4934r-creative-commons.pdf

... Their primary designs are exactly specified, while in current text
forms may be used resembling the font design. ... In these examples, you
see some variation in size, font, and placement, which is common for ?
symbol as well. ...


In other words, the glyphs for these symbols are not as fixed as you might
think, and the use of ? likely fits right in.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/0669da42/attachment.htm>

From Shawn.Steele at microsoft.com  Fri Jun  5 16:47:49 2020
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Fri, 5 Jun 2020 21:47:49 +0000
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
Message-ID: <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>

The modern viewpoint is that the BOM should be discouraged in all contexts.   (Along with you should always be using Unicode encodings, probably UTF-8 or UTF-16).  I?d recommend to anyone encountering ASCII-like data to presume it was UTF-8 unless proven otherwise.

Are you asking because you?re interested in differentiating UTF-8 from UTF-16?  Or UTF-8 from some other legacy non-Unicode encoding?

Anecdotally, if you can decode data without error in UTF-8, then it?s probably UTF-8.  Sensible sequences in other encodings rarely look like valid UTF-8, though there are a few short examples that can confuse it.

-Shawn

From: Unicode <unicode-bounces at unicode.org> On Behalf Of Tom Honermann via Unicode
Sent: Freitag, 5. Juni 2020 13:10
To: unicode at unicode.org
Cc: Alisdair Meredith <alisdairm at me.com>
Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte order, states (emphasis mine):
... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that  use  a  BOM  or  where  the  BOM  is  used  as  a  UTF-8  signature.  See  the  ?Byte  Order Mark? subsection in Section 23.8, Specials, for more information.
The emphasized statement is unconditional regarding the recommendation, but it isn't clear to me that this recommendation is intended to extend to both presence of a BOM in contexts where the encoding is known to be UTF-8 (where the BOM provides no additional information) and to contexts where the BOM signifies the presence of UTF-8 encoded text (where the BOM does provide additional information).  Is the guidance intended to state that, when possible, use of UTF-8 as an encoding signature is to be avoided in favor of some other mechanism?
The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8 (Specials) contains no similar guidance; it is factual and details some possible consequences of use, but does not apply a judgement.  The discussion of use with other character sets could be read as an endorsement for use of a BOM as an encoding signature.
Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode FAQ<https://www.unicode.org/faq/utf_bom.html> does not recommend for or against use of a BOM as an encoding signature.  It also can be read as endorsing such usage.
So, my question is, what exactly is the intent of the emphasized statement above?  Is the recommendation intended to be so broadly worded?  Or is it only intended to discourage BOM use in cases where the encoding is known by other means?
Tom.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/f01d2068/attachment.htm>

From Shawn.Steele at microsoft.com  Fri Jun  5 17:00:23 2020
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Fri, 5 Jun 2020 22:00:23 +0000
Subject: Alternate presentation for U+229C CIRCLED EQUALS?
In-Reply-To: <CAN49p6qa37Vy1FX=x+MLCyKcNNKJo3bx-X36gpMPdT+7TxTHgw@mail.gmail.com>
References: <a65d1548-c4c9-2307-770e-1efc6df9076d@kli.org>
 <CAN49p6pSiioNbjjhsD41W3ys5FJHH3L5zKDVVdKKaRJOEKwdJw@mail.gmail.com>
 <CAN49p6qa37Vy1FX=x+MLCyKcNNKJo3bx-X36gpMPdT+7TxTHgw@mail.gmail.com>
Message-ID: <MWHPR21MB084754802F1F9FFF6DF01A6C82860@MWHPR21MB0847.namprd21.prod.outlook.com>

I don?t really like the proposal at all.  Is there prior context that I?m missing?

They don?t want a ?circled cc? character.  They want a Creative Commons license symbol.

They don?t want the equivalent of ?, they want ?.

In plain text, the Creative Commons symbol has an explicit meaning, it?s not a random emoji.

It is unclear to me why this is being proposed as ?circled characters? rather than ?CC license symbols?.

My preference would be to see these encoded as ?licensing symbols?.

If I was designing a font that included the CC licensing symbols and the circled math symbols, I might choose to match the CC symbols published by them EXACTLY.  However, the math symbols may have a slightly different style.  As ? and ? likely do.

Not to mention, if I have a ? in my text, then it?s clearly intended as an abbreviation for ?copyright? and not bullet c that I thought looked prettier in a circle.  That intent is not lost if I change fonts or whatever.

-Shawn

From: Unicode <unicode-bounces at unicode.org> On Behalf Of Markus Scherer via Unicode
Sent: Freitag, 5. Juni 2020 13:21
To: Mark E. Shoulson <mark at kli.org>
Cc: unicode at unicode.org
Subject: Re: Alternate presentation for U+229C CIRCLED EQUALS?

Actually, I should have looked at the proposal doc first: http://www.unicode.org/L2/L2017/17242r2-n4934r-creative-commons.pdf

... Their primary designs are exactly specified, while in current text forms may be used resembling the font design. ... In these examples, you see some variation in size, font, and placement, which is common for ? symbol as well. ...

In other words, the glyphs for these symbols are not as fixed as you might think, and the use of ? likely fits right in.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/16bb826f/attachment.htm>

From tom at honermann.net  Fri Jun  5 17:15:08 2020
From: tom at honermann.net (Tom Honermann)
Date: Fri, 5 Jun 2020 18:15:08 -0400
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
Message-ID: <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>

On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote:
>
> The modern viewpoint is that the BOM should be discouraged in all 
> contexts.?? (Along with you should always be using Unicode encodings, 
> probably UTF-8 or UTF-16). I?d recommend to anyone encountering 
> ASCII-like data to presume it was UTF-8 unless proven otherwise.
>
> Are you asking because you?re interested in differentiating UTF-8 from 
> UTF-16?? Or UTF-8 from some other legacy non-Unicode encoding?
>
The latter.? In particular, as a differentiator between shiny new UTF-8 
encoded source code files and long-in-the-tooth legacy encoded source 
code files coexisting (perhaps via transitive package dependencies) 
within a single project.

Tom.

> Anecdotally, if you can decode data without error in UTF-8, then it?s 
> probably UTF-8. ?Sensible sequences in other encodings rarely look 
> like valid UTF-8, though there are a few short examples that can 
> confuse it.
>
> -Shawn
>
> *From:* Unicode <unicode-bounces at unicode.org> *On Behalf Of *Tom 
> Honermann via Unicode
> *Sent:* Freitag, 5. Juni 2020 13:10
> *To:* unicode at unicode.org
> *Cc:* Alisdair Meredith <alisdairm at me.com>
> *Subject:* What is the Unicode guidance regarding the use of a BOM as 
> a UTF-8 encoding signature?
>
> Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte order, 
> states (emphasis mine):
>
>     ... *Use of a BOM is neither required nor recommended for UTF-8*,
>     but may be encountered in contexts where UTF-8 data is converted
>     from other encoding forms that? use? a? BOM? or? where? the? BOM?
>     is? used? as a? UTF-8? signature.? See? the? ?Byte? Order Mark?
>     subsection in Section 23.8, Specials, for more information.
>
> The emphasized statement is unconditional regarding the 
> recommendation, but it isn't clear to me that this recommendation is 
> intended to extend to both presence of a BOM in contexts where the 
> encoding is known to be UTF-8 (where the BOM provides no additional 
> information) and to contexts where the BOM signifies the presence of 
> UTF-8 encoded text (where the BOM does provide additional 
> information).? Is the guidance intended to state that, when possible, 
> use of UTF-8 as an encoding signature is to be avoided in favor of 
> some other mechanism?
>
> The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8 
> (Specials) contains no similar guidance; it is factual and details 
> some possible consequences of use, but does not apply a judgement.? 
> The discussion of use with other character sets could be read as an 
> endorsement for use of a BOM as an encoding signature.
>
> Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode FAQ 
> <https://www.unicode.org/faq/utf_bom.html> does not recommend for or 
> against use of a BOM as an encoding signature.? It also can be read as 
> endorsing such usage.
>
> So, my question is, what exactly is the intent of the emphasized 
> statement above?? Is the recommendation intended to be so broadly 
> worded?? Or is it only intended to discourage BOM use in cases where 
> the encoding is known by other means?
>
> Tom.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/291e1011/attachment.htm>

From asmusf at ix.netcom.com  Fri Jun  5 17:22:54 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Fri, 5 Jun 2020 15:22:54 -0700
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
Message-ID: <187adced-82bb-99c7-1d59-a82ee74f5d87@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/1668a116/attachment.htm>

From asmusf at ix.netcom.com  Fri Jun  5 17:23:43 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Fri, 5 Jun 2020 15:23:43 -0700
Subject: Alternate presentation for U+229C CIRCLED EQUALS?
In-Reply-To: <CAN49p6pSiioNbjjhsD41W3ys5FJHH3L5zKDVVdKKaRJOEKwdJw@mail.gmail.com>
References: <a65d1548-c4c9-2307-770e-1efc6df9076d@kli.org>
 <CAN49p6pSiioNbjjhsD41W3ys5FJHH3L5zKDVVdKKaRJOEKwdJw@mail.gmail.com>
Message-ID: <3d70accb-93a3-18ec-e1e1-a3a84005bf2f@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/d01cfbb6/attachment.htm>

From Shawn.Steele at microsoft.com  Fri Jun  5 17:33:23 2020
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Fri, 5 Jun 2020 22:33:23 +0000
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
Message-ID: <MWHPR21MB0847386BE436C612C0F56A9C82860@MWHPR21MB0847.namprd21.prod.outlook.com>

I?ve been recommending that people assume documents are UTF-8.  If the UTF-8 decoding fails, then consider falling back to some other codepage.   Pretty much all the other code pages would contain text that would look like unexpected trail bytes, or lead bytes without trail bytes, etc.  One can anecdotally find single-word Latin examples that break the pattern (Nestl?? IIRC), but if you want to think of accuracy in terms of ?9s?, then that pretty much has as many nines as you have bytes of input data.

I did find some DBCS CJK text that could look like valid UTF-8, so my ?one nine per byte of input? isn?t quite as high there, however for meaningful runs of text it is still reasonably hard to make sensible text in a double byte codepage look like UTF-8.  Note that this ?works? partially because the ASCII range of the SBCS/DBCS code pages typically looks like ASCII, as does UTF-8.  If you had a 7 bit codepage data with stateful shift sequences, of course that wouldn?t err in UTF-8.  Fortunately for your scenario source code in 7 bit encodings is very rare nowadays.

Hope that helps,

-Shawn

From: Tom Honermann <tom at honermann.net>
Sent: Freitag, 5. Juni 2020 15:15
To: Shawn Steele <Shawn.Steele at microsoft.com>
Cc: Alisdair Meredith <alisdairm at me.com>; Unicode Mail List <unicode at unicode.org>
Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote:
The modern viewpoint is that the BOM should be discouraged in all contexts.   (Along with you should always be using Unicode encodings, probably UTF-8 or UTF-16).  I?d recommend to anyone encountering ASCII-like data to presume it was UTF-8 unless proven otherwise.

Are you asking because you?re interested in differentiating UTF-8 from UTF-16?  Or UTF-8 from some other legacy non-Unicode encoding?

The latter.  In particular, as a differentiator between shiny new UTF-8 encoded source code files and long-in-the-tooth legacy encoded source code files coexisting (perhaps via transitive package dependencies) within a single project.

Tom.

Anecdotally, if you can decode data without error in UTF-8, then it?s probably UTF-8.  Sensible sequences in other encodings rarely look like valid UTF-8, though there are a few short examples that can confuse it.

-Shawn

From: Unicode <unicode-bounces at unicode.org><mailto:unicode-bounces at unicode.org> On Behalf Of Tom Honermann via Unicode
Sent: Freitag, 5. Juni 2020 13:10
To: unicode at unicode.org<mailto:unicode at unicode.org>
Cc: Alisdair Meredith <alisdairm at me.com><mailto:alisdairm at me.com>
Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte order, states (emphasis mine):
... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that  use  a  BOM  or  where  the  BOM  is  used  as  a  UTF-8  signature.  See  the  ?Byte  Order Mark? subsection in Section 23.8, Specials, for more information.
The emphasized statement is unconditional regarding the recommendation, but it isn't clear to me that this recommendation is intended to extend to both presence of a BOM in contexts where the encoding is known to be UTF-8 (where the BOM provides no additional information) and to contexts where the BOM signifies the presence of UTF-8 encoded text (where the BOM does provide additional information).  Is the guidance intended to state that, when possible, use of UTF-8 as an encoding signature is to be avoided in favor of some other mechanism?
The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8 (Specials) contains no similar guidance; it is factual and details some possible consequences of use, but does not apply a judgement.  The discussion of use with other character sets could be read as an endorsement for use of a BOM as an encoding signature.
Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode FAQ<https://www.unicode.org/faq/utf_bom.html> does not recommend for or against use of a BOM as an encoding signature.  It also can be read as endorsing such usage.
So, my question is, what exactly is the intent of the emphasized statement above?  Is the recommendation intended to be so broadly worded?  Or is it only intended to discourage BOM use in cases where the encoding is known by other means?
Tom.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/9fbcfb4e/attachment.htm>

From abrahamgross at disroot.org  Fri Jun  5 17:48:05 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Fri, 5 Jun 2020 22:48:05 +0000 (UTC)
Subject: Alternate presentation for U+229C CIRCLED EQUALS?
In-Reply-To: <3d70accb-93a3-18ec-e1e1-a3a84005bf2f@ix.netcom.com>
References: <a65d1548-c4c9-2307-770e-1efc6df9076d@kli.org>
 <CAN49p6pSiioNbjjhsD41W3ys5FJHH3L5zKDVVdKKaRJOEKwdJw@mail.gmail.com>
 <3d70accb-93a3-18ec-e1e1-a3a84005bf2f@ix.netcom.com>
Message-ID: <60808396-077e-4df8-bc34-4793120dc91d@disroot.org>

Yes, thank you!

I vote for a separate codepoint for the CIRCLED EQUALS SIGN since it has a different meaning and would also visually be displayed differently

2020/06/05 ??6:27:43 Asmus Freytag via Unicode <unicode at unicode.org>:
> Overloading this mathematical symbol with anything that needs different styling is *wrong*.
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/eb2d5bfe/attachment.htm>

From markus.icu at gmail.com  Fri Jun  5 18:04:12 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Fri, 5 Jun 2020 16:04:12 -0700
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
Message-ID: <CAN49p6qVEhjcMk3JyjG1CoRHw-rrvBr=cPaqZe0sjCBfSFwGkA@mail.gmail.com>

The BOM -- or for UTF-8 where "byte order" is meaningless, the Unicode
signature byte sequence -- was popular when Unicode was gaining ground but
legacy charsets were still widely used.
Especially on Windows, which had settled on UTF-16 much earlier, lots of
tools and editors started writing or expecting UTF-8 signatures.
Other tools (especially in the Linux/Unix world) were never modified to
expect or even cope with the signature, so ignored it or choked on it.
There has never been uniform practice on this.
For the most part, all new and recent text is now UTF-8, and the signature
byte sequence has fallen out of favor again even where it had been used.

Having said that, I think the statement is right: "neither required nor
recommended for UTF-8"

We might want to review chapter 23 and the FAQ and see if they should be
updated.

Thanks,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/da4e3a7d/attachment.htm>

From markus.icu at gmail.com  Fri Jun  5 18:17:30 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Fri, 5 Jun 2020 16:17:30 -0700
Subject: Alternate presentation for U+229C CIRCLED EQUALS?
In-Reply-To: <MWHPR21MB084754802F1F9FFF6DF01A6C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
References: <a65d1548-c4c9-2307-770e-1efc6df9076d@kli.org>
 <CAN49p6pSiioNbjjhsD41W3ys5FJHH3L5zKDVVdKKaRJOEKwdJw@mail.gmail.com>
 <CAN49p6qa37Vy1FX=x+MLCyKcNNKJo3bx-X36gpMPdT+7TxTHgw@mail.gmail.com>
 <MWHPR21MB084754802F1F9FFF6DF01A6C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
Message-ID: <CAN49p6pBpMBvC3EDSwZgyqTBQUL3n7Mghs1r8uZVifZH1YSgug@mail.gmail.com>

On Fri, Jun 5, 2020 at 3:00 PM Shawn Steele <Shawn.Steele at microsoft.com>
wrote:

> I don?t really like the proposal at all.
>

The proposal is from 2017/2018. These characters were added in Unicode 13.

markus

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/52f2d0b9/attachment.htm>

From mark at kli.org  Fri Jun  5 18:30:51 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Fri, 5 Jun 2020 19:30:51 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <20200605002205.696251ba@JRWUBU2>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <20200604093151.227437a0@spixxi>
 <0ea68b66-9d9a-e1ef-cb60-b89747c04ee4@kli.org>
 <20200605002205.696251ba@JRWUBU2>
Message-ID: <7d4f417d-ea94-1497-6386-a5a1d5f4b5a6@kli.org>

On 6/4/20 7:22 PM, Richard Wordingham via Unicode wrote:
> On Thu, 4 Jun 2020 16:30:20 -0400
> "Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:
>
>> Not so!? Contrariwise, in fact, at least for the IPA ?.? The reason
>> it is encoded is because IPA stipulates that the symbol for the
>> voiced velar stop be written ? with an open loop, and it is incorrect
>> to write it with a binocular g.
> The IPA threw the towel in on that one, and now allow either.
Bah!? Cowards.? I suppose it doesn't matter from Unicode's perspective, 
since Unicode is also concerned with historical usage, and there was a 
time when it mattered.? (that's oversimplifying, I know.)
>>  ? Linguists do not consider these to
>> be mutually interchangeable.? Same with the IPA ?, which is wrong if
>> written two-storey.
> That's different.  [a] and [?] are two different sounds.  Of course, it
> all gets horribly confused when type faces for children's books use
> single storey 'a' and open loop 'g'.

Well, it's "different" only because binocular g didn't have another 
meaning, as two-storey a does.? Though to be honest, if IPA has to have 
? because it uses two-storey a and one-storey ? contrastively, then by 
rights there ought to be a character (or variation sequence or 
something) like LATIN SMALL LETTER TWO STOREY A, since after all, some 
fonts don't draw U+0061 the way that IPA stipulates is needed for the 
open front vowel.? I've wondered about that from time to time.

~mark

From mark at kli.org  Fri Jun  5 18:38:52 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Fri, 5 Jun 2020 19:38:52 -0400
Subject: Alternate presentation for U+229C CIRCLED EQUALS?
In-Reply-To: <CAN49p6qa37Vy1FX=x+MLCyKcNNKJo3bx-X36gpMPdT+7TxTHgw@mail.gmail.com>
References: <a65d1548-c4c9-2307-770e-1efc6df9076d@kli.org>
 <CAN49p6pSiioNbjjhsD41W3ys5FJHH3L5zKDVVdKKaRJOEKwdJw@mail.gmail.com>
 <CAN49p6qa37Vy1FX=x+MLCyKcNNKJo3bx-X36gpMPdT+7TxTHgw@mail.gmail.com>
Message-ID: <d50240f3-766d-a6e3-da15-af383325e9d6@kli.org>

On 6/5/20 4:20 PM, Markus Scherer wrote:
> Actually, I should have looked at the proposal doc first: 
> http://www.unicode.org/L2/L2017/17242r2-n4934r-creative-commons.pdf
>
>     ... Their primary designs are exactly specified, while in current
>     text forms may be used resembling the font design. ... In these
>     examples, you see some variation in size, font, and placement,
>     which is common for ? symbol as well. ...
>
>
> In other words, the glyphs for these symbols are not as fixed as you 
> might think, and the use of?? likely fits right in.


Not certain I buy that.? I'm a font designer, I'm going to be designing 
the Creative Commons symbols, even if not in some "standard" way, at 
least in the style I envision for them, but CIRCLED EQUALS, a 
mathematics operator, will have different needs and I'll be designing it 
to comport well with mathematics, and that is very likely to be 
different from the CC symbols.

~mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/41565a31/attachment.htm>

From mark at kli.org  Fri Jun  5 18:42:50 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Fri, 5 Jun 2020 19:42:50 -0400
Subject: Alternate presentation for U+229C CIRCLED EQUALS?
In-Reply-To: <60808396-077e-4df8-bc34-4793120dc91d@disroot.org>
References: <a65d1548-c4c9-2307-770e-1efc6df9076d@kli.org>
 <CAN49p6pSiioNbjjhsD41W3ys5FJHH3L5zKDVVdKKaRJOEKwdJw@mail.gmail.com>
 <3d70accb-93a3-18ec-e1e1-a3a84005bf2f@ix.netcom.com>
 <60808396-077e-4df8-bc34-4793120dc91d@disroot.org>
Message-ID: <68cdeef6-471a-3e4e-2f91-ffb8ddec7993@kli.org>

On 6/5/20 6:48 PM, abrahamgross--- via Unicode wrote:
> Yes, thank you!
>
> I vote for a separate codepoint for the CIRCLED EQUALS SIGN since it 
> has a different meaning and would also visually be displayed differently
>
> 2020/06/05 ??6:27:43 Asmus Freytag via Unicode <unicode at unicode.org>:
>
>     Overloading this mathematical symbol with anything that needs
>     different styling is *wrong*.
>
We don't get to "vote" here, but I think my preference, too, would be to 
encode a new character, as opposed to a variant of U+229C (or doing 
nothing, which of course is another alternative.)? The "ND" license 
symbol just seems to be a different creature to the math operator.

~mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/e5323b7d/attachment.htm>

From abrahamgross at disroot.org  Fri Jun  5 18:49:47 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Fri, 5 Jun 2020 23:49:47 +0000 (UTC)
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <7d4f417d-ea94-1497-6386-a5a1d5f4b5a6@kli.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <20200604093151.227437a0@spixxi>
 <0ea68b66-9d9a-e1ef-cb60-b89747c04ee4@kli.org>
 <20200605002205.696251ba@JRWUBU2>
 <7d4f417d-ea94-1497-6386-a5a1d5f4b5a6@kli.org>
Message-ID: <b033daa4-845b-4930-8e81-9928ad09a435@disroot.org>

YES, THIS!

I've been thinking about writing a proposal for the double story "a" so that I can send an unambiguous IPA transcription - even to ppl with devices that have the U0061 ?a? as a single storey "a" - but I don't want to spend a ton of time on something thatll get rejected?

2020/06/05 ??7:31:32 Mark E. Shoulson via Unicode <unicode at unicode.org>:
> Though to be honest, if IPA has to have ? because it uses two-storey a and one-storey ? contrastively, then by rights there ought to be a character (or variation sequence or something) like LATIN SMALL LETTER TWO STOREY A, since after all, some fonts don't draw U+0061 the way that IPA stipulates is needed for the open front vowel.
> 


From Shawn.Steele at microsoft.com  Fri Jun  5 19:01:07 2020
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Sat, 6 Jun 2020 00:01:07 +0000
Subject: Alternate presentation for U+229C CIRCLED EQUALS?
In-Reply-To: <CAN49p6pBpMBvC3EDSwZgyqTBQUL3n7Mghs1r8uZVifZH1YSgug@mail.gmail.com>
References: <a65d1548-c4c9-2307-770e-1efc6df9076d@kli.org>
 <CAN49p6pSiioNbjjhsD41W3ys5FJHH3L5zKDVVdKKaRJOEKwdJw@mail.gmail.com>
 <CAN49p6qa37Vy1FX=x+MLCyKcNNKJo3bx-X36gpMPdT+7TxTHgw@mail.gmail.com>
 <MWHPR21MB084754802F1F9FFF6DF01A6C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <CAN49p6pBpMBvC3EDSwZgyqTBQUL3n7Mghs1r8uZVifZH1YSgug@mail.gmail.com>
Message-ID: <MWHPR21MB084763E7A9A54EC8258846D382870@MWHPR21MB0847.namprd21.prod.outlook.com>

I guess I?m a little late ?

From: Markus Scherer <markus.icu at gmail.com>
Sent: Friday, June 5, 2020 4:18 PM
To: Shawn Steele <Shawn.Steele at microsoft.com>
Cc: Mark E. Shoulson <mark at kli.org>; Unicode Mail List <unicode at unicode.org>
Subject: Re: Alternate presentation for U+229C CIRCLED EQUALS?

On Fri, Jun 5, 2020 at 3:00 PM Shawn Steele <Shawn.Steele at microsoft.com<mailto:Shawn.Steele at microsoft.com>> wrote:
I don?t really like the proposal at all.

The proposal is from 2017/2018. These characters were added in Unicode 13.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200606/8f000865/attachment.htm>

From abrahamgross at disroot.org  Fri Jun  5 19:15:26 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Sat, 6 Jun 2020 00:15:26 +0000 (UTC)
Subject: Alternate presentation for U+229C CIRCLED EQUALS?
In-Reply-To: <MWHPR21MB084763E7A9A54EC8258846D382870@MWHPR21MB0847.namprd21.prod.outlook.com>
References: <a65d1548-c4c9-2307-770e-1efc6df9076d@kli.org>
 <CAN49p6pSiioNbjjhsD41W3ys5FJHH3L5zKDVVdKKaRJOEKwdJw@mail.gmail.com>
 <CAN49p6qa37Vy1FX=x+MLCyKcNNKJo3bx-X36gpMPdT+7TxTHgw@mail.gmail.com>
 <MWHPR21MB084754802F1F9FFF6DF01A6C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <CAN49p6pBpMBvC3EDSwZgyqTBQUL3n7Mghs1r8uZVifZH1YSgug@mail.gmail.com>
 <MWHPR21MB084763E7A9A54EC8258846D382870@MWHPR21MB0847.namprd21.prod.outlook.com>
Message-ID: <cf905e67-0d04-476d-a840-b6d8ee5416ee@disroot.org>

You can still make a proposal to add it to U+1F10D

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200606/148d0cbb/attachment.htm>

From jk at koremail.com  Fri Jun  5 20:32:21 2020
From: jk at koremail.com (jk at koremail.com)
Date: Sat, 06 Jun 2020 09:32:21 +0800
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <b033daa4-845b-4930-8e81-9928ad09a435@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <20200604093151.227437a0@spixxi>
 <0ea68b66-9d9a-e1ef-cb60-b89747c04ee4@kli.org>
 <20200605002205.696251ba@JRWUBU2>
 <7d4f417d-ea94-1497-6386-a5a1d5f4b5a6@kli.org>
 <b033daa4-845b-4930-8e81-9928ad09a435@disroot.org>
Message-ID: <731899901a3dee4a5dbfef61f14779b1@koremail.com>


No, that some fonts display a character in a certain way would not be 
sufficient justification for a new character, but rather justification 
for not using those fonts in documents that contain IPA. Such a proposal 
would must certainly be rejected.


On 2020-06-06 07:49, abrahamgross--- via Unicode wrote:
> YES, THIS!
> 
> I've been thinking about writing a proposal for the double story "a"
> so that I can send an unambiguous IPA transcription - even to ppl with
> devices that have the U0061 ?a? as a single storey "a" - but I don't
> want to spend a ton of time on something thatll get rejected?
> 
> 2020/06/05 ??7:31:32 Mark E. Shoulson via Unicode 
> <unicode at unicode.org>:
>> Though to be honest, if IPA has to have ? because it uses two-storey a 
>> and one-storey ? contrastively, then by rights there ought to be a 
>> character (or variation sequence or something) like LATIN SMALL LETTER 
>> TWO STOREY A, since after all, some fonts don't draw U+0061 the way 
>> that IPA stipulates is needed for the open front vowel.
>> 


From markus.icu at gmail.com  Fri Jun  5 22:25:59 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Fri, 5 Jun 2020 20:25:59 -0700
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
Message-ID: <CAN49p6qAGysh7qygoKwqT-K=U0dDAVEikn9pR0LVhfiQZSVpEg@mail.gmail.com>

On Fri, Jun 5, 2020 at 5:36 PM Tom Honermann via Unicode <
unicode at unicode.org> wrote:

> On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote:
>
> Are you asking because you?re interested in differentiating UTF-8 from
> UTF-16?  Or UTF-8 from some other legacy non-Unicode encoding?
>
> The latter.  In particular, as a differentiator between shiny new UTF-8
> encoded source code files and long-in-the-tooth legacy encoded source code
> files coexisting (perhaps via transitive package dependencies) within a
> single project.
>
I would not use a BOM/signature on source code files. It will confuse or
break various tools.

I would take any non-ASCII/non-UTF-8 source code file and convert it to
UTF-8, and be done with it.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/dfcdd7d3/attachment-0001.htm>

From public at khwilliamson.com  Fri Jun  5 22:28:52 2020
From: public at khwilliamson.com (Karl Williamson)
Date: Fri, 5 Jun 2020 21:28:52 -0600
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <MWHPR21MB0847386BE436C612C0F56A9C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
 <MWHPR21MB0847386BE436C612C0F56A9C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
Message-ID: <499dd515-f61d-8114-aae7-52da51d92e58@khwilliamson.com>

On 6/5/20 4:33 PM, Shawn Steele via Unicode wrote:
> I?ve been recommending that people assume documents are UTF-8. ?If the 
> UTF-8 decoding fails, then consider falling back to some other 
> codepage.? ?Pretty much all the other code pages would contain text that 
> would look like unexpected trail bytes, or lead bytes without trail 
> bytes, etc.? One can anecdotally find single-word Latin examples that 
> break the pattern (Nestl?? IIRC), but if you want to think of accuracy 
> in terms of ?9s?, then that pretty much has as many nines as you have 
> bytes of input data.

I have code that attempts to distinguish between UTF-8 and CP1252 
inputs.  It now does a pretty good job; no one has complained in several 
years.   To do this, I resort to some "semantic" analysis of the input. 
If it is syntactically valid UTF-8, but not a script run, it's not 
UTF-8.  Likewise, the texts it will be subjected to are going to be in 
modern commercially-valuable scripts, so not IPA, for example.  And it 
will be important characters, ones whose Age property is 1.1; text won't 
contain C1 controls.  CP1252 is harder than plain ASCII/Latin1/C1 
because manyh of the C1 controls are co-opted for graphic characters. 
Someone sent me the following example, scraped from some dictionaries, 
that it successfully gets right:

Muvrar\xE1\x9A\x9Aa is a mountain in Norway

is legal 1252, and syntactically legal UTF-8, but the "semantic" tests 
say it isn't UTF-8.

I also have code that tries to distinguish between a UTF-8 POSIX locale 
and a non-UTF-8, and which needs to work on systems without certain C 
library functions that would make it foolproof.  That is less successful 
primarily because of insufficient text available to make a 
determination.  One might think that the operating system error messages 
would be fruitful, but it turns out that many are in English, no one 
bothered to translate them.  The locale's currency symbol is always 
translated, though the dollar sign is commonly used in other languages 
as part of the symbol.  The time and date names are usually translated, 
and I use them.

> I did find some DBCS CJK text that could look like valid UTF-8, so my 
> ?one nine per byte of input? isn?t quite as high there, however for 
> meaningful runs of text it is still reasonably hard to make sensible 
> text in a double byte codepage look like UTF-8.? Note that this ?works? 
> partially because the ASCII range of the SBCS/DBCS code pages typically 
> looks like ASCII, as does UTF-8.? If you had a 7 bit codepage data with 
> stateful shift sequences, of course that wouldn?t err in UTF-8.  
> Fortunately for your scenario source code in 7 bit encodings is very 
> rare nowadays.
> 
> Hope that helps,
> 
> -Shawn
> 
> *From:* Tom Honermann <tom at honermann.net>
> *Sent:* Freitag, 5. Juni 2020 15:15
> *To:* Shawn Steele <Shawn.Steele at microsoft.com>
> *Cc:* Alisdair Meredith <alisdairm at me.com>; Unicode Mail List 
> <unicode at unicode.org>
> *Subject:* Re: What is the Unicode guidance regarding the use of a BOM 
> as a UTF-8 encoding signature?
> 
> On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote:
> 
>     The modern viewpoint is that the BOM should be discouraged in all
>     contexts.?? (Along with you should always be using Unicode
>     encodings, probably UTF-8 or UTF-16).? I?d recommend to anyone
>     encountering ASCII-like data to presume it was UTF-8 unless proven
>     otherwise.
> 
>     Are you asking because you?re interested in differentiating UTF-8
>     from UTF-16?? Or UTF-8 from some other legacy non-Unicode encoding?
> 
> The latter.? In particular, as a differentiator between shiny new UTF-8 
> encoded source code files and long-in-the-tooth legacy encoded source 
> code files coexisting (perhaps via transitive package dependencies) 
> within a single project.
> 
> Tom.
> 
>     Anecdotally, if you can decode data without error in UTF-8, then
>     it?s probably UTF-8. ?Sensible sequences in other encodings rarely
>     look like valid UTF-8, though there are a few short examples that
>     can confuse it.
> 
>     -Shawn
> 
>     *From:* Unicode <unicode-bounces at unicode.org>
>     <mailto:unicode-bounces at unicode.org> *On Behalf Of *Tom Honermann
>     via Unicode
>     *Sent:* Freitag, 5. Juni 2020 13:10
>     *To:* unicode at unicode.org <mailto:unicode at unicode.org>
>     *Cc:* Alisdair Meredith <alisdairm at me.com> <mailto:alisdairm at me.com>
>     *Subject:* What is the Unicode guidance regarding the use of a BOM
>     as a UTF-8 encoding signature?
> 
>     Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte
>     order, states (emphasis mine):
> 
>         ... *Use of a BOM is neither required nor recommended for
>         UTF-8*, but may be encountered in contexts where UTF-8 data is
>         converted from other encoding forms that? use? a? BOM? or 
>         where? the? BOM? is? used? as? a? UTF-8? signature. See? the 
>         ?Byte? Order Mark? subsection in Section 23.8, Specials, for
>         more information.
> 
>     The emphasized statement is unconditional regarding the
>     recommendation, but it isn't clear to me that this recommendation is
>     intended to extend to both presence of a BOM in contexts where the
>     encoding is known to be UTF-8 (where the BOM provides no additional
>     information) and to contexts where the BOM signifies the presence of
>     UTF-8 encoded text (where the BOM does provide additional
>     information).? Is the guidance intended to state that, when
>     possible, use of UTF-8 as an encoding signature is to be avoided in
>     favor of some other mechanism?
> 
>     The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8
>     (Specials) contains no similar guidance; it is factual and details
>     some possible consequences of use, but does not apply a judgement. 
>     The discussion of use with other character sets could be read as an
>     endorsement for use of a BOM as an encoding signature.
> 
>     Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode
>     FAQ <https://www.unicode.org/faq/utf_bom.html> does not recommend
>     for or against use of a BOM as an encoding signature.? It also can
>     be read as endorsing such usage.
> 
>     So, my question is, what exactly is the intent of the emphasized
>     statement above?? Is the recommendation intended to be so broadly
>     worded?? Or is it only intended to discourage BOM use in cases where
>     the encoding is known by other means?
> 
>     Tom.
> 


From abrahamgross at disroot.org  Fri Jun  5 22:50:21 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Sat, 6 Jun 2020 03:50:21 +0000 (UTC)
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <731899901a3dee4a5dbfef61f14779b1@koremail.com>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <20200604093151.227437a0@spixxi>
 <0ea68b66-9d9a-e1ef-cb60-b89747c04ee4@kli.org>
 <20200605002205.696251ba@JRWUBU2>
 <7d4f417d-ea94-1497-6386-a5a1d5f4b5a6@kli.org>
 <b033daa4-845b-4930-8e81-9928ad09a435@disroot.org>
 <731899901a3dee4a5dbfef61f14779b1@koremail.com>
Message-ID: <712ae449-475d-425d-bfc3-934f45535b4d@disroot.org>

Even though it completely different meanings?

2020/06/05 ??9:32:57 John Knightley via Unicode <unicode at unicode.org>:

> No, that some fonts display a character in a certain way would not be sufficient justification for a new character, but rather justification for not using those fonts in documents that contain IPA. Such a proposal would must certainly be rejected.
> 


From jr at qsm.co.il  Fri Jun  5 22:53:29 2020
From: jr at qsm.co.il (Jonathan Rosenne)
Date: Sat, 6 Jun 2020 03:53:29 +0000
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <499dd515-f61d-8114-aae7-52da51d92e58@khwilliamson.com>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
 <MWHPR21MB0847386BE436C612C0F56A9C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <499dd515-f61d-8114-aae7-52da51d92e58@khwilliamson.com>
Message-ID: <AM6PR10MB285600201965BDE398CB392084870@AM6PR10MB2856.EURPRD10.PROD.OUTLOOK.COM>

I am curious about how your code would work with CP1255 or CP1256?

Best Regards,

Jonathan Rosenne

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Karl Williamson via Unicode
Sent: Saturday, June 6, 2020 6:29 AM
To: Shawn Steele; Tom Honermann
Cc: Alisdair Meredith; Unicode Mail List
Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

On 6/5/20 4:33 PM, Shawn Steele via Unicode wrote:
> I?ve been recommending that people assume documents are UTF-8. ?If the 
> UTF-8 decoding fails, then consider falling back to some other 
> codepage.? ?Pretty much all the other code pages would contain text that 
> would look like unexpected trail bytes, or lead bytes without trail 
> bytes, etc.? One can anecdotally find single-word Latin examples that 
> break the pattern (Nestl?? IIRC), but if you want to think of accuracy 
> in terms of ?9s?, then that pretty much has as many nines as you have 
> bytes of input data.

I have code that attempts to distinguish between UTF-8 and CP1252 
inputs.  It now does a pretty good job; no one has complained in several 
years.   To do this, I resort to some "semantic" analysis of the input. 
If it is syntactically valid UTF-8, but not a script run, it's not 
UTF-8.  Likewise, the texts it will be subjected to are going to be in 
modern commercially-valuable scripts, so not IPA, for example.  And it 
will be important characters, ones whose Age property is 1.1; text won't 
contain C1 controls.  CP1252 is harder than plain ASCII/Latin1/C1 
because manyh of the C1 controls are co-opted for graphic characters. 
Someone sent me the following example, scraped from some dictionaries, 
that it successfully gets right:

Muvrar\xE1\x9A\x9Aa is a mountain in Norway

is legal 1252, and syntactically legal UTF-8, but the "semantic" tests 
say it isn't UTF-8.

I also have code that tries to distinguish between a UTF-8 POSIX locale 
and a non-UTF-8, and which needs to work on systems without certain C 
library functions that would make it foolproof.  That is less successful 
primarily because of insufficient text available to make a 
determination.  One might think that the operating system error messages 
would be fruitful, but it turns out that many are in English, no one 
bothered to translate them.  The locale's currency symbol is always 
translated, though the dollar sign is commonly used in other languages 
as part of the symbol.  The time and date names are usually translated, 
and I use them.

> I did find some DBCS CJK text that could look like valid UTF-8, so my 
> ?one nine per byte of input? isn?t quite as high there, however for 
> meaningful runs of text it is still reasonably hard to make sensible 
> text in a double byte codepage look like UTF-8.? Note that this ?works? 
> partially because the ASCII range of the SBCS/DBCS code pages typically 
> looks like ASCII, as does UTF-8.? If you had a 7 bit codepage data with 
> stateful shift sequences, of course that wouldn?t err in UTF-8.  
> Fortunately for your scenario source code in 7 bit encodings is very 
> rare nowadays.
> 
> Hope that helps,
> 
> -Shawn
> 
> *From:* Tom Honermann <tom at honermann.net>
> *Sent:* Freitag, 5. Juni 2020 15:15
> *To:* Shawn Steele <Shawn.Steele at microsoft.com>
> *Cc:* Alisdair Meredith <alisdairm at me.com>; Unicode Mail List 
> <unicode at unicode.org>
> *Subject:* Re: What is the Unicode guidance regarding the use of a BOM 
> as a UTF-8 encoding signature?
> 
> On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote:
> 
>     The modern viewpoint is that the BOM should be discouraged in all
>     contexts.?? (Along with you should always be using Unicode
>     encodings, probably UTF-8 or UTF-16).? I?d recommend to anyone
>     encountering ASCII-like data to presume it was UTF-8 unless proven
>     otherwise.
> 
>     Are you asking because you?re interested in differentiating UTF-8
>     from UTF-16?? Or UTF-8 from some other legacy non-Unicode encoding?
> 
> The latter.? In particular, as a differentiator between shiny new UTF-8 
> encoded source code files and long-in-the-tooth legacy encoded source 
> code files coexisting (perhaps via transitive package dependencies) 
> within a single project.
> 
> Tom.
> 
>     Anecdotally, if you can decode data without error in UTF-8, then
>     it?s probably UTF-8. ?Sensible sequences in other encodings rarely
>     look like valid UTF-8, though there are a few short examples that
>     can confuse it.
> 
>     -Shawn
> 
>     *From:* Unicode <unicode-bounces at unicode.org>
>     <mailto:unicode-bounces at unicode.org> *On Behalf Of *Tom Honermann
>     via Unicode
>     *Sent:* Freitag, 5. Juni 2020 13:10
>     *To:* unicode at unicode.org <mailto:unicode at unicode.org>
>     *Cc:* Alisdair Meredith <alisdairm at me.com> <mailto:alisdairm at me.com>
>     *Subject:* What is the Unicode guidance regarding the use of a BOM
>     as a UTF-8 encoding signature?
> 
>     Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte
>     order, states (emphasis mine):
> 
>         ... *Use of a BOM is neither required nor recommended for
>         UTF-8*, but may be encountered in contexts where UTF-8 data is
>         converted from other encoding forms that? use? a? BOM? or 
>         where? the? BOM? is? used? as? a? UTF-8? signature. See? the 
>         ?Byte? Order Mark? subsection in Section 23.8, Specials, for
>         more information.
> 
>     The emphasized statement is unconditional regarding the
>     recommendation, but it isn't clear to me that this recommendation is
>     intended to extend to both presence of a BOM in contexts where the
>     encoding is known to be UTF-8 (where the BOM provides no additional
>     information) and to contexts where the BOM signifies the presence of
>     UTF-8 encoded text (where the BOM does provide additional
>     information).? Is the guidance intended to state that, when
>     possible, use of UTF-8 as an encoding signature is to be avoided in
>     favor of some other mechanism?
> 
>     The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8
>     (Specials) contains no similar guidance; it is factual and details
>     some possible consequences of use, but does not apply a judgement. 
>     The discussion of use with other character sets could be read as an
>     endorsement for use of a BOM as an encoding signature.
> 
>     Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode
>     FAQ <https://www.unicode.org/faq/utf_bom.html> does not recommend
>     for or against use of a BOM as an encoding signature.? It also can
>     be read as endorsing such usage.
> 
>     So, my question is, what exactly is the intent of the emphasized
>     statement above?? Is the recommendation intended to be so broadly
>     worded?? Or is it only intended to discourage BOM use in cases where
>     the encoding is known by other means?
> 
>     Tom.
> 


From prosfilaes at gmail.com  Fri Jun  5 22:59:27 2020
From: prosfilaes at gmail.com (David Starner)
Date: Fri, 5 Jun 2020 20:59:27 -0700
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <b033daa4-845b-4930-8e81-9928ad09a435@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
 <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com>
 <d34a4d4e-9854-4076-9bd4-402800abf794@disroot.org>
 <c019acec51861f2d88ff15de12821a40@koremail.com>
 <a5cd536a-3400-64be-40af-594e0d3c4899@ix.netcom.com>
 <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org>
 <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org>
 <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org>
 <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org>
 <20200604093151.227437a0@spixxi>
 <0ea68b66-9d9a-e1ef-cb60-b89747c04ee4@kli.org>
 <20200605002205.696251ba@JRWUBU2>
 <7d4f417d-ea94-1497-6386-a5a1d5f4b5a6@kli.org>
 <b033daa4-845b-4930-8e81-9928ad09a435@disroot.org>
Message-ID: <CAMZ=zj7P3apJMutDwoxdk0739BU57pM0x5z31_rdranmFBZgow@mail.gmail.com>

On Fri, Jun 5, 2020 at 7:21 PM abrahamgross--- via Unicode
<unicode at unicode.org> wrote:
>
> YES, THIS!
>
> I've been thinking about writing a proposal for the double story "a" so that I can send an unambiguous IPA transcription - even to ppl with devices that have the U0061 ?a? as a single storey "a" - but I don't want to spend a ton of time on something thatll get rejected?

I understand the argument, but it's been over a quarter century since
IPA was encoded with U+0061 standing for the IPA a, and it seems long
past changing.

-- 
The standard is written in English . If you have trouble understanding
a particular section, read it again and again and again . . . Sit up
straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185
(1991)


From eliz at gnu.org  Sat Jun  6 01:39:44 2020
From: eliz at gnu.org (Eli Zaretskii)
Date: Sat, 06 Jun 2020 09:39:44 +0300
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <MWHPR21MB0847386BE436C612C0F56A9C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
 (message from Shawn Steele via Unicode on Fri, 5 Jun 2020 22:33:23
 +0000)
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
 <MWHPR21MB0847386BE436C612C0F56A9C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
Message-ID: <83h7voaea7.fsf@gnu.org>

> CC: Alisdair Meredith <alisdairm at me.com>,
>         Unicode Mail List
>  <unicode at unicode.org>
> Date: Fri, 5 Jun 2020 22:33:23 +0000
> From: Shawn Steele via Unicode <unicode at unicode.org>
> 
> I?ve been recommending that people assume documents are UTF-8.  If the UTF-8 decoding fails, then
> consider falling back to some other codepage.

That strategy would fail with 7-bit ISO 2022 based encodings, no?
They look like plain 7-bit ASCII (which will not fail UTF-8), but
actually represent non-ASCII text.

From Shawn.Steele at microsoft.com  Sat Jun  6 01:58:55 2020
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Sat, 6 Jun 2020 06:58:55 +0000
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <83h7voaea7.fsf@gnu.org>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
 <MWHPR21MB0847386BE436C612C0F56A9C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <83h7voaea7.fsf@gnu.org>
Message-ID: <MWHPR21MB0847671B71A22621F5BA411882870@MWHPR21MB0847.namprd21.prod.outlook.com>

I mentioned that later....  But there is a lot of content for interchange that are single/double byte (8 bit) rather than requiring escape sequences.  The 2022 encodings seem rarer, though it may depend on your data source.

-----Original Message-----
From: Eli Zaretskii <eliz at gnu.org> 
Sent: Friday, June 5, 2020 11:40 PM
To: Shawn Steele <Shawn.Steele at microsoft.com>
Cc: tom at honermann.net; alisdairm at me.com; unicode at unicode.org
Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

> CC: Alisdair Meredith <alisdairm at me.com>,
>         Unicode Mail List
>  <unicode at unicode.org>
> Date: Fri, 5 Jun 2020 22:33:23 +0000
> From: Shawn Steele via Unicode <unicode at unicode.org>
> 
> I?ve been recommending that people assume documents are UTF-8.  If the 
> UTF-8 decoding fails, then consider falling back to some other codepage.

That strategy would fail with 7-bit ISO 2022 based encodings, no?
They look like plain 7-bit ASCII (which will not fail UTF-8), but actually represent non-ASCII text.


From eliz at gnu.org  Sat Jun  6 02:12:50 2020
From: eliz at gnu.org (Eli Zaretskii)
Date: Sat, 06 Jun 2020 10:12:50 +0300
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <MWHPR21MB0847671B71A22621F5BA411882870@MWHPR21MB0847.namprd21.prod.outlook.com>
 (message from Shawn Steele via Unicode on Sat, 6 Jun 2020 06:58:55
 +0000)
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
 <MWHPR21MB0847386BE436C612C0F56A9C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <83h7voaea7.fsf@gnu.org>
 <MWHPR21MB0847671B71A22621F5BA411882870@MWHPR21MB0847.namprd21.prod.outlook.com>
Message-ID: <83a71gacr1.fsf@gnu.org>

> CC: "tom at honermann.net" <tom at honermann.net>,
>         "alisdairm at me.com"
>  <alisdairm at me.com>,
>         "unicode at unicode.org" <unicode at unicode.org>
> Date: Sat, 6 Jun 2020 06:58:55 +0000
> From: Shawn Steele via Unicode <unicode at unicode.org>
> 
> I mentioned that later....  But there is a lot of content for interchange that are single/double byte (8 bit) rather than requiring escape sequences.  The 2022 encodings seem rarer, though it may depend on your data source.

I agree that ISO 2022 is rare these days, but rarity doesn't help when
you need to be accurate in decoding, because mistaking one encoding
for another produces horribly incorrect results, and users complain
vociferously when that happens.

From junicode at jcbradfield.org  Sat Jun  6 02:45:10 2020
From: junicode at jcbradfield.org (Julian Bradfield)
Date: Sat,  6 Jun 2020 08:45:10 +0100 (BST)
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <CAN49p6qVEhjcMk3JyjG1CoRHw-rrvBr=cPaqZe0sjCBfSFwGkA@mail.gmail.com>
Message-ID: <slrnrdmic6.gjv.jcb@home.stevens-bradfield.com>

Just to digress a little, I get quite a lot of mail which has
BOM/ZWNBSP scattered through it, sometimes at the beginning of the
mail, sometimes at the beginning of the quoted mail to which it is a
reply. Occasionally at the start of every line. Mostly it emanates
from known useless webmail providers such as Yahoo, but some slightly
more reputable providers do it as well. (I don't easily have a list,
as I now filter it out before it hits my mailbox.)

Does anybody have an idea why they do this? Some accident of legacy
coding?

From eliz at gnu.org  Sat Jun  6 05:57:57 2020
From: eliz at gnu.org (Eli Zaretskii)
Date: Sat, 06 Jun 2020 13:57:57 +0300
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <DB6PR07MB344877E8C226A2578BA4A96BB7870@DB6PR07MB3448.eurprd07.prod.outlook.com>
 (message from Harriet Riddle on Sat, 6 Jun 2020 10:05:49 +0000)
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
 <MWHPR21MB0847386BE436C612C0F56A9C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <83h7voaea7.fsf@gnu.org>
 <MWHPR21MB0847671B71A22621F5BA411882870@MWHPR21MB0847.namprd21.prod.outlook.com>,
 <83a71gacr1.fsf@gnu.org> 
 <DB6PR07MB344877E8C226A2578BA4A96BB7870@DB6PR07MB3448.eurprd07.prod.outlook.com>
Message-ID: <83sgf88nre.fsf@gnu.org>

> From: Harriet Riddle <harjitmoe at outlook.com>
> CC: "tom at honermann.net" <tom at honermann.net>, "alisdairm at me.com"
> 	<alisdairm at me.com>, "unicode at unicode.org" <unicode at unicode.org>
> Date: Sat, 6 Jun 2020 10:05:49 +0000
> 
> In theory, one decoder can handle both, since 7-bit ISO 2022 generally starts out in ASCII, and SI, SO and
> the Gx-set designating ESC sequences make no sense in UTF-8.

What do you mean by "make no sense"?  A general-purpose editor is
presented with a byte stream and needs to decide how to interpret and
display it.  It usually has no meta-data about the byte stream to help
it decide what does and doesn't make sense.  It doesn't even know
whether the byte stream is human-readable text or just raw binary
bytes.

I understand that, given enough of the byte stream, one can analyze it
and see whether interpreting it as one encoding or another will make
more sense.  But these decisions are sometimes required after only a
small portion of the material has arrived (a case in point: a process
or a network connection that outputs text in relatively small chunks).

In any case, I was responding to a proposal to treat any text as UTF-8
"unless proven otherwise".  My point is that with ISO 2022 encoding,
and perhaps also others, such a proof is not really at hand.

> If that isn't feasible, then, more moderate measures might include trying 7-bit ISO 2022 and if it runs into a
> set high bit, retrying with UTF-8. Or trying UTF-8 and, if the result contains SI, SO or (for instance) the
> sequence ESC $ B (U+001B U+0024 U+0042), retrying with (for instance) ISO-2022-JP-2.

Treating ESC sequences as telltale signs of ISO 2022 is not foolproof,
either.  For example, you may be looking at UTF-8 text interspersed
with terminal control sequences, like SGR or somesuch.

Bottom line: the real world out there is not as clean as we might
think, and those rare corner cases keep breaking any simple-minded
decision rules such as "assume UTF-8 by default".

From jr at qsm.co.il  Sat Jun  6 06:17:48 2020
From: jr at qsm.co.il (Jonathan Rosenne)
Date: Sat, 6 Jun 2020 11:17:48 +0000
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <83sgf88nre.fsf@gnu.org>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
 <MWHPR21MB0847386BE436C612C0F56A9C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <83h7voaea7.fsf@gnu.org>
 <MWHPR21MB0847671B71A22621F5BA411882870@MWHPR21MB0847.namprd21.prod.outlook.com>,
 <83a71gacr1.fsf@gnu.org>
 <DB6PR07MB344877E8C226A2578BA4A96BB7870@DB6PR07MB3448.eurprd07.prod.outlook.com>
 <83sgf88nre.fsf@gnu.org>
Message-ID: <AM6PR10MB2856366F4F5B3A471135D59084870@AM6PR10MB2856.EURPRD10.PROD.OUTLOOK.COM>

Frequency analysis of bigrams and trigrams, provided the text is not too short, can reveal the encoding and even the language. But this is not normally the province of text editors and word processing software.

Jonathan Rosenne
-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Eli Zaretskii via Unicode
Sent: Saturday, June 6, 2020 1:58 PM
To: Harriet Riddle
Cc: Shawn.Steele at microsoft.com; tom at honermann.net; alisdairm at me.com; unicode at unicode.org
Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

> From: Harriet Riddle <harjitmoe at outlook.com>
> CC: "tom at honermann.net" <tom at honermann.net>, "alisdairm at me.com"
> 	<alisdairm at me.com>, "unicode at unicode.org" <unicode at unicode.org>
> Date: Sat, 6 Jun 2020 10:05:49 +0000
> 
> In theory, one decoder can handle both, since 7-bit ISO 2022 generally starts out in ASCII, and SI, SO and
> the Gx-set designating ESC sequences make no sense in UTF-8.

What do you mean by "make no sense"?  A general-purpose editor is
presented with a byte stream and needs to decide how to interpret and
display it.  It usually has no meta-data about the byte stream to help
it decide what does and doesn't make sense.  It doesn't even know
whether the byte stream is human-readable text or just raw binary
bytes.

I understand that, given enough of the byte stream, one can analyze it
and see whether interpreting it as one encoding or another will make
more sense.  But these decisions are sometimes required after only a
small portion of the material has arrived (a case in point: a process
or a network connection that outputs text in relatively small chunks).

In any case, I was responding to a proposal to treat any text as UTF-8
"unless proven otherwise".  My point is that with ISO 2022 encoding,
and perhaps also others, such a proof is not really at hand.

> If that isn't feasible, then, more moderate measures might include trying 7-bit ISO 2022 and if it runs into a
> set high bit, retrying with UTF-8. Or trying UTF-8 and, if the result contains SI, SO or (for instance) the
> sequence ESC $ B (U+001B U+0024 U+0042), retrying with (for instance) ISO-2022-JP-2.

Treating ESC sequences as telltale signs of ISO 2022 is not foolproof,
either.  For example, you may be looking at UTF-8 text interspersed
with terminal control sequences, like SGR or somesuch.

Bottom line: the real world out there is not as clean as we might
think, and those rare corner cases keep breaking any simple-minded
decision rules such as "assume UTF-8 by default".


From eliz at gnu.org  Sat Jun  6 07:53:30 2020
From: eliz at gnu.org (Eli Zaretskii)
Date: Sat, 06 Jun 2020 15:53:30 +0300
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <DB6PR07MB3448A61D384E2810380C2459B7870@DB6PR07MB3448.eurprd07.prod.outlook.com>
 (message from Harriet Riddle on Sat, 6 Jun 2020 12:20:40 +0000)
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
 <MWHPR21MB0847386BE436C612C0F56A9C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <83h7voaea7.fsf@gnu.org>
 <MWHPR21MB0847671B71A22621F5BA411882870@MWHPR21MB0847.namprd21.prod.outlook.com>,
 <83a71gacr1.fsf@gnu.org>
 <DB6PR07MB344877E8C226A2578BA4A96BB7870@DB6PR07MB3448.eurprd07.prod.outlook.com>,
 <83sgf88nre.fsf@gnu.org> 
 <DB6PR07MB3448A61D384E2810380C2459B7870@DB6PR07MB3448.eurprd07.prod.outlook.com>
Message-ID: <83mu5g8iet.fsf@gnu.org>

> From: Harriet Riddle <harjitmoe at outlook.com>
> CC: "Shawn.Steele at microsoft.com" <Shawn.Steele at microsoft.com>,
> 	"tom at honermann.net" <tom at honermann.net>, "alisdairm at me.com"
> 	<alisdairm at me.com>, "unicode at unicode.org" <unicode at unicode.org>
> Date: Sat, 6 Jun 2020 12:20:40 +0000
> 
> So it is true that detecting ESC on its own will not identify 7-bit ISO 2022, but the specific sequence ESC $ B
> (ESC 0x24 0x42) has only one ANSI/ISO compliant meaning, which is to switch the G0 set to JIS X 0208. In
> UTF-8, there is no such thing as a G0 set (due to it not being fully ISO 2022 based), so it is meaningless.

If you are saying that "ESC $ B" or similar sequences can be
considered as evidence that the text is not in UTF-8, then I might
concur.  Whether that's the "proof" that should reject UTF-8, I'm not
sure.

From sosipiuk at gmail.com  Sat Jun  6 08:56:23 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Sat, 6 Jun 2020 09:56:23 -0400
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <83sgf88nre.fsf@gnu.org>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
 <MWHPR21MB0847386BE436C612C0F56A9C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <83h7voaea7.fsf@gnu.org>
 <MWHPR21MB0847671B71A22621F5BA411882870@MWHPR21MB0847.namprd21.prod.outlook.com>
 <83a71gacr1.fsf@gnu.org>
 <DB6PR07MB344877E8C226A2578BA4A96BB7870@DB6PR07MB3448.eurprd07.prod.outlook.com>
 <83sgf88nre.fsf@gnu.org>
Message-ID: <CAM+ijLgo-RquvSgX9keDQUBrAt-eWqheWW+6zPsz4ObDPH2jSw@mail.gmail.com>

On Sat, Jun 6, 2020 at 7:04 AM Eli Zaretskii via Unicode
<unicode at unicode.org> wrote:
>
> What do you mean by "make no sense"?  A general-purpose editor is
> presented with a byte stream and needs to decide how to interpret and
> display it.  It usually has no meta-data about the byte stream to help
> it decide what does and doesn't make sense.  It doesn't even know
> whether the byte stream is human-readable text or just raw binary
> bytes.

Escape sequences may be present in UTF-8, but SI and SO cannot be, nor
can most designation sequences (a special subset of escape sequences),
not only because they make no sense, but because ISO 10646 explicitly
forbids them:

"Code extension control functions for the ISO/IEC 2022 code extension
techniques (such as designation escape sequences, single shift, and
locking shift) shall not be used with this coded character set."

The presence of these in a UTF-8 stream indicates an error of some
kind. It's not completely impossible for them to appear in something
that is otherwise valid UTF-8, but they should be treated, in my
opinion, the same as overlong sequences or surrogates; i.e. the UTF-8
math works, but the code point isn't valid. This can occur due to
faulty conversion from another encoding, giving something that is
close to UTF-8 but not quite right. This brings up the question of how
error-tolerant Karl's algorithm is.

7-bit ISO 2022 encodings would clearly show such errors.

Also: I did not receive the email from Harriet Riddle that Eli is
replying to. Is there a problem with the mailing list? I may be
missing other messages.

S?awomir Osipiuk


From eliz at gnu.org  Sat Jun  6 09:27:33 2020
From: eliz at gnu.org (Eli Zaretskii)
Date: Sat, 06 Jun 2020 17:27:33 +0300
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <CAM+ijLgo-RquvSgX9keDQUBrAt-eWqheWW+6zPsz4ObDPH2jSw@mail.gmail.com>
 (message from =?utf-8?Q?S=C5=82awomir?= Osipiuk on Sat, 6 Jun 2020 09:56:23
 -0400)
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
 <MWHPR21MB0847386BE436C612C0F56A9C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <83h7voaea7.fsf@gnu.org>
 <MWHPR21MB0847671B71A22621F5BA411882870@MWHPR21MB0847.namprd21.prod.outlook.com>
 <83a71gacr1.fsf@gnu.org>
 <DB6PR07MB344877E8C226A2578BA4A96BB7870@DB6PR07MB3448.eurprd07.prod.outlook.com>
 <83sgf88nre.fsf@gnu.org>
 <CAM+ijLgo-RquvSgX9keDQUBrAt-eWqheWW+6zPsz4ObDPH2jSw@mail.gmail.com>
Message-ID: <83h7vo8e22.fsf@gnu.org>

> From: S?awomir Osipiuk <sosipiuk at gmail.com>
> Date: Sat, 6 Jun 2020 09:56:23 -0400
> 
> Escape sequences may be present in UTF-8, but SI and SO cannot be, nor
> can most designation sequences (a special subset of escape sequences),
> not only because they make no sense, but because ISO 10646 explicitly
> forbids them:
> 
> "Code extension control functions for the ISO/IEC 2022 code extension
> techniques (such as designation escape sequences, single shift, and
> locking shift) shall not be used with this coded character set."

Alas, the stuff one bumps into out there doesn't always follow written
standards, let alone recent enough standards.

> The presence of these in a UTF-8 stream indicates an error of some
> kind. It's not completely impossible for them to appear in something
> that is otherwise valid UTF-8, but they should be treated, in my
> opinion, the same as overlong sequences or surrogates; i.e. the UTF-8
> math works, but the code point isn't valid.

What to do when these irregularities are found is a separate (though
very important) issue.  The issue discussed here is whether assuming
UTF-8 "until proven otherwise" is sufficient in practice.  I don't
think it is, and I provided a few examples why.

From harjitmoe at outlook.com  Sat Jun  6 05:05:49 2020
From: harjitmoe at outlook.com (Harriet Riddle)
Date: Sat, 6 Jun 2020 10:05:49 +0000
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <83a71gacr1.fsf@gnu.org>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
 <MWHPR21MB0847386BE436C612C0F56A9C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <83h7voaea7.fsf@gnu.org>
 <MWHPR21MB0847671B71A22621F5BA411882870@MWHPR21MB0847.namprd21.prod.outlook.com>,
 <83a71gacr1.fsf@gnu.org>
Message-ID: <DB6PR07MB344877E8C226A2578BA4A96BB7870@DB6PR07MB3448.eurprd07.prod.outlook.com>

In theory, one decoder can handle both, since 7-bit ISO 2022 generally starts out in ASCII, and SI, SO and the Gx-set designating ESC sequences make no sense in UTF-8. So, handling the left-hand side (those with the high bit unset) as (say) ISO-2022-JP-2 and the right-hand side (with the high bit set) as UTF-8 could work, with no ambiguity in practice. I do not recommend this for general use, since allowing this sort of mixed encoding at the receiving end can allow data to bypass upstream XSS sanitisers et cetera, but you presumably know how revelant this concern is to your work. It also probably doesn't make sense to write a decoder from scratch for this, unless you were doing that anyway.

If that isn't feasible, then, more moderate measures might include trying 7-bit ISO 2022 and if it runs into a set high bit, retrying with UTF-8. Or trying UTF-8 and, if the result contains SI, SO or (for instance) the sequence ESC $ B (U+001B U+0024 U+0042), retrying with (for instance) ISO-2022-JP-2.

________________________________
From: Unicode <unicode-bounces at unicode.org> on behalf of Eli Zaretskii via Unicode <unicode at unicode.org>
Sent: 06 June 2020 09:12
To: Shawn Steele <Shawn.Steele at microsoft.com>
Cc: tom at honermann.net <tom at honermann.net>; alisdairm at me.com <alisdairm at me.com>; unicode at unicode.org <unicode at unicode.org>
Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

> CC: "tom at honermann.net" <tom at honermann.net>,
>         "alisdairm at me.com"
>  <alisdairm at me.com>,
>         "unicode at unicode.org" <unicode at unicode.org>
> Date: Sat, 6 Jun 2020 06:58:55 +0000
> From: Shawn Steele via Unicode <unicode at unicode.org>
>
> I mentioned that later....  But there is a lot of content for interchange that are single/double byte (8 bit) rather than requiring escape sequences.  The 2022 encodings seem rarer, though it may depend on your data source.

I agree that ISO 2022 is rare these days, but rarity doesn't help when
you need to be accurate in decoding, because mistaking one encoding
for another produces horribly incorrect results, and users complain
vociferously when that happens.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200606/4d924928/attachment.htm>

From harjitmoe at outlook.com  Sat Jun  6 07:20:40 2020
From: harjitmoe at outlook.com (Harriet Riddle)
Date: Sat, 6 Jun 2020 12:20:40 +0000
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <83sgf88nre.fsf@gnu.org>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
 <MWHPR21MB0847386BE436C612C0F56A9C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <83h7voaea7.fsf@gnu.org>
 <MWHPR21MB0847671B71A22621F5BA411882870@MWHPR21MB0847.namprd21.prod.outlook.com>,
 <83a71gacr1.fsf@gnu.org>
 <DB6PR07MB344877E8C226A2578BA4A96BB7870@DB6PR07MB3448.eurprd07.prod.outlook.com>,
 <83sgf88nre.fsf@gnu.org>
Message-ID: <DB6PR07MB3448A61D384E2810380C2459B7870@DB6PR07MB3448.eurprd07.prod.outlook.com>

Point taken about it not necessarily being human readable text. I was mainly considering the case of distinguishing between a collection of files, the older ones being in ISO-2022-JP and the newer ones in UTF-8.

In response to the comment about SGR sequences:

ISO/IEC 2022 (ECMA-35, JIS X 0202), specifically section 13 (referencing the ECMA version), ultimately defines the format of all ANSI/ISO compliant escape sequences, whether in an actual ISO/IEC 2022 encoding (including both 7-bit code versions, and also 8-bit code versions such as ISO-8859-1) or in ISO 10646 / Unicode. The main difference is that ISO/IEC 10646 adds the requirement that they be padded to the code unit width, which is only relevant in the context of UTF-16 or UTF-32.

However, "type Fe" escape sequences, i.e. ESC 0x40 (ESC @) through ESC 0x5F (ESC _) with no intervening bytes, are delegated to the C1 control code set in use, normally ISO/IEC 6429 (ECMA-48, JIS X 0211). The escape sequence ESC 0x5B (ESC [), which is the CSI control in turn used at the start of SGR, CUP etc. sequences, is one of these.

The sequence ESC $ B (ESC 0x24 0x42), on the other hand, is a "type 4F" escape sequence, with a function defined by ISO/IEC 2022 itself.

And yes, some of the code-switching sequences are supported by e.g. xterm, but this is mainly for their ISO 2022 code-switching purposes, e.g. using ESC - F to switch from ISO-8859-1 to ISO-8859-7, or ESC % G to switch from an ISO 2022 code version (such as ISO 8859) to UTF-8.

So it is true that detecting ESC on its own will not identify 7-bit ISO 2022, but the specific sequence ESC $ B (ESC 0x24 0x42) has only one ANSI/ISO compliant meaning, which is to switch the G0 set to JIS X 0208. In UTF-8, there is no such thing as a G0 set (due to it not being fully ISO 2022 based), so it is meaningless.

If you're dealing with non-ISO-compliant escape sequences used by some terminal, then fair enough.

________________________________
From: Eli Zaretskii <eliz at gnu.org>
Sent: 06 June 2020 12:57
To: Harriet Riddle <harjitmoe at outlook.com>
Cc: Shawn.Steele at microsoft.com <Shawn.Steele at microsoft.com>; tom at honermann.net <tom at honermann.net>; alisdairm at me.com <alisdairm at me.com>; unicode at unicode.org <unicode at unicode.org>
Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

> From: Harriet Riddle <harjitmoe at outlook.com>
> CC: "tom at honermann.net" <tom at honermann.net>, "alisdairm at me.com"
>        <alisdairm at me.com>, "unicode at unicode.org" <unicode at unicode.org>
> Date: Sat, 6 Jun 2020 10:05:49 +0000
>
> In theory, one decoder can handle both, since 7-bit ISO 2022 generally starts out in ASCII, and SI, SO and
> the Gx-set designating ESC sequences make no sense in UTF-8.

What do you mean by "make no sense"?  A general-purpose editor is
presented with a byte stream and needs to decide how to interpret and
display it.  It usually has no meta-data about the byte stream to help
it decide what does and doesn't make sense.  It doesn't even know
whether the byte stream is human-readable text or just raw binary
bytes.

I understand that, given enough of the byte stream, one can analyze it
and see whether interpreting it as one encoding or another will make
more sense.  But these decisions are sometimes required after only a
small portion of the material has arrived (a case in point: a process
or a network connection that outputs text in relatively small chunks).

In any case, I was responding to a proposal to treat any text as UTF-8
"unless proven otherwise".  My point is that with ISO 2022 encoding,
and perhaps also others, such a proof is not really at hand.

> If that isn't feasible, then, more moderate measures might include trying 7-bit ISO 2022 and if it runs into a
> set high bit, retrying with UTF-8. Or trying UTF-8 and, if the result contains SI, SO or (for instance) the
> sequence ESC $ B (U+001B U+0024 U+0042), retrying with (for instance) ISO-2022-JP-2.

Treating ESC sequences as telltale signs of ISO 2022 is not foolproof,
either.  For example, you may be looking at UTF-8 text interspersed
with terminal control sequences, like SGR or somesuch.

Bottom line: the real world out there is not as clean as we might
think, and those rare corner cases keep breaking any simple-minded
decision rules such as "assume UTF-8 by default".
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200606/8b004e2d/attachment.htm>

From pgcon6 at msn.com  Sat Jun  6 10:06:23 2020
From: pgcon6 at msn.com (Peter Constable)
Date: Sat, 6 Jun 2020 15:06:23 +0000
Subject: reminder about this list
Message-ID: <MWHPR1301MB2112FD4ED1A802439DE9B7FF86870@MWHPR1301MB2112.namprd13.prod.outlook.com>

I'd just like to remind people (or point out): the Unicode Technical Committee does not monitor or act on anything discussed in this list. It's here for discussion-to seek answers to questions, bounce idea of others... If you bring up an idea that gets some support and think it would be work UTC considering, there are two channels to provide input that UTC will consider:

Submit comments via the contact form:
https://corp.unicode.org/reporting.html

Submit a document with a specific proposal and rationale:
https://www.unicode.org/pending/docsubmit.html


For general info about UTC see https://www.unicode.org/consortium/utc.html


Cheers!
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200606/de7b8369/attachment.htm>

From doug at ewellic.org  Sat Jun  6 10:19:48 2020
From: doug at ewellic.org (Doug Ewell)
Date: Sat, 6 Jun 2020 09:19:48 -0600
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
Message-ID: <000d01d63c15$eb40ca80$c1c25f80$@ewellic.org>

Shawn Steele wrote:

> I?ve been recommending that people assume documents are UTF-8.  If
> the UTF-8 decoding fails, then consider falling back to some other
> codepage.   Pretty much all the other code pages would contain text
> that would look like unexpected trail bytes, or lead bytes without
> trail bytes, etc.  One can anecdotally find single-word Latin examples
> that break the pattern (Nestl?? IIRC),

That's traditionally been my example. You have to spell it in all caps (NESTL??), which Nestl? seldom does, in order to get an ISO 8859-1 sequence that can be mistaken for UTF-8:

4E 45 53 54 4C C9 AE

where the last two code points could be UTF-8 for ?, U+026E LATIN SMALL LETTER LEZH.

If the ? is lowercase, you get:

4E 45 53 54 4C E9 AE

which is not valid UTF-8 (only one trail byte), and the heuristic that UTF-8 can be reliably auto-detected is reinforced.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From doug at ewellic.org  Sat Jun  6 10:43:34 2020
From: doug at ewellic.org (Doug Ewell)
Date: Sat, 6 Jun 2020 09:43:34 -0600
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
Message-ID: <000e01d63c19$3d070ed0$b7152c70$@ewellic.org>

Eli Zaretskii wrote:

>>> That strategy would fail with 7-bit ISO 2022 based encodings, no?
>>> They look like plain 7-bit ASCII (which will not fail UTF-8), but
>>> actually represent non-ASCII text.
>>
>> I mentioned that later....  But there is a lot of content for
>> interchange that are single/double byte (8 bit) rather than requiring
>> escape sequences.  The 2022 encodings seem rarer, though it may
>> depend on your data source.
>
> I agree that ISO 2022 is rare these days, but rarity doesn't help when
> you need to be accurate in decoding, because mistaking one encoding
> for another produces horribly incorrect results, and users complain
> vociferously when that happens.

If you need to deal with an arbitrary set of encodings, such as CP1255 and CP1256 and 7-bit ISO 2022-based encodings, instead of just CP1252 versus UTF-8 as Karl stated, then auto-detection won't work without a fair amount of natural language context. Otherwise, the text really has to be tagged.

Long ago I wrote some code that detected Russian text in any of six popular (at the time) Cyrillic encodings, and it seldom got it wrong, but I have no idea how it would do for other, especially non-Slavic, languages written in Cyrillic. I bet it would fail spectacularly for Mongolian, for example.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From eliz at gnu.org  Sat Jun  6 10:47:00 2020
From: eliz at gnu.org (Eli Zaretskii)
Date: Sat, 06 Jun 2020 18:47:00 +0300
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <000e01d63c19$3d070ed0$b7152c70$@ewellic.org> (message from Doug
 Ewell via Unicode on Sat, 6 Jun 2020 09:43:34 -0600)
References: <000e01d63c19$3d070ed0$b7152c70$@ewellic.org>
Message-ID: <83ftb88adn.fsf@gnu.org>

> Date: Sat, 6 Jun 2020 09:43:34 -0600
> From: Doug Ewell via Unicode <unicode at unicode.org>
> 
> Eli Zaretskii wrote:
> 
> If you need to deal with an arbitrary set of encodings, such as CP1255 and CP1256 and 7-bit ISO 2022-based encodings, instead of just CP1252 versus UTF-8 as Karl stated, then auto-detection won't work without a fair amount of natural language context. Otherwise, the text really has to be tagged.

Yes, that's my experience as well.

From doug at ewellic.org  Sat Jun  6 10:58:31 2020
From: doug at ewellic.org (Doug Ewell)
Date: Sat, 6 Jun 2020 09:58:31 -0600
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
Message-ID: <000f01d63c1b$53a627f0$faf277d0$@ewellic.org>

abrahamgross at disroot.org wrote:

> I've been thinking about writing a proposal for the double story "a"
> so that I can send an unambiguous IPA transcription - even to ppl with
> devices that have the U0061 ?a? as a single storey "a" - but I don't
> want to spend a ton of time on something thatll get rejected?

IMHO the major beneficiary of such a character would be the shadowy authors of those annoying "I know your password, now send me bitcoin or I'll send incriminating videos to all your contacts" messages, who love to sprinkle in lookalike characters (like, well, ?) for whatever reason.

In this whole thread, I have yet to see why Alphabetic Presentation Forms would be considered a good place to encode a variant form of a Hebrew letter. If that's no longer being considered, or never was, perhaps a change in Subject line would help readers.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From asmusf at ix.netcom.com  Sat Jun  6 17:01:13 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sat, 6 Jun 2020 15:01:13 -0700
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <83ftb88adn.fsf@gnu.org>
References: <000e01d63c19$3d070ed0$b7152c70$@ewellic.org>
 <83ftb88adn.fsf@gnu.org>
Message-ID: <3d396692-4400-86ce-eb15-d5088800b81d@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200606/ba9cb467/attachment.htm>

From duerst at it.aoyama.ac.jp  Sat Jun  6 19:40:08 2020
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=)
Date: Sun, 7 Jun 2020 09:40:08 +0900
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <000e01d63c19$3d070ed0$b7152c70$@ewellic.org>
References: <000e01d63c19$3d070ed0$b7152c70$@ewellic.org>
Message-ID: <414518f9-adc3-af6f-1c68-b0c0df80a54d@it.aoyama.ac.jp>


On 07/06/2020 00:43, Doug Ewell via Unicode wrote:
> Eli Zaretskii wrote:
> 
>>>> That strategy would fail with 7-bit ISO 2022 based encodings, no?
>>>> They look like plain 7-bit ASCII (which will not fail UTF-8), but
>>>> actually represent non-ASCII text.

Well, yes, but if you exploit the fact that 7-bit ISO 2022 encodings 
contain ESC characters with specific character sequences thereafter, 
whereas UTF-8 text doesn't, that case should be easy to handle, too.


>>> I mentioned that later....  But there is a lot of content for
>>> interchange that are single/double byte (8 bit) rather than requiring
>>> escape sequences.  The 2022 encodings seem rarer, though it may
>>> depend on your data source.
>>
>> I agree that ISO 2022 is rare these days, but rarity doesn't help when
>> you need to be accurate in decoding, because mistaking one encoding
>> for another produces horribly incorrect results, and users complain
>> vociferously when that happens.
> 
> If you need to deal with an arbitrary set of encodings, such as CP1255 and CP1256 and 7-bit ISO 2022-based encodings, instead of just CP1252 versus UTF-8 as Karl stated, then auto-detection won't work without a fair amount of natural language context. Otherwise, the text really has to be tagged.
> 
> Long ago I wrote some code that detected Russian text in any of six popular (at the time) Cyrillic encodings, and it seldom got it wrong, but I have no idea how it would do for other, especially non-Slavic, languages written in Cyrillic. I bet it would fail spectacularly for Mongolian, for example.

I agree. What's difficult is distinguish the various non-UTF-8 encodings 
among themselves. Compared to that, identifying something as UTF-8 is 
much easier. It's not 100% failproof, in particular not for very short 
pieces of non-ASCII text (just a word or so), but it gets better very, 
very fast the more non-ASCII text you have.


Regards,   Martin.

From duerst at it.aoyama.ac.jp  Sat Jun  6 19:48:37 2020
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=)
Date: Sun, 7 Jun 2020 09:48:37 +0900
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <CAN49p6qVEhjcMk3JyjG1CoRHw-rrvBr=cPaqZe0sjCBfSFwGkA@mail.gmail.com>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <CAN49p6qVEhjcMk3JyjG1CoRHw-rrvBr=cPaqZe0sjCBfSFwGkA@mail.gmail.com>
Message-ID: <a92030b2-77eb-d3e0-6bd6-a8542199e093@it.aoyama.ac.jp>

On 06/06/2020 08:04, Markus Scherer via Unicode wrote:
> The BOM -- or for UTF-8 where "byte order" is meaningless, the Unicode
> signature byte sequence -- was popular when Unicode was gaining ground but
> legacy charsets were still widely used.
> Especially on Windows, which had settled on UTF-16 much earlier, lots of
> tools and editors started writing or expecting UTF-8 signatures.
> Other tools (especially in the Linux/Unix world) were never modified to
> expect or even cope with the signature, so ignored it or choked on it.
> There has never been uniform practice on this.
> For the most part, all new and recent text is now UTF-8, and the signature
> byte sequence has fallen out of favor again even where it had been used.

I'm really glad to hear this, and I very much hope it is true. But I 
know of a case where the BOM on UTF-8 is necessary. It's to get Excel 
recognize a CSV file as UTF-8.

Regards,   Martin.

> Having said that, I think the statement is right: "neither required nor
> recommended for UTF-8"
> 
> We might want to review chapter 23 and the FAQ and see if they should be
> updated.
> 
> Thanks,
> markus
> 


From abrahamgross at disroot.org  Sun Jun  7 00:53:23 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Sun, 7 Jun 2020 05:53:23 +0000 (UTC)
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <000f01d63c1b$53a627f0$faf277d0$@ewellic.org>
References: <000f01d63c1b$53a627f0$faf277d0$@ewellic.org>
Message-ID: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>

I just gave the Alphabetic Presentation Forms as a suggestion of where it can possibly be encoded. Everyone here disagreed, so the regular Hebrew block it is

2020/06/06 ??11:59:07 Doug Ewell via Unicode <unicode at unicode.org>:

> In this whole thread, I have yet to see why Alphabetic Presentation Forms would be considered a good place to encode a variant form of a Hebrew letter. If that's no longer being considered, or never was, perhaps a change in Subject line would help readers.
> 


From pandey at umich.edu  Sun Jun  7 00:58:42 2020
From: pandey at umich.edu (Anshuman Pandey)
Date: Sat, 6 Jun 2020 23:58:42 -0600
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
Message-ID: <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>

Hi Abraham,

If you?re seriously thinking of submitting a proposal for a new Hebrew character, please consider getting in touch with Deborah Anderson, Michael Everson, or me. We?d be happy to help you figure out the suitability of encoding the character in question or figuring out ways to represent it in plain text, if need be.

All my best,
Anshu


> On Jun 6, 2020, at 11:53 PM, abrahamgross--- via Unicode <unicode at unicode.org> wrote:
> 
> ?I just gave the Alphabetic Presentation Forms as a suggestion of where it can possibly be encoded. Everyone here disagreed, so the regular Hebrew block it is
> 
> 2020/06/06 ??11:59:07 Doug Ewell via Unicode <unicode at unicode.org>:
> 
>> In this whole thread, I have yet to see why Alphabetic Presentation Forms would be considered a good place to encode a variant form of a Hebrew letter. If that's no longer being considered, or never was, perhaps a change in Subject line would help readers.
>> 
> 


From asmusf at ix.netcom.com  Sun Jun  7 01:19:32 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sat, 6 Jun 2020 23:19:32 -0700
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <a92030b2-77eb-d3e0-6bd6-a8542199e093@it.aoyama.ac.jp>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <CAN49p6qVEhjcMk3JyjG1CoRHw-rrvBr=cPaqZe0sjCBfSFwGkA@mail.gmail.com>
 <a92030b2-77eb-d3e0-6bd6-a8542199e093@it.aoyama.ac.jp>
Message-ID: <b61a3ece-d0e0-62d1-1881-93901aa2333b@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200606/020c3c86/attachment.htm>

From tom at honermann.net  Sun Jun  7 02:47:12 2020
From: tom at honermann.net (Tom Honermann)
Date: Sun, 7 Jun 2020 03:47:12 -0400
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <CAN49p6qVEhjcMk3JyjG1CoRHw-rrvBr=cPaqZe0sjCBfSFwGkA@mail.gmail.com>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <CAN49p6qVEhjcMk3JyjG1CoRHw-rrvBr=cPaqZe0sjCBfSFwGkA@mail.gmail.com>
Message-ID: <4a173391-8436-1f3f-c311-f4a9960d288e@honermann.net>

Thank you to everyone that responded to this thread.? The responses have 
indicated that I need to be more clear about my motivation for asking.? 
More details below.

On 6/5/20 7:04 PM, Markus Scherer via Unicode wrote:
> The BOM -- or for UTF-8 where "byte order" is meaningless, the Unicode 
> signature byte sequence -- was popular when Unicode was gaining ground 
> but legacy charsets were still widely used.
> Especially on Windows, which had settled on UTF-16 much earlier, lots 
> of tools and editors started writing or expecting UTF-8 signatures.
> Other tools (especially in the Linux/Unix world) were never modified 
> to expect or even cope with the signature, so ignored it or choked on it.
> There has never been uniform practice on this.
> For the most part, all new and recent text is now UTF-8, and the 
> signature byte sequence has fallen out of favor again even where it 
> had been used.
Thank you, this is helpful historical perspective.
>
> Having?said that, I think the statement is right: "neither required 
> nor recommended for UTF-8"

I think different audiences could interpret that guidance in different ways.

As a software tool provider, I can interpret the guidance as meaning 
that I should not require a BOM to be present on text that is consumed, 
nor produce a BOM in text that is produced.? But what is the 
recommendation for honoring a BOM that is present in consumed text?? 
Pragmatically, it seems to me that tools should honor the presence of a 
BOM by either treating the data following it as UTF-8 encoded or issuing 
a diagnostic if the BOM presents a conflict with other indications of 
expected encoding.

As a protocol developer, I can interpret the guidance as meaning that a 
new protocol should either mandate a particular encoding or use some 
mechanism other than a BOM to negotiate encoding.

As a text author, I can interpret the guidance as meaning that I should 
not place a BOM in text that I author without strong motivation, nor 
should I expect a tool to require one.

Back to my motivation for asking the question...

I'm researching support for UTF-8 encoded source files in various C++ 
compilers.? Here is a brief snapshot of existing practice:

  * Clang only accepts UTF-8 encoded source files.? A UTF-8 BOM is
    recognized and discarded.
  * GCC accepts UTF-8 encoded source files by default, but the encoding
    expectation can be overridden with a command line option.? If GCC is
    expecting UTF-8 source, then a BOM is discarded.? Otherwise, a BOM
    is *not* honored and its presence is likely to result in compilation
    error.? GCC has no support for compiling a translation unit
    consisting of differently encoded source files.
  * Microsoft Visual C++, by default, interprets source files as encoded
    according to the Windows' Active Code Page (ACP), but supports
    translation units consisting of differently encoded source files by
    honoring a UTF-8 BOM.? The default encoding can be overridden with a
    command line option.
  * IBM z/OS xlC C/C++ is IBM's compiler for C and C++ on mainframes
    (yes, though you may not have seen a green screen in recent times,
    mainframes are still busy crunching numbers behind the scenes for
    websites you frequent).? z/OS is an EBCDIC based operating system
    and IBM's xlC compiler for z/OS only accepts EBCDIC encoded source
    files.? Many EBCDIC code pages exist and the xlC compiler supports
    an in-source code page annotation that enables compilation of
    translation units consisting of differently encoded source files.

The goal of this research is to produce a proposal for the C and C++ 
standards intended to better enable UTF-8 as a portable source file 
encoding.? The following are acknowledged (at least by me) as accepted 
constraints:

  * Existing compilers are not going to change their default mode of
    operation due to backward compatibility constraints.
  * Non-UTF-8 encoded source files are still in use, particularly by
    commercial software providers.
  * Converting source files to UTF-8 is not necessarily an easy task.?
    It isn't necessarily a simple matter of running the source files
    through 'iconv' and committing the results.
  * Transition to UTF-8 for source files will be aided by the
    possibility of incremental adoption; e.g., use of UTF-8 encoded
    header files by a project that has non-UTF-8 encoded source files.

Various methods are being explored for how to support collections of 
mixed encoding source files.? The intent in asking the question is to 
help determine if/how use of a UTF-8 BOM fits in to the picture.

>
> We might want to review chapter 23 and the FAQ and see if they should 
> be updated.

I think that would be useful.? In particular, per other comments above, 
if the standard or FAQ is to continue offering statements regarding 
recommendations or guidance, it may be helpful to tailor the guidance 
for different audiences.? For example, "Software providers are 
encouraged to honor the presence of a BOM signifying that a text is 
UTF-8 encoded in text that is consumed, and are discouraged from 
inserting a BOM in text that is produced .? Text authors are discouraged 
from inserting a BOM in their UTF-8 encoded documents [unless it is 
known to be needed; because UTF-8 should be considered a default, 
because some tools won't honor it, etc...]".

Tom.

>
> Thanks,
> markus


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200607/e82b12a3/attachment.htm>

From richard.wordingham at ntlworld.com  Sun Jun  7 06:46:27 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 7 Jun 2020 12:46:27 +0100
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
Message-ID: <20200607124627.12db8f87@JRWUBU2>

On Sat, 6 Jun 2020 23:58:42 -0600
Anshuman Pandey via Unicode <unicode at unicode.org> wrote:

> Hi Abraham,
> 
> If you?re seriously thinking of submitting a proposal for a new
> Hebrew character, please consider getting in touch with Deborah
> Anderson, Michael Everson, or me. We?d be happy to help you figure
> out the suitability of encoding the character in question or figuring
> out ways to represent it in plain text, if need be.

If doesn't belong in plain text.  It only becomes useful once line
breaks and character spacing are known.

Richard.


From everson at evertype.com  Sun Jun  7 08:45:00 2020
From: everson at evertype.com (Michael Everson)
Date: Sun, 7 Jun 2020 14:45:00 +0100
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <20200607124627.12db8f87@JRWUBU2>
References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
Message-ID: <9C2A4C94-BAA2-4A33-A34C-57F18049079E@evertype.com>

I?ve often helped encode Hebrew characters. :-) 

M

> On 7 Jun 2020, at 12:46, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Sat, 6 Jun 2020 23:58:42 -0600
> Anshuman Pandey via Unicode <unicode at unicode.org> wrote:
> 
>> Hi Abraham,
>> 
>> If you?re seriously thinking of submitting a proposal for a new
>> Hebrew character, please consider getting in touch with Deborah
>> Anderson, Michael Everson, or me. We?d be happy to help you figure
>> out the suitability of encoding the character in question or figuring
>> out ways to represent it in plain text, if need be.
> 
> If doesn't belong in plain text.  It only becomes useful once line
> breaks and character spacing are known.
> 
> Richard.
> 


From mark at kli.org  Sun Jun  7 09:27:17 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Sun, 7 Jun 2020 10:27:17 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <20200607124627.12db8f87@JRWUBU2>
References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
Message-ID: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>

On 6/7/20 7:46 AM, Richard Wordingham via Unicode wrote:
> On Sat, 6 Jun 2020 23:58:42 -0600
> Anshuman Pandey via Unicode <unicode at unicode.org> wrote:
>
>> Hi Abraham,
>>
>> If you?re seriously thinking of submitting a proposal for a new
>> Hebrew character, please consider getting in touch with Deborah
>> Anderson, Michael Everson, or me. We?d be happy to help you figure
>> out the suitability of encoding the character in question or figuring
>> out ways to represent it in plain text, if need be.
> I[t] doesn't belong in plain text.  It only becomes useful once line
> breaks and character spacing are known.
>
> Richard.

I agree.? Sorry, pretty typography is nice and everything, but if bent 
LAMED is anything, it's at best a presentation form (and even that is a 
hard sell.)? You show ANYONE a word spelled with any combination of bent 
and straight LAMEDs and ask how it's spelled, they'll just say "LAMED" 
for each one.? Unicode encodes different *characters*, symbols that have 
a different *meaning* in text, not things that happen to look 
different.? A U+05BA HOLAM HASER FOR VAV means not just "a dot like 
U+05B9 only shifted over a little," it means that there is something 
*different* going on: VAV plus HOLAM usually means one thing (a VAV as 
mater lectionis for an /o/ vowel), this is a consonantal VAV followed by 
a vowel.? In spelling it out, you could call one a holam mal?, but not 
the other.? A QAMATS QATAN is not just a qamats that looks a little 
different, it is a grammatically distinct character, and moreover one 
that cannot be deduced algorithmically by looking at the letters around 
it.? What you're talking about is a LAMED and a LAMED.? They are two 
*glyphs* for the same character, and Unicode doesn't encode glyphs 
(anymore?)

~mark


From abrahamgross at disroot.org  Sun Jun  7 13:45:05 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Sun, 07 Jun 2020 18:45:05 +0000
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
References: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
Message-ID: <1b3f13c54cc6102907fbb3d0043178ca@disroot.org>

If this is the case, then why do the CJK blocks have tons of alternatives for the same character? (not counting the compatibility ideographs that were just added for compatibility with other encodings) If you look at old dictionaries, these alternatives get listed as alternatives of the same character you might see some fonts use. The meaning is exactly the same.

Some examples (theres tons and tons more): 
??????
???
????
??
???
???
???
??????
??
??????


2020?6?7? 10:27, "Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> On 6/7/20 7:46 AM, Richard Wordingham via Unicode wrote:
> 
>> On Sat, 6 Jun 2020 23:58:42 -0600
>> Anshuman Pandey via Unicode <unicode at unicode.org> wrote:
>> 
>>> Hi Abraham,
>>> 
>>> If you?re seriously thinking of submitting a proposal for a new
>>> Hebrew character, please consider getting in touch with Deborah
>>> Anderson, Michael Everson, or me. We?d be happy to help you figure
>>> out the suitability of encoding the character in question or figuring
>>> out ways to represent it in plain text, if need be.
>> 
>> I[t] doesn't belong in plain text. It only becomes useful once line
>> breaks and character spacing are known.
>> 
>> Richard.
> 
> I agree.  Sorry, pretty typography is nice and everything, but if bent LAMED is anything, it's at
> best a presentation form (and even that is a hard sell.)  You show ANYONE a word spelled with any
> combination of bent and straight LAMEDs and ask how it's spelled, they'll just say "LAMED" for each
> one.  Unicode encodes different *characters*, symbols that have a different *meaning* in text, not
> things that happen to look different.  A U+05BA HOLAM HASER FOR VAV means not just "a dot like
> U+05B9 only shifted over a little," it means that there is something *different* going on: VAV plus
> HOLAM usually means one thing (a VAV as mater lectionis for an /o/ vowel), this is a consonantal
> VAV followed by a vowel.  In spelling it out, you could call one a holam mal?, but not the other. 
> A QAMATS QATAN is not just a qamats that looks a little different, it is a grammatically distinct
> character, and moreover one that cannot be deduced algorithmically by looking at the letters around
> it.  What you're talking about is a LAMED and a LAMED.  They are two *glyphs* for the same
> character, and Unicode doesn't encode glyphs (anymore?)
> 
> ~mark


From abrahamgross at disroot.org  Sun Jun  7 13:50:01 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Sun, 07 Jun 2020 18:50:01 +0000
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <1b3f13c54cc6102907fbb3d0043178ca@disroot.org>
References: <1b3f13c54cc6102907fbb3d0043178ca@disroot.org>
 <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
Message-ID: <b4f15036c46f7de4b6d7b98412e21f89@disroot.org>

This doesnt display properly on my android device, so I hope ya'll received this intact.

2020?6?7? 14:45, "Abraham Gross via Unicode" <unicode at unicode.org> wrote:

> Some examples (theres tons and tons more): 
> ??????
> ???
> ????
> ??
> ???
> ???
> ???
> ??????
> ??
> ??????


From public at khwilliamson.com  Sun Jun  7 14:29:50 2020
From: public at khwilliamson.com (Karl Williamson)
Date: Sun, 7 Jun 2020 13:29:50 -0600
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <AM6PR10MB285600201965BDE398CB392084870@AM6PR10MB2856.EURPRD10.PROD.OUTLOOK.COM>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <MWHPR21MB08476D1EBC98526AF33991B682860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net>
 <MWHPR21MB0847386BE436C612C0F56A9C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <499dd515-f61d-8114-aae7-52da51d92e58@khwilliamson.com>
 <AM6PR10MB285600201965BDE398CB392084870@AM6PR10MB2856.EURPRD10.PROD.OUTLOOK.COM>
Message-ID: <b563f981-2a47-56a0-0b31-15bb9928a371@khwilliamson.com>

On 6/5/20 9:53 PM, Jonathan Rosenne via Unicode wrote:
> I am curious about how your code would work with CP1255 or CP1256?
> 
> Best Regards,
> 
> Jonathan Rosenne

Send me a few problematic strings, and I'll check them out

> 
> -----Original Message-----
> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Karl Williamson via Unicode
> Sent: Saturday, June 6, 2020 6:29 AM
> To: Shawn Steele; Tom Honermann
> Cc: Alisdair Meredith; Unicode Mail List
> Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
> 
> On 6/5/20 4:33 PM, Shawn Steele via Unicode wrote:
>> I?ve been recommending that people assume documents are UTF-8. ?If the
>> UTF-8 decoding fails, then consider falling back to some other
>> codepage.? ?Pretty much all the other code pages would contain text that
>> would look like unexpected trail bytes, or lead bytes without trail
>> bytes, etc.? One can anecdotally find single-word Latin examples that
>> break the pattern (Nestl?? IIRC), but if you want to think of accuracy
>> in terms of ?9s?, then that pretty much has as many nines as you have
>> bytes of input data.
> 
> I have code that attempts to distinguish between UTF-8 and CP1252
> inputs.  It now does a pretty good job; no one has complained in several
> years.   To do this, I resort to some "semantic" analysis of the input.
> If it is syntactically valid UTF-8, but not a script run, it's not
> UTF-8.  Likewise, the texts it will be subjected to are going to be in
> modern commercially-valuable scripts, so not IPA, for example.  And it
> will be important characters, ones whose Age property is 1.1; text won't
> contain C1 controls.  CP1252 is harder than plain ASCII/Latin1/C1
> because manyh of the C1 controls are co-opted for graphic characters.
> Someone sent me the following example, scraped from some dictionaries,
> that it successfully gets right:
> 
> Muvrar\xE1\x9A\x9Aa is a mountain in Norway
> 
> is legal 1252, and syntactically legal UTF-8, but the "semantic" tests
> say it isn't UTF-8.
> 
> I also have code that tries to distinguish between a UTF-8 POSIX locale
> and a non-UTF-8, and which needs to work on systems without certain C
> library functions that would make it foolproof.  That is less successful
> primarily because of insufficient text available to make a
> determination.  One might think that the operating system error messages
> would be fruitful, but it turns out that many are in English, no one
> bothered to translate them.  The locale's currency symbol is always
> translated, though the dollar sign is commonly used in other languages
> as part of the symbol.  The time and date names are usually translated,
> and I use them.
> 
>> I did find some DBCS CJK text that could look like valid UTF-8, so my
>> ?one nine per byte of input? isn?t quite as high there, however for
>> meaningful runs of text it is still reasonably hard to make sensible
>> text in a double byte codepage look like UTF-8.? Note that this ?works?
>> partially because the ASCII range of the SBCS/DBCS code pages typically
>> looks like ASCII, as does UTF-8.? If you had a 7 bit codepage data with
>> stateful shift sequences, of course that wouldn?t err in UTF-8.
>> Fortunately for your scenario source code in 7 bit encodings is very
>> rare nowadays.
>>
>> Hope that helps,
>>
>> -Shawn
>>
>> *From:* Tom Honermann <tom at honermann.net>
>> *Sent:* Freitag, 5. Juni 2020 15:15
>> *To:* Shawn Steele <Shawn.Steele at microsoft.com>
>> *Cc:* Alisdair Meredith <alisdairm at me.com>; Unicode Mail List
>> <unicode at unicode.org>
>> *Subject:* Re: What is the Unicode guidance regarding the use of a BOM
>> as a UTF-8 encoding signature?
>>
>> On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote:
>>
>>      The modern viewpoint is that the BOM should be discouraged in all
>>      contexts.?? (Along with you should always be using Unicode
>>      encodings, probably UTF-8 or UTF-16).? I?d recommend to anyone
>>      encountering ASCII-like data to presume it was UTF-8 unless proven
>>      otherwise.
>>
>>      Are you asking because you?re interested in differentiating UTF-8
>>      from UTF-16?? Or UTF-8 from some other legacy non-Unicode encoding?
>>
>> The latter.? In particular, as a differentiator between shiny new UTF-8
>> encoded source code files and long-in-the-tooth legacy encoded source
>> code files coexisting (perhaps via transitive package dependencies)
>> within a single project.
>>
>> Tom.
>>
>>      Anecdotally, if you can decode data without error in UTF-8, then
>>      it?s probably UTF-8. ?Sensible sequences in other encodings rarely
>>      look like valid UTF-8, though there are a few short examples that
>>      can confuse it.
>>
>>      -Shawn
>>
>>      *From:* Unicode <unicode-bounces at unicode.org>
>>      <mailto:unicode-bounces at unicode.org> *On Behalf Of *Tom Honermann
>>      via Unicode
>>      *Sent:* Freitag, 5. Juni 2020 13:10
>>      *To:* unicode at unicode.org <mailto:unicode at unicode.org>
>>      *Cc:* Alisdair Meredith <alisdairm at me.com> <mailto:alisdairm at me.com>
>>      *Subject:* What is the Unicode guidance regarding the use of a BOM
>>      as a UTF-8 encoding signature?
>>
>>      Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte
>>      order, states (emphasis mine):
>>
>>          ... *Use of a BOM is neither required nor recommended for
>>          UTF-8*, but may be encountered in contexts where UTF-8 data is
>>          converted from other encoding forms that? use? a? BOM? or
>>          where? the? BOM? is? used? as? a? UTF-8? signature. See? the
>>          ?Byte? Order Mark? subsection in Section 23.8, Specials, for
>>          more information.
>>
>>      The emphasized statement is unconditional regarding the
>>      recommendation, but it isn't clear to me that this recommendation is
>>      intended to extend to both presence of a BOM in contexts where the
>>      encoding is known to be UTF-8 (where the BOM provides no additional
>>      information) and to contexts where the BOM signifies the presence of
>>      UTF-8 encoded text (where the BOM does provide additional
>>      information).? Is the guidance intended to state that, when
>>      possible, use of UTF-8 as an encoding signature is to be avoided in
>>      favor of some other mechanism?
>>
>>      The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8
>>      (Specials) contains no similar guidance; it is factual and details
>>      some possible consequences of use, but does not apply a judgement.
>>      The discussion of use with other character sets could be read as an
>>      endorsement for use of a BOM as an encoding signature.
>>
>>      Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode
>>      FAQ <https://www.unicode.org/faq/utf_bom.html> does not recommend
>>      for or against use of a BOM as an encoding signature.? It also can
>>      be read as endorsing such usage.
>>
>>      So, my question is, what exactly is the intent of the emphasized
>>      statement above?? Is the recommendation intended to be so broadly
>>      worded?? Or is it only intended to discourage BOM use in cases where
>>      the encoding is known by other means?
>>
>>      Tom.
>>
> 
> 
> 


From asmusf at ix.netcom.com  Sun Jun  7 16:31:33 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 7 Jun 2020 14:31:33 -0700
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <1b3f13c54cc6102907fbb3d0043178ca@disroot.org>
References: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <1b3f13c54cc6102907fbb3d0043178ca@disroot.org>
Message-ID: <5324b0cf-ff48-d6d5-768d-d41cba4155ee@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200607/17050668/attachment.htm>

From 747.neutron at gmail.com  Mon Jun  8 02:23:41 2020
From: 747.neutron at gmail.com (=?UTF-8?B?V8OhbmcgWWlmw6Fu?=)
Date: Mon, 8 Jun 2020 16:23:41 +0900
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <1b3f13c54cc6102907fbb3d0043178ca@disroot.org>
References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1b3f13c54cc6102907fbb3d0043178ca@disroot.org>
Message-ID: <CAF5KyEzW+_aZAmcownBxcA1u7dpscBaAhuX618HjUrJnf0h3=g@mail.gmail.com>

As CJK ideographs mentioned...

They are different from most other Unicode code points in several ways, namely:
- Most of the substantial discussion is going on under the supervision
of ISO, rather than UTC. It's one of few fields whose description in
ISO/IEC 10646 is more informative than that in the Unicode Standard.
For practical knowledge see especially the ISO standard's Annex P and
S.
- Whether to separately encode two characters is mainly decided by
difference in structure i.e. sub-character formation, besides the
semantics, because Han characters are compositional by nature, unlike
most phonetic scripts where each letter only means what it means as a
whole shape (? is not a ? with hyphen, is it?).
- The questionable quality of CJK Extension B characters is an open secret.

2020?6?8?(?) 4:57 Abraham Gross via Unicode <unicode at unicode.org>:

>
> If this is the case, then why do the CJK blocks have tons of alternatives for the same character? (not counting the compatibility ideographs that were just added for compatibility with other encodings) If you look at old dictionaries, these alternatives get listed as alternatives of the same character you might see some fonts use. The meaning is exactly the same.
>
> Some examples (theres tons and tons more):
> ??????
> ???
> ????
> ??
> ???
> ???
> ???
> ??????
> ??
> ??????
>
>
> 2020?6?7? 10:27, "Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:
>
> > On 6/7/20 7:46 AM, Richard Wordingham via Unicode wrote:
> >
> >> On Sat, 6 Jun 2020 23:58:42 -0600
> >> Anshuman Pandey via Unicode <unicode at unicode.org> wrote:
> >>
> >>> Hi Abraham,
> >>>
> >>> If you?re seriously thinking of submitting a proposal for a new
> >>> Hebrew character, please consider getting in touch with Deborah
> >>> Anderson, Michael Everson, or me. We?d be happy to help you figure
> >>> out the suitability of encoding the character in question or figuring
> >>> out ways to represent it in plain text, if need be.
> >>
> >> I[t] doesn't belong in plain text. It only becomes useful once line
> >> breaks and character spacing are known.
> >>
> >> Richard.
> >
> > I agree.  Sorry, pretty typography is nice and everything, but if bent LAMED is anything, it's at
> > best a presentation form (and even that is a hard sell.)  You show ANYONE a word spelled with any
> > combination of bent and straight LAMEDs and ask how it's spelled, they'll just say "LAMED" for each
> > one.  Unicode encodes different *characters*, symbols that have a different *meaning* in text, not
> > things that happen to look different.  A U+05BA HOLAM HASER FOR VAV means not just "a dot like
> > U+05B9 only shifted over a little," it means that there is something *different* going on: VAV plus
> > HOLAM usually means one thing (a VAV as mater lectionis for an /o/ vowel), this is a consonantal
> > VAV followed by a vowel.  In spelling it out, you could call one a holam mal?, but not the other.
> > A QAMATS QATAN is not just a qamats that looks a little different, it is a grammatically distinct
> > character, and moreover one that cannot be deduced algorithmically by looking at the letters around
> > it.  What you're talking about is a LAMED and a LAMED.  They are two *glyphs* for the same
> > character, and Unicode doesn't encode glyphs (anymore?)
> >
> > ~mark
>


From abrahamgross at disroot.org  Mon Jun  8 02:41:56 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Mon, 08 Jun 2020 07:41:56 +0000
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <CAF5KyEzW+_aZAmcownBxcA1u7dpscBaAhuX618HjUrJnf0h3=g@mail.gmail.com>
References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1b3f13c54cc6102907fbb3d0043178ca@disroot.org>
 <CAF5KyEzW+_aZAmcownBxcA1u7dpscBaAhuX618HjUrJnf0h3=g@mail.gmail.com>
Message-ID: <453b7f69c101676fe1815355b5e41d22@disroot.org>

???

The way I understand it, a lot of ext b (including the crazy/cursive ones like
????????????????) come from dictionaries that had cursive entries (for some unknown reason).

Example dictionary:
https://sociorocketnewsen.files.wordpress.com/2016/09/broken-11.jpg
https://sociorocketnewsen.files.wordpress.com/2016/09/wrong-11-e1473684067982.jpg
https://sociorocketnewsen.files.wordpress.com/2016/09/curve-11.jpg
https://sociorocketnewsen.files.wordpress.com/2016/09/curve-21-e1473684039617.jpg

2020/06/08 ??3:25:01 W?ng Yif?n via Unicode <unicode at unicode.org>:

> - The questionable quality of CJK Extension B characters is an open secret.


From marius.spix at web.de  Mon Jun  8 04:46:58 2020
From: marius.spix at web.de (Marius Spix)
Date: Mon, 8 Jun 2020 11:46:58 +0200
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <CAF5KyEzW+_aZAmcownBxcA1u7dpscBaAhuX618HjUrJnf0h3=g@mail.gmail.com>
References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1b3f13c54cc6102907fbb3d0043178ca@disroot.org>
 <CAF5KyEzW+_aZAmcownBxcA1u7dpscBaAhuX618HjUrJnf0h3=g@mail.gmail.com>
Message-ID: <20200608114350.364eba2e@spixxi>

? is a ligature of ? and ?. ? is a ? with breve. They are considered to
be seperate letters for historic reasons. ? and ? have nothing in
common but part of the shape. They derived from completely different
characters. 

On Mon, 8 Jun 2020 16:23:41 +0900
W?ng Yif?n wrote:

> - Whether to separately encode two characters is mainly decided by
> difference in structure i.e. sub-character formation, besides the
> semantics, because Han characters are compositional by nature, unlike
> most phonetic scripts where each letter only means what it means as a
> whole shape (? is not a ? with hyphen, is it?).


From 747.neutron at gmail.com  Mon Jun  8 07:43:48 2020
From: 747.neutron at gmail.com (=?UTF-8?B?V8OhbmcgWWlmw6Fu?=)
Date: Mon, 8 Jun 2020 21:43:48 +0900
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <453b7f69c101676fe1815355b5e41d22@disroot.org>
References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1b3f13c54cc6102907fbb3d0043178ca@disroot.org>
 <CAF5KyEzW+_aZAmcownBxcA1u7dpscBaAhuX618HjUrJnf0h3=g@mail.gmail.com>
 <453b7f69c101676fe1815355b5e41d22@disroot.org>
Message-ID: <CAF5KyEx9Sdhs+xYbvniigGZFp-rrxM4AYLO__5MvAMGV+ywTZw@mail.gmail.com>

Those are not exact what we call failure in Ext B (some of them you
listed are actually not from Ext B), because they are sort of "had to"
be included rather than by carelessness. It's definitely one of the
headaches, just in another dimension.

Maybe you can take a glance of what we had discussed recently, if you
really into it:
https://www.unicode.org/L2/L2019/19346-gongche-policy.pdf
https://appsrv.cse.cuhk.edu.hk/~irg/irg/irg53/IRGN2413_IDS_issues.pdf
https://www.unicode.org/L2/L2020/20059-unihan-kstrange-update.pdf

2020?6?8?(?) 17:57 Abraham Gross via Unicode <unicode at unicode.org>:
>
>
>
> The way I understand it, a lot of ext b (including the crazy/cursive ones like
> ????????????????) come from dictionaries that had cursive entries (for some unknown reason).
>
> Example dictionary:
> https://sociorocketnewsen.files.wordpress.com/2016/09/broken-11.jpg
> https://sociorocketnewsen.files.wordpress.com/2016/09/wrong-11-e1473684067982.jpg
> https://sociorocketnewsen.files.wordpress.com/2016/09/curve-11.jpg
> https://sociorocketnewsen.files.wordpress.com/2016/09/curve-21-e1473684039617.jpg
>
> 2020/06/08 ??3:25:01 W?ng Yif?n via Unicode <unicode at unicode.org>:
>
> > - The questionable quality of CJK Extension B characters is an open secret.
>


From abrahamgross at disroot.org  Mon Jun  8 12:45:02 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Mon, 08 Jun 2020 17:45:02 +0000
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
References: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
Message-ID: <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>

Unicode encodes characters that other character sets have even though it normally wouldn't. So if I find a character set with a folded lamed they'd add it?

Here are 2 character sets with a folded lamed:
https://i.imgur.com/iq8awBe.jpg ? an ??? ???? with the standing and folded lameds as separate letters.
https://www.tug.org/TUGboat/tb15-3/tb44haralambous-hebrew.pdf#page=12 ? A TeX typesetting module with the standing and folded lameds as separate characters for fine-grain control when the automatic system doesn't work.

2020?6?7? 10:27, "Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> On 6/7/20 7:46 AM, Richard Wordingham via Unicode wrote:
> 
> I agree.  Sorry, pretty typography is nice and everything, but if bent LAMED is anything, it's at
> best a presentation form (and even that is a hard sell.)  You show ANYONE a word spelled with any
> combination of bent and straight LAMEDs and ask how it's spelled, they'll just say "LAMED" for each
> one.  Unicode encodes different *characters*, symbols that have a different *meaning* in text, not
> things that happen to look different.  A U+05BA HOLAM HASER FOR VAV means not just "a dot like
> U+05B9 only shifted over a little," it means that there is something *different* going on: VAV plus
> HOLAM usually means one thing (a VAV as mater lectionis for an /o/ vowel), this is a consonantal
> VAV followed by a vowel.  In spelling it out, you could call one a holam mal?, but not the other. 
> A QAMATS QATAN is not just a qamats that looks a little different, it is a grammatically distinct
> character, and moreover one that cannot be deduced algorithmically by looking at the letters around
> it.  What you're talking about is a LAMED and a LAMED.  They are two *glyphs* for the same
> character, and Unicode doesn't encode glyphs (anymore?)
> 
> ~mark


From john_h_jenkins at apple.com  Mon Jun  8 13:09:38 2020
From: john_h_jenkins at apple.com (jenkins)
Date: Mon, 08 Jun 2020 12:09:38 -0600
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>
References: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>
Message-ID: <CBC1ED96-3D3F-4258-9C9C-E6C7B0CB389F@apple.com>

Unicode *encoded* characters that other character sets have, even though it normally wouldn?t. That?s really not done anymore. It?s also a matter of what the character set in question is. The two mentioned here are too obscure IMHO to have ever been covered by round-trip compatibility.

> On Jun 8, 2020, at 11:45 AM, Abraham Gross via Unicode <unicode at unicode.org> wrote:
> 
> Unicode encodes characters that other character sets have even though it normally wouldn't. So if I find a character set with a folded lamed they'd add it?
> 
> Here are 2 character sets with a folded lamed:
> https://i.imgur.com/iq8awBe.jpg ? an ??? ???? with the standing and folded lameds as separate letters.
> https://www.tug.org/TUGboat/tb15-3/tb44haralambous-hebrew.pdf#page=12 ? A TeX typesetting module with the standing and folded lameds as separate characters for fine-grain control when the automatic system doesn't work.
> 
> 2020?6?7? 10:27, "Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:
> 
>> On 6/7/20 7:46 AM, Richard Wordingham via Unicode wrote:
>> 
>> I agree.  Sorry, pretty typography is nice and everything, but if bent LAMED is anything, it's at
>> best a presentation form (and even that is a hard sell.)  You show ANYONE a word spelled with any
>> combination of bent and straight LAMEDs and ask how it's spelled, they'll just say "LAMED" for each
>> one.  Unicode encodes different *characters*, symbols that have a different *meaning* in text, not
>> things that happen to look different.  A U+05BA HOLAM HASER FOR VAV means not just "a dot like
>> U+05B9 only shifted over a little," it means that there is something *different* going on: VAV plus
>> HOLAM usually means one thing (a VAV as mater lectionis for an /o/ vowel), this is a consonantal
>> VAV followed by a vowel.  In spelling it out, you could call one a holam mal?, but not the other. 
>> A QAMATS QATAN is not just a qamats that looks a little different, it is a grammatically distinct
>> character, and moreover one that cannot be deduced algorithmically by looking at the letters around
>> it.  What you're talking about is a LAMED and a LAMED.  They are two *glyphs* for the same
>> character, and Unicode doesn't encode glyphs (anymore?)
>> 
>> ~mark
> 


From mark at kli.org  Mon Jun  8 16:02:37 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Mon, 8 Jun 2020 17:02:37 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>
References: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>
Message-ID: <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org>

Look, think of it this way: what exactly is the content of, say, Exodus 
6:10, for a nice short and common verse?? What are the letters, vowels, 
and cantillations that make up that verse?? The answer is pretty 
well-agreed-upon by most sources.? Tell me: is the LAMED in that verse 
bent or straight?? Can you find a list of LAMEDs in the Torah that are 
bent?? Not "which ones are bent in this particular book."? That's like 
finding me a list of YODs that are at the end of a line: it has nothing 
to do with the actual TEXT.? Which LAMEDs in the Torah are bent?? None 
of them.? Nor are any of them straight.? Nor are any of them written in 
Frank-Ruehl, or Hadassah, or David.? Those are not properties of the 
text.? The consonantal text of the Torah uses exactly 22 letters plus 
final forms, plus the NUN HAFUKHA and a few instances of UPPER DOT.

Now, there *are* some letters in the Torah which are written unusually 
large or small, like the BET at the very beginning, or the small ALEPH 
in Leviticus 1:1.? But Unicode rightly considers those to be glyphic 
variants, to be handled at a higher level.? There's actually a better 
case for encoding these, because there IS a list of large BETs or small 
ALEPHs in the Torah, which "everyone" (who accepts Masoretic tradition) 
agrees are in these and those places in the text.? (But don't try to 
encode these, either.)

Down to one sentence: until you can talk about which LAMEDs in the Torah 
are bent and which are straight, I would expect this to be a non-starter.

~mark

On 6/8/20 1:45 PM, Abraham Gross via Unicode wrote:
> Unicode encodes characters that other character sets have even though it normally wouldn't. So if I find a character set with a folded lamed they'd add it?
>
> Here are 2 character sets with a folded lamed:
> https://i.imgur.com/iq8awBe.jpg ? an ??? ???? with the standing and folded lameds as separate letters.
> https://www.tug.org/TUGboat/tb15-3/tb44haralambous-hebrew.pdf#page=12 ? A TeX typesetting module with the standing and folded lameds as separate characters for fine-grain control when the automatic system doesn't work.
>
> 2020?6?7? 10:27, "Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:
>
>> On 6/7/20 7:46 AM, Richard Wordingham via Unicode wrote:
>>
>> I agree.  Sorry, pretty typography is nice and everything, but if bent LAMED is anything, it's at
>> best a presentation form (and even that is a hard sell.)  You show ANYONE a word spelled with any
>> combination of bent and straight LAMEDs and ask how it's spelled, they'll just say "LAMED" for each
>> one.  Unicode encodes different *characters*, symbols that have a different *meaning* in text, not
>> things that happen to look different.  A U+05BA HOLAM HASER FOR VAV means not just "a dot like
>> U+05B9 only shifted over a little," it means that there is something *different* going on: VAV plus
>> HOLAM usually means one thing (a VAV as mater lectionis for an /o/ vowel), this is a consonantal
>> VAV followed by a vowel.  In spelling it out, you could call one a holam mal?, but not the other.
>> A QAMATS QATAN is not just a qamats that looks a little different, it is a grammatically distinct
>> character, and moreover one that cannot be deduced algorithmically by looking at the letters around
>> it.  What you're talking about is a LAMED and a LAMED.  They are two *glyphs* for the same
>> character, and Unicode doesn't encode glyphs (anymore?)
>>
>> ~mark


From kenwhistler at sonic.net  Mon Jun  8 18:58:19 2020
From: kenwhistler at sonic.net (Ken Whistler)
Date: Mon, 8 Jun 2020 16:58:19 -0700
Subject: Alternate presentation for U+229C CIRCLED EQUALS?
In-Reply-To: <cf905e67-0d04-476d-a840-b6d8ee5416ee@disroot.org>
References: <a65d1548-c4c9-2307-770e-1efc6df9076d@kli.org>
 <CAN49p6pSiioNbjjhsD41W3ys5FJHH3L5zKDVVdKKaRJOEKwdJw@mail.gmail.com>
 <CAN49p6qa37Vy1FX=x+MLCyKcNNKJo3bx-X36gpMPdT+7TxTHgw@mail.gmail.com>
 <MWHPR21MB084754802F1F9FFF6DF01A6C82860@MWHPR21MB0847.namprd21.prod.outlook.com>
 <CAN49p6pBpMBvC3EDSwZgyqTBQUL3n7Mghs1r8uZVifZH1YSgug@mail.gmail.com>
 <MWHPR21MB084763E7A9A54EC8258846D382870@MWHPR21MB0847.namprd21.prod.outlook.com>
 <cf905e67-0d04-476d-a840-b6d8ee5416ee@disroot.org>
Message-ID: <16312c19-dd47-fcdf-f775-b868e49b72fc@sonic.net>

Actually, no you can't, because U+1F10D is already standardized as 
CIRCLED ZERO WITH SLASH, published in Unicode 13.0.

https://www.unicode.org/charts/PDF/U1F100.pdf

People discussing this should make sure they are referring to actual, 
published code charts, and not to proposals from 2 to 4 years ago.

--Ken

On 6/5/2020 5:15 PM, abrahamgross--- via Unicode wrote:
> You can still make a proposal to add it to U+1F10D
>

From asmusf at ix.netcom.com  Mon Jun  8 21:57:05 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Mon, 8 Jun 2020 19:57:05 -0700
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org>
References: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>
 <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org>
Message-ID: <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200608/cf7b78b3/attachment.htm>

From abrahamgross at disroot.org  Mon Jun  8 23:47:24 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Tue, 9 Jun 2020 04:47:24 +0000 (UTC)
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com>
References: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>
 <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org>
 <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com>
Message-ID: <ff544a5b-ba3e-4185-9d66-00b39fa52ea5@disroot.org>

Does anyone know which national standard the alternative ayin came from? I can't find it anywhere, and I want to look through it.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200609/d1af366b/attachment.htm>

From pandey at umich.edu  Tue Jun  9 01:10:34 2020
From: pandey at umich.edu (Anshuman Pandey)
Date: Tue, 9 Jun 2020 01:10:34 -0500
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com>
References: <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com>
Message-ID: <52B1F014-F225-4934-A830-C0BFF7381710@umich.edu>


> On Jun 8, 2020, at 9:57 PM, Asmus Freytag via Unicode <unicode at unicode.org> wrote:
> 
> ?
> On 6/8/2020 2:02 PM, Mark E. Shoulson via Unicode wrote:
>> Down to one sentence: until you can talk about which LAMEDs in the Torah are bent and which are straight, I would expect this to be a non-starter.
> The meta issue: how to ensure that texts that have such features (i.e. layout-specific or scribe-specific choice of shapes) can be widely represented in interchangeable digital representations - even if that representation isn't plain text.
> 
> A./
> 

You hit the nail on the head: dealing with this issue for alternate terminals for Old Uyghur letters. Scribes made a choice to use either vertical or a curved stroke. The shape of the terminal itself doesn?t change the semantic value of the letter, but it carries pragmatic value.

Anshu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200609/97f8b051/attachment.htm>

From mark at kli.org  Tue Jun  9 07:29:42 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Tue, 9 Jun 2020 08:29:42 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com>
References: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>
 <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org>
 <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com>
Message-ID: <669714ba-c500-61b6-64ca-791bcf24ccae@kli.org>

On 6/8/20 10:57 PM, Asmus Freytag via Unicode wrote:
> On 6/8/2020 2:02 PM, Mark E. Shoulson via Unicode wrote:
>> Down to one sentence: until you can talk about which LAMEDs in the 
>> Torah are bent and which are straight, I would expect this to be a 
>> non-starter. 
>
> The meta issue: how to ensure that texts that have such features (i.e. 
> layout-specific or scribe-specific choice of shapes) can be widely 
> represented in interchangeable digital representations - even if that 
> representation isn't plain text.
>
> A./
>
I guess that's what it comes down to.? Unicode is classically concerned 
only with plain text.? Aside from disputes about where "plain text" 
ends, what's to be done with "non-plain" text?? Some aspects of this 
non-plain text, like these scribal choices, obviously feels more 
connected to the abstract text than others, like page layout.? Are these 
part of Unicode's mission?? Should they be?? If not, then what?? You 
*can* represent and reproduce these details by kludges, be they as 
ham-fisted as having two fonts with different LAMEDs and formatting some 
in one font and some in another.? Is that good enough?? Does it mess up 
other things?? And even if it is good enough, does that count as an 
"interchangeable digital representation," that I can send a .odt file 
around?? Things to ponder.

~mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200609/ffaceb62/attachment.htm>

From mhd at yv.org  Mon Jun  8 23:45:45 2020
From: mhd at yv.org (Mark H. David)
Date: Mon, 08 Jun 2020 21:45:45 -0700
Subject: Why do the Hebrew Alphabetic Presentation Forms  Exist
In-Reply-To: <f34323018601fb33ee08c040702906a7@disroot.org>
References: <f34323018601fb33ee08c040702906a7@disroot.org>
Message-ID: <593a318d-4662-4410-b768-705c198e8eea@www.fastmail.com>

Hi, sorry for late response, but regarding other character sets *besides* Unicode with Hebrew characters that ended up in Alphabetic Presentation Forms, several were from Apple: Mac OS Hebrew. See this mapping table:

http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/HEBREW.TXT

----- Original message -----
From: Abraham Gross via Unicode <unicode at unicode.org>
To: unicode at unicode.org
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
Date: Tuesday, June 02, 2020 8:18 PM

Why are there precomposed Hebrew characters in Unicode (Alphabetic Presentation Forms block)?

It says in the FAQ that ?a substantial number of presentation forms were encoded in Unicode as compatibility characters, because legacy software or data included them.? (https://www.unicode.org/faq/ligature_digraph.html#PForms)

I can't find any character set other than Unicode that has separate codepoints for all Hebrew letters with a dagesh/mapiq or any of the other precomposed letters other than the Yiddish ligatures. (ex: Code page 862, ISO/IEC 8859-8, Windows-1255)

Does anyone know where I can find the legacy software or character sets that had these presentation forms?

I also want to see the documents/proposals that got these characters accepted as part of Unicode. Does anyone know where I can find them? The closest I got was when I figured out the proposal to add HEBREW LETTER YOD WITH HIRIQ is in proposal N1364, but I can't find it in the document register?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200608/43cd8d8d/attachment.htm>

From jk at koremail.com  Tue Jun  9 10:00:59 2020
From: jk at koremail.com (jk at koremail.com)
Date: Tue, 09 Jun 2020 23:00:59 +0800
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <669714ba-c500-61b6-64ca-791bcf24ccae@kli.org>
References: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>
 <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org>
 <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com>
 <669714ba-c500-61b6-64ca-791bcf24ccae@kli.org>
Message-ID: <d347e195e7f2e3e9de133b4e5a67282c@koremail.com>

On 2020-06-09 20:29, Mark E. Shoulson via Unicode wrote:
> On 6/8/20 10:57 PM, Asmus Freytag via Unicode wrote:
> 
>> On 6/8/2020 2:02 PM, Mark E. Shoulson via Unicode wrote:
>> 
>>> Down to one sentence: until you can talk about which LAMEDs in the
>>> Torah are bent and which are straight, I would expect this to be a
>>> non-starter.
>> 
>> The meta issue: how to ensure that texts that have such features
>> (i.e. layout-specific or scribe-specific choice of shapes) can be
>> widely represented in interchangeable digital representations - even
>> if that representation isn't plain text.
>> 
>> A./
> 
> I guess that's what it comes down to.  Unicode is classically
> concerned only with plain text.  Aside from disputes about where
> "plain text" ends, what's to be done with "non-plain" text?  Some
> aspects of this non-plain text, like these scribal choices, obviously
> feels more connected to the abstract text than others, like page
> layout.  Are these part of Unicode's mission?  Should they be?  If
> not, then what?  You *can* represent and reproduce these details by
> kludges, be they as ham-fisted as having two fonts with different
> LAMEDs and formatting some in one font and some in another.  Is that
> good enough?  Does it mess up other things?  And even if it is good
> enough, does that count as an "interchangeable digital
> representation," that I can send a .odt file around?  Things to
> ponder.
> 

Unicode is concerned with information exchange, as you say 
"interchangeable digital representation" of which to common examples are 
emails and text messages. So an ~.odt file does not count.

John

> ~mark


From everson at evertype.com  Tue Jun  9 11:44:01 2020
From: everson at evertype.com (Michael Everson)
Date: Tue, 9 Jun 2020 17:44:01 +0100
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>
References: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>
Message-ID: <519CF2BF-4F79-4054-BE09-FE7F3F9F711A@evertype.com>

To respond to Mark, I?d say that these examples here certainly show a fairly obvious glyph distinction that is not really a ?hard sell?.

> On 8 Jun 2020, at 18:45, Abraham Gross via Unicode <unicode at unicode.org> wrote:
> 
> Unicode encodes characters that other character sets have even though it normally wouldn't. So if I find a character set with a folded lamed they'd add it?
> 
> Here are 2 character sets with a folded lamed:
> https://i.imgur.com/iq8awBe.jpg ? an ??? ???? with the standing and folded lameds as separate letters.
> https://www.tug.org/TUGboat/tb15-3/tb44haralambous-hebrew.pdf#page=12 ? A TeX typesetting module with the standing and folded lameds as separate characters for fine-grain control when the automatic system doesn't work.
> 
> 2020?6?7? 10:27, "Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:
> 
>> On 6/7/20 7:46 AM, Richard Wordingham via Unicode wrote:
>> 
>> I agree.  Sorry, pretty typography is nice and everything, but if bent LAMED is anything, it's at
>> best a presentation form (and even that is a hard sell.)  You show ANYONE a word spelled with any
>> combination of bent and straight LAMEDs and ask how it's spelled, they'll just say "LAMED" for each
>> one.  Unicode encodes different *characters*, symbols that have a different *meaning* in text, not
>> things that happen to look different.  A U+05BA HOLAM HASER FOR VAV means not just "a dot like
>> U+05B9 only shifted over a little," it means that there is something *different* going on: VAV plus
>> HOLAM usually means one thing (a VAV as mater lectionis for an /o/ vowel), this is a consonantal
>> VAV followed by a vowel.  In spelling it out, you could call one a holam mal?, but not the other. 
>> A QAMATS QATAN is not just a qamats that looks a little different, it is a grammatically distinct
>> character, and moreover one that cannot be deduced algorithmically by looking at the letters around
>> it.  What you're talking about is a LAMED and a LAMED.  They are two *glyphs* for the same
>> character, and Unicode doesn't encode glyphs (anymore?)
>> 
>> ~mark
> 


From everson at evertype.com  Tue Jun  9 14:53:31 2020
From: everson at evertype.com (Michael Everson)
Date: Tue, 9 Jun 2020 20:53:31 +0100
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org>
References: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>
 <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org>
Message-ID: <C92A2980-5255-425A-80F6-E70396F17C14@evertype.com>

Doesn?t it matter _why_ they are bent?

> On 8 Jun 2020, at 22:02, Mark E. Shoulson via Unicode <unicode at unicode.org> wrote:
> 
> Down to one sentence: until you can talk about which LAMEDs in the Torah are bent and which are straight, I would expect this to be a non-starter.


From asmusf at ix.netcom.com  Tue Jun  9 14:59:47 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 9 Jun 2020 12:59:47 -0700
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <669714ba-c500-61b6-64ca-791bcf24ccae@kli.org>
References: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>
 <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org>
 <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com>
 <669714ba-c500-61b6-64ca-791bcf24ccae@kli.org>
Message-ID: <354d6a41-4306-cb10-d5de-5ff5d7294c60@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200609/a7ea818f/attachment.htm>

From doug at ewellic.org  Tue Jun  9 17:33:20 2020
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 9 Jun 2020 16:33:20 -0600
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
Message-ID: <002301d63ead$faaaf320$f000d960$@ewellic.org>

abrahamgross at disroot.org wrote:

> Unicode encodes characters that other character sets have even though
> it normally wouldn't. So if I find a character set with a folded lamed
> they'd add it?

To elaborate a little on John's comment that "that's really not done anymore": Unicode more or less promised to encode everything that was present in existing, contemporary coded character sets. So if it was in ISO 8859-8, MS-DOS CP862, Windows CP1255, MARC-8 for Hebrew, etc., then it would be in Unicode as well. That's where the presentation forms came from, as mentioned earlier.

This did not mean Unicode was obligated to conform retroactively to every coded character set introduced or updated *after* Unicode was published. It has certainly done so for some widely used character sets, particularly in East Asia, but there is no obligation for Unicode to add EWELLIC LETTER A just because I publish an 8-bit character set that contains that letter.

And this promise always applied to "coded character sets," a collection of mappings between a code point (single-byte, double-byte, or multi-byte) and a character, used to represent plain text in computers. It didn't apply to glyph collections for typesetting, as in the TeX example below, and definitely not to charts of letters found in a book, with no corresponding code points, as in the JPEG image below.

> Here are 2 character sets with a folded lamed:
> https://i.imgur.com/iq8awBe.jpg ? an ??? ???? with the standing and
> folded lameds as separate letters.
> https://www.tug.org/TUGboat/tb15-3/tb44haralambous-hebrew.pdf#page=12
> ? A TeX typesetting module with the standing and folded lameds as
> separate characters for fine-grain control when the automatic system
> doesn't work.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From abrahamgross at disroot.org  Tue Jun  9 17:51:34 2020
From: abrahamgross at disroot.org (abraham gross)
Date: Tue, 9 Jun 2020 18:51:34 -0400
Subject: OverStrike control character
Message-ID: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>

What do yall think about adding an OverStrike control character?

Theres historical precedence of having such a control character. The famous Symbolics Space Cadet keyboard had such a key, and many typewriters relied on the its functionality (e.g. in order to make a ?!? you had to type ?'<BS>.? in most typewriters up until the mid 1900s)

The programming language APL also heavily relied on the overstrike control character, so many systems in the 80s had the character including Lisp machines.

Here's a quote from the Lisp Machine manual: ?OVER STRIKE: Moves the cursor back so that you can superpose (overlay) two characters, should you really want to. The key called BS will do the same thing on the old keyboards.? (The BackSpace key in Lisp Machines worked like they do in typewriters where it just went back a character. The Rub Out key actually deleted the last character)

Unicode/ASCII currently has at ASCII 8 the character "BS" thats supposed to go back a character without deleting it, and "DEL" at ASCII 127 that does delete the character. But nowadays BS just deletes the previous character. In fact, it's prohibited in ISO/IEC 8859 for BS to not delete the previous character. Wikipedia says: ?[The cancel control character is] A control character ("CCH", "Cancel Character", U+0094, or ESC T) used to erase the previous character. This character was created as an unambiguous alternative to the much more common backspace character ("BS", U+0008), which has a now mostly obsolete alternative function of causing the following character to be superimposed on the preceding one.?


Modern Usage: Since no modern systems have such a control character, you won't find it being used anywhere, but I can guarantee that it will receive wide adoption, especially in east asia, because the kaomoji community will have a field-day with it.  
People looking to add diacritics that aren't encoded as a combining character yet will also now have the option to do ?<letter><OverStrike><funky diacritic>? like ?g<OS>?? would come out looking like a "g" with a sideways "Z" on top.

Another use of the OverStrike key will be combining shapes in new creative ways for custom orthographies or for custom symbols that can be sent over plain text without the need for special fonts. (since unicode will never encode anyone's random conscript/symbols, this would be a great way for people to get this usage with only the addition of a single character)

From richard.wordingham at ntlworld.com  Tue Jun  9 18:31:09 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 10 Jun 2020 00:31:09 +0100
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <C92A2980-5255-425A-80F6-E70396F17C14@evertype.com>
References: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>
 <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org>
 <C92A2980-5255-425A-80F6-E70396F17C14@evertype.com>
Message-ID: <20200610003109.0297da84@JRWUBU2>

On Tue, 9 Jun 2020 20:53:31 +0100
Michael Everson via Unicode <unicode at unicode.org> wrote:

> Doesn?t it matter _why_ they are bent?
> 
> > On 8 Jun 2020, at 22:02, Mark E. Shoulson via Unicode
> > <unicode at unicode.org> wrote:
> > 
> > Down to one sentence: until you can talk about which LAMEDs in the
> > Torah are bent and which are straight, I would expect this to be a
> > non-starter.  

Yes, it does.  It seems that they are bent so that they don't clash
with the line above.  Changing the line breaks or even changing the
relative widths of the characters will change which ones get bent.
Being bent is an attribute of glyphs in laid out text, rather than an
attribute of characters in a sequence of characters.

That is why mention of ODT files is relevant.  I'm not sure what one
has to do to stop an ODT file reflowing.  I suspect one may have to
freeze a lot of the rendering chain to stop reflowing.

Richard.


From kenwhistler at sonic.net  Tue Jun  9 18:37:07 2020
From: kenwhistler at sonic.net (Ken Whistler)
Date: Tue, 9 Jun 2020 16:37:07 -0700
Subject: OverStrike control character
In-Reply-To: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
Message-ID: <b9f5f00c-4d8b-8634-6a4d-60a340cd23af@sonic.net>


On 6/9/2020 3:51 PM, abraham gross via Unicode wrote:
> What do yall think about adding an OverStrike control character?
Not gonna happen.
>
> Theres historical precedence of having such a control character. The famous Symbolics Space Cadet keyboard had such a key, and many typewriters relied on the its functionality (e.g. in order to make a ?!? you had to type ?'<BS>.? in most typewriters up until the mid 1900s)
And actually U+0008 BACKSPACE (i.e. BS) has been in the Unicode Standard 
for 30 years now. If people were going to implement characters a la the 
1980's (and earlier) backspace and overstrike mode, they had the 
character they needed for that already. It is a clue about how 
implementations work with characters and fonts these days that nobody is 
rushing out to implement overstriking with U+0008, even though it is 
there for the taking.
>
> The programming language APL also heavily relied on the overstrike control character, so many systems in the 80s had the character including Lisp machines.

Another telling example. Unicode 1.0 in 1990 included U+2300 APL COMPOSE 
OPERATOR, which was intended precisely for the APL overstrike 
functionality. It was *removed* in the big merger that resulted in 
Unicode 1.1, in part because the APL community itself was more 
interested in getting the actual composed operators into the encoding, 
rather than depending on archaic sequences that reflected the 
limitations of 7- and 8-bit character encodings and associated 
keyboards. Hence, all the combined APL operators now seen in Unicode at 
U+2336 .. U+2379.

There is some (very) limited use of the concept of an overstruck 
compositor in the Unicode Standard, but the concept is limited to 
specific scripts and is very constrained. The obvious example is U+13436 
EGYPTIAN HIEROGLYPH OVERLAY MIDDLE. That is used as part of a syntax for 
constructing complete Egyptian hieroglyph quadrats. But the critical 
distinction is that that format control is part of a complex syntax used 
by a modern font technology that maps sequences of hieroglyphs into 
ligatures and/or using complex positioning and resizing rules.

Nobody implements fonts these days that will just "back up" "one space" 
and "overstrike" a new character. Well, possibly outside the context of 
Societies for Deliberate Anachronism busy implementing emulations of 
long dead technology, I suppose.

--Ken


From harjitmoe at outlook.com  Tue Jun  9 18:44:26 2020
From: harjitmoe at outlook.com (Harriet Riddle)
Date: Tue, 9 Jun 2020 23:44:26 +0000
Subject: OverStrike control character
In-Reply-To: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
Message-ID: <DB6PR07MB3448F79C29A86F2C036B0ADBB7820@DB6PR07MB3448.eurprd07.prod.outlook.com>

> The programming language APL also heavily relied on the overstrike control character, so many systems in the 80s had the character including Lisp machines.

The current way of handling APL overstamping sequences is to include the entire sequences in the mapping file: https://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/APL-ISO-IR-68.TXT

The interpreter/compiler would presumably have a hardcoded list of sequences it recognises anyway?

> Unicode/ASCII currently has at ASCII 8 the character "BS" thats supposed to go back a character without deleting it, and "DEL" at ASCII 127 that does delete the character. But nowadays BS just deletes the previous character.

Unicode itself is fairly hands-off about how higher level protocols can interpret C0 and C1 control codes (general category Cc). Indeed, ISO 10646:2017 section 12.4, while giving the designation sequences of the ISO 6429 (ECMA 48) controls as the default, does go on to (on the next page) permit the use of ISO 2022 designations of other control code sets with UCS/Unicode (by contrast, ISO 2022 designation of graphical sets are not permitted inside UCS, and have no compatible semantic).

That being said, TUS chapter 23.1 names a limited subset of them (HT, LF, VT, FF, CR, FS, GS, RS, US, NEL), so that they can be given custom behaviours for line breaking, bidirectional processing and classification as whitespace. BS is not amongst these.

In practice, BS is not supported at all (i.e. has neither behaviour) outside of terminal emulators in my experience.

> In fact, it's prohibited in ISO/IEC 8859 for BS to not delete the previous character. 

ISO 8859 defines profiles of ISO 4873 (ECMA 43) Level 1. Both ISO 8859 and ISO 4873 stipulate fixed character repertoires, and so prohibit creating new characters from overstamping existing ones by any means (including using BS or CR to seek back over them). I don't read this as limiting how BS itself might be implemented, just that it is invalid ISO 8859 for a text to use it to stamp characters on top of other characters to create a character with a different meaning to the two one after the other.

They do permit using the GCC control sequence defined by ISO 6429 (ECMA 48) though, since it doesn't overstamp anything but merely renders them in one em-square (if that function is supported, and it usually isn't so far as I can tell, the most extreme example I can think of is that the byte sequence 9B 31 20 5F D5 E4 E9 20 C7 E4 E4 E7 20 D9 E4 EA E7 20 E8 D3 E4 E5 9B 32 20 5F in ISO-8859-6 might be shown with a U+FDFA glyph).


From sosipiuk at gmail.com  Tue Jun  9 19:01:32 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Tue, 9 Jun 2020 20:01:32 -0400
Subject: OverStrike control character
In-Reply-To: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
Message-ID: <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>

On Tue, Jun 9, 2020 at 6:57 PM abraham gross via Unicode
<unicode at unicode.org> wrote:
>
> What do yall think about adding an OverStrike control character?

I don't think it's a goer. There are two things that immediately stand out:

1. Unicode doesn't seem eager to define control characters at all. In
fact, aside from a handful of format effectors which were so universal
and obvious that it made no sense to exclude them, Unicode is very
passive on the topic of even the well-defined controls of ISO
6429/ECMA 48. An interesting exception to this is the pair of U+2028
and U+2029 (line and paragraph separators). Any control character is
going to be a "hard sell".

2. Overstriking arbitrary characters is a qualitatively different
process than using combining characters. In the latter case, the set
of characters is restricted, and certain algorithms can be applied to
make the presentation look sane (to varying degrees of success).
Overstriking implies the need for the rendering engine to be able to
combine any two characters, regardless of elements that interfere or
clash. It seems simple in principle to just render the characters
separately and overlay the pixels, but I'm very skeptical of what the
results would actually look like in real-life, with users making
unpredictable font and formatting choices.

> Unicode/ASCII currently has at ASCII 8 the character "BS" thats supposed to go back a character without deleting it, and "DEL" at ASCII 127 that does delete the character. But nowadays BS just deletes the previous character. In fact, it's prohibited in ISO/IEC 8859 for BS to not delete the previous character.

Is it? I know that's the behaviour in all modern software, but I can't
find that prohibition. Can you point out the section?

Speaking of old standards, though, ISO 6429/ECMA 48 has the GCC
(GRAPHIC CHARACTER COMBINATION) control which seems to be its
recommendation for overstriking (though it also waffles about how
combined characters may simply be made half-width and inserted into
the horizontal space of a single character, leaving the ultimate
decision of "relative sizes and placements" to the implementation.)
GCC looks like a mess. Because of the way it's built up from a CSI
(control sequence introducer) and uses parameters, the way to combine
two characters is to precede them both with the sequence [0x1B 0x5B
0x30 0x20 0x5F], and to combine more than two characters, enclose them
with an initial [0x1B 0x5B 0x31 0x20 0x5F] and a final [0x1B 0x5B 0x32
0x20 0x5F]. How fun.

S?awomir Osipiuk


From sosipiuk at gmail.com  Tue Jun  9 19:09:22 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Tue, 9 Jun 2020 20:09:22 -0400
Subject: OverStrike control character
In-Reply-To: <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
Message-ID: <CAM+ijLifPWD_EjBSCGGeGfP4z21R6a=aD=1+jBevKWh=1sFFCg@mail.gmail.com>

On Tue, Jun 9, 2020 at 8:01 PM S?awomir Osipiuk <sosipiuk at gmail.com> wrote:
>
> Is it? I know that's the behaviour in all modern software, but I can't
> find that prohibition. Can you point out the section?
>

D'oh. It's right in the very first section ("Scope").

"The use of control functions, such as BACKSPACE or CARRIAGE RETURN
for the coded representation of composite characters is prohibited by
this Standard."


From jk at koremail.com  Tue Jun  9 20:13:30 2020
From: jk at koremail.com (jk at koremail.com)
Date: Wed, 10 Jun 2020 09:13:30 +0800
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <20200610003109.0297da84@JRWUBU2>
References: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>
 <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org>
 <C92A2980-5255-425A-80F6-E70396F17C14@evertype.com>
 <20200610003109.0297da84@JRWUBU2>
Message-ID: <1f63ec03c58bf1bfac3ee7f46f72f475@koremail.com>

On 2020-06-10 07:31, Richard Wordingham via Unicode wrote:
> On Tue, 9 Jun 2020 20:53:31 +0100
> Michael Everson via Unicode <unicode at unicode.org> wrote:
> 
>> Doesn?t it matter _why_ they are bent?
>> 
>> > On 8 Jun 2020, at 22:02, Mark E. Shoulson via Unicode
>> > <unicode at unicode.org> wrote:
>> >
>> > Down to one sentence: until you can talk about which LAMEDs in the
>> > Torah are bent and which are straight, I would expect this to be a
>> > non-starter.
> 
> Yes, it does.  It seems that they are bent so that they don't clash
> with the line above.  Changing the line breaks or even changing the
> relative widths of the characters will change which ones get bent.
> Being bent is an attribute of glyphs in laid out text, rather than an
> attribute of characters in a sequence of characters.
> 
> That is why mention of ODT files is relevant.  I'm not sure what one
> has to do to stop an ODT file reflowing.  I suspect one may have to
> freeze a lot of the rendering chain to stop reflowing.
> 
> Richard.

If whether or not the lamed is bent depends on the line above then 
clearly not a suitable candidate for encoding.

John K

From mark at kli.org  Tue Jun  9 20:41:15 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Tue, 9 Jun 2020 21:41:15 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <C92A2980-5255-425A-80F6-E70396F17C14@evertype.com>
References: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>
 <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org>
 <C92A2980-5255-425A-80F6-E70396F17C14@evertype.com>
Message-ID: <8c1fe062-d31a-7c64-8ca0-4695e6a9f0a1@kli.org>

On 6/9/20 3:53 PM, Michael Everson via Unicode wrote:
> Doesn?t it matter _why_ they are bent?
>
>> On 8 Jun 2020, at 22:02, Mark E. Shoulson via Unicode <unicode at unicode.org> wrote:
>>
>> Down to one sentence: until you can talk about which LAMEDs in the Torah are bent and which are straight, I would expect this to be a non-starter.

On one hand, no, not really.? On the other hand, well, if there's a 
reason, that's already a start.

See, note that I asked about "which LAMEDs in the *Torah*."? Not a 
certain printing or document.? Because the Torah is not a book, it is 
not a scroll, it is not a computer file.? It's a text, by which I mean 
it is a... conceptual string of characters?? That is to say, it isn't "A 
symbol that looks like this, followed by one that looks like that..." 
it's a BET followed by a RESH followed by an ALEPH... That, at its 
heart, is what "plain text" is all about.? A phrase I tried to coin 
years back: "there's no plain text on paper."? Once you're describing 
how something is printed, you're dealing with something that's been 
formatted.

The Torah's contents have been dissected and delved through in 
excruciating detail.? We know which letters and symbols come in what 
order.? (We even know some distinctions that _still_ aren't encoded by 
Unicode, and I'm not saying they should be, like the difference between 
a PASEQ and the line that forms part of a LEGARMEHH cantillation.? There 
are lists of PASEQs you can check.)? Whether a LAMED is bent or not is 
determined mainly by whether or not there's space above it the way they 
text has been flowed.? Adding so much as a punctuation mark anywhere 
could mean the text is reformatted, and the situation in the line above 
it could be very different.? It isn't something that depends on the 
text, it depends on the paper, on the formatting.

This doesn't seem like "character" information, to me.

~mark


From mark at kli.org  Tue Jun  9 20:57:43 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Tue, 9 Jun 2020 21:57:43 -0400
Subject: Wide Hebrew characters
Message-ID: <9ba86122-f9bc-a245-a7d0-55ce00f09b64@kli.org>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200609/88bd4f1c/attachment.htm>

From mark at kli.org  Tue Jun  9 21:09:06 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Tue, 9 Jun 2020 22:09:06 -0400
Subject: Why do the Hebrew Alphabetic Presentation Forms Exist
In-Reply-To: <519CF2BF-4F79-4054-BE09-FE7F3F9F711A@evertype.com>
References: <f3325dc3-957d-74ee-c3bc-365b87540a2c@kli.org>
 <1ff571c5-990b-4c60-9689-952440040d05@disroot.org>
 <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu>
 <20200607124627.12db8f87@JRWUBU2>
 <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org>
 <519CF2BF-4F79-4054-BE09-FE7F3F9F711A@evertype.com>
Message-ID: <bf564d97-3303-022a-7ab2-b2b44adcb0ab@kli.org>

Hm, you think?? This is like a font sample, showing all the alternate 
glyphs.? An old-time italic font might have 3 different "s"s or "n"s, 
depending on how much swoop and swash the typesetter felt like using at 
that particular spot, and the type sample pages would show them all, but 
that doesn't make them all distinct characters.? I think 
https://www.oldfonts.com/antiquepenman/wp-content/uploads/2017/03/libraryprimer.jpg 
is for handwriting, but one could easily imagine a typeface imitating 
that, with all those different forms of f and g and so on.? They're 
still just f's.? (I can see if I have some actual type samples for 
better examples if needed...)

(I read Haralambous'(*) article on "Tiqwah" years ago; he definitely did 
some very careful work and study in Biblical typesetting.? But note, 
_typesetting_.? The art of laying out glyphs on paper.? That's not the 
same thing as characters.)

~mark

(*) I think this may be the first time I noticed that his name isn't 
"Harambolous", which for some reason I thought it was.? Apologies, 
professor...

On 6/9/20 12:44 PM, Michael Everson via Unicode wrote:
> To respond to Mark, I?d say that these examples here certainly show a fairly obvious glyph distinction that is not really a ?hard sell?.
>
>> On 8 Jun 2020, at 18:45, Abraham Gross via Unicode <unicode at unicode.org> wrote:
>>
>> Unicode encodes characters that other character sets have even though it normally wouldn't. So if I find a character set with a folded lamed they'd add it?
>>
>> Here are 2 character sets with a folded lamed:
>> https://i.imgur.com/iq8awBe.jpg ? an ??? ???? with the standing and folded lameds as separate letters.
>> https://www.tug.org/TUGboat/tb15-3/tb44haralambous-hebrew.pdf#page=12 ? A TeX typesetting module with the standing and folded lameds as separate characters for fine-grain control when the automatic system doesn't work.
>>
>> 2020?6?7? 10:27, "Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:
>>
>>> On 6/7/20 7:46 AM, Richard Wordingham via Unicode wrote:
>>>
>>> I agree.  Sorry, pretty typography is nice and everything, but if bent LAMED is anything, it's at
>>> best a presentation form (and even that is a hard sell.)  You show ANYONE a word spelled with any
>>> combination of bent and straight LAMEDs and ask how it's spelled, they'll just say "LAMED" for each
>>> one.  Unicode encodes different *characters*, symbols that have a different *meaning* in text, not
>>> things that happen to look different.  A U+05BA HOLAM HASER FOR VAV means not just "a dot like
>>> U+05B9 only shifted over a little," it means that there is something *different* going on: VAV plus
>>> HOLAM usually means one thing (a VAV as mater lectionis for an /o/ vowel), this is a consonantal
>>> VAV followed by a vowel.  In spelling it out, you could call one a holam mal?, but not the other.
>>> A QAMATS QATAN is not just a qamats that looks a little different, it is a grammatically distinct
>>> character, and moreover one that cannot be deduced algorithmically by looking at the letters around
>>> it.  What you're talking about is a LAMED and a LAMED.  They are two *glyphs* for the same
>>> character, and Unicode doesn't encode glyphs (anymore?)
>>>
>>> ~mark


From abrahamgross at disroot.org  Tue Jun  9 21:50:07 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Wed, 10 Jun 2020 02:50:07 +0000 (UTC)
Subject: OverStrike control character
In-Reply-To: <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
Message-ID: <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>

It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character.

It shouldnt do any fancy processing by default (unless if a font actually cares enough to mess with it). Most systems have just about the same font so I wouldn't worry about the results of overstriking not coming out perfect. Even if doesn't come out perfect, I'd take almost-exact-representation over no-represtation any day.

2020/06/09 ??8:02:39 S?awomir Osipiuk via Unicode <unicode at unicode.org>:

> 2. Overstriking arbitrary characters is a qualitatively different
> process than using combining characters. In the latter case, the set
> of characters is restricted, and certain algorithms can be applied to
> make the presentation look sane (to varying degrees of success).
> Overstriking implies the need for the rendering engine to be able to
> combine any two characters, regardless of elements that interfere or
> clash. It seems simple in principle to just render the characters
> separately and overlay the pixels, but I'm very skeptical of what the
> results would actually look like in real-life, with users making
> unpredictable font and formatting choices.
> 


From prosfilaes at gmail.com  Tue Jun  9 22:14:14 2020
From: prosfilaes at gmail.com (David Starner)
Date: Tue, 9 Jun 2020 20:14:14 -0700
Subject: OverStrike control character
In-Reply-To: <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
Message-ID: <CAMZ=zj4E2dyAG-+oi7qrbHKSC0mXX_bUdASeVPTOixxkvWiKjQ@mail.gmail.com>

On Tue, Jun 9, 2020 at 7:55 PM abrahamgross--- via Unicode
<unicode at unicode.org> wrote:
>
> It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character.
>
> It shouldnt do any fancy processing by default (unless if a font actually cares enough to mess with it). Most systems have just about the same font so I wouldn't worry about the results of overstriking not coming out perfect. Even if doesn't come out perfect, I'd take almost-exact-representation over no-represtation any day.

There is a character to do that; BS. Like many other control
characters, it's now generally considered obsolete. There's no need to
provide a new character to do something that has an old character does
just fine, even if that old character is unsupported because it
doesn't fit with current design choices. Making a new character won't
change that.

-- 
The standard is written in English . If you have trouble understanding
a particular section, read it again and again and again . . . Sit up
straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185
(1991)


From abrahamgross at disroot.org  Tue Jun  9 22:25:10 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Wed, 10 Jun 2020 03:25:10 +0000 (UTC)
Subject: OverStrike control character
In-Reply-To: <CAMZ=zj4E2dyAG-+oi7qrbHKSC0mXX_bUdASeVPTOixxkvWiKjQ@mail.gmail.com>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <CAMZ=zj4E2dyAG-+oi7qrbHKSC0mXX_bUdASeVPTOixxkvWiKjQ@mail.gmail.com>
Message-ID: <2f501bae-9f2a-4bad-be3a-f1e48646db78@disroot.org>

BS doesn't do that tho. even in the beginning of ascii it was only supported on some devices.

Doing `echo -e 'xy\x08z'` results in `xz` and not in https://imgur.com/B0020Xb


From gwalla at gmail.com  Wed Jun 10 00:44:41 2020
From: gwalla at gmail.com (Garth Wallace)
Date: Tue, 9 Jun 2020 22:44:41 -0700
Subject: OverStrike control character
In-Reply-To: <2f501bae-9f2a-4bad-be3a-f1e48646db78@disroot.org>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <CAMZ=zj4E2dyAG-+oi7qrbHKSC0mXX_bUdASeVPTOixxkvWiKjQ@mail.gmail.com>
 <2f501bae-9f2a-4bad-be3a-f1e48646db78@disroot.org>
Message-ID: <CA+p4_H1s7i-PfNe7ZYta8ovtVrVaWed+yS5=Jz7BEwo-EBptiA@mail.gmail.com>

Would x OVERSTRIKE z look the same as z OVERSTRIKE x? If yes, would they be
considered identical for string matching purposes? Would they have to be
reordered for normalization? What would be the repercussions for collation?

Display is not the only thing text is for.

On Tue, Jun 9, 2020 at 8:30 PM abrahamgross--- via Unicode <
unicode at unicode.org> wrote:

> BS doesn't do that tho. even in the beginning of ascii it was only
> supported on some devices.
>
> Doing `echo -e 'xy\x08z'` results in `xz` and not in
> https://imgur.com/B0020Xb
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200609/12a7df0b/attachment.htm>

From abrahamgross at disroot.org  Wed Jun 10 01:05:57 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Wed, 10 Jun 2020 06:05:57 +0000 (UTC)
Subject: OverStrike control character
In-Reply-To: <CA+p4_H1s7i-PfNe7ZYta8ovtVrVaWed+yS5=Jz7BEwo-EBptiA@mail.gmail.com>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <CAMZ=zj4E2dyAG-+oi7qrbHKSC0mXX_bUdASeVPTOixxkvWiKjQ@mail.gmail.com>
 <2f501bae-9f2a-4bad-be3a-f1e48646db78@disroot.org>
 <CA+p4_H1s7i-PfNe7ZYta8ovtVrVaWed+yS5=Jz7BEwo-EBptiA@mail.gmail.com>
Message-ID: <9caa937c-f3d7-4132-a528-b435398785c0@disroot.org>

2020/06/10 ??1:45:32 Garth Wallace via Unicode <unicode at unicode.org>:

> Would x OVERSTRIKE z look the same as z OVERSTRIKE x? If yes, would they be considered identical for string matching purposes?

They would look the same.
In a perfect world they would be identical for string matching, but since its a new control character I would understand if ppl don't want to put in the effort to adopt it properly.

> Would they have to be reordered for normalization?

Not sure what this means, but if I understand it correctly, then this might actually be a good idea for collation. But it might also be too much effort to implement, so its not necessary. Like the japanese saying goes ??????????[https://www.weblio.jp/content/Simple+is+Best]?

> What would be the repercussions for collation?

I would say just take the first character in the sequence of overstriked characters and use that as the bases of collation. If this doesn't work, then I'm always open to suggestions.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200610/846d1d2a/attachment.htm>

From jameskasskrv at gmail.com  Wed Jun 10 02:36:31 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Wed, 10 Jun 2020 07:36:31 +0000
Subject: OverStrike control character
In-Reply-To: <9caa937c-f3d7-4132-a528-b435398785c0@disroot.org>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <CAMZ=zj4E2dyAG-+oi7qrbHKSC0mXX_bUdASeVPTOixxkvWiKjQ@mail.gmail.com>
 <2f501bae-9f2a-4bad-be3a-f1e48646db78@disroot.org>
 <CA+p4_H1s7i-PfNe7ZYta8ovtVrVaWed+yS5=Jz7BEwo-EBptiA@mail.gmail.com>
 <9caa937c-f3d7-4132-a528-b435398785c0@disroot.org>
Message-ID: <60bd95ba-42c1-fb3b-aaae-d6099f6a848d@gmail.com>


On 2020-06-10 6:05 AM, abrahamgross--- via Unicode wrote:
> Like the japanese saying goes ???????????
That's English written phonetically in katakana.


From richard.wordingham at ntlworld.com  Wed Jun 10 04:27:04 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 10 Jun 2020 10:27:04 +0100
Subject: OverStrike control character
In-Reply-To: <9caa937c-f3d7-4132-a528-b435398785c0@disroot.org>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <CAMZ=zj4E2dyAG-+oi7qrbHKSC0mXX_bUdASeVPTOixxkvWiKjQ@mail.gmail.com>
 <2f501bae-9f2a-4bad-be3a-f1e48646db78@disroot.org>
 <CA+p4_H1s7i-PfNe7ZYta8ovtVrVaWed+yS5=Jz7BEwo-EBptiA@mail.gmail.com>
 <9caa937c-f3d7-4132-a528-b435398785c0@disroot.org>
Message-ID: <20200610102704.3462a896@JRWUBU2>

On Wed, 10 Jun 2020 06:05:57 +0000 (UTC)
abrahamgross--- via Unicode <unicode at unicode.org> wrote:

> 2020/06/10 ??1:45:32 Garth Wallace via Unicode
> <unicode at unicode.org>:
> 
> > Would x OVERSTRIKE z look the same as z OVERSTRIKE x? If yes, would
> > they be considered identical for string matching purposes?  
> 
> They would look the same.
> In a perfect world they would be identical for string matching, but
> since its a new control character I would understand if ppl don't
> want to put in the effort to adopt it properly.
> 
> > Would they have to be reordered for normalization?  
> 
> Not sure what this means, but if I understand it correctly, then this
> might actually be a good idea for collation. But it might also be too
> much effort to implement, so its not necessary. Like the japanese
> saying goes
> ??????????[https://www.weblio.jp/content/Simple+is+Best]?
> 
> > What would be the repercussions for collation?  
> 
> I would say just take the first character in the sequence of
> overstriked characters and use that as the bases of collation. If
> this doesn't work, then I'm always open to suggestions.

But if they're identical for character matching, then they should
collate identically, so this last bit is inherently broken.

Consider <l, OVERSTRIKE, m> and <m, OVERSTRIKE, l> in a proportional
width font.  Are you expecting the rendering system to position the 'l'
using the knowledge that it will be overstruck? Overstriking is
designed for a teletype with fixed width characters. the knowledge that
it will be overstruck?  It takes special effort to get the overstriking
effects on a video terminal or its emulation.

Richard.


From mark at kli.org  Wed Jun 10 07:59:16 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Wed, 10 Jun 2020 08:59:16 -0400
Subject: OverStrike control character
In-Reply-To: <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
Message-ID: <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>

On 6/9/20 10:50 PM, abrahamgross--- via Unicode wrote:
> It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character.

What are these "pixels" to which you refer?? Fonts these days are 
defined in terms of strokes, not pixels.? And Richard Wordingham points 
out the flaw in your notion of how it would be rendered, your claim that 
x OS z would look the same as z OS x:

> Consider <l, OVERSTRIKE, m> and <m, OVERSTRIKE, l> in a proportional
> width font.  Are you expecting the rendering system to position the 'l'
> using the knowledge that it will be overstruck? Overstriking is
> designed for a teletype with fixed width characters.
Besides, even if it worked as you said, with the narrow character 
centered, how long would it take before you found some examples that 
didn't really quite work out right?? Like overlaying a HEBREW LETTER YOD 
on a LATIN CAPITAL LETTER L, but what you really wanted was the YOD 
centered in the negative space of the L and not between the 
side-bearings, so next you'll want to be able to add some control over 
the exact positioning.? And of course that won't work right in general, 
because it all depends on the font(s) involved.

And when it comes to matching, you say of x OS z and z OS x,

> In a perfect world they would be identical for string matching, but 
> since its a new control character I would understand if ppl don't want 
> to put in the effort to adopt it properly.

But we're talking about making the rules here, the "perfect world."? 
What should the *rule* be about string-matching?? You can't have an 
optional rule, so a pair of strings will match on one system and not the 
other and both are right.? Are we to understand that you think the rule 
should be that overstruck characters are considered to match in either 
order?? Your gracious forgiveness of laxity in the rules doesn't really 
enter into the picture.? And what about larger considerations?? Can I 
have "ab??xy" (using ? for the overstrike) to overstrike a&x and b&y?? 
What about "a?b?c?d?e?f?g?h"?? What about "abc?d??fg"? The f&b are 
overstruck and so are the c&d&g?? Is that combination of c?d overstruck 
with g different from c?d?g or the same?? What about other 
combinations?? These are all things that need answers.? What about 
overstriking a LTR character with a RTL one, or vice-versa?? Which way 
does the text go after that?

But I think what you're really missing is the crucial point that Garth 
Wallace pointed out:

> Display is not the only thing text is for.

You're focussing a lot on how characters *look*, can we get this letter 
to look a little different, can we layer characters to make other weird 
symbols (which will look radically different depending on the font)... 
You're looking at how to _draw_ stuff, how to make things look this way 
or that on paper or when rendered.? But that's not what Unicode 
encodes.? You need to think more about the distinction Unicode makes 
between characters and glyphs.? "Plain text" isn't about display,? it's 
about representing what's in a document, the characters which encode (in 
a different sense) the spoken language (usually) that is being 
communicated.? All the things you're talking about are firmly in the 
realm of fonts and higher-level protocols.? You surely could work out 
this overstriking display with a sufficiently-advanced font (you could 
make zero-advance-width overlaying characters and ligatures that would 
replace X? with a zero-width equivalent of X, for example, in a 
monospace font), and you are welcome to do so, but that's where it belongs.
> It shouldnt do any fancy processing by default
Figuring out how much to backspace in order to center a glyph on another 
one, in a proportional-spaced font, is pretty fancy processing.
> Most systems have just about the same font so I wouldn't worry about the results of overstriking not coming out perfect.

What a bland world you live in, wherein most fonts are the same! It's 
not about working with the default font on your favorite system; we're 
dealing with _characters_ here, which could be represented in ANY font.

~mark


From sosipiuk at gmail.com  Wed Jun 10 09:39:08 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Wed, 10 Jun 2020 10:39:08 -0400
Subject: OverStrike control character
In-Reply-To: <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
Message-ID: <CAM+ijLgNuXjtXE3NAEf8_riN9uVfRU5oeVfhK-8C4-A2i0ixww@mail.gmail.com>

On Tue, Jun 9, 2020 at 10:55 PM abrahamgross--- via Unicode
<unicode at unicode.org> wrote:
>
> It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character.
>
> Even if doesn't come out perfect, I'd take almost-exact-representation over no-represtation any day.

To be sure, it would be a cool feature to have. However, as Raymond
Chen (a Windows developer) succinctly put it, "every feature starts
out at minus 100 points". It's not just a matter of having some value;
that value must outweigh the effort it would take to fully implement
it, and it must compete with other features that that effort could go
to instead. The proposal here is to add a completely new feature to
Unicode, that in turn demands an updated font rendering process, and
to convince vendors to support it. It's not just a new character which
can be added to a font. It's new behaviour for characters. That's big;
that's a lot of effort. The idea of just overlaying pixels is a
tunnel-view of what's involved. Character combination (as it's
currently done) doesn't occur at that level. It's done earlier. Pixels
are at the final level of display. You'd need a new set of routines to
enable combination there. And it's not trivial then, either. What
about anti-aliasing, sub-pixel rendering?

Who will want to do all that? You'd need not just font support but an
OS-level change in the rendering process, in all major OSes. All for a
feature that's a bit of fun but isn't guaranteed to produce elegant
results.

There is also the question of backward-compatibility. Even if this
change is included in new releases, there will be plenty of old
systems out there that won't have any idea of what an overstrike
character is. You won't just get an "unknown character" glyph, you'll
get an "unknown character" glyph between the two characters you're
trying to combine.

I'm not saying it can't be done, or that it wouldn't be a
nice-to-have. Where there's a will, there's a way. Realistically,
though, I don't predict a lot of will to get something like this done.

S?awomir Osipiuk


From abrahamgross at disroot.org  Wed Jun 10 10:45:49 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Wed, 10 Jun 2020 15:45:49 +0000 (UTC)
Subject: OverStrike control character
In-Reply-To: <60bd95ba-42c1-fb3b-aaae-d6099f6a848d@gmail.com>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <CAMZ=zj4E2dyAG-+oi7qrbHKSC0mXX_bUdASeVPTOixxkvWiKjQ@mail.gmail.com>
 <2f501bae-9f2a-4bad-be3a-f1e48646db78@disroot.org>
 <CA+p4_H1s7i-PfNe7ZYta8ovtVrVaWed+yS5=Jz7BEwo-EBptiA@mail.gmail.com>
 <9caa937c-f3d7-4132-a528-b435398785c0@disroot.org>
 <60bd95ba-42c1-fb3b-aaae-d6099f6a848d@gmail.com>
Message-ID: <fada6174-24d7-4af1-8bd9-7c2e60a8c44e@disroot.org>

2020/06/10 ??3:37:37 James Kass via Unicode <unicode at unicode.org>:

> > On 2020-06-10 6:05 AM, abrahamgross--- via Unicode wrote:
> > Like the japanese saying goes ???????????
> That's English written phonetically in katakana.
>

It is, but "simple is best" doesn't mean anything in english, while it does mean something in japanese. in english it would be something like "simplicity is always better".


From kent.b.karlsson at bahnhof.se  Wed Jun 10 11:12:04 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Wed, 10 Jun 2020 18:12:04 +0200
Subject: OverStrike control character
In-Reply-To: <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
Message-ID: <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se>

(You (all) apparently mean ?overtype? rather than ?overstrike??; at least I read the latter as the same as crossed-out or strike-through.)

Well, however the overtyping, or overlapping, is achieved (BS, GCC (is there any implementation of that at all? I would strongly recommend against it), pinching the glyph spacing (there is a control sequence for that in ECMA-48) too much, or simply via how the font?s glyphs are designed and spaced), there is no telling what the displayed result will be.

Doing such things really passes from being text display (styled or not) into the realm of graphics. And sure, you can do lots of things in graphic design, also with overlapping ?graphic elements? (including glyphs for letters/digits/...). But as (possibly styled) TEXT, the displayed/printed result of overlaps would be ?implementation defined?. Please use a graphics editing program for controlling how overlapping graphic elements look like; for overlapping, you may want to use different layers, graphics editing programs often support ?layers?, for the graphic elements that overlap even if there are no layers when converting to (say) PNG.

(And for graphics, sorting, searching, and other text operations do not apply?; in HTML for images/graphics, you can have an ?alt? text, which may or may not, indicate what is in the image/graphics.)

/Kent Karlsson

> 10 juni 2020 kl. 14:59 skrev Mark E. Shoulson via Unicode <unicode at unicode.org>:
> 
> On 6/9/20 10:50 PM, abrahamgross--- via Unicode wrote:
>> It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character.
> 
> What are these "pixels" to which you refer?  Fonts these days are defined in terms of strokes, not pixels.  And Richard Wordingham points out the flaw in your notion of how it would be rendered, your claim that x OS z would look the same as z OS x:
> 
>> Consider <l, OVERSTRIKE, m> and <m, OVERSTRIKE, l> in a proportional
>> width font.  Are you expecting the rendering system to position the 'l'
>> using the knowledge that it will be overstruck? Overstriking is
>> designed for a teletype with fixed width characters.
> Besides, even if it worked as you said, with the narrow character centered, how long would it take before you found some examples that didn't really quite work out right?  Like overlaying a HEBREW LETTER YOD on a LATIN CAPITAL LETTER L, but what you really wanted was the YOD centered in the negative space of the L and not between the side-bearings, so next you'll want to be able to add some control over the exact positioning.  And of course that won't work right in general, because it all depends on the font(s) involved.
> 
> And when it comes to matching, you say of x OS z and z OS x,
> 
>> In a perfect world they would be identical for string matching, but since its a new control character I would understand if ppl don't want to put in the effort to adopt it properly.
> 
> But we're talking about making the rules here, the "perfect world."  What should the *rule* be about string-matching?  You can't have an optional rule, so a pair of strings will match on one system and not the other and both are right.  Are we to understand that you think the rule should be that overstruck characters are considered to match in either order?  Your gracious forgiveness of laxity in the rules doesn't really enter into the picture.  And what about larger considerations?  Can I have "ab??xy" (using ? for the overstrike) to overstrike a&x and b&y?  What about "a?b?c?d?e?f?g?h"?  What about "abc?d??fg"? The f&b are overstruck and so are the c&d&g?  Is that combination of c?d overstruck with g different from c?d?g or the same?  What about other combinations?  These are all things that need answers.  What about overstriking a LTR character with a RTL one, or vice-versa?  Which way does the text go after that?
> 
> But I think what you're really missing is the crucial point that Garth Wallace pointed out:
> 
>> Display is not the only thing text is for.
> 
> You're focussing a lot on how characters *look*, can we get this letter to look a little different, can we layer characters to make other weird symbols (which will look radically different depending on the font)... You're looking at how to _draw_ stuff, how to make things look this way or that on paper or when rendered.  But that's not what Unicode encodes.  You need to think more about the distinction Unicode makes between characters and glyphs.  "Plain text" isn't about display,  it's about representing what's in a document, the characters which encode (in a different sense) the spoken language (usually) that is being communicated.  All the things you're talking about are firmly in the realm of fonts and higher-level protocols.  You surely could work out this overstriking display with a sufficiently-advanced font (you could make zero-advance-width overlaying characters and ligatures that would replace X? with a zero-width equivalent of X, for example, in a monospace font), and you are welcome to do so, but that's where it belongs.
>> It shouldnt do any fancy processing by default
> Figuring out how much to backspace in order to center a glyph on another one, in a proportional-spaced font, is pretty fancy processing.
>> Most systems have just about the same font so I wouldn't worry about the results of overstriking not coming out perfect.
> 
> What a bland world you live in, wherein most fonts are the same! It's not about working with the default font on your favorite system; we're dealing with _characters_ here, which could be represented in ANY font.
> 
> ~mark
> 
> 
> 


From hsivonen at hsivonen.fi  Wed Jun 10 11:47:47 2020
From: hsivonen at hsivonen.fi (Henri Sivonen)
Date: Wed, 10 Jun 2020 19:47:47 +0300
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <4a173391-8436-1f3f-c311-f4a9960d288e@honermann.net>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <CAN49p6qVEhjcMk3JyjG1CoRHw-rrvBr=cPaqZe0sjCBfSFwGkA@mail.gmail.com>
 <4a173391-8436-1f3f-c311-f4a9960d288e@honermann.net>
Message-ID: <CAJQvAufqOvKWHrtFOaQD7CRGS94_SryyqBY0xM0ipMHTjF75vw@mail.gmail.com>

Tom Honermann wrote:
> I'm researching support for UTF-8 encoded source files in various C++ compilers.  Here is a brief snapshot of existing practice:
>
> Clang only accepts UTF-8 encoded source files.  A UTF-8 BOM is recognized and discarded.
> GCC accepts UTF-8 encoded source files by default, but the encoding expectation can be overridden with a command line option.  If GCC is expecting UTF-8 source, then a BOM is discarded.  Otherwise, a BOM is *not* honored and its presence is likely to result in compilation error.  GCC has no support for compiling a translation unit consisting of differently encoded source files.
> Microsoft Visual C++, by default, interprets source files as encoded according to the Windows' Active Code Page (ACP), but supports translation units consisting of differently encoded source files by honoring a UTF-8 BOM.  The default encoding can be overridden with a command line option.
> IBM z/OS xlC C/C++ is IBM's compiler for C and C++ on mainframes (yes, though you may not have seen a green screen in recent times, mainframes are still busy crunching numbers behind the scenes for websites you frequent).  z/OS is an EBCDIC based operating system and IBM's xlC compiler for z/OS only accepts EBCDIC encoded source files.  Many EBCDIC code pages exist and the xlC compiler supports an in-source code page annotation that enables compilation of translation units consisting of differently encoded source files.
>
> The goal of this research is to produce a proposal for the C and C++ standards intended to better enable UTF-8 as a portable source file encoding.
...
> Various methods are being explored for how to support collections of mixed encoding source files.  The intent in asking the question is to help determine if/how use of a UTF-8 BOM fits in to the picture.

Given your description of existing compiler behavior, I recommend
making the C++ standard say that if a file (substitute the right ISO
term for "file") starts with a UTF-8 BOM, the file must be interpreted
as UTF-8 and the BOM be discarded before further processing. This
already fits what you say GCC, clang, and MSVC do by default and would
not be a compatibility-breaking change for IBM compilers (though I
understood the IBM compilers are being superseded by clang anyway as
far as implementing C++ versions later than C++11 goes). This would
also facilitate migration to UTF-8 on Windows and z/OS.

Shawn Steele wrote:
> The modern viewpoint is that the BOM should be discouraged in all contexts.

If you are writing an HTML serializer that 1) is a component distinct
from the HTTP layer and, therefore, cannot control the HTTP headers
and 2) mustn't impose restrictions on the shape of the DOM and,
therefore, mustn't inject a meta element on its own, the best approach
is to use the UTF-8 BOM.

> I?d recommend to anyone encountering ASCII-like data to presume it was UTF-8 unless proven otherwise.

This is problematic in contexts where there is non-UTF-8 legacy, the
input arrives over time, and streaming processing of the input is
expected. See https://hsivonen.fi/utf-8-detection/

On Eli Zaretskii wrote:
> > From: Shawn Steele via Unicode <unicode at unicode.org>
> >
> > I?ve been recommending that people assume documents are UTF-8.  If the UTF-8 decoding fails, then
> > consider falling back to some other codepage.
>
> That strategy would fail with 7-bit ISO 2022 based encodings, no?

Yes. When HTML is labeled as UTF-8 and is valid UTF-8, Firefox
disables the character encoding menu to prevent self-XSS and to
prevent the user from introducing data corruption to forms. This is a
bit of a problem with e.g. university servers that have acquired a
server-wide HTTP-level UTF-8 declaration but that carry occasional
ancient ISO-2022-JP content. So far, I've decided not to do anything
about this.

Fortunately, the ISO 2022 series isn't really relevant (as a good
approximation) to C++.

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/


From junicode at jcbradfield.org  Wed Jun 10 12:04:30 2020
From: junicode at jcbradfield.org (Julian Bradfield)
Date: Wed, 10 Jun 2020 18:04:30 +0100 (BST)
Subject: OverStrike control character
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <CAMZ=zj4E2dyAG-+oi7qrbHKSC0mXX_bUdASeVPTOixxkvWiKjQ@mail.gmail.com>
 <2f501bae-9f2a-4bad-be3a-f1e48646db78@disroot.org>
 <CA+p4_H1s7i-PfNe7ZYta8ovtVrVaWed+yS5=Jz7BEwo-EBptiA@mail.gmail.com>
 <9caa937c-f3d7-4132-a528-b435398785c0@disroot.org>
 <60bd95ba-42c1-fb3b-aaae-d6099f6a848d@gmail.com>
 <fada6174-24d7-4af1-8bd9-7c2e60a8c44e@disroot.org>
Message-ID: <slrnre24ku.1hr.jcb@home.stevens-bradfield.com>

On 2020-06-10, abrahamgross--- via Unicode <unicode at unicode.org> wrote:
> 2020/06/10 ??3:37:37 James Kass via Unicode <unicode at unicode.org>:
>
>> > On 2020-06-10 6:05 AM, abrahamgross--- via Unicode wrote:
>> > Like the japanese saying goes ???????????
>> That's English written phonetically in katakana.
>>
>
> It is, but "simple is best" doesn't mean anything in english, while it does mean something in japanese. in english it would be something like "simplicity is always better".

I can only conclude you're not a native English speaker. We can noun
adjectives if we want, just as we can verb nouns.

But in compsci, we say it more forcefully: KISS!

From harjitmoe at outlook.com  Wed Jun 10 12:27:07 2020
From: harjitmoe at outlook.com (Harriet Riddle)
Date: Wed, 10 Jun 2020 17:27:07 +0000
Subject: OverStrike control character
In-Reply-To: <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>,
 <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se>
Message-ID: <DB6PR07MB34488F49C92B26FF533A9E92B7830@DB6PR07MB3448.eurprd07.prod.outlook.com>

> From: Unicode <unicode-bounces at unicode.org> on behalf of Kent Karlsson via Unicode <unicode at unicode.org>
> Sent: Wednesday, June 10, 2020 5:12:04 PM
> [?]
> Well, however the overtyping, or overlapping, is achieved (BS, GCC (is there any implementation of that at all? I would strongly recommend against it), pinching the glyph spacing (there is a control sequence for that in ECMA-48) too much, or simply via how the font?s glyphs are designed and spaced), there is no telling what the displayed result will be.
>[?]

Looking at Annex C of ECMA-43 (ECMA's designation for ISO 4873, in turn referenced from ISO 8859), GCC is only permitted because it is not supposed to create an effectively new character, but rather to ?juxtapose? the characters in one position (i.e. force a ligature, which if unsupported could just be shown as a sequence of individual characters).

Similarly, BS is prohibited precisely because it overstamps to create a new character that the target system cannot be expected to support properly. Which you rightly mention, and which makes sense even for rendering, and let us not forget filename handling, narrator software for the visually impaired, _et cetera_?

The example it gives is using GCC on the sequence Pts to represent a ligature form (i.e. U+20A7).

ECMA-48 (ISO 6429) defines GCC's coded representation and parameters as a CSI sequence, and gives as a mere example the simplest case of triggering display of two characters side-by-side in one kanji width, i.e. what the Japanese era name ligatures, the CJK Compatibility block unit symbols, _et cetera_ do.

So apparently, GCC was (from what I can tell from the standards themselves) an attempt at defining a general mechanism for coding arbitrary ligatures and arbitrary CJK squared forms. Not character overstamping.

As a final note, I should probably mention that the best existing way to create an overstamped character cluster in HTML5 is probably to use embedded SVG. But for the reasons mentioned, this would inherently not be very good for accessibility.
________________________________
From: Unicode <unicode-bounces at unicode.org> on behalf of Kent Karlsson via Unicode <unicode at unicode.org>
Sent: Wednesday, June 10, 2020 5:12:04 PM
To: Mark E. Shoulson <mark at kli.org>
Cc: unicode at unicode.org <unicode at unicode.org>
Subject: Re: OverStrike control character

(You (all) apparently mean ?overtype? rather than ?overstrike??; at least I read the latter as the same as crossed-out or strike-through.)

Well, however the overtyping, or overlapping, is achieved (BS, GCC (is there any implementation of that at all? I would strongly recommend against it), pinching the glyph spacing (there is a control sequence for that in ECMA-48) too much, or simply via how the font?s glyphs are designed and spaced), there is no telling what the displayed result will be.

Doing such things really passes from being text display (styled or not) into the realm of graphics. And sure, you can do lots of things in graphic design, also with overlapping ?graphic elements? (including glyphs for letters/digits/...). But as (possibly styled) TEXT, the displayed/printed result of overlaps would be ?implementation defined?. Please use a graphics editing program for controlling how overlapping graphic elements look like; for overlapping, you may want to use different layers, graphics editing programs often support ?layers?, for the graphic elements that overlap even if there are no layers when converting to (say) PNG.

(And for graphics, sorting, searching, and other text operations do not apply?; in HTML for images/graphics, you can have an ?alt? text, which may or may not, indicate what is in the image/graphics.)

/Kent Karlsson

> 10 juni 2020 kl. 14:59 skrev Mark E. Shoulson via Unicode <unicode at unicode.org>:
>
> On 6/9/20 10:50 PM, abrahamgross--- via Unicode wrote:
>> It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character.
>
> What are these "pixels" to which you refer?  Fonts these days are defined in terms of strokes, not pixels.  And Richard Wordingham points out the flaw in your notion of how it would be rendered, your claim that x OS z would look the same as z OS x:
>
>> Consider <l, OVERSTRIKE, m> and <m, OVERSTRIKE, l> in a proportional
>> width font.  Are you expecting the rendering system to position the 'l'
>> using the knowledge that it will be overstruck? Overstriking is
>> designed for a teletype with fixed width characters.
> Besides, even if it worked as you said, with the narrow character centered, how long would it take before you found some examples that didn't really quite work out right?  Like overlaying a HEBREW LETTER YOD on a LATIN CAPITAL LETTER L, but what you really wanted was the YOD centered in the negative space of the L and not between the side-bearings, so next you'll want to be able to add some control over the exact positioning.  And of course that won't work right in general, because it all depends on the font(s) involved.
>
> And when it comes to matching, you say of x OS z and z OS x,
>
>> In a perfect world they would be identical for string matching, but since its a new control character I would understand if ppl don't want to put in the effort to adopt it properly.
>
> But we're talking about making the rules here, the "perfect world."  What should the *rule* be about string-matching?  You can't have an optional rule, so a pair of strings will match on one system and not the other and both are right.  Are we to understand that you think the rule should be that overstruck characters are considered to match in either order?  Your gracious forgiveness of laxity in the rules doesn't really enter into the picture.  And what about larger considerations?  Can I have "ab??xy" (using ? for the overstrike) to overstrike a&x and b&y?  What about "a?b?c?d?e?f?g?h"?  What about "abc?d??fg"? The f&b are overstruck and so are the c&d&g?  Is that combination of c?d overstruck with g different from c?d?g or the same?  What about other combinations?  These are all things that need answers.  What about overstriking a LTR character with a RTL one, or vice-versa?  Which way does the text go after that?
>
> But I think what you're really missing is the crucial point that Garth Wallace pointed out:
>
>> Display is not the only thing text is for.
>
> You're focussing a lot on how characters *look*, can we get this letter to look a little different, can we layer characters to make other weird symbols (which will look radically different depending on the font)... You're looking at how to _draw_ stuff, how to make things look this way or that on paper or when rendered.  But that's not what Unicode encodes.  You need to think more about the distinction Unicode makes between characters and glyphs.  "Plain text" isn't about display,  it's about representing what's in a document, the characters which encode (in a different sense) the spoken language (usually) that is being communicated.  All the things you're talking about are firmly in the realm of fonts and higher-level protocols.  You surely could work out this overstriking display with a sufficiently-advanced font (you could make zero-advance-width overlaying characters and ligatures that would replace X? with a zero-width equivalent of X, for example, in a monospace fo!
 nt), and you are welcome to do so, but that's where it belongs.
>> It shouldnt do any fancy processing by default
> Figuring out how much to backspace in order to center a glyph on another one, in a proportional-spaced font, is pretty fancy processing.
>> Most systems have just about the same font so I wouldn't worry about the results of overstriking not coming out perfect.
>
> What a bland world you live in, wherein most fonts are the same! It's not about working with the default font on your favorite system; we're dealing with _characters_ here, which could be represented in ANY font.
>
> ~mark
>
>
>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200610/68f2aa7d/attachment.htm>

From richard.wordingham at ntlworld.com  Wed Jun 10 14:13:37 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 10 Jun 2020 20:13:37 +0100
Subject: OverStrike control character
In-Reply-To: <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se>
Message-ID: <20200610201337.582a0775@JRWUBU2>

On Wed, 10 Jun 2020 18:12:04 +0200
Kent Karlsson via Unicode <unicode at unicode.org> wrote:

> (You (all) apparently mean ?overtype? rather than ?overstrike??; at
> least I read the latter as the same as crossed-out or strike-through.)

It is not the same, though crossing out can be implemented by
overstriking.

Richard.


From kent.b.karlsson at bahnhof.se  Wed Jun 10 16:18:23 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Wed, 10 Jun 2020 23:18:23 +0200
Subject: OverStrike control character
In-Reply-To: <DB6PR07MB34488F49C92B26FF533A9E92B7830@DB6PR07MB3448.eurprd07.prod.outlook.com>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se>
 <DB6PR07MB34488F49C92B26FF533A9E92B7830@DB6PR07MB3448.eurprd07.prod.outlook.com>
Message-ID: <494AEDF4-3BC9-4CFA-ADFA-EE874FED34CE@bahnhof.se>


> Looking at Annex C of ECMA-43 (ECMA's designation for ISO 4873, in turn referenced from ISO 8859), GCC is only permitted because it is not supposed to create an effectively new character, but rather to ?juxtapose? the characters in one position (i.e. force a ligature, which if unsupported could just be shown as a sequence of individual characters).

Annex C says nothing about  GCC. And the Pts example you mention below I don?t find either?

I dread to make long quotes from the ECMA-48 text here; but I?ll make a short one (from Annex A):

--------
Such a device may, however, process the sequence:
    = BS /
in such a way that it is preserved and can be forwarded to a device which can indeed produce the intended composite symbol.

This example serves only the purpose of illustrating the difference between the effects of editor and formator functions. Where two or
more graphic characters are to be imaged by a single graphic symbol, this should be done by using the control function GRAPHIC
CHARACTER COMBINATION (GCC).
-------

So yes, GCC is (or rather: was) intended for overtyping (among other things...), exemplified by composing a ?not equal to? symbol. Unicode does
that particular example in a different way of course. While I do think much of ECMA-48 does have a future (ever used a terminal emulator?),
I don?t think GCC has a future? Nor the interpretation of BS exemplified above? Nor an ?OVERSTRIKE? control character...

/Kent K


> Similarly, BS is prohibited precisely because it overstamps to create a new character that the target system cannot be expected to support properly. Which you rightly mention, and which makes sense even for rendering, and let us not forget filename handling, narrator software for the visually impaired, _et cetera_?
> 
> The example it gives is using GCC on the sequence Pts to represent a ligature form (i.e. U+20A7). 
> 
> ECMA-48 (ISO 6429) defines GCC's coded representation and parameters as a CSI sequence, and gives as a mere example the simplest case of triggering display of two characters side-by-side in one kanji width, i.e. what the Japanese era name ligatures, the CJK Compatibility block unit symbols, _et cetera_ do. 
> 
> So apparently, GCC was (from what I can tell from the standards themselves) an attempt at defining a general mechanism for coding arbitrary ligatures and arbitrary CJK squared forms. Not character overstamping.
> 
> As a final note, I should probably mention that the best existing way to create an overstamped character cluster in HTML5 is probably to use embedded SVG. But for the reasons mentioned, this would inherently not be very good for accessibility.
> From: Unicode <unicode-bounces at unicode.org> on behalf of Kent Karlsson via Unicode <unicode at unicode.org>
> Sent: Wednesday, June 10, 2020 5:12:04 PM
> To: Mark E. Shoulson <mark at kli.org>
> Cc: unicode at unicode.org <unicode at unicode.org>
> Subject: Re: OverStrike control character
>  
> (You (all) apparently mean ?overtype? rather than ?overstrike??; at least I read the latter as the same as crossed-out or strike-through.)
> 
> Well, however the overtyping, or overlapping, is achieved (BS, GCC (is there any implementation of that at all? I would strongly recommend against it), pinching the glyph spacing (there is a control sequence for that in ECMA-48) too much, or simply via how the font?s glyphs are designed and spaced), there is no telling what the displayed result will be.
> 
> Doing such things really passes from being text display (styled or not) into the realm of graphics. And sure, you can do lots of things in graphic design, also with overlapping ?graphic elements? (including glyphs for letters/digits/...). But as (possibly styled) TEXT, the displayed/printed result of overlaps would be ?implementation defined?. Please use a graphics editing program for controlling how overlapping graphic elements look like; for overlapping, you may want to use different layers, graphics editing programs often support ?layers?, for the graphic elements that overlap even if there are no layers when converting to (say) PNG.
> 
> (And for graphics, sorting, searching, and other text operations do not apply?; in HTML for images/graphics, you can have an ?alt? text, which may or may not, indicate what is in the image/graphics.)
> 
> /Kent Karlsson
> 
> > 10 juni 2020 kl. 14:59 skrev Mark E. Shoulson via Unicode <unicode at unicode.org>:
> > 
> > On 6/9/20 10:50 PM, abrahamgross--- via Unicode wrote:
> >> It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character.
> > 
> > What are these "pixels" to which you refer?  Fonts these days are defined in terms of strokes, not pixels.  And Richard Wordingham points out the flaw in your notion of how it would be rendered, your claim that x OS z would look the same as z OS x:
> > 
> >> Consider <l, OVERSTRIKE, m> and <m, OVERSTRIKE, l> in a proportional
> >> width font.  Are you expecting the rendering system to position the 'l'
> >> using the knowledge that it will be overstruck? Overstriking is
> >> designed for a teletype with fixed width characters.
> > Besides, even if it worked as you said, with the narrow character centered, how long would it take before you found some examples that didn't really quite work out right?  Like overlaying a HEBREW LETTER YOD on a LATIN CAPITAL LETTER L, but what you really wanted was the YOD centered in the negative space of the L and not between the side-bearings, so next you'll want to be able to add some control over the exact positioning.  And of course that won't work right in general, because it all depends on the font(s) involved.
> > 
> > And when it comes to matching, you say of x OS z and z OS x,
> > 
> >> In a perfect world they would be identical for string matching, but since its a new control character I would understand if ppl don't want to put in the effort to adopt it properly.
> > 
> > But we're talking about making the rules here, the "perfect world."  What should the *rule* be about string-matching?  You can't have an optional rule, so a pair of strings will match on one system and not the other and both are right.  Are we to understand that you think the rule should be that overstruck characters are considered to match in either order?  Your gracious forgiveness of laxity in the rules doesn't really enter into the picture.  And what about larger considerations?  Can I have "ab??xy" (using ? for the overstrike) to overstrike a&x and b&y?  What about "a?b?c?d?e?f?g?h"?  What about "abc?d??fg"? The f&b are overstruck and so are the c&d&g?  Is that combination of c?d overstruck with g different from c?d?g or the same?  What about other combinations?  These are all things that need answers.  What about overstriking a LTR character with a RTL one, or vice-versa?  Which way does the text go after that?
> > 
> > But I think what you're really missing is the crucial point that Garth Wallace pointed out:
> > 
> >> Display is not the only thing text is for.
> > 
> > You're focussing a lot on how characters *look*, can we get this letter to look a little different, can we layer characters to make other weird symbols (which will look radically different depending on the font)... You're looking at how to _draw_ stuff, how to make things look this way or that on paper or when rendered.  But that's not what Unicode encodes.  You need to think more about the distinction Unicode makes between characters and glyphs.  "Plain text" isn't about display,  it's about representing what's in a document, the characters which encode (in a different sense) the spoken language (usually) that is being communicated.  All the things you're talking about are firmly in the realm of fonts and higher-level protocols.  You surely could work out this overstriking display with a sufficiently-advanced font (you could make zero-advance-width overlaying characters and ligatures that would replace X? with a zero-width equivalent of X, for example, in a monospace fo!
>  nt), and you are welcome to do so, but that's where it belongs.
> >> It shouldnt do any fancy processing by default
> > Figuring out how much to backspace in order to center a glyph on another one, in a proportional-spaced font, is pretty fancy processing.
> >> Most systems have just about the same font so I wouldn't worry about the results of overstriking not coming out perfect.
> > 
> > What a bland world you live in, wherein most fonts are the same! It's not about working with the default font on your favorite system; we're dealing with _characters_ here, which could be represented in ANY font.
> > 
> > ~mark
> > 
> > 
> > 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200610/506e318a/attachment.htm>

From harjitmoe at outlook.com  Wed Jun 10 16:21:41 2020
From: harjitmoe at outlook.com (Harriet Riddle)
Date: Wed, 10 Jun 2020 21:21:41 +0000
Subject: OverStrike control character
In-Reply-To: <494AEDF4-3BC9-4CFA-ADFA-EE874FED34CE@bahnhof.se>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se>
 <DB6PR07MB34488F49C92B26FF533A9E92B7830@DB6PR07MB3448.eurprd07.prod.outlook.com>,
 <494AEDF4-3BC9-4CFA-ADFA-EE874FED34CE@bahnhof.se>
Message-ID: <DB6PR07MB34481AA9E120320C6C033380B7830@DB6PR07MB3448.eurprd07.prod.outlook.com>

I was referring to Annex C of ECMA-43, not ECMA-48.

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Kent Karlsson <kent.b.karlsson at bahnhof.se>
Sent: Wednesday, June 10, 2020 10:18:23 PM
To: Harriet Riddle <harjitmoe at outlook.com>
Cc: Mark E. Shoulson <mark at kli.org>; unicode at unicode.org <unicode at unicode.org>
Subject: Re: OverStrike control character


Looking at Annex C of ECMA-43 (ECMA's designation for ISO 4873, in turn referenced from ISO 8859), GCC is only permitted because it is not supposed to create an effectively new character, but rather to ?juxtapose? the characters in one position (i.e. force a ligature, which if unsupported could just be shown as a sequence of individual characters).

Annex C says nothing about  GCC. And the Pts example you mention below I don?t find either?

I dread to make long quotes from the ECMA-48 text here; but I?ll make a short one (from Annex A):

--------
Such a device may, however, process the sequence:
    = BS /
in such a way that it is preserved and can be forwarded to a device which can indeed produce the intended composite symbol.

This example serves only the purpose of illustrating the difference between the effects of editor and formator functions. Where two or
more graphic characters are to be imaged by a single graphic symbol, this should be done by using the control function GRAPHIC
CHARACTER COMBINATION (GCC).
-------

So yes, GCC is (or rather: was) intended for overtyping (among other things...), exemplified by composing a ?not equal to? symbol. Unicode does
that particular example in a different way of course. While I do think much of ECMA-48 does have a future (ever used a terminal emulator?),
I don?t think GCC has a future? Nor the interpretation of BS exemplified above? Nor an ?OVERSTRIKE? control character...

/Kent K


Similarly, BS is prohibited precisely because it overstamps to create a new character that the target system cannot be expected to support properly. Which you rightly mention, and which makes sense even for rendering, and let us not forget filename handling, narrator software for the visually impaired, _et cetera_?

The example it gives is using GCC on the sequence Pts to represent a ligature form (i.e. U+20A7).

ECMA-48 (ISO 6429) defines GCC's coded representation and parameters as a CSI sequence, and gives as a mere example the simplest case of triggering display of two characters side-by-side in one kanji width, i.e. what the Japanese era name ligatures, the CJK Compatibility block unit symbols, _et cetera_ do.

So apparently, GCC was (from what I can tell from the standards themselves) an attempt at defining a general mechanism for coding arbitrary ligatures and arbitrary CJK squared forms. Not character overstamping.

As a final note, I should probably mention that the best existing way to create an overstamped character cluster in HTML5 is probably to use embedded SVG. But for the reasons mentioned, this would inherently not be very good for accessibility.
________________________________
From: Unicode <unicode-bounces at unicode.org<mailto:unicode-bounces at unicode.org>> on behalf of Kent Karlsson via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>>
Sent: Wednesday, June 10, 2020 5:12:04 PM
To: Mark E. Shoulson <mark at kli.org<mailto:mark at kli.org>>
Cc: unicode at unicode.org<mailto:unicode at unicode.org> <unicode at unicode.org<mailto:unicode at unicode.org>>
Subject: Re: OverStrike control character

(You (all) apparently mean ?overtype? rather than ?overstrike??; at least I read the latter as the same as crossed-out or strike-through.)

Well, however the overtyping, or overlapping, is achieved (BS, GCC (is there any implementation of that at all? I would strongly recommend against it), pinching the glyph spacing (there is a control sequence for that in ECMA-48) too much, or simply via how the font?s glyphs are designed and spaced), there is no telling what the displayed result will be.

Doing such things really passes from being text display (styled or not) into the realm of graphics. And sure, you can do lots of things in graphic design, also with overlapping ?graphic elements? (including glyphs for letters/digits/...). But as (possibly styled) TEXT, the displayed/printed result of overlaps would be ?implementation defined?. Please use a graphics editing program for controlling how overlapping graphic elements look like; for overlapping, you may want to use different layers, graphics editing programs often support ?layers?, for the graphic elements that overlap even if there are no layers when converting to (say) PNG.

(And for graphics, sorting, searching, and other text operations do not apply?; in HTML for images/graphics, you can have an ?alt? text, which may or may not, indicate what is in the image/graphics.)

/Kent Karlsson

> 10 juni 2020 kl. 14:59 skrev Mark E. Shoulson via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>>:
>
> On 6/9/20 10:50 PM, abrahamgross--- via Unicode wrote:
>> It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character.
>
> What are these "pixels" to which you refer?  Fonts these days are defined in terms of strokes, not pixels.  And Richard Wordingham points out the flaw in your notion of how it would be rendered, your claim that x OS z would look the same as z OS x:
>
>> Consider <l, OVERSTRIKE, m> and <m, OVERSTRIKE, l> in a proportional
>> width font.  Are you expecting the rendering system to position the 'l'
>> using the knowledge that it will be overstruck? Overstriking is
>> designed for a teletype with fixed width characters.
> Besides, even if it worked as you said, with the narrow character centered, how long would it take before you found some examples that didn't really quite work out right?  Like overlaying a HEBREW LETTER YOD on a LATIN CAPITAL LETTER L, but what you really wanted was the YOD centered in the negative space of the L and not between the side-bearings, so next you'll want to be able to add some control over the exact positioning.  And of course that won't work right in general, because it all depends on the font(s) involved.
>
> And when it comes to matching, you say of x OS z and z OS x,
>
>> In a perfect world they would be identical for string matching, but since its a new control character I would understand if ppl don't want to put in the effort to adopt it properly.
>
> But we're talking about making the rules here, the "perfect world."  What should the *rule* be about string-matching?  You can't have an optional rule, so a pair of strings will match on one system and not the other and both are right.  Are we to understand that you think the rule should be that overstruck characters are considered to match in either order?  Your gracious forgiveness of laxity in the rules doesn't really enter into the picture.  And what about larger considerations?  Can I have "ab??xy" (using ? for the overstrike) to overstrike a&x and b&y?  What about "a?b?c?d?e?f?g?h"?  What about "abc?d??fg"? The f&b are overstruck and so are the c&d&g?  Is that combination of c?d overstruck with g different from c?d?g or the same?  What about other combinations?  These are all things that need answers.  What about overstriking a LTR character with a RTL one, or vice-versa?  Which way does the text go after that?
>
> But I think what you're really missing is the crucial point that Garth Wallace pointed out:
>
>> Display is not the only thing text is for.
>
> You're focussing a lot on how characters *look*, can we get this letter to look a little different, can we layer characters to make other weird symbols (which will look radically different depending on the font)... You're looking at how to _draw_ stuff, how to make things look this way or that on paper or when rendered.  But that's not what Unicode encodes.  You need to think more about the distinction Unicode makes between characters and glyphs.  "Plain text" isn't about display,  it's about representing what's in a document, the characters which encode (in a different sense) the spoken language (usually) that is being communicated.  All the things you're talking about are firmly in the realm of fonts and higher-level protocols.  You surely could work out this overstriking display with a sufficiently-advanced font (you could make zero-advance-width overlaying characters and ligatures that would replace X? with a zero-width equivalent of X, for example, in a monospace fo!
 nt), and you are welcome to do so, but that's where it belongs.
>> It shouldnt do any fancy processing by default
> Figuring out how much to backspace in order to center a glyph on another one, in a proportional-spaced font, is pretty fancy processing.
>> Most systems have just about the same font so I wouldn't worry about the results of overstriking not coming out perfect.
>
> What a bland world you live in, wherein most fonts are the same! It's not about working with the default font on your favorite system; we're dealing with _characters_ here, which could be represented in ANY font.
>
> ~mark
>
>
>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200610/284b31f7/attachment.htm>

From kent.b.karlsson at bahnhof.se  Wed Jun 10 16:40:26 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Wed, 10 Jun 2020 23:40:26 +0200
Subject: OverStrike control character
In-Reply-To: <DB6PR07MB34481AA9E120320C6C033380B7830@DB6PR07MB3448.eurprd07.prod.outlook.com>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se>
 <DB6PR07MB34488F49C92B26FF533A9E92B7830@DB6PR07MB3448.eurprd07.prod.outlook.com>
 <494AEDF4-3BC9-4CFA-ADFA-EE874FED34CE@bahnhof.se>
 <DB6PR07MB34481AA9E120320C6C033380B7830@DB6PR07MB3448.eurprd07.prod.outlook.com>
Message-ID: <EB8D8CD6-4762-4FE8-AEA6-2E6472E4A172@bahnhof.se>

Ok ?3? not ?8??

The text you do refer to then seems to partially contradict ECMA.48 then.

It is also technically wrong in the ?Pts? example:
It cannot be ?GCC P t s ? Pts? (the intent here was that the result is ? <https://en.wikipedia.org/wiki/%E2%82%A7_(disambiguation)> not ?Pts"),
but must be ?CSI 1   _ P t s CSI 2   _ ? Pts? (the GCC control sequence has a SPACE before the final ?_?)
since there are three characters composed.

/Kent K


> 10 juni 2020 kl. 23:21 skrev Harriet Riddle <harjitmoe at outlook.com>:
> 
> I was referring to Annex C of ECMA-43, not ECMA-48.
> 
> Get Outlook for Android <https://aka.ms/ghei36>
> From: Kent Karlsson <kent.b.karlsson at bahnhof.se>
> Sent: Wednesday, June 10, 2020 10:18:23 PM
> To: Harriet Riddle <harjitmoe at outlook.com>
> Cc: Mark E. Shoulson <mark at kli.org>; unicode at unicode.org <unicode at unicode.org>
> Subject: Re: OverStrike control character
>  
> 
>> Looking at Annex C of ECMA-43 (ECMA's designation for ISO 4873, in turn referenced from ISO 8859), GCC is only permitted because it is not supposed to create an effectively new character, but rather to ?juxtapose? the characters in one position (i.e. force a ligature, which if unsupported could just be shown as a sequence of individual characters).
> 
> Annex C says nothing about  GCC. And the Pts example you mention below I don?t find either?
> 
> I dread to make long quotes from the ECMA-48 text here; but I?ll make a short one (from Annex A):
> 
> --------
> Such a device may, however, process the sequence:
>     = BS /
> in such a way that it is preserved and can be forwarded to a device which can indeed produce the intended composite symbol.
> 
> This example serves only the purpose of illustrating the difference between the effects of editor and formator functions. Where two or
> more graphic characters are to be imaged by a single graphic symbol, this should be done by using the control function GRAPHIC
> CHARACTER COMBINATION (GCC).
> -------
> 
> So yes, GCC is (or rather: was) intended for overtyping (among other things...), exemplified by composing a ?not equal to? symbol. Unicode does
> that particular example in a different way of course. While I do think much of ECMA-48 does have a future (ever used a terminal emulator?),
> I don?t think GCC has a future? Nor the interpretation of BS exemplified above? Nor an ?OVERSTRIKE? control character...
> 
> /Kent K
> 
> 
>> Similarly, BS is prohibited precisely because it overstamps to create a new character that the target system cannot be expected to support properly. Which you rightly mention, and which makes sense even for rendering, and let us not forget filename handling, narrator software for the visually impaired, _et cetera_?
>> 
>> The example it gives is using GCC on the sequence Pts to represent a ligature form (i.e. U+20A7). 
>> 
>> ECMA-48 (ISO 6429) defines GCC's coded representation and parameters as a CSI sequence, and gives as a mere example the simplest case of triggering display of two characters side-by-side in one kanji width, i.e. what the Japanese era name ligatures, the CJK Compatibility block unit symbols, _et cetera_ do. 
>> 
>> So apparently, GCC was (from what I can tell from the standards themselves) an attempt at defining a general mechanism for coding arbitrary ligatures and arbitrary CJK squared forms. Not character overstamping.
>> 
>> As a final note, I should probably mention that the best existing way to create an overstamped character cluster in HTML5 is probably to use embedded SVG. But for the reasons mentioned, this would inherently not be very good for accessibility.
>> From: Unicode <unicode-bounces at unicode.org <mailto:unicode-bounces at unicode.org>> on behalf of Kent Karlsson via Unicode <unicode at unicode.org <mailto:unicode at unicode.org>>
>> Sent: Wednesday, June 10, 2020 5:12:04 PM
>> To: Mark E. Shoulson <mark at kli.org <mailto:mark at kli.org>>
>> Cc: unicode at unicode.org <mailto:unicode at unicode.org> <unicode at unicode.org <mailto:unicode at unicode.org>>
>> Subject: Re: OverStrike control character
>>  
>> (You (all) apparently mean ?overtype? rather than ?overstrike??; at least I read the latter as the same as crossed-out or strike-through.)
>> 
>> Well, however the overtyping, or overlapping, is achieved (BS, GCC (is there any implementation of that at all? I would strongly recommend against it), pinching the glyph spacing (there is a control sequence for that in ECMA-48) too much, or simply via how the font?s glyphs are designed and spaced), there is no telling what the displayed result will be.
>> 
>> Doing such things really passes from being text display (styled or not) into the realm of graphics. And sure, you can do lots of things in graphic design, also with overlapping ?graphic elements? (including glyphs for letters/digits/...). But as (possibly styled) TEXT, the displayed/printed result of overlaps would be ?implementation defined?. Please use a graphics editing program for controlling how overlapping graphic elements look like; for overlapping, you may want to use different layers, graphics editing programs often support ?layers?, for the graphic elements that overlap even if there are no layers when converting to (say) PNG.
>> 
>> (And for graphics, sorting, searching, and other text operations do not apply?; in HTML for images/graphics, you can have an ?alt? text, which may or may not, indicate what is in the image/graphics.)
>> 
>> /Kent Karlsson
>> 
>> > 10 juni 2020 kl. 14:59 skrev Mark E. Shoulson via Unicode <unicode at unicode.org <mailto:unicode at unicode.org>>:
>> > 
>> > On 6/9/20 10:50 PM, abrahamgross--- via Unicode wrote:
>> >> It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character.
>> > 
>> > What are these "pixels" to which you refer?  Fonts these days are defined in terms of strokes, not pixels.  And Richard Wordingham points out the flaw in your notion of how it would be rendered, your claim that x OS z would look the same as z OS x:
>> > 
>> >> Consider <l, OVERSTRIKE, m> and <m, OVERSTRIKE, l> in a proportional
>> >> width font.  Are you expecting the rendering system to position the 'l'
>> >> using the knowledge that it will be overstruck? Overstriking is
>> >> designed for a teletype with fixed width characters.
>> > Besides, even if it worked as you said, with the narrow character centered, how long would it take before you found some examples that didn't really quite work out right?  Like overlaying a HEBREW LETTER YOD on a LATIN CAPITAL LETTER L, but what you really wanted was the YOD centered in the negative space of the L and not between the side-bearings, so next you'll want to be able to add some control over the exact positioning.  And of course that won't work right in general, because it all depends on the font(s) involved.
>> > 
>> > And when it comes to matching, you say of x OS z and z OS x,
>> > 
>> >> In a perfect world they would be identical for string matching, but since its a new control character I would understand if ppl don't want to put in the effort to adopt it properly.
>> > 
>> > But we're talking about making the rules here, the "perfect world."  What should the *rule* be about string-matching?  You can't have an optional rule, so a pair of strings will match on one system and not the other and both are right.  Are we to understand that you think the rule should be that overstruck characters are considered to match in either order?  Your gracious forgiveness of laxity in the rules doesn't really enter into the picture.  And what about larger considerations?  Can I have "ab??xy" (using ? for the overstrike) to overstrike a&x and b&y?  What about "a?b?c?d?e?f?g?h"?  What about "abc?d??fg"? The f&b are overstruck and so are the c&d&g?  Is that combination of c?d overstruck with g different from c?d?g or the same?  What about other combinations?  These are all things that need answers.  What about overstriking a LTR character with a RTL one, or vice-versa?  Which way does the text go after that?
>> > 
>> > But I think what you're really missing is the crucial point that Garth Wallace pointed out:
>> > 
>> >> Display is not the only thing text is for.
>> > 
>> > You're focussing a lot on how characters *look*, can we get this letter to look a little different, can we layer characters to make other weird symbols (which will look radically different depending on the font)... You're looking at how to _draw_ stuff, how to make things look this way or that on paper or when rendered.  But that's not what Unicode encodes.  You need to think more about the distinction Unicode makes between characters and glyphs.  "Plain text" isn't about display,  it's about representing what's in a document, the characters which encode (in a different sense) the spoken language (usually) that is being communicated.  All the things you're talking about are firmly in the realm of fonts and higher-level protocols.  You surely could work out this overstriking display with a sufficiently-advanced font (you could make zero-advance-width overlaying characters and ligatures that would replace X? with a zero-width equivalent of X, for example, in a monospace fo!
>>  nt), and you are welcome to do so, but that's where it belongs.
>> >> It shouldnt do any fancy processing by default
>> > Figuring out how much to backspace in order to center a glyph on another one, in a proportional-spaced font, is pretty fancy processing.
>> >> Most systems have just about the same font so I wouldn't worry about the results of overstriking not coming out perfect.
>> > 
>> > What a bland world you live in, wherein most fonts are the same! It's not about working with the default font on your favorite system; we're dealing with _characters_ here, which could be represented in ANY font.
>> > 
>> > ~mark
>> > 
>> > 
>> > 
>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200610/2b3734c7/attachment.htm>

From kent.b.karlsson at bahnhof.se  Wed Jun 10 16:53:06 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Wed, 10 Jun 2020 23:53:06 +0200
Subject: OverStrike control character
In-Reply-To: <20200610201337.582a0775@JRWUBU2>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se>
 <20200610201337.582a0775@JRWUBU2>
Message-ID: <DDEC2681-CDB4-4327-A1EE-214E9B317E53@bahnhof.se>


> 10 juni 2020 kl. 21:13 skrev Richard Wordingham via Unicode <unicode at unicode.org>:
> 
> On Wed, 10 Jun 2020 18:12:04 +0200
> Kent Karlsson via Unicode <unicode at unicode.org> wrote:
> 
>> (You (all) apparently mean ?overtype? rather than ?overstrike??; at
>> least I read the latter as the same as crossed-out or strike-through.)
> 
> It is not the same, though crossing out can be implemented by
> overstriking.

Whichever is the best term, I first thought (just seeing the subject line) the suggestion was about
what ECMA-48 does via CSI 9m? And is supported also in HTML, MS Word, and surely other
formats. Some ?mark-downs? (a bit surprisingly) use -crossedout text-; try pasting the typical 
output from the Unix/Linux command ?ls -l? and paste that into some place using that mark-down?
Or try it with any text that uses hyphens (as HYPHEN-MINUS). (I did not like the result?) 

/K

> Richard.
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200610/fc7ffed4/attachment.htm>

From tom at honermann.net  Thu Jun 11 00:00:49 2020
From: tom at honermann.net (Tom Honermann)
Date: Thu, 11 Jun 2020 01:00:49 -0400
Subject: What is the Unicode guidance regarding the use of a BOM as a
 UTF-8 encoding signature?
In-Reply-To: <CAJQvAufqOvKWHrtFOaQD7CRGS94_SryyqBY0xM0ipMHTjF75vw@mail.gmail.com>
References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net>
 <CAN49p6qVEhjcMk3JyjG1CoRHw-rrvBr=cPaqZe0sjCBfSFwGkA@mail.gmail.com>
 <4a173391-8436-1f3f-c311-f4a9960d288e@honermann.net>
 <CAJQvAufqOvKWHrtFOaQD7CRGS94_SryyqBY0xM0ipMHTjF75vw@mail.gmail.com>
Message-ID: <014fe622-011e-b225-b769-4753a8af7883@honermann.net>

On 6/10/20 12:47 PM, Henri Sivonen via Unicode wrote:
> Tom Honermann wrote:
>> I'm researching support for UTF-8 encoded source files in various C++ compilers.  Here is a brief snapshot of existing practice:
>>
>> Clang only accepts UTF-8 encoded source files.  A UTF-8 BOM is recognized and discarded.
>> GCC accepts UTF-8 encoded source files by default, but the encoding expectation can be overridden with a command line option.  If GCC is expecting UTF-8 source, then a BOM is discarded.  Otherwise, a BOM is *not* honored and its presence is likely to result in compilation error.  GCC has no support for compiling a translation unit consisting of differently encoded source files.
>> Microsoft Visual C++, by default, interprets source files as encoded according to the Windows' Active Code Page (ACP), but supports translation units consisting of differently encoded source files by honoring a UTF-8 BOM.  The default encoding can be overridden with a command line option.
>> IBM z/OS xlC C/C++ is IBM's compiler for C and C++ on mainframes (yes, though you may not have seen a green screen in recent times, mainframes are still busy crunching numbers behind the scenes for websites you frequent).  z/OS is an EBCDIC based operating system and IBM's xlC compiler for z/OS only accepts EBCDIC encoded source files.  Many EBCDIC code pages exist and the xlC compiler supports an in-source code page annotation that enables compilation of translation units consisting of differently encoded source files.
>>
>> The goal of this research is to produce a proposal for the C and C++ standards intended to better enable UTF-8 as a portable source file encoding.
> ...
>> Various methods are being explored for how to support collections of mixed encoding source files.  The intent in asking the question is to help determine if/how use of a UTF-8 BOM fits in to the picture.
> Given your description of existing compiler behavior, I recommend
> making the C++ standard say that if a file (substitute the right ISO
> term for "file") starts with a UTF-8 BOM, the file must be interpreted
> as UTF-8 and the BOM be discarded before further processing. This
> already fits what you say GCC, clang, and MSVC do by default and would
> not be a compatibility-breaking change for IBM compilers (though I
> understood the IBM compilers are being superseded by clang anyway as
> far as implementing C++ versions later than C++11 goes). This would
> also facilitate migration to UTF-8 on Windows and z/OS.

Thank you, Henri, this matches my inclination.? If anyone else has 
dissenting opinions, please share them.

(The Clang ports to z/OS support distinct EBCDIC and ASCII modes, so 
they don't escape these concerns)

>
> Shawn Steele wrote:
>> The modern viewpoint is that the BOM should be discouraged in all contexts.
> If you are writing an HTML serializer that 1) is a component distinct
> from the HTTP layer and, therefore, cannot control the HTTP headers
> and 2) mustn't impose restrictions on the shape of the DOM and,
> therefore, mustn't inject a meta element on its own, the best approach
> is to use the UTF-8 BOM.
>
>> I?d recommend to anyone encountering ASCII-like data to presume it was UTF-8 unless proven otherwise.
> This is problematic in contexts where there is non-UTF-8 legacy, the
> input arrives over time, and streaming processing of the input is
> expected. See https://hsivonen.fi/utf-8-detection/
>
> On Eli Zaretskii wrote:
>>> From: Shawn Steele via Unicode <unicode at unicode.org>
>>>
>>> I?ve been recommending that people assume documents are UTF-8.  If the UTF-8 decoding fails, then
>>> consider falling back to some other codepage.
>> That strategy would fail with 7-bit ISO 2022 based encodings, no?
> Yes. When HTML is labeled as UTF-8 and is valid UTF-8, Firefox
> disables the character encoding menu to prevent self-XSS and to
> prevent the user from introducing data corruption to forms. This is a
> bit of a problem with e.g. university servers that have acquired a
> server-wide HTTP-level UTF-8 declaration but that carry occasional
> ancient ISO-2022-JP content. So far, I've decided not to do anything
> about this.
>
> Fortunately, the ISO 2022 series isn't really relevant (as a good
> approximation) to C++.

Additionally, fall back to another code page is not appropriate in 
contexts where proper diagnosis of ill-formed UTF-8 text is desired.? 
For source code in particular, fall back to ISO-8859-1 due to ill-formed 
UTF-8 in a string literal would result in silent miscompilation.? The 
performance overhead of fall back for C++ compilation would also not be 
acceptable (where compilation performance is already a challenge).

Tom.


From richard.wordingham at ntlworld.com  Thu Jun 11 03:18:55 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 11 Jun 2020 09:18:55 +0100
Subject: OverStrike control character
In-Reply-To: <494AEDF4-3BC9-4CFA-ADFA-EE874FED34CE@bahnhof.se>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se>
 <DB6PR07MB34488F49C92B26FF533A9E92B7830@DB6PR07MB3448.eurprd07.prod.outlook.com>
 <494AEDF4-3BC9-4CFA-ADFA-EE874FED34CE@bahnhof.se>
Message-ID: <20200611091855.6e25a559@JRWUBU2>

On Wed, 10 Jun 2020 23:18:23 +0200
Kent Karlsson via Unicode <unicode at unicode.org> wrote:

> So yes, GCC is (or rather: was) intended for overtyping (among other
> things...), exemplified by composing a ?not equal to? symbol. Unicode
> does that particular example in a different way of course. While I do
> think much of ECMA-48 does have a future (ever used a terminal
> emulator?), I don?t think GCC has a future? Nor the interpretation of
> BS exemplified above? Nor an ?OVERSTRIKE? control character...

I suspect terminal emulators still need to handle GCC.  Years ago,
underlining in man pages used to be presented by overstriking letters
with an underscore.  It was a major pain when copying a man page to
another medium, or even a video terminal that didn't understand the
convention.  Nowadays, the output is tailored to the destination, so
piping through od -c doesn't reveal how underlining is achieved in
more modern systems.  (However, 20 year old Unix boxes are still
being used with their original operating systems, and rlogin and its
analogues are still in use.)

I could be wrong.  The emacs shell window doesn't (27.0.50) handle man
page underlining, and I assume it's not enough of an irritant to get
fixed.  GCC is seriously inconsistent with context-sensitive layout. 

Richard.


From sosipiuk at gmail.com  Thu Jun 11 09:52:29 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Thu, 11 Jun 2020 10:52:29 -0400
Subject: OverStrike control character
In-Reply-To: <20200611091855.6e25a559@JRWUBU2>
References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se>
 <DB6PR07MB34488F49C92B26FF533A9E92B7830@DB6PR07MB3448.eurprd07.prod.outlook.com>
 <494AEDF4-3BC9-4CFA-ADFA-EE874FED34CE@bahnhof.se>
 <20200611091855.6e25a559@JRWUBU2>
Message-ID: <CAM+ijLgTVpj9eHB_srmH=9HUW0k-N92_aqKUmJgQ4LU1g=Fu5g@mail.gmail.com>

On Thu, Jun 11, 2020 at 4:26 AM Richard Wordingham via Unicode
<unicode at unicode.org> wrote:
>
> I suspect terminal emulators still need to handle GCC.  Years ago,
> underlining in man pages used to be presented by overstriking letters
> with an underscore.

Of course, the proper way to underline is by using SGR (select graphic
rendition), not GCC.

From doug at ewellic.org  Fri Jun 12 10:12:31 2020
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 12 Jun 2020 09:12:31 -0600
Subject: OverStrike control character
Message-ID: <001901d640cb$e545e470$afd1ad50$@ewellic.org>

If we're going to get all ECMA-48 about this, there is also CUB (CSI D), which moves the "active presentation position" back one character, or HPB (CSI j), which moves the "active data position" back one character. (Notice that even ECMA-48 understands the difference between presentation and data.)

Both of these take an optional numeric parameter, so you can back up more than one position if you want.

So you have those. And if they don't work for you, well, then they don't work. It's no different from adding an overstrike mechanism to Unicode and expecting it to work for everyone, in all editing and displaying contexts, with all fonts and rendering engines, on all platforms.

This is a very non-Unicode concept and I would suggest re-reading what Ken Whistler and others had to say about it.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From kent.b.karlsson at bahnhof.se  Sun Jun 14 17:33:29 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Mon, 15 Jun 2020 00:33:29 +0200
Subject: OverStrike control character
In-Reply-To: <001901d640cb$e545e470$afd1ad50$@ewellic.org>
References: <001901d640cb$e545e470$afd1ad50$@ewellic.org>
Message-ID: <6174958D-2780-4FD1-8890-51EECC3DCF20@bahnhof.se>


> 12 juni 2020 kl. 17:12 skrev Doug Ewell via Unicode <unicode at unicode.org>:
> 
> If we're going to get all ECMA-48 about this, there is also CUB (CSI D), which moves the "active presentation position" back one character, or HPB (CSI j), which moves the "active data position" back one character. (Notice that even ECMA-48 understands the difference between presentation and data.)

Note that CUB (and related: CUD, CUF, CUU) moves the cursor position (?active position?) AFTER GCC (if anything has where supported that one), line breaking, bidi (ECMA-48 had its own approach to that) and line/character directions. Note that CUB moves to the left, even for text with vertical lines, it?s a purely ?visual? move. HPB (and related: HPR) moves the cursor position BEFORE (nowadays referred to as ?backing store?) all that, it?s a ?logical? move. (They need to be synched up, moving one will move the other, but ECMA-48 does not give exact details.) Such cursor movement control sequences are not suitable to store in the ?backing store?, but ECMA-48 does not say so explicitly.

And CUB (with the absent parameter defaulted to 1) is what your favorite terminal emulator sends (to the program reading what you type on the keyboard) when you press the left arrow key on the keyboard. 

Maybe arrow keys in terminal emulators should have sent HPB/CNL/CPL/HPR (and have them interpreted as specified) instead of CUB/CUD/CUU/CUF. Sometimes it seems like the CUB/CUF are interpreted as if they were HPB/HPR. However, this is for terminal emulators only. Otherwise, ?window" programs use keystroke events, not bothering with these cursor movement control sequences. But terminal emulators will be with us for quite some time yet!

/Kent K

> Both of these take an optional numeric parameter, so you can back up more than one position if you want.
> 
> So you have those. And if they don't work for you, well, then they don't work. It's no different from adding an overstrike mechanism to Unicode and expecting it to work for everyone, in all editing and displaying contexts, with all fonts and rendering engines, on all platforms.
> 
> This is a very non-Unicode concept and I would suggest re-reading what Ken Whistler and others had to say about it.
> 
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
> 
> 
> 


From corentin.jabot at gmail.com  Sun Jun 14 17:17:41 2020
From: corentin.jabot at gmail.com (Corentin)
Date: Mon, 15 Jun 2020 00:17:41 +0200
Subject: What constitute? an abstract character?
Message-ID: <CA+Om+Sj98CiS4kbfC+o+1vd9hQPhbVNsUXPtakzcb_wMOH9qZg@mail.gmail.com>

Hello
While trying to define suitable semantic for the lexing of C++, we seem to
fail to agree on the definition of abstract characters

Notably:
- Would diatrics marks considered in isolation be considered abstract
characters?
- What about Hangul Jamos and other marks that are not found in isolation
in their respective scripts, Variation selectors, etc ?

I guess another way to phrase my question is: does every assigned codepoint
represent on its own an abstract character?

My understanding is that is not the case, but I am eager to be enlighten

Thanks,

Corentin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200615/77a42878/attachment.htm>

From xfq.free at gmail.com  Sun Jun 14 19:47:37 2020
From: xfq.free at gmail.com (Fuqiao Xue)
Date: Mon, 15 Jun 2020 08:47:37 +0800
Subject: What constitute? an abstract character?
In-Reply-To: <CA+Om+Sj98CiS4kbfC+o+1vd9hQPhbVNsUXPtakzcb_wMOH9qZg@mail.gmail.com>
References: <CA+Om+Sj98CiS4kbfC+o+1vd9hQPhbVNsUXPtakzcb_wMOH9qZg@mail.gmail.com>
Message-ID: <CAAF+z6F7CzQ_kS4q+su2FqGc2uY-NMYoy0rtpLZNEn228LbOuw@mail.gmail.com>

Hi Corentin,

The term "abstract character" is ambiguous and can have multiple
definitions. Depending on what you need, It can refer to visual (i.e.,
grapheme), logical (i.e., code point), or byte-level (i.e., code unit)
representation of a given piece of text.

FYI - W3C developed a Character Model document, which includes some
guidelines on "characters" and may be useful to you:
https://www.w3.org/TR/charmod/

Cheers,

Fuqiao

2020?6?15?(?) 8:01 Corentin via Unicode <unicode at unicode.org>:
>
> Hello
> While trying to define suitable semantic for the lexing of C++, we seem to fail to agree on the definition of abstract characters
>
> Notably:
> - Would diatrics marks considered in isolation be considered abstract characters?
> - What about Hangul Jamos and other marks that are not found in isolation in their respective scripts, Variation selectors, etc ?
>
> I guess another way to phrase my question is: does every assigned codepoint represent on its own an abstract character?
>
> My understanding is that is not the case, but I am eager to be enlighten
>
> Thanks,
>
> Corentin


From asmusf at ix.netcom.com  Mon Jun 15 01:42:07 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 14 Jun 2020 23:42:07 -0700
Subject: What constitute? an abstract character?
In-Reply-To: <CAAF+z6F7CzQ_kS4q+su2FqGc2uY-NMYoy0rtpLZNEn228LbOuw@mail.gmail.com>
References: <CA+Om+Sj98CiS4kbfC+o+1vd9hQPhbVNsUXPtakzcb_wMOH9qZg@mail.gmail.com>
 <CAAF+z6F7CzQ_kS4q+su2FqGc2uY-NMYoy0rtpLZNEn228LbOuw@mail.gmail.com>
Message-ID: <93aa538b-a0f6-ed81-bc20-73261786c1b4@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200614/dc120d52/attachment.htm>

From xfq.free at gmail.com  Mon Jun 15 09:22:26 2020
From: xfq.free at gmail.com (Fuqiao Xue)
Date: Mon, 15 Jun 2020 22:22:26 +0800
Subject: What constitute? an abstract character?
In-Reply-To: <93aa538b-a0f6-ed81-bc20-73261786c1b4@ix.netcom.com>
References: <CA+Om+Sj98CiS4kbfC+o+1vd9hQPhbVNsUXPtakzcb_wMOH9qZg@mail.gmail.com>
 <CAAF+z6F7CzQ_kS4q+su2FqGc2uY-NMYoy0rtpLZNEn228LbOuw@mail.gmail.com>
 <93aa538b-a0f6-ed81-bc20-73261786c1b4@ix.netcom.com>
Message-ID: <CAAF+z6HEYJERbYZ20y7PcG=NZgb-NF_rPYOCQkiYJAKrjvExtA@mail.gmail.com>

Thanks for the hint Asmus - what you said makes sense and is very
useful information in addition to the definitions in The Unicode
Standard Section 3.4. I forgot that the term "abstract character" is
defined in TUS, sorry.

Fuqiao

2020?6?15?(?) 14:44 Asmus Freytag via Unicode <unicode at unicode.org>:
>
> On 6/14/2020 5:47 PM, Fuqiao Xue via Unicode wrote:
>
> Hi Corentin,
>
> The term "abstract character" is ambiguous and can have multiple
> definitions. Depending on what you need, It can refer to visual (i.e.,
> grapheme), logical (i.e., code point), or byte-level (i.e., code unit)
> representation of a given piece of text.
>
> An abstract character is related to a code point by the character encoding. See definitions D7 and D10-D12 in Section 3.4 Characters and Encoding. (http://www.unicode.org/versions/latest/ch03.pdf#G2212)
>
> It is never a "code unit" or a "byte-level" thing. It is also not the code point.
>
> It is the thing that is being assigned a code point. (D11: "Encoded character: An association (or mapping) between an abstract character and
> a code point." -- the definition should really have an added "or code point sequence". Unicode finesses that by saying that sequences never encode an abstract character directly, but they can be used to "represent it", see comment on D7. That formally makes encoding a 1:1 process, but muddies the waters a bit on what we should consider an 'abstract character'. For example, it means that all "building blocks" of any sequences must be seen as abstract characters themselves.)
>
> Now the abstract character A-diaresis (?) is encode by a  single code point and also has a canonically equivalent representation by a combining sequence. In effect, the whole sequence "encodes" a single abstract character, but that is formally not how Unicode defines it.
>
> A diaeresis is a recognizable item of the writing system; if used as an umlaut, it tends to act as a decoration of character that is more-or-less seen as a new entity (particularly in Swedish) and less a modified letter A. If used as a diaeresis, it acts more like a punctuation mark that has a function of its own (forcing separate pronunciation). Even though it's graphically applied to a vowel, it can be understood as its own abstract character.
>
> Treating the diaerersis as its own independent abstract character makes logical and not just formal sense. That may not be the case equally for all types of diacritical marks. However, since they can all be named, and thus arguably exist as their own concepts at least on a descriptive level, it becomes effectively a non-problem.
>
> The way combining marks are treated in other scripts, they can all be on different points of the scale as logically independent entities, and some are even on different points of the scale in terms of graphically combining (they may be graphically indistinguishable from regular spacing letters).
>
> To recap, an "abstract" character is a conceptual character, something that forms the atom of a writing system (smallest divisible particle) as viewed from the process of encoding, which associates with it a single code point. "Abstract" characters may exist that are not encoded; and some of them can be analyzed as series of smaller abstract characters, and thus be represented as code point sequences.
>
> Some abstract characters are more like small molecules; they can be encoded as such, or they can also have a more atomic sequence that represents them. The rationale of for allowing this dual nature is historical compatibility, not logical necessity, hence the model is in some ways not "pure" (just practical).
>
> A./
>
> PS: while the character model document tries to unravel the implications of the Unicode Encoding model for W3C standards, it's not a substitution for the original definitions of how the Unicode Standard understands and defines the encoding process.
>
> FYI - W3C developed a Character Model document, which includes some
> guidelines on "characters" and may be useful to you:
> https://www.w3.org/TR/charmod/
>
> Cheers,
>
> Fuqiao
>
> 2020?6?15?(?) 8:01 Corentin via Unicode <unicode at unicode.org>:
>
> Hello
> While trying to define suitable semantic for the lexing of C++, we seem to fail to agree on the definition of abstract characters
>
> Notably:
> - Would diatrics marks considered in isolation be considered abstract characters?
> - What about Hangul Jamos and other marks that are not found in isolation in their respective scripts, Variation selectors, etc ?
>
> I guess another way to phrase my question is: does every assigned codepoint represent on its own an abstract character?
>
> My understanding is that is not the case, but I am eager to be enlighten
>
> Thanks,
>
> Corentin
>
>


From sosipiuk at gmail.com  Mon Jun 15 09:33:04 2020
From: sosipiuk at gmail.com (=?utf-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Mon, 15 Jun 2020 10:33:04 -0400
Subject: What constitute? an abstract character?
In-Reply-To: <CA+Om+Sj98CiS4kbfC+o+1vd9hQPhbVNsUXPtakzcb_wMOH9qZg@mail.gmail.com>
References: <CA+Om+Sj98CiS4kbfC+o+1vd9hQPhbVNsUXPtakzcb_wMOH9qZg@mail.gmail.com>
Message-ID: <000901d64321$e17f5ad0$a47e1070$@gmail.com>

I believe the underlying question is:

How does one programmatically identify and/or count the abstract characters in a Unicode text?

 
S?awomir Osipiuk

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200615/f5e88578/attachment.htm>

From kenwhistler at sonic.net  Mon Jun 15 10:03:10 2020
From: kenwhistler at sonic.net (Ken Whistler)
Date: Mon, 15 Jun 2020 08:03:10 -0700
Subject: What constitute? an abstract character?
In-Reply-To: <000901d64321$e17f5ad0$a47e1070$@gmail.com>
References: <CA+Om+Sj98CiS4kbfC+o+1vd9hQPhbVNsUXPtakzcb_wMOH9qZg@mail.gmail.com>
 <000901d64321$e17f5ad0$a47e1070$@gmail.com>
Message-ID: <302647d5-c5ec-0623-b714-1804f0319ca9@sonic.net>

Not an interesting question, actually.

The units relevant to count in text, depending on what you are doing, are:

code units

code points

user-perceived characters (and other higher-level constructs which may 
be orthography-specific)

"abstract characters" are an artifact of the formal encoding process. 
They are really only "counted" by character encoding committees, not by 
software processing text strings.

--Ken

On 6/15/2020 7:33 AM, S?awomir Osipiuk via Unicode wrote:
>
> I believe the underlying question is:
>
> How does one programmatically identify and/or count the abstract 
> characters in a Unicode text?
>
> S?awomir Osipiuk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200615/41465034/attachment.htm>

From pgcon6 at msn.com  Mon Jun 15 10:04:59 2020
From: pgcon6 at msn.com (Peter Constable)
Date: Mon, 15 Jun 2020 15:04:59 +0000
Subject: What constitute? an abstract character?
In-Reply-To: <000901d64321$e17f5ad0$a47e1070$@gmail.com>
References: <CA+Om+Sj98CiS4kbfC+o+1vd9hQPhbVNsUXPtakzcb_wMOH9qZg@mail.gmail.com>
 <000901d64321$e17f5ad0$a47e1070$@gmail.com>
Message-ID: <MWHPR1301MB2112C433197EB863E2B16C72869C0@MWHPR1301MB2112.namprd13.prod.outlook.com>

Unicode doesn?t give one answer since there?s more than one way that might be appropriate to answer it.

You might want a count of Unicode code points. If a buffer contained a UTF-32 sequence, that would be the same as the sequence length divided by 4. (Count if UTF-16 or UTF-8 requires walking the sequence, obviously.) It would also mean that the text element _a-diaeresis_ could have a count of 1 in some cases but a count of 2 in other cases. An Old Hangul syllable might have a count of 1, 2 or 3, depending on the syllable.

You might want a could of NFC-composable entities. In that case, _a-diaeresis_ would always have a count of 1, but an Old Hangul syllable would have a count of 1, 2 or 3 depending on the syllable.
You might want a count of grapheme clusters, as defined in UAX #29. In that case, _a-diaeresis_ or any Old Hangul Syllable would always have a count of 1.

Which way to count depends on one?s purpose for counting.


Peter

From: Unicode <unicode-bounces at unicode.org> On Behalf Of Slawomir Osipiuk via Unicode
Sent: Monday, June 15, 2020 7:33 AM
To: 'Corentin' <corentin.jabot at gmail.com>; 'unicode Unicode Discussion' <unicode at unicode.org>
Subject: RE: What constitute? an abstract character?

I believe the underlying question is:
How does one programmatically identify and/or count the abstract characters in a Unicode text?

S?awomir Osipiuk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200615/94b82ed8/attachment-0001.htm>

From corentin.jabot at gmail.com  Mon Jun 15 11:34:25 2020
From: corentin.jabot at gmail.com (Corentin)
Date: Mon, 15 Jun 2020 18:34:25 +0200
Subject: What constitute? an abstract character?
In-Reply-To: <93aa538b-a0f6-ed81-bc20-73261786c1b4@ix.netcom.com>
References: <CA+Om+Sj98CiS4kbfC+o+1vd9hQPhbVNsUXPtakzcb_wMOH9qZg@mail.gmail.com>
 <CAAF+z6F7CzQ_kS4q+su2FqGc2uY-NMYoy0rtpLZNEn228LbOuw@mail.gmail.com>
 <93aa538b-a0f6-ed81-bc20-73261786c1b4@ix.netcom.com>
Message-ID: <CA+Om+SgEfecYLmWoJtsaBA22Joox_ibNYUL-_2LEpuk6rJmu+w@mail.gmail.com>

On Mon, 15 Jun 2020 at 08:44, Asmus Freytag via Unicode <unicode at unicode.org>
wrote:

> On 6/14/2020 5:47 PM, Fuqiao Xue via Unicode wrote:
>
> Hi Corentin,
>
> The term "abstract character" is ambiguous and can have multiple
> definitions. Depending on what you need, It can refer to visual (i.e.,
> grapheme), logical (i.e., code point), or byte-level (i.e., code unit)
> representation of a given piece of text.
>
> An abstract character is related to a code point by the character
> encoding. See definitions D7 and D10-D12 in Section 3.4 Characters and
> Encoding. (http://www.unicode.org/versions/latest/ch03.pdf#G2212)
>
> It is never a "code unit" or a "byte-level" thing. It is also not the code
> point.
>
> It is the thing that is being assigned a code point. (D11: "Encoded
> character: An association (or mapping) between an abstract character and
> a code point." -- the definition should really have an added "or code
> point sequence". Unicode finesses that by saying that sequences never
> encode an abstract character directly, but they can be used to "represent
> it", see comment on D7. That formally makes encoding a 1:1 process, but
> muddies the waters a bit on what we should consider an 'abstract
> character'. For example, it means that all "building blocks" of any
> sequences must be seen as abstract characters themselves.)
>
> Now the abstract character A-diaresis (?) is encode by a  single code
> point and also has a canonically equivalent representation by a combining
> sequence. In effect, the whole sequence "encodes" a single abstract
> character, but that is formally not how Unicode defines it.
>
> A diaeresis is a recognizable item of the writing system; if used as an
> umlaut, it tends to act as a decoration of character that is more-or-less
> seen as a new entity (particularly in Swedish) and less a modified letter
> A. If used as a diaeresis, it acts more like a punctuation mark that has a
> function of its own (forcing separate pronunciation). Even though it's
> graphically applied to a vowel, it can be understood as its own abstract
> character.
>
> Treating the diaerersis as its own independent abstract character makes
> logical and not just formal sense. That may not be the case equally for all
> types of diacritical marks. However, since they can all be named, and thus
> arguably exist as their own concepts at least on a descriptive level, it
> becomes effectively a non-problem.
>
> The way combining marks are treated in other scripts, they can all be on
> different points of the scale as logically independent entities, and some
> are even on different points of the scale in terms of graphically combining
> (they may be graphically indistinguishable from regular spacing letters).
>
> To recap, an "abstract" character is a conceptual character, something
> that forms the atom of a writing system (smallest divisible particle) as
> viewed from the process of encoding, which associates with it a single code
> point. "Abstract" characters may exist that are not encoded; and some of
> them can be analyzed as series of smaller abstract characters, and thus be
> represented as code point sequences.
>
> Some abstract characters are more like small molecules; they can be
> encoded as such, or they can also have a more atomic sequence that
> represents them. The rationale of for allowing this dual nature is
> historical compatibility, not logical necessity, hence the model is in some
> ways not "pure" (just practical).
>

Thanks for this detailed reply, this is exactly the answer I was looking
for!

> A./
>
> PS: while the character model document tries to unravel the implications
> of the Unicode Encoding model for W3C standards, it's not a substitution
> for the original definitions of how the Unicode Standard understands and
> defines the encoding process.
>
> FYI - W3C developed a Character Model document, which includes some
> guidelines on "characters" and may be useful to you:https://www.w3.org/TR/charmod/
>
> Cheers,
>
> Fuqiao
>
> 2020?6?15?(?) 8:01 Corentin via Unicode <unicode at unicode.org> <unicode at unicode.org>:
>
> Hello
> While trying to define suitable semantic for the lexing of C++, we seem to fail to agree on the definition of abstract characters
>
> Notably:
> - Would diatrics marks considered in isolation be considered abstract characters?
> - What about Hangul Jamos and other marks that are not found in isolation in their respective scripts, Variation selectors, etc ?
>
> I guess another way to phrase my question is: does every assigned codepoint represent on its own an abstract character?
>
> My understanding is that is not the case, but I am eager to be enlighten
>
> Thanks,
>
> Corentin
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200615/f7b6fd12/attachment.htm>

From richard.wordingham at ntlworld.com  Mon Jun 15 12:01:46 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 15 Jun 2020 18:01:46 +0100
Subject: What constitute? an abstract character?
In-Reply-To: <302647d5-c5ec-0623-b714-1804f0319ca9@sonic.net>
References: <CA+Om+Sj98CiS4kbfC+o+1vd9hQPhbVNsUXPtakzcb_wMOH9qZg@mail.gmail.com>
 <000901d64321$e17f5ad0$a47e1070$@gmail.com>
 <302647d5-c5ec-0623-b714-1804f0319ca9@sonic.net>
Message-ID: <20200615180146.6bb55149@JRWUBU2>

On Mon, 15 Jun 2020 08:03:10 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:

> Not an interesting question, actually.
> 
> The units relevant to count in text, depending on what you are doing,
> are:
> 
> code units
> 
> code points
> 
> user-perceived characters (and other higher-level constructs which
> may be orthography-specific)

Which in general have a user- and time-specific definition, as at least
hinted at in Peter Constable's comment that the way to count depends
on the purpose of counting.

Richard.

From rhandwerker at us.ibm.com  Tue Jun 16 08:37:52 2020
From: rhandwerker at us.ibm.com (Reinhard Handwerker)
Date: Tue, 16 Jun 2020 13:37:52 +0000
Subject: unsubscribe
Message-ID: <OFFDCC3042.42527068-ON00258589.004AE0C5-00258589.004AE0CE@notes.na.collabserv.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200616/cd10f5ca/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Image.708519188946.png
Type: image/png
Size: 14191 bytes
Desc: not available
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200616/cd10f5ca/attachment.png>

From abrahamgross at disroot.org  Tue Jun 16 11:43:16 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Tue, 16 Jun 2020 16:43:16 +0000
Subject: OverStrike control character
In-Reply-To: <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
References: <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
Message-ID: <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>

> What are these "pixels" to which you refer?  Fonts these days are defined in terms of strokes, not pixels. 
> 

even though fonts are vectors, they still get rendered onto a raster screen. but the point was that they get overlayed and centered horizontally.

> Consider <l, OVERSTRIKE, m> and <m, OVERSTRIKE, l> in a proportional
> width font.  Are you expecting the rendering system to position the 'l'
> using the knowledge that it will be overstruck? Overstriking is
> designed for a teletype with fixed width characters.

You can think of the knowledge of being overstruck like the knowledge fonts have of characters being combined to with diacritics ? Fonts can specify an anchor point where the diacritics will go. Except with overstrike, the anchor will always be the center. The overstrike character is sorta like a ZWJ (zero width joiner) that turns the next character into a "diacritic". (hope this explanation makes sense)

> Besides, even if it worked as you said, with the narrow character centered, how long would it take before you found some examples that didn't really quite work out right?  Like overlaying a HEBREW LETTER YOD on a LATIN CAPITAL LETTER L, but what you really wanted was the YOD centered in the negative space of the L and not between the side-bearings, so next you'll want to be able to add some control over the exact positioning.  And of course that won't work right in general, because it all depends on the font(s) involved.
> 

I will never make a proposal for the addition of control character to control positioning. If it doesnt come out quite right, then either live with it, or find another character that fits better. After all, since fonts are different you cant expect it to come out the same on every device (I'm agreeing with you on this). But i still think that an almost perfect rendition of the overstriked characters is way better than having none at all.

> Can I have "ab??xy" (using ? for the overstrike) to overstrike a&x and b&y?  What about "a?b?c?d?e?f?g?h"?  What about "abc?d??fg"? The f&b are overstruck and so are the c&d&g?  Is that combination of c?d overstruck with g different from c?d?g or the same?  What about other combinations?  These are all things that need answers. 
> 

You can only put one overstrike character in a row. If you type a second one, then it gets ignored. So ab??xy will render as "a[bx]y" where [bracketed] characters are rendered overlayed.
a?b?c?d?e?f?g?h will look like [abcdefgh], all on top of each other.
abc?d??fg will be ab[cdf]g. to get a[bf][cdg] you need to type ab?fc?d?g

the exact rendering will of course depend on the font of the device you're using, but again, i still think that an almost perfect rendition of the overstriked characters is way better than having none at all.

> What about overstriking a LTR character with a RTL one, or vice-versa?  Which way does the text go after that?
> 

The text after that goes in the direction of the text afterwards. So for ?L??????? its gonna look like ?[L???]?????? and for ?L??ab? its gonna look like ?[L?]ab?. Meaning that only the very next letter gets overstruck, and anything afterwards continues on like it would normally.

Going back to the L?? example, heres what it would look like: https://imgur.com/a/N9QApwh

Here's a short command to generate the images, and it works for any 2 letter combinations. Just replace whats after `label:` with the letters you want to test how overstriking it might look like. (You need to install ImageMagick before using this though.)

```
	magick -background none -pointsize 100 \
		label:L label:? \
		\( -clone 0 \) -delete "%[fx:u.w>v.w?2:0]" \
		-gravity center -compose Over -composite \
		-background white -flatten \
	out.png
```
How it would like in a serif font:
```
	magick -background none -font "FreeSerif" -pointsize 100 \
		label:L label:? \
		\( -clone 0 \) -delete "%[fx:u.w>v.w?2:0]" \
		-gravity center -compose Over -composite \
		-background white -flatten \
	out.png
```
For windows users:
```
	magick ^
		in0.png in1.png ( -clone 0 ) ^
		-delete "%%[fx:u.w>v.w?2:0]" ^
		-compose Over -composite ^
		-background white -flatten ^
	out.png
```

Here's an example of a symbol that isn't widespread enough for its own codepoint, but which can be easily implemented through the usage of the overstrike key:
The symbol in usage: https://imgur.com/AMAVrZT
The symbol by overstriking ???: https://imgur.com/46ReTNu
The ImageMagick command to generate it: 
```
	magick -background none -font "FreeSerif" -pointsize 100 \
		label:? label:? \
		\( -clone 0 \) -delete "%[fx:u.w>v.w?2:0]" \
		-gravity center -compose Over -composite \
		-background white -flatten \
	clefheart.png
```


As for arrow keys, it should pass an overstruck combination in a single key press. However, the backspace key should remove only one of the overstruck characters, and not both/all at once. It should work like how combining diacritics work: it takes 3 presses of the arrow keys to go past "xy?z", but 4 BackSpace key presses to remove all of it because the grave gets deleted separately than the "y".


From richard.wordingham at ntlworld.com  Tue Jun 16 12:05:22 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 16 Jun 2020 18:05:22 +0100
Subject: OverStrike control character
In-Reply-To: <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
References: <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
Message-ID: <20200616180522.3a2bbcae@JRWUBU2>

On Tue, 16 Jun 2020 16:43:16 +0000
Abraham Gross via Unicode <unicode at unicode.org> wrote:

> > What are these "pixels" to which you refer?  Fonts these days are
> > defined in terms of strokes, not pixels. 
> 
> even though fonts are vectors, they still get rendered onto a raster
> screen. but the point was that they get overlayed and centered
> horizontally.
> 
> > Consider <l, OVERSTRIKE, m> and <m, OVERSTRIKE, l> in a proportional
> > width font.  Are you expecting the rendering system to position the
> > 'l' using the knowledge that it will be overstruck? Overstriking is
> > designed for a teletype with fixed width characters.  
> 
> You can think of the knowledge of being overstruck like the knowledge
> fonts have of characters being combined to with diacritics ? Fonts
> can specify an anchor point where the diacritics will go. Except with
> overstrike, the anchor will always be the center. The overstrike
> character is sorta like a ZWJ (zero width joiner) that turns the next
> character into a "diacritic". (hope this explanation makes sense)

You miss the problem.  There is an issue of advance width.  Font
writers by and large don't seem very fond of making 'i' with a
circumflex wider than one without.  Some bite the bullet - there is
at least one Arabic font where adding vowel marks changes the
consonant skeleton. Your equivalence calls for <l, OVERSTRIKE, m> and
<m, OVERSTRIKE, l> to have the same advance width.

Richard.


From marius.spix at web.de  Tue Jun 16 12:16:24 2020
From: marius.spix at web.de (Marius Spix)
Date: Tue, 16 Jun 2020 19:16:24 +0200
Subject: Aw: Re: OverStrike control character
In-Reply-To: <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
References: <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
Message-ID: <trinity-e933d6a0-7f23-4836-a84d-e293e98a64cf-1592327784575@3c-app-webde-bs49>

>even though fonts are vectors, they still get rendered onto a raster screen. but the point was that they get overlayed and centered horizontally.
It is possible to draw vector graphics on CRT screens or draw documents with a plotter. Unicode does not specify how characters are rendered.

From abrahamgross at disroot.org  Tue Jun 16 12:28:22 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Tue, 16 Jun 2020 17:28:22 +0000
Subject: OverStrike control character
In-Reply-To: <20200616180522.3a2bbcae@JRWUBU2>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
Message-ID: <b3684a32be827099ad961dcdb3daab92@disroot.org>

2020?6?16? 13:06, "Richard Wordingham via Unicode" <unicode at unicode.org> wrote:

Your equivalence calls for <l, OVERSTRIKE, m> and
> <m, OVERSTRIKE, l> to have the same advance width.
> 

Right, exactly. Why's that a problem?


From richard.wordingham at ntlworld.com  Tue Jun 16 15:39:25 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 16 Jun 2020 21:39:25 +0100
Subject: OverStrike control character
In-Reply-To: <b3684a32be827099ad961dcdb3daab92@disroot.org>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
Message-ID: <20200616213925.46980570@JRWUBU2>

On Tue, 16 Jun 2020 17:28:22 +0000
Abraham Gross via Unicode <unicode at unicode.org> wrote:

> 2020?6?16? 13:06, "Richard Wordingham via Unicode"
> <unicode at unicode.org> wrote:
> 
> Your equivalence calls for <l, OVERSTRIKE, m> and
> > <m, OVERSTRIKE, l> to have the same advance width.
> >   
> 
> Right, exactly. Why's that a problem?

Have you written a proportional width font that can do that?

I'm not saying it's impossible, just that it's a lot of work.

Richard.


From harjitmoe at outlook.com  Tue Jun 16 15:43:15 2020
From: harjitmoe at outlook.com (Harriet Riddle)
Date: Tue, 16 Jun 2020 20:43:15 +0000
Subject: OverStrike control character
In-Reply-To: <b3684a32be827099ad961dcdb3daab92@disroot.org>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>,
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
Message-ID: <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>

> Your equivalence calls for <l, OVERSTRIKE, m> and
> > <m, OVERSTRIKE, l> to have the same advance width.
> > 
> 
> Right, exactly. Why's that a problem?

Because with how things usually work currently (in something like Roman or Greek, at any rate), the text renderer will make space for the first character first, and then position any following combining diacritics in that space. That is to say, the anchor point is a point inside the space allocated for the base character, and the diacritics are positioned so their anchors are at that point. The combining diacritics themselves have zero advance width, and no space is allocated for them; in the absence of anchors, they just poke over the previous character and (somewhat optimistically) hope for the best.

So if, say, <OVS>m and <OVS>l were treated just as postfix combining diacritics are today, the m in l<OVS>m would significantly poke out of both sides of the space allocated for the l. Whereas m<OVS>l would not do that (since the space is allocated for the m, which is the wider of the two), and hence they wouldn't display the same way.

In terms of use of <BS> for this in e.g. 7-bit ASCII, this works only because the output device is using a fixed width font such as Courier, and so doesn't have to worry about this sort of thing.

Obviously, there are some existing exceptions to this being how combining characters work (e.g. some Tamil vowel marks actually display in-line before the base character, and so shove it forward in the line despite being encoded after it). But these exceptions pose an implementation burden, requiring the layout engine to actively support these scripts.

From pgcon6 at msn.com  Tue Jun 16 18:01:18 2020
From: pgcon6 at msn.com (Peter Constable)
Date: Tue, 16 Jun 2020 23:01:18 +0000
Subject: OverStrike control character
In-Reply-To: <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>,
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
Message-ID: <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>

From: Unicode <unicode-bounces at unicode.org> On Behalf Of Harriet Riddle via Unicode
Sent: Tuesday, June 16, 2020 1:43 PM

>Because with how things usually work currently (in something like Roman or Greek, at any rate), the text renderer will make space for the first character first, and then position any following combining diacritics in that space.

Well, maybe some legacy rendering engines are like that. But that is not how any text rendering engine capable of supporting any significant portion of Unicode is going to work. For most scripts-and for Latin or Greek script with any support for typographic features-the engine needs to resolve what are all the glyph IDs in a run before it can start determining advance widths / positioning. And to do the latter, it will start with default advance widths and positions for all the glyphs but then apply position actions that could revise any advance width or position.


Peter


From abrahamgross at disroot.org  Wed Jun 17 09:44:52 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Wed, 17 Jun 2020 14:44:52 +0000 (UTC)
Subject: OverStrike control character
In-Reply-To: <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
Message-ID: <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>

Then theres no problem with the advance width changes that overstriking will do (e.g. l?m)

2020/06/16 ??7:02:02 Peter Constable via Unicode <unicode at unicode.org>:

> For most scripts-and for Latin or Greek script with any support for typographic features-the engine needs to resolve what are all the glyph IDs in a run before it can start determining advance widths / positioning. And to do the latter, it will start with default advance widths and positions for all the glyphs but then apply position actions that could revise any advance width or position.
> 


From pgcon6 at msn.com  Wed Jun 17 18:54:57 2020
From: pgcon6 at msn.com (Peter Constable)
Date: Wed, 17 Jun 2020 23:54:57 +0000
Subject: OverStrike control character
In-Reply-To: <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
Message-ID: <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>

Except that BS is not a graphic character that will get a glyph with default metrics and potential interaction with other glyphs.


Peter

-----Original Message-----
From: Unicode <unicode-bounces at unicode.org> On Behalf Of abrahamgross--- via Unicode
Sent: Wednesday, June 17, 2020 7:45 AM
To: unicode at unicode.org
Subject: RE: OverStrike control character

Then theres no problem with the advance width changes that overstriking will do (e.g. l?m)

2020/06/16 ??7:02:02 Peter Constable via Unicode <unicode at unicode.org>:

> For most scripts-and for Latin or Greek script with any support for typographic features-the engine needs to resolve what are all the glyph IDs in a run before it can start determining advance widths / positioning. And to do the latter, it will start with default advance widths and positions for all the glyphs but then apply position actions that could revise any advance width or position.
> 


From abrahamgross at disroot.org  Wed Jun 17 19:45:32 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Thu, 18 Jun 2020 00:45:32 +0000 (UTC)
Subject: OverStrike control character
In-Reply-To: <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
Message-ID: <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>

Which is why I'm advocating for an OverStrike control character

2020/06/17 ??7:55:36 Peter Constable via Unicode <unicode at unicode.org>:

> Except that BS is not a graphic character that will get a glyph with default metrics and potential interaction with other glyphs.
> 


From corentin.jabot at gmail.com  Thu Jun 18 10:54:16 2020
From: corentin.jabot at gmail.com (Corentin)
Date: Thu, 18 Jun 2020 17:54:16 +0200
Subject: EBCDIC control characters
Message-ID: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>

Dear Unicode people.

The C0 and C1 control blocks seems to have no intrinsic semantic, but the
control characters
of multiple characters sets (such as some of the ISO encodings, and the
EBCDIC control characters) map to the same block of code points (for
EBCDIC, a mapping is described in the UTF-EBCDIC UAX - not sure if this
mapping is described anywhere else) such that a distinction between the
different provenance is not possible, despite these control characters
having potentially different semantic in their original character sets.

Has this ever been an issue? Was it discussed at any point in history?
Is there a recommended way of dealing with that?

I realize the scenario in which this might be relevant is a bit far-fetched
but as I try to push the C++ committee in the modern age, these questions,
unfortunately, arised.

Thanks a lot,

Corentin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200618/d5bef0a1/attachment.htm>

From kenwhistler at sonic.net  Thu Jun 18 13:00:12 2020
From: kenwhistler at sonic.net (Ken Whistler)
Date: Thu, 18 Jun 2020 11:00:12 -0700
Subject: EBCDIC control characters
In-Reply-To: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
Message-ID: <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>


On 6/18/2020 8:54 AM, Corentin via Unicode wrote:
> Dear Unicode people.
>
> The C0 and C1 control blocks seems to have no intrinsic semantic, but 
> the control characters
> of multiple characters sets (such?as some of the ISO encodings, and 
> the EBCDIC control characters) map to the same block of code points 
> (for EBCDIC, a mapping is described in the UTF-EBCDIC UAX

UTR, actually, not a UAX:

https://www.unicode.org/reports/tr16/tr16-8.html

> - not sure if this mapping is described anywhere else)

Yes, in excruciating detail in the IBM Character Data Representation 
Architecture:

https://www.ibm.com/downloads/cas/G01BQVRV

> such that a distinction between the different provenance is not 
> possible, despite these control characters having potentially 
> different semantic in their original character sets.

It isn't really a "character set" issue. Either ASCII graphic character 
sets or EBCDIC graphic character sets could be used, in principle, with 
different sets of control functions, mapped onto the control code 
positions in each overall scheme. That is typically how character sets 
worked in terminal environments.

What the IBM CDRA establishes is a reliable mapping between all the code 
points used, so that it was possible to set up reliable interchange 
between EBCDIC systems and ASCII-based systems.

There is one gotcha to watch out for, because there are two possible 
ways to map newlines back and forth.

>
> Has this ever been an issue? Was it discussed at any point in history?
> Is there a recommended?way of dealing with that?
>
> I realize the scenario in which this might be relevant is a bit 
> far-fetched but as I try to push the C++ committee in the modern age, 
> these questions, unfortunately, arised.

There really is no way for a C or C++ compiler to interpret arbitrary 
control functions associated with control codes, in any case, other than 
the specific control functions baked into the languages (which are? 
basically the same that the Unicode Standard insists should be nailed 
down to particular code points: CR, LF, TAB, etc.). Other control code 
points should be allowed (and not be messed with) in string literals, 
and the compiler should otherwise barf if they occur in program text 
where the language syntax doesn't allow it. And then compilers 
supporting EBCDIC should just use the IBM standard for mapping back and 
forth to ASCII-based values.

--Ken


From corentin.jabot at gmail.com  Thu Jun 18 14:22:18 2020
From: corentin.jabot at gmail.com (Corentin)
Date: Thu, 18 Jun 2020 21:22:18 +0200
Subject: EBCDIC control characters
In-Reply-To: <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
Message-ID: <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>

On Thu, 18 Jun 2020 at 20:00, Ken Whistler <kenwhistler at sonic.net> wrote:

>
> On 6/18/2020 8:54 AM, Corentin via Unicode wrote:
> > Dear Unicode people.
> >
> > The C0 and C1 control blocks seems to have no intrinsic semantic, but
> > the control characters
> > of multiple characters sets (such as some of the ISO encodings, and
> > the EBCDIC control characters) map to the same block of code points
> > (for EBCDIC, a mapping is described in the UTF-EBCDIC UAX
>
> UTR, actually, not a UAX:
>
> https://www.unicode.org/reports/tr16/tr16-8.html
>
> > - not sure if this mapping is described anywhere else)
>
> Yes, in excruciating detail in the IBM Character Data Representation
> Architecture:
>
> https://www.ibm.com/downloads/cas/G01BQVRV


Thanks, I will have to read that !


>
> > such that a distinction between the different provenance is not
> > possible, despite these control characters having potentially
> > different semantic in their original character sets.
>
> It isn't really a "character set" issue. Either ASCII graphic character
> sets or EBCDIC graphic character sets could be used, in principle, with
> different sets of control functions, mapped onto the control code
> positions in each overall scheme. That is typically how character sets
> worked in terminal environments.
>

That makes sense !


>
> What the IBM CDRA establishes is a reliable mapping between all the code
> points used, so that it was possible to set up reliable interchange
> between EBCDIC systems and ASCII-based systems.


> There is one gotcha to watch out for, because there are two possible
> ways to map newlines back and forth.
>
> >
> > Has this ever been an issue? Was it discussed at any point in history?
> > Is there a recommended way of dealing with that?
> >
> > I realize the scenario in which this might be relevant is a bit
> > far-fetched but as I try to push the C++ committee in the modern age,
> > these questions, unfortunately, arised.
>
> There really is no way for a C or C++ compiler to interpret arbitrary
> control functions associated with control codes, in any case, other than
> the specific control functions baked into the languages (which are
> basically the same that the Unicode Standard insists should be nailed
> down to particular code points: CR, LF, TAB, etc.). Other control code
> points should be allowed (and not be messed with) in string literals,
> and the compiler should otherwise barf if they occur in program text
> where the language syntax doesn't allow it. And then compilers
> supporting EBCDIC should just use the IBM standard for mapping back and
> forth to ASCII-based values.
>

The specific case that people are talking about is indeed string literals
such as "\x06\u0086" where the hexadecimal escape is meant to be an ebcdic
character and the \uxxxx is meant to be be an unicode character such that
the
hexadecimal sequence would map to that character, and whether, in that very
odd scenario, they are or not the same character, and whether they should
be distinguishable

Our current model is source encoding -> unicode -> literal encoding, all
three encodings being potentially distinct,
so we do in fact "mess with" string literals and the question is whether or
not going through unicode should ever considered destructive,
and my argument is that it is never destructive because semantically
preserving in all the relevant use cases.

The question was in particular whether we should use "a super set of
unicode" instead of "unicode" in that intermediate step.


Again thanks a lot for your reply!


>
> --Ken
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200618/79adfbe6/attachment.htm>

From kenwhistler at sonic.net  Thu Jun 18 18:14:19 2020
From: kenwhistler at sonic.net (Ken Whistler)
Date: Thu, 18 Jun 2020 16:14:19 -0700
Subject: EBCDIC control characters
In-Reply-To: <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
Message-ID: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>


On 6/18/2020 12:22 PM, Corentin wrote:
> The specific case that people are talking about is indeed string literals
> such as "\x06\u0086" where the hexadecimal escape is meant to be an 
> ebcdic character and the \uxxxx is meant to be be an unicode character 
> such that the
> hexadecimal sequence would map to that character, and whether, in that 
> very odd scenario, they are or not the same character, and 
> whether?they should be distinguishable

Well, with the caveat that I am not a formal languge designer -- I just 
use them on T.V.... ;-)

My opinion is that such constructs should simply be illegal and/or 
non-syntactical. The whole idea of letting people import the complexity 
of character set conversion (particularly extended to the 
incompatibility between EBCDIC and ASCII-based representation) into 
string literals strikes me as just daft.

If program text is to be interpreted and compiled in an EBCDIC 
environment, any string literals contained in that source text should be 
constrained to EBCDIC, period, full stop. (0x4B, 0x4B) And if they 
contain more than the very restricted EBCDIC set of A..Z, a..z, 0..9 and 
a few common punctuation, then it better all be in one well-supported 
EBCDIC extended code page such as CP 500.

If program text is to be interpreted and compiled in a Unicode 
environment, any string literals contained in that source text should be 
constrained to Unicode, period, full stop (U+002E, U+002E). And for 
basic, 8-bit char strings, it better all be UTF-8 these days. UTF-16 and 
UTF-32 also work, of course, but IMO, support for those is best handled 
by depending on libraries such as ICU, rather than expecting that the 
programming language and runtime libraries are going to support them as 
well as char* UTF-8 strings.

If program source text has to be cross-compiled in both an EBCDIC and a 
Unicode environment, the only sane approach to extract all but the bare 
minimum of string literals to various kinds of resource files which can 
then be independently manipulated and pushed through character 
conversions, as needed -- not expecting that the *compiler* is going to 
suddenly get smart and do the right thing every time it encounters some 
otherwise untagged string literal sitting in program text. That's a 
whole lot cleaner than doing a whole bunch of conditional compilation 
and working with string literals in program text that are always going 
to be half-gibberish on whichever platform you view it for maintenance. 
I had to do some EBCDIC/ASCII cross-compiled code development once -- 
although admittedly 20 years ago. It wasn't pretty.

>
> Our current model is source encoding -> unicode -> literal encoding, 
> all three encodings being potentially distinct,
> so we do in fact "mess with" string literals and the question is 
> whether or not going through unicode should ever considered destructive,
Answer, no. If somebody these days is trying to do software development 
work in a one-off, niche character encoding that cannot be fully 
converted to Unicode, then *they* are daft.
> and my argument is that it is never destructive because semantically 
> preserving?in all the relevant use cases.
>
> The question was in particular whether we should use "a super set of 
> unicode" instead of "unicode" in that intermediate step.

Answer no. That will cause you nothing but trouble going forward.

All my opinions, of course. YMMV. But probably not by a lot. ;-)

--Ken


From asmusf at ix.netcom.com  Thu Jun 18 18:55:45 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Thu, 18 Jun 2020 16:55:45 -0700
Subject: EBCDIC control characters
In-Reply-To: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
Message-ID: <08f8d673-e707-6d95-056b-5d8487ff56d9@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200618/8189f1b7/attachment.htm>

From kenwhistler at sonic.net  Thu Jun 18 19:24:35 2020
From: kenwhistler at sonic.net (Ken Whistler)
Date: Thu, 18 Jun 2020 17:24:35 -0700
Subject: EBCDIC control characters
In-Reply-To: <08f8d673-e707-6d95-056b-5d8487ff56d9@ix.netcom.com>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <08f8d673-e707-6d95-056b-5d8487ff56d9@ix.netcom.com>
Message-ID: <8d7f54ff-0f62-2aef-c7d7-8f1d6f3202e0@sonic.net>

Asmus,

On 6/18/2020 4:55 PM, Asmus Freytag via Unicode wrote:
> The problem with the C/C++ compilers in this regard has always been 
> that they attempted to implement the character-set insensitive model, 
> which doesn't play well with Unicode, so if you want to compile a 
> program where string literals are in Unicode (and not just any 16-bit 
> character set) then you can't simply zero-extend. (And if you are 
> trying to create a UTF-8 literal, then all bets are off unless you 
> have a real conversion).

As I said, daft. ;-)

Anybody who depends on zero-sign extension for embedding Unicode 
character literals in an 8859-1 (or any other 8-bit character set) 
program text ought to have their head examined. Just because you *can* 
do it, and the compilers will cheerily do what the spec says they should 
in such cases doesn't mean that anybody *should* use it. (There is lots 
of stuff in C++ that no sane programmer should use. )

--Ken


From asmusf at ix.netcom.com  Thu Jun 18 20:16:05 2020
From: asmusf at ix.netcom.com (Asmus Freytag (c))
Date: Thu, 18 Jun 2020 18:16:05 -0700
Subject: EBCDIC control characters
In-Reply-To: <8d7f54ff-0f62-2aef-c7d7-8f1d6f3202e0@sonic.net>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <08f8d673-e707-6d95-056b-5d8487ff56d9@ix.netcom.com>
 <8d7f54ff-0f62-2aef-c7d7-8f1d6f3202e0@sonic.net>
Message-ID: <9a97ea72-0084-a78a-ade0-c5d72eac2b8c@ix.netcom.com>

On 6/18/2020 5:24 PM, Ken Whistler wrote:
> Asmus,
>
> On 6/18/2020 4:55 PM, Asmus Freytag via Unicode wrote:
>> The problem with the C/C++ compilers in this regard has always been 
>> that they attempted to implement the character-set insensitive model, 
>> which doesn't play well with Unicode, so if you want to compile a 
>> program where string literals are in Unicode (and not just any 16-bit 
>> character set) then you can't simply zero-extend. (And if you are 
>> trying to create a UTF-8 literal, then all bets are off unless you 
>> have a real conversion).
>
> As I said, daft. ;-)

Ken,

An argument can certainly be made that trying to be "character set 
independent" is daft - and back in the '90s I walked away from a job 
interview at a place that told me that they had "figured it all out" and 
were going to use "character set independence" as their i18n strategy 
and "only" needed someone to implement it. Easiest decision on my part. 
(They got creamed by their Unicode-based competitor in short order).

My experience with C/C++ is perhaps colored a bit by the fact that I've 
always used compilers that were targeting Unicode-based systems and had 
special extension; not sure where things stand right now, for a purely 
generic implementation.

A./

>
> Anybody who depends on zero-sign extension for embedding Unicode 
> character literals in an 8859-1 (or any other 8-bit character set) 
> program text ought to have their head examined. Just because you *can* 
> do it, and the compilers will cheerily do what the spec says they 
> should in such cases doesn't mean that anybody *should* use it. (There 
> is lots of stuff in C++ that no sane programmer should use. )
>
> --Ken
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200618/ff43f77c/attachment.htm>

From pgcon6 at msn.com  Thu Jun 18 22:59:29 2020
From: pgcon6 at msn.com (Peter Constable)
Date: Fri, 19 Jun 2020 03:59:29 +0000
Subject: OverStrike control character
In-Reply-To: <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
Message-ID: <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>

And your new control character would have the same limitation: control characters are default ignorable and don't get rendered.


Peter

-----Original Message-----
From: Unicode <unicode-bounces at unicode.org> On Behalf Of abrahamgross--- via Unicode
Sent: Wednesday, June 17, 2020 5:46 PM
To: unicode at unicode.org
Subject: RE: OverStrike control character

Which is why I'm advocating for an OverStrike control character

2020/06/17 ??7:55:36 Peter Constable via Unicode <unicode at unicode.org>:

> Except that BS is not a graphic character that will get a glyph with default metrics and potential interaction with other glyphs.
> 


From abrahamgross at disroot.org  Thu Jun 18 23:37:39 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Fri, 19 Jun 2020 04:37:39 +0000 (UTC)
Subject: OverStrike control character
In-Reply-To: <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
 <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
Message-ID: <a59ca4fc-e75a-4be8-bd06-a22c20f1348e@disroot.org>

Then how does the ZWJ (zero width joiner) work?

2020/06/19 ??0:00:26 Peter Constable via Unicode <unicode at unicode.org>:

> And your new control character would have the same limitation: control characters are default ignorable and don't get rendered.
> 
> Peter
> 


From jameskasskrv at gmail.com  Fri Jun 19 00:20:33 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Fri, 19 Jun 2020 05:20:33 +0000
Subject: OverStrike control character
In-Reply-To: <a59ca4fc-e75a-4be8-bd06-a22c20f1348e@disroot.org>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
 <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <a59ca4fc-e75a-4be8-bd06-a22c20f1348e@disroot.org>
Message-ID: <54c36eae-5589-069b-1791-8dabea30b3e6@gmail.com>


On 2020-06-19 4:37 AM, abrahamgross--- via Unicode wrote:
> Then how does the ZWJ (zero width joiner) work?

It?s considered punctuation even though it has control and format 
aspects.? ZWJ is default ignorable.? ZWJ requests a more joined form of 
a character string for the display if a more joined form is available in 
the font/rendering system.? If a more joined form is not available the 
display will be the same as if the ZWJ was not part of the character 
stream, and no harm done.? The point being that the author requested a 
more joined form and this authorial intent is preserved in the text/data.

The ZWJ might be a good way to achieve over-striking.? For example, the 
string ?Respec?tfully? has a ZWJ inserted between the ?c? and the ?t?.? 
If you had a font which substituted a ?c-t? over-strike for that string, 
your display could show it.? If you had a font which substituted a ?c-t? 
ligature for that string, that ligature would be displayed.? Otherwise 
the string ?Respec?tfully? would look the same as the string ?Respectfully?.

(Heh, the Thunderbird spell-checker chokes on the two instances with ZWJ.)


From richard.wordingham at ntlworld.com  Fri Jun 19 05:32:50 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 19 Jun 2020 11:32:50 +0100
Subject: OverStrike control character
In-Reply-To: <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
 <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
Message-ID: <20200619113250.66bb7f47@JRWUBU2>

On Fri, 19 Jun 2020 03:59:29 +0000
Peter Constable via Unicode <unicode at unicode.org> wrote:

> And your new control character would have the same limitation:
> control characters are default ignorable and don't get rendered.

As a systematic rule, that's incorrect behaviour.  They should only be
ignored if the system doesn't 'understand' them.  So, if the font
selected doesn't support it, it can be ignored, but if it does, it
should be honoured as part of the text.

Of course, there are misunderstandings; I've seen USE implementations
complain about ZWJ.

Richard.

From richard.wordingham at ntlworld.com  Fri Jun 19 05:48:27 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 19 Jun 2020 11:48:27 +0100
Subject: EBCDIC control characters
In-Reply-To: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
Message-ID: <20200619114827.07df1f21@JRWUBU2>

On Thu, 18 Jun 2020 16:14:19 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:

> On 6/18/2020 12:22 PM, Corentin wrote:
> > The specific case that people are talking about is indeed string
> > literals such as "\x06\u0086" where the hexadecimal escape is meant
> > to be an ebcdic character and the \uxxxx is meant to be be an
> > unicode character such that the
> > hexadecimal sequence would map to that character, and whether, in
> > that very odd scenario, they are or not the same character, and 
> > whether?they should be distinguishable  
> > The question was in particular whether we should use "a super set
> > of unicode" instead of "unicode" in that intermediate step.  

> Answer no. That will cause you nothing but trouble going forward.

Isn't there still the issue of supporting U+0000 in C-type strings?

Richard.


From jameskasskrv at gmail.com  Fri Jun 19 06:40:01 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Fri, 19 Jun 2020 11:40:01 +0000
Subject: OverStrike control character
In-Reply-To: <20200619113250.66bb7f47@JRWUBU2>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
 <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <20200619113250.66bb7f47@JRWUBU2>
Message-ID: <c322e5b2-dbd7-10b1-de40-7cf87fd3869c@gmail.com>


A font could be designed to make appropriate glyph substitutions for 
strings which include the control picture for backspace, U+2408 (???).? 
So a font could substitute an over struck l-m glyph for the string ?l? + 
??? + ?m?.? If the font didn?t support that string, the default display 
would still show authorial intent.? In this way users desiring to 
exchange data in plain-text which included over-strikes could do so 
without any additions to TUS.

Unicode, excluding emoji, eschews encoding items just because they sound 
cool and somebody might use them.

But if users want to band together and establish conventions, there?s 
nothing holding them back.


From richard.wordingham at ntlworld.com  Fri Jun 19 09:42:16 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 19 Jun 2020 15:42:16 +0100
Subject: OverStrike control character
In-Reply-To: <c322e5b2-dbd7-10b1-de40-7cf87fd3869c@gmail.com>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
 <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <20200619113250.66bb7f47@JRWUBU2>
 <c322e5b2-dbd7-10b1-de40-7cf87fd3869c@gmail.com>
Message-ID: <20200619154216.6f85f1d5@JRWUBU2>

On Fri, 19 Jun 2020 11:40:01 +0000
James Kass via Unicode <unicode at unicode.org> wrote:

> A font could be designed to make appropriate glyph substitutions for 
> strings which include the control picture for backspace, U+2408
> (???). So a font could substitute an over struck l-m glyph for the
> string ?l? + ??? + ?m?.? If the font didn?t support that string, the
> default display would still show authorial intent.? In this way users
> desiring to exchange data in plain-text which included over-strikes
> could do so without any additions to TUS.

Wouldn't this violate the character identity of U+2408?

The proper mechanism would be to use a PUA character.  The question is
whether the font would be enough, or whether one would have to change
its invoker.

Richard. 


From gwidion at gmail.com  Fri Jun 19 10:11:58 2020
From: gwidion at gmail.com (Joao S. O. Bueno)
Date: Fri, 19 Jun 2020 12:11:58 -0300
Subject: OverStrike control character
In-Reply-To: <20200619154216.6f85f1d5@JRWUBU2>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
 <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <20200619113250.66bb7f47@JRWUBU2>
 <c322e5b2-dbd7-10b1-de40-7cf87fd3869c@gmail.com>
 <20200619154216.6f85f1d5@JRWUBU2>
Message-ID: <CAH0mxTTTYqvp=Qq56yHcZt1AC3zNFLqdgxpsrtkzRkVZOXmDBg@mail.gmail.com>

Since this discussion has come this far, I will drop my 0.02 :

I am currently authoring a library/framework to create character art
("ASCII art") - as  a free software project, including drawing APIs
using block characters, and helpers for using emoji.

In this position, such a character combination would be
a "nice to have" - and if it would not disturb other aspects of
text-communication, I am all for it.

Fact is one would still need a terminal app to support it properly,
but, my project can also work with other backends for rendering
Currently it supports ANSI-sequence texts aimed at terminal emulators
and an HTML output based on monospaced fonts and CSS stylling.
But pixel-based backends are on the roadmap, and easy to do.

I see this library and similar projects as major users of the
"overstrike" features. In the case of my project even as
an enabler for other people to use it.

However, as it is obvious, I have to count on higher level protocols
to specify in-string text attributes, and I can make use of those
for overriding character positioning with no need of an special
character for overstrike.

So, although my project could support this, and having some people
using the overstrike character for some simplified output, it will certainly
also integrate in-string markup for other positioning control (by
coincidence
I was coding exactly this part last night) .

On the other hand if overstrike character  is ever implemented and
supported in
terminals and other text APIs in popular toolkits such as Qt/GTK I can get
more
character artistic effects on those backends as well, instead of limiting
them
to pixel-based backends.

(If anyone is curious about it, the project url is
https://github.com/jsbueno/terminedia   -
and I can get help  with having more unicode compliant
internal names and APIs, as well as help other people
for whom the project tools might be useful)

Regards,

  js
 -><-


On Fri, 19 Jun 2020 at 11:48, Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> On Fri, 19 Jun 2020 11:40:01 +0000
> James Kass via Unicode <unicode at unicode.org> wrote:
>
> > A font could be designed to make appropriate glyph substitutions for
> > strings which include the control picture for backspace, U+2408
> > (???). So a font could substitute an over struck l-m glyph for the
> > string ?l? + ??? + ?m?.  If the font didn?t support that string, the
> > default display would still show authorial intent.  In this way users
> > desiring to exchange data in plain-text which included over-strikes
> > could do so without any additions to TUS.
>
> Wouldn't this violate the character identity of U+2408?
>
> The proper mechanism would be to use a PUA character.  The question is
> whether the font would be enough, or whether one would have to change
> its invoker.
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200619/3402a16d/attachment.htm>

From abrahamgross at disroot.org  Fri Jun 19 10:27:35 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Fri, 19 Jun 2020 15:27:35 +0000 (UTC)
Subject: OverStrike control character
In-Reply-To: <20200619154216.6f85f1d5@JRWUBU2>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
 <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <20200619113250.66bb7f47@JRWUBU2>
 <c322e5b2-dbd7-10b1-de40-7cf87fd3869c@gmail.com>
 <20200619154216.6f85f1d5@JRWUBU2>
Message-ID: <efa1e902-0b70-4706-94c8-288654422514@disroot.org>

I think james used ? as an example, and not as a real proposal for what overstrike should look like where its its not supported. (?m?l? or ?m?l? might be a good alternative)

2020/06/19 ??10:43:03 Richard Wordingham via Unicode <unicode at unicode.org>:

> Wouldn't this violate the character identity of U+2408?
> 


From kenwhistler at sonic.net  Fri Jun 19 15:24:41 2020
From: kenwhistler at sonic.net (Ken Whistler)
Date: Fri, 19 Jun 2020 13:24:41 -0700
Subject: EBCDIC control characters
In-Reply-To: <20200619114827.07df1f21@JRWUBU2>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <20200619114827.07df1f21@JRWUBU2>
Message-ID: <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>


On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote:
> Isn't there still the issue of supporting U+0000 in C-type strings?

I don't see why. And it has nothing to do with Unicode per se, anyway.

That is just a transform of the question of "the issue of supporting 0x00 in
C-type strings restricted to ASCII."

The issue is precisely the same, and the solutions are precisely the same -- by design.

--Ken


From markus.icu at gmail.com  Fri Jun 19 16:00:21 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Fri, 19 Jun 2020 14:00:21 -0700
Subject: EBCDIC control characters
In-Reply-To: <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <20200619114827.07df1f21@JRWUBU2>
 <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
Message-ID: <CAN49p6qe9qAXHyai=UwPQc+vvVmjN9rYnuBARZJfmL0dciEL1w@mail.gmail.com>

I would soften a bit what Ken and Asmus have said.

Of course C++ compilers have to deal with a variety of charsets/codepages.
There is (or used to be) a lot of code in various Windows/Mac/Linux/...
codepages, including variations of Shift-JIS, EUC-KR, etc.

My mental model of how compilers work (which might be outdated) is that
they work within a charset family (usually ASCII, but EBCDIC on certain
platforms) and mostly parse ASCII characters as is (and for the "basic
character set" in EBCDIC, mostly assume the byte values of cp37 or 1047
depending on platform). For regular string literals, I expect it's mostly a
pass-through from the source code (and \xhh bytes) to the output binary.

But of course C++ has syntax for Unicode string literals. I think compilers
basically call a system function to convert from the source bytes to
Unicode, either with the process default charset or with an explicit one if
specified on the command line.

And then there are \uhhhh and \U00HHHHHH escapes even in non-Unicode string
literals, as Corentin said. What I would expect to happen is that the
compiler copies all of the literal bytes, and when it reads a Unicode
escape it converts that one code point to the byte sequence in the default
or execution-charset.

It would get more interesting if a compiler had options for different
source and execution charsets. I don't know if they would convert regular
string literals directly from one to the other, or if they convert
everything to Unicode (like a Java compiler) and then to the execution
charset. (In Java, the execution charset is UTF-16, so the problem space
there is simpler.)

Of course, in many cases a conversion from A to B will pivot through
Unicode anyway (so that you only need 2n tables not n^2.)

About character conversion in general I would caution that there are
basically two types of mappings: Round-trip mappings for what's really the
same character on both sides, and fallbacks where you map to a different
but more or less similar/related character because that may be more
readable than a question mark or a replacement character. In a compiler, I
would hope that both unmappable characters and fallback mappings lead to
compiler errors, to avoid hidden surprises in runtime behavior.

This probably constrains what the compiler can and should do. As a
programmer, I want to be able to put any old byte sequence into a string
literal, including NUL, controls, and non-character-encoding bytes. (We use
string literals for more things than "text".) For example, when we didn't
yet have syntax for UTF-8 string literals, we could write unmarked literals
with \xhh sequences and pass them into functions that explicitly operated
on UTF-8, regardless of whether those byte sequences were well-formed
according to the source or execution charsets. This pretty much works only
if there is no conversion that puts limits on the contents.

I believe that EBCDIC platforms have dealt with this, where necessary, by
using single-byte conversion mappings between EBCDIC-based and ASCII-based
codepages that were strict permutations. Thus, control codes and other byte
values would round-trip through any number of conversions back and forth.

PS: I know that this really goes beyond string literals: C++ identifiers
can include non-ASCII characters. I expect these to work much like regular
string literals, minus escape sequences. I guess that the execution charset
still plays a role for the linker symbol table.

Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200619/a89eea24/attachment-0001.htm>

From sdowney at gmail.com  Fri Jun 19 16:56:33 2020
From: sdowney at gmail.com (Steve Downey)
Date: Fri, 19 Jun 2020 17:56:33 -0400
Subject: EBCDIC control characters
In-Reply-To: <CAN49p6qe9qAXHyai=UwPQc+vvVmjN9rYnuBARZJfmL0dciEL1w@mail.gmail.com>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <20200619114827.07df1f21@JRWUBU2>
 <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
 <CAN49p6qe9qAXHyai=UwPQc+vvVmjN9rYnuBARZJfmL0dciEL1w@mail.gmail.com>
Message-ID: <CAJEGDKpCv8m2Xssfy50nFRUZ5DbfbxCFVBo0mmpxe71M318DWg@mail.gmail.com>

On Fri, Jun 19, 2020 at 5:08 PM Markus Scherer via Unicode
<unicode at unicode.org> wrote:
>
> I would soften a bit what Ken and Asmus have said.
>
> Of course C++ compilers have to deal with a variety of charsets/codepages. There is (or used to be) a lot of code in various Windows/Mac/Linux/... codepages, including variations of Shift-JIS, EUC-KR, etc.
>
> My mental model of how compilers work (which might be outdated) is that they work within a charset family (usually ASCII, but EBCDIC on certain platforms) and mostly parse ASCII characters as is (and for the "basic character set" in EBCDIC, mostly assume the byte values of cp37 or 1047 depending on platform). For regular string literals, I expect it's mostly a pass-through from the source code (and \xhh bytes) to the output binary.
>

What you described is the standard model for C compilers. For better
or worse, the C++ model is much more complicated. Note that what I'm
about to describe isn't how actual compilers work, but is what is
described in the C++ standard.

When translating a source file, all of the characters outside the
'basic source character set' (ascii letters, numbers, some necessary
punctuation) are converted to universal character names of the form
\unnnn or \Unnnnnnnn, where the ns are the short name of the code
point, and surrogate pairs are excluded, so really scalar values.
Later in translation, the universal character names, and the basic
source character set elements are mapped to the execution character
set, where the values are determined by locale. Which is terribly
vague and we'd like to clean that up. There are wide literals to deal
with, as well as the newer Unicode literals, where we've mandated the
encoding to be UTF of the appropriate code unit width, with distinct
types of char8_t, char16_t, and char32_t.

> But of course C++ has syntax for Unicode string literals. I think compilers basically call a system function to convert from the source bytes to Unicode, either with the process default charset or with an explicit one if specified on the command line.
>

> It would get more interesting if a compiler had options for different source and execution charsets. I don't know if they would convert regular string literals directly from one to the other, or if they convert everything to Unicode (like a Java compiler) and then to the execution charset. (In Java, the execution charset is UTF-16, so the problem space there is simpler.)

In practice, compilers behave sensibly and will map from the source to
the destination encodings. In theory they triangulate via code points.
This difference, of course, can be made visible by chosen text where
there are multiple possible destinations for a code point. In
practice, users do not care because they get the results they expect.
It's more a problem in specification.


> PS: I know that this really goes beyond string literals: C++ identifiers can include non-ASCII characters. I expect these to work much like regular string literals, minus escape sequences. I guess that the execution charset still plays a role for the linker symbol table.
>
Identifiers work substantially the same way, although with additional
restrictions. I'm currently working on a proposal to apply the current
UAX 31 to C++ to clean up the historical allow and block list. (
http://wg21.link/p1949 : C++ Identifier Syntax using Unicode Standard
Annex 31 )

I'll be posting some questions soon about that.

-SMD


From richard.wordingham at ntlworld.com  Fri Jun 19 17:58:12 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 19 Jun 2020 23:58:12 +0100
Subject: EBCDIC control characters
In-Reply-To: <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <20200619114827.07df1f21@JRWUBU2>
 <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
Message-ID: <20200619235812.405e74d0@JRWUBU2>

On Fri, 19 Jun 2020 13:24:41 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:

> On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote:
> > Isn't there still the issue of supporting U+0000 in C-type
> > strings?  
> 
> I don't see why. And it has nothing to do with Unicode per se, anyway.
> 
> That is just a transform of the question of "the issue of supporting
> 0x00 in C-type strings restricted to ASCII."
> 
> The issue is precisely the same, and the solutions are precisely the
> same -- by design.

There is a solution, but it's not nice.  The solution is to work with
UTF-8 plus one other character code - <0xC0, 0x80> for U+0000.  In the
absence of policemen, it works.

While Ken and Asmus both live (I can't remember whose life time it
is), one can use scalar values beyond 0x10FFFF for character-like
non-character entities, such as byte values with bit 7 or higher set
(a widespread possibility for file names), or some enormous CJK glyph
sets.  I understand Emacs does that sort of thing, storing them using
an extension of UTF-8, and seems to get away with it.  I believe they're
also used for Bucky-bitted 'characters' from keyboards. Outside Emacs,
such things also provide reliable, private non-characters.  Again, one
has to watch out for policemen, which can make life fraught in
complicated environments.

Richard.

From kent.b.karlsson at bahnhof.se  Fri Jun 19 18:06:34 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Sat, 20 Jun 2020 01:06:34 +0200
Subject: What constitute? an abstract character?
In-Reply-To: <MWHPR1301MB2112C433197EB863E2B16C72869C0@MWHPR1301MB2112.namprd13.prod.outlook.com>
References: <CA+Om+Sj98CiS4kbfC+o+1vd9hQPhbVNsUXPtakzcb_wMOH9qZg@mail.gmail.com>
 <000901d64321$e17f5ad0$a47e1070$@gmail.com>
 <MWHPR1301MB2112C433197EB863E2B16C72869C0@MWHPR1301MB2112.namprd13.prod.outlook.com>
Message-ID: <5D51B47F-D1E9-45BD-8891-F9F374B9F97A@bahnhof.se>


> 15 juni 2020 kl. 17:04 skrev Peter Constable via Unicode <unicode at unicode.org>:
> 
> Unicode doesn?t give one answer since there?s more than one way that might be appropriate to answer it.
>  
> [?] An Old Hangul syllable might have a count of 1, 2 or 3, depending on the syllable.

A bit peripheral to this thread, but:

1) No need to limit that to Old Hangul. It is equally valid for Modern Hangul. It?s just that for SOME old Hangul syllables there is no (canonically equivalent) single character form. This is for encoding historical reasons, nothing deep. Just that hindsight is (now) not at all a sufficient reason to radically change the encoding. (It was sufficient reason long ago, resulting in the ?Hangul mess? in Unicode...)

2) For practical (I guess) reasons one considers clusters of consonants and clusters of vowels as singular indivisible entities. However, since Hangul is an alphabetic script (and the letter basis has no consonant or vowel ?clusters?, the clusters consists of one to three letters), also the (canonical) decomposition into maximum three components is an artifact of the encoding. A Hangul syllable can often consist of more than three Hangul letters. And no, the compatibility decomposition of the Hangul Jamo is of no help, basically they are wrong for Hangul. DO NOT USE! Completely different decompositions are needed to decompose into the letters originally designed for the script. Furthermore, the consonants are (basically) double encoded, but that is for encoding technical reasons, not that there are really two different ones each, just two different positions in a syllable.

This just shows that the mapping from ?abstract characters? (in this example, the letters of the Hangul alphabet) to encoded characters sometimes can be non-trivial.

/Kent Karlsson


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200620/dbd53dd6/attachment.htm>

From kent.b.karlsson at bahnhof.se  Fri Jun 19 18:06:45 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Sat, 20 Jun 2020 01:06:45 +0200
Subject: EBCDIC control characters
In-Reply-To: <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
Message-ID: <0E7BDFB6-1F2E-40A3-B9AE-D45443DF288C@bahnhof.se>


> 18 juni 2020 kl. 20:00 skrev Ken Whistler via Unicode <unicode at unicode.org>:
> [?]
> It isn't really a "character set" issue. Either ASCII graphic character sets or EBCDIC graphic character sets could be used, in principle, with different sets of control functions, mapped onto the control code positions in each overall scheme.

That does not seem to be a very good idea at all. Especially since we do not have any good way of telling which set of control codes are used in such cases, in particular it would be a very very bad idea for Unicode encodings. It would be even worse than the situation that lead up the the construction of Unicode.

So let?s assume ?normal? control code allocation in the C0 and C1 areas when using the U+nnnn notation (or \unnnn). (Here, not saying anything about the contents for C0/C1 for other encodings.)

I don?t usually need to worry about EBCDIC-based encodings? But it seems that at least earlier (UTF-EBCDIC not so much) EBCDIC based encoding had some control codes that has no direct correspondence in the ?normal? C0/C1. Several are listed in the Wikipedia page about EBCDIC.

Even though there is no direct correspondence for them, there is a way to represent them; provided one agrees on a mapping: ISO/IEC 6429/ECMA-48 comes to the rescue. There are very many unused, but syntactically correct, escape sequences and control sequences. A few of them are designated as private use. So for (old?) EBCDIC control codes that do not have a representation in ?normal? C0/C1, if it is a parameterless one, ?allocate? an escape sequence (cmp. each C1 control code has an alternative as an escape sequence, like HTJ can be designated ESC I and NEL as ESC E), and for the ones that take a parameter, ?allocate? a control sequence (in the ECMA-48 sense) that takes a parameter (you will need a parameter value mapping as well).

I?m not saying that these (old?) control codes unique to EBCDIC are well-designed all worthy of implementation and perpetual use. Not at all. But if you do need to keep some of them, in some contexts (and otherwise ignore them) allocating escape sequences and control sequences is the way to go. No need to allocate new characters in Unicode? And no need to interpret the C0/C1 space in Unicode ?strangely? in some contexts. That way you can represent ?odd? control codes, from e.g. (old?) EBCDIC-based encodings also in Unicode, and \unnnn notation (ok for some (old?) EBCDIC-based encodings one needs an extra conversion step to convert the escape/control sequence to the (old?) control codes if the string targets such an encoding).

Happy summer (northern hemisphere...) solstice
/Kent Karlsson


From sdowney at gmail.com  Fri Jun 19 21:16:29 2020
From: sdowney at gmail.com (Steve Downey)
Date: Fri, 19 Jun 2020 22:16:29 -0400
Subject: UAX 31 for C++ Identifiers
Message-ID: <CAJEGDKpzQk=C_w4XXA+UHKYw3FXHYb51mwm+_cJsDSGbHZBAWA@mail.gmail.com>

I'm the lead author for a proposal to rework C++ identifiers in line with
the current recommendation of UAX 31. The current version is available at
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1949r4.html.

The core of the proposal is to replace the current allowlist to using
XID_Start and XID_Continue with the addition of LOW LINE in start. The
summary

The allowed Unicode code points in identifiers include many that are
unassigned or unnecessary, and others that are actually counter-productive.
By adopting the recommendations of UAX #31, Unicode Identifier and Pattern
Syntax, C++ will be easier to work with in international environments and
less prone to accidental problems.

This proposal does not address some potential security concerns?so called
homoglyph attacks?where letters that appear the same may be treated as
distinct. Methods of defense against such attacks are complex and evolving,
and requiring mitigation strategies would impose substantial implementation
burden.

This proposal also recommends adoption of Unicode normalization form C
(NFC) for identifiers to ensure that when compared, identifiers intended to
be the same will compare as equal. Legacy encodings are generally naturally
in NFC when converted to Unicode. Most tools will, by default, produce NFC
text.

Some unusual scripts require the use of characters as joiners that are not
allowed by UAX #31, these will no longer be available as identifiers in C++.

As a side-effect of adopting the identifier characters from UAX #31, using
emoji in or as identifiers becomes ill-formed.


The most important open question is what are we losing by using the basic
XID_Start XID_Continue* pattern. There are apparently natural languages
that require code points outside that set in order to write some words? How
much of a problem is that, and are there solutions without complex script
analysis on potential identifiers?

Secondarily, what would an excellent conformance statement look like? I'm
proposing an annex to the C++ standard discussing the conformance points
and how we are or are not meeting them, so as to have clarity on how and
why.

There are also open questions about emoji. There are currently a large
number that are allowed, but it seems mostly due to open listing unassigned
code points. Has there been discussion of a standard profile that would
allow emoji in identifiers? I realize this has substantial overlap with
script checking and the security paper. C++ identifiers are sort of half
over the fence. ZWJ are allowed, but gender modifiers aren't, and neither
were intentional with respect to emoji. The feedback I've got is that we,
the C++ committee, would really like not to own this problem, even if
members participate in solving the problem.

Thanks!

-SMD (wg21/sg16)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200619/b254ae96/attachment.htm>

From asmusf at ix.netcom.com  Fri Jun 19 21:35:40 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Fri, 19 Jun 2020 19:35:40 -0700
Subject: UAX 31 for C++ Identifiers
In-Reply-To: <CAJEGDKpzQk=C_w4XXA+UHKYw3FXHYb51mwm+_cJsDSGbHZBAWA@mail.gmail.com>
References: <CAJEGDKpzQk=C_w4XXA+UHKYw3FXHYb51mwm+_cJsDSGbHZBAWA@mail.gmail.com>
Message-ID: <9f9e7e04-e035-2740-419c-b9667b44bd80@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200619/11b7d811/attachment.htm>

From sdowney at gmail.com  Fri Jun 19 22:22:35 2020
From: sdowney at gmail.com (Steve Downey)
Date: Fri, 19 Jun 2020 23:22:35 -0400
Subject: UAX 31 for C++ Identifiers
In-Reply-To: <9f9e7e04-e035-2740-419c-b9667b44bd80@ix.netcom.com>
References: <CAJEGDKpzQk=C_w4XXA+UHKYw3FXHYb51mwm+_cJsDSGbHZBAWA@mail.gmail.com>
 <9f9e7e04-e035-2740-419c-b9667b44bd80@ix.netcom.com>
Message-ID: <CAJEGDKrJQVx2wxxA-kCz2j0jfiKx2FNFam_F028-COT4RUfX7w@mail.gmail.com>

On Fri, Jun 19, 2020 at 10:44 PM Asmus Freytag via Unicode
<unicode at unicode.org> wrote:
>
> In source code, having ambiguous identifiers may not be worse than C-style obfuscation.
>

Until recently (the last release 10.1), gcc rejected much of allowed
unicode in UTF-8 input, even in places it would allow \u
universal-character-names. So this all becomes easier now. As a
Standard, we should have handled this better earlier, but the second
best time is now. The XID_ properties make this a lot more palatable
w.r.t. stability, though, and I'm not going to second guess people 10
or 20 or more years ago, too much. Ambiguity in external identifiers
is already ill-formed no diagnostic required, which means broken but
in ways that compilers can't treat as undefined.

>
> But with module names, etc. you may run into security issues if naming allows / facilitates spoofing.
>
I, and other people doing tools, both won and lost this battle
already. Module names in source do not correspond with anything
physical. `import some.module` connects you to whatever exported
`some.module` by magic as far as the standard is concerned. We're
working on the actual mechanics as a Technical Report, and compiler
vendors are participating and aren't, as far as I can tell, more
insane than the average infrastructure engineer. So I have hope.

Mapping anything to file paths is fraught beyond belief, and there are
many experienced engineers providing war stories and parades of
horribles, although I'd personally like to have more stories to tell.

The entire disconnect between logical and physical actually is
hopeful, in a way that `#include <ha/hahahahaha.h>` isn't. Even though
we have a lot of understanding of how that maps to filesystem
searches.

Province of wg21/sg15 , which I also participate in.

I suspect that trying to fix up anything with #include is infeasible
since it's currently the wild west, changes will break, and C++
depends in practice on system provided headers that at best conform to
old C standards.

Thanks!

-SMD

From jameskasskrv at gmail.com  Fri Jun 19 23:48:00 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Sat, 20 Jun 2020 04:48:00 +0000
Subject: OverStrike control character
In-Reply-To: <20200619154216.6f85f1d5@JRWUBU2>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
 <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <20200619113250.66bb7f47@JRWUBU2>
 <c322e5b2-dbd7-10b1-de40-7cf87fd3869c@gmail.com>
 <20200619154216.6f85f1d5@JRWUBU2>
Message-ID: <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com>


Richard Wordingham wrote,

 > Wouldn't this violate the character identity of U+2408?

U+2408 SYMBOL FOR BACKSPACE
Using a symbol for backspace in running text as a symbol for backspace 
to illustrate a notational convention for overstriking shouldn?t violate 
its character identity.? It was offered in response to the objections of 
using the ASCII back space or other control characters because they are 
not graphic characters.? U+2408 is a graphic character.

 > The proper mechanism would be to use a PUA character.

This would only be true if the data wasn?t intended to be interchangeable.

Abraham Gross wrote,

 > (?m?l? or ?m?l? might be a good alternative)

Since they?re graphic characters either should be workable.? As long as 
our hypothetical user community agrees on a notational convention, 
acceptable display should be possible with existing technology.? It 
might be interesting to see if people with a demonstrable need to 
exchange overstruck material in plain-text, such as epigraphers, already 
have an established convention.

In numismatics, Yeoman?s catalogs simply spell it out for overstruck 
dates, such as ?1918D, 8 over 7?.


From abrahamgross at disroot.org  Sat Jun 20 01:30:02 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Sat, 20 Jun 2020 06:30:02 +0000 (UTC)
Subject: OverStrike control character
In-Reply-To: <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
 <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <20200619113250.66bb7f47@JRWUBU2>
 <c322e5b2-dbd7-10b1-de40-7cf87fd3869c@gmail.com>
 <20200619154216.6f85f1d5@JRWUBU2>
 <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com>
Message-ID: <ca01bf6d-7e7b-4904-882b-45e25d60a802@disroot.org>

If epigraphers and numismaticians have the need for overstiking in plain text, isn't that reason enough to encode it? Unicode encoded many completely extinct scripts* and extinct characters in existing scripts, so adding the overstrike doesn't seem like a stretch at all.

Does Yeoman show the ?8 over 7? visually too, or does it just say ?8 over 7? and you're supposed to imagine it urself?

*Extinct scripts in Unicode:
Georgian capitals
Ogham
Runes
Glagolitic
Linear B
Phaistos disc
Lycian
Carien
Old (RTL) Italic
Gothic
Old Permic
Cuneiform
Deseret (conscript)
Shavian (conscript)
Linear A
Cypriot
Imperial Aramaic
Palmyrene
Nabatean
Hatran
Phoenician
Lydian
Meroitic
Old South Arabian
Old North Arabian
Avestan
Inscriptional Parthian
Inscriptional Pahlavi
Psalter Pahlavi
Old Turkic
Old Hungarian
Brahmi

Then there are many extinct scripts in the propsal stage like oracle bone and classical yi

2020/06/20 ??0:48:47 James Kass via Unicode <unicode at unicode.org>:

> It might be interesting to see if people with a demonstrable need to exchange overstruck material in plain-text, such as epigraphers, already have an established convention.
> 
> In numismatics, Yeoman?s catalogs simply spell it out for overstruck dates, such as ?1918D, 8 over 7?.
> 


From asmusf at ix.netcom.com  Sat Jun 20 01:44:59 2020
From: asmusf at ix.netcom.com (Asmus Freytag (c))
Date: Fri, 19 Jun 2020 23:44:59 -0700
Subject: UAX 31 for C++ Identifiers
In-Reply-To: <CAJEGDKrJQVx2wxxA-kCz2j0jfiKx2FNFam_F028-COT4RUfX7w@mail.gmail.com>
References: <CAJEGDKpzQk=C_w4XXA+UHKYw3FXHYb51mwm+_cJsDSGbHZBAWA@mail.gmail.com>
 <9f9e7e04-e035-2740-419c-b9667b44bd80@ix.netcom.com>
 <CAJEGDKrJQVx2wxxA-kCz2j0jfiKx2FNFam_F028-COT4RUfX7w@mail.gmail.com>
Message-ID: <8f744f47-5464-1fe0-7b56-4815a3a28ce3@ix.netcom.com>

My meta point had been about possibly different levels security issues 
between compile time and runtime.
A./

On 6/19/2020 8:22 PM, Steve Downey wrote:
> On Fri, Jun 19, 2020 at 10:44 PM Asmus Freytag via Unicode
> <unicode at unicode.org> wrote:
>> In source code, having ambiguous identifiers may not be worse than C-style obfuscation.
>>
> Until recently (the last release 10.1), gcc rejected much of allowed
> unicode in UTF-8 input, even in places it would allow \u
> universal-character-names. So this all becomes easier now. As a
> Standard, we should have handled this better earlier, but the second
> best time is now. The XID_ properties make this a lot more palatable
> w.r.t. stability, though, and I'm not going to second guess people 10
> or 20 or more years ago, too much. Ambiguity in external identifiers
> is already ill-formed no diagnostic required, which means broken but
> in ways that compilers can't treat as undefined.
>
>> But with module names, etc. you may run into security issues if naming allows / facilitates spoofing.
>>
> I, and other people doing tools, both won and lost this battle
> already. Module names in source do not correspond with anything
> physical. `import some.module` connects you to whatever exported
> `some.module` by magic as far as the standard is concerned. We're
> working on the actual mechanics as a Technical Report, and compiler
> vendors are participating and aren't, as far as I can tell, more
> insane than the average infrastructure engineer. So I have hope.
>
> Mapping anything to file paths is fraught beyond belief, and there are
> many experienced engineers providing war stories and parades of
> horribles, although I'd personally like to have more stories to tell.
>
> The entire disconnect between logical and physical actually is
> hopeful, in a way that `#include <ha/hahahahaha.h>` isn't. Even though
> we have a lot of understanding of how that maps to filesystem
> searches.
>
> Province of wg21/sg15 , which I also participate in.
>
> I suspect that trying to fix up anything with #include is infeasible
> since it's currently the wild west, changes will break, and C++
> depends in practice on system provided headers that at best conform to
> old C standards.
>
> Thanks!
>
> -SMD


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200619/a2bdef8b/attachment.htm>

From corentin.jabot at gmail.com  Sat Jun 20 03:50:28 2020
From: corentin.jabot at gmail.com (Corentin)
Date: Sat, 20 Jun 2020 10:50:28 +0200
Subject: EBCDIC control characters
In-Reply-To: <CAN49p6qe9qAXHyai=UwPQc+vvVmjN9rYnuBARZJfmL0dciEL1w@mail.gmail.com>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <20200619114827.07df1f21@JRWUBU2>
 <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
 <CAN49p6qe9qAXHyai=UwPQc+vvVmjN9rYnuBARZJfmL0dciEL1w@mail.gmail.com>
Message-ID: <CA+Om+Si_1uzbyz=XQ4VtzYyoAxqdwTuzQS+wVtDwiuG28bhakQ@mail.gmail.com>

On Fri, 19 Jun 2020 at 23:00, Markus Scherer <markus.icu at gmail.com> wrote:

> I would soften a bit what Ken and Asmus have said.
>
> Of course C++ compilers have to deal with a variety of charsets/codepages.
> There is (or used to be) a lot of code in various Windows/Mac/Linux/...
> codepages, including variations of Shift-JIS, EUC-KR, etc.
>
> My mental model of how compilers work (which might be outdated) is that
> they work within a charset family (usually ASCII, but EBCDIC on certain
> platforms) and mostly parse ASCII characters as is (and for the "basic
> character set" in EBCDIC, mostly assume the byte values of cp37 or 1047
> depending on platform). For regular string literals, I expect it's mostly a
> pass-through from the source code (and \xhh bytes) to the output binary.
>
> But of course C++ has syntax for Unicode string literals. I think
> compilers basically call a system function to convert from the source bytes
> to Unicode, either with the process default charset or with an explicit one
> if specified on the command line.
>
> And then there are \uhhhh and \U00HHHHHH escapes even in non-Unicode
> string literals, as Corentin said. What I would expect to happen is that
> the compiler copies all of the literal bytes, and when it reads a Unicode
> escape it converts that one code point to the byte sequence in the default
> or execution-charset.
>
> It would get more interesting if a compiler had options for different
> source and execution charsets. I don't know if they would convert regular
> string literals directly from one to the other, or if they convert
> everything to Unicode (like a Java compiler) and then to the execution
> charset. (In Java, the execution charset is UTF-16, so the problem space
> there is simpler.)
>

Yes, and actually people are talking about that for legacy projects sake,
and there using Unicode internally makes even more sense


>
> Of course, in many cases a conversion from A to B will pivot through
> Unicode anyway (so that you only need 2n tables not n^2.)
>
> About character conversion in general I would caution that there are
> basically two types of mappings: Round-trip mappings for what's really the
> same character on both sides, and fallbacks where you map to a different
> but more or less similar/related character because that may be more
> readable than a question mark or a replacement character. In a compiler, I
> would hope that both unmappable characters and fallback mappings lead to
> compiler errors, to avoid hidden surprises in runtime behavior.
>

I am hoping to make conversions that do not preserve semantic invalid,
right now compilers will behave differently, some will not compile, some
will insert question marks, leading to the runtime issues you describe.

Now, my argument is that going through Unicode ( and keep in mind that we
are describing a specification not compiler implementations ), let us
simplify the spec without
preventing (nor mandating ) round tripping if the source and literal
encodings happen to be the same. If there is a way through unicode,
transitively there is a direct way.

This probably constrains what the compiler can and should do. As a
> programmer, I want to be able to put any old byte sequence into a string
> literal, including NUL, controls, and non-character-encoding bytes. (We use
> string literals for more things than "text".) For example, when we didn't
> yet have syntax for UTF-8 string literals, we could write unmarked literals
> with \xhh sequences and pass them into functions that explicitly operated
> on UTF-8, regardless of whether those byte sequences were well-formed
> according to the source or execution charsets. This pretty much works only
> if there is no conversion that puts limits on the contents.
>

Okay, we are really in C++ territory now. for the sake of people who are
not aware the \0 and \x escape sequences are really integer values and will
never be semantically characters or involve conversion.


>
> I believe that EBCDIC platforms have dealt with this, where necessary, by
> using single-byte conversion mappings between EBCDIC-based and ASCII-based
> codepages that were strict permutations. Thus, control codes and other byte
> values would round-trip through any number of conversions back and forth.
>
> PS: I know that this really goes beyond string literals: C++ identifiers
> can include non-ASCII characters. I expect these to work much like regular
> string literals, minus escape sequences. I guess that the execution charset
> still plays a role for the linker symbol table.
>
> Best regards,
> markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200620/ae8179e9/attachment.htm>

From corentin.jabot at gmail.com  Sat Jun 20 03:57:10 2020
From: corentin.jabot at gmail.com (Corentin)
Date: Sat, 20 Jun 2020 10:57:10 +0200
Subject: EBCDIC control characters
In-Reply-To: <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <20200619114827.07df1f21@JRWUBU2>
 <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
Message-ID: <CA+Om+SjfbEx-MgE2tnLVnm7xvsWZSsH3L63tShzmSFrO8kOsBQ@mail.gmail.com>

On Fri, 19 Jun 2020 at 22:30, Ken Whistler via Unicode <unicode at unicode.org>
wrote:

>
> On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote:
> > Isn't there still the issue of supporting U+0000 in C-type strings?
>
> I don't see why. And it has nothing to do with Unicode per se, anyway.
>
> That is just a transform of the question of "the issue of supporting 0x00
> in
> C-type strings restricted to ASCII."
>
> The issue is precisely the same, and the solutions are precisely the same
> -- by design.
>

I'm not sure I understand that issue, could you clarify?
in both C and C++, U+0000  is interpreted as the null character (which mark
the end of the string depending on context), which is the same behavior as
the equivalent
ascii character


>
> --Ken
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200620/bf3406f2/attachment.htm>

From prosfilaes at gmail.com  Sat Jun 20 05:30:22 2020
From: prosfilaes at gmail.com (David Starner)
Date: Sat, 20 Jun 2020 03:30:22 -0700
Subject: OverStrike control character
In-Reply-To: <ca01bf6d-7e7b-4904-882b-45e25d60a802@disroot.org>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
 <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <20200619113250.66bb7f47@JRWUBU2>
 <c322e5b2-dbd7-10b1-de40-7cf87fd3869c@gmail.com>
 <20200619154216.6f85f1d5@JRWUBU2>
 <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com>
 <ca01bf6d-7e7b-4904-882b-45e25d60a802@disroot.org>
Message-ID: <CAMZ=zj699bzNXBR1Pr_RAis2qiprs3DcW8srSxX30W-xCrXsXQ@mail.gmail.com>

You can use BS; you can use GCC; there are apparently ESC sequences
that will do it, and you could implement it in any number of ways in
rich text. They don't work right now; if you need this functionality,
you're going to need to implement it. It doesn't feel that you want a
way to store and display overtyped text, it feels that you want
Unicode to officially support it. It's a very complex and expensive
expansion, but if a bunch of people were using overtyped text, this
discussion might be going differently.

This seems to be the quintessential example of a feature that has very
marginal use and would be rather complex to implement. The fact that
back in the days of daisy-wheel printers, this was used, often to get
characters not otherwise supported, like the cent sign, and when the
daisy-wheel printer disappeared, so did any support for such a thing.
It's like playing games with character cell fonts to support stuff
like a mouse cursor. It's history, and history not particularly easy
to support in Unicode.

There were just recently a bunch of characters encoded to support old
8-bit machines, because that was easy. But the associated inverted
characters were rejected, and the submitters told to use some higher
level protocol to support them. That seems to be a comparable reaction
to what you're getting.

-- 
The standard is written in English . If you have trouble understanding
a particular section, read it again and again and again . . . Sit up
straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185
(1991)

From richard.wordingham at ntlworld.com  Sat Jun 20 06:09:04 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 20 Jun 2020 12:09:04 +0100
Subject: EBCDIC control characters
In-Reply-To: <CA+Om+SjfbEx-MgE2tnLVnm7xvsWZSsH3L63tShzmSFrO8kOsBQ@mail.gmail.com>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <20200619114827.07df1f21@JRWUBU2>
 <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
 <CA+Om+SjfbEx-MgE2tnLVnm7xvsWZSsH3L63tShzmSFrO8kOsBQ@mail.gmail.com>
Message-ID: <20200620120904.0c6f7630@JRWUBU2>

On Sat, 20 Jun 2020 10:57:10 +0200
Corentin via Unicode <unicode at unicode.org> wrote:

> On Fri, 19 Jun 2020 at 22:30, Ken Whistler via Unicode
> <unicode at unicode.org> wrote:
> 
> >
> > On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote:  
> > > Isn't there still the issue of supporting U+0000 in C-type
> > > strings?  
> >
> > I don't see why. And it has nothing to do with Unicode per se,
> > anyway.
> >
> > That is just a transform of the question of "the issue of
> > supporting 0x00 in
> > C-type strings restricted to ASCII."
> >
> > The issue is precisely the same, and the solutions are precisely
> > the same -- by design.
> >  
> 
> I'm not sure I understand that issue, could you clarify?
> in both C and C++, U+0000  is interpreted as the null character
> (which mark the end of the string depending on context), which is the
> same behavior as the equivalent
> ascii character

One immediate consequence of that assertion is that one cannot in
general store a line of Unicode text in a 'string'.  There have been
Unicode test cases that deliberately include a null in the middle of
the text, and if the program thinks it has stored the line in a
'string', it will fail the test, because the null character and beyond
are not part of the text being interpreted.

One of the early tricks to store general character sequences in
strings was to use non-shortest form UTF-8 encodings to avoid characters
being interpreted as control characters with undesired
characteristics.  This form of UTF-8 is now invalid.  Java was
especially noted for using the encoding <C0, 80> to store zero bytes in
byte code in UTF-8.

My guess is that Ken is alluding to not storing arbitrary text in
strings, but rather in arrays of code units along with appropriate
length parameters.

Richard.


From richard.wordingham at ntlworld.com  Sat Jun 20 06:24:52 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 20 Jun 2020 12:24:52 +0100
Subject: OverStrike control character
In-Reply-To: <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
 <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <20200619113250.66bb7f47@JRWUBU2>
 <c322e5b2-dbd7-10b1-de40-7cf87fd3869c@gmail.com>
 <20200619154216.6f85f1d5@JRWUBU2>
 <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com>
Message-ID: <20200620122452.7c9b709b@JRWUBU2>

On Sat, 20 Jun 2020 04:48:00 +0000
James Kass via Unicode <unicode at unicode.org> wrote:

> Richard Wordingham wrote,

>> The proper mechanism would be to use a PUA character.  
> 
> This would only be true if the data wasn?t intended to be
> interchangeable.

PUA-encoded material is interchangeable - you just need to agree to the
convention.  Emoji started out in the PUA.  Remember the Conscript
Registry?

Richard.


From corentin.jabot at gmail.com  Sat Jun 20 07:11:15 2020
From: corentin.jabot at gmail.com (Corentin)
Date: Sat, 20 Jun 2020 14:11:15 +0200
Subject: EBCDIC control characters
In-Reply-To: <20200620120904.0c6f7630@JRWUBU2>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <20200619114827.07df1f21@JRWUBU2>
 <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
 <CA+Om+SjfbEx-MgE2tnLVnm7xvsWZSsH3L63tShzmSFrO8kOsBQ@mail.gmail.com>
 <20200620120904.0c6f7630@JRWUBU2>
Message-ID: <CA+Om+Sgf3c2VuN=EOd=4AGE85vtNULGWvHAKL3P0HtgsSZ8Cjg@mail.gmail.com>

On Sat, 20 Jun 2020 at 13:14, Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> On Sat, 20 Jun 2020 10:57:10 +0200
> Corentin via Unicode <unicode at unicode.org> wrote:
>
> > On Fri, 19 Jun 2020 at 22:30, Ken Whistler via Unicode
> > <unicode at unicode.org> wrote:
> >
> > >
> > > On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote:
> > > > Isn't there still the issue of supporting U+0000 in C-type
> > > > strings?
> > >
> > > I don't see why. And it has nothing to do with Unicode per se,
> > > anyway.
> > >
> > > That is just a transform of the question of "the issue of
> > > supporting 0x00 in
> > > C-type strings restricted to ASCII."
> > >
> > > The issue is precisely the same, and the solutions are precisely
> > > the same -- by design.
> > >
> >
> > I'm not sure I understand that issue, could you clarify?
> > in both C and C++, U+0000  is interpreted as the null character
> > (which mark the end of the string depending on context), which is the
> > same behavior as the equivalent
> > ascii character
>
> One immediate consequence of that assertion is that one cannot in
> general store a line of Unicode text in a 'string'.  There have been
> Unicode test cases that deliberately include a null in the middle of
> the text, and if the program thinks it has stored the line in a
> 'string', it will fail the test, because the null character and beyond
> are not part of the text being interpreted.
>
> One of the early tricks to store general character sequences in
> strings was to use non-shortest form UTF-8 encodings to avoid characters
> being interpreted as control characters with undesired
> characteristics.  This form of UTF-8 is now invalid.  Java was
> especially noted for using the encoding <C0, 80> to store zero bytes in
> byte code in UTF-8.
>
> My guess is that Ken is alluding to not storing arbitrary text in
> strings, but rather in arrays of code units along with appropriate
> length parameters.
>

Oh, yes, I see thanks.
It's a special case of "null-terminated strings were a mistake".
But U+0000 has no other use or alternative semantic right? The main use
case would be test cases?


>
> Richard.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200620/a434c9ee/attachment.htm>

From jameskasskrv at gmail.com  Sat Jun 20 07:57:43 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Sat, 20 Jun 2020 12:57:43 +0000
Subject: OverStrike control character
In-Reply-To: <CAMZ=zj699bzNXBR1Pr_RAis2qiprs3DcW8srSxX30W-xCrXsXQ@mail.gmail.com>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
 <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <20200619113250.66bb7f47@JRWUBU2>
 <c322e5b2-dbd7-10b1-de40-7cf87fd3869c@gmail.com>
 <20200619154216.6f85f1d5@JRWUBU2>
 <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com>
 <ca01bf6d-7e7b-4904-882b-45e25d60a802@disroot.org>
 <CAMZ=zj699bzNXBR1Pr_RAis2qiprs3DcW8srSxX30W-xCrXsXQ@mail.gmail.com>
Message-ID: <fb46e714-5489-026e-fac8-d87196ce45d5@gmail.com>


On 2020-06-20 10:30 AM, David Starner via Unicode wrote:
> It doesn't feel that you want a
> way to store and display overtyped text, it feels that you want
> Unicode to officially support it. It's a very complex and expensive
> expansion, but if a bunch of people were using overtyped text, this
> discussion might be going differently.
Yes.? And even if we suppose that some groups like coin collectors or 
epigraphers might find that kind of feature helpful, there's a couple of 
points to consider.? One is that groups like that are probably already 
cheerfully exchanging information using either some kind of rich text 
scheme or some kind of plain-text convention.? The other is that if 
their needs were not being met they would be lobbying to get some kind 
of support, which doesn't seem to be happening.

In the case of coin catalogs, the 'spell-it-out' convention predates the 
computer era.? It wouldn't surprise if epigraphic conventions also 
predate computers.? People tend to stick with their conventions.? If the 
overstrike feature became available in plain-text, I'd expect the coin 
catalogs to keep spelling things out for both clarity and consistency.


From jameskasskrv at gmail.com  Sat Jun 20 08:08:36 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Sat, 20 Jun 2020 13:08:36 +0000
Subject: OverStrike control character
In-Reply-To: <ca01bf6d-7e7b-4904-882b-45e25d60a802@disroot.org>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
 <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <20200619113250.66bb7f47@JRWUBU2>
 <c322e5b2-dbd7-10b1-de40-7cf87fd3869c@gmail.com>
 <20200619154216.6f85f1d5@JRWUBU2>
 <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com>
 <ca01bf6d-7e7b-4904-882b-45e25d60a802@disroot.org>
Message-ID: <0d6610c1-9f8f-0f9f-efba-677eb6700bc2@gmail.com>


On 2020-06-20 6:30 AM, abrahamgross--- via Unicode wrote:
> Does Yeoman show the ?8 over 7? visually too, or does it just say ?8 over 7? and you're supposed to imagine it urself?
Sometimes the catalogs include close-up photographs of the more popular 
variations in the graphics section of a page, but in the text listings 
it's just spelled out.? The Yeoman consulted earlier was a 1984 print 
version.? But I also looked at an on-line catalog, "numista", which 
likewise spelled out this particular overstrike (U.S.A. 1918D nickel 
five cent piece).

From andrewcwest at gmail.com  Sat Jun 20 08:26:28 2020
From: andrewcwest at gmail.com (Andrew West)
Date: Sat, 20 Jun 2020 14:26:28 +0100
Subject: OverStrike control character
In-Reply-To: <fb46e714-5489-026e-fac8-d87196ce45d5@gmail.com>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
 <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <20200619113250.66bb7f47@JRWUBU2>
 <c322e5b2-dbd7-10b1-de40-7cf87fd3869c@gmail.com>
 <20200619154216.6f85f1d5@JRWUBU2>
 <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com>
 <ca01bf6d-7e7b-4904-882b-45e25d60a802@disroot.org>
 <CAMZ=zj699bzNXBR1Pr_RAis2qiprs3DcW8srSxX30W-xCrXsXQ@mail.gmail.com>
 <fb46e714-5489-026e-fac8-d87196ce45d5@gmail.com>
Message-ID: <CALgEMhxsc5+r9RNNzGpebS2bJD3ZF6YxRvNqGMYWcb1GNk_m_A@mail.gmail.com>

On Sat, 20 Jun 2020 at 14:03, James Kass via Unicode
<unicode at unicode.org> wrote:
>
> computer era.  It wouldn't surprise if epigraphic conventions also
> predate computers.  People tend to stick with their conventions.  If the
> overstrike feature became available in plain-text, I'd expect the coin
> catalogs to keep spelling things out for both clarity and consistency.

No-one would ever typographical overstrike one character with another
in a coin catalog because it would be difficult to read or even
illegible, and impossible to know if it meant "7 overstruck with 8" or
"8 overstruck with 7".

Andrew

From richard.wordingham at ntlworld.com  Sat Jun 20 09:32:19 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 20 Jun 2020 15:32:19 +0100
Subject: EBCDIC control characters
In-Reply-To: <CA+Om+Sgf3c2VuN=EOd=4AGE85vtNULGWvHAKL3P0HtgsSZ8Cjg@mail.gmail.com>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <20200619114827.07df1f21@JRWUBU2>
 <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
 <CA+Om+SjfbEx-MgE2tnLVnm7xvsWZSsH3L63tShzmSFrO8kOsBQ@mail.gmail.com>
 <20200620120904.0c6f7630@JRWUBU2>
 <CA+Om+Sgf3c2VuN=EOd=4AGE85vtNULGWvHAKL3P0HtgsSZ8Cjg@mail.gmail.com>
Message-ID: <20200620153219.4d62c00f@JRWUBU2>

On Sat, 20 Jun 2020 14:11:15 +0200
Corentin via Unicode <unicode at unicode.org> wrote:

> On Sat, 20 Jun 2020 at 13:14, Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:  
> 
> > On Sat, 20 Jun 2020 10:57:10 +0200
> > Corentin via Unicode <unicode at unicode.org> wrote:
> >  
> > > On Fri, 19 Jun 2020 at 22:30, Ken Whistler via Unicode
> > > <unicode at unicode.org> wrote:
> > >  
> > > >
> > > > On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote:  
> > > > > Isn't there still the issue of supporting U+0000 in C-type
> > > > > strings?  
> > > >
> > > > I don't see why. And it has nothing to do with Unicode per se,
> > > > anyway.
> > > >
> > > > That is just a transform of the question of "the issue of
> > > > supporting 0x00 in
> > > > C-type strings restricted to ASCII."
> > > >
> > > > The issue is precisely the same, and the solutions are precisely
> > > > the same -- by design.
> > > >  
> > >
> > > I'm not sure I understand that issue, could you clarify?
> > > in both C and C++, U+0000  is interpreted as the null character
> > > (which mark the end of the string depending on context), which is
> > > the same behavior as the equivalent
> > > ascii character  
> >
> > One immediate consequence of that assertion is that one cannot in
> > general store a line of Unicode text in a 'string'.  There have been
> > Unicode test cases that deliberately include a null in the middle of
> > the text, and if the program thinks it has stored the line in a
> > 'string', it will fail the test, because the null character and
> > beyond are not part of the text being interpreted.
> >
> > One of the early tricks to store general character sequences in
> > strings was to use non-shortest form UTF-8 encodings to avoid
> > characters being interpreted as control characters with undesired
> > characteristics.  This form of UTF-8 is now invalid.  Java was
> > especially noted for using the encoding <C0, 80> to store zero
> > bytes in byte code in UTF-8.
> >
> > My guess is that Ken is alluding to not storing arbitrary text in
> > strings, but rather in arrays of code units along with appropriate
> > length parameters.
> >  
> 
> Oh, yes, I see thanks.
> It's a special case of "null-terminated strings were a mistake".
> But U+0000 has no other use or alternative semantic right? The main
> use case would be test cases?

I believe Unicode doesn't define its semantics, but rather defers by
default to ECMA-48.  Of NUL it says, "NUL is used for media-fill or
time-fill. NUL characters may be inserted into, or removed from, a data
stream without affecting the information content of that stream, but
such action may affect the information layout and/or the control of
equipment."

I have used it for easy composition of Fortran output lines from
CHARACTER variables; the NULs in the resulting lines were simply
ignored when the output was display on a terminal or line printer.  The
Fortran 90 intrinisic function TRIM provided an easier and more
reliable way of doing the same job; embedded NULs don't play well with
C.

Richard.

From kenwhistler at sonic.net  Sat Jun 20 09:45:45 2020
From: kenwhistler at sonic.net (Ken Whistler)
Date: Sat, 20 Jun 2020 07:45:45 -0700
Subject: EBCDIC control characters
In-Reply-To: <CA+Om+Sgf3c2VuN=EOd=4AGE85vtNULGWvHAKL3P0HtgsSZ8Cjg@mail.gmail.com>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <20200619114827.07df1f21@JRWUBU2>
 <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
 <CA+Om+SjfbEx-MgE2tnLVnm7xvsWZSsH3L63tShzmSFrO8kOsBQ@mail.gmail.com>
 <20200620120904.0c6f7630@JRWUBU2>
 <CA+Om+Sgf3c2VuN=EOd=4AGE85vtNULGWvHAKL3P0HtgsSZ8Cjg@mail.gmail.com>
Message-ID: <42525c51-20be-9443-4420-ac5e59b32319@sonic.net>


On 6/20/2020 5:11 AM, Corentin via Unicode wrote:
>
>     My guess is that Ken is alluding to not storing arbitrary text in
>     strings, but rather in arrays of code units along with appropriate
>     length parameters.
>
>
> Oh, yes, I see thanks.
> It's a special case of "null-terminated strings were a mistake".
> But U+0000 has no other use or alternative semantic right? The main 
> use case would be test cases?

Yes, that was basically what I was alluding to.

Richard is making the purist point that U+0000 is a Unicode character, 
and therefore should be transmissible as part of any Unicode plain text 
stream.

But the C string is not actually "plain text" -- it is a convention for 
representing a string which makes use of 0x00 as a "syntactic" character 
to terminate the string without counting for its length. And that was 
already true back in 7-bit ASCII days, of course. Peoples' workaround, 
if they need to represent NULL *in* character data in a "string" in a C 
program, was to simply use char arrays, manage length external to the 
"string" stored in the array, and then avoid the regular C string 
runtime library calls when manipulating them, because those depend on 
0x00 as a signal of string termination.

Such cases need not be limited to test cases. One can envision real 
cases, as for example, packing a data store full of null-terminated 
strings and then wanting to manipulate that entire data store as a 
chunk. It is, of course, full of NULL bytes for the null-terminated 
strings. But the answer, of course, is to just keep track of the size of 
the entire data store and use memcpy() instead of strcpy(). I've had to 
deal with precisely such cases in real production code.

Now fast forward to Unicode and UTF-8. U+0000 is a Unicode character, 
but in UTF-8 it is, of course, represented as a single 0x00 code unit. 
And for the ASCII subset of Unicode, you cannot even tell the difference 
-- it is precisely identical, as far as C strings and their manipulation 
is concerned. Which was precisely my point:

7-bit ASCII: One cannot represent NULL (0x00) as part of the content of 
a C string. Resort to char arrays.

Unicode UTF-8: One cannot represent U+0000 NULL (0x00) as part of the 
content of a C string. Resort to char arrays.

The convention of using non-shortest UTF-8 to represent embedded NULLs 
in C strings was simply a non-interoperable hack that people tried 
because they fervently believed that NULLs *should* be embeddable in C 
strings, after all. The UTC put a spike in that one by ruling that 
non-shortest UTF-8 was ill-formed for any purpose.

This whole issue has been a permanent confusion for C programmers, I 
think, largely because C is so loosey goosey about pointers, where a 
pointer is really just an index register wolf in sheep's clothing. With 
a char* pointer in hand, one cannot really tell whether it is referring 
to an actual C string following the null-termination convention, or a 
char array full of characters interpreted as a "string", but without 
null termination, or a char array full of arbitrary byte values meaning 
anything. And from that source flow thousands upon thousands of C 
program bugs. :(

--Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200620/5a4a7579/attachment.htm>

From richard.wordingham at ntlworld.com  Sat Jun 20 10:53:26 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 20 Jun 2020 16:53:26 +0100
Subject: EBCDIC control characters
In-Reply-To: <42525c51-20be-9443-4420-ac5e59b32319@sonic.net>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <20200619114827.07df1f21@JRWUBU2>
 <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
 <CA+Om+SjfbEx-MgE2tnLVnm7xvsWZSsH3L63tShzmSFrO8kOsBQ@mail.gmail.com>
 <20200620120904.0c6f7630@JRWUBU2>
 <CA+Om+Sgf3c2VuN=EOd=4AGE85vtNULGWvHAKL3P0HtgsSZ8Cjg@mail.gmail.com>
 <42525c51-20be-9443-4420-ac5e59b32319@sonic.net>
Message-ID: <20200620165326.67dfbee5@JRWUBU2>

On Sat, 20 Jun 2020 07:45:45 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:

> Richard is making the purist point that U+0000 is a Unicode
> character, and therefore should be transmissible as part of any
> Unicode plain text stream.

Prompted by the pain of Unicode test files with embedded nulls and even
embedded end of file.

I could never work out why isolated UTF-16 code units should be
handled, but there was no need to handle isolated UTF-8 code units.

> 7-bit ASCII: One cannot represent NULL (0x00) as part of the content
> of a C string. Resort to char arrays.

Actually, you can.  As the size of char is at least 8 bits, you have
128 spare codes. :-)

Richard.

From harjitmoe at outlook.com  Sat Jun 20 11:43:26 2020
From: harjitmoe at outlook.com (Harriet Riddle)
Date: Sat, 20 Jun 2020 17:43:26 +0100
Subject: EBCDIC control characters
In-Reply-To: <20200620165326.67dfbee5@JRWUBU2>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <20200619114827.07df1f21@JRWUBU2>
 <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
 <CA+Om+SjfbEx-MgE2tnLVnm7xvsWZSsH3L63tShzmSFrO8kOsBQ@mail.gmail.com>
 <20200620120904.0c6f7630@JRWUBU2>
 <CA+Om+Sgf3c2VuN=EOd=4AGE85vtNULGWvHAKL3P0HtgsSZ8Cjg@mail.gmail.com>
 <42525c51-20be-9443-4420-ac5e59b32319@sonic.net>
 <20200620165326.67dfbee5@JRWUBU2>
Message-ID: <DB6PR0701MB21987446FE541F4D593A51A3B7990@DB6PR0701MB2198.eurprd07.prod.outlook.com>

Richard Wordingham via Unicode wrote:
> Prompted by the pain of Unicode test files with embedded nulls and 
> even embedded end of file.
Embedded nulls might indeed be used disruptively in user-submitted 
content (to induce truncation or, if nulls are removed or ignored at 
some stage, even to mask malicious sequences if the nulls are removed or 
ignored by something downstream of a sanitisation step). In such 
applications, there may be a need to deal with them somehow (even if 
that is simply replacing U+0000 instances with U+FFFD, as stipulated in 
the spec for e.g. CommonMark).

But so long as it can accurately output the string and its length in 
code units, it's not really the decoder's job to sort this out.

> I could never work out why isolated UTF-16 code units should be handled, but there was no need to handle isolated UTF-8 code units.
Depends on the context you are working in.

Python's PEP 383 ( https://www.python.org/dev/peps/pep-0383/ ) does 
define a scheme for passing isolated 8-bit code units through a decoder 
and encoder unchanged, actually in much the same way as tends to be done 
for UTF-16, i.e. passing around isolated surrogate codes. This is not 
the default behaviour, but it arose as a solution to the problem of 
handling potentially invalid data in Unix filenames (similar to the 
issue of potentially invalid UTF-16 data in Windows filenames).

-- Har

>> 7-bit ASCII: One cannot represent NULL (0x00) as part of the content
>> of a C string. Resort to char arrays.
> Actually, you can.  As the size of char is at least 8 bits, you have
> 128 spare codes. :-)
>
> Richard.


From corentin.jabot at gmail.com  Sat Jun 20 12:00:51 2020
From: corentin.jabot at gmail.com (Corentin)
Date: Sat, 20 Jun 2020 19:00:51 +0200
Subject: EBCDIC control characters
In-Reply-To: <42525c51-20be-9443-4420-ac5e59b32319@sonic.net>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <20200619114827.07df1f21@JRWUBU2>
 <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
 <CA+Om+SjfbEx-MgE2tnLVnm7xvsWZSsH3L63tShzmSFrO8kOsBQ@mail.gmail.com>
 <20200620120904.0c6f7630@JRWUBU2>
 <CA+Om+Sgf3c2VuN=EOd=4AGE85vtNULGWvHAKL3P0HtgsSZ8Cjg@mail.gmail.com>
 <42525c51-20be-9443-4420-ac5e59b32319@sonic.net>
Message-ID: <CA+Om+SguL-BTp_jGWr7Q1pgrSR1+2r7TSUs1wTOte=-p_XQu4Q@mail.gmail.com>

On Sat, 20 Jun 2020 at 16:45, Ken Whistler <kenwhistler at sonic.net> wrote:

>
> On 6/20/2020 5:11 AM, Corentin via Unicode wrote:
>
> My guess is that Ken is alluding to not storing arbitrary text in
>> strings, but rather in arrays of code units along with appropriate
>> length parameters.
>>
>
> Oh, yes, I see thanks.
> It's a special case of "null-terminated strings were a mistake".
> But U+0000 has no other use or alternative semantic right? The main use
> case would be test cases?
>
> Yes, that was basically what I was alluding to.
>
> Richard is making the purist point that U+0000 is a Unicode character, and
> therefore should be transmissible as part of any Unicode plain text stream.
>
> But the C string is not actually "plain text" -- it is a convention for
> representing a string which makes use of 0x00 as a "syntactic" character to
> terminate the string without counting for its length. And that was already
> true back in 7-bit ASCII days, of course. Peoples' workaround, if they need
> to represent NULL *in* character data in a "string" in a C program, was to
> simply use char arrays, manage length external to the "string" stored in
> the array, and then avoid the regular C string runtime library calls when
> manipulating them, because those depend on 0x00 as a signal of string
> termination.
>
> Such cases need not be limited to test cases. One can envision real cases,
> as for example, packing a data store full of null-terminated strings and
> then wanting to manipulate that entire data store as a chunk. It is, of
> course, full of NULL bytes for the null-terminated strings. But the answer,
> of course, is to just keep track of the size of the entire data store and
> use memcpy() instead of strcpy(). I've had to deal with precisely such
> cases in real production code.
>
> Now fast forward to Unicode and UTF-8. U+0000 is a Unicode character, but
> in UTF-8 it is, of course, represented as a single 0x00 code unit. And for
> the ASCII subset of Unicode, you cannot even tell the difference -- it is
> precisely identical, as far as C strings and their manipulation is
> concerned. Which was precisely my point:
>
> 7-bit ASCII: One cannot represent NULL (0x00) as part of the content of a
> C string. Resort to char arrays.
>
> Unicode UTF-8: One cannot represent U+0000 NULL (0x00) as part of the
> content of a C string. Resort to char arrays.
>
> The convention of using non-shortest UTF-8 to represent embedded NULLs in
> C strings was simply a non-interoperable hack that people tried because
> they fervently believed that NULLs *should* be embeddable in C strings,
> after all. The UTC put a spike in that one by ruling that non-shortest
> UTF-8 was ill-formed for any purpose.
>
> This whole issue has been a permanent confusion for C programmers, I
> think, largely because C is so loosey goosey about pointers, where a
> pointer is really just an index register wolf in sheep's clothing. With a
> char* pointer in hand, one cannot really tell whether it is referring to an
> actual C string following the null-termination convention, or a char array
> full of characters interpreted as a "string", but without null termination,
> or a char array full of arbitrary byte values meaning anything. And from
> that source flow thousands upon thousands of C program bugs. :(
>

To be super pedantic, strings *are* arrays, but they decay to pointers
really easily, at which point the only way to know their size is to look
for 0x0, which made sense at one point in 1964 - if you never use strlen
you are fine. in fact it is common for people to use multiple null as
string delimiters within a larger array


> --Ken
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200620/e3bbb68d/attachment.htm>

From richard.wordingham at ntlworld.com  Sat Jun 20 12:34:33 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 20 Jun 2020 18:34:33 +0100
Subject: EBCDIC control characters
In-Reply-To: <CA+Om+SguL-BTp_jGWr7Q1pgrSR1+2r7TSUs1wTOte=-p_XQu4Q@mail.gmail.com>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <20200619114827.07df1f21@JRWUBU2>
 <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
 <CA+Om+SjfbEx-MgE2tnLVnm7xvsWZSsH3L63tShzmSFrO8kOsBQ@mail.gmail.com>
 <20200620120904.0c6f7630@JRWUBU2>
 <CA+Om+Sgf3c2VuN=EOd=4AGE85vtNULGWvHAKL3P0HtgsSZ8Cjg@mail.gmail.com>
 <42525c51-20be-9443-4420-ac5e59b32319@sonic.net>
 <CA+Om+SguL-BTp_jGWr7Q1pgrSR1+2r7TSUs1wTOte=-p_XQu4Q@mail.gmail.com>
Message-ID: <20200620183433.1c42f6ef@JRWUBU2>

On Sat, 20 Jun 2020 19:00:51 +0200
Corentin via Unicode <unicode at unicode.org> wrote:

> To be super pedantic, strings *are* arrays, but they decay to pointers
> really easily, at which point the only way to know their size is to
> look for 0x0, which made sense at one point in 1964 - if you never
> use strlen you are fine. in fact it is common for people to use
> multiple null as string delimiters within a larger array

I think almost all the functions in string.h go wrong if you want to
treat NUL as an ordinary character.  strncpy(), strncmp() and strncat()
certainly do.  Inserting NUL into a C string chops it up into multiple
C strings.

Richard.

From haberg-1 at telia.com  Sat Jun 20 14:34:38 2020
From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=)
Date: Sat, 20 Jun 2020 21:34:38 +0200
Subject: EBCDIC control characters
In-Reply-To: <42525c51-20be-9443-4420-ac5e59b32319@sonic.net>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <20200619114827.07df1f21@JRWUBU2>
 <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
 <CA+Om+SjfbEx-MgE2tnLVnm7xvsWZSsH3L63tShzmSFrO8kOsBQ@mail.gmail.com>
 <20200620120904.0c6f7630@JRWUBU2>
 <CA+Om+Sgf3c2VuN=EOd=4AGE85vtNULGWvHAKL3P0HtgsSZ8Cjg@mail.gmail.com>
 <42525c51-20be-9443-4420-ac5e59b32319@sonic.net>
Message-ID: <D4E8A047-1A68-4E6C-8A7B-B18279F283EF@telia.com>


> On 20 Jun 2020, at 16:45, Ken Whistler via Unicode <unicode at unicode.org> wrote:
> 
> This whole issue has been a permanent confusion for C programmers, I think, largely because C is so loosey goosey about pointers, where a pointer is really just an index register wolf in sheep's clothing. With a char* pointer in hand, one cannot really tell whether it is referring to an actual C string following the null-termination convention, or a char array full of characters interpreted as a "string", but without null termination, or a char array full of arbitrary byte values meaning anything. And from that source flow thousands upon thousands of C program bugs. :(

The distinction can conveniently be indicated in C++, as in the example at the bottom of [1]: "?" converts as a C string which truncates at the first \0, whereas "?"s converts to a std::string that keeps track of the full length.

1. https://en.cppreference.com/w/cpp/string/basic_string/operator%22%22s


From tom at honermann.net  Sat Jun 20 15:36:01 2020
From: tom at honermann.net (Tom Honermann)
Date: Sat, 20 Jun 2020 16:36:01 -0400
Subject: UAX 31 for C++ Identifiers
In-Reply-To: <8f744f47-5464-1fe0-7b56-4815a3a28ce3@ix.netcom.com>
References: <CAJEGDKpzQk=C_w4XXA+UHKYw3FXHYb51mwm+_cJsDSGbHZBAWA@mail.gmail.com>
 <9f9e7e04-e035-2740-419c-b9667b44bd80@ix.netcom.com>
 <CAJEGDKrJQVx2wxxA-kCz2j0jfiKx2FNFam_F028-COT4RUfX7w@mail.gmail.com>
 <8f744f47-5464-1fe0-7b56-4815a3a28ce3@ix.netcom.com>
Message-ID: <d5d47284-0b7b-4b23-822c-39e60ff90c8e@honermann.net>

On 6/20/20 2:44 AM, Asmus Freytag (c) via Unicode wrote:
> My meta point had been about possibly different levels security issues 
> between compile time and runtime.
> A./

When you mentioned "modules", were you referring to C++20 modules?? If 
so, there may be some confusion; C++20 modules is a compile-time feature 
with no run-time component.

Tom.

>
> On 6/19/2020 8:22 PM, Steve Downey wrote:
>> On Fri, Jun 19, 2020 at 10:44 PM Asmus Freytag via Unicode
>> <unicode at unicode.org>  wrote:
>>> In source code, having ambiguous identifiers may not be worse than C-style obfuscation.
>>>
>> Until recently (the last release 10.1), gcc rejected much of allowed
>> unicode in UTF-8 input, even in places it would allow \u
>> universal-character-names. So this all becomes easier now. As a
>> Standard, we should have handled this better earlier, but the second
>> best time is now. The XID_ properties make this a lot more palatable
>> w.r.t. stability, though, and I'm not going to second guess people 10
>> or 20 or more years ago, too much. Ambiguity in external identifiers
>> is already ill-formed no diagnostic required, which means broken but
>> in ways that compilers can't treat as undefined.
>>
>>> But with module names, etc. you may run into security issues if naming allows / facilitates spoofing.
>>>
>> I, and other people doing tools, both won and lost this battle
>> already. Module names in source do not correspond with anything
>> physical. `import some.module` connects you to whatever exported
>> `some.module` by magic as far as the standard is concerned. We're
>> working on the actual mechanics as a Technical Report, and compiler
>> vendors are participating and aren't, as far as I can tell, more
>> insane than the average infrastructure engineer. So I have hope.
>>
>> Mapping anything to file paths is fraught beyond belief, and there are
>> many experienced engineers providing war stories and parades of
>> horribles, although I'd personally like to have more stories to tell.
>>
>> The entire disconnect between logical and physical actually is
>> hopeful, in a way that `#include <ha/hahahahaha.h>` isn't. Even though
>> we have a lot of understanding of how that maps to filesystem
>> searches.
>>
>> Province of wg21/sg15 , which I also participate in.
>>
>> I suspect that trying to fix up anything with #include is infeasible
>> since it's currently the wild west, changes will break, and C++
>> depends in practice on system provided headers that at best conform to
>> old C standards.
>>
>> Thanks!
>>
>> -SMD
>
>


From markus.icu at gmail.com  Sat Jun 20 15:47:59 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Sat, 20 Jun 2020 13:47:59 -0700
Subject: EBCDIC control characters
In-Reply-To: <20200620183433.1c42f6ef@JRWUBU2>
References: <CA+Om+ShPWBMcGt5OuN7JcQm2VtDcxG9jVrScsRdqMCrdx-OnZQ@mail.gmail.com>
 <ad9bf980-eb4f-6f0c-24dc-32249511e79b@sonic.net>
 <CA+Om+SjQikF15AmCndEzUEoRXA8BpkJtGjurPRO=1ebPYe22aw@mail.gmail.com>
 <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net>
 <20200619114827.07df1f21@JRWUBU2>
 <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net>
 <CA+Om+SjfbEx-MgE2tnLVnm7xvsWZSsH3L63tShzmSFrO8kOsBQ@mail.gmail.com>
 <20200620120904.0c6f7630@JRWUBU2>
 <CA+Om+Sgf3c2VuN=EOd=4AGE85vtNULGWvHAKL3P0HtgsSZ8Cjg@mail.gmail.com>
 <42525c51-20be-9443-4420-ac5e59b32319@sonic.net>
 <CA+Om+SguL-BTp_jGWr7Q1pgrSR1+2r7TSUs1wTOte=-p_XQu4Q@mail.gmail.com>
 <20200620183433.1c42f6ef@JRWUBU2>
Message-ID: <CAN49p6qrBx-7v6tjd9m2g+tWVDarJwTosWSNSGD2LF9hathR8A@mail.gmail.com>

On Sat, Jun 20, 2020 at 10:40 AM Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> I think almost all the functions in string.h go wrong if you want to
> treat NUL as an ordinary character.  strncpy(), strncmp() and strncat()
> certainly do.  Inserting NUL into a C string chops it up into multiple
> C strings.
>

Right, you need to carry ptr+length and use memcpy() etc. Or use
std::string in C++.

Since C++17 we can use std::string_view as *the* type for input strings
that are not to be modified. Inspired by the email from Hans, I looked, and
there is "syntax"sv for string_view literals.
https://en.cppreference.com/w/cpp/string/basic_string_view
https://en.cppreference.com/w/cpp/string/basic_string_view/operator%22%22sv

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200620/dac8c9c4/attachment.htm>

From jameskasskrv at gmail.com  Sat Jun 20 18:26:19 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Sat, 20 Jun 2020 23:26:19 +0000
Subject: OverStrike control character
In-Reply-To: <ca01bf6d-7e7b-4904-882b-45e25d60a802@disroot.org>
References: <20200616180522.3a2bbcae@JRWUBU2>
 <20200609225134.oxkhssxtlj5t664t@nitsipc.home>
 <CAM+ijLirzEda30ddx7-a2jQd_fmObRQ=1ovjYKwT8ESgAz75Gg@mail.gmail.com>
 <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org>
 <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>
 <b3684a32be827099ad961dcdb3daab92@disroot.org>
 <DB6PR0701MB2198040425A55C7D5CECFED8B79D0@DB6PR0701MB2198.eurprd07.prod.outlook.com>
 <MWHPR1301MB21120DB740C27342EC2BFF8F869D0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <df53de6f-d5e8-4b79-abd3-d54c5a888402@disroot.org>
 <MWHPR1301MB2112F8CB892081FEC3D77E4A869A0@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <ce1a6bb7-5b3c-4d56-b954-344399babbcd@disroot.org>
 <MWHPR1301MB2112F95347427E0E5F19C9EE86980@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <20200619113250.66bb7f47@JRWUBU2>
 <c322e5b2-dbd7-10b1-de40-7cf87fd3869c@gmail.com>
 <20200619154216.6f85f1d5@JRWUBU2>
 <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com>
 <ca01bf6d-7e7b-4904-882b-45e25d60a802@disroot.org>
Message-ID: <61e66815-dfea-d64e-7dbe-7989255970eb@gmail.com>


On 2020-06-20 6:30 AM, abrahamgross--- via Unicode wrote:
> If epigraphers and numismaticians have the need for overstiking in plain text, isn't that reason enough to encode it? Unicode encoded many completely extinct scripts* and extinct characters in existing scripts, so adding the overstrike doesn't seem like a stretch at all.

Epigraphers and numismatists indeed preserve and exchange information 
about overstriking.? But they have existing conventions for doing so 
which apparently serve them well.? Similar arguments were made against 
the encoding of ancient scripts.? The scholars could just go on happily 
transliterating and transcribing their ancient texts.? When contact was 
established with various user groups some folks said they would continue 
using transliteration. But other folks said they would welcome and 
embrace the ability to store and exchange data in the actual original 
scripts.? Encoding the ancient scripts did no harm; the scholars 
preferring transliteration could keep on transliterating.? Ancient 
script encoding opened up new vistas for those who welcomed it, and I 
think this is especially true for undeciphered scripts.

So anyone seriously considering floating a proposal for an overstrike 
mechanism in Unicode would be well advised to establish contact with 
potential users to determine whether such a mechanism would see any 
actual use.


From doug at ewellic.org  Sat Jun 20 19:44:39 2020
From: doug at ewellic.org (Doug Ewell)
Date: Sat, 20 Jun 2020 18:44:39 -0600
Subject: OverStrike control character
Message-ID: <006701d64765$25b2c470$71184d50$@ewellic.org>

James Kass wrote:

> So anyone seriously considering floating a proposal for an overstrike
> mechanism in Unicode would be well advised to establish contact with
> potential users to determine whether such a mechanism would see any
> actual use.

When we proposed the "bunch of characters encoded to support old 8-bit machines" that David Starner referred to, being able to cite assurance from actual end users that they would use these characters was not just a good idea; it was essential to getting the characters encoded. (Since then, we have learned of many times more users who plan to use them, or are already using them, than we knew about at the time.)

Neither Yeoman nor any other coin catalog would use intentionally print, say, an 8 over a 7 in a listing for an overdate. They might do so in rich text (i.e. a published book) to illustrate, for novice collectors, what is meant by an overdate; but that could be done just as well, and usually is, with a greatly magnified picture of the coin in question. So I don't think it can be said that numismatists have a "need" for overstriking in plain text.

It seems the only serious use case for this character (as opposed to "it would be fun" or "it would be possible" or "Unicode has lots of empty code points, and look at the stuff they've already encoded") is that people could make up their own characters ? so long as they consisted of two or more existing glyphs, one overstruck on the other ? and they would have a non-PUA Unicode representation. Is that about the size of it?

--
Doug Ewell | Thornton, CO, US | ewellic.org


From abrahamgross at disroot.org  Sat Jun 20 20:31:59 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Sun, 21 Jun 2020 01:31:59 +0000 (UTC)
Subject: OverStrike control character
In-Reply-To: <006701d64765$25b2c470$71184d50$@ewellic.org>
References: <006701d64765$25b2c470$71184d50$@ewellic.org>
Message-ID: <8025f806-835e-4422-8c63-60be10a4059b@disroot.org>

Basically, yes. unicode has plenty of basic geometric shapes throughout that can be utilized to build interchangeable (and non-PUA) characters. (if Classical Yi ever get accepted, then youll be able to use just about any shape out there for your overstriking needs (the proposal lists over 88k new chars!))

2020/06/20 ??8:45:18 Doug Ewell via Unicode <unicode at unicode.org>:

> It seems the only serious use case for this character (as opposed to "it would be fun" or "it would be possible" or "Unicode has lots of empty code points, and look at the stuff they've already encoded") is that people could make up their own characters ? so long as they consisted of two or more existing glyphs, one overstruck on the other ? and they would have a non-PUA Unicode representation. Is that about the size of it?
> 


From mark at kli.org  Sat Jun 20 20:37:32 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Sat, 20 Jun 2020 21:37:32 -0400
Subject: OverStrike control character
In-Reply-To: <8025f806-835e-4422-8c63-60be10a4059b@disroot.org>
References: <006701d64765$25b2c470$71184d50$@ewellic.org>
 <8025f806-835e-4422-8c63-60be10a4059b@disroot.org>
Message-ID: <59e1f2b3-17e7-99a9-ccc2-5eb1ab75beef@kli.org>

On 6/20/20 9:31 PM, abrahamgross--- via Unicode wrote:
> Basically, yes. unicode has plenty of basic geometric shapes throughout that can be utilized to build interchangeable (and non-PUA) characters. (if Classical Yi ever get accepted, then youll be able to use just about any shape out there for your overstriking needs (the proposal lists over 88k new chars!))
>
> 2020/06/20 ??8:45:18 Doug Ewell via Unicode <unicode at unicode.org>:
>
>> It seems the only serious use case for this character (as opposed to "it would be fun" or "it would be possible" or "Unicode has lots of empty code points, and look at the stuff they've already encoded") is that people could make up their own characters ? so long as they consisted of two or more existing glyphs, one overstruck on the other ? and they would have a non-PUA Unicode representation. Is that about the size of it?
>>
You have just made one the strongest possible arguments against your own 
position.

~mark


From abrahamgross at disroot.org  Sat Jun 20 20:59:55 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Sun, 21 Jun 2020 01:59:55 +0000 (UTC)
Subject: OverStrike control character
In-Reply-To: <59e1f2b3-17e7-99a9-ccc2-5eb1ab75beef@kli.org>
References: <006701d64765$25b2c470$71184d50$@ewellic.org>
 <8025f806-835e-4422-8c63-60be10a4059b@disroot.org>
 <59e1f2b3-17e7-99a9-ccc2-5eb1ab75beef@kli.org>
Message-ID: <230194d6-3a7d-4f5c-8907-ade211f8929c@disroot.org>

Why?

2020/06/20 ??9:38:10 Mark E. Shoulson via Unicode <unicode at unicode.org>:

> You have just made one the strongest possible arguments against your own position.
> 


From jameskasskrv at gmail.com  Sat Jun 20 22:34:09 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Sun, 21 Jun 2020 03:34:09 +0000
Subject: OverStrike control character
In-Reply-To: <230194d6-3a7d-4f5c-8907-ade211f8929c@disroot.org>
References: <006701d64765$25b2c470$71184d50$@ewellic.org>
 <8025f806-835e-4422-8c63-60be10a4059b@disroot.org>
 <59e1f2b3-17e7-99a9-ccc2-5eb1ab75beef@kli.org>
 <230194d6-3a7d-4f5c-8907-ade211f8929c@disroot.org>
Message-ID: <26abf033-3f9c-dabd-ed99-19337db77a69@gmail.com>


On 2020-06-21 1:59 AM, abrahamgross--- via Unicode asked:
> Why?
>
> 2020/06/20 ??9:38:10 Mark E. Shoulson via Unicode <unicode at unicode.org>:
>
>> You have just made one the strongest possible arguments against your own position.
>>

Quoting from TUS 13.0, 1.1 (page 3),
"Note, however, that Unicode does not encode idiosyncratic, personal, 
*novel*, or private-use characters, nor does it encode logos or 
graphics."? (Asterisks added)

By extension, Unicode wouldn't encode a mechanism specifically for 
designing novel characters.

It's been said before that Unicode encodes what is or what was, not what 
might be.

Even getting "what was" encoded can be something of an uphill battle.? 
Consider the recent addition to Unicode mentioned by David Starner and 
Doug Ewell, "214 graphic characters that provide compatibility with 
various home computers from the mid-1970s to the mid-1980s and with 
early teletext broadcasting standards".? If I'm not mistaken, many of 
those characters were first proposed some twenty years ago by Frank da 
Cruz and rejected by Unicode.

So even being able to prove actual legacy usage, such as can be done 
with the backspace-for-overstrike technique, is no guarantee that a 
proposal would be accepted.


From asmusf at ix.netcom.com  Sat Jun 20 23:38:49 2020
From: asmusf at ix.netcom.com (Asmus Freytag (c))
Date: Sat, 20 Jun 2020 21:38:49 -0700
Subject: UAX 31 for C++ Identifiers
In-Reply-To: <d5d47284-0b7b-4b23-822c-39e60ff90c8e@honermann.net>
References: <CAJEGDKpzQk=C_w4XXA+UHKYw3FXHYb51mwm+_cJsDSGbHZBAWA@mail.gmail.com>
 <9f9e7e04-e035-2740-419c-b9667b44bd80@ix.netcom.com>
 <CAJEGDKrJQVx2wxxA-kCz2j0jfiKx2FNFam_F028-COT4RUfX7w@mail.gmail.com>
 <8f744f47-5464-1fe0-7b56-4815a3a28ce3@ix.netcom.com>
 <d5d47284-0b7b-4b23-822c-39e60ff90c8e@honermann.net>
Message-ID: <84ee1ede-30bf-f474-0cbc-03469043bea4@ix.netcom.com>

On 6/20/2020 1:36 PM, Tom Honermann wrote:
> On 6/20/20 2:44 AM, Asmus Freytag (c) via Unicode wrote:
>> My meta point had been about possibly different levels security 
>> issues between compile time and runtime.
>> A./
>
> When you mentioned "modules", were you referring to C++20 modules?? If 
> so, there may be some confusion; C++20 modules is a compile-time 
> feature with no run-time component.
>
> Tom.

I had been thinking of interfaces to the various OSs, like dynamically 
linked libraries, etc. that are usually named with identifiers of some 
sort. Although to the language proper, these may just be strings, of course.

A./

>
>>
>> On 6/19/2020 8:22 PM, Steve Downey wrote:
>>> On Fri, Jun 19, 2020 at 10:44 PM Asmus Freytag via Unicode
>>> <unicode at unicode.org>? wrote:
>>>> In source code, having ambiguous identifiers may not be worse than 
>>>> C-style obfuscation.
>>>>
>>> Until recently (the last release 10.1), gcc rejected much of allowed
>>> unicode in UTF-8 input, even in places it would allow \u
>>> universal-character-names. So this all becomes easier now. As a
>>> Standard, we should have handled this better earlier, but the second
>>> best time is now. The XID_ properties make this a lot more palatable
>>> w.r.t. stability, though, and I'm not going to second guess people 10
>>> or 20 or more years ago, too much. Ambiguity in external identifiers
>>> is already ill-formed no diagnostic required, which means broken but
>>> in ways that compilers can't treat as undefined.
>>>
>>>> But with module names, etc. you may run into security issues if 
>>>> naming allows / facilitates spoofing.
>>>>
>>> I, and other people doing tools, both won and lost this battle
>>> already. Module names in source do not correspond with anything
>>> physical. `import some.module` connects you to whatever exported
>>> `some.module` by magic as far as the standard is concerned. We're
>>> working on the actual mechanics as a Technical Report, and compiler
>>> vendors are participating and aren't, as far as I can tell, more
>>> insane than the average infrastructure engineer. So I have hope.
>>>
>>> Mapping anything to file paths is fraught beyond belief, and there are
>>> many experienced engineers providing war stories and parades of
>>> horribles, although I'd personally like to have more stories to tell.
>>>
>>> The entire disconnect between logical and physical actually is
>>> hopeful, in a way that `#include <ha/hahahahaha.h>` isn't. Even though
>>> we have a lot of understanding of how that maps to filesystem
>>> searches.
>>>
>>> Province of wg21/sg15 , which I also participate in.
>>>
>>> I suspect that trying to fix up anything with #include is infeasible
>>> since it's currently the wild west, changes will break, and C++
>>> depends in practice on system provided headers that at best conform to
>>> old C standards.
>>>
>>> Thanks!
>>>
>>> -SMD
>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200620/11b8abb3/attachment.htm>

From andrewcwest at gmail.com  Sun Jun 21 05:36:43 2020
From: andrewcwest at gmail.com (Andrew West)
Date: Sun, 21 Jun 2020 11:36:43 +0100
Subject: OverStrike control character
In-Reply-To: <8025f806-835e-4422-8c63-60be10a4059b@disroot.org>
References: <006701d64765$25b2c470$71184d50$@ewellic.org>
 <8025f806-835e-4422-8c63-60be10a4059b@disroot.org>
Message-ID: <CALgEMhzDHfikkO-H=jkMpVw2Dj+AMtLh8t18RR8f3pZQ7N2V1w@mail.gmail.com>

On Sun, 21 Jun 2020 at 02:33, abrahamgross--- via Unicode
<unicode at unicode.org> wrote:
>
> Basically, yes. unicode has plenty of basic geometric shapes throughout that can be utilized to build interchangeable (and non-PUA) characters. (if Classical Yi ever get accepted, then youll be able to use just about any shape out there for your overstriking needs (the proposal lists over 88k new chars!))

There is no such thing as "Classical Yi" (i.e. a single
language/script with a well-known and well-studied corpus of literary
texts) -- if there was we would have encoded it ten years ago. The
proposal you refer to just lists an unorganized and unified list of
glyph forms used in multiple different Yi script traditions and
manuscript sources. It is not even a starting point for a proper
encoding proposal. In fact there are likely to be several separate
proposals for additional Yi scripts representing separate regional
traditions, each comprising about a thousand characters or so. See for
example my listing of 1,389 characters used in the Sani Yi script:

https://www.babelstone.co.uk/Yi/Sani_list.html

Andrew


From mark at kli.org  Sun Jun 21 18:47:34 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Sun, 21 Jun 2020 19:47:34 -0400
Subject: OverStrike control character
In-Reply-To: <8025f806-835e-4422-8c63-60be10a4059b@disroot.org>
References: <006701d64765$25b2c470$71184d50$@ewellic.org>
 <8025f806-835e-4422-8c63-60be10a4059b@disroot.org>
Message-ID: <c371d557-92e6-8d0d-07e5-54d0538e5242@kli.org>

On 6/20/20 9:31 PM, abrahamgross--- via Unicode wrote:
> Basically, yes. unicode has plenty of basic geometric shapes throughout that can be utilized to build interchangeable (and non-PUA) characters. (if Classical Yi ever get accepted, then youll be able to use just about any shape out there for your overstriking needs (the proposal lists over 88k new chars!))

Essentially "painting" with characters.? Which wouldn't work in a 
consistent fashion (you point out yourself it won't render things the 
same if you use different fonts) and would be MUCH more complicated to 
use than just encoding some vector drawing language with Unicode 
code-points (which has been suggested, and has its own raft of issues).? 
Which set of Yi characters will paint just the picture of George 
Washington that I want...?? Easier to paint it!

You had previously said

>> What about overstriking a LTR character with a RTL one, or vice-versa?  Which way does the text go after that?
>>
> The text after that goes in the direction of the text afterwards. So for ?L??????? its gonna look like ?[L???]?????? and for ?L??ab? its gonna look like ?[L?]ab?. Meaning that only the very next letter gets overstruck, and anything afterwards continues on like it would normally.
It's not about what happens if you put a strong LTR or RTL character 
afterwards.? Those always carry their own directionality!? Read up on 
the Unicode Bidi algorithm.? The direction of a stream of text is 
stateful, and some characters adapt themselves to what the current 
directionality is.? If I have A??, what state does that leave things 
in?? Is it the same or different from ??A?

~mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200621/a74e0e43/attachment.htm>

From kent.b.karlsson at bahnhof.se  Sun Jun 21 19:15:22 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Mon, 22 Jun 2020 02:15:22 +0200
Subject: OverStrike control character
In-Reply-To: <26abf033-3f9c-dabd-ed99-19337db77a69@gmail.com>
References: <006701d64765$25b2c470$71184d50$@ewellic.org>
 <8025f806-835e-4422-8c63-60be10a4059b@disroot.org>
 <59e1f2b3-17e7-99a9-ccc2-5eb1ab75beef@kli.org>
 <230194d6-3a7d-4f5c-8907-ade211f8929c@disroot.org>
 <26abf033-3f9c-dabd-ed99-19337db77a69@gmail.com>
Message-ID: <2871D945-971F-4D8B-AA00-DD4ECE57215F@bahnhof.se>


> 21 juni 2020 kl. 05:34 skrev James Kass via Unicode <unicode at unicode.org>:

> [?] the recent addition to Unicode mentioned by David Starner and Doug Ewell, "214 graphic characters that provide compatibility with various home computers from the mid-1970s to the mid-1980s and with early teletext broadcasting standards?.

Note, however, that Teletext is not something obsolete in any way. It does still use charsets that have (otherwise) grown obsolete, e.g. several ?national variants? of ISO 646, but also cover Greek, Arabic and Hebrew, with the charset used communicated in the Teletext protocol. But Teletext is still supported in every TV set (and ?TV box?) sold the last few decades in at least Europe and likely other parts of the world as well. Using Teletext for news service is very much on the decline, that is true. But Teletext is still often used for optional subtitling.

See https://www.etsi.org/deliver/etsi_en/300700_300799/300706/01.02.01_60/en_300706v010201p.pdf <https://www.etsi.org/deliver/etsi_en/300700_300799/300706/01.02.01_60/en_300706v010201p.pdf> (from 2003) for the standard describing it.

There are (now) apps for mobile phones showing Teletext pages from certain TV channels (some still offer news services via Teletext); I find apps for SVT (Swedish public television) and Danish TV, and one that offers Teletext pages from several countries. These apps must convert to Unicode (at some point, since that is what is used for mobile phone apps?), or use graphics... As well as web pages showing current Teletext data, e.g. https://www.svt.se/svttext/web/pages/100.html <https://www.svt.se/svttext/web/pages/100.html>, https://www.nrk.no/tekst-tv/100/ <https://www.nrk.no/tekst-tv/100/>, https://www.dr.dk/cgi-bin/fttx1.exe/100 <https://www.dr.dk/cgi-bin/fttx1.exe/100>, https://www.rtve.es/television/teletexto/100/ <https://www.rtve.es/television/teletexto/100/>. (These, and a few more, still have news services via Teletext.)

/Kent Karlsson


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200622/52125cff/attachment.htm>

From richard.wordingham at ntlworld.com  Sun Jun 21 22:42:37 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 22 Jun 2020 04:42:37 +0100
Subject: OverStrike control character
In-Reply-To: <c371d557-92e6-8d0d-07e5-54d0538e5242@kli.org>
References: <006701d64765$25b2c470$71184d50$@ewellic.org>
 <8025f806-835e-4422-8c63-60be10a4059b@disroot.org>
 <c371d557-92e6-8d0d-07e5-54d0538e5242@kli.org>
Message-ID: <20200622044237.422ccfa6@JRWUBU2>

On Sun, 21 Jun 2020 19:47:34 -0400
"Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> It's not about what happens if you put a strong LTR or RTL character 
> afterwards.? Those always carry their own directionality!? Read up on 
> the Unicode Bidi algorithm.? The direction of a stream of text is 
> stateful, and some characters adapt themselves to what the current 
> directionality is.? If I have A??, what state does that leave things 
> in?? Is it the same or different from ??A?

OTL can't handled mixed script text.  Interracting characters have to
wind up in the same script run.

Richard.


From doug at ewellic.org  Mon Jun 22 13:34:58 2020
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 22 Jun 2020 12:34:58 -0600
Subject: What is the current Unicode stance on subscripts and superscripts
 for mathematical use?
In-Reply-To: <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca>
References: <CADxDn9Sca-0uyi6nf57ifRFUfa=gGKefEswTa+Rz0bK3C+uEcw@mail.gmail.com>
 <CAD2gp_RzwxKUMxtXZjTq5neypurGED9Y1ZqPUas5COHAH-8xtg@mail.gmail.com>
 <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca>
Message-ID: <000001d648c3$d5bcc870$81365950$@ewellic.org>

So, does that mean you don't think L2/18-206 will fly?

 ?

<gd&r>

 ?

--

Doug Ewell | Thornton, CO, US | ewellic.org

 ?

 ?

-----Original Message-----
From: Unicore <unicore-bounces at unicode.org> On Behalf Of John Hudson via Unicore
Sent: Monday, June 22, 2020 10:48
To: unicore at unicode.org
Subject: Re: What is the current Unicode stance on subscripts and superscripts for mathematical use?

 ?

Math layout and display needs to be able to handle essentially arbitrary

super- and subcript characters, and to do so at multiple levels of ?script embedding, e.g. subscripts of superscripts. This requires specialised fonts as well as specialised layout engines. The method we use in OpenType math fonts, i.e. fonts containing a MATH table with extensive scaling and alignment data to be used by math layout engines, is to have variant full-size glyphs that are then scaled down to the superscript, subscript, superscriptscript, etc. sizes and positioned according to MATH table data and tolerances within the layout engines (e.g. some environments may allow for more vertically compressed positioning for inline equations).

 ?

The set of variant glyphs provided for scaling to ?script and ?scriptscript size will vary depending on the font. If a font does not contain such variants for a given character, the layout engine will apply scaling (as defined in the MATH table) to the default glyph for that character.

 ?

We're in the process of extending the STIX Two Math font with a large number of additional variants for ?script and ?scriptscript use, based on frequency analysis from the American Mathematical Society and other members of the STI Pub consortium. Latin and Greek letters are definitely among the most frequent superscript and subscript typeforms, but the overall list is very much larger, includes multiple styles of letters, as well as a variety of symbols and operators.

 ?

J.

 ?

-- 

 ?

John Hudson

Tiro Typeworks Ltd     <http://www.tiro.com> www.tiro.com

Salish Sea, BC         <mailto:tiro at tiro.com> tiro at tiro.com

 ?

NOTE: In the interests of productivity, I am currently dealing with email on only two days per week, usually Monday and Thursday unless this schedule is disrupted by travel. If you need to contact me urgently, please use some other method of communication. Thank you.

 ?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200622/e71ebda3/attachment.htm>

From doug at ewellic.org  Mon Jun 22 13:45:16 2020
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 22 Jun 2020 12:45:16 -0600
Subject: What is the current Unicode stance on subscripts and superscripts
 for mathematical use?
References: <CADxDn9Sca-0uyi6nf57ifRFUfa=gGKefEswTa+Rz0bK3C+uEcw@mail.gmail.com>
 <CAD2gp_RzwxKUMxtXZjTq5neypurGED9Y1ZqPUas5COHAH-8xtg@mail.gmail.com>
 <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca> 
Message-ID: <000501d648c5$46100320$d2300960$@ewellic.org>

Sorry, sent to wrong list.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From john at tiro.ca  Mon Jun 22 16:10:32 2020
From: john at tiro.ca (John Hudson)
Date: Mon, 22 Jun 2020 14:10:32 -0700
Subject: What is the current Unicode stance on subscripts and superscripts
 for mathematical use?
In-Reply-To: <000001d648c3$d5bcc870$81365950$@ewellic.org>
References: <CADxDn9Sca-0uyi6nf57ifRFUfa=gGKefEswTa+Rz0bK3C+uEcw@mail.gmail.com>
 <CAD2gp_RzwxKUMxtXZjTq5neypurGED9Y1ZqPUas5COHAH-8xtg@mail.gmail.com>
 <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca>
 <000001d648c3$d5bcc870$81365950$@ewellic.org>
Message-ID: <b80fd15b-88ee-1591-1bbf-00a87074e5a4@tiro.ca>

On 22062020 11:34 am, Doug Ewell wrote:
> So, does that mean you don't think L2/18-206 will fly?

Has it shown any signs of flying in the past two years? or am I being 
trolled? :)

I'll bite:

That document is targeting issues in general typographic display 
variants and muddies the character/glyph distinction. Most of what it 
calls for are clear cases of typographic glyph processing, e.g. 
smallcaps as variants of uppercase characters. In that respect, it at 
once goes too far in calling for smallcap encoding for a large number of 
existing uppercase characters and not nearly far enough in ignoring vast 
numbers of existing characters outside the small European subset 
identified in the document.

The author seems also not to understand that existing 'small capitals' 
in Unicode are not typographic smallcap variants but distinct letters in 
some phonetic notation systems.

The author is not wrong to point out that the existence of some super- 
and subscript characters in Unicode doesn't always play well with font 
and algorithmic display of additional characters with super- and 
subscript styling: size, weight, and alignments can vary, depending on 
the path from the encoded characters to the styled display, how well the 
font has been made, and what algorithms are used. But these problems are 
not solved by encoding a bunch of additional super- and subscript 
characters. The problems may be pushed further out ? at least for 
European users of the Latin script ? but not solved.


Mathematical notation is a different case: a specialised writing system 
in which style, size, and relative position all have semantic meaning. 
It needs a different model for both encoding and layout than typical 
language text and typography.

J.


-- 

John Hudson
Tiro Typeworks Ltd    www.tiro.com
Salish Sea, BC        tiro at tiro.com

NOTE: In the interests of productivity, I am currently
dealing with email on only two days per week, usually
Monday and Thursday unless this schedule is disrupted
by travel. If you need to contact me urgently, please
use some other method of communication. Thank you.


From haberg-1 at telia.com  Mon Jun 22 16:22:46 2020
From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=)
Date: Mon, 22 Jun 2020 23:22:46 +0200
Subject: What is the current Unicode stance on subscripts and superscripts
 for mathematical use?
In-Reply-To: <000501d648c5$46100320$d2300960$@ewellic.org>
References: <CADxDn9Sca-0uyi6nf57ifRFUfa=gGKefEswTa+Rz0bK3C+uEcw@mail.gmail.com>
 <CAD2gp_RzwxKUMxtXZjTq5neypurGED9Y1ZqPUas5COHAH-8xtg@mail.gmail.com>
 <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca>
 <000501d648c5$46100320$d2300960$@ewellic.org>
Message-ID: <3A90FBA1-10BF-40F0-A0BB-59376C7603F4@telia.com>


> On 22 Jun 2020, at 20:45, Doug Ewell via Unicode <unicode at unicode.org> wrote:
> 
> Sorry, sent to wrong list.

I use, for text file input (plain UTF-8), Unicode subscript and superscript parentheses as subscript and superscript delimiters, like in ????, ????. It would be nice to have such subscript and superscript delimiters that provide the corresponding rendering in the text editor.


From marius.spix at web.de  Mon Jun 22 18:09:39 2020
From: marius.spix at web.de (Marius Spix)
Date: Tue, 23 Jun 2020 01:09:39 +0200
Subject: What is the current Unicode stance on subscripts and
 superscripts for mathematical use?
In-Reply-To: <3A90FBA1-10BF-40F0-A0BB-59376C7603F4@telia.com>
References: <CADxDn9Sca-0uyi6nf57ifRFUfa=gGKefEswTa+Rz0bK3C+uEcw@mail.gmail.com>
 <CAD2gp_RzwxKUMxtXZjTq5neypurGED9Y1ZqPUas5COHAH-8xtg@mail.gmail.com>
 <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca>
 <000501d648c5$46100320$d2300960$@ewellic.org>
 <3A90FBA1-10BF-40F0-A0BB-59376C7603F4@telia.com>
Message-ID: <20200623010928.3a0d00fb@spixxi>

This can already be done by rich text. Unicode includes some
superscript characters like ?, ? or ? (the degree sign is a superscript
version of the white circle U+25CB) for compatibility with legacy
character sets and phonetic transcriptions (in some languages the tone
is important). Unicode?s superscript characters are very limited and
nested superscript is not supported at all. For some units and common
chemical terms which often appear in plain text in non-scientific
contexts (like m?, kg?m/s?, ?C, CO? or Na?) the Unicode superscript
characters are sufficient, however.

On Mon, 22 Jun 2020 23:22:46 +0200
Hans ?berg via Unicode <unicode at unicode.org> wrote:

> 
> > On 22 Jun 2020, at 20:45, Doug Ewell via Unicode
> > <unicode at unicode.org> wrote:
> > 
> > Sorry, sent to wrong list.
> 
> I use, for text file input (plain UTF-8), Unicode subscript and
> superscript parentheses as subscript and superscript delimiters, like
> in ????, ????. It would be nice to have such subscript and
> superscript delimiters that provide the corresponding rendering in
> the text editor.
> 
> 
> 


From richard.wordingham at ntlworld.com  Mon Jun 22 19:44:56 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 23 Jun 2020 01:44:56 +0100
Subject: What is the current Unicode stance on subscripts and
 superscripts for mathematical use?
In-Reply-To: <20200623010928.3a0d00fb@spixxi>
References: <CADxDn9Sca-0uyi6nf57ifRFUfa=gGKefEswTa+Rz0bK3C+uEcw@mail.gmail.com>
 <CAD2gp_RzwxKUMxtXZjTq5neypurGED9Y1ZqPUas5COHAH-8xtg@mail.gmail.com>
 <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca>
 <000501d648c5$46100320$d2300960$@ewellic.org>
 <3A90FBA1-10BF-40F0-A0BB-59376C7603F4@telia.com>
 <20200623010928.3a0d00fb@spixxi>
Message-ID: <20200623014456.6413cee6@JRWUBU2>

On Tue, 23 Jun 2020 01:09:39 +0200
Marius Spix via Unicode <unicode at unicode.org> wrote:

> This can already be done by rich text. Unicode includes some
> superscript characters like ?, ? or ? (the degree sign is a
> superscript version of the white circle U+25CB) for compatibility
> with legacy character sets and phonetic transcriptions (in some
> languages the tone is important).

But still there can be annoying gaps.  Tai-Kadai is reconstructed to
have four tones, conventionally called A, B, C and D, and it is very
convenient to use the corresponding superscript letters to label the
tone on the syllables.  However, one of those capitals is missing, and
so in Wiktionary they have to make do with a superscript lower letter
for that one!  I don't think there's confidence in the
reconstruction of the original tones - and its possible that the tone
inducers, not the tones themselves, go back to the proto-language.

Richard.


From haberg-1 at telia.com  Tue Jun 23 03:36:53 2020
From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=)
Date: Tue, 23 Jun 2020 10:36:53 +0200
Subject: What is the current Unicode stance on subscripts and superscripts
 for mathematical use?
In-Reply-To: <20200623010928.3a0d00fb@spixxi>
References: <CADxDn9Sca-0uyi6nf57ifRFUfa=gGKefEswTa+Rz0bK3C+uEcw@mail.gmail.com>
 <CAD2gp_RzwxKUMxtXZjTq5neypurGED9Y1ZqPUas5COHAH-8xtg@mail.gmail.com>
 <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca>
 <000501d648c5$46100320$d2300960$@ewellic.org>
 <3A90FBA1-10BF-40F0-A0BB-59376C7603F4@telia.com>
 <20200623010928.3a0d00fb@spixxi>
Message-ID: <5C2214FF-A9A4-44A9-8E73-02BA8080CC0C@telia.com>

Indeed, and in the days of ASCII, one felt that all characters could be encoded with 7-bit bytes. But that is not really legible without some post-processing.

I use it in a program that uses plain text as input, which turns out to be very convenient, apart from that superscripts and subscripts might look better.


> On 23 Jun 2020, at 01:09, Marius Spix <marius.spix at web.de> wrote:
> 
> This can already be done by rich text. Unicode includes some
> superscript characters like ?, ? or ? (the degree sign is a superscript
> version of the white circle U+25CB) for compatibility with legacy
> character sets and phonetic transcriptions (in some languages the tone
> is important). Unicode?s superscript characters are very limited and
> nested superscript is not supported at all. For some units and common
> chemical terms which often appear in plain text in non-scientific
> contexts (like m?, kg?m/s?, ?C, CO? or Na?) the Unicode superscript
> characters are sufficient, however.


From kent.b.karlsson at bahnhof.se  Tue Jun 23 05:51:52 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Tue, 23 Jun 2020 12:51:52 +0200
Subject: What is the current Unicode stance on subscripts and superscripts
 for mathematical use?
In-Reply-To: <20200623010928.3a0d00fb@spixxi>
References: <CADxDn9Sca-0uyi6nf57ifRFUfa=gGKefEswTa+Rz0bK3C+uEcw@mail.gmail.com>
 <CAD2gp_RzwxKUMxtXZjTq5neypurGED9Y1ZqPUas5COHAH-8xtg@mail.gmail.com>
 <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca>
 <000501d648c5$46100320$d2300960$@ewellic.org>
 <3A90FBA1-10BF-40F0-A0BB-59376C7603F4@telia.com>
 <20200623010928.3a0d00fb@spixxi>
Message-ID: <612502FD-3792-4761-AC6D-38C761AAB06E@bahnhof.se>


> 23 juni 2020 kl. 01:09 skrev Marius Spix via Unicode <unicode at unicode.org>:
> 
> This can already be done by rich text. Unicode includes some
> superscript characters like ?, ? or ? (the degree sign is a superscript
> version of the white circle U+25CB) 

The degree sign is in origin a superscript 0 (zero), but now distinct from superscript 0 (and from superscript o, which in addition happens to be doubly encoded in Unicode?).

/Kent K


From richard.wordingham at ntlworld.com  Tue Jun 23 06:54:57 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 23 Jun 2020 12:54:57 +0100
Subject: Distinguishing COENG TA from COENG DA in Khmer script
Message-ID: <20200623125457.421435ce@JRWUBU2>

The modern Khmer language does not make use of a COENG DA distinct from
COENG TA.  The normal practice is to render them the same, with a
recommendation from Unicode that the choice be based on the sound the
subscript represents.  At least, there was such a recommendation; I
tried to find it again, but failed.  The visual distinction faded out
in the 1920's according to Antelme.

Now, the Khmer script is not just used for modern languages of
Cambodia.  It is used for transcribing Old Khmer (for words, at least)
and was the religious script of most of Thailand until the 19th
century, and was also the secular script in southern Thailand.  In
these usages, COENG TA and COENG DA are distinct, or at least, TA and DA
have distinct subscripts that are clearly associated with them.

Is it legitimate for a font to deliberately render the corresponding
named sequences differently while claiming to respect characters'
character identities?  I thought it obviously was, but I received a
demurral when I asked about the best way to request an arbitrary
OpenType font to make the distinction.  (I expect the overwhelming
majority would refuse to make the distinction.)  I am therefore asking
here for advice on the legitimacy of such a request. Conceivably we need
a new character to make the distinction.

Richard.

From asmusf at ix.netcom.com  Tue Jun 23 17:50:27 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 23 Jun 2020 15:50:27 -0700
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <20200623125457.421435ce@JRWUBU2>
References: <20200623125457.421435ce@JRWUBU2>
Message-ID: <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200623/f43aed48/attachment.htm>

From duerst at it.aoyama.ac.jp  Tue Jun 23 18:29:38 2020
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=)
Date: Wed, 24 Jun 2020 08:29:38 +0900
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <20200623125457.421435ce@JRWUBU2>
References: <20200623125457.421435ce@JRWUBU2>
Message-ID: <a5ec3fa2-c03e-36dd-0023-0ba7234ea530@it.aoyama.ac.jp>

Hello Richard,

I'm not an expert on OpenType or Khmer (except for having been on the 
side of separately encoding subscript letters in Unicode list 
discussions in the 1990ies), but a few comments and questions below.

On 23/06/2020 20:54, Richard Wordingham via Unicode wrote:
> The modern Khmer language does not make use of a COENG DA distinct from
> COENG TA.  The normal practice is to render them the same, with a
> recommendation from Unicode that the choice be based on the sound the
> subscript represents.  At least, there was such a recommendation; I
> tried to find it again, but failed.  The visual distinction faded out
> in the 1920's according to Antelme.
> 
> Now, the Khmer script is not just used for modern languages of
> Cambodia.  It is used for transcribing Old Khmer (for words, at least)
> and was the religious script of most of Thailand until the 19th
> century, and was also the secular script in southern Thailand.  In
> these usages, COENG TA and COENG DA are distinct, or at least, TA and DA
> have distinct subscripts that are clearly associated with them.
> 
> Is it legitimate for a font to deliberately render the corresponding
> named sequences differently while claiming to respect characters'
> character identities?

A font for Old Khmer,... would do that, wouldn't it? I couldn't see 
anything wrong with that.

> I thought it obviously was, but I received a
> demurral when I asked about the best way to request an arbitrary
> OpenType font to make the distinction.

A truly arbitrary (i.e. arbitrarily choosen) OpenType font probably 
wouldn't cover Khmer anyway, so it would be unable to even start to make 
this distinction.

> (I expect the overwhelming
> majority would refuse to make the distinction.)

The majority of fonts that actually cover modern Khmer might not include 
the relevant glyphs.

> I am therefore asking
> here for advice on the legitimacy of such a request.

I'm guessing that your request was either "How can I coerce a font 
covering modern Khmer to show different glyphs for COENG TA and COENG 
DA?" or "How can I create a font that will allow to show different 
glyphs for COENG TA and COENG DA?"

The reply to the former question is probably "you can't because the font 
doesn't contain the necessary glyph". For the later question, I think it 
should be possible, unless there's some OpenType stuff for Khmer that 
gets in the way.

> Conceivably we need
> a new character to make the distinction.

Do you mean you want to make the distinction in modern Khmer fonts? 
Would that be e.g. for words of Old Khmer that are cited in modern 
Khmer, or something similar?

Regards,   Martin.


> 
> Richard.


From richard.wordingham at ntlworld.com  Tue Jun 23 19:03:17 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 24 Jun 2020 01:03:17 +0100
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>
References: <20200623125457.421435ce@JRWUBU2>
 <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>
Message-ID: <20200624010317.5f2f5310@JRWUBU2>

On Tue, 23 Jun 2020 15:50:27 -0700
Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> On 6/23/2020 4:54 AM, Richard Wordingham via Unicode wrote:
> The modern Khmer language does not make use of a COENG DA distinct
> from COENG TA.  The normal practice is to render them the same, with a
> recommendation from Unicode that the choice be based on the sound the
> subscript represents.  At least, there was such a recommendation; I
> tried to find it again, but failed.  The visual distinction faded out
> in the 1920's according to Antelme.
> 
> Now, the Khmer script is not just used for modern languages of
> Cambodia.  It is used for transcribing Old Khmer (for words, at least)
> and was the religious script of most of Thailand until the 19th
> century, and was also the secular script in southern Thailand.  In
> these usages, COENG TA and COENG DA are distinct, or at least, TA and
> DA have distinct subscripts that are clearly associated with them.
> 
> Is it legitimate for a font to deliberately render the corresponding
> named sequences differently while claiming to respect characters'
> character identities?  I thought it obviously was, but I received a
> demurral when I asked about the best way to request an arbitrary
> OpenType font to make the distinction.  (I expect the overwhelming
> majority would refuse to make the distinction.)  I am therefore asking
> here for advice on the legitimacy of such a request. Conceivably we
> need a new character to make the distinction.
> 
> Richard.
> 
> The recommendation you cite is a bit "common sense". I believe,
> without actual knowledge, that there are no "dt" or "td" combinations
> only "dd" and "tt". In that case, a spell checker can help you
> pick the correct code for the subscript form.

That's a grammar rule - I'm not that spell checkers can exploit it.
While Series one ('a') normally has /nt/ for base consonant, there are
or were (my source is Huffman) a few words with /nd/, and there are a
few words that can be said either way (Durdin 2018, I think).

My immediate concern was the alternative Old Khmer spellings ???? and
????, which look identical in most fonts.  However, I am told the
Windows UI font Leelawadee UI distinguishes them, which could make it
difficult to outlaw deliberate distinction.

> Now, the identity of the characters is DA and TA (the COENG forms a
> sequence). Therefore, you don't violate the identity of DA and TA if
> you render their subscript forms distinct.

Don't multipoint characters get the same protection?  COENG DA and
COENG TA are named sequences.

> If you have a font that works that way, it may not be usable for
> modern Khmer (unless the there's a language tag to select the
> behavior). That's a font issue.

I'm not sure that we can get an OpenType language tag for 19th century
Khmer.  However, it seems that the feature tag 'hist' would be
appropriate.  One could try tagging as Southern Thai (ISO-693 sou), but
that's another can of worms.  Tagging as Sanskrit might work - I don't
know enough about modern Khmer script Sanskrit.

Richard.


From richard.wordingham at ntlworld.com  Tue Jun 23 19:20:29 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 24 Jun 2020 01:20:29 +0100
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <a5ec3fa2-c03e-36dd-0023-0ba7234ea530@it.aoyama.ac.jp>
References: <20200623125457.421435ce@JRWUBU2>
 <a5ec3fa2-c03e-36dd-0023-0ba7234ea530@it.aoyama.ac.jp>
Message-ID: <20200624012029.45ddfe28@JRWUBU2>

On Wed, 24 Jun 2020 08:29:38 +0900
Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:

> > ... In these usages, COENG TA and COENG DA are distinct, or
> > at least, TA and DA have distinct subscripts that are clearly
> > associated with them.
> > 
> > Is it legitimate for a font to deliberately render the corresponding
> > named sequences differently while claiming to respect characters'
> > character identities?  

> > I am therefore asking
> > here for advice on the legitimacy of such a request.  
> 
> I'm guessing that your request was either "How can I coerce a font 
> covering modern Khmer to show different glyphs for COENG TA and COENG 
> DA?" or "How can I create a font that will allow to show different 
> glyphs for COENG TA and COENG DA?"

The request would be made to the font by a combination of language and a
setting of OpenType features.

> The reply to the former question is probably "you can't because the
> font doesn't contain the necessary glyph". For the later question, I
> think it should be possible, unless there's some OpenType stuff for
> Khmer that gets in the way.

The OpenType question was closer to how do we make it easy to advise
people how to use co-operative fonts, if they exist.

> > Conceivably we need
> > a new character to make the distinction.  
> 
> Do you mean you want to make the distinction in modern Khmer fonts? 
> Would that be e.g. for words of Old Khmer that are cited in modern 
> Khmer, or something similar?

Something similar.  The application domain was Wiktionary.  I suspect
most people would be happier to see the words in a Modern Khmer style,
but not necessarily a modern Modern Khmer style.  The Angkorian styles
are quite different to the modern styles - unreadably so without practice.

My Unicode question is also relevant for fonts displaying Unicode text
in an Angkorian style.  It seems that they do exist, but complying with
TUS was probably low down on the authors' list of priorities.

Richard.


From asmusf at ix.netcom.com  Tue Jun 23 22:04:13 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 23 Jun 2020 20:04:13 -0700
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <20200624010317.5f2f5310@JRWUBU2>
References: <20200623125457.421435ce@JRWUBU2>
 <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>
 <20200624010317.5f2f5310@JRWUBU2>
Message-ID: <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200623/ba6f41bd/attachment.htm>

From jameskasskrv at gmail.com  Tue Jun 23 23:17:44 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Wed, 24 Jun 2020 04:17:44 +0000
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <20200624010317.5f2f5310@JRWUBU2>
References: <20200623125457.421435ce@JRWUBU2>
 <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>
 <20200624010317.5f2f5310@JRWUBU2>
Message-ID: <b7d5e041-7e04-2fa4-1a66-1cad2571a93d@gmail.com>


On 2020-06-24 12:03 AM, Richard Wordingham via Unicode wrote:
> My immediate concern was the alternative Old Khmer spellings ???? and
> ????, which look identical in most fonts.  However, I am told the
> Windows UI font Leelawadee UI distinguishes them, which could make it
> difficult to outlaw deliberate distinction.
The Leelawadee font with Windows 7 covers Thai but not Khmer.

From richard.wordingham at ntlworld.com  Wed Jun 24 03:19:41 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 24 Jun 2020 09:19:41 +0100
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <20200624010317.5f2f5310@JRWUBU2>
References: <20200623125457.421435ce@JRWUBU2>
 <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>
 <20200624010317.5f2f5310@JRWUBU2>
Message-ID: <20200624091941.7507cd03@JRWUBU2>

On Wed, 24 Jun 2020 01:03:17 +0100
Richard Wordingham via Unicode <unicode at unicode.org> wrote:

> On Tue, 23 Jun 2020 15:50:27 -0700
> Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> > The recommendation you cite is a bit "common sense". I believe,
> > without actual knowledge, that there are no "dt" or "td"
> > combinations only "dd" and "tt". In that case, a spell checker can
> > help you pick the correct code for the subscript form.  

> That's a grammar rule - I'm not that spell checkers can exploit it.
> While Series one ('a') normally has /nt/ for base consonant, there are
> or were (my source is Huffman) a few words with /nd/, and there are a
> few words that can be said either way (Durdin 2018, I think).

I forgot to add the condition that the /n/ be written with NO.  Series
one with NNO is usually /nd/; I don't know whether series one /nt/
written with NNO exists.  It might exist in Pali, but be a matter of
sect.

Richard.

From richard.wordingham at ntlworld.com  Wed Jun 24 03:43:45 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 24 Jun 2020 09:43:45 +0100
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <b7d5e041-7e04-2fa4-1a66-1cad2571a93d@gmail.com>
References: <20200623125457.421435ce@JRWUBU2>
 <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>
 <20200624010317.5f2f5310@JRWUBU2>
 <b7d5e041-7e04-2fa4-1a66-1cad2571a93d@gmail.com>
Message-ID: <20200624094345.5b41e198@JRWUBU2>

On Wed, 24 Jun 2020 04:17:44 +0000
James Kass via Unicode <unicode at unicode.org> wrote:

> On 2020-06-24 12:03 AM, Richard Wordingham via Unicode wrote:
> > My immediate concern was the alternative Old Khmer spellings ????
> > and ????, which look identical in most fonts.  However, I am told
> > the Windows UI font Leelawadee UI distinguishes them, which could
> > make it difficult to outlaw deliberate distinction.  
> The Leelawadee font with Windows 7 covers Thai but not Khmer.

Still true on Windows 10.  But Leelawadee UI also includes Khmer, and in
response to this post I verified that it makes the distinction, albeit
it in an innovative way.

Richard.


From richard.wordingham at ntlworld.com  Wed Jun 24 05:39:01 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 24 Jun 2020 11:39:01 +0100
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>
References: <20200623125457.421435ce@JRWUBU2>
 <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>
Message-ID: <20200624113901.7cc75c76@JRWUBU2>

On Tue, 23 Jun 2020 15:50:27 -0700
Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> The recommendation you cite is a bit "common sense". I believe,
> without actual knowledge, that there are no "dt" or "td" combinations
> only "dd" and "tt". In that case, a spell checker can help you
> pick the correct code for the subscript form.

Some SIL notes record ????? /?ut.??m/ ?excellent?; the Pali/Sanskrit
word is _uttama_.  It makes me wonder if "td" is much commoner than
"tt" for series 1.

Richard.


From kent.b.karlsson at bahnhof.se  Wed Jun 24 14:22:25 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Wed, 24 Jun 2020 21:22:25 +0200
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com>
References: <20200623125457.421435ce@JRWUBU2>
 <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>
 <20200624010317.5f2f5310@JRWUBU2>
 <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com>
Message-ID: <A1B4F3A7-3B95-4E96-89C5-23DEC693A2CF@bahnhof.se>

(Picking a quote slightly arbitrarily here.)

> They are supposed to represent subscript DA and TA, and for the 
> old-Khmer style those look different. The fact that they look identical
> does not mean that you should only use the subscript TA and expect
> it to work where subscript DA is intended.


I know it is very late to say this but? To me this seem very much like
there has been an ORTHOGRAPHIC change over time (preferring
TA over DA when subscript), NOT a commonisation of glyphs.

Indeed, one can well argue that giving COENG TA and COENG DA
the same glyph violates the character identity for these characters/
character sequences.

/Kent Karlsson


From richard.wordingham at ntlworld.com  Wed Jun 24 18:29:30 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 25 Jun 2020 00:29:30 +0100
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <A1B4F3A7-3B95-4E96-89C5-23DEC693A2CF@bahnhof.se>
References: <20200623125457.421435ce@JRWUBU2>
 <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>
 <20200624010317.5f2f5310@JRWUBU2>
 <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com>
 <A1B4F3A7-3B95-4E96-89C5-23DEC693A2CF@bahnhof.se>
Message-ID: <20200625002930.5067dbc8@JRWUBU2>

On Wed, 24 Jun 2020 21:22:25 +0200
Kent Karlsson via Unicode <unicode at unicode.org> wrote:

> (Picking a quote slightly arbitrarily here.)
> 
> > They are supposed to represent subscript DA and TA, and for the 
> > old-Khmer style those look different. The fact that they look
> > identical does not mean that you should only use the subscript TA
> > and expect it to work where subscript DA is intended.  
> 
> 
> I know it is very late to say this but? To me this seem very much like
> there has been an ORTHOGRAPHIC change over time (preferring
> TA over DA when subscript), NOT a commonisation of glyphs.
> 
> Indeed, one can well argue that giving COENG TA and COENG DA
> the same glyph violates the character identity for these characters/
> character sequences.

The identities of the subscript consonants do seem tied to the base
consonants; there have been some drastic changes as the current shape
becomes too confusable.  Now, the usage of the base characters isn't,or
wasn't, as sharply defined as one might hope.  There are, or were (my
source is pre-Khmer Rouge), some words written with a base consonant TA
pronounced as though it were DA.

According to Huffman, there was free variation between what are encoded
<NIKAHIT, C> and <MO, COENG, C>.  By that correspondence, simply
abandoning the concept of COENG DA probably wasn't an option.  Deciding
to make COENG DA identical to COENG TA was an option.

Richard.


From asmusf at ix.netcom.com  Wed Jun 24 22:28:41 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Wed, 24 Jun 2020 20:28:41 -0700
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <A1B4F3A7-3B95-4E96-89C5-23DEC693A2CF@bahnhof.se>
References: <20200623125457.421435ce@JRWUBU2>
 <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>
 <20200624010317.5f2f5310@JRWUBU2>
 <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com>
 <A1B4F3A7-3B95-4E96-89C5-23DEC693A2CF@bahnhof.se>
Message-ID: <6e390a35-d08e-8c92-ead8-f2e8fa4a2697@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200624/7d34b18d/attachment.htm>

From richard.wordingham at ntlworld.com  Thu Jun 25 08:57:52 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 25 Jun 2020 14:57:52 +0100
Subject: Khmer tally-style numbers
Message-ID: <20200625145752.63de24f6@JRWUBU2>

I have come across some tally-style numbers in old Khmer
inscriptions and I'm wondering how they should be encoded.  I am
assuming that the alphabetic script is Khmer.  What characters might
be being used here?

If I use a sans-serif font, the Roman numerals I, II and III work
well.  Other possibilities I have considered are U+1D369 COUNTING ROD
TENS DIGIT ONE to U+1D36B COUNTING ROD TENS DIGIT THREE and up to three
instances of U+1D377 TALLY MARK ONE, which the chart calls a 'western
tally mark'.  Are any of these formally appropriate?  For high numbers,
e.g. '6', the texts seem to use ordinary Khmer decimal digits.  It
feels massively inappropriate to use single vertical stroke character
U+17F2 KHMER SYMBOL LEK ATTAK PII for a pair of vertical strokes
together meaning '2'.

Richard. 

From vp88.mobile at gmail.com  Thu Jun 25 04:02:38 2020
From: vp88.mobile at gmail.com (Vova Mobile)
Date: Thu, 25 Jun 2020 12:02:38 +0300
Subject: Ukrainian names for somes specific Unicode characters
Message-ID: <CAMOF1tAE_gcXQRAmwdsuxs+82zaYpyVtrEVaZKAPvoqdJzPYMw@mail.gmail.com>

Hi dear developers and mail-list members.
Please tell me, where can I get the Ukrainian names of some specific
Unicode characters?
I am interested in the Ukrainian names of the following symbols:
?
?
?
?
?
?
?
?
?
?
?
Please, tell me Ukrainian names this characters or tell me where can I
get there.


From kent.b.karlsson at bahnhof.se  Sun Jun 28 15:05:35 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Sun, 28 Jun 2020 22:05:35 +0200
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <6e390a35-d08e-8c92-ead8-f2e8fa4a2697@ix.netcom.com>
References: <20200623125457.421435ce@JRWUBU2>
 <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>
 <20200624010317.5f2f5310@JRWUBU2>
 <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com>
 <A1B4F3A7-3B95-4E96-89C5-23DEC693A2CF@bahnhof.se>
 <6e390a35-d08e-8c92-ead8-f2e8fa4a2697@ix.netcom.com>
Message-ID: <D833BA6F-A973-42BB-9C3D-2091E72EC8E7@bahnhof.se>


> 25 juni 2020 kl. 05:28 skrev Asmus Freytag via Unicode <unicode at unicode.org>:
> 
> On 6/24/2020 12:22 PM, Kent Karlsson via Unicode wrote:
>> (Picking a quote slightly arbitrarily here.)
>> 
>>> They are supposed to represent subscript DA and TA, and for the 
>>> old-Khmer style those look different. The fact that they look identical
>>> does not mean that you should only use the subscript TA and expect
>>> it to work where subscript DA is intended.
>> 
>> I know it is very late to say this but? To me this seem very much like
>> there has been an ORTHOGRAPHIC change over time (preferring
>> TA over DA when subscript), NOT a commonisation of glyphs.
>> 
>> Indeed, one can well argue that giving COENG TA and COENG DA
>> the same glyph violates the character identity for these characters/
>> character sequences.
>> 
>> /Kent Karlsson
>> 
>> 
> What fact about the Khmer writing system leads you to that conclusion?
> 
> A./
> 
> PS: at some point, looking merely at the printed shapes, an issue like this is not decidable -- to decide it you need to know how people using the script conceptualize it.
> 
As far as I can gather, it seems like DA and TA both span ?t?-like and ?d?-like pronunciations. So from a pronunciation point of view, their degree of
interchangeability seems high. Trying to make rules for when to use one or use the other is then very tenuous and prone to change both over space
(dialects) and time (spell changes, formal or informal). Thus one cannot make a reliable rule as to whether a TA-looking subjoint (Khmer) letter should
be seen as COENG TA or COENG DA.

And indeed, if COENG DA and COENG TA are rendered the same by many but not all fonts supporting the Khmer script, it is impossible to 
reliably communicate things like ?the current spelling of the word is <some word written with COENG TA> but the traditional spelling is <same word
written with COENG DA>? in plain or formatted text (even in formatted text the font selection is not very hard, there can be font substitutions) without
resorting to images or extraneous explanations of which letters were actually used. That seems like a pity. The different rendering need not be such
that (e.g., as here, COENG DA) it is the old one, but needs to be distinguishable by a reasonable reader at reasonable font size/resolution. It could be
a ?modernized? rendering of COENG DA, or a more traditional one, but sufficiently clearly distinct from  the rendering of COENG TA (and distinct from
other Khmer subscript letters); THAT would be a font difference. But at the point where ?original? COENG DA is rendered exactly the same as
COENG TA, it is a spelling change, and should be treated as such.

I am all for that the author of a text decides which letter-apparent there are in a piece of text, not font makers. This is especially important here,
since historically COENG DA had its own separate rendering not conflated with COENG TA rendering.

/Kent Karlsson

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200628/324c3c26/attachment.htm>

From richard.wordingham at ntlworld.com  Mon Jun 29 09:56:40 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 29 Jun 2020 15:56:40 +0100
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <D833BA6F-A973-42BB-9C3D-2091E72EC8E7@bahnhof.se>
References: <20200623125457.421435ce@JRWUBU2>
 <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>
 <20200624010317.5f2f5310@JRWUBU2>
 <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com>
 <A1B4F3A7-3B95-4E96-89C5-23DEC693A2CF@bahnhof.se>
 <6e390a35-d08e-8c92-ead8-f2e8fa4a2697@ix.netcom.com>
 <D833BA6F-A973-42BB-9C3D-2091E72EC8E7@bahnhof.se>
Message-ID: <20200629155640.5d1c30f3@JRWUBU2>

On Sun, 28 Jun 2020 22:05:35 +0200
Kent Karlsson via Unicode <unicode at unicode.org> wrote:

> And indeed, if COENG DA and COENG TA are rendered the same by many
> but not all fonts supporting the Khmer script, it is impossible to
> reliably communicate things like ?the current spelling of the word is
> <some word written with COENG TA> but the traditional spelling is
> <same word written with COENG DA>? in plain or formatted text (even
> in formatted text the font selection is not very hard, there can be
> font substitutions) without resorting to images or extraneous
> explanations of which letters were actually used. That seems like a
> pity. The different rendering need not be such that (e.g., as here,
> COENG DA) it is the old one, but needs to be distinguishable by a
> reasonable reader at reasonable font size/resolution. It could be a
> ?modernized? rendering of COENG DA, or a more traditional one, but
> sufficiently clearly distinct from  the rendering of COENG TA (and
> distinct from other Khmer subscript letters); THAT would be a font
> difference. But at the point where ?original? COENG DA is rendered
> exactly the same as COENG TA, it is a spelling change, and should be
> treated as such.

One of the fonts that comes with Windows 10, Leelawadee UI, actually
makes a distinction.  (Marc Durdin pointed that out to me in
response to this thread.)  Its COENG DA leaves a wider gap with the base
consonant, and is vertically more compressed to compensate.  This is a
modern innovation, and to me seems similar in intent to the barely
perceptible difference between an open loop U+0067 LATIN SMALL LETTER G
and U+0261 LATIN SMALL LETTER SCRIPT G that another font makes.

That font looks like part of a move to change the encoding of modern
Khmer by replacing COENG DA by COENG TA.  That promises to be another
complication in transliterating Pali and Sanskrit between Indic
scripts. 

The visible spelling change seems to have been complete in Khmer by
1930, at least as far as printed material was concerned.  Does anyone
know if COENG TA and COENG DA have been distinguished if subscripts
were encoded separately?  Still, we are where we are.

Richard.


From kent.b.karlsson at bahnhof.se  Mon Jun 29 10:47:34 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Mon, 29 Jun 2020 17:47:34 +0200
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <20200629155640.5d1c30f3@JRWUBU2>
References: <20200623125457.421435ce@JRWUBU2>
 <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>
 <20200624010317.5f2f5310@JRWUBU2>
 <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com>
 <A1B4F3A7-3B95-4E96-89C5-23DEC693A2CF@bahnhof.se>
 <6e390a35-d08e-8c92-ead8-f2e8fa4a2697@ix.netcom.com>
 <D833BA6F-A973-42BB-9C3D-2091E72EC8E7@bahnhof.se>
 <20200629155640.5d1c30f3@JRWUBU2>
Message-ID: <A9C14B84-C4F6-4112-8358-048E60653D42@bahnhof.se>


> 29 juni 2020 kl. 16:56 skrev Richard Wordingham via Unicode <unicode at unicode.org>:

> The visible spelling change seems to have been complete in Khmer by
> 1930, at least as far as printed material was concerned.  

To make this a little bit less abstract: what did (what is now) COENG DA look like well before 1930?

/Kent Karlsson

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200629/461574fe/attachment.htm>

From richard.wordingham at ntlworld.com  Mon Jun 29 12:56:16 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 29 Jun 2020 18:56:16 +0100
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <A9C14B84-C4F6-4112-8358-048E60653D42@bahnhof.se>
References: <20200623125457.421435ce@JRWUBU2>
 <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>
 <20200624010317.5f2f5310@JRWUBU2>
 <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com>
 <A1B4F3A7-3B95-4E96-89C5-23DEC693A2CF@bahnhof.se>
 <6e390a35-d08e-8c92-ead8-f2e8fa4a2697@ix.netcom.com>
 <D833BA6F-A973-42BB-9C3D-2091E72EC8E7@bahnhof.se>
 <20200629155640.5d1c30f3@JRWUBU2>
 <A9C14B84-C4F6-4112-8358-048E60653D42@bahnhof.se>
Message-ID: <20200629185616.5622f60c@JRWUBU2>

On Mon, 29 Jun 2020 17:47:34 +0200
Kent Karlsson via Unicode <unicode at unicode.org> wrote:

> To make this a little bit less abstract: what did (what is now) COENG
> DA look like well before 1930?

See pp25 and 26 of 'Inventaire provisoire des caract?res et divers
signes des ?critures khm?res pr?-modernes et modernes employ?s pour la
notation du khmer, du siamois, des dialectes tha?s m?ridionaux, du
sanskrit et du p?li' by Michel Antelme and read Note 5 on p25.  He
gives the transliteration of TA and DA as 'ta' and '?a'.  The tables
treat the merger as a change of spelling.  A copy of the paper is
available at http://aefek.free.fr/iso_album/antelme_bis.pdf.

The font Khmer2004 mentioned by Antelme, which is targeted at the Khom
(Antelme writes this word 'kh?ma') variant of the script, is put
through its paces at http://www.khmerfonts.info/fontinfo.php?font=1507 .  
The display starts with the alphabet, with each letter displayed on
itself as a subscript.

Richard.


From kent.b.karlsson at bahnhof.se  Mon Jun 29 13:34:33 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Mon, 29 Jun 2020 20:34:33 +0200
Subject: Distinguishing COENG TA from COENG DA in Khmer script
In-Reply-To: <20200629185616.5622f60c@JRWUBU2>
References: <20200623125457.421435ce@JRWUBU2>
 <d2c2aee8-e29d-970e-6222-7538266fa6d1@ix.netcom.com>
 <20200624010317.5f2f5310@JRWUBU2>
 <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com>
 <A1B4F3A7-3B95-4E96-89C5-23DEC693A2CF@bahnhof.se>
 <6e390a35-d08e-8c92-ead8-f2e8fa4a2697@ix.netcom.com>
 <D833BA6F-A973-42BB-9C3D-2091E72EC8E7@bahnhof.se>
 <20200629155640.5d1c30f3@JRWUBU2>
 <A9C14B84-C4F6-4112-8358-048E60653D42@bahnhof.se>
 <20200629185616.5622f60c@JRWUBU2>
Message-ID: <864B1991-CB30-4107-A822-125CC95985AE@bahnhof.se>


> 29 juni 2020 kl. 19:56 skrev Richard Wordingham via Unicode <unicode at unicode.org>:
> 
> On Mon, 29 Jun 2020 17:47:34 +0200
> Kent Karlsson via Unicode <unicode at unicode.org> wrote:
> 
>> To make this a little bit less abstract: what did (what is now) COENG
>> DA look like well before 1930?
> 
> See pp25 and 26 of 'Inventaire provisoire des caract?res et divers
> signes des ?critures khm?res pr?-modernes et modernes employ?s pour la
> notation du khmer, du siamois, des dialectes tha?s m?ridionaux, du
> sanskrit et du p?li' by Michel Antelme and read Note 5 on p25.  He
> gives the transliteration of TA and DA as 'ta' and '?a'.  The tables
> treat the merger as a change of spelling.

?The tables treat the merger as a change of spelling.? I think that is key.

> A copy of the paper is
> available at http://aefek.free.fr/iso_album/antelme_bis.pdf.
> 
> The font Khmer2004 mentioned by Antelme, which is targeted at the Khom
> (Antelme writes this word 'kh?ma') variant of the script, is put
> through its paces at http://www.khmerfonts.info/fontinfo.php?font=1507 .  
> The display starts with the alphabet, with each letter displayed on
> itself as a subscript.
> 
> Richard.

I note that both the references here give a ?DA-like? shape to ?COENG DA?.

/Kent Karlsson


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200629/f6593219/attachment.htm>