1 2 3 0x

Wonderful!

From asmusf at ix.netcom.com  Thu Mar 17 17:10:46 2022
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Thu, 17 Mar 2022 15:10:46 -0700
Subject: Why is it called case "folding"?
In-Reply-To: <4E354E16-342F-42B3-88C0-88CF03D353DC@freenet.de>
References: 
 <16c2e18e-c49c-37b5-6fe7-4e02af783d70@ix.netcom.com>
 <4E354E16-342F-42B3-88C0-88CF03D353DC@freenet.de>
Message-ID: <205689e3-0e3c-c744-de6d-7f181281cc07@ix.netcom.com>

An HTML attachment was scrubbed...
URL: 

From doug at ewellic.org  Thu Mar 17 18:40:09 2022
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 17 Mar 2022 17:40:09 -0600
Subject: Why is it called case "folding"?
In-Reply-To: 
References: 
Message-ID: <001f01d83a58$571b0030$05510090$@ewellic.org>

Roger L Costello wrote:

> Why is it called case "folding"?

If you think of a piece of paper with the uppercase alphabet written at the top of the page, and the lowercase alphabet at the bottom, and then folding the page in half so that the uppercase letters are on top of the lowercase letters (or vice versa), that's kind of the image.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org

From aprilop at freenet.de  Fri Mar 18 02:48:14 2022
From: aprilop at freenet.de (Andreas Prilop)
Date: Fri, 18 Mar 2022 07:48:14 +0000
Subject: Why is it called case "folding"?
In-Reply-To: <205689e3-0e3c-c744-de6d-7f181281cc07@ix.netcom.com>
References: 
 <16c2e18e-c49c-37b5-6fe7-4e02af783d70@ix.netcom.com>
 <4E354E16-342F-42B3-88C0-88CF03D353DC@freenet.de>
 <205689e3-0e3c-c744-de6d-7f181281cc07@ix.netcom.com>
Message-ID: <9496AF4E-7A91-48B8-9538-23D91AAF7508@freenet.de>

On 17 March 2022, Asmus Freytag wrote:

>>> Content-Type: text/html
>>>  
>>
>>  Wonderful! 
> 
> And your point being?

You (or your user-agent) send messages as ?text/html? with ??,
which prevents wrapping of lines. The reader has to scroll horizontally
or sees the message in a tiny font size.
See your own message at:

https://corp.unicode.org/pipermail/unicode/2022-March/010065.html

https://corp.unicode.org/pipermail/unicode/attachments/20220317/a8b80bc4/attachment.htm

Why are you sending text/html in the first place?
You do not have any formatting at all.

From textexin at xencraft.com  Fri Mar 18 04:04:48 2022
From: textexin at xencraft.com (Tex)
Date: Fri, 18 Mar 2022 02:04:48 -0700
Subject: Why is it called case "folding"?
In-Reply-To: <001f01d83a58$571b0030$05510090$@ewellic.org>
References: 
 <001f01d83a58$571b0030$05510090$@ewellic.org>
Message-ID: <000c01d83aa7$38b49da0$aa1dd8e0$@xencraft.com>

Doug, So it isn't the case (no pun intended)  without the extra characters (cards) you can't win, like in poker, and so you fold.

-----Original Message-----
From: Unicode [mailto:unicode-bounces at corp.unicode.org] On Behalf Of Doug Ewell via Unicode
Sent: Thursday, March 17, 2022 4:40 PM
To: 'Roger L Costello'; 'unicode at unicode.org'
Subject: RE: Why is it called case "folding"?

Roger L Costello wrote:

> Why is it called case "folding"?

If you think of a piece of paper with the uppercase alphabet written at the top of the page, and the lowercase alphabet at the bottom, and then folding the page in half so that the uppercase letters are on top of the lowercase letters (or vice versa), that's kind of the image.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org

From lyratelle at gmx.de  Sat Mar 19 04:59:13 2022
From: lyratelle at gmx.de (Dominikus Dittes Scherkl)
Date: Sat, 19 Mar 2022 10:59:13 +0100
Subject: Why is it called case "folding"?
In-Reply-To: <000c01d83aa7$38b49da0$aa1dd8e0$@xencraft.com>
References: 
 <001f01d83a58$571b0030$05510090$@ewellic.org>
 <000c01d83aa7$38b49da0$aa1dd8e0$@xencraft.com>
Message-ID: <2c5f0d63-8c08-e1b8-ebdf-2955781fa98e@gmx.de>

Am 18.03.22 um 10:04 schrieb Tex via Unicode:
> Doug, So it isn't the case (no pun intended)  without the extra characters (cards) you can't win, like in poker, and so you fold.

In card-games the term comes from "folding" the spread cards in your
hand to one block, because the distinction doesn't matter to you anymore
(as you have given up).

--
                                          Dominikus Dittes Scherkl

From richard.wordingham at ntlworld.com  Sun Mar 20 12:58:26 2022
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 20 Mar 2022 17:58:26 +0000
Subject: Fault in Bidi Algorithm at BD16
Message-ID: <20220320175826.5ee0b834@JRWUBU2>

There is a fault in BD16, at least at Unicode 14.0:

The problem lies in this part of the algorithm:

"If an opening paired bracket is found and there is room in the stack,
push its Bidi_Paired_Bracket property value and its text position onto
the stack.

If an opening paired bracket is found and there is no room
in the stack, stop processing BD16 for the remainder of the isolating
run sequence.

If a closing paired bracket is found, do the following:

1.  Declare a variable that holds a reference to the current stack
    element and initialize it with the top element of the stack.

2.  Compare the closing paired bracket being inspected or its
    canonical equivalent to the bracket in the current stack element."

It was picked up by line 312 of BidiCharacterTests.txt:

0061 0020 2329 0062 002E 0031 3009;1;1;2 2 2 2 2 2 2;0 1 2 3 4 5 6

This line primarily checks that U+2329 and U+3009 are identified as a
'bracket pair'.  bpb(U+2329) is U+232A, whose canonical decomposition
is U+3009.  However, the step *numbered* '2' is non-determistic; it
contains the word 'or'.  The simple, robust solution is to change 'or
its canonical equivalent' to 'and its canonical equivalents'.  That
also avoids the risk of 'its canonical equivalent' being interpreted as
the result of the function to_NFC or to_NFD.

It feels simpler to work with the NFC or NFD equivalents of the
candidate opening and closing brackets at both the first and last of
the quoted steps.

I admit that part of the problem was that I was using a tool that
assumed that canonically equivalent characters had the same Unicode
properties.

Richard.

From kenwhistler at sonic.net  Sun Mar 20 14:49:19 2022
From: kenwhistler at sonic.net (Ken Whistler)
Date: Sun, 20 Mar 2022 12:49:19 -0700
Subject: Fault in Bidi Algorithm at BD16
In-Reply-To: <20220320175826.5ee0b834@JRWUBU2>
References: <20220320175826.5ee0b834@JRWUBU2>
Message-ID: <1040dbb3-ce80-caf1-c31d-b6974bec007a@sonic.net>

Richard,

On 3/20/2022 10:58 AM, Richard Wordingham via Unicode wrote:
> 2.  Compare the closing paired bracket being inspected or its
>      canonical equivalent to the bracket in the current stack element."
>
> It was picked up by line 312 of BidiCharacterTests.txt:
>
> 0061 0020 2329 0062 002E 0031 3009;1;1;2 2 2 2 2 2 2;0 1 2 3 4 5 6
>
> This line primarily checks that U+2329 and U+3009 are identified as a
> 'bracket pair'.  bpb(U+2329) is U+232A, whose canonical decomposition
> is U+3009.  However, the step*numbered*  '2' is non-determistic; it
> contains the word 'or'.

I'm not seeing it. The inclusion of an "or" there does not make this 
non-deterministic.

Yes, the text is not pedantically precise, I suppose, but most people 
have not had trouble interpreting what is intended. If your candidate 
closing bracket (or the canonical equivalent of your candidate closing 
bracket) matches the closing bracket match mapping detailed in 
BidiBrackets.txt for the opening bracket candidate on the stack, then 
you have a bracket match.

This affects precisely 2329 and 232A because those are the *only* 
brackets listed in BidiBrackets.txt that have canonical decomposition 
mappings. And it is vanishingly unlikely that the UTC is ever going to 
add more such paired brackets with canonical decomposition mappings.

>   The simple, robust solution is to change 'or
> its canonical equivalent' to 'and its canonical equivalents'.
I don't think that actually would clarify the text. And we shouldn't 
imply more of a requirement to import normalization into UBA than is 
actually needed.
>   That
> also avoids the risk of 'its canonical equivalent' being interpreted as
> the result of the function to_NFC or to_NFD.

I don't see the distinction here. The NFC *and* NFD form of 2329 are 
both 3008. The NFC *and* NFD form of 232A are both 3009. You could use 
either of those and still end up with the right result for the bracket 
match. But why bother?

The BidiReference code just does a hard-coded additional test (and 
explains why). For this particular edge case, that works just as well, 
is just as robust (see above assertion that UTC isn't going to add more 
exceptions to be dealt with), and would be *faster* than introducing a 
step to normalize the brackets:

 ?? ???? if ( ( bracketData.bracket == closingcp ) ||
 ?? ???? ??? ?( ( bracketData.bracket == 0x232A ) && ( closingcp == 
0x3009 ) ) ||
 ?? ???? ??? ?( ( bracketData.bracket == 0x3009 ) && ( closingcp == 
0x232A ) ) )

Note the logical OR's there. If condition_a OR condition_b OR 
condition_c then you have a match. That is completely deterministic in 
this case.

--Ken

>
> It feels simpler to work with the NFC or NFD equivalents of the
> candidate opening and closing brackets at both the first and last of
> the quoted steps.

From richard.wordingham at ntlworld.com  Sun Mar 20 15:43:38 2022
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 20 Mar 2022 20:43:38 +0000
Subject: Fault in Bidi Algorithm at BD16
In-Reply-To: <1040dbb3-ce80-caf1-c31d-b6974bec007a@sonic.net>
References: <20220320175826.5ee0b834@JRWUBU2>
 <1040dbb3-ce80-caf1-c31d-b6974bec007a@sonic.net>
Message-ID: <20220320204338.4915e71f@JRWUBU2>

On Sun, 20 Mar 2022 12:49:19 -0700
Ken Whistler via Unicode  wrote:

> Richard,
> 
> On 3/20/2022 10:58 AM, Richard Wordingham via Unicode wrote:
> > 2.  Compare the closing paired bracket being inspected or its
> >      canonical equivalent to the bracket in the current stack
> > element."
> >
> > It was picked up by line 312 of BidiCharacterTests.txt:
> >
> > 0061 0020 2329 0062 002E 0031 3009;1;1;2 2 2 2 2 2 2;0 1 2 3 4 5 6
> >
> > This line primarily checks that U+2329 and U+3009 are identified as
> > a 'bracket pair'.  bpb(U+2329) is U+232A, whose canonical
> > decomposition is U+3009.  However, the step*numbered*  '2' is
> > non-determistic; it contains the word 'or'.  
> 
> I'm not seeing it. The inclusion of an "or" there does not make this 
> non-deterministic.

"Do A or B" is not deterministic.  In general, there may be several
different ways of achieving the same effect.

> Yes, the text is not pedantically precise, I suppose, but most people 
> have not had trouble interpreting what is intended. If your candidate 
> closing bracket (or the canonical equivalent of your candidate
> closing bracket) matches the closing bracket match mapping detailed
> in BidiBrackets.txt for the opening bracket candidate on the stack,
> then you have a bracket match.

How do you collect the statistics?  I would have thought you would have
been unlikely to know about such matters, for the errors should get
caught by the conformance tests.  At that point the penny drops.  And
with English, one needs to be careful with quantifiers like 'or'; it
seems clear to me that not even all native speakers interpret
combinations the same.

By the time one gets to N0, the intelligibility of the UBA is rapidly
falling off.  (I'm not confident that that's curable.)  And we know that
people do code up Unicode algorithms without understanding them.  The
UBA is one of the more complex algorithms, which is probably why it has
such a large set of tests.  The complexity has led to at least one
author leaving a curse in his public code.

> This affects precisely 2329 and 232A because those are the *only* 
> brackets listed in BidiBrackets.txt that have canonical decomposition 
> mappings. And it is vanishingly unlikely that the UTC is ever going
> to add more such paired brackets with canonical decomposition
> mappings.
> 
> >   The simple, robust solution is to change 'or
> > its canonical equivalent' to 'and its canonical equivalents'.  
> I don't think that actually would clarify the text. And we shouldn't 
> imply more o f a requirement to import normalization into UBA than is 
> actually needed.
> >   That
> > also avoids the risk of 'its canonical equivalent' being
> > interpreted as the result of the function to_NFC or to_NFD.  
> 
> I don't see the distinction here. The NFC *and* NFD form of 2329 are 
> both 3008. The NFC *and* NFD form of 232A are both 3009. You could
> use either of those and still end up with the right result for the
> bracket match. But why bother?

U+232A is canonically equivalent to U+3009, but is neither
to_NFC(U+3009) nor to_NFD(U+3009).  Thus, it's not immediately obvious
that the 'canonical equivalent of U+3009' means U+232A.

> The BidiReference code just does a hard-coded additional test (and 
> explains why). For this particular edge case, that works just as
> well, is just as robust (see above assertion that UTC isn't going to
> add more exceptions to be dealt with), and would be *faster* than
> introducing a step to normalize the brackets:
> 
>  ?? ???? if ( ( bracketData.bracket == closingcp ) ||
>  ?? ???? ??? ?( ( bracketData.bracket == 0x232A ) && ( closingcp == 
> 0x3009 ) ) ||
>  ?? ???? ??? ?( ( bracketData.bracket == 0x3009 ) && ( closingcp == 
> 0x232A ) ) )
> 
> Note the logical OR's there. If condition_a OR condition_b OR 
> condition_c then you have a match. That is completely deterministic
> in this case.

The reference code is now in a place widely consider a threat to
networks!

Richard.

From public at khwilliamson.com  Wed Mar 23 11:01:24 2022
From: public at khwilliamson.com (Karl Williamson)
Date: Wed, 23 Mar 2022 10:01:24 -0600
Subject: Unicode in the news
Message-ID: <7d91cdec-eae4-6fd0-933a-8df5ebaa6b45@khwilliamson.com>

https://www.cbc.ca/news/canada/british-columbia/dakelh-indigenous-language-standard-syllabics-1.6392552

From sosipiuk at gmail.com  Thu Mar 24 13:09:23 2022
From: sosipiuk at gmail.com (=?iso-8859-2?Q?S=B3awomir_Osipiuk?=)
Date: Thu, 24 Mar 2022 14:09:23 -0400
Subject: Use of CANCEL TAG in emoji flags
Message-ID: <005e01d83faa$4b4da6c0$e1e8f440$@gmail.com>

Alexei Chimendez submitted a report last year about the problematic use of
CANCEL TAG for flag emojis:
https://www.unicode.org/L2/L2021/21127-edcom-rept-utc168.html

This was turned into an action item for Markus Scherer and the Properties
and Algorithms Group:
https://www.unicode.org/L2/L2021/21123.htm#168-A30

Is there any further information about this issue or the progress on it?

Thanks,
S?awomir Osipiuk

From markus.icu at gmail.com  Thu Mar 24 13:33:58 2022
From: markus.icu at gmail.com (Markus Scherer)
Date: Thu, 24 Mar 2022 11:33:58 -0700
Subject: Use of CANCEL TAG in emoji flags
In-Reply-To: <005e01d83faa$4b4da6c0$e1e8f440$@gmail.com>
References: <005e01d83faa$4b4da6c0$e1e8f440$@gmail.com>
Message-ID: 

On Thu, Mar 24, 2022 at 11:13 AM S?awomir Osipiuk via Unicode <
unicode at corp.unicode.org> wrote:

> Alexei Chimendez submitted a report last year about the problematic use of
> CANCEL TAG for flag emojis:
> https://www.unicode.org/L2/L2021/21127-edcom-rept-utc168.html
>
> This was turned into an action item for Markus Scherer and the Properties
> and Algorithms Group:
> https://www.unicode.org/L2/L2021/21123.htm#168-A30
>
> Is there any further information about this issue or the progress on it?
>

Thanks for the nudge :-}
I have added it to the agenda now...

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From don.hosek at gmail.com  Wed Mar 30 22:16:23 2022
From: don.hosek at gmail.com (Don Hosek)
Date: Wed, 30 Mar 2022 21:16:23 -0600
Subject: =?utf-8?Q?Clarification_on_Annex_29=2C_GB12=E2=80=9313?=
Message-ID: 

Annex 29 says:
> Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) symbols if there is an odd number of RI characters before the break point.
> GB12	sot (RI RI)* RI	?	RI
> GB13	[^RI] (RI RI)* RI	?	RI

This would seem to indicate that any even number of RI tags should be treated as a single grapheme so given, e.g., ?????? this should be a single grapheme rather than the expected three. There is no test in https://www.unicode.org/Public/14.0.0/ucd/auxiliary/GraphemeBreakTest.txt that would enforce this however. Or is this just a case of my misreading the spec and there is an implicit ? after each pair of RI characters? (if the latter, it might be helpful for future implementors to have a note to that effect).

-dh

1 2 3 0x

1 2 3 0x

⁧1 2 3⁩ 0x

‫1 2 3‬ 0x

1 2 3 0x

1 2 3 0x

⁧1 2 3⁩ 0x

‫1 2 3‬ 0x

1 2 3 0x

1 2 3 0x

⁧1 2 3⁩ 0x

‫1 2 3‬ 0x

1 2 3 0x

1 2 3 0x