From richard.wordingham at ntlworld.com  Tue May  5 08:23:46 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 5 May 2020 14:23:46 +0100
Subject: Is Devanagari =?UTF-8?B?4KSy4KWN4KSy4KS+4KSB?= ambiguous?
Message-ID: <20200505142346.2c414ceb@JRWUBU2>

Is this Devanagari akshara ambiguous between "l?l?" (with a nasalised
first consonant, as in Sanskrit) and "ll?? " (with a nasalised vowel, as
in Hindi)?  If I understand correctly, the ISO 5919 transliterates the
first reading as "m?ll?", or "m?l l?" if one is splitting words
combined by sandhi.

Richard.


From richard.wordingham at ntlworld.com  Tue May  5 15:59:20 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 5 May 2020 21:59:20 +0100
Subject: Is Devanagari =?UTF-8?B?4KSy4KWN4KSy4KS+4KSB?= ambiguous?
Message-ID: <20200505215920.6221496b@JRWUBU2>

(Sorry if this is a duplicate - I'm experimenting with domain names as
the Unicode list via unicode at unicode.org is taking over 5 hours to
respond and may in fact be out of action.)

Is this Devanagari akshara ambiguous between "l?l?" (with a nasalised
first consonant, as in Sanskrit) and "ll?? " (with a nasalised vowel, as
in Hindi)?  If I understand correctly, the ISO 5919 transliterates the
first reading as "m?ll?", or "m?l l?" if one is splitting words
combined by sandhi.

Richard.


From richard.wordingham at ntlworld.com  Tue May  5 19:19:27 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 6 May 2020 01:19:27 +0100
Subject: [EXTERNAL] Is Devanagari =?UTF-8?B?4KSy4KWN4KSy4KS+4KSB?=
 ambiguous?
In-Reply-To: <BL0PR00MB0339A3E99FB3A05FAD3EC5528EA70@BL0PR00MB0339.namprd00.prod.outlook.com>
References: <20200505142346.2c414ceb@JRWUBU2>
 <BL0PR00MB033983FCF7EE5CCD50FC66E28EA70@BL0PR00MB0339.namprd00.prod.outlook.com>
 <BL0PR00MB0339A3E99FB3A05FAD3EC5528EA70@BL0PR00MB0339.namprd00.prod.outlook.com>
Message-ID: <20200506011927.1dbf0453@JRWUBU2>

On Tue, 5 May 2020 22:57:14 +0000
Andrew Glass <Andrew.Glass at microsoft.com> wrote:

 
> Here is an excerpt from Whitney's Sanskrit Grammar page 69:
> 
> 
> 
> [A close up of a newspaper  Description automatically generated]
> 
> Whitney, William Dwight. 1889. A Sanskrit grammar, including both the
> classical language, and the older dialects, of Veda and Brahmana.
> Bibliothek indogermanischer Grammatiken, Band II. Leipzig: Breitkopf
> and H?rtel.

And MacDonnell's 1886 revision of Max Mueller's 'A Sanskrit Grammar for
Beginners' gives on p21 an example of y?y? with the candrabindu on the
far left. (The conjunct uses a half form.)  Unfortunately, they don't
actually answer the question of whether the placement of candrabindu is
significant, though they support my feeling that it is.
 
> The version with explicit virama is nice because it shows how the
> ambiguity can be avoided and gives us a clue to the better encoding.
> So I would encode these as follows:
> As a single cluster with candrabindu applied to the first l:
> 
> 0932 094D 0901 0932 094B
> 
> ?????
> 
> This cluster is supported in Nirmala UI: [A drawing of a face
> Description automatically generated]

According to
https://docs.microsoft.com/en-us/typography/script-development/devanagari ,
that's two syllables, and that's how HarfBuzz is currently rendering
it.  It seems I'll have to raise a bug report against HarfBuzz - unless
it's changed fairly recently.

If I treat the candrabindu as a consonant modifier (i.e. as a type of
nukta), which is what the grammarians say it is, and encoded it before
the virama, I get a dotted circle out of HarfBuzz.

> With explicit virama and candrabindu applied to the first l:
> 
> 0932 094D 0901 0020 0932 094B
> 
> ??? ??
> 

> Which leaves the vowel marked form as you have given it:
> 
> 0932 094D 0932 093E 0901
> 
> ?????

And this last one is the only encoding allowed by TUS 13 Section 12.1
R10.

My Unicode feedback and HarfBuzz bug report should make reference to the
thread 'Sanskrit nasalised L' including
https://www.unicode.org/mail-arch/unicode-ml/y2011-m06/0144.html .
Thank you for your help.

One thing I have established is that there are rendering systems that
support candrabindu within the consonant stack - but not where I
expected it!  (This relates to the issue of where U+0D81 SINHALA
SIGN CANDRABINDU appears in the encoding of a word, which I've raised
elsewhere.)

Richard.


From richard.wordingham at ntlworld.com  Wed May  6 04:46:29 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 6 May 2020 10:46:29 +0100
Subject: Is Devanagari =?UTF-8?B?4KSy4KWN4KSy4KS+4KSB?= ambiguous?
In-Reply-To: <BL0PR00MB0339A3E99FB3A05FAD3EC5528EA70@BL0PR00MB0339.namprd00.prod.outlook.com>
References: <20200505142346.2c414ceb@JRWUBU2>
 <BL0PR00MB033983FCF7EE5CCD50FC66E28EA70@BL0PR00MB0339.namprd00.prod.outlook.com>
 <BL0PR00MB0339A3E99FB3A05FAD3EC5528EA70@BL0PR00MB0339.namprd00.prod.outlook.com>
Message-ID: <20200506104629.06c8ee3d@JRWUBU2>

On Tue, 5 May 2020 22:57:14 +0000
Andrew Glass <Andrew.Glass at microsoft.com> wrote:

> As a single cluster with candrabindu applied to the first l:
> 
> 0932 094D 0901 0932 094B
> 
> ?????
> 
> This cluster is supported in Nirmala UI: [A drawing of a face
> Description automatically generated]

Just to check, is the formation of the conjunct done within the cluster
shaping or after the dissolution of the cluster boundaries?  This font
seems to have been carefully designed so that a candrabindu within the
consonant stack forces half-forms, an approach which prevents the
vertical stacking seen in the example from Whitney, but prevents
candrabindu on a consonant being rendered the same as
candrabindu on a vowel.  (I tested the behaviour with the
consonant stacks t.ra and t.ta, where Sanskrit doesn't permit
candrabindu on the first character.)

Richard.


From cibucj at gmail.com  Wed May  6 04:53:48 2020
From: cibucj at gmail.com (Cibu)
Date: Wed, 6 May 2020 10:53:48 +0100
Subject: =?UTF-8?B?UmU6IElzIERldmFuYWdhcmkg4KSy4KWN4KSy4KS+4KSBIGFtYmlndW91cz8=?=
In-Reply-To: <20200505142346.2c414ceb@JRWUBU2>
References: <20200505142346.2c414ceb@JRWUBU2>
Message-ID: <CAD8TiP6RHznJfkGQhxP+H02+8mcXCZpcakyb7JXTQds=CqsQWw@mail.gmail.com>

I thought one would transliterate this as 'll?m?'. That is, the candrabindu
occurring as the last.

On Wed, May 6, 2020 at 9:19 AM Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> Is this Devanagari akshara ambiguous between "l?l?" (with a nasalised
> first consonant, as in Sanskrit) and "ll?? " (with a nasalised vowel, as
> in Hindi)?  If I understand correctly, the ISO 5919 transliterates the
> first reading as "m?ll?", or "m?l l?" if one is splitting words
> combined by sandhi.
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200506/db6e95a0/attachment.htm>

From richard.wordingham at ntlworld.com  Wed May  6 06:58:21 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 6 May 2020 12:58:21 +0100
Subject: Is Devanagari =?UTF-8?B?4KSy4KWN4KSy4KS+4KSB?= ambiguous?
In-Reply-To: <CAD8TiP6RHznJfkGQhxP+H02+8mcXCZpcakyb7JXTQds=CqsQWw@mail.gmail.com>
References: <20200505142346.2c414ceb@JRWUBU2>
 <CAD8TiP6RHznJfkGQhxP+H02+8mcXCZpcakyb7JXTQds=CqsQWw@mail.gmail.com>
Message-ID: <20200506125821.4dc05785@JRWUBU2>

On Wed, 6 May 2020 10:53:48 +0100
Cibu <cibucj at gmail.com> wrote:

> I thought one would transliterate this as 'll?m?'. That is, the
> candrabindu occurring as the last.

My question is whether the visual placement of the candrabindu affects
the meaning.  The Unicode standard says (Section 12.1 R10) that it
should be the last character in an akshara with these components.  Your
answer confirms that TUS is wrong when the candrabindu is modifying a
consonant; the position matters.  Thank you for the
information.  Microsoft still implements a solution* for Devanagari that
uses assigned characters with their correct individual significations.

Richard.

*Nirmala UI has a typographical problem with this solution for
'l?li' (?????); possibly this is the price of disambiguating it from
'llim?' (?????).


From everson at evertype.com  Wed May  6 08:01:26 2020
From: everson at evertype.com (Michael Everson)
Date: Wed, 6 May 2020 14:01:26 +0100
Subject: =?utf-8?B?UmU6IElzIERldmFuYWdhcmkg4KSy4KWN4KSy4KS+4KSBIGFtYmln?=
 =?utf-8?B?dW91cz8=?=
In-Reply-To: <20200505215920.6221496b@JRWUBU2>
References: <20200505215920.6221496b@JRWUBU2>
Message-ID: <AA160BA7-04FE-451F-A652-6D02590DDE5E@evertype.com>

It is not ambiguous in encoding. Whether one interprets it as l?l? or ll?? is a reading rule. But the encoding is LA + VIRAMA + LA + -AA + CANDRABINDU either way.

> Is this Devanagari akshara ambiguous between "l?l?" (with a nasalised
> first consonant, as in Sanskrit) and "ll?? " (with a nasalised vowel, as
> in Hindi)?  If I understand correctly, the ISO 5919 transliterates the
> first reading as "m?ll?", or "m?l l?" if one is splitting words
> combined by sandhi.
> 
> Richard.
> 


From samjnaa at gmail.com  Thu May  7 10:31:31 2020
From: samjnaa at gmail.com (Shriramana Sharma)
Date: Thu, 7 May 2020 21:01:31 +0530
Subject: =?UTF-8?B?UmU6IFtFWFRFUk5BTF0gSXMgRGV2YW5hZ2FyaSDgpLLgpY3gpLLgpL7gpIEgYW1iaWd1bw==?=
 =?UTF-8?B?dXM/?=
In-Reply-To: <20200506011927.1dbf0453@JRWUBU2>
References: <20200505142346.2c414ceb@JRWUBU2>
 <BL0PR00MB033983FCF7EE5CCD50FC66E28EA70@BL0PR00MB0339.namprd00.prod.outlook.com>
 <BL0PR00MB0339A3E99FB3A05FAD3EC5528EA70@BL0PR00MB0339.namprd00.prod.outlook.com>
 <20200506011927.1dbf0453@JRWUBU2>
Message-ID: <CAH-HCWWxJpCc4XywdhD3FOvSPFuavB__KrTGy=JZJc-sFHwDug@mail.gmail.com>

The only linguistically valid Sanskrit sequences are nasal-Y/V/L followed
by the same in non-nasal form. This may then be followed by a vowel or
another consonant.

In N Indic scripts the first nasal consonant is written as a half form
carrying a chandrabindu. The rest is as usual.

In S Indic scripts the stack of the consonants carries a chandrabindu on
top. See L2/09-372 p 41.

I would expect to encode this linguistic sequence in either type of script
as:

Y/V/L + VIRAMA + CANDRABINDU + ?

This is what I have said in my Grantha proposal but that probably got lost
among so many other issues. I had been meaning to submit a separate doc on
this but haven't been able to get around to it sadly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200507/924aedf7/attachment.htm>

From Andrew.Glass at microsoft.com  Thu May  7 17:38:22 2020
From: Andrew.Glass at microsoft.com (Andrew Glass)
Date: Thu, 7 May 2020 22:38:22 +0000
Subject: =?utf-8?B?UkU6IFtFWFRFUk5BTF0gUmU6IElzIERldmFuYWdhcmkg4KSy4KWN4KSy4KS+?=
 =?utf-8?B?4KSBIGFtYmlndW91cz8=?=
In-Reply-To: <20200506104629.06c8ee3d@JRWUBU2>
References: <20200505142346.2c414ceb@JRWUBU2>
 <BL0PR00MB033983FCF7EE5CCD50FC66E28EA70@BL0PR00MB0339.namprd00.prod.outlook.com>
 <BL0PR00MB0339A3E99FB3A05FAD3EC5528EA70@BL0PR00MB0339.namprd00.prod.outlook.com>
 <20200506104629.06c8ee3d@JRWUBU2>
Message-ID: <SN6PR00MB03516C2777A480BFFCDC372F8EA50@SN6PR00MB0351.namprd00.prod.outlook.com>

Good question. We support Devanagari with our Indic engine, a peculiarity of this engine is that it doesn?t have a per-run feature application stage and all features are applied at the cluster stage. Addressing this is an outstanding issue. Therefore, the example cluster is a permitted single cluster in our Indic engine. It would certainly be possible to have a ligature more like Whitney's example, but that wasn't included in the plan for the Nirmala font. 

Cheers,

Andrew

-----Original Message-----
From: Richard Wordingham <richard.wordingham at ntlworld.com> 
Sent: 06 May 2020 02:46
To: Andrew Glass <Andrew.Glass at microsoft.com>; unicode at unicode.org
Subject: [EXTERNAL] Re: Is Devanagari ????? ambiguous?

On Tue, 5 May 2020 22:57:14 +0000
Andrew Glass <Andrew.Glass at microsoft.com> wrote:

> As a single cluster with candrabindu applied to the first l:
> 
> 0932 094D 0901 0932 094B
> 
> ?????
> 
> This cluster is supported in Nirmala UI:

Just to check, is the formation of the conjunct done within the cluster shaping or after the dissolution of the cluster boundaries?  This font seems to have been carefully designed so that a candrabindu within the consonant stack forces half-forms, an approach which prevents the vertical stacking seen in the example from Whitney, but prevents candrabindu on a consonant being rendered the same as candrabindu on a vowel.  (I tested the behaviour with the consonant stacks t.ra and t.ta, where Sanskrit doesn't permit candrabindu on the first character.)

Richard.


From richard.wordingham at ntlworld.com  Fri May  8 04:43:36 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 8 May 2020 10:43:36 +0100
Subject: Is Devanagari =?UTF-8?B?4KSy4KWN4KSy4KS+4KSB?= ambiguous?
In-Reply-To: <SN6PR00MB03516C2777A480BFFCDC372F8EA50@SN6PR00MB0351.namprd00.prod.outlook.com>
References: <20200505142346.2c414ceb@JRWUBU2>
 <BL0PR00MB033983FCF7EE5CCD50FC66E28EA70@BL0PR00MB0339.namprd00.prod.outlook.com>
 <BL0PR00MB0339A3E99FB3A05FAD3EC5528EA70@BL0PR00MB0339.namprd00.prod.outlook.com>
 <20200506104629.06c8ee3d@JRWUBU2>
 <SN6PR00MB03516C2777A480BFFCDC372F8EA50@SN6PR00MB0351.namprd00.prod.outlook.com>
Message-ID: <20200508104336.4b1d34b3@JRWUBU2>

On Thu, 7 May 2020 22:38:22 +0000
Andrew Glass <Andrew.Glass at microsoft.com> wrote:

> Good question. We support Devanagari with our Indic engine, a
> peculiarity of this engine is that it doesn?t have a per-run feature
> application stage and all features are applied at the cluster stage.
> Addressing this is an outstanding issue. Therefore, the example
> cluster is a permitted single cluster in our Indic engine.

Thanks for the information.  I have now raised the HarfBuzz bug as
Issue 2392 (https://github.com/harfbuzz/harfbuzz/issues/2392) and have
made comment number 416 for the Microsoft typography Devanagari
specification, currently at
https://docs.microsoft.com/en-us/typography/script-development/devanagari .

Richard.


From Andrew.Glass at microsoft.com  Fri May  8 15:18:38 2020
From: Andrew.Glass at microsoft.com (Andrew Glass)
Date: Fri, 8 May 2020 20:18:38 +0000
Subject: =?utf-8?B?UmU6IFtFWFRFUk5BTF0gUmU6IElzIERldmFuYWdhcmkg4KSy4KWN4KSy4KS+?=
 =?utf-8?B?4KSBIGFtYmlndW91cz8=?=
In-Reply-To: <20200508104336.4b1d34b3@JRWUBU2>
References: <20200505142346.2c414ceb@JRWUBU2>
 <BL0PR00MB033983FCF7EE5CCD50FC66E28EA70@BL0PR00MB0339.namprd00.prod.outlook.com>
 <BL0PR00MB0339A3E99FB3A05FAD3EC5528EA70@BL0PR00MB0339.namprd00.prod.outlook.com>
 <20200506104629.06c8ee3d@JRWUBU2>
 <SN6PR00MB03516C2777A480BFFCDC372F8EA50@SN6PR00MB0351.namprd00.prod.outlook.com>,
 <20200508104336.4b1d34b3@JRWUBU2>
Message-ID: <SN6PR00MB035131F532DC9777DAD008398EA20@SN6PR00MB0351.namprd00.prod.outlook.com>

Thank you Richard,

I'll update our Indic engine documentation using the issue you created.

Andrew


Sent from Outlook<http://aka.ms/weboutlook>

________________________________
From: Richard Wordingham <richard.wordingham at ntlworld.com>
Sent: Friday, May 8, 2020 2:43 AM
To: Andrew Glass <Andrew.Glass at microsoft.com>
Cc: unicode at unicode.org <unicode at unicode.org>
Subject: [EXTERNAL] Re: Is Devanagari ????? ambiguous?

On Thu, 7 May 2020 22:38:22 +0000
Andrew Glass <Andrew.Glass at microsoft.com> wrote:

> Good question. We support Devanagari with our Indic engine, a
> peculiarity of this engine is that it doesn?t have a per-run feature
> application stage and all features are applied at the cluster stage.
> Addressing this is an outstanding issue. Therefore, the example
> cluster is a permitted single cluster in our Indic engine.

Thanks for the information.  I have now raised the HarfBuzz bug as
Issue 2392 (https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fharfbuzz%2Fharfbuzz%2Fissues%2F2392&amp;data=02%7C01%7Candrew.glass%40microsoft.com%7Cde47f583c43246588de808d7f3344c0d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637245278924290476&amp;sdata=ZpC6yhI8CyE4bavmisceTLfjj3SiwjtZraL6pEqHpPQ%3D&amp;reserved=0) and have
made comment number 416 for the Microsoft typography Devanagari
specification, currently at
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Ftypography%2Fscript-development%2Fdevanagari&amp;data=02%7C01%7Candrew.glass%40microsoft.com%7Cde47f583c43246588de808d7f3344c0d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637245278924290476&amp;sdata=Cy1VDE5VyxwtL686U%2BuATucaEuWpA2vfrK6y41JduPE%3D&amp;reserved=0 .

Richard.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200508/bd225e50/attachment.htm>

From wjgo_10009 at btinternet.com  Mon May 11 10:44:05 2020
From: wjgo_10009 at btinternet.com (wjgo_10009 at btinternet.com)
Date: Mon, 11 May 2020 16:44:05 +0100 (BST)
Subject: Symbols for Disasters
Message-ID: <329d62ec.9e5.1720468535a.Webtop.49@btinternet.com>

Symbols for Disasters

Hi

I saw some time ago the following.

https://www.unicode.org/L2/L2020/20078-n4710-liaison-stmt.pdf

More recently I saw the following.

https://www.unicode.org/L2/L2020/20136-sc2-response.pdf

I have been trying to design some symbols and I have today produced and 
published an experimental font as a suggestion..

https://forum.high-logic.com/viewtopic.php?f=10&t=8406

William Overington

Monday 11 May 2020


From wjgo_10009 at btinternet.com  Sat May 16 08:06:53 2020
From: wjgo_10009 at btinternet.com (wjgo_10009 at btinternet.com)
Date: Sat, 16 May 2020 14:06:53 +0100 (BST)
Subject: Abstract emoji
Message-ID: <86907e.f2e.1721d9833df.Webtop.216@btinternet.com>

Abstract emoji

I notice that Public Review 408 QID Emoji has been reopened with a new 
closing date of 9 July 2020.

i wonder if a good mailing list discussion of whether abstract emoji 
should become implemented, either as part of QID emoji or however the 
QID emoji idea becomes adapted, or direct into regular Unicode, can take 
place.

http://www.users.globalnet.co.uk/~ngo/abstract_emoji.htm

If abstract emoji are allowed then it would greatly increase the 
expressive capability of emoji, including for communication through the 
language barrier; however, the meaning of each of at least some abstract 
emoji would need to be learned by end users. So some quantity of 
abstract emoji might be good, too many could be confusing. So what would 
be best please?

William Overington

Saturday 16 May 2020


From marius.spix at web.de  Sat May 16 18:43:17 2020
From: marius.spix at web.de (Marius Spix)
Date: Sun, 17 May 2020 01:43:17 +0200
Subject: Security consideration: math symbols in an exotic IP address format
 in a phishing mail
Message-ID: <20200517014230.329b11b5@spixxi>

Today I received an interesting phishing mail which had an URL
containing mathematical bold numbers. Interestingly the address
??????????? was interpreted as an octal number 05671360302, which is
another spelling for 46.229.224.194. This worked for both Firefox and
Chrome. I don?t know why such an address is accepted in the authority
part of a HTTPS URI of current browsers. Section 7.4 in RFC 3986 states
that additional IP address formats can become a security concern, but
it also says that literals should be converted to numeric form.

I wonder if this case should be added to UTR #36.

Regards

Marius


From bortzmeyer at nic.fr  Sun May 17 01:24:09 2020
From: bortzmeyer at nic.fr (Stephane Bortzmeyer)
Date: Sun, 17 May 2020 08:24:09 +0200
Subject: Security consideration: math symbols in an exotic IP address
 format  in a phishing mail
In-Reply-To: <20200517014230.329b11b5@spixxi>
References: <20200517014230.329b11b5@spixxi>
Message-ID: <20200517062409.GA10656@nic.fr>

On Sun, May 17, 2020 at 01:43:17AM +0200,
 Marius Spix via Unicode <unicode at unicode.org> wrote 
 a message of 15 lines which said:

> This worked for both Firefox and Chrome.

Also in a terminal with ping, so I suspect this is handled by lower
name resolution libraries.

From ratmice at gmail.com  Sun May 17 03:04:52 2020
From: ratmice at gmail.com (Matt Rice)
Date: Sun, 17 May 2020 08:04:52 +0000
Subject: characters for edge crossing/edge casing
Message-ID: <CACTLOFqU-VVBeq1WMxuBZXgEhG+-AqFfUUhC1vL1tO4563cNLg@mail.gmail.com>

I had looked, but couldn't find any characters suitable for edge crossing/casing
such as tunnel's, bridges in the following paper, suitable for
orthogonal graph layouts.
e.g. vertical/horizontal crossing, somwhat similar to the characters
?, ? U+292B-U+292C, but at 90 degrees.

https://arxiv.org/pdf/0705.0413.pdf

Or another style of crossing which I forget the name of,
https://www.yworks.com/assets/images/features/bridges.ff977b3c.svg

Have characters for this purpose been proposed before?


From doug at ewellic.org  Sun May 17 14:23:06 2020
From: doug at ewellic.org (Doug Ewell)
Date: Sun, 17 May 2020 13:23:06 -0600
Subject: characters for edge crossing/edge casing
Message-ID: <001901d62c80$97d60530$c7820f90$@ewellic.org>

Matt Rice wrote:

> I had looked, but couldn't find any characters suitable for edge
> crossing/casing such as tunnel's, bridges in the following paper,
> suitable for orthogonal graph layouts.
> e.g. vertical/horizontal crossing, somwhat similar to the characters
> ?, ? U+292B-U+292C, but at 90 degrees.
>
> https://arxiv.org/pdf/0705.0413.pdf
>
> Or another style of crossing which I forget the name of,
> https://www.yworks.com/assets/images/features/bridges.ff977b3c.svg
>
> Have characters for this purpose been proposed before?

Shapecatcher couldn't find them either, so I suppose they don't exist and could be reasonably proposed.

Keep in mind that even at 90 degrees, it should be possible to show examples of them in plain text, not just in diagrams, and that arbitrary angles such as those shown in the paper should be inadmissible.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From duerst at it.aoyama.ac.jp  Sun May 17 18:42:58 2020
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=)
Date: Mon, 18 May 2020 08:42:58 +0900
Subject: Security consideration: math symbols in an exotic IP address
 format in a phishing mail
In-Reply-To: <20200517014230.329b11b5@spixxi>
References: <20200517014230.329b11b5@spixxi>
Message-ID: <f6d2d185-6262-150e-14ab-c4cd4d5df6ca@it.aoyama.ac.jp>

Hello Marius, others,

On 17/05/2020 08:43, Marius Spix via Unicode wrote:
> Today I received an interesting phishing mail which had an URL
> containing mathematical bold numbers. Interestingly the address
> ??????????? was interpreted as an octal number 05671360302, which is
> another spelling for 46.229.224.194. This worked for both Firefox and
> Chrome. I don?t know why such an address is accepted in the authority
> part of a HTTPS URI of current browsers. Section 7.4 in RFC 3986 states
> that additional IP address formats can become a security concern, but
> it also says that literals should be converted to numeric form.

I'm somehow wondering what the *Unicode* phishing story is here. The 
user saw ???????????, which was interpreted as 05671360302, 
which shouldn't be too surprising unless somebody is familiar with 
mathematical bold numbers.

The average user wouldn't know what 05671360302 is (unless it's e.g. a 
familiar telephone number). That should lead the user to reject this 
URL, and the phishing to fail. A similar should might be expected for 
46.229.224.194. Of course, the URL could be designed so as to make these 
numbers appear natural. And the user may click anyway.

There's an Unicode issue if we assume that a) phishing checkers check 
for cases such as 05671360302, or b) browsers,... don't resolve 
05671360302 if it's in ASCII, but ??????????? gets through. 
Otherwise, there may be a security issue, but it's not an Unicode one.


> I wonder if this case should be added to UTR #36.

Security considerations are always additive, so I'd guess yes.

Regards,   Martin.

> Regards
> 
> Marius
> 

From magnus at bodin.org  Sun May 17 23:39:37 2020
From: magnus at bodin.org (=?UTF-8?Q?Magnus_Bodin_=E2=98=80?=)
Date: Mon, 18 May 2020 06:39:37 +0200
Subject: Security consideration: math symbols in an exotic IP address
 format in a phishing mail
In-Reply-To: <20200517062409.GA10656@nic.fr>
References: <20200517014230.329b11b5@spixxi> <20200517062409.GA10656@nic.fr>
Message-ID: <CAMCs_Hq=0gNemkHbNe1hipNCfcB9U-Mb+Rv6shVeP3gt_dNoQQ@mail.gmail.com>

On Sun, May 17, 2020 at 12:06 PM Stephane Bortzmeyer via Unicode
<unicode at unicode.org> wrote:
>
> On Sun, May 17, 2020 at 01:43:17AM +0200,
>  Marius Spix via Unicode <unicode at unicode.org> wrote
>  a message of 15 lines which said:
>
> > This worked for both Firefox and Chrome.
>
> Also in a terminal with ping, so I suspect this is handled by lower
> name resolution libraries.

Yes, It is an old legacy from BSD libraries that has been inherited.
Actually, it accepts various formats (previously even more than 32 bits.

I illustrated this with a website here 1998:

https://x42.com/active/ip32.mpl?host=archive.org

Nowadays the 40, 48 and 56-bit ones are blocked at least in Chrome.

-- magnus

From c933103 at gmail.com  Mon May 18 09:04:14 2020
From: c933103 at gmail.com (Phake Nick)
Date: Mon, 18 May 2020 22:04:14 +0800
Subject: Security consideration: math symbols in an exotic IP address
 format in a phishing mail
In-Reply-To: <20200517014230.329b11b5@spixxi>
References: <20200517014230.329b11b5@spixxi>
Message-ID: <CAGHjPPLfd_DwdE75z3UAXysyNRL9ZMqxYqJGrk8DGUbjxrOCOw@mail.gmail.com>

Somewhat relevant, I have previously observed that, if you type/produce a
link of http://www.abc.def/ghi?jk=lm , and then replace symbol characters
in the link with some other confusable symbols, like full width punctuation
and such, that link will still take you to the intended address. Different
browsers accept different characters. Sometimes when such a link format is
being posted onto internet communities that restrict link sharing, such
alternative unicode characters formed links can bypass link restrictions in
those communities and potentially take unsuspecting netizens to harmful
websites.
I don't understand why browsers would normalize links being clicked/typed
in such way which would expose users to such risk.

? 2020?5?17??? 13:56?Marius Spix via Unicode <unicode at unicode.org> ???

> Today I received an interesting phishing mail which had an URL
> containing mathematical bold numbers. Interestingly the address
> ??????????? was interpreted as an octal number 05671360302,
> which is
> another spelling for 46.229.224.194. This worked for both Firefox and
> Chrome. I don?t know why such an address is accepted in the authority
> part of a HTTPS URI of current browsers. Section 7.4 in RFC 3986 states
> that additional IP address formats can become a security concern, but
> it also says that literals should be converted to numeric form.
>
> I wonder if this case should be added to UTR #36.
>
> Regards
>
> Marius
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200518/0d87d887/attachment.htm>

From cloos at jhcloos.com  Tue May 19 01:29:41 2020
From: cloos at jhcloos.com (James Cloos)
Date: Tue, 19 May 2020 02:29:41 -0400
Subject: Security consideration: math symbols in an exotic IP address
 format in a phishing mail
In-Reply-To: <f6d2d185-6262-150e-14ab-c4cd4d5df6ca@it.aoyama.ac.jp> ("Martin
 J. =?iso-8859-1?Q?D=FCrst?= via Unicode"'s message of "Mon, 18 May 2020
 08:42:58 +0900")
References: <20200517014230.329b11b5@spixxi>
 <f6d2d185-6262-150e-14ab-c4cd4d5df6ca@it.aoyama.ac.jp>
Message-ID: <m3367wl9nu.fsf@carbon.jhcloos.org>

> I don?t know why such an address is accepted in the authority
> part of a HTTPS URI of current browsers.

simple.

it isn't.

at least not here.

for me, pasting ??????????? into a browser?s address bar leads to an
http GET, not to an https one.

(curiosity won.)

(which comic was that where the kids were intentionally using integers
instead of dns to asccess web sites to confuse anyonewatching them?)

-JimC
-- 
James Cloos <cloos at jhcloos.com>         OpenPGP: 0x997A9F17ED7DAEA6

From marius.spix at web.de  Tue May 19 13:56:06 2020
From: marius.spix at web.de (Marius Spix)
Date: Tue, 19 May 2020 20:56:06 +0200
Subject: Security consideration: math symbols in an exotic IP address
 format in a phishing mail
In-Reply-To: <m3367wl9nu.fsf@carbon.jhcloos.org>
References: <20200517014230.329b11b5@spixxi>
 <f6d2d185-6262-150e-14ab-c4cd4d5df6ca@it.aoyama.ac.jp>
 <m3367wl9nu.fsf@carbon.jhcloos.org>
Message-ID: <20200519205602.224b225d@spixxi>

> simple.
> 
> it isn't.
> 
> at least not here.
> 
> for me, pasting ??????????? into a browser?s address bar leads to an
> http GET, not to an https one.
> 
> (curiosity won.)
> 
> (which comic was that where the kids were intentionally using integers
> instead of dns to asccess web sites to confuse anyonewatching them?)
> 
> -JimC

I deliberately did not send the complete URI. But I see a problem with
spam filters, because they have to recognize a lot more variants of IP
addresses.

Marius


From markus.icu at gmail.com  Tue May 19 14:33:58 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Tue, 19 May 2020 12:33:58 -0700
Subject: Security consideration: math symbols in an exotic IP address
 format in a phishing mail
In-Reply-To: <CAGHjPPLfd_DwdE75z3UAXysyNRL9ZMqxYqJGrk8DGUbjxrOCOw@mail.gmail.com>
References: <20200517014230.329b11b5@spixxi>
 <CAGHjPPLfd_DwdE75z3UAXysyNRL9ZMqxYqJGrk8DGUbjxrOCOw@mail.gmail.com>
Message-ID: <CAN49p6qugyxVgnrk0ea=srSeek-F9uvud-KH0-+SYWdi6cF=_Q@mail.gmail.com>

On Tue, May 19, 2020 at 12:24 PM Phake Nick via Unicode <unicode at unicode.org>
wrote:

> Somewhat relevant, I have previously observed that, if you type/produce a
> link of http://www.abc.def/ghi?jk=lm , and then replace symbol characters
> in the link with some other confusable symbols, like full width punctuation
> and such, that link will still take you to the intended address. Different
> browsers accept different characters. Sometimes when such a link format is
> being posted onto internet communities that restrict link sharing, such
> alternative unicode characters formed links can bypass link restrictions in
> those communities and potentially take unsuspecting netizens to harmful
> websites.
> I don't understand why browsers would normalize links being clicked/typed
> in such way which would expose users to such risk.
>

IDNA implementations process domain names using a "mapping" step which is
like a variant of NFKC_Casefold. That's why you can use uppercase as well
as other canonical and compatibility equivalents, and out-of-order
combining marks.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200519/cf8eed2a/attachment.htm>

From lokedhs at gmail.com  Tue May 19 20:13:42 2020
From: lokedhs at gmail.com (=?UTF-8?Q?Elias_M=C3=A5rtenson?=)
Date: Wed, 20 May 2020 09:13:42 +0800
Subject: characters for edge crossing/edge casing
In-Reply-To: <001901d62c80$97d60530$c7820f90$@ewellic.org>
References: <001901d62c80$97d60530$c7820f90$@ewellic.org>
Message-ID: <CADtN0WJQiHDQhjCdTbH7AwyLCpDudEgqbvE+pPAv4P8_zTuB=Q@mail.gmail.com>

On Mon, 18 May 2020, 07:12 Doug Ewell via Unicode, <unicode at unicode.org>
wrote:

> Matt Rice wrote:
>
> > I had looked, but couldn't find any characters suitable for edge
> > crossing/casing such as tunnel's, bridges in the following paper,
> > suitable for orthogonal graph layouts.
> > e.g. vertical/horizontal crossing, somwhat similar to the characters
> > ?, ? U+292B-U+292C, but at 90 degrees.
> >
> > https://arxiv.org/pdf/0705.0413.pdf
> >
> > Or another style of crossing which I forget the name of,
> > https://www.yworks.com/assets/images/features/bridges.ff977b3c.svg
> >
> > Have characters for this purpose been proposed before?
>
> Shapecatcher couldn't find them either, so I suppose they don't exist and
> could be reasonably proposed.
>
> Keep in mind that even at 90 degrees, it should be possible to show
> examples of them in plain text, not just in diagrams, and that arbitrary
> angles such as those shown in the paper should be inadmissible.
>

What about U+2573 BOX DRAWINGS LIGHT DIAGONAL CROSS?

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200520/afba67d0/attachment.htm>

From richard.wordingham at ntlworld.com  Wed May 20 03:03:12 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 20 May 2020 09:03:12 +0100
Subject: Security consideration: math symbols in an exotic IP address
 format in a phishing mail
In-Reply-To: <CAGHjPPLfd_DwdE75z3UAXysyNRL9ZMqxYqJGrk8DGUbjxrOCOw@mail.gmail.com>
References: <20200517014230.329b11b5@spixxi>
 <CAGHjPPLfd_DwdE75z3UAXysyNRL9ZMqxYqJGrk8DGUbjxrOCOw@mail.gmail.com>
Message-ID: <20200520090312.47483edc@JRWUBU2>

On Mon, 18 May 2020 22:04:14 +0800
Phake Nick via Unicode <unicode at unicode.org> wrote:

> Somewhat relevant, I have previously observed that, if you
> type/produce a link of http://www.abc.def/ghi?jk=lm , and then
> replace symbol characters in the link with some other confusable
> symbols, like full width punctuation and such, that link will still
> take you to the intended address. Different browsers accept different
> characters. Sometimes when such a link format is being posted onto
> internet communities that restrict link sharing, such alternative
> unicode characters formed links can bypass link restrictions in those
> communities and potentially take unsuspecting netizens to harmful
> websites. I don't understand why browsers would normalize links being
> clicked/typed in such way which would expose users to such risk.

Possible because it hasn't occurred to them to ban users of CJK
scripts?  Seriously, forcing users to explicitly type narrow
punctuation may be one hurdle too far for usability by some.  Not all
user input of URLs is mere copy and paste.  Sometimes one has to
manually convert '%2F' to '/'.

Calling these characters confusables misses the point that they are
variants of ASCII characters.

Richard.


From cloos at jhcloos.com  Wed May 20 05:53:54 2020
From: cloos at jhcloos.com (James Cloos)
Date: Wed, 20 May 2020 06:53:54 -0400
Subject: Security consideration: math symbols in an exotic IP address
 format in a phishing mail
In-Reply-To: <20200519205602.224b225d@spixxi> (Marius Spix's message of "Tue, 
 19 May 2020 20:56:06 +0200")
References: <20200517014230.329b11b5@spixxi>
 <f6d2d185-6262-150e-14ab-c4cd4d5df6ca@it.aoyama.ac.jp>
 <m3367wl9nu.fsf@carbon.jhcloos.org> <20200519205602.224b225d@spixxi>
Message-ID: <m3tv0aj2rh.fsf@carbon.jhcloos.org>

>>>>> "MS" == Marius Spix <marius.spix at web.de> writes:

MS> I deliberately did not send the complete URI. But I see a problem with
MS> spam filters, because they have to recognize a lot more variants of IP
MS> addresses.

ah.  of course.

looking at the https url with that string as the hostand no local part,
my browsers do choose to block it.  seamonkey says the cert is self
signed and also that it is not for 46.229.224.194.

so at least some platforms/browsers do the right thing.

-JimC
-- 
James Cloos <cloos at jhcloos.com>         OpenPGP: 0x997A9F17ED7DAEA6

From c933103 at gmail.com  Wed May 20 10:14:21 2020
From: c933103 at gmail.com (Phake Nick)
Date: Wed, 20 May 2020 23:14:21 +0800
Subject: Security consideration: math symbols in an exotic IP address
 format in a phishing mail
In-Reply-To: <20200520090312.47483edc@JRWUBU2>
References: <20200517014230.329b11b5@spixxi>
 <CAGHjPPLfd_DwdE75z3UAXysyNRL9ZMqxYqJGrk8DGUbjxrOCOw@mail.gmail.com>
 <20200520090312.47483edc@JRWUBU2>
Message-ID: <CAGHjPP+Ybk6siAiAKZL9=c1GOqeeZKpX9Dy9wf2zNpByfp1LvA@mail.gmail.com>

? 2020?5?20??? 22:30?Richard Wordingham via Unicode <unicode at unicode.org>
???

> On Mon, 18 May 2020 22:04:14 +0800
> Phake Nick via Unicode <unicode at unicode.org> wrote:
>
> > Somewhat relevant, I have previously observed that, if you
> > type/produce a link of http://www.abc.def/ghi?jk=lm , and then
> > replace symbol characters in the link with some other confusable
> > symbols, like full width punctuation and such, that link will still
> > take you to the intended address. Different browsers accept different
> > characters. Sometimes when such a link format is being posted onto
> > internet communities that restrict link sharing, such alternative
> > unicode characters formed links can bypass link restrictions in those
> > communities and potentially take unsuspecting netizens to harmful
> > websites. I don't understand why browsers would normalize links being
> > clicked/typed in such way which would expose users to such risk.
>
> Possible because it hasn't occurred to them to ban users of CJK
> scripts?  Seriously, forcing users to explicitly type narrow
> punctuation may be one hurdle too far for usability by some.  Not all
> user input of URLs is mere copy and paste.  Sometimes one has to
> manually convert '%2F' to '/'.
>
> Calling these characters confusables misses the point that they are
> variants of ASCII characters.
>
> Richard.


As a native Chinese speaker I have never seen anyone typing URL punctuation
in full width, other than a.) to confuse URL filtering systems, or b.) on a
few archaic printed documents that are not intended to be circulated in
digital format.

Also, sometimes browsers accept not just the exact fullwidth version of the
character but also other similar characters, for example a URL like
https://? <https://i>?????????/ would also work in Chrome and take you to
imgur's site.

These characters are being described as "confusable" in UTR #36, which I
followed the usage of the term in my email.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200520/590bfffe/attachment.htm>

From Shawn.Steele at microsoft.com  Wed May 20 11:27:32 2020
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Wed, 20 May 2020 16:27:32 +0000
Subject: Security consideration: math symbols in an exotic IP address
 format in a phishing mail
In-Reply-To: <20200520090312.47483edc@JRWUBU2>
References: <20200517014230.329b11b5@spixxi>
 <CAGHjPPLfd_DwdE75z3UAXysyNRL9ZMqxYqJGrk8DGUbjxrOCOw@mail.gmail.com>
 <20200520090312.47483edc@JRWUBU2>
Message-ID: <MWHPR21MB0847559329795481D7A6C6FE82B60@MWHPR21MB0847.namprd21.prod.outlook.com>

Anyone validating links as supposed below should make sure that IDN style normalization happens first...

It's kind of a "common" security problem that folks try to check for "security" of data prior to that data undergoing a transformation of some kind, at which point the previous security check may no longer be valid.

Note that "Full width" isn't exactly "confusable" in the way IDN thinks of it, since they're mapped directly to their corresponding character.  Normally "confusable" is used to refer to characters that may appear similar yet end up resolving to something different.

-Shawn

-----Original Message-----
From: Unicode <unicode-bounces at unicode.org> On Behalf Of Richard Wordingham via Unicode
Sent: Wednesday, May 20, 2020 1:03 AM
To: unicode at unicode.org
Subject: Re: Security consideration: math symbols in an exotic IP address format in a phishing mail

On Mon, 18 May 2020 22:04:14 +0800
Phake Nick via Unicode <unicode at unicode.org> wrote:

> Somewhat relevant, I have previously observed that, if you 
> type/produce a link of http://www.abc.def/ghi?jk=lm , and then replace 
> symbol characters in the link with some other confusable symbols, like 
> full width punctuation and such, that link will still take you to the 
> intended address. Different browsers accept different characters. 
> Sometimes when such a link format is being posted onto internet 
> communities that restrict link sharing, such alternative unicode 
> characters formed links can bypass link restrictions in those 
> communities and potentially take unsuspecting netizens to harmful 
> websites. I don't understand why browsers would normalize links being 
> clicked/typed in such way which would expose users to such risk.

Possible because it hasn't occurred to them to ban users of CJK scripts?  Seriously, forcing users to explicitly type narrow punctuation may be one hurdle too far for usability by some.  Not all user input of URLs is mere copy and paste.  Sometimes one has to manually convert '%2F' to '/'.

Calling these characters confusables misses the point that they are variants of ASCII characters.

Richard.


From doug at ewellic.org  Wed May 20 16:19:33 2020
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 20 May 2020 15:19:33 -0600
Subject: characters for edge crossing/edge casing
In-Reply-To: <CADtN0WJQiHDQhjCdTbH7AwyLCpDudEgqbvE+pPAv4P8_zTuB=Q@mail.gmail.com>
References: <001901d62c80$97d60530$c7820f90$@ewellic.org>
 <CADtN0WJQiHDQhjCdTbH7AwyLCpDudEgqbvE+pPAv4P8_zTuB=Q@mail.gmail.com>
Message-ID: <001401d62eec$5bbde720$1339b560$@ewellic.org>

I?m not sure what this question is asking. U+2573 doesn?t depict the bottom line being visually broken to show the top line crossing over it, as U+292B and U+292C do.

 ?

Matt appears to be looking for characters like U+292B and U+292C, but tilted 45 degrees so that the lines point north-south and east-west.

 ?

--

Doug Ewell | Thornton, CO, US | ewellic.org

 ?

From: Elias M?rtenson <lokedhs at gmail.com> 
Sent: Tuesday, May 19, 2020 19:14
To: Doug Ewell <doug at ewellic.org>
Cc: unicode <unicode at unicode.org>
Subject: Re: characters for edge crossing/edge casing

 ?

On Mon, 18 May 2020, 07:12 Doug Ewell via Unicode, <unicode at unicode.org <mailto:unicode at unicode.org> > wrote:

Matt Rice wrote:

> I had looked, but couldn't find any characters suitable for edge
> crossing/casing such as tunnel's, bridges in the following paper,
> suitable for orthogonal graph layouts.
> e.g. vertical/horizontal crossing, somwhat similar to the characters
> ?, ? U+292B-U+292C, but at 90 degrees.
>
> https://arxiv.org/pdf/0705.0413.pdf
>
> Or another style of crossing which I forget the name of,
> https://www.yworks.com/assets/images/features/bridges.ff977b3c.svg
>
> Have characters for this purpose been proposed before?

Shapecatcher couldn't find them either, so I suppose they don't exist and could be reasonably proposed.

Keep in mind that even at 90 degrees, it should be possible to show examples of them in plain text, not just in diagrams, and that arbitrary angles such as those shown in the paper should be inadmissible.

 ?

What about U+2573 ?BOX DRAWINGS LIGHT DIAGONAL CROSS? ?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200520/e715b621/attachment.htm>

From duerst at it.aoyama.ac.jp  Wed May 20 19:59:13 2020
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=)
Date: Thu, 21 May 2020 09:59:13 +0900
Subject: Security consideration: math symbols in an exotic IP address
 format in a phishing mail
In-Reply-To: <CAN49p6qugyxVgnrk0ea=srSeek-F9uvud-KH0-+SYWdi6cF=_Q@mail.gmail.com>
References: <20200517014230.329b11b5@spixxi>
 <CAGHjPPLfd_DwdE75z3UAXysyNRL9ZMqxYqJGrk8DGUbjxrOCOw@mail.gmail.com>
 <CAN49p6qugyxVgnrk0ea=srSeek-F9uvud-KH0-+SYWdi6cF=_Q@mail.gmail.com>
Message-ID: <f7192fc5-08fd-f353-fc1b-048b759254ab@it.aoyama.ac.jp>

Hello Markus, others,

On 20/05/2020 04:33, Markus Scherer via Unicode wrote:
> On Tue, May 19, 2020 at 12:24 PM Phake Nick via Unicode <unicode at unicode.org>
> wrote:
> 
>> Somewhat relevant, I have previously observed that, if you type/produce a
>> link of http://www.abc.def/ghi?jk=lm , and then replace symbol characters
>> in the link with some other confusable symbols, like full width punctuation
>> and such, that link will still take you to the intended address. Different
>> browsers accept different characters. Sometimes when such a link format is
>> being posted onto internet communities that restrict link sharing, such
>> alternative unicode characters formed links can bypass link restrictions in
>> those communities and potentially take unsuspecting netizens to harmful
>> websites.
>> I don't understand why browsers would normalize links being clicked/typed
>> in such way which would expose users to such risk.
>>
> 
> IDNA implementations process domain names using a "mapping" step which is
> like a variant of NFKC_Casefold.

That in itself isn't a problem, but it depends on the details.

> That's why you can use uppercase

Good.

> as well
> as other canonical

Good.

> and compatibility equivalents,

Good up to a point. As discussed already, mapping full-width characters 
to their half-width equivalents can make a lot of sense for users in 
China, Japan,... But mapping other compatibility equivalents doesn't 
make sense at all. Definitely not for Math Bold like in the example at 
hand, and definitely not for circled characters and the like.

> and out-of-order
> combining marks.

That's just part of canonical equivalence, isn't it?

The other very important point is of course that IP addresses are not 
domain names, and therefore are not covered by IDNA, and shouldn't be 
mapped in any way. But what happens inside browsers is probably the 
following:

(1) Check if the authority part is an IP address or a domain name.
(2) It doesn't look like an (ASCII) IP address, so it's handled
     as a domain name.
(3) Apply IDNA mapping (see above). Produces ASCII numbers.
(4) Apply IDNA toASCII conversion (no-op in the case at hand)
(5) Feed this to a generic resolver, which includes octal->regular
     IP address conversion.

Narrowing the IDNA mapping as discussed above would fix this case, 
because the toASCII operation would reject Math Bold as invalid 
characters. For security checks, rejecting Math Bold (and the like) 
would also work. But that would have to be restricted to the authority 
part of a Web address, because such numbers can of course occur in other 
parts.

Regards,   Martin.

> markus
> 


From markus at gyger.org  Thu May 21 05:21:51 2020
From: markus at gyger.org (Markus Gyger)
Date: Thu, 21 May 2020 12:21:51 +0200
Subject: Wireless Connection Symbol
In-Reply-To: <CAGdV86k5_wQCTeRLdg0d2uAQBs3Kf3p39YkUQx3cjLaw50nD0Q@mail.gmail.com>
References: <CAGdV86k5_wQCTeRLdg0d2uAQBs3Kf3p39YkUQx3cjLaw50nD0Q@mail.gmail.com>
Message-ID: <CAGdV86k0hPYUo2JY6m=NQ_HvSq0SP=pNaSQkqX9UHFK4OEW7iA@mail.gmail.com>

Is there a recommended code point (sequence) for the *S01863 Wireless
Connection* symbol <http://archive.is/d4Yr1> of IEC 60617
<https://webstore.iec.ch/publication/2723>?

The suggested <https://archive.is/X1tbC> character *U+1F4F6 ? ANTENNA WITH
BARS (cellular reception)* seems to have a less general meaning and looks
quite different.

Some older IEC mappings are in N2032
<https://unicode.org/wg2/docs/n2032.pdf>.


Markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200521/ac93c266/attachment.htm>

From rick at unicode.org  Fri May 22 15:49:11 2020
From: rick at unicode.org (Rick McGowan)
Date: Fri, 22 May 2020 13:49:11 -0700
Subject: Unicode server, planned maintenance
Message-ID: <5EC83AC7.9020507@unicode.org>

Hello.

This is to let everyone know there is an upcoming maintenance downtime 
for the Unicode.org servers scheduled in a window from 12am - 6am 
Pacific time, on May 23, 2020.


From doug at ewellic.org  Mon May 25 13:34:36 2020
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 25 May 2020 12:34:36 -0600
Subject: Wireless Connection Symbol
Message-ID: <000001d632c3$249f75d0$6dde6170$@ewellic.org>

Markus Gyger wrote:

> Is there a recommended code point (sequence) for the *S01863 Wireless
> Connection* symbol <http://archive.is/d4Yr1> of IEC 60617
> <https://webstore.iec.ch/publication/2723>?

U+1F50A SPEAKER WITH THREE SOUND WAVES ? might be as close as you'll get.

Not all symbols for use in diagrams are necessarily candidates for plain-text encoding, although some certainly are and the bar is moving.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From everson at evertype.com  Tue May 26 21:10:29 2020
From: everson at evertype.com (Michael Everson)
Date: Wed, 27 May 2020 03:10:29 +0100
Subject: Wireless Connection Symbol
In-Reply-To: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
Message-ID: <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>

On 25 May 2020, at 19:34, Doug Ewell via Unicode <unicode at unicode.org> wrote:

> Not all symbols for use in diagrams are necessarily candidates for plain-text encoding, although some certainly are and the bar is moving.

No, and despite the utility of some symbols for scholarship or tech use, we get to have chipmunk-squirrel hybrids and an incomplete set of dinosaurs.

Michael Everson

From jameskasskrv at gmail.com  Tue May 26 22:57:11 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Wed, 27 May 2020 03:57:11 +0000
Subject: Wireless Connection Symbol
In-Reply-To: <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
Message-ID: <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>


On 2020-05-27 2:10 AM, Michael Everson via Unicode wrote:
> On 25 May 2020, at 19:34, Doug Ewell via Unicode <unicode at unicode.org> wrote:
>
>> Not all symbols for use in diagrams are necessarily candidates for plain-text encoding, although some certainly are and the bar is moving.
> No, and despite the utility of some symbols for scholarship or tech use, we get to have chipmunk-squirrel hybrids and an incomplete set of dinosaurs.
>
> Michael Everson
A dilemma which QID Emoji Tag Sequences would resolve.
https://www.unicode.org/review/pri408/

From asmusf at ix.netcom.com  Wed May 27 00:25:47 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 26 May 2020 22:25:47 -0700
Subject: Wireless Connection Symbol
In-Reply-To: <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
Message-ID: <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200526/d2d83077/attachment.htm>

From markus at gyger.org  Wed May 27 02:09:21 2020
From: markus at gyger.org (Markus Gyger)
Date: Wed, 27 May 2020 09:09:21 +0200
Subject: Wireless Connection Symbol
In-Reply-To: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
Message-ID: <CAGdV86mjW45jPkk6Gq-ooY4BhHue+Bw1Abz3LiFvt-fxg+w_MQ@mail.gmail.com>

On Wed, May 27, 2020 at 1:06 AM Doug Ewell via Unicode <unicode at unicode.org>
wrote:

> U+1F50A SPEAKER WITH THREE SOUND WAVES ? might be as close as you'll get.
>

Thanks, looks visually close. U+1F4F6 ? is probably still closer to some
WLAN (or Wi-Fi) emoji though...


Markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200527/3b785bdf/attachment.htm>

From prosfilaes at gmail.com  Wed May 27 02:32:08 2020
From: prosfilaes at gmail.com (David Starner)
Date: Wed, 27 May 2020 00:32:08 -0700
Subject: Wireless Connection Symbol
In-Reply-To: <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
Message-ID: <CAMZ=zj7C1fzrrRzDUoz5GZO1F0wXk3oNaWd=egaU8tdQMAi55A@mail.gmail.com>

On Tue, May 26, 2020 at 11:20 PM James Kass via Unicode
<unicode at unicode.org> wrote:
>
> On 2020-05-27 2:10 AM, Michael Everson via Unicode wrote:
> > On 25 May 2020, at 19:34, Doug Ewell via Unicode <unicode at unicode.org> wrote:
> >
> >> Not all symbols for use in diagrams are necessarily candidates for plain-text encoding, although some certainly are and the bar is moving.
> > No, and despite the utility of some symbols for scholarship or tech use, we get to have chipmunk-squirrel hybrids and an incomplete set of dinosaurs.
> >
> > Michael Everson
> A dilemma which QID Emoji Tag Sequences would resolve.
> https://www.unicode.org/review/pri408/

In theory, but in practice, there are 700-some dinosaur species, and
encoding them alongside a hundred thousand other emoji is just going
to mean that nobody supports any of them.

-- 
The standard is written in English . If you have trouble understanding
a particular section, read it again and again and again . . . Sit up
straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185
(1991)

From jameskasskrv at gmail.com  Wed May 27 04:21:39 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Wed, 27 May 2020 09:21:39 +0000
Subject: Wireless Connection Symbol
In-Reply-To: <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
 <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>
Message-ID: <3fccb199-a5da-83ef-de23-62406f5cec60@gmail.com>


On 2020-05-27 5:25 AM, Asmus Freytag via Unicode wrote:
> I?ve already said this on the previous PRI, but it bears repeating: QID
> sequences are fundamentally unworkable because they destroy the concept of
> character identity. I firmly believe the UTC is considerably underestimating
> the implications of providing a mechanism that can encode exactly the same
> information in several, mutually incompatible ways. ...

As opposed to the current mechanism in which users cannot encode their 
desired information at all?

Unicode already provides a method for encoding the same information 
incompatibly, the PUA.? The QID emoji proposal seeks to standardize the 
plain-text interchange of any desired unencoded image, which would avoid 
the PUA issues.

There's more than one way to encode LATIN CAPITAL LETTER E WITH ACUTE 
compatibly.? If there were more than one way to encode an image of an 
eohippus in Unicode, they could be considered compatible.

If the concern wrt compatibility is that "image of an eohippus" might 
some day become an atomic Unicode character, somehow conflicting with 
the QID Emoji encoding, then it's suggested that with an existing 
plain-text interchange mechanism nobody would need to propose "image of 
an eohippus" as an atomic character.

It's also suggested that concerns about character identity needn't apply 
to "image of an eohippus" and the like because they haven't any.? "Image 
of an eohippus" is exactly that, nothing more and nothing less.? 
Interpreting the image as a meaningful symbol is up to the organic 
intelligence reading the text.? Meanwhile, the intention of the author 
is discoverable by any artificial process, namely that the author 
intended to send an image of an eohippus.


From jameskasskrv at gmail.com  Wed May 27 05:30:38 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Wed, 27 May 2020 10:30:38 +0000
Subject: QID Emoji (was Re: Wireless Connection Symbol)
In-Reply-To: <CAMZ=zj7C1fzrrRzDUoz5GZO1F0wXk3oNaWd=egaU8tdQMAi55A@mail.gmail.com>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
 <CAMZ=zj7C1fzrrRzDUoz5GZO1F0wXk3oNaWd=egaU8tdQMAi55A@mail.gmail.com>
Message-ID: <0f60b7aa-7818-2599-55dc-48541c65c89c@gmail.com>


On 2020-05-27 7:32 AM, David Starner wrote:
> On Tue, May 26, 2020 at 11:20 PM James Kass via Unicode
> <unicode at unicode.org> wrote:
>> On 2020-05-27 2:10 AM, Michael Everson via Unicode wrote:
>>> On 25 May 2020, at 19:34, Doug Ewell via Unicode <unicode at unicode.org> wrote:
>>>
>>>> Not all symbols for use in diagrams are necessarily candidates for plain-text encoding, although some certainly are and the bar is moving.
>>> No, and despite the utility of some symbols for scholarship or tech use, we get to have chipmunk-squirrel hybrids and an incomplete set of dinosaurs.
>>>
>>> Michael Everson
>> A dilemma which QID Emoji Tag Sequences would resolve.
>> https://www.unicode.org/review/pri408/
> In theory, but in practice, there are 700-some dinosaur species, and
> encoding them alongside a hundred thousand other emoji is just going
> to mean that nobody supports any of them.
>

In the short term, yes.? In the long term support will be driven by demand.

If enough users demand the ability to exchange an image of a 
basenji-doberman hybrid in plain-text, display support will be 
forthcoming.? If the demand isn't enough to stimulate the large 
corporate players, then third-party support will step in.? If there's no 
demand, it's moot.? But it's discoverable under the QID Emoji proposal 
regardless of demand level.

If approved, it's already supported as far as Unicode is concerned. 
Because it uses strings of already encoded characters which are already 
interchangeable.? Display issues and input methods have traditionally 
been considered outside the scope of The Standard.


From abrahamgross at disroot.org  Wed May 27 03:13:59 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Wed, 27 May 2020 08:13:59 +0000 (UTC)
Subject: Wireless Connection Symbol
In-Reply-To: <CAGdV86mjW45jPkk6Gq-ooY4BhHue+Bw1Abz3LiFvt-fxg+w_MQ@mail.gmail.com>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <CAGdV86mjW45jPkk6Gq-ooY4BhHue+Bw1Abz3LiFvt-fxg+w_MQ@mail.gmail.com>
Message-ID: <9036c9f9-0fa9-4895-a8c3-ccc86c174e44@disroot.org>

Maybe try a writing a combination of characters like
??)
So that it looks like the symbol (but make it look good though)

2020/05/27 ??3:10:21 Markus Gyger via Unicode <unicode at unicode.org>:
> On Wed, May 27, 2020 at 1:06 AM Doug Ewell via Unicode <unicode at unicode.org> wrote:
>> U+1F50A SPEAKER WITH THREE SOUND WAVES ? might be as close as you'll get.
>> Thanks, looks visually close. U+1F4F6 ? is probably still closer to some WLAN (or Wi-Fi) emoji though...
> 
> Markus
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200527/ea266844/attachment.htm>

From wjgo_10009 at btinternet.com  Wed May 27 08:40:22 2020
From: wjgo_10009 at btinternet.com (wjgo_10009 at btinternet.com)
Date: Wed, 27 May 2020 14:40:22 +0100 (BST)
Subject: QID Emoji
Message-ID: <410f7ff6.4c3.172565cd027.Webtop.53@btinternet.com>

QID Emoji can be discussed in this mailing list.

One of my ideas is banned from even being discussed in this mailing list 
and deemed to be out of scope!

Yet the research continues.

But I am not going to use QID Emoji to encode the items.

Either encoding of the concept gets done excellently with its own 
structured tagspace or it can go round again and again until it is 
encoded excellently in Unicode.

The documents are all deposited for conservation with the British 
Library.

Yet if QID Emoji are encoded, then maybe my idea will become encoded too 
on a "sauce for pasta is sauce for rice" basis as my idea is at least as 
rigorous as the QID Emoji proposal.

William Overington

Wednesday 27 May 2020


From sosipiuk at gmail.com  Wed May 27 11:18:01 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Wed, 27 May 2020 12:18:01 -0400
Subject: QID Emoij (was: Re: Wireless Connection Symbol)
Message-ID: <001e01d63442$659619b0$30c24d10$@gmail.com>

Users can encode their desired information using the PUA. That is precisely what it is for.

What QID seems to be proposing is a way to assign an interchange-suitable, (semi-)stable ID to a meaningful symbol. I don't see how that is anything other than what Unicode itself is meant to provide: A standardized  ID number for a character. The QID process seems to be merely a way to "assign emojis faster" because the current process isn't responsive enough for vendors' liking. It's another layer of Unicode sitting atop Unicode, motivated by a desire for less oversight from the Unicode Consortium.

I agree with the "nay" comments. Besides all the practical downsides, I think QID emojis invite a free-for-all that goes against the very spirit of Unicode as being a standardized database of character/emoji IDs.

The issue to be resolved here lies in the process for adding emojis. The current process is too onerous and slow. I can imagine a new process, that isn't bound to a regular schedule, and that allows eminently useful and needed emojis to be fast-tracked to approval in days, not months. Perhaps an entire plane could be reserved for such emojis - 65K should be enough for anyone, right? ;) Perhaps there could be a provisional or probationary approval granted to certain emojis, or at least a "reservation" system for code points. A vendor could reserve spaces with emojis they plan to add (with reasonable limits, of course). There could be a public voting system to add or approve emojis in near-real-time based on thresholds for approval. It's 2020; we have the technology. Provisional emojis or code points reservations that don't see use/support after some amount of time are rejected and code points are allowed to be reused. Those that see use or public support are given final approval and become bound by stability requirements. The Unicode Consortium is still involved, but less so, relying more on automated metrics than meetings, though they would still have veto power if there is some valid subjective factor to consider.

The details are something to be worked out. The main point is that there is a desire for a quicker, more responsive way to add emojis. That can be done without essentially reconstructing Unicode on top of itself.

S?awomir Osipiuk


From markus at gyger.org  Wed May 27 14:51:45 2020
From: markus at gyger.org (Markus Gyger)
Date: Wed, 27 May 2020 21:51:45 +0200
Subject: Wireless Connection Symbol
In-Reply-To: <9036c9f9-0fa9-4895-a8c3-ccc86c174e44@disroot.org>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <CAGdV86mjW45jPkk6Gq-ooY4BhHue+Bw1Abz3LiFvt-fxg+w_MQ@mail.gmail.com>
 <9036c9f9-0fa9-4895-a8c3-ccc86c174e44@disroot.org>
Message-ID: <CAGdV86=Ai-xTpwcH3XUcUH7LrvC=E1TozMiq8P4e2en7P2K2nQ@mail.gmail.com>

On Wed, May 27, 2020 at 4:26 PM abrahamgross--- via Unicode <
unicode at unicode.org> wrote:

> Maybe try a writing a combination of characters like
> ??)
> So that it looks like the symbol (but make it look good though)
>

Great idea, thanks!  ??)  looks even good in e.g. Source Sans Pro
<https://github.com/adobe-fonts/source-sans-pro>.
I'll probably just encode it as a "ligature" of two characters:  ?)


Markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200527/0099c80a/attachment.htm>

From kent.b.karlsson at bahnhof.se  Wed May 27 17:50:10 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Thu, 28 May 2020 00:50:10 +0200
Subject: Wireless Connection Symbol
In-Reply-To: <3fccb199-a5da-83ef-de23-62406f5cec60@gmail.com>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
 <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>
 <3fccb199-a5da-83ef-de23-62406f5cec60@gmail.com>
Message-ID: <0840AF94-25F2-4682-B66C-803B741625B7@bahnhof.se>


Embedding images (whichever one you like and ?you' have access to?) in text can already be done.

In HTML markup it is called ?img? (e.g. ?<img src=??./eohippus.png?/>?). And there is no real question about which images will will be ?supported?.

Granted, it is not plain text. But emoji are already pushing ?out of? plain text as we knew it. And? I recall an argument (years ago) saying essentially
?these will be the only emoji encoded, the recommendation for expansion is to use images instead?. That seems to have been forgotten?

/Kent Karlsson

PS
I agree with Asmus that the ?QID emoji? is a really bad idea.


> 27 maj 2020 kl. 11:21 skrev James Kass via Unicode <unicode at unicode.org>:
> 
> 
> On 2020-05-27 5:25 AM, Asmus Freytag via Unicode wrote:
>> I?ve already said this on the previous PRI, but it bears repeating: QID
>> sequences are fundamentally unworkable because they destroy the concept of
>> character identity. I firmly believe the UTC is considerably underestimating
>> the implications of providing a mechanism that can encode exactly the same
>> information in several, mutually incompatible ways. ...
> 
> As opposed to the current mechanism in which users cannot encode their desired information at all?
> 
> Unicode already provides a method for encoding the same information incompatibly, the PUA.  The QID emoji proposal seeks to standardize the plain-text interchange of any desired unencoded image, which would avoid the PUA issues.
> 
> There's more than one way to encode LATIN CAPITAL LETTER E WITH ACUTE compatibly.  If there were more than one way to encode an image of an eohippus in Unicode, they could be considered compatible.
> 
> If the concern wrt compatibility is that "image of an eohippus" might some day become an atomic Unicode character, somehow conflicting with the QID Emoji encoding, then it's suggested that with an existing plain-text interchange mechanism nobody would need to propose "image of an eohippus" as an atomic character.
> 
> It's also suggested that concerns about character identity needn't apply to "image of an eohippus" and the like because they haven't any.  "Image of an eohippus" is exactly that, nothing more and nothing less.  Interpreting the image as a meaningful symbol is up to the organic intelligence reading the text.  Meanwhile, the intention of the author is discoverable by any artificial process, namely that the author intended to send an image of an eohippus.
> 
> 


From markus.icu at gmail.com  Wed May 27 21:53:04 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Wed, 27 May 2020 19:53:04 -0700
Subject: Wireless Connection Symbol
In-Reply-To: <0840AF94-25F2-4682-B66C-803B741625B7@bahnhof.se>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
 <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>
 <3fccb199-a5da-83ef-de23-62406f5cec60@gmail.com>
 <0840AF94-25F2-4682-B66C-803B741625B7@bahnhof.se>
Message-ID: <CAN49p6qjUx9JzRWYk8TO0AVZOW=1OQsm8v-xFe8Hh1rcbpVM-w@mail.gmail.com>

On Wed, May 27, 2020 at 6:20 PM Kent Karlsson via Unicode <
unicode at unicode.org> wrote:

> Granted, it is not plain text. But emoji are already pushing ?out of?
> plain text as we knew it. And? I recall an argument (years ago) saying
> essentially
> ?these will be the only emoji encoded, the recommendation for expansion is
> to use images instead?. That seems to have been forgotten?
>

Not entirely forgotten...

http://www.unicode.org/reports/tr51/#Longer_Term

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200527/0070816d/attachment.htm>

From nospam-abuse at ilyaz.org  Thu May 28 03:42:25 2020
From: nospam-abuse at ilyaz.org (Ilya Zakharevich)
Date: Thu, 28 May 2020 01:42:25 -0700
Subject: ISO 14651/14652 vs Unicode sorting
Message-ID: <20200528084225.mglngkxuhyc77vnf@math.berkeley.edu>

I have been informed that according to the tables distributed with ISO
14651/14652, the following strings should be sorted in this order:

>   foobar
>   foo baz

Moreover, this is how glibc (and, as a corollary, all utilities) do
this in European locales on contemporary Linuxes.

I checked COBUILT, American Heritage, and Le Petit Robert II???and it
seems that they do indeed use this (brain damaged?) order.  (Although
not, apparently, Le Petit Robert I???which SEEMS TO HAVE compound
words tackled at the end of the main record.)

However, this definitely contradicts what
  https://icu4c-demos-7hxm2n5zgq-uc.a.run.app/icu-bin/collation.html
does with the default locale, and with `en?.

So what is the intended behavior: of ICU, or of ISO?!

Thanks,
Ilya

From kenwhistler at sonic.net  Thu May 28 09:52:03 2020
From: kenwhistler at sonic.net (Ken Whistler)
Date: Thu, 28 May 2020 07:52:03 -0700
Subject: ISO 14651/14652 vs Unicode sorting
In-Reply-To: <20200528084225.mglngkxuhyc77vnf@math.berkeley.edu>
References: <20200528084225.mglngkxuhyc77vnf@math.berkeley.edu>
Message-ID: <3c23b16c-27cd-3780-f63e-4dcfd81f33f6@sonic.net>

Ilya,

On this topic, see the extended discussion of variable weighting in UTS #10:

https://www.unicode.org/reports/tr10/#Variable_Weighting_Examples

On 5/28/2020 1:42 AM, Ilya Zakharevich via Unicode wrote:
> I have been informed that according to the tables distributed with ISO
> 14651/14652, the following strings should be sorted in this order:
>
>>    foobar
>>    foo baz
> Moreover, this is how glibc (and, as a corollary, all utilities) do
> this in European locales on contemporary Linuxes.
ISO 14651 recommends the "Shifted" handling of variables. In this 
particular case, your concern is with the handling of U+0020 SPACE, but 
that choice also affects all punctuation and symbols, unless otherwise 
tailored.
>
> I checked COBUILT, American Heritage, and Le Petit Robert II???and it
> seems that they do indeed use this (brain damaged?) order.  (Although
> not, apparently, Le Petit Robert I???which SEEMS TO HAVE compound
> words tackled at the end of the main record.)
Precisely what happens in various dictionaries is a bit beside the 
point, because they often follow somewhat special rules that may not 
always directly match the results of just taking all the headwords and 
sorting the strings according to a particular collation setting. They 
may require special tailoring.
>
> However, this definitely contradicts what
>    https://icu4c-demos-7hxm2n5zgq-uc.a.run.app/icu-bin/collation.html
> does with the default locale, and with `en?.

In that demo, the collation *defaults* to "Non-ignorable". Again, see 
the discussion of variable weighting cited above. In a "Non-ignorable" 
collation, the primary weights of the variables (space included) *are* 
used at the primary level of sortkey construction, instead of being 
shifted to only make a difference following any tertiary weight 
differences. So you get the results in the demo you see where the space 
character "makes a difference" -- namely, that it is weighted as 
significantly as other full letters.

However, if you switch options in that demo to "Shifted" -- see the the 
seventh line of the radio buttons, labeled "alternate", then you get the 
Shifted weighting, which will then mirror the results you see for glibc.

>
> So what is the intended behavior: of ICU, or of ISO?!

There is no "right answer" here. The Unicode Collation Algorithm comes 
with built-in alternative parametric settings, and, of course, the 
option to tailor the collation rules indefinitely, to meet the 
requirements of particular languages and/or particular dictionary 
orderings or other special purposes. ISO 14651 also allows different 
settings (although not as completely spelled out as in UCA) and 
tailorings. What glibc has done is pick the default, out-of-the-box 
shifted handling of variables implied by ISO 14651, but that is simply 
an implementation choice.

--Ken

>
> Thanks,
> Ilya
>

From wjgo_10009 at btinternet.com  Thu May 28 13:10:57 2020
From: wjgo_10009 at btinternet.com (wjgo_10009 at btinternet.com)
Date: Thu, 28 May 2020 19:10:57 +0100 (BST)
Subject: QID Emoji (from Re: Wireless Connection Symbol)
Message-ID: <4bddc04c.14fe.1725c7ae4cd.Webtop.41@btinternet.com>

QID Emoji (from Re: Wireless Connection Symbol)

Kent Karlsson wrote as follows.

> I agree with Asmus that the ?QID emoji? is a really bad idea.

I opine that when considering a new idea it is important to be prepared 
to suspend disbelief and consider if any parts of the idea are good, 
rather than just the total idea.

I find the QID Emoji proposal has some very good aspects but is somewhat 
unstable as a whole.

So, if those in favour of the proposal and those against are each 
willing to be like the strongest trees and sway in the breeze then the 
good parts of the proposal could become available in a stable manner.

For example, maybe registration in a Unicode Inc. database, with the 
option of a cross-reference link to QID, would mean that only those QID 
where someone wants an emoji for that QID would be in the Unicode Inc. 
database, and a gentle moderation policy could be used to stop ambiguity 
and duplication. So maybe shorter codes.

What if U+FFF0 is defined, mutatis mutandis, as effectively what would 
be a ligature of the ID emoji and tag Q in the original proposal, U+FFF8 
is defined as the corresponding CANCEL and circled digits are used. All 
part of the basic plane, so fewer bytes for each such character and a 
graceful indicative fallback facility built in.

I realize that the original proposal can be implemented with existing 
technology, and that the changes I suggest would require changes to The 
Unicode Standard and software packages, but that could be done in time 
if there is the will to do so, yet whatever solution is implemented is 
there for a very long time.

Would those two changes both go a long way towards making a solution 
that is acceptable to everybody?

I may not have solved every objection and what I suggest does change the 
original. Yet this is research for the future. So, if people agree, 
please say so, if not then please say what I have missed or got wrong 
and what needs fixing and then, as a group effort, maybe we can iterate 
in a constructive way and achieve a good solution acceptable to 
everybody.

William Overington

Thursday 28 May 2020


From richard.wordingham at ntlworld.com  Thu May 28 16:34:06 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 28 May 2020 22:34:06 +0100
Subject: ISO 14651/14652 vs Unicode sorting
In-Reply-To: <20200528084225.mglngkxuhyc77vnf@math.berkeley.edu>
References: <20200528084225.mglngkxuhyc77vnf@math.berkeley.edu>
Message-ID: <20200528223406.1efd0d36@JRWUBU2>

On Thu, 28 May 2020 01:42:25 -0700
Ilya Zakharevich via Unicode <unicode at unicode.org> wrote:

> So what is the intended behavior: of ICU, or of ISO?!

Does ICU now support the ISO 14651 default for NFD strings of
assigned characters?  I thought supporting DUCET was abandoned several
years ago.

Richard.

From kent.b.karlsson at bahnhof.se  Thu May 28 17:19:28 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Fri, 29 May 2020 00:19:28 +0200
Subject: Wireless Connection Symbol
In-Reply-To: <CAN49p6qjUx9JzRWYk8TO0AVZOW=1OQsm8v-xFe8Hh1rcbpVM-w@mail.gmail.com>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
 <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>
 <3fccb199-a5da-83ef-de23-62406f5cec60@gmail.com>
 <0840AF94-25F2-4682-B66C-803B741625B7@bahnhof.se>
 <CAN49p6qjUx9JzRWYk8TO0AVZOW=1OQsm8v-xFe8Hh1rcbpVM-w@mail.gmail.com>
Message-ID: <3A410044-B4F7-4489-8CC4-3562D4E84ADE@bahnhof.se>


> 28 maj 2020 kl. 04:53 skrev Markus Scherer via Unicode <unicode at unicode.org>:
> 
> On Wed, May 27, 2020 at 6:20 PM Kent Karlsson via Unicode <unicode at unicode.org <mailto:unicode at unicode.org>> wrote:
> Granted, it is not plain text. But emoji are already pushing ?out of? plain text as we knew it. And? I recall an argument (years ago) saying essentially
> ?these will be the only emoji encoded, the recommendation for expansion is to use images instead?. That seems to have been forgotten?
> 
> Not entirely forgotten...
> 
> http://www.unicode.org/reports/tr51/#Longer_Term <http://www.unicode.org/reports/tr51/#Longer_Term>
> 
> markus

Ok. Thanks for pointing that out. Glad it is not entirely forgotten.

One little nit:
?Other features required to make embedded graphics work well include the ability of images to scale with font size?

That sounds a little bit like one was requiring a small revolution in image rendering. But of course it is not (ok, HTML again):

     <img src="???" alt="??" style="width:1em;height:1em;"/>

(1em being the typical height and width of emoji glyphs.)

/Kent K

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200529/e4d08879/attachment.htm>

From markus.icu at gmail.com  Thu May 28 21:14:05 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Thu, 28 May 2020 19:14:05 -0700
Subject: ISO 14651/14652 vs Unicode sorting
In-Reply-To: <20200528223406.1efd0d36@JRWUBU2>
References: <20200528084225.mglngkxuhyc77vnf@math.berkeley.edu>
 <20200528223406.1efd0d36@JRWUBU2>
Message-ID: <CAN49p6qMxudpcqJB79pDR_iv_aBWodBMaCa3oN8zud3EHb0RXg@mail.gmail.com>

On Thu, May 28, 2020 at 5:05 PM Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> Does ICU now support the ISO 14651 default for NFD strings of
> assigned characters?  I thought supporting DUCET was abandoned several
> years ago.
>

ICU uses the CLDR sort order which is a mild tailoring of the DUCET.
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200528/ae6106ee/attachment.htm>

From abrahamgross at disroot.org  Fri May 29 01:19:52 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Fri, 29 May 2020 06:19:52 +0000 (UTC)
Subject: QID Emoji (from Re: Wireless Connection Symbol)
In-Reply-To: <4bddc04c.14fe.1725c7ae4cd.Webtop.41@btinternet.com>
References: <4bddc04c.14fe.1725c7ae4cd.Webtop.41@btinternet.com>
Message-ID: <4da8599e-8e5d-49d7-9701-be8d227bfbfd@disroot.org>

What if instead of using a QID which on its own is just meaningless numbers, we do tag_start, then write out a word in the tag letters, then close it off with a tag_end. this way, if your device doesn't have the font, it can fall back by rendering the tags through regular character (with an indicator that an emoji is supposed to be there)

Example:
Suppose I type: ?I love ?triceratops?? (with ? being tag_start; ? being tag_end; and everything between being tag ascii characters E0020-E007E),
My phone, that has the correct font installed, will render it as
 [cid:eu.faircode.email.987]

While for someone else without the font, it would render as something like

 [cid:eu.faircode.email.988]

This way the intent of the original message will carry across to the reader correctly whether they have the correct font or not. This makes it better than PUA due to the information not being completely lost, and as a another bonus, screen readers would be able to read this just fine (i.e. ?I love ?triceratops??, where ? represents a tone.)

I understand that this uses more space than QID sequences, but I think the payoff of not having to worry about ?/?/?/? everywhere is worth it.

2020/05/28 ??2:32:28 wjgo_10009--- via Unicode <unicode at unicode.org>:
> QID Emoji (from Re: Wireless Connection Symbol)
> 
> Kent Karlsson wrote as follows.
> 
>> I agree with Asmus that the ?QID emoji? is a really bad idea.
>> 
> I opine that when considering a new idea it is important to be prepared to suspend disbelief and consider if any parts of the idea are good, rather than just the total idea.
> 
> I find the QID Emoji proposal has some very good aspects but is somewhat unstable as a whole.
> 
> So, if those in favour of the proposal and those against are each willing to be like the strongest trees and sway in the breeze then the good parts of the proposal could become available in a stable manner.
> 
> For example, maybe registration in a Unicode Inc. database, with the option of a cross-reference link to QID, would mean that only those QID where someone wants an emoji for that QID would be in the Unicode Inc. database, and a gentle moderation policy could be used to stop ambiguity and duplication. So maybe shorter codes.
> 
> What if U+FFF0 is defined, mutatis mutandis, as effectively what would be a ligature of the ID emoji and tag Q in the original proposal, U+FFF8 is defined as the corresponding CANCEL and circled digits are used. All part of the basic plane, so fewer bytes for each such character and a graceful indicative fallback facility built in.
> 
> I realize that the original proposal can be implemented with existing technology, and that the changes I suggest would require changes to The Unicode Standard and software packages, but that could be done in time if there is the will to do so, yet whatever solution is implemented is there for a very long time.
> 
> Would those two changes both go a long way towards making a solution that is acceptable to everybody?
> 
> I may not have solved every objection and what I suggest does change the original. Yet this is research for the future. So, if people agree, please say so, if not then please say what I have missed or got wrong and what needs fixing and then, as a group effort, maybe we can iterate in a constructive way and achieve a good solution acceptable to everybody.
> 
> William Overington
> 
> Thursday 28 May 2020
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image:432838
Type: image/png
Size: 48480 bytes
Desc: not available
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200529/9d4c8933/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Render.png
Type: image/png
Size: 24595 bytes
Desc: not available
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200529/9d4c8933/attachment-0001.png>

From jameskasskrv at gmail.com  Fri May 29 02:14:15 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Fri, 29 May 2020 07:14:15 +0000
Subject: QID Emoij (was: Re: Wireless Connection Symbol)
In-Reply-To: <0840AF94-25F2-4682-B66C-803B741625B7@bahnhof.se>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
 <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>
 <3fccb199-a5da-83ef-de23-62406f5cec60@gmail.com>
 <0840AF94-25F2-4682-B66C-803B741625B7@bahnhof.se>
Message-ID: <5c735b52-3979-5d53-48a3-62f7db29cc17@gmail.com>


On 2020-05-27 10:50 PM, Kent Karlsson via Unicode wrote:
> Embedding images (whichever one you like and ?you' have access to?) in text can already be done.
>
> In HTML markup it is called ?img? (e.g. ?<img src=??./eohippus.png?/>?). And there is no real question about which images will will be ?supported?.
>

The point Kent Karlsson makes here is just as valid today as it was when 
it was used as an argument against encoding the first set of emoji in 
Unicode.

It's true that there are proper ways of sticking random images in 
running text.? It's too bad that the emoji user community doesn't seem 
interested in that type of solution.

An underlying question is whether it's better for profit driven 
corporate interests to determine the emoji repertoire, or to let the set 
evolve naturally based on user community desires.

The QID Emoji proposal enables the latter, which is one of the reasons 
I'm in favor of it.

Naysayers will point out obstacles in any such approach.? In the English 
language, anyone who considers any and all obstacles insurmountable is 
referred to as a quitter.


From prosfilaes at gmail.com  Fri May 29 02:40:40 2020
From: prosfilaes at gmail.com (David Starner)
Date: Fri, 29 May 2020 00:40:40 -0700
Subject: QID Emoij (was: Re: Wireless Connection Symbol)
In-Reply-To: <5c735b52-3979-5d53-48a3-62f7db29cc17@gmail.com>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
 <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>
 <3fccb199-a5da-83ef-de23-62406f5cec60@gmail.com>
 <0840AF94-25F2-4682-B66C-803B741625B7@bahnhof.se>
 <5c735b52-3979-5d53-48a3-62f7db29cc17@gmail.com>
Message-ID: <CAMZ=zj4+1w6_Zcqj3+mOfUMm01KwcoB+fDc2-UzXMdWhAu2GeQ@mail.gmail.com>

On Fri, May 29, 2020 at 12:19 AM James Kass via Unicode
<unicode at unicode.org> wrote:
> An underlying question is whether it's better for profit driven
> corporate interests to determine the emoji repertoire, or to let the set
> evolve naturally based on user community desires.
>
> The QID Emoji proposal enables the latter, which is one of the reasons
> I'm in favor of it.

I don't see it. Profit driven corporate interests may or may not
support QID Emoji; if they don't, it's practically dead in the water.
If they do, Google is going to make a list of corporately supported
emoji, just like what started this, and that's going to be the list of
supported QID emoji. Outside that corporate line, there's going to be
about zero chance anyone tries to use QID emoji, because at most one
in a million QID emoji are going to be supported, so even if you do
want to use an emoji from the Palmer's Chipmunk, what's the right QID
for that? It seems to have three, one for each scientific name, plus
maybe using the genus, Tamias, will be better supported, or Marmotini,
or Sciuridae, but if you're getting that vague, why not use the
existing emoji? So six alternatives for QID emoji, and none of them
will probably work, so why should the user community bother? I guess
instead of bothering Unicode, they could bother Google to add it to
their list and provide fonts, but there's that "profit driven
corporate interests" again.

> Naysayers will point out obstacles in any such approach.  In the English
> language, anyone who considers any and all obstacles insurmountable is
> referred to as a quitter.

People who use the term quitter in such a sense often get a ride from
a police car or ambulance later that night. It's a bizarre word to use
when you've decided Unicode should be a quitter and stop even trying
to manage emoji.

-- 
The standard is written in English . If you have trouble understanding
a particular section, read it again and again and again . . . Sit up
straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185
(1991)

From abrahamgross at disroot.org  Fri May 29 02:48:47 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Fri, 29 May 2020 07:48:47 +0000 (UTC)
Subject: QID Emoij (was: Re: Wireless Connection Symbol)
In-Reply-To: <CAMZ=zj4+1w6_Zcqj3+mOfUMm01KwcoB+fDc2-UzXMdWhAu2GeQ@mail.gmail.com>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
 <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>
 <3fccb199-a5da-83ef-de23-62406f5cec60@gmail.com>
 <0840AF94-25F2-4682-B66C-803B741625B7@bahnhof.se>
 <5c735b52-3979-5d53-48a3-62f7db29cc17@gmail.com>
 <CAMZ=zj4+1w6_Zcqj3+mOfUMm01KwcoB+fDc2-UzXMdWhAu2GeQ@mail.gmail.com>
Message-ID: <b536eff7-f6dd-4666-a9c6-762cb8645abc@disroot.org>

What if instead of using a QID which on its own is just meaningless numbers, we do tag_start, then write out a word in the tag letters, then close it off with a tag_end. this way, if your device doesn't have the font, it can fall back by rendering the tags through regular character (with an indicator that an emoji is supposed to be there)

Example:
Suppose I type: ?I love ?triceratops?? (with ? being tag_start; ? being tag_end; and everything between being tag ascii characters E0020-E007E),
My phone, that has the correct font installed, will render it as

 [cid:eu.faircode.email.996]

While for someone else without the font, it would render as something like ?I love ?????????????

 [cid:eu.faircode.email.998]

This way the intent of the original message will carry across to the reader correctly whether they have the correct font or not. This makes it better than PUA due to the information not being completely lost, and as a another bonus, screen readers would be able to read this just fine (i.e. ?I love ?triceratops??, where ? represents a tone.)

I understand that this uses more space than QID sequences, but I think the payoff of not having to worry about ?/?/?/? everywhere is worth it.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: image:432838
Type: image/png
Size: 48480 bytes
Desc: not available
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200529/5de5820c/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Render.png
Type: image/png
Size: 24595 bytes
Desc: not available
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200529/5de5820c/attachment-0001.png>

From marius.spix at web.de  Fri May 29 03:34:45 2020
From: marius.spix at web.de (Marius Spix)
Date: Fri, 29 May 2020 10:34:45 +0200
Subject: Aw: Re: Wireless Connection Symbol
In-Reply-To: <3A410044-B4F7-4489-8CC4-3562D4E84ADE@bahnhof.se>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
 <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>
 <3fccb199-a5da-83ef-de23-62406f5cec60@gmail.com>
 <0840AF94-25F2-4682-B66C-803B741625B7@bahnhof.se>
 <CAN49p6qjUx9JzRWYk8TO0AVZOW=1OQsm8v-xFe8Hh1rcbpVM-w@mail.gmail.com>
 <3A410044-B4F7-4489-8CC4-3562D4E84ADE@bahnhof.se>
Message-ID: <trinity-6d2462e9-1225-4e70-99e4-fcce9dbf2b20-1590741285382@3c-app-webde-bs56>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200529/f1c8da82/attachment.htm>

From richard.wordingham at ntlworld.com  Fri May 29 03:38:43 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 29 May 2020 09:38:43 +0100
Subject: ISO 14651/14652 vs Unicode sorting
In-Reply-To: <CAN49p6qMxudpcqJB79pDR_iv_aBWodBMaCa3oN8zud3EHb0RXg@mail.gmail.com>
References: <20200528084225.mglngkxuhyc77vnf@math.berkeley.edu>
 <20200528223406.1efd0d36@JRWUBU2>
 <CAN49p6qMxudpcqJB79pDR_iv_aBWodBMaCa3oN8zud3EHb0RXg@mail.gmail.com>
Message-ID: <20200529093843.36ccfa3a@JRWUBU2>

On Thu, 28 May 2020 19:14:05 -0700
Markus Scherer via Unicode <unicode at unicode.org> wrote:

> On Thu, May 28, 2020 at 5:05 PM Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:  
> 
> > Does ICU now support the ISO 14651 default for NFD strings of
> > assigned characters?  I thought supporting DUCET was abandoned
> > several years ago.

> ICU uses the CLDR sort order which is a mild tailoring of the DUCET.

Depending on what you mean by tailoring.  By definition, it's not a
tailoring in the CLDR sense!

I'll take that answer as 'no'.  ICU used to claim to support DUCET, and
I suspect that there is a tailoring that will get through the UCA
compliance testing if restricted to assigned characters.  However, it
seems that the ICU/CLDR teams decided that that goal wasn't worth the
effort.

Richard.

From jameskasskrv at gmail.com  Fri May 29 04:14:59 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Fri, 29 May 2020 09:14:59 +0000
Subject: QID Emoij
In-Reply-To: <CAMZ=zj4+1w6_Zcqj3+mOfUMm01KwcoB+fDc2-UzXMdWhAu2GeQ@mail.gmail.com>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
 <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>
 <3fccb199-a5da-83ef-de23-62406f5cec60@gmail.com>
 <0840AF94-25F2-4682-B66C-803B741625B7@bahnhof.se>
 <5c735b52-3979-5d53-48a3-62f7db29cc17@gmail.com>
 <CAMZ=zj4+1w6_Zcqj3+mOfUMm01KwcoB+fDc2-UzXMdWhAu2GeQ@mail.gmail.com>
Message-ID: <28b102fc-8b7f-9f36-fd9a-942e3eb868f5@gmail.com>


On 2020-05-29 7:40 AM, David Starner wrote:
> I don't see it. Profit driven corporate interests may or may not
> support QID Emoji; if they don't, it's practically dead in the water.

This discounts the probability that third-partiers would step up to the 
plate.
> If they do, Google is going to make a list of corporately supported
> emoji, just like what started this, and that's going to be the list of
> supported QID emoji. Outside that corporate line, there's going to be
> about zero chance anyone tries to use QID emoji, because at most one
> in a million QID emoji are going to be supported, so even if you do
> want to use an emoji from the Palmer's Chipmunk, what's the right QID
> for that? It seems to have three, one for each scientific name, plus
> maybe using the genus, Tamias, will be better supported, or Marmotini,
> or Sciuridae, but if you're getting that vague, ...
Aren't there far more than three ways to express the concept "Hello" 
using valid Unicode strings?? If *that* had been deemed an 
insurmountable obstacle, we'd still be limited to ASCII-English.
>> Naysayers will point out obstacles in any such approach.  In the English
>> language, anyone who considers any and all obstacles insurmountable is
>> referred to as a quitter.
> People who use the term quitter in such a sense often get a ride from
> a police car or ambulance later that night. It's a bizarre word to use
> when you've decided Unicode should be a quitter and stop even trying
> to manage emoji.
>
I've decided that Unicode has no business limiting an evolving set of 
symbols.


From prosfilaes at gmail.com  Fri May 29 05:32:31 2020
From: prosfilaes at gmail.com (David Starner)
Date: Fri, 29 May 2020 03:32:31 -0700
Subject: QID Emoij
In-Reply-To: <28b102fc-8b7f-9f36-fd9a-942e3eb868f5@gmail.com>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
 <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>
 <3fccb199-a5da-83ef-de23-62406f5cec60@gmail.com>
 <0840AF94-25F2-4682-B66C-803B741625B7@bahnhof.se>
 <5c735b52-3979-5d53-48a3-62f7db29cc17@gmail.com>
 <CAMZ=zj4+1w6_Zcqj3+mOfUMm01KwcoB+fDc2-UzXMdWhAu2GeQ@mail.gmail.com>
 <28b102fc-8b7f-9f36-fd9a-942e3eb868f5@gmail.com>
Message-ID: <CAMZ=zj6NV0P7TSJd4aQFXsyPGNvHjUhef0Xq14fFWDN2H22t+w@mail.gmail.com>

On Fri, May 29, 2020 at 2:15 AM James Kass <jameskasskrv at gmail.com> wrote:
> On 2020-05-29 7:40 AM, David Starner wrote:
> > I don't see it. Profit driven corporate interests may or may not
> > support QID Emoji; if they don't, it's practically dead in the water.
>
> This discounts the probability that third-partiers would step up to the
> plate.

How? If you can't send it in email to arbitrary systems or in text
messages in arbitrary systems and it show up right, who is going to
use it?

> > If they do, Google is going to make a list of corporately supported
> > emoji, just like what started this, and that's going to be the list of
> > supported QID emoji. Outside that corporate line, there's going to be
> > about zero chance anyone tries to use QID emoji, because at most one
> > in a million QID emoji are going to be supported, so even if you do
> > want to use an emoji from the Palmer's Chipmunk, what's the right QID
> > for that? It seems to have three, one for each scientific name, plus
> > maybe using the genus, Tamias, will be better supported, or Marmotini,
> > or Sciuridae, but if you're getting that vague, ...
>
> Aren't there far more than three ways to express the concept "Hello"
> using valid Unicode strings?  If *that* had been deemed an
> insurmountable obstacle, we'd still be limited to ASCII-English.

That's not exactly comparable. I'm looking for a way to pass an image
of a Palmer's Chipmunk, and am willing to accept fallbacks. With QID
emoji, there's no way for me to know what will work, nor any way for a
implementer to know which one I will use. On the contrary, there is
one correct way to express "Hello" in Unicode, as a series of five
codepoints encoded in the Basic Latin block. Redundant encoding, where
it's ambiguous which character to use, is frowned upon in Unicode, and
most codepoints in Unicode are generally supported unless they're for
a poorly supported script, in which case the problems can be
anticipated.

> >> Naysayers will point out obstacles in any such approach.  In the English
> >> language, anyone who considers any and all obstacles insurmountable is
> >> referred to as a quitter.
> > People who use the term quitter in such a sense often get a ride from
> > a police car or ambulance later that night. It's a bizarre word to use
> > when you've decided Unicode should be a quitter and stop even trying
> > to manage emoji.
> >
> I've decided that Unicode has no business limiting an evolving set of
> symbols.

Why don't you do this yourself? You could have QID emoji codepoints in
the PUA, and everyone would flock to supporting them. Any obstacles
you point out in that just show that you're a quitter.

-- 
The standard is written in English . If you have trouble understanding
a particular section, read it again and again and again . . . Sit up
straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185
(1991)

From jameskasskrv at gmail.com  Fri May 29 06:56:07 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Fri, 29 May 2020 11:56:07 +0000
Subject: QID Emoij
In-Reply-To: <CAMZ=zj6NV0P7TSJd4aQFXsyPGNvHjUhef0Xq14fFWDN2H22t+w@mail.gmail.com>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
 <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>
 <3fccb199-a5da-83ef-de23-62406f5cec60@gmail.com>
 <0840AF94-25F2-4682-B66C-803B741625B7@bahnhof.se>
 <5c735b52-3979-5d53-48a3-62f7db29cc17@gmail.com>
 <CAMZ=zj4+1w6_Zcqj3+mOfUMm01KwcoB+fDc2-UzXMdWhAu2GeQ@mail.gmail.com>
 <28b102fc-8b7f-9f36-fd9a-942e3eb868f5@gmail.com>
 <CAMZ=zj6NV0P7TSJd4aQFXsyPGNvHjUhef0Xq14fFWDN2H22t+w@mail.gmail.com>
Message-ID: <13fd69f3-20ec-f36b-0806-30dad8f782f5@gmail.com>


On 2020-05-29 10:32 AM, David Starner wrote:
> On Fri, May 29, 2020 at 2:15 AM James Kass <jameskasskrv at gmail.com> wrote:
>> On 2020-05-29 7:40 AM, David Starner wrote:
>>> I don't see it. Profit driven corporate interests may or may not
>>> support QID Emoji; if they don't, it's practically dead in the water.
>> This discounts the probability that third-partiers would step up to the
>> plate.
> How? If you can't send it in email to arbitrary systems or in text
> messages in arbitrary systems and it show up right, who is going to
> use it?
>
The same kind of people who used Unicode Indic when it wasn't supported; 
trailblazers, pioneers, and enthusiasts.

Third-party folks offered scripts to convert to-and-from Indic Unicode 
and various pre-Unicode Indic font mappings.? Freely downloadable ones, 
at that.? An example:
http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML014/0753.html

>> Aren't there far more than three ways to express the concept "Hello"
>> using valid Unicode strings?  If *that* had been deemed an
>> insurmountable obstacle, we'd still be limited to ASCII-English.
> That's not exactly comparable. I'm looking for a way to pass an image
> of a Palmer's Chipmunk, and am willing to accept fallbacks.
Always glad to help a brother out.? It can be done in-line, , if 
desired.? Here's the instructions I found on-line:
Insert an Image Inline in an Email with Mozilla Thunderbird

1.Create a new message in Mozilla Thunderbird.
2.Put the cursor where you want the image to appear in the body of the 
email.
3.Select Insert > Image from the menu.
4.Use the Choose File... ...
5.Type a short textual description of the image under Alternate text: ...
6.Click OK.


>   With QID
> emoji, there's no way for me to know what will work, nor any way for a
> implementer to know which one I will use. On the contrary, there is
> one correct way to express "Hello" in Unicode, as a series of five
> codepoints encoded in the Basic Latin block.

I was going for the concept of "hello" rather than the spelling of the 
English word for it.? But even in Basic Latin English, there's more than 
one way for the word.

hello... Hello... HELLO!

Even in Basic Latin English there's more than that for the concept.

Hello, good day, good morning/evening/afternoon, howdy...

Beyond English there's a myriad of ways.

Bonjour, guten tag, ????????????, &c.
>> I've decided that Unicode has no business limiting an evolving set of
>> symbols.
> Why don't you do this yourself?
Because I don't have any business limiting an evolving set of symbols, 
either.? I'd rather go for a ride in a paddy-wagon.


From richard.wordingham at ntlworld.com  Fri May 29 09:39:17 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 29 May 2020 15:39:17 +0100
Subject: QID Emoij
In-Reply-To: <CAMZ=zj6NV0P7TSJd4aQFXsyPGNvHjUhef0Xq14fFWDN2H22t+w@mail.gmail.com>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
 <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>
 <3fccb199-a5da-83ef-de23-62406f5cec60@gmail.com>
 <0840AF94-25F2-4682-B66C-803B741625B7@bahnhof.se>
 <5c735b52-3979-5d53-48a3-62f7db29cc17@gmail.com>
 <CAMZ=zj4+1w6_Zcqj3+mOfUMm01KwcoB+fDc2-UzXMdWhAu2GeQ@mail.gmail.com>
 <28b102fc-8b7f-9f36-fd9a-942e3eb868f5@gmail.com>
 <CAMZ=zj6NV0P7TSJd4aQFXsyPGNvHjUhef0Xq14fFWDN2H22t+w@mail.gmail.com>
Message-ID: <20200529153917.6e7a94fb@JRWUBU2>

On Fri, 29 May 2020 03:32:31 -0700
David Starner via Unicode <unicode at unicode.org> wrote:

> Why don't you do this yourself? You could have QID emoji codepoints in
> the PUA, and everyone would flock to supporting them. Any obstacles
> you point out in that just show that you're a quitter.

But there are only a couple of planes in the PUA!  There's one database
(the Barcode of Life Data Systems) with over 190,000 species.

Richard.

From markus.icu at gmail.com  Fri May 29 10:45:15 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Fri, 29 May 2020 08:45:15 -0700
Subject: QID Emoji (from Re: Wireless Connection Symbol)
In-Reply-To: <4da8599e-8e5d-49d7-9701-be8d227bfbfd@disroot.org>
References: <4bddc04c.14fe.1725c7ae4cd.Webtop.41@btinternet.com>
 <4da8599e-8e5d-49d7-9701-be8d227bfbfd@disroot.org>
Message-ID: <CAN49p6ppKVoacgwqiCdE922C=2PRL73Q4bfZiAxwh6_7fyNSbA@mail.gmail.com>

On Fri, May 29, 2020 at 1:50 AM abrahamgross--- via Unicode <
unicode at unicode.org> wrote:

> What if instead of using a QID which on its own is just meaningless
> numbers, we do tag_start, then write out a word in the tag letters, then
> close it off with a tag_end. this way, if your device doesn't have the
> font, it can fall back by rendering the tags through regular character
> (with an indicator that an emoji is supposed to be there)
>
> Example:
> Suppose I type: ?I love ?triceratops?? (with ? being tag_start; ? being
> tag_end; and everything between being tag ascii characters E0020-E007E),
>

I think many people would not like the limitation to the ASCII repertoire
which requires that the word is in English or a few other languages where
all or some words can be reasonably spelled that way.
Part of the attraction of symbols is also a lesser dependence on language.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200529/829ad3fa/attachment.htm>

From wjgo_10009 at btinternet.com  Fri May 29 11:32:08 2020
From: wjgo_10009 at btinternet.com (wjgo_10009 at btinternet.com)
Date: Fri, 29 May 2020 17:32:08 +0100 (BST)
Subject: QID Emoji (from Re: Wireless Connection Symbol)
Message-ID: <4a8a55.7fb.1726146cb6d.Webtop.229@btinternet.com>

Re: QID Emoji (from Re: Wireless Connection Symbol)

!123

Markus Scherer wrote as follows.

> I think many people would not like the limitation to the ASCII 
> repertoire which requires that the word is in English or a few other 
> languages where all or some words can be reasonably spelled that way.

> Part of the attraction of symbols is also a lesser dependence on 
> language.

http://www.users.globalnet.co.uk/~ngo/locse027.pdf

In another thread, (Re: QID Emoij), James Kass wrote as follows.

> Even in Basic Latin English there's more than that for the concept.

> Hello, good day, good morning/evening/afternoon, howdy...

> Beyond English there's a myriad of ways.

> Bonjour, guten tag, ????????????, &c.

There is also, in my research project,

!123

http://www.users.globalnet.co.uk/~ngo/A_List_of_Code_Numbers_and_English_Localizations_for_use_in_Research_on_Communication_through_the_Language_Barrier_using_encoded_Localizable_Sentences.pdf

There are also other research documents and two novels, one novel 
completed in 2019, and a sequel, being published as chapters as each 
chapter is completed.

They are available online, free to read, no registration needed, and the 
webspace is hosted on a PlusNet server, not on my computer. I upload 
over the internet. All of the documents and the novels are deposited for 
Legal Deposit with The British Library, and Legal Deposit receipted by 
The British Library.

http://www.users.globalnet.co.uk/~ngo/

!987

William Overington

Friday 29 May 2020


From asmusf at ix.netcom.com  Fri May 29 12:52:46 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Fri, 29 May 2020 10:52:46 -0700
Subject: QID Emoij
In-Reply-To: <5c735b52-3979-5d53-48a3-62f7db29cc17@gmail.com>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
 <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>
 <3fccb199-a5da-83ef-de23-62406f5cec60@gmail.com>
 <0840AF94-25F2-4682-B66C-803B741625B7@bahnhof.se>
 <5c735b52-3979-5d53-48a3-62f7db29cc17@gmail.com>
Message-ID: <2ff76225-1311-4133-f293-5c06807e28cb@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200529/44bc9744/attachment.htm>

From asmusf at ix.netcom.com  Fri May 29 12:53:46 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Fri, 29 May 2020 10:53:46 -0700
Subject: QID Emoij
In-Reply-To: <CAMZ=zj4+1w6_Zcqj3+mOfUMm01KwcoB+fDc2-UzXMdWhAu2GeQ@mail.gmail.com>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
 <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>
 <3fccb199-a5da-83ef-de23-62406f5cec60@gmail.com>
 <0840AF94-25F2-4682-B66C-803B741625B7@bahnhof.se>
 <5c735b52-3979-5d53-48a3-62f7db29cc17@gmail.com>
 <CAMZ=zj4+1w6_Zcqj3+mOfUMm01KwcoB+fDc2-UzXMdWhAu2GeQ@mail.gmail.com>
Message-ID: <810a5e1a-6bcb-7b4f-4d6a-1e7e452a8e0e@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200529/7cf3aaa9/attachment.htm>

From kent.b.karlsson at bahnhof.se  Sat May 30 15:30:46 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Sat, 30 May 2020 22:30:46 +0200
Subject: Wireless Connection Symbol
In-Reply-To: <trinity-6d2462e9-1225-4e70-99e4-fcce9dbf2b20-1590741285382@3c-app-webde-bs56>
References: <000001d632c3$249f75d0$6dde6170$@ewellic.org>
 <0B8B641E-A5A1-467B-BF19-C6FDF3FD5F0F@evertype.com>
 <29d231d7-6d5f-f840-e3d9-6b178e871feb@gmail.com>
 <74c6facd-1703-337d-2105-f959af00cd34@ix.netcom.com>
 <3fccb199-a5da-83ef-de23-62406f5cec60@gmail.com>
 <0840AF94-25F2-4682-B66C-803B741625B7@bahnhof.se>
 <CAN49p6qjUx9JzRWYk8TO0AVZOW=1OQsm8v-xFe8Hh1rcbpVM-w@mail.gmail.com>
 <3A410044-B4F7-4489-8CC4-3562D4E84ADE@bahnhof.se>
 <trinity-6d2462e9-1225-4e70-99e4-fcce9dbf2b20-1590741285382@3c-app-webde-bs56>
Message-ID: <2C36D52C-77BA-488F-86A1-18C57088D12D@bahnhof.se>


> 29 maj 2020 kl. 10:34 skrev Marius Spix via Unicode <unicode at unicode.org>:
> 
> What about using the icon URI scheme to represent arbitrary emoji?
>  
> https://tools.ietf.org/id/draft-lafayette-icon-uri-scheme-00.html
>  
> This would allow stuff like <img src="icon:animals:dinosaurs:archaeopteryx" alt="Archaeopteryx?/>

Thanks for the reference. A quick look gives that it seems to be a fair idea. But that proposal was for file type icons only (mostly based on file suffix). So ?icon:animals:dinosaurs:archaeopteryx? is not covered by that proposal (though ?icon:.pdf? is). It also has a number of quirks. And the proposal was ?dead in the water? according to the (presumed) author. But let?s assume a similar ?emoji icon? uri type. One would then need to have some reasonable, and agreed upon, ?universal?, way of referring to ?emoji icons?, to tell which one is referenced.

The displayed size (not the pixel sizes, of which there are at least two, the origin and the final display, the latter depends on various resize operations, including zoom) of the ?emoji icon? must be overridable by a style="width:1em;height:1em;" (or similar) style (perhaps that too abbreviated; even implicitly referring to a user preference for emoji size). Emoji are (sort of) text, so their glyphs should follow the size of the text. Nit: 1em by 1em may be a bit small, esp. for running text. 1.5em by 1.5em or even bigger can make it much easier to see what is in the image. (Some (chat) applications already display emoji larger when alone in a line?)

Note that though I used HTML in the examples, embedding images (wether it is ?whatever image?, ?icons? or even ?generalized emoji?) is not limited to HTML of course.

/Kent Karlsson


> Regards,
>  
> Marius
>  
> Gesendet: Freitag, 29. Mai 2020 um 00:19 Uhr
> Von: "Kent Karlsson via Unicode" <unicode at unicode.org>
> An: "Markus Scherer" <markus.icu at gmail.com>
> Cc: "Unicode" <unicode at unicode.org>
> Betreff: Re: Wireless Connection Symbol
>  
>  
> 28 maj 2020 kl. 04:53 skrev Markus Scherer via Unicode <unicode at unicode.org <mailto:unicode at unicode.org>>:
>  
> On Wed, May 27, 2020 at 6:20 PM Kent Karlsson via Unicode <unicode at unicode.org <mailto:unicode at unicode.org>> wrote:
> Granted, it is not plain text. But emoji are already pushing ?out of? plain text as we knew it. And? I recall an argument (years ago) saying essentially
> ?these will be the only emoji encoded, the recommendation for expansion is to use images instead?. That seems to have been forgotten?
>  
> Not entirely forgotten...
>  
> http://www.unicode.org/reports/tr51/#Longer_Term <http://www.unicode.org/reports/tr51/#Longer_Term>
>  
> markus
>  
> Ok. Thanks for pointing that out. Glad it is not entirely forgotten.
>  
> One little nit:
> ?Other features required to make embedded graphics work well include the ability of images to scale with font size?
>  
> That sounds a little bit like one was requiring a small revolution in image rendering. But of course it is not (ok, HTML again):
>  
>      <img src="???" alt="??" style="width:1em;height:1em;"/>
>  
> (1em being the typical height and width of emoji glyphs.)
>  
> /Kent K
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200530/fc8c7997/attachment.htm>

From richard.wordingham at ntlworld.com  Sun May 31 20:52:21 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 1 Jun 2020 02:52:21 +0100
Subject: Bengali script repha
Message-ID: <20200601025221.50a434ce@JRWUBU2>

Which consonants in the Bengali script may form repha?  In particular,
is the encoding just <U+09B0 BENGALI LETTER RA, U+09CD BENGALI SIGN
VIRAMA>, or can it legitimately also be encoded as <U+09F0 BENGALI RA
WITH MIDDLE DIAGONAL, U+09CD>.

The problem we are having is that some people are encoding Pali 'v' as
U+09F0 RA WITH MIDDLE DIAGONAL, rather than U+09F1 BENGALI LETTER RA
WITH LOWER DIAGONAL, and in some fonts (renderers?) the 'vy' cluster is
being rendered with a repha as though it were <U+09B0, U+09CD, U+09AF
BENGALI LETTER YA>.

Should using the joining sequence <U+200D, U+09CD> prevent repha
formation when the preceding character is U+09B0?  TUS appears to only
define the behaviour between RA and YA, not between RA and other
letters.

Richard.