From jameskass at code2001.com  Mon Nov  1 10:11:57 2021
From: jameskass at code2001.com (James Kass)
Date: Mon, 1 Nov 2021 15:11:57 +0000
Subject: Tales from the archives
Message-ID: <45361202-4125-d8d4-f9b8-542dc34a7467@code2001.com>


Recently someone mentioned how a public list thread can generate nuggets 
of insight even when the topic being discussed may be controversial 
and/or the thread might have a tendency to veer off topic.? Reviewing 
threads spanning April and May of 2004 in this list?s archives affirms 
the accuracy of that observation.

While the threads being reviewed were ongoing, there were other 
conversations related to Unicode, such as why UTF-8 worked for Plane Two 
in a certain browser, but didn?t work for Plane One.? Additional 
discussions covered planned extensions to existing blocks as well as 
scripts which might be encoded in the future (as most of them were).

But the discussion I examined was related to a script proposal by 
Michael Everson.

Real and imaginary characters were brought into the discussion, such as 
George Custer, Zaphod Beeblebrox, Ezra the Scribe (and Ezra the font), 
Martin Bormann, Potter Stewart, Cerberus the three-headed dog, Popeye 
the Sailor Man, Alexander the Great, and Hannibal (not Lecter) ? some of 
whom might be considered off topic.? Even Chang and Eng popped up.? A 
neologism was coined which never gained currency.? The thread and its 
spawn became so popular that the topic itself was banned from further 
discussion.

During the threads, Michael Everson shared information about how he, Ken 
Whistler, and Rick McGowan had set up the Roadmaps guided by the history 
and long-established studies of the world?s writing systems.? Various 
posters provided insight to UTC deliberations and considerations as well 
as procedural information about other standards bodies.? Definitions of 
some words as used in Unicode jargon were compared to how those same 
words were defined elsewhere, and some of the Unicode usages were 
further clarified.? Some list members offered their backgrounds and 
fields of interests, revealing considerable diversity among members.

At the time, standardizing ancient scripts was fairly novel, so 
precedent and procedure were nascent.

Ken Whistler made a post about determining whether a script should be 
considered which is well worth revisiting:
https://www.unicode.org/mail-arch/unicode-ml/y2004-m05/1138.html

Ken expressed the concepts clearly using language and phrasing 
understandable even to the casual list visitor.? Not only do those 
principles Ken outlined remain germane today, they are expected to 
continue to guide Unicode into the future.

The Unicode public list archives are a treasure trove of information 
about Unicode and the history of the project.? We should all be thankful 
that they are available and well maintained.

Best regards,

James Kass


From abrahamgross at disroot.org  Tue Nov  2 20:03:08 2021
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Wed, 03 Nov 2021 01:03:08 +0000
Subject: New CJK characters
Message-ID: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>

I have a proposal regarding the future of encoding new Unihan characters into Unicode that I'd like to float by this group to see if it makes any sense.

New CJK characters keep on being encoded and it doesn't seem to be slowing down. This is to the point where there are now in unicode 92,856 CJK characters!

I think that going forward, it would make a lot of sense if instead of encoding each new character as a separate codepoint, we adopt a paradigm like that of Sutton SIgnwriting  (https://en.wikipedia.org/wiki/Sutton_SignWriting_(Unicode_block))- where Unicode would provide a set of all radicals and position/sizing modifiers - and anyone that wants to use any arbitrary non-encoded character would be able to just combine the radicals the right way (by using a gui designed for this, ? la glyphwiki.org's or Wenlin's editor), and then be able to use the character right away. This would work because the font would have to support for all the basic strokes, and since all CJK characters are comprised of the basic strokes, the font will be able to put the character together without the need for a font maker to specifically create that character.

This method of "encoding" would solve many problems we have now:
	* Non encoded characters can be used without the need to wait years for the character to be accepted into Unicode, and then a couple more years until the major OSes update their fonts to support the new characters.
	* This is in my opinion a really neat solution to the gaiji problem (described here (https://en.wikipedia.org/wiki/OpenType#SING_gaiji_solution)).
	* This would also give way to much more rapid font development time, since you'd only need to create the basic strokes and some radicals to get a working version of the font, then all other characters would just be refining the exact stroke size/positioning.
	* Most CJK fonts only have a small subset of all available characters. This will allow for all fonts to support any character you wish - including ones you dream up.
	* People have been coming up with new CJK characters for thousands of years, including nowadays (here's a new-kanji competition for example (https://sousaku-kanji.com/archive.html)), but any new characters created nowadays would be extremely hard to get into Unicode since Unicode requires proof of use before they accept a proposal, but how are people supposed to use a character if they can't type it.

I still think that Unicode should keep track of new characters in a Nameslist of sorts so that font makers have a base to go off of.

Q: My (city) name has a character that isn't encoded. How can I type it quickly without needing to open up an editor and creating it each time?
A: Adding them to your IME's dictionary would allow you to just create the character once.
- This can be extended in such a way where an IME can be fully formed entirely out of preconstructed characters instead of codepoints.
Q: What would the specifics of such a system look like behind the scenes?
A: I'm not sure yet, but I think Wenlin's CDL (http://guide.wenlininstitute.org/wenlin4.3/Character_Description_Language) would be a good place to start.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211103/aa61e293/attachment.htm>

From abrahamgross at disroot.org  Tue Nov  2 20:07:59 2021
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Wed, 03 Nov 2021 01:07:59 +0000
Subject: New CJK characters
In-Reply-To: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
Message-ID: <af9a713bf09176f70a661b05ac28c488@disroot.org>

I sent this by mistake while writing it up (before finishing), but you can tell the basic gist of what I was trying to say.

2021?11?2? 21:03, "Abraham Gross via Unicode" <unicode at corp.unicode.org (mailto:unicode at corp.unicode.org?to=%22Abraham%20Gross%20via%20Unicode%22%20<unicode at corp.unicode.org>)> wrote:
I have a proposal regarding the future of encoding new Unihan characters into Unicode that I'd like to float by this group to see if it makes any sense.

New CJK characters keep on being encoded and it doesn't seem to be slowing down. This is to the point where there are now in unicode 92,856 CJK characters!

I think that going forward, it would make a lot of sense if instead of encoding each new character as a separate codepoint, we adopt a paradigm like that of Sutton SIgnwriting  (https://en.wikipedia.org/wiki/Sutton_SignWriting_(Unicode_block))- where Unicode would provide a set of all radicals and position/sizing modifiers - and anyone that wants to use any arbitrary non-encoded character would be able to just combine the radicals the right way (by using a gui designed for this, ? la glyphwiki.org's or Wenlin's editor), and then be able to use the character right away. This would work because the font would have to support for all the basic strokes, and since all CJK characters are comprised of the basic strokes, the font will be able to put the character together without the need for a font maker to specifically create that character.

This method of "encoding" would solve many problems we have now:
	* Non encoded characters can be used without the need to wait years for the character to be accepted into Unicode, and then a couple more years until the major OSes update their fonts to support the new characters.
	* This is in my opinion a really neat solution to the gaiji problem (described here (https://en.wikipedia.org/wiki/OpenType#SING_gaiji_solution)).
	* This would also give way to much more rapid font development time, since you'd only need to create the basic strokes and some radicals to get a working version of the font, then all other characters would just be refining the exact stroke size/positioning.
	* Most CJK fonts only have a small subset of all available characters. This will allow for all fonts to support any character you wish - including ones you dream up.
	* People have been coming up with new CJK characters for thousands of years, including nowadays (here's a new-kanji competition for example (https://sousaku-kanji.com/archive.html)), but any new characters created nowadays would be extremely hard to get into Unicode since Unicode requires proof of use before they accept a proposal, but how are people supposed to use a character if they can't type it.

I still think that Unicode should keep track of new characters in a Nameslist of sorts so that font makers have a base to go off of.

Q: My (city) name has a character that isn't encoded. How can I type it quickly without needing to open up an editor and creating it each time?
A: Adding them to your IME's dictionary would allow you to just create the character once.
- This can be extended in such a way where an IME can be fully formed entirely out of preconstructed characters instead of codepoints.
Q: What would the specifics of such a system look like behind the scenes?
A: I'm not sure yet, but I think Wenlin's CDL (http://guide.wenlininstitute.org/wenlin4.3/Character_Description_Language) would be a good place to start.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211103/5139326b/attachment.htm>

From abrahamgross at disroot.org  Tue Nov  2 20:09:07 2021
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Wed, 03 Nov 2021 01:09:07 +0000
Subject: New CJK characters
In-Reply-To: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
Message-ID: <2fac5213041386c0a733a48843f9b280@disroot.org>

I sent this by mistake while writing it up (before finishing), but you can tell the basic gist of what I was trying to say.

2021?11?2? 21:03, "Abraham Gross via Unicode" <unicode at corp.unicode.org (mailto:unicode at corp.unicode.org?to=%22Abraham%20Gross%20via%20Unicode%22%20<unicode at corp.unicode.org>)> wrote:
I have a proposal regarding the future of encoding new Unihan characters into Unicode that I'd like to float by this group to see if it makes any sense.

New CJK characters keep on being encoded and it doesn't seem to be slowing down. This is to the point where there are now in unicode 92,856 CJK characters!

I think that going forward, it would make a lot of sense if instead of encoding each new character as a separate codepoint, we adopt a paradigm like that of Sutton SIgnwriting  (https://en.wikipedia.org/wiki/Sutton_SignWriting_(Unicode_block))- where Unicode would provide a set of all radicals and position/sizing modifiers - and anyone that wants to use any arbitrary non-encoded character would be able to just combine the radicals the right way (by using a gui designed for this, ? la glyphwiki.org's or Wenlin's editor), and then be able to use the character right away. This would work because the font would have to support for all the basic strokes, and since all CJK characters are comprised of the basic strokes, the font will be able to put the character together without the need for a font maker to specifically create that character.

This method of "encoding" would solve many problems we have now:
	* Non encoded characters can be used without the need to wait years for the character to be accepted into Unicode, and then a couple more years until the major OSes update their fonts to support the new characters.
	* This is in my opinion a really neat solution to the gaiji problem (described here (https://en.wikipedia.org/wiki/OpenType#SING_gaiji_solution)).
	* This would also give way to much more rapid font development time, since you'd only need to create the basic strokes and some radicals to get a working version of the font, then all other characters would just be refining the exact stroke size/positioning.
	* Most CJK fonts only have a small subset of all available characters. This will allow for all fonts to support any character you wish - including ones you dream up.
	* People have been coming up with new CJK characters for thousands of years, including nowadays (here's a new-kanji competition for example (https://sousaku-kanji.com/archive.html)), but any new characters created nowadays would be extremely hard to get into Unicode since Unicode requires proof of use before they accept a proposal, but how are people supposed to use a character if they can't type it.

I still think that Unicode should keep track of new characters in a Nameslist of sorts so that font makers have a base to go off of.

Q: My (city) name has a character that isn't encoded. How can I type it quickly without needing to open up an editor and creating it each time?
A: Adding them to your IME's dictionary would allow you to just create the character once.
- This can be extended in such a way where an IME can be fully formed entirely out of preconstructed characters instead of codepoints.
Q: What would the specifics of such a system look like behind the scenes?
A: I'm not sure yet, but I think Wenlin's CDL (http://guide.wenlininstitute.org/wenlin4.3/Character_Description_Language) would be a good place to start.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211103/c2aebf88/attachment.htm>

From jameskass at code2001.com  Tue Nov  2 22:34:47 2021
From: jameskass at code2001.com (James Kass)
Date: Wed, 3 Nov 2021 03:34:47 +0000
Subject: New CJK characters
In-Reply-To: <2fac5213041386c0a733a48843f9b280@disroot.org>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <2fac5213041386c0a733a48843f9b280@disroot.org>
Message-ID: <edd64fb4-9f76-143b-ac9b-1d6450c2bdbf@code2001.com>


On 2021-11-03 1:09 AM, Abraham Gross via Unicode wrote:
> Q: What would the specifics of such a system look like behind the scenes?
> A: I'm not sure yet, but I think Wenlin's CDL (http://guide.wenlininstitute.org/wenlin4.3/Character_Description_Language) would be a good place to start.

This web page gives an overview of some of the approaches:
https://everything.explained.today/Chinese_character_description_languages/

Wenlin's approach is quite sophisticated and has been around for a 
while.? A quick web search didn't turn up any previous proposals for 
getting Wenlin's CDL enshrined in Unicode, although Richard Cook has 
submitted various encoding proposals over the years.? If Wenlin 
personnel never floated any CDL-related proposal, it may be that they 
themselves consider such an approach to be out of scope for plain text.

As many of us know, Andrew West maintains a list of IDS for encoded Han 
characters, available here:
https://www.babelstone.co.uk/CJK/index.html
Using IDS to generate glyphs on the fly might be workable, although such 
an approach might well be relegated to a higher level protocol.? 
Meanwhile an IDS can already be stored and exchanged in a standard 
fashion.? Counting how many of any IDS for an as yet unencoded ideograph 
exist in plain text might help to establish usage for future encoding 
consideration.

Ken Whistler crunched some numbers about CJK additions here:
https://www.unicode.org/mail-arch/unicode-ml/y2018-m03/0023.html

Additional information about CJK proliferation can be found here:
https://www.babelstone.co.uk/Blog/2007/07/cjk-unified-ideographs-to-infinity-and.html


From 747.neutron at gmail.com  Tue Nov  2 23:14:24 2021
From: 747.neutron at gmail.com (=?UTF-8?B?V8OhbmcgWWlmw6Fu?=)
Date: Wed, 3 Nov 2021 13:14:24 +0900
Subject: Fwd: New CJK characters
In-Reply-To: <CAF5KyEww=gNiUf0uEkYsRVrPbd6DR8eAfe9yKBNB6+rbhtDBJg@mail.gmail.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <2fac5213041386c0a733a48843f9b280@disroot.org>
 <CAF5KyEww=gNiUf0uEkYsRVrPbd6DR8eAfe9yKBNB6+rbhtDBJg@mail.gmail.com>
Message-ID: <CAF5KyEwDvi8BcU5bquvYXxxvq42TrZ64idh9_5hYk1-N46xPnQ@mail.gmail.com>

I just noticed that my message wasn't sent to the mail list.

---------- Forwarded message ---------
From: W?ng Yif?n <747.neutron at gmail.com>
Date: 2021?11?3?(?) 10:56
Subject: Re: New CJK characters
To: <abrahamgross at disroot.org>


FWIW I was told that BabelStone utilizes a mechanic that glues each
element of IDS with WJ just like composite emoji (not dynamic though).
It may be useful if such kind of notation gets any official recognition.
Also see https://zi.tools/?secondary=ids

> Q: What would the specifics of such a system look like behind the scenes?
> A: I'm not sure yet, but I think Wenlin's CDL would be a good place to start.

I think we need a separation of concerns here. CDL looks more of a
font-level technology to me. Whether it will be adoptable or not, a
more plain text format in Unicode sequence, if not IDS, will be surely
required separately as the input to fonts.


2021?11?3?(?) 10:11 Abraham Gross via Unicode <unicode at corp.unicode.org>:

>
> I sent this by mistake while writing it up (before finishing), but you can tell the basic gist of what I was trying to say.
>
> 2021?11?2? 21:03, "Abraham Gross via Unicode" <unicode at corp.unicode.org> wrote:
>
> I have a proposal regarding the future of encoding new Unihan characters into Unicode that I'd like to float by this group to see if it makes any sense.
>
> New CJK characters keep on being encoded and it doesn't seem to be slowing down. This is to the point where there are now in unicode 92,856 CJK characters!
>
> I think that going forward, it would make a lot of sense if instead of encoding each new character as a separate codepoint, we adopt a paradigm like that of Sutton SIgnwriting - where Unicode would provide a set of all radicals and position/sizing modifiers - and anyone that wants to use any arbitrary non-encoded character would be able to just combine the radicals the right way (by using a gui designed for this, ? la glyphwiki.org's or Wenlin's editor), and then be able to use the character right away. This would work because the font would have to support for all the basic strokes, and since all CJK characters are comprised of the basic strokes, the font will be able to put the character together without the need for a font maker to specifically create that character.
>
> This method of "encoding" would solve many problems we have now:
>
> Non encoded characters can be used without the need to wait years for the character to be accepted into Unicode, and then a couple more years until the major OSes update their fonts to support the new characters.
> This is in my opinion a really neat solution to the gaiji problem (described here).
> This would also give way to much more rapid font development time, since you'd only need to create the basic strokes and some radicals to get a working version of the font, then all other characters would just be refining the exact stroke size/positioning.
> Most CJK fonts only have a small subset of all available characters. This will allow for all fonts to support any character you wish - including ones you dream up.
> People have been coming up with new CJK characters for thousands of years, including nowadays (here's a new-kanji competition for example), but any new characters created nowadays would be extremely hard to get into Unicode since Unicode requires proof of use before they accept a proposal, but how are people supposed to use a character if they can't type it.
>
> I still think that Unicode should keep track of new characters in a Nameslist of sorts so that font makers have a base to go off of.
>
> Q: My (city) name has a character that isn't encoded. How can I type it quickly without needing to open up an editor and creating it each time?
> A: Adding them to your IME's dictionary would allow you to just create the character once.
> - This can be extended in such a way where an IME can be fully formed entirely out of preconstructed characters instead of codepoints.
>
> Q: What would the specifics of such a system look like behind the scenes?
> A: I'm not sure yet, but I think Wenlin's CDL would be a good place to start.
>
>
>


From abrahamgross at disroot.org  Tue Nov  2 23:40:59 2021
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Wed, 3 Nov 2021 04:40:59 +0000 (UTC)
Subject: Fwd: New CJK characters
In-Reply-To: <CAF5KyEwDvi8BcU5bquvYXxxvq42TrZ64idh9_5hYk1-N46xPnQ@mail.gmail.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <2fac5213041386c0a733a48843f9b280@disroot.org>
 <CAF5KyEww=gNiUf0uEkYsRVrPbd6DR8eAfe9yKBNB6+rbhtDBJg@mail.gmail.com>
 <CAF5KyEwDvi8BcU5bquvYXxxvq42TrZ64idh9_5hYk1-N46xPnQ@mail.gmail.com>
Message-ID: <ddf29a63-3ca0-452c-8619-180248a60509@disroot.org>

Wow I'm really impressed by this tool!
https://zi.tools/?secondary=ids

Examples I tried to test the limits of what it can do:

https://imgur.com/9EMGqvM
https://imgur.com/lkgSGeq

From jameskass at code2001.com  Wed Nov  3 00:27:01 2021
From: jameskass at code2001.com (James Kass)
Date: Wed, 3 Nov 2021 05:27:01 +0000
Subject: Fwd: New CJK characters
In-Reply-To: <ddf29a63-3ca0-452c-8619-180248a60509@disroot.org>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <2fac5213041386c0a733a48843f9b280@disroot.org>
 <CAF5KyEww=gNiUf0uEkYsRVrPbd6DR8eAfe9yKBNB6+rbhtDBJg@mail.gmail.com>
 <CAF5KyEwDvi8BcU5bquvYXxxvq42TrZ64idh9_5hYk1-N46xPnQ@mail.gmail.com>
 <ddf29a63-3ca0-452c-8619-180248a60509@disroot.org>
Message-ID: <f25576ae-94fd-9d5d-e599-f062f02b0355@code2001.com>


On 2021-11-03 4:40 AM, abrahamgross--- via Unicode wrote:
> Wow I'm really impressed by this tool!
> https://zi.tools/?secondary=ids
>
> Examples I tried to test the limits of what it can do:
>
> https://imgur.com/9EMGqvM
> https://imgur.com/lkgSGeq

It is very impressive.? I input an IDS for an as yet unencoded character 
slated for Extension H (???) and was immediately rewarded with a 
beautiful ideograph.

The combos Abraham Gross tried are more complex than that.? I'd say the 
tool passes the tests!

(I would guess that W?ng Yif?n uses component stroke counts in order to 
algorithmically determine the relative heights and widths of the 
components and may well have also assigned "classes" for each 
component's base, top, and so forth to determine how those components 
could be kerned or adjusted for the optimal fit.)

Maybe in the future there will be a conversion feature in a plain text 
editor which would automatically generate ideographs based on IDSs for 
the display.


From A.Schappo at lboro.ac.uk  Wed Nov  3 08:18:41 2021
From: A.Schappo at lboro.ac.uk (Andre Schappo)
Date: Wed, 3 Nov 2021 13:18:41 +0000
Subject: New CJK characters
In-Reply-To: <ddf29a63-3ca0-452c-8619-180248a60509@disroot.org>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <2fac5213041386c0a733a48843f9b280@disroot.org>
 <CAF5KyEww=gNiUf0uEkYsRVrPbd6DR8eAfe9yKBNB6+rbhtDBJg@mail.gmail.com>
 <CAF5KyEwDvi8BcU5bquvYXxxvq42TrZ64idh9_5hYk1-N46xPnQ@mail.gmail.com>
 <ddf29a63-3ca0-452c-8619-180248a60509@disroot.org>
Message-ID: <AS8PR04MB75256861368A3D979A829F57A38C9@AS8PR04MB7525.eurprd04.prod.outlook.com>


I am totally impressed as well. I have just used it to generate a png for a character I created some time ago which I call ???I created it for a friend who has 3 children ?

http://zu.zi.tools/???.png & https://?.??/hao3

Andr? Schappo

________________________________
From: Unicode <unicode-bounces at corp.unicode.org> on behalf of abrahamgross--- via Unicode <unicode at corp.unicode.org>
Sent: 03 November 2021 04:40
To: unicode at corp.unicode.org <unicode at corp.unicode.org>
Subject: Fwd: New CJK characters

** THIS MESSAGE ORIGINATED OUTSIDE LOUGHBOROUGH UNIVERSITY **

Be wary of links or attachments, especially if the email is unsolicited or you don't recognise the sender's email address.

Wow I'm really impressed by this tool!
https://zi.tools/?secondary=ids

Examples I tried to test the limits of what it can do:

https://imgur.com/9EMGqvM
https://imgur.com/lkgSGeq
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211103/3115597d/attachment.htm>

From 747.neutron at gmail.com  Wed Nov  3 09:27:33 2021
From: 747.neutron at gmail.com (=?UTF-8?B?V8OhbmcgWWlmw6Fu?=)
Date: Wed, 3 Nov 2021 23:27:33 +0900
Subject: New CJK characters
In-Reply-To: <AS8PR04MB75256861368A3D979A829F57A38C9@AS8PR04MB7525.eurprd04.prod.outlook.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <2fac5213041386c0a733a48843f9b280@disroot.org>
 <CAF5KyEww=gNiUf0uEkYsRVrPbd6DR8eAfe9yKBNB6+rbhtDBJg@mail.gmail.com>
 <CAF5KyEwDvi8BcU5bquvYXxxvq42TrZ64idh9_5hYk1-N46xPnQ@mail.gmail.com>
 <ddf29a63-3ca0-452c-8619-180248a60509@disroot.org>
 <AS8PR04MB75256861368A3D979A829F57A38C9@AS8PR04MB7525.eurprd04.prod.outlook.com>
Message-ID: <CAF5KyExKQHdZZWZnn4QxgfsRAOEtLJXar4Dk_zwSZPG-XMJqMQ@mail.gmail.com>

As you might have noticed, I'm not the developer of the website I
mentioned. It is a great service run by an IRG contributor, I think
you can just join the Telegram to contact the community.

2021?11?3?(?) 22:24 Andre Schappo via Unicode <unicode at corp.unicode.org>:
>
>
> I am totally impressed as well. I have just used it to generate a png for a character I created some time ago which I call ???I created it for a friend who has 3 children ?
>
> http://zu.zi.tools/???.png & https://?.??/hao3
>
> Andr? Schappo
>
> ________________________________
> From: Unicode <unicode-bounces at corp.unicode.org> on behalf of abrahamgross--- via Unicode <unicode at corp.unicode.org>
> Sent: 03 November 2021 04:40
> To: unicode at corp.unicode.org <unicode at corp.unicode.org>
> Subject: Fwd: New CJK characters
>
> ** THIS MESSAGE ORIGINATED OUTSIDE LOUGHBOROUGH UNIVERSITY **
>
> Be wary of links or attachments, especially if the email is unsolicited or you don't recognise the sender's email address.
>
> Wow I'm really impressed by this tool!
> https://zi.tools/?secondary=ids
>
> Examples I tried to test the limits of what it can do:
>
> https://imgur.com/9EMGqvM
> https://imgur.com/lkgSGeq


From pgcon6 at msn.com  Wed Nov  3 12:40:37 2021
From: pgcon6 at msn.com (Peter Constable)
Date: Wed, 3 Nov 2021 17:40:37 +0000
Subject: New CJK characters
In-Reply-To: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
Message-ID: <MWHPR1301MB21126B8DAB4AA28374F3E088868C9@MWHPR1301MB2112.namprd13.prod.outlook.com>

Something to consider: While highlighting potential benefits in relation to characters that are used only very rarely (in general-there might be local exceptions for some place names), you don't mention the problems that would be created for the vast majority of much-more-frequently used ideographs, as well as the down-sides for those rare characters. For example, the IDS scheme would never be supported in IDNA, so that town name could never be used in a domain name.


Peter

From: Unicode <unicode-bounces at corp.unicode.org> On Behalf Of Abraham Gross via Unicode
Sent: Tuesday, November 2, 2021 6:03 PM
To: unicode at corp.unicode.org
Subject: New CJK characters

I have a proposal regarding the future of encoding new Unihan characters into Unicode that I'd like to float by this group to see if it makes any sense.

New CJK characters keep on being encoded and it doesn't seem to be slowing down. This is to the point where there are now in unicode 92,856 CJK characters!

I think that going forward, it would make a lot of sense if instead of encoding each new character as a separate codepoint, we adopt a paradigm like that of Sutton SIgnwriting <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FSutton_SignWriting_(Unicode_block)&data=04%7C01%7C%7C56e383ef29d04433831808d99e66be5c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637714986385950571%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Mn2%2Bhkgt9CeUxH7jLPCp%2F6mU6LbPdqaQdm1JBT1EDBI%3D&reserved=0> - where Unicode would provide a set of all radicals and position/sizing modifiers - and anyone that wants to use any arbitrary non-encoded character would be able to just combine the radicals the right way (by using a gui designed for this, ? la glyphwiki.org's or Wenlin's editor), and then be able to use the character right away. This would work because the font would have to support for all the basic strokes, and since all CJK characters are comprised of the basic strokes, the font will be able to put the character together without the need for a font maker to specifically create that character.

This method of "encoding" would solve many problems we have now:

  1.  Non encoded characters can be used without the need to wait years for the character to be accepted into Unicode, and then a couple more years until the major OSes update their fonts to support the new characters.
  2.  This is in my opinion a really neat solution to the gaiji problem (described here<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FOpenType%23SING_gaiji_solution&data=04%7C01%7C%7C56e383ef29d04433831808d99e66be5c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637714986385960565%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=x2pfzaVjYj4uEO772mWLtK7C124YlzfPcJXJzGt%2BMCc%3D&reserved=0>).
  3.  This would also give way to much more rapid font development time, since you'd only need to create the basic strokes and some radicals to get a working version of the font, then all other characters would just be refining the exact stroke size/positioning.
  4.  Most CJK fonts only have a small subset of all available characters. This will allow for all fonts to support any character you wish - including ones you dream up.
  5.  People have been coming up with new CJK characters for thousands of years, including nowadays (here's a new-kanji competition for example<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsousaku-kanji.com%2Farchive.html&data=04%7C01%7C%7C56e383ef29d04433831808d99e66be5c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637714986385960565%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qq5EI%2FRGpX7IkdBTIRotux%2Bn1AFaltlQKvLxi1%2Buv3I%3D&reserved=0>), but any new characters created nowadays would be extremely hard to get into Unicode since Unicode requires proof of use before they accept a proposal, but how are people supposed to use a character if they can't type it.
I still think that Unicode should keep track of new characters in a Nameslist of sorts so that font makers have a base to go off of.

Q: My (city) name has a character that isn't encoded. How can I type it quickly without needing to open up an editor and creating it each time?
A: Adding them to your IME's dictionary would allow you to just create the character once.
- This can be extended in such a way where an IME can be fully formed entirely out of preconstructed characters instead of codepoints.

Q: What would the specifics of such a system look like behind the scenes?
A: I'm not sure yet, but I think Wenlin's CDL<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fguide.wenlininstitute.org%2Fwenlin4.3%2FCharacter_Description_Language&data=04%7C01%7C%7C56e383ef29d04433831808d99e66be5c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637714986385970559%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=W8fUMF0C3NRVHwCTnMESXkd3p5CylkTCIdg5PZweO90%3D&reserved=0> would be a good place to start.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211103/4012532e/attachment-0001.htm>

From jsbien at mimuw.edu.pl  Wed Nov  3 13:13:07 2021
From: jsbien at mimuw.edu.pl (=?utf-8?Q?Janusz_S=2E_Bie=C5=84?=)
Date: Wed, 03 Nov 2021 19:13:07 +0100
Subject: "DOS fonts" (was RE: Breaking barriers)
In-Reply-To: <baa8885e-c292-0436-26d9-6ae999603716@code2001.com> (James Kass's
 message of "Mon, 25 Oct 2021 20:03:50 +0000")
References: <MWHPR1301MB2112ED11ACDB030DE41726BC86839@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <001201d7c9c5$042bd240$0c8376c0$@ewellic.org>
 <baa8885e-c292-0436-26d9-6ae999603716@code2001.com>
Message-ID: <87pmrhf6q4.fsf@mimuw.edu.pl>

On Mon, Oct 25 2021 at 20:03 GMT, James Kass wrote:
> On 2021-10-25 5:23 PM, Doug Ewell via Unicode wrote:
>> Peter Constable wrote:
>>
>>>> A DOS command then enabled users to swap the font-in-use.
>>> As I recall, DOS had no such command. Rather, one needed a utility
>>> that would load the font data into specific memory.
>> I suspect James was thinking of the MODE CON CP SELECT=x command, where 'x' was the code page ID of the desired character set.
> My post was poorly phrased.? "A command entered at the DOS prompt"
> would have been better.? It wasn't a native DOS command.? An internet
> search revealed that typical extensions for the modified/newly created
> fonts included "*.F11" or "*.F12".? I couldn't locate the "*.COM" file
> which swapped the font-in-use in my archives, I can't remember the
> file name.? I did find "8859-5.f16" in a directory, which appears to
> be one I made back in the day.

Switching the font started to be possible with EGA, I used to switch
from CP852 to ISO Latin-2 just for fun. Earlier you had to change the
ROM in your graphic card. For Polish letters you had to to "burn in" (?)
your font into your custom ROM (with UHV light if I remember well).

Regards

JSB

P.S. I read the list in a digest form, so my post may cross with other
relevant posting.

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From abrahamgross at disroot.org  Wed Nov  3 15:44:49 2021
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Wed, 3 Nov 2021 20:44:49 +0000 (UTC)
Subject: New CJK characters
In-Reply-To: <MWHPR1301MB21126B8DAB4AA28374F3E088868C9@MWHPR1301MB2112.namprd13.prod.outlook.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <MWHPR1301MB21126B8DAB4AA28374F3E088868C9@MWHPR1301MB2112.namprd13.prod.outlook.com>
Message-ID: <56309b47-32aa-4251-812b-0574da750313@disroot.org>

If a new/rare character is composed of an IDS sequence (ex: ?????) like many emojis, then it should be able to be represented on URLs just fine
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211103/0da955a1/attachment.htm>

From markus.icu at gmail.com  Wed Nov  3 15:51:15 2021
From: markus.icu at gmail.com (Markus Scherer)
Date: Wed, 3 Nov 2021 13:51:15 -0700
Subject: New CJK characters
In-Reply-To: <56309b47-32aa-4251-812b-0574da750313@disroot.org>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <MWHPR1301MB21126B8DAB4AA28374F3E088868C9@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <56309b47-32aa-4251-812b-0574da750313@disroot.org>
Message-ID: <CAN49p6pNBzjX8RoRKdH1UWPHwXAPzHQ6tYCu-5qvMRFE2PSJpA@mail.gmail.com>

On Wed, Nov 3, 2021 at 1:48 PM abrahamgross--- via Unicode <
unicode at corp.unicode.org> wrote:

> If a new/rare character is composed of an IDS sequence (ex: ?????) like
> many emojis, then it should be able to be represented on URLs just fine
>

Peter's reference to IDNA points out that such sequences are not
allowed in *domain
names*.
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211103/2f3eb200/attachment.htm>

From mark at kli.org  Wed Nov  3 16:22:58 2021
From: mark at kli.org (Mark E. Shoulson)
Date: Wed, 3 Nov 2021 17:22:58 -0400
Subject: New CJK characters
In-Reply-To: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
Message-ID: <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>

I'm waiting for some of the old-timers here to give a proper answer, 
Unicode history-wise.

As I understood it, the idea of using IDS or something similar for CJK 
characters was considered (probably more than once) and it was decided 
to do things this way, and so that's the way we're doing them.

A font wouldn't necessarily have to be able to generate new hanzi 
dynamically from IDS descriptions; it could have all the 100,000 or 
however many glyphs already there, and just render the known ones like 
ligatures or something.? It means it's still up to font-designers to add 
characters when they're needed, but the list of characters is then 
open-ended and it's up to font-designers to decide what they want to 
support.

OTOH, as is well known, IDS descriptions are not unique.? There's 
frequently more than one way to slice a character up.? Should *all* be 
supported?? Should there be some way to decide the "canonical" 
decomposition?? I guess if we're leaving it up to fonts, it's then up to 
the font designers again, but that would break all the non-font uses of 
Unicode (searching, comparing, etc) unless there is some canonical 
representation.

I don't know if IDS sequences can really represent "all" han characters; 
I'd guess probably not, but there are probably more sophisticated 
systems that can do better.? There'll probably always be corner cases, 
though.

But at any rate, it's my understanding that that particular ship has 
already sailed, and atomic CJK characters is how Unicode does stuff.? 
Changing that now would be rather more disrupting than just saying "no 
more precomposed accented letters."

On 11/2/21 21:03, Abraham Gross via Unicode wrote:
> I have a proposal regarding the future of encoding new Unihan 
> characters into Unicode that I'd like to float by this group to see if 
> it makes any sense. ....
~mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211103/5fd33afd/attachment.htm>

From jameskass at code2001.com  Wed Nov  3 16:59:57 2021
From: jameskass at code2001.com (James Kass)
Date: Wed, 3 Nov 2021 21:59:57 +0000
Subject: New CJK characters
In-Reply-To: <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
Message-ID: <a5675503-45b8-2e8d-ebe2-a9fd53954fd6@code2001.com>


On 2021-11-03 9:22 PM, Mark E. Shoulson via Unicode wrote:
> There's frequently more than one way to slice a character up.? Should 
> *all* be supported?? Should there be some way to decide the 
> "canonical" decomposition?

Take U+68DA "?", which can be given IDS of "???" or "?????". 
Entering either into the Zi tool gets the character.? Entering the 
latter results in the tool showing a "normalized IDS" which is the 
former.? It appears that the tool is, of necessity, performing its own 
"roll up" of the sequences in order to perform look-ups.

Then there's unification issues.? For example, this recently added 
Extension G character:
U+31310??? ???? ^???$(G)??? ^???$(Z)
...the tool generates fine ideographs for both IDS.? But only the first 
IDS is being recognized by the tool as a valid Unicode character.

Then there's regional preferences of component glyph shapes to consider, 
and I don't know how or if that would be addressed.

IDS are useful for expressing unencoded ideographs in plain-text, not 
only for those rare older characters, but also for newly invented ones.

(Sorry for my earlier misperception about the identity of the tool's 
developer.)


From john_h_jenkins at apple.com  Wed Nov  3 17:18:53 2021
From: john_h_jenkins at apple.com (john_h_jenkins)
Date: Wed, 03 Nov 2021 16:18:53 -0600
Subject: New CJK characters
In-Reply-To: <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
Message-ID: <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>


> On Nov 3, 2021, at 3:22 PM, Mark E. Shoulson via Unicode <unicode at corp.unicode.org> wrote:
> I don't know if IDS sequences can really represent "all" han characters; I'd guess probably not, but there are probably more sophisticated systems that can do better.  There'll probably always be corner cases, though.
> 
> 

They do not. Even more sophisticated systems like CDL don?t. (See L2/21-118 <https://www.unicode.org/L2/L2021/21118r-kids-preliminary.pdf>.)

I should point out that even sophisticated systems that draw characters based on their IDS (or CDL) are not going to match the quality of a commercial CJK font. 
> But at any rate, it's my understanding that that particular ship has already sailed, and atomic CJK characters is how Unicode does stuff.  Changing that now would be rather more disrupting than just saying "no more precomposed accented letters.?
> 
This is actually touched on in TUS (? 18.2) and the FAQ (Why doesn't the Unicode Standard adopt a compositional model for encoding Han ideographs? Wouldn't that save a large number of code points? <http://www.unicode.org/faq/han_cjk.html#16>). Outside of the momentum issue mentioned, compositional methods don?t work because of ?spelling? ambiguity and failure to address issues such as collation, text-to-speech, searching, semantic analysis?basically, everything you want to use text for *other* than rendering. Even in rendering, you aren?t covering the region-specific shapes, at least not with IDS.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211103/ec712e42/attachment.htm>

From jameskass at code2001.com  Wed Nov  3 17:35:13 2021
From: jameskass at code2001.com (James Kass)
Date: Wed, 3 Nov 2021 22:35:13 +0000
Subject: New CJK characters
In-Reply-To: <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
Message-ID: <e6ca8b12-8b6c-216e-8741-53910c89b78a@code2001.com>


On 2021-11-03 10:18 PM, john_h_jenkins via Unicode wrote:
> I should point out that even sophisticated systems that draw characters based on their IDS (or CDL) are not going to match the quality of a commercial CJK font.
Any reasonable glyph is better than the "missing glyph".

From abrahamgross at disroot.org  Wed Nov  3 18:22:49 2021
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Wed, 3 Nov 2021 23:22:49 +0000 (UTC)
Subject: New CJK characters
In-Reply-To: <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
Message-ID: <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>

Sutton Signwriting is completely compositional, and yet it was encoded despite all the downfalls

2021/11/03 ??6:19:32 john_h_jenkins via Unicode <unicode at corp.unicode.org>:

> 
> This is actually touched on in TUS (? 18.2) and the FAQ (Why doesn't the Unicode Standard adopt a compositional model for encoding Han ideographs? Wouldn't that save a large number of code points?). Outside of the momentum issue mentioned, compositional methods don?t work because of ?spelling? ambiguity and failure to address issues such as collation, text-to-speech, searching, semantic analysis?basically, everything you want to use text for *other* than rendering. Even in rendering, you aren?t covering the region-specific shapes, at least not with IDS.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211103/c62c8ce2/attachment.htm>

From jameskass at code2001.com  Wed Nov  3 21:20:13 2021
From: jameskass at code2001.com (James Kass)
Date: Thu, 4 Nov 2021 02:20:13 +0000
Subject: New CJK characters
In-Reply-To: <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
 <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>
Message-ID: <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com>


Take a Han character already encoded and call it ???.? Since ? is 
encoded, it can be entered in plain-text and The Standard serves us 
well.? Rendering (higher level protocol) checks available fonts for 
coverage.? If ? is covered, that?s the end of it.? But if ? isn?t 
covered, the application /could/ query the IDS database and construct a 
glyph on the fly.

If there?s an unencoded character, ???, it can?t be entered in 
plain-text directly.? IDCs/IDSs are a notational system which can serve 
as placeholders in plain-text.? Maybe ? will be encoded someday, maybe 
not.? Meanwhile The Standard serves us well because this notational 
system is encoded.? Rendering /could/ construct an /ad hoc/ glyph for ? 
which would be exo-Unicode.? The underlying data wouldn?t be altered.

Any application sophisticated enough to generate reasonable glyphs on 
the fly based on IDSs should be sophisticated enough to check any opened 
files for IDSs which have since become encoded and offer the user the 
option of replacing IDSs with Unicode characters as appropriate.

The document linked by John H. Jenkins earlier, L2/21-118, shows that 
efforts are underway to enhance the IDSs by adding missing IDCs as well 
as presently unencoded components.? The current level of support already 
covers the vast majority of encoded characters. When the enhancements 
are accomplished, only the most bizarre edge cases will remain 
unexpressable as IDSs, AFAICT.

We shouldn?t expect Unicode to say that any conformant application must 
substitute glyphs on the fly for IDSs.? But many users would probably 
welcome sophisticated applications which can do it.


From abrahamgross at disroot.org  Wed Nov  3 21:38:17 2021
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Thu, 4 Nov 2021 02:38:17 +0000 (UTC)
Subject: New CJK characters
In-Reply-To: <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
 <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>
 <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com>
Message-ID: <7340a6c6-7ff0-4ca1-b912-ccab2dba0677@disroot.org>

Id say making an update to HarfBuzz (most popular text-shaping engine) so that it includes IDS shaping would solve this problem very nicely.

maybe we should require a special character somewhere in the IDS when we want it to combine

2021/11/03 ??10:21:00 James Kass via Unicode <unicode at corp.unicode.org>:

> We shouldn?t expect Unicode to say that any conformant application must substitute glyphs on the fly for IDSs.? But many users would probably welcome sophisticated applications which can do it.


From john_h_jenkins at apple.com  Thu Nov  4 11:38:39 2021
From: john_h_jenkins at apple.com (john_h_jenkins)
Date: Thu, 04 Nov 2021 10:38:39 -0600
Subject: New CJK characters
In-Reply-To: <e6ca8b12-8b6c-216e-8741-53910c89b78a@code2001.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
 <e6ca8b12-8b6c-216e-8741-53910c89b78a@code2001.com>
Message-ID: <94086C46-BAC3-4BF6-AC5A-8CDAF6B7C1B2@apple.com>


> On Nov 3, 2021, at 4:35 PM, James Kass via Unicode <unicode at corp.unicode.org> wrote:
> 
> 
> On 2021-11-03 10:18 PM, john_h_jenkins via Unicode wrote:
>> I should point out that even sophisticated systems that draw characters based on their IDS (or CDL) are not going to match the quality of a commercial CJK font.
> Any reasonable glyph is better than the "missing glyph?.

Oh, this is true, and I should have been clearer. IDSs as a way of *representing* unencoded characters is fine. It?s what they were invented for. And any rendering engine that can turn these into visually-pleasing glyphs is welcome to do so (see TUS pp. 750?751).

IDSs are not, however, a workable alternative to *encoding* Han ideographs as singletons. Even simpler ideas that would allow some ideographs to be implicitly encoded have been rejected by IRG members. 


From john_h_jenkins at apple.com  Thu Nov  4 11:51:09 2021
From: john_h_jenkins at apple.com (john_h_jenkins)
Date: Thu, 04 Nov 2021 10:51:09 -0600
Subject: New CJK characters
In-Reply-To: <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
 <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>
Message-ID: <7FF2A172-E795-47FE-AF3A-9A4FB2C3FCCC@apple.com>

As I understand it, the encoded repertoire for Sutton SignWriting is inadequate for actual display of text because Unicode doesn?t provide a mechanism for the two-dimensional layout SignWriting uses (TUS p. 831). In this, it?s like music and mathematics. 

> On Nov 3, 2021, at 5:22 PM, abrahamgross--- via Unicode <unicode at corp.unicode.org> wrote:
> 
> Sutton Signwriting is completely compositional, and yet it was encoded despite all the downfalls 
> 
> 2021/11/03 ??6:19:32 john_h_jenkins via Unicode <unicode at corp.unicode.org>:
> 
> 
> This is actually touched on in TUS (? 18.2) and the FAQ (Why doesn't the Unicode Standard adopt a compositional model for encoding Han ideographs? Wouldn't that save a large number of code points?). Outside of the momentum issue mentioned, compositional methods don?t work because of ?spelling? ambiguity and failure to address issues such as collation, text-to-speech, searching, semantic analysis?basically, everything you want to use text for *other* than rendering. Even in rendering, you aren?t covering the region-specific shapes, at least not with IDS. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211104/c6d74f2b/attachment.htm>

From jameskass at code2001.com  Thu Nov  4 18:55:42 2021
From: jameskass at code2001.com (James Kass)
Date: Thu, 4 Nov 2021 23:55:42 +0000
Subject: SignWriting (was Re: New CJK characters)
In-Reply-To: <7FF2A172-E795-47FE-AF3A-9A4FB2C3FCCC@apple.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
 <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>
 <7FF2A172-E795-47FE-AF3A-9A4FB2C3FCCC@apple.com>
Message-ID: <5b25599e-2a14-98dc-97ad-2307e728528b@code2001.com>


On 2021-11-04 4:51 PM, john_h_jenkins via Unicode wrote:
> As I understand it, the encoded repertoire for Sutton SignWriting is inadequate for actual display of text because Unicode doesn?t provide a mechanism for the two-dimensional layout SignWriting uses (TUS p. 831). In this, it?s like music and mathematics.

This is correct as far as the currently encoded repertoire.? My 
understanding is that the current repertoire represents the characters 
without any layout mechanism, but that the mechanism was considered 
essential (as 'spelling') and would be proposed separately.? (Maybe it 
was proposed separately and rejected, IDK.)

Quoting from:
https://www.unicode.org/L2/L2012/12321-n4342-signwriting.pdf

"In terms of UCS encoding, two main stages will be required. The first 
stage (represented in this proposal) is simpler: the encoding of the 
basic characters. These are simply graphic characters, proposed to be 
encoded in Plane 1. The second stage will deal with the spatial 
organization of SignWriting characters. The latter are anticipated to be 
encoded as control characters specific to SignWriting, probably in Plane 
14."


From jameskass at code2001.com  Thu Nov  4 20:32:33 2021
From: jameskass at code2001.com (James Kass)
Date: Fri, 5 Nov 2021 01:32:33 +0000
Subject: SignWriting (was Re: New CJK characters)
In-Reply-To: <5b25599e-2a14-98dc-97ad-2307e728528b@code2001.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
 <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>
 <7FF2A172-E795-47FE-AF3A-9A4FB2C3FCCC@apple.com>
 <5b25599e-2a14-98dc-97ad-2307e728528b@code2001.com>
Message-ID: <0c2d5222-b445-6220-f65d-9a4865808b57@code2001.com>


On 2021-11-04 11:55 PM, James Kass via Unicode wrote:
> (Maybe it was proposed separately and rejected, IDK.) 

Sorry, my bad.? Apparently this is the case, and John H. Jenkins had 
already provided the relevant page number from the current Standard PDF:

"The spatial arrangement of the symbols is an essential part of the 
writing system, but constitutes a higher-level protocol beyond the scope 
of the Unicode Standard."

From c933103 at gmail.com  Fri Nov  5 08:14:15 2021
From: c933103 at gmail.com (Phake Nick)
Date: Fri, 5 Nov 2021 21:14:15 +0800
Subject: New CJK characters
In-Reply-To: <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
 <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>
 <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com>
Message-ID: <CAGHjPPL-ETmLkL2ciGXp0k+NLGdOdRAZai8GRb3ps=-uN7Rgdg@mail.gmail.com>

I have briefly mentioned the issue, in my previous mail to the mailing
list, which I jave received some reply that worth consideration and still
haven't get around writing reply to those mails I rsceoved, but yes such
sort of character encoding system have been conceptualized since the 20th
century before wide adaption of Unicode, and due to the current encoding
system being too convenient people just opt to use this instead of any
other possibly incrementally better but would be incompatible with existing
system.

Recently I have came across some proposed solutions to develop CJK fonts
for array of characters by using deep learning to put radicals together
with different components of different characters nocely according to their
proportion through machine learning, that's also something we didn't have
back in the pre-Unicode era.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211105/f9ac9116/attachment.htm>

From abrahamgross at disroot.org  Fri Nov  5 08:25:27 2021
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Fri, 5 Nov 2021 13:25:27 +0000 (UTC)
Subject: New CJK characters
In-Reply-To: <CAGHjPPL-ETmLkL2ciGXp0k+NLGdOdRAZai8GRb3ps=-uN7Rgdg@mail.gmail.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
 <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>
 <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com>
 <CAGHjPPL-ETmLkL2ciGXp0k+NLGdOdRAZai8GRb3ps=-uN7Rgdg@mail.gmail.com>
Message-ID: <65ed13f8-cb0e-41b7-9a14-765a9bb45c6c@disroot.org>

Looking at TUS ?11.4 Egyptian Hieroglyphs you can see that there they decided to yes use control characters to shape complex characters.
Anyone know why that is?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211105/82469e02/attachment.htm>

From Andrew.Glass at microsoft.com  Fri Nov  5 12:02:20 2021
From: Andrew.Glass at microsoft.com (Andrew Glass)
Date: Fri, 5 Nov 2021 17:02:20 +0000
Subject: [EXTERNAL] Re: New CJK characters
In-Reply-To: <65ed13f8-cb0e-41b7-9a14-765a9bb45c6c@disroot.org>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
 <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>
 <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com>
 <CAGHjPPL-ETmLkL2ciGXp0k+NLGdOdRAZai8GRb3ps=-uN7Rgdg@mail.gmail.com>
 <65ed13f8-cb0e-41b7-9a14-765a9bb45c6c@disroot.org>
Message-ID: <BYAPR00MB0599497D692C0C15F06D35EC8E8E9@BYAPR00MB0599.namprd00.prod.outlook.com>

We use control characters for Egyptian because it is possible and preferable to do so. The elements of the writing system are the encoded logographic and phonetic signs. The signs are arranged spatially to take advantage of available space. The blocks of writing can represent polysyllabic sequences or even multiple words. Thus, these blocks are quite different from CJK. Cataloguing attested blocks to encode them atomically would never be complete and would result in a massive number of combinations. It is important to the user community (mainly scholars) to be able to enter texts that are newly discovered, and therefore, would contain previously unattested blocks. So, rendering of arbitrary blocks is a requirement, hence the use of control characters to define the spatial relationships.

Cheers,

Andrew
________________________________
From: Unicode <unicode-bounces at corp.unicode.org> on behalf of abrahamgross--- via Unicode <unicode at corp.unicode.org>
Sent: Friday, November 5, 2021 1:25 PM
To: unicode at corp.unicode.org <unicode at corp.unicode.org>
Subject: [EXTERNAL] Re: New CJK characters

Looking at TUS ?11.4 Egyptian Hieroglyphs you can see that there they decided to yes use control characters to shape complex characters.
Anyone know why that is?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211105/84c6fbe6/attachment.htm>

From kenwhistler at sonic.net  Fri Nov  5 12:06:38 2021
From: kenwhistler at sonic.net (Ken Whistler)
Date: Fri, 5 Nov 2021 10:06:38 -0700
Subject: New CJK characters
In-Reply-To: <65ed13f8-cb0e-41b7-9a14-765a9bb45c6c@disroot.org>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
 <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>
 <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com>
 <CAGHjPPL-ETmLkL2ciGXp0k+NLGdOdRAZai8GRb3ps=-uN7Rgdg@mail.gmail.com>
 <65ed13f8-cb0e-41b7-9a14-765a9bb45c6c@disroot.org>
Message-ID: <c53aee8a-8861-7d30-2e1f-7590cb45722a@sonic.net>

Because *quadrats* are sequences of independent signs organized into 
square boxes for presentation. They are conceived of that way by modern 
day Egyptologists, and presumably also by the people who wrote them 
millennia ago.

Although both hieroglyphics and Han characters are graphically complex 
and both have concepts of dynamic (and somewhat recursive) principles 
for construction of more complex forms, when examined in detail the 
systems are quite distinct. And the way the writing systems map onto the 
languages involved is quite distinct as well.

And then there is the simple fact of precedent, which weighs heavily on 
encoding decisions for complex scripts.

For Han, we started with the existing fact of implemented JIS and GB 
systems and all their cousins, which encoded Han characters atomically 
(by necessity), and treated the dynamic structure of Han characters the 
same way almost all CJK dictionaries do: by enumerated list.

For Egyptian hieroglyphs we started with the Gardiner list of *signs* 
(fundamental to Egyptian study). Gardiner and Egyptologists (and the 
implementations) subsequently assumed that quadrats are built up from 
the signs dynamically. The atomic unit is not the quadrat.

--Ken

On 11/5/2021 6:25 AM, abrahamgross--- via Unicode wrote:
> Looking at TUS ?11.4 Egyptian Hieroglyphs you can see that there they 
> decided to yes use control characters to shape complex characters.
> Anyone know why that is?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211105/ac8495ac/attachment.htm>

From wjgo_10009 at btinternet.com  Fri Nov  5 11:58:27 2021
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Fri, 5 Nov 2021 16:58:27 +0000 (GMT)
Subject: SignWriting (was Re: New CJK characters)
In-Reply-To: <0c2d5222-b445-6220-f65d-9a4865808b57@code2001.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
 <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>
 <7FF2A172-E795-47FE-AF3A-9A4FB2C3FCCC@apple.com>
 <5b25599e-2a14-98dc-97ad-2307e728528b@code2001.com>
 <0c2d5222-b445-6220-f65d-9a4865808b57@code2001.com>
Message-ID: <7ab3caf9.332c7.17cf1098e26.Webtop.101@btinternet.com>


The following video about the history of Sutton SignWriting is 
wonderful.

https://www.youtube.com/watch?v=sYQn6crcBno 
<https://www.youtube.com/watch?v=sYQn6crcBno>

Only 40 views at the time of the writing of this note.

It is one of a number of videos available from Ms Valerie Sutton.

William Overington

Friday 5 November 2021

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211105/dbfb272d/attachment.htm>

From abrahamgross at disroot.org  Fri Nov  5 12:25:03 2021
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Fri, 5 Nov 2021 17:25:03 +0000 (UTC)
Subject: SignWriting (was Re: New CJK characters)
In-Reply-To: <7ab3caf9.332c7.17cf1098e26.Webtop.101@btinternet.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
 <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>
 <7FF2A172-E795-47FE-AF3A-9A4FB2C3FCCC@apple.com>
 <5b25599e-2a14-98dc-97ad-2307e728528b@code2001.com>
 <0c2d5222-b445-6220-f65d-9a4865808b57@code2001.com>
 <7ab3caf9.332c7.17cf1098e26.Webtop.101@btinternet.com>
Message-ID: <d95a6280-da7c-47f8-a7de-8e5db865f9de@disroot.org>

These videos were indeed very interesting. seldom do we get to hear the thoughts of people who created a widely used writing system
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211105/78c150ce/attachment.htm>

From duerst at it.aoyama.ac.jp  Fri Nov  5 19:31:43 2021
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=)
Date: Sat, 6 Nov 2021 09:31:43 +0900
Subject: New CJK characters
In-Reply-To: <CAGHjPPL-ETmLkL2ciGXp0k+NLGdOdRAZai8GRb3ps=-uN7Rgdg@mail.gmail.com>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
 <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>
 <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com>
 <CAGHjPPL-ETmLkL2ciGXp0k+NLGdOdRAZai8GRb3ps=-uN7Rgdg@mail.gmail.com>
Message-ID: <fbaa9eb8-a11a-8c71-9c94-7acba6002516@it.aoyama.ac.jp>

On 2021-11-05 22:14, Phake Nick via Unicode wrote:

> Recently I have came across some proposed solutions to develop CJK fonts
> for array of characters by using deep learning to put radicals together
> with different components of different characters nocely according to their
> proportion through machine learning, that's also something we didn't have
> back in the pre-Unicode era.

I would be very interested in any pointers, either on or off list.

Regards,   Martin.

From xfq.free at gmail.com  Fri Nov  5 20:43:59 2021
From: xfq.free at gmail.com (Fuqiao Xue)
Date: Sat, 6 Nov 2021 09:43:59 +0800
Subject: New CJK characters
In-Reply-To: <fbaa9eb8-a11a-8c71-9c94-7acba6002516@it.aoyama.ac.jp>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
 <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>
 <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com>
 <CAGHjPPL-ETmLkL2ciGXp0k+NLGdOdRAZai8GRb3ps=-uN7Rgdg@mail.gmail.com>
 <fbaa9eb8-a11a-8c71-9c94-7acba6002516@it.aoyama.ac.jp>
Message-ID: <CAAF+z6H=KJt+BboC=GUg5_quXCRHC69NuoqhSNxpcgJ3qU_01g@mail.gmail.com>

Hi Martin,

2021?11?6?(?) 8:35 Martin J. D?rst via Unicode <unicode at corp.unicode.org>:
>
> On 2021-11-05 22:14, Phake Nick via Unicode wrote:
>
> > Recently I have came across some proposed solutions to develop CJK fonts
> > for array of characters by using deep learning to put radicals together
> > with different components of different characters nocely according to their
> > proportion through machine learning, that's also something we didn't have
> > back in the pre-Unicode era.
>
> I would be very interested in any pointers, either on or off list.

Although Phake may not be talking about this project, here is a
project that uses a neural network to create Chinese fonts, and it has
sparked a lot of discussions:

   https://github.com/kaonashi-tyc/Rewrite#motivation

~xfq

> Regards,   Martin.


From Jens.Maurer at gmx.net  Sat Nov  6 08:00:29 2021
From: Jens.Maurer at gmx.net (Jens Maurer)
Date: Sat, 6 Nov 2021 14:00:29 +0100
Subject: Aliases for control characters; BELL in particular
Message-ID: <8a962aad-9afc-705a-2a18-3785785c2478@gmx.net>

Hi,

I'm involved in extending the C++ programming language so that
character names can be used to represent a Unicode character in
source code, in addition to code point hex numbers.

There are a number of obstacles here; I'll start with a rather
specific concern.

I'm looking at Unicode 14.0.0.

In section 24.1 it says

Normative Aliases
[...]

Normative aliases which provide information about corrections to defective character
names or which provide alternate names in wide use for a Unicode format character are
printed in the character names list, preceded by a special symbol [...]. Normative aliases
serving other purposes, if listed, are shown by convention in all caps, following an ?=?.
Normative aliases of type ?figment? for control codes are not listed. Normative aliases
which represent commonly used abbreviations for control codes or format characters are
shown in all caps, enclosed in parentheses. In contrast, informative aliases are shown in
lowercase. For the definitive list of normative aliases, also including their type and suitable
for machine parsing, see NameAliases.txt in the UCD.


https://www.unicode.org/Public/14.0.0/ucd/NameAliases.txt

says, in particular,

# Note that no formal name alias for the ISO 6429 "BELL" is
# provided for U+0007, because of the existing name collision
# with U+1F514 BELL.

0007;ALERT;control
0007;BEL;abbreviation


Yet, https://www.unicode.org/Public/14.0.0/charts/CodeCharts.pdf says

0007 <control>
= BELL

and about a thousand pages later

1F514 BELL
? 0FC4 tibetan symbol dril bu
? 2407 symbol for bell
? 1F56D ringing bell


So, given the explanation in section 24.1, CodeCharts.pdf defines a normative
alias "BELL" for U+0007 (it's all-caps and follows "="), despite the utterance
in NameAliases.txt that this is not desired.
It feels that CodeCharts.pdf ought to say "0007 <control> = ALERT" to avoid
the naming conflict described in the comment in NameAliases.txt.

(It would be good if NameAliases.txt would not use the phrase "formal name alias",
but one of the category phrases from section 24.1.)


A slightly related question is for these aliases from NameAliases.txt:

000A;LINE FEED;control
000A;NEW LINE;control
000A;END OF LINE;control

This seems to indicate that all three aliases are on the same level.

Yet, CodeCharts.pdf says

000A <control>
= LINE FEED (LF)
= new line (NL)
= end of line (EOL)

which, according to the explanation in section 24.1, means that only LINE FEED
is a normative alias, but "new line" and "end of line" are merely informative
aliases. The data in NameAliases.txt does not support this interpretation.
Is it the intention that all three aliases for U+000A are normative aliases?

Thanks for your help!

Jens


From markus.icu at gmail.com  Sat Nov  6 12:07:52 2021
From: markus.icu at gmail.com (Markus Scherer)
Date: Sat, 6 Nov 2021 10:07:52 -0700
Subject: Aliases for control characters; BELL in particular
In-Reply-To: <8a962aad-9afc-705a-2a18-3785785c2478@gmx.net>
References: <8a962aad-9afc-705a-2a18-3785785c2478@gmx.net>
Message-ID: <CAN49p6qsXH4urEqw690=s2pdDs0pUHoMzWOai9cPU6xYB30VYA@mail.gmail.com>

Hallo Jens,

On Sat, Nov 6, 2021 at 8:50 AM Jens Maurer via Unicode <
unicode at corp.unicode.org> wrote:

> So, given the explanation in section 24.1, CodeCharts.pdf defines a
> normative
> alias "BELL" for U+0007 (it's all-caps and follows "="), despite the
> utterance
> in NameAliases.txt that this is not desired.
>

Here is the disconnect. The code charts, with their annotations driven by
https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt , are a
presentation of glyphs, names and useful additional information.
But the normative data is in NameAliases.txt.

It would be best if you could report the discrepancy via
https://www.unicode.org/reporting.html

The data in NameAliases.txt does not support this interpretation.
> Is it the intention that all three aliases for U+000A are normative
> aliases?
>

Please use only the data in NameAliases.txt.

https://www.unicode.org/reports/tr44/#NameAliases.txt
vs.
https://www.unicode.org/reports/tr44/#NamesList

Viele Gr??e,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211106/ee60828a/attachment.htm>

From Jens.Maurer at gmx.net  Sat Nov  6 14:59:36 2021
From: Jens.Maurer at gmx.net (Jens Maurer)
Date: Sat, 6 Nov 2021 20:59:36 +0100
Subject: Aliases for control characters; BELL in particular
In-Reply-To: <CAN49p6qsXH4urEqw690=s2pdDs0pUHoMzWOai9cPU6xYB30VYA@mail.gmail.com>
References: <8a962aad-9afc-705a-2a18-3785785c2478@gmx.net>
 <CAN49p6qsXH4urEqw690=s2pdDs0pUHoMzWOai9cPU6xYB30VYA@mail.gmail.com>
Message-ID: <b73daf75-c0e6-4658-b486-d9d157b12e81@gmx.net>

On 06/11/2021 18.07, Markus Scherer via Unicode wrote:
> Hallo Jens,
>
> On Sat, Nov 6, 2021 at 8:50 AM Jens Maurer via Unicode <unicode at corp.unicode.org <mailto:unicode at corp.unicode.org>> wrote:
>
>     So, given the explanation in section 24.1, CodeCharts.pdf defines a normative
>     alias "BELL" for U+0007 (it's all-caps and follows "="), despite the utterance
>     in NameAliases.txt that this is not desired.
>
>
> Here is the disconnect. The code charts, with their annotations driven by?https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt <https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt> , are a presentation of glyphs, names and useful additional information.
> But the normative data is in NameAliases.txt.
>
> It would be best if you could report the discrepancy via?https://www.unicode.org/reporting.html <https://www.unicode.org/reporting.html>

I've posted two bug reports, one against the use of BELL for U+0007 and
one against the presentation of aliases for U+000A (and other control
characters with more than one "control" alias).

> Please use only the data in NameAliases.txt.

The sad part here is that C++ is an ISO standard, which really likes to refer to
another ISO standard for these matters.  But the code charts in ISO 10646:2020
have these bugs in them, and it seems those charts are normative in ISO 10646.

Beyond that, according to ISO 10646 section 34.3, only the "correction" aliases
are normative, the others are informative, which differs from the viewpoint
of Unicode 14.  And which means that the control characters are not nameable
at all via ISO 10646 normative names/aliases, which makes me sad.

Jens


From c933103 at gmail.com  Sun Nov  7 01:18:12 2021
From: c933103 at gmail.com (Phake Nick)
Date: Sun, 7 Nov 2021 14:18:12 +0800
Subject: New CJK characters
In-Reply-To: <fbaa9eb8-a11a-8c71-9c94-7acba6002516@it.aoyama.ac.jp>
References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org>
 <a04e6e9a-1d6a-4482-80e3-722b3986ddfe@shoulson.com>
 <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com>
 <c84beb97-0eae-403a-b54e-4f02ca0da1c1@disroot.org>
 <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com>
 <CAGHjPPL-ETmLkL2ciGXp0k+NLGdOdRAZai8GRb3ps=-uN7Rgdg@mail.gmail.com>
 <fbaa9eb8-a11a-8c71-9c94-7acba6002516@it.aoyama.ac.jp>
Message-ID: <CAGHjPPJm14M7Hs0xdECpiCk-=CMm+6-LDbxSJyZn9Bx0SPWBFg@mail.gmail.com>

I'm fairly certain it was introduced from a fontmaker's facebook post,
however I cannot use Facebook account nowadays, and thus have
difficulty finding the relevant posts, as Facebook now blocked many
pages from non-logged-in users.

But, while searching, I found the following links which might be
interesting to anyone who want to look into this topic:

https://www.astar.com.tw/astar_auto02.htm A Taiwan company's software,
Astar Auto, which will dynamically generate characters in gif image
format and serve it to client browser according to request. This is
from ~2000s or so thus no fancy technology involved

https://github.com/ButTaiwan/GlyphsTools/tree/main/TaiwanKit An
opensource font making tool from Taiwan, which include the feature of
auto generating symbols like Roman numerals or Full width Latin
characters, based on glyphs that have already been created, and it can
use mirroring and rotation to automatically make glyphs for symbols
like tabulation symbols and arrows, as well as adding circle and such
around numbers to form enclosed characters. But it doesn't appear to
support auto-generating Chinese characters. It can also auto update
the resultant design if the source glyph is modified.

https://www.cjkfonts.io/blog/cjkfonts_allseto A Traditional Chinese
font maker used machine learning to generate Simplified Chinese
characters of the same style as an open source Japanese font and
released it to the public.

https://aihub.org.tw/ai_case/fd0c8ff03157edb37926475ef674873a Arphic,
a famous Traditional Chinese font maker, is reportedly using their own
AI module to automatically adjust structure and thickness of glyphs,
and font designers will only need to do final quality check before
releasing the product. Currently their AI can create 5000 characters
from 5000 handmade characters, and they want to increase the rate to
90% glyphs being auto generated into the future. It is said that the
introduction of such a tool has already improved their revenue, and in
the next stage they want to open up the platform for public use, such
that everyone can create Chinese fonts with their own personal style.

Martin J. D?rst <duerst at it.aoyama.ac.jp> ? 2021?11?6??? ??8:31???
>
> On 2021-11-05 22:14, Phake Nick via Unicode wrote:
>
> > Recently I have came across some proposed solutions to develop CJK fonts
> > for array of characters by using deep learning to put radicals together
> > with different components of different characters nocely according to their
> > proportion through machine learning, that's also something we didn't have
> > back in the pre-Unicode era.
>
> I would be very interested in any pointers, either on or off list.
>
> Regards,   Martin.


From tom at honermann.net  Mon Nov 15 18:20:14 2021
From: tom at honermann.net (Tom Honermann)
Date: Mon, 15 Nov 2021 18:20:14 -0600
Subject: ICU encoding name alias conflicts
Message-ID: <5611ea3b-6e6e-3472-0417-cb959ad89808@honermann.net>

I conducted an audit of all of the encoding names recognized by ICU with 
the goal of identifying any cases where comparison under the COMP_NAME 
loose matching algorithm specified in P1885 <https://wg21.link/p1885> 
would lead to a conflict in selecting an ICU converter. The good news is 
that no conflicts were identified that can be attributed to the loose 
matching algorithm. However, I found that the same alias is used for 
different encodings in multiple cases as described in the table below. 
These can be verified with ICU Converter Explorer 
<https://icu4c-demos.unicode.org/icu-bin/convexp?s=UTR22&s=IBM&s=WINDOWS&s=JAVA&s=IANA&s=MIME&s=-&s=ALL&ShowUnavailable=>.

I did not scrape the ICU Converter Explorer page to perform the audit. 
The data I worked off of was produced with ICU 70.1 by running uconv -l 
--canon and then massaging the output.

Each row of the table describes a conflict between two ICU encodings, 
each of which is named in the left most and right most columns 
respectively. The inner columns list the specific aliases that conflict 
and which provider they correspond to.

For at least some of these, one has to wonder if the ICU data is simply 
incorrect. Cases that only involve a conflict with an untagged alias are 
illustrated in gray so that the others stand out.

Can anyone offer an explanation for these conflicts? Do these reflect 
defects in ICU (particularly for the cases where the untagged aliases 
disagree with)?

*ICU encoding**
* 	*Encoding alias****(provider)**
* 	*Encoding alias****(provider)* 	*ICU encoding**
*
ibm-943_P15A-2003
	cp932 (Windows)
	cp932 (Untagged)
	ibm-942_P12A-1999
ibm-943_P130-1999
	ibm-943 (IBM)
ibm-943 (Java) 	ibm-943 (Untagged)
	ibm-943_P15A-2003
ibm-943_P130-1999
	Shift_JIS (Untagged)
	Shift_JIS (Windows)
Shift_JIS (Java)
Shift_JIS (IANA)
Shift_JIS (MIME)
	ibm-943_P15A-2003
ibm-33722_P120-1999
	ibm-33722 (IBM)
ibm-33722 (Java) 	ibm-33722 (Untagged)
	ibm-33722_P12A_P12A-2009_U2
ibm-33722_P120-1999
	ibm-5050 (IBM)
	ibm-5050 (Untagged)
	ibm-33722_P12A_P12A-2009_U2
windows-950-2000
	windows-950 (Windows)
	windows-950 (Untagged)
	ibm-1373_P100-2002
ibm-5471_P100-2006
	Big5-HKSCS (Untagged)
	Big5-HKSCS (Java)
Big5-HKSCS (IANA)
	ibm-1375_P100-2008
windows-936-2000
	windows-936 (Windows)
windows-936 (Java)
windows-936 (IANA)
	windows-936 (Untagged)
	ibm-1386_P100-2001
ibm-949_P11A-1999
	ibm-949 (Untagged)
	ibm-949 (IBM)
ibm-949 (Java)
	ibm-949_P110-1999
ibm-1363_P11B-1998
	KS_C_5601-1987 (IANA)
	KS_C_5601-1987 (Java)
	ibm-970_P110_P110-2006_U2
ibm-1363_P11B-1998
	KSC_5601 (IANA)
	KSC_5601 (Java)
	ibm-970_P110_P110-2006_U2
ibm-1363_P11B-1998
	5601 (Untagged)
	5601 (Java)
	ibm-970_P110_P110-2006_U2
ibm-1363_P110-1997
	ibm-1363 (IBM)
	ibm-1363 (Untagged)
	ibm-1363_P11B-1998
windows-949-2000
	windows-949 (Windows)
windows-949 (Java)
	windows-949 (Untagged)
	ibm-1363_P11B-1998
windows-949-2000
	KS_C_5601-1987 (Windows)
	KS_C_5601-1987 (Java)
	ibm-970_P110_P110-2006_U2
windows-949-2000
	KS_C_5601-1989 (Windows)
	KS_C_5601-1989 (IANA)
	ibm-1363_P11B-1998
windows-949-2000
	KSC_5601 (Windows)
KSC_5601 (MIME)
	KSC_5601 (Java)
	ibm-970_P110_P110-2006_U2
windows-949-2000
	csKSC56011987 (Windows)
	csKSC56011987 (IANA)
	ibm-1363_P11B-1998
windows-949-2000
	korean (Windows)
	korean (IANA)
	ibm-1363_P11B-1998
windows-949-2000
	iso-ir-149 (Windows)
	iso-ir-149 (IANA)
	ibm-1363_P11B-1998
ibm-874_P100-1995
	TIS-620 (Java)
TIS-620 (IANA)
	TIS-620 (Windows)
	windows-874-2000
ibm-1250_P100-1995
	windows-1250 (Untagged)
	windows-1250 (Windows)
windows-1250 (Java)
windows-1250 (IANA)
	ibm-5346_P100-1998
ibm-1251_P100-1995
	windows-1251 (Untagged)
	windows-1251 (Windows)
windows-1251 (Java)
windows-1251 (IANA) 	ibm-5347_P100-1998
ibm-1252_P100-2000
	windows-1252 (Untagged)
	windows-1252 (Windows)
windows-1252 (Java)
windows-1252 (IANA) 	ibm-5348_P100-1997
ibm-1253_P100-1995
	windows-1253 (Untagged)
	windows-1253 (Windows)
windows-1253 (Java)
windows-1253 (IANA) 	ibm-5349_P100-1998
ibm-1254_P100-1995
	windows-1254 (Untagged)
	windows-1254 (Windows)
windows-1254 (Java)
windows-1254 (IANA) 	ibm-5350_P100-1998
ibm-5351_P100-1998
	windows-1255 (Untagged)
	windows-1255 (Windows)
windows-1255 (Java)
windows-1255 (IANA) 	ibm-9447_P100-2002
ibm-5352_P100-1998
	windows-1256 (Untagged)
	windows-1256 (Windows)
windows-1256 (Java)
windows-1256 (IANA) 	ibm-9448_X100-2005
ibm-5353_P100-1998
	windows-1257 (Untagged)
	windows-1257 (Windows)
windows-1257 (Java)
windows-1257 (IANA) 	ibm-9449_P100-2002
ibm-1258_P100-1997
	windows-1258 (Untagged)
	windows-1258 (Windows)
windows-1258 (Java)
windows-1258 (IANA) 	ibm-5354_P100-1998

Tom.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211115/21b361e0/attachment.htm>

From harjitmoe at outlook.com  Sun Nov 21 17:03:34 2021
From: harjitmoe at outlook.com (Harriet Riddle)
Date: Sun, 21 Nov 2021 23:03:34 +0000
Subject: ICU encoding name alias conflicts
In-Reply-To: <5611ea3b-6e6e-3472-0417-cb959ad89808@honermann.net>
References: <5611ea3b-6e6e-3472-0417-cb959ad89808@honermann.net>
Message-ID: <VI1PR07MB5712E0A3DD7A7311B116E7E3B79E9@VI1PR07MB5712.eurprd07.prod.outlook.com>

Hello.

Long infodump ahead, but there are several things going on here.

? Some of these are different mappings for the same encoding, e.g. ibm-33722_P120-1999 versus ibm-33722_P12A_P12A-2009_U2.?This is because the mapping of legacy character sets, JIS X 0208 and a subset of JIS X 0212 in this case, isn't always universally agreed upon between vendors (MINUS SIGN versus FULLWIDTH HYPHEN-MINUS, EM DASH versus HORIZONTAL BAR, WAVE DASH versus FULLWIDTH TILDE versus TILDE OPERATOR, et cetera), to say nothing of the REVERSE SOLIDUS / YEN SIGN / WON SIGN brouhaha.

As a sidenote, IBM-33722 is the subset of IBM-954 (IBM's version of EUC-JP) that can be converted to IBM-942, similarly to how IBM-5050 is the subset of IBM-954 that can be converted to IBM-932, which is a subset of IBM-942 without the single-byte extensions (hence IBM-5050 is aliased to its superset IBM-33722).?Why both aren't just aliased to IBM-954 is beyond me.

Further sidenote: both IBM-954 and the OSF/TUG eucJP-open encode the subset of the IBM Extensions section from IBM-932 that doesn't have standard codepoints in JIS X 0212 to an extension range in empty space in JIS X 0212; however, these schemes collide with one another.?In practice, it is NEC's scheme (which encodes the subset of the IBM Extensions section that doesn't have standard codepoints in NEC Row 13 to empty space in JIS X 0208) that gets used more often, in both EUC-JP and Shift_JIS, even when the IBM Extensions themselves are also included (as in Windows code page 932).

? A pervasive problem with legacy character encoding names is that Microsoft and IBM often use different definitions for a given code page number.?For instance, code page 932 was modified by Microsoft to use a newer JIS X 0208 edition and add NEC extensions as well as the existing IBM extensions (IBM-932 was also updated with the newer JIS X 0208 repertoire, but without the codepoint swaps of kyuujitai with corresponding extended shinjitai between levels 1 and 2 that JIS X 0208 made in 1983, and excluding additions which duplicated the existing IBM extensions).?Microsoft's code page 932 was later adopted by IBM as code page 943.?Hence some labels are inherently ambiguous.

Likewise: IBM code page 949 and Windows code page 949 are both supersets of EUC-KR, but the similarities end there (Windows's one is Unified Hangul Code, IBM's adds its own extensions outside of the EUC range to fully support the repertoires of IBM-933 and IBM-934).?IBM's 1363 is Windows-949, although IBM and Microsoft don't entirely agree on mapping.

IBM's code page 950 and Windows code page 950 are both subsets of Big5-ETEN, but IBM includes only the part of the ETEN extensions that Microsoft doesn't, both treating the other range as user-defined; IBM-1373 corresponds to Windows-950.

Code page 936 is the most egregious, referring to formerly EUC-CN and latterly GBK on Windows, but seemingly referring to Shift_GB (or something very similar) by IBM's definition (though IBM-936 is heavily deprecated and is omitted by ICU).

IBM-874 and Windows-874 are also different, otherwise-unrelated, extensions of TIS-620, the national standard which would, with a minor revision, become ISO-8859-11.

? IBM makes a distinction between CPGIDs and CCSIDs, both of which essentially occupy the same namespace, but CPGIDs identify a fixed-width plane with a potentially growing repertoire (unless the plane is full), while CCSIDs specify a repertoire (they can have a growing repertoire, but have to specify it explicitly) and can be variable-width by combining multiple planes within a higher-level scheme (such as ISO-2022-JP, general EUC, stateful EBCDIC, lead-byte-masked variable-width).?Microsoft does not, calling both code page numbers.

Hence, IBM-5348 (CCSID 5348) is the current version of Windows-1252, with a larger specified repertoire than IBM-1252 (CCSID 1252), which is the version of Windows-1252 before the Euro Sign Update (which also added a few characters besides the Euro sign)?but CPGID 1252 refers to the whole thing (with the maximal CCSID of 5348).

Similarly, IBM-5471 is Big5-HKSCS (2001) and IBM-1375 is Big5-HKSCS Growing, in practice meaning Big5-HKSCS (2008) as seen from its inclusion of 0x877A through 0x87DF?both are variable-width so neither is a CPGID (the pure double-byte CPGID for HKSCS is 1374).

Often updates or extensions to, or conversely subsets of, an existing CCSID get assigned CCSIDs amounting to an increment of the existing one by a multiple of 4096 (hence 1257 versus 5353 versus 9449).

I think those three explanations cover everything.
?Har.

________________________________
From: Unicode <unicode-bounces at corp.unicode.org> on behalf of Tom Honermann via Unicode <unicode at corp.unicode.org>
Sent: 16 November 2021 00:20
To: SG16 <sg16 at lists.isocpp.org>; UnicoDe List <unicode at corp.unicode.org>; icu-support at lists.sourceforge.net <icu-support at lists.sourceforge.net>
Subject: ICU encoding name alias conflicts


I conducted an audit of all of the encoding names recognized by ICU with the goal of identifying any cases where comparison under the COMP_NAME loose matching algorithm specified in P1885<https://wg21.link/p1885> would lead to a conflict in selecting an ICU converter. The good news is that no conflicts were identified that can be attributed to the loose matching algorithm. However, I found that the same alias is used for different encodings in multiple cases as described in the table below. These can be verified with ICU Converter Explorer<https://icu4c-demos.unicode.org/icu-bin/convexp?s=UTR22&s=IBM&s=WINDOWS&s=JAVA&s=IANA&s=MIME&s=-&s=ALL&ShowUnavailable=>.

I did not scrape the ICU Converter Explorer page to perform the audit. The data I worked off of was produced with ICU 70.1 by running uconv -l --canon and then massaging the output.

Each row of the table describes a conflict between two ICU encodings, each of which is named in the left most and right most columns respectively. The inner columns list the specific aliases that conflict and which provider they correspond to.

For at least some of these, one has to wonder if the ICU data is simply incorrect. Cases that only involve a conflict with an untagged alias are illustrated in gray so that the others stand out.

Can anyone offer an explanation for these conflicts? Do these reflect defects in ICU (particularly for the cases where the untagged aliases disagree with)?

ICU encoding
Encoding alias (provider)
Encoding alias (provider)       ICU encoding
ibm-943_P15A-2003
cp932 (Windows)
cp932 (Untagged)
ibm-942_P12A-1999
ibm-943_P130-1999
ibm-943 (IBM)
ibm-943 (Java)  ibm-943 (Untagged)
ibm-943_P15A-2003
ibm-943_P130-1999
Shift_JIS (Untagged)
Shift_JIS (Windows)
Shift_JIS (Java)
Shift_JIS (IANA)
Shift_JIS (MIME)
ibm-943_P15A-2003
ibm-33722_P120-1999
ibm-33722 (IBM)
ibm-33722 (Java)        ibm-33722 (Untagged)
ibm-33722_P12A_P12A-2009_U2
ibm-33722_P120-1999
ibm-5050 (IBM)
ibm-5050 (Untagged)
ibm-33722_P12A_P12A-2009_U2
windows-950-2000
windows-950 (Windows)
windows-950 (Untagged)
ibm-1373_P100-2002
ibm-5471_P100-2006
Big5-HKSCS (Untagged)
Big5-HKSCS (Java)
Big5-HKSCS (IANA)
ibm-1375_P100-2008
windows-936-2000
windows-936 (Windows)
windows-936 (Java)
windows-936 (IANA)
windows-936 (Untagged)
ibm-1386_P100-2001
ibm-949_P11A-1999
ibm-949 (Untagged)
ibm-949 (IBM)
ibm-949 (Java)
ibm-949_P110-1999
ibm-1363_P11B-1998
KS_C_5601-1987 (IANA)
KS_C_5601-1987 (Java)
ibm-970_P110_P110-2006_U2
ibm-1363_P11B-1998
KSC_5601 (IANA)
KSC_5601 (Java)
ibm-970_P110_P110-2006_U2
ibm-1363_P11B-1998
5601 (Untagged)
5601 (Java)
ibm-970_P110_P110-2006_U2
ibm-1363_P110-1997
ibm-1363 (IBM)
ibm-1363 (Untagged)
ibm-1363_P11B-1998
windows-949-2000
windows-949 (Windows)
windows-949 (Java)
windows-949 (Untagged)
ibm-1363_P11B-1998
windows-949-2000
KS_C_5601-1987 (Windows)
KS_C_5601-1987 (Java)
ibm-970_P110_P110-2006_U2
windows-949-2000
KS_C_5601-1989 (Windows)
KS_C_5601-1989 (IANA)
ibm-1363_P11B-1998
windows-949-2000
KSC_5601 (Windows)
KSC_5601 (MIME)
KSC_5601 (Java)
ibm-970_P110_P110-2006_U2
windows-949-2000
csKSC56011987 (Windows)
csKSC56011987 (IANA)
ibm-1363_P11B-1998
windows-949-2000
korean (Windows)
korean (IANA)
ibm-1363_P11B-1998
windows-949-2000
iso-ir-149 (Windows)
iso-ir-149 (IANA)
ibm-1363_P11B-1998
ibm-874_P100-1995
TIS-620 (Java)
TIS-620 (IANA)
TIS-620 (Windows)
windows-874-2000
ibm-1250_P100-1995
windows-1250 (Untagged)
windows-1250 (Windows)
windows-1250 (Java)
windows-1250 (IANA)
ibm-5346_P100-1998
ibm-1251_P100-1995
windows-1251 (Untagged)
windows-1251 (Windows)
windows-1251 (Java)
windows-1251 (IANA)     ibm-5347_P100-1998
ibm-1252_P100-2000
windows-1252 (Untagged)
windows-1252 (Windows)
windows-1252 (Java)
windows-1252 (IANA)     ibm-5348_P100-1997
ibm-1253_P100-1995
windows-1253 (Untagged)
windows-1253 (Windows)
windows-1253 (Java)
windows-1253 (IANA)     ibm-5349_P100-1998
ibm-1254_P100-1995
windows-1254 (Untagged)
windows-1254 (Windows)
windows-1254 (Java)
windows-1254 (IANA)     ibm-5350_P100-1998
ibm-5351_P100-1998
windows-1255 (Untagged)
windows-1255 (Windows)
windows-1255 (Java)
windows-1255 (IANA)     ibm-9447_P100-2002
ibm-5352_P100-1998
windows-1256 (Untagged)
windows-1256 (Windows)
windows-1256 (Java)
windows-1256 (IANA)     ibm-9448_X100-2005
ibm-5353_P100-1998
windows-1257 (Untagged)
windows-1257 (Windows)
windows-1257 (Java)
windows-1257 (IANA)     ibm-9449_P100-2002
ibm-1258_P100-1997
windows-1258 (Untagged)
windows-1258 (Windows)
windows-1258 (Java)
windows-1258 (IANA)     ibm-5354_P100-1998

Tom.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211121/56ce083d/attachment-0001.htm>

From pgcon6 at msn.com  Tue Nov 30 04:45:07 2021
From: pgcon6 at msn.com (Peter Constable)
Date: Tue, 30 Nov 2021 10:45:07 +0000
Subject: Agreement for Paramount
In-Reply-To: <MWHPR1301MB21126B13FC010384120EE96A86679@MWHPR1301MB2112.namprd13.prod.outlook.com>
References: <27888448-ee13-fb22-5a7b-7f78146fc27a@shoulson.com>
 <afc9fc95-0753-c15e-021c-8c326546846a@shoulson.com>
 <4009f05e-7858-bfc2-529f-322bdab2f0d5@ix.netcom.com>
 <6fa95c81-5cba-12fb-da62-a80ce2eaf250@shoulson.com>
 <3555bed2-fa40-6efa-1dbf-e92cf85d88ed@ix.netcom.com>
 <52061097-692c-7578-e7d7-9460654e3835@shoulson.com>
 <CAJ2xs_GaXYyPxPYoy01qD_Qh=mUGLVFiNQTMjMOiYM6PnWJB4w@mail.gmail.com>
 <0ebce4be-19c8-8e0f-4012-8b15bf641413@shoulson.com>
 <CAJ2xs_HHPHokebrL2udgv1DdQpGuL=ur-Lu8GMnDC5a8F93HNw@mail.gmail.com>
 <f29b2e8e-5363-454c-7fb4-20f68d560fe0@shoulson.com>
 <MWHPR1301MB21126B13FC010384120EE96A86679@MWHPR1301MB2112.namprd13.prod.outlook.com>
Message-ID: <MWHPR1301MB2112F1FDF8916A051B8F192086679@MWHPR1301MB2112.namprd13.prod.outlook.com>

Forgot to include the list.

From: Peter Constable
Sent: November 29, 2021 11:55 PM
To: Mark E. Shoulson <mark at kli.org>
Subject: RE: Agreement for Paramount

From: Unicore <unicore-bounces at corp.unicode.org<mailto:unicore-bounces at corp.unicode.org>> On Behalf Of Mark E. Shoulson via Unicore
Sent: November 28, 2021 2:29 PM
Subject: Re: Agreement for Paramount

[snip]

> If Unicode is willing to do the negotiations, why are we still arguing about this?

The Unicode Consortium isn?t prepared to take the lead in establishing engagement from 3rd-party IP holders. That initiative needs to come from the proposers championing the encoding of a given script. With all of the many scripts that are candidates for encoding, Unicode doesn?t have the capacity to take the lead in preparing proposals for individual scripts, or even to take the lead in resolving questions of IP rights in cases in which there are potential concerns. It?s enough for the volunteers (whose time has been donated, in most cases, by their employers) to vet proposals and work through the technical details that often need to be sorted out.


Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211130/dcbd27f4/attachment.htm>

From mark at kli.org  Tue Nov 30 11:54:10 2021
From: mark at kli.org (Mark E. Shoulson)
Date: Tue, 30 Nov 2021 12:54:10 -0500
Subject: Agreement for Paramount
In-Reply-To: <MWHPR1301MB2112F1FDF8916A051B8F192086679@MWHPR1301MB2112.namprd13.prod.outlook.com>
References: <27888448-ee13-fb22-5a7b-7f78146fc27a@shoulson.com>
 <afc9fc95-0753-c15e-021c-8c326546846a@shoulson.com>
 <4009f05e-7858-bfc2-529f-322bdab2f0d5@ix.netcom.com>
 <6fa95c81-5cba-12fb-da62-a80ce2eaf250@shoulson.com>
 <3555bed2-fa40-6efa-1dbf-e92cf85d88ed@ix.netcom.com>
 <52061097-692c-7578-e7d7-9460654e3835@shoulson.com>
 <CAJ2xs_GaXYyPxPYoy01qD_Qh=mUGLVFiNQTMjMOiYM6PnWJB4w@mail.gmail.com>
 <0ebce4be-19c8-8e0f-4012-8b15bf641413@shoulson.com>
 <CAJ2xs_HHPHokebrL2udgv1DdQpGuL=ur-Lu8GMnDC5a8F93HNw@mail.gmail.com>
 <f29b2e8e-5363-454c-7fb4-20f68d560fe0@shoulson.com>
 <MWHPR1301MB21126B13FC010384120EE96A86679@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <MWHPR1301MB2112F1FDF8916A051B8F192086679@MWHPR1301MB2112.namprd13.prod.outlook.com>
Message-ID: <4fa2195b-b802-7732-f300-6dfd58ca174c@shoulson.com>

"Initiative has to come from the proposers... Unicode doesn't have the 
capacity to take the lead in preparing proposals..."? I thought that's 
what I was doing (with my own volunteered time). I'm working on finding 
out if, after not-taking the lead, Unicode is at least willing to follow 
up, which apparently they have to do (or rather, nobody else can do it, 
but they don't have to; I can only hope they will.)? Working on it.? 
These exchanges have given me an idea of another route to explore.

~mark

On 11/30/21 05:45, Peter Constable via Unicode wrote:
>
> Forgot to include the list.
>
> *From:* Peter Constable
> *Sent:* November 29, 2021 11:55 PM
> *To:* Mark E. Shoulson <mark at kli.org>
> *Subject:* RE: Agreement for Paramount
>
> *From:* Unicore <unicore-bounces at corp.unicode.org> *On Behalf Of *Mark 
> E. Shoulson via Unicore
> *Sent:* November 28, 2021 2:29 PM
> *Subject:* Re: Agreement for Paramount
>
> [snip]
>
> > If Unicode is willing to do the negotiations, why are we still 
> arguing about this?
>
> The Unicode Consortium isn?t prepared to take the lead in establishing 
> engagement from 3rd-party IP holders. That initiative needs to come 
> from the proposers championing the encoding of a given script. With 
> all of the many scripts that are candidates for encoding, Unicode 
> doesn?t have the capacity to take the lead in preparing proposals for 
> individual scripts, or even to take the lead in resolving questions of 
> IP rights in cases in which there are potential concerns. It?s enough 
> for the volunteers (whose time has been donated, in most cases, by 
> their employers) to vet proposals and work through the technical 
> details that often need to be sorted out.
>
> Peter
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211130/3a7c401f/attachment.htm>

From pgcon6 at msn.com  Tue Nov 30 12:08:59 2021
From: pgcon6 at msn.com (Peter Constable)
Date: Tue, 30 Nov 2021 18:08:59 +0000
Subject: Agreement for Paramount
In-Reply-To: <4fa2195b-b802-7732-f300-6dfd58ca174c@shoulson.com>
References: <27888448-ee13-fb22-5a7b-7f78146fc27a@shoulson.com>
 <afc9fc95-0753-c15e-021c-8c326546846a@shoulson.com>
 <4009f05e-7858-bfc2-529f-322bdab2f0d5@ix.netcom.com>
 <6fa95c81-5cba-12fb-da62-a80ce2eaf250@shoulson.com>
 <3555bed2-fa40-6efa-1dbf-e92cf85d88ed@ix.netcom.com>
 <52061097-692c-7578-e7d7-9460654e3835@shoulson.com>
 <CAJ2xs_GaXYyPxPYoy01qD_Qh=mUGLVFiNQTMjMOiYM6PnWJB4w@mail.gmail.com>
 <0ebce4be-19c8-8e0f-4012-8b15bf641413@shoulson.com>
 <CAJ2xs_HHPHokebrL2udgv1DdQpGuL=ur-Lu8GMnDC5a8F93HNw@mail.gmail.com>
 <f29b2e8e-5363-454c-7fb4-20f68d560fe0@shoulson.com>
 <MWHPR1301MB21126B13FC010384120EE96A86679@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <MWHPR1301MB2112F1FDF8916A051B8F192086679@MWHPR1301MB2112.namprd13.prod.outlook.com>
 <4fa2195b-b802-7732-f300-6dfd58ca174c@shoulson.com>
Message-ID: <MWHPR1301MB2112BB35C85F9DB85557788D86679@MWHPR1301MB2112.namprd13.prod.outlook.com>

>I'm working on finding out if, after not-taking the lead, Unicode is at least willing to follow up?

Follow up how?

There?s been follow up in this thread; e.g., Mark has provided some pretty detailed information, much more for this kind of issue than I?ve seen done before. If Paramount were to reach out to Unicode wanting to discuss licensing considerations related to a proposal, I?m pretty sure Unicode would be willing to engage with them. Just don?t expect Unicode to be initiating communication with Paramount.


Peter

From: Unicode <unicode-bounces at corp.unicode.org> On Behalf Of Mark E. Shoulson via Unicode
Sent: November 30, 2021 9:54 AM
To: unicode at corp.unicode.org
Subject: Re: Agreement for Paramount


"Initiative has to come from the proposers... Unicode doesn't have the capacity to take the lead in preparing proposals..."  I thought that's what I was doing (with my own volunteered time).  I'm working on finding out if, after not-taking the lead, Unicode is at least willing to follow up, which apparently they have to do (or rather, nobody else can do it, but they don't have to; I can only hope they will.)  Working on it.  These exchanges have given me an idea of another route to explore.

~mark
On 11/30/21 05:45, Peter Constable via Unicode wrote:
Forgot to include the list.

From: Peter Constable
Sent: November 29, 2021 11:55 PM
To: Mark E. Shoulson <mark at kli.org><mailto:mark at kli.org>
Subject: RE: Agreement for Paramount

From: Unicore <unicore-bounces at corp.unicode.org<mailto:unicore-bounces at corp.unicode.org>> On Behalf Of Mark E. Shoulson via Unicore
Sent: November 28, 2021 2:29 PM
Subject: Re: Agreement for Paramount

[snip]

> If Unicode is willing to do the negotiations, why are we still arguing about this?

The Unicode Consortium isn?t prepared to take the lead in establishing engagement from 3rd-party IP holders. That initiative needs to come from the proposers championing the encoding of a given script. With all of the many scripts that are candidates for encoding, Unicode doesn?t have the capacity to take the lead in preparing proposals for individual scripts, or even to take the lead in resolving questions of IP rights in cases in which there are potential concerns. It?s enough for the volunteers (whose time has been donated, in most cases, by their employers) to vet proposals and work through the technical details that often need to be sorted out.


Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211130/2ce749f8/attachment.htm>

From public at khwilliamson.com  Tue Nov 30 12:38:48 2021
From: public at khwilliamson.com (Karl Williamson)
Date: Tue, 30 Nov 2021 11:38:48 -0700
Subject: Directionality controls for malicious code
Message-ID: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com>

It is possible to make text appear to be other than what it really is by 
using BiDi controls.

Such text may be be in the form of computer code, which could allow a 
trojan horse attack by sneaking stuff past human code reviewers.

I have not studied the BiDi algorithm, so this may be naive.

Is there any legitimate use of BiDi controls in text that doesn't have a 
mixture of LtoR and RtoL strings?

If not, and since there are relatively few scripts of RtoL characters, 
is there any legitimate use of BiDi controls outside of script runs of 
those scripts.

If not, then could the Bidi control characters be made to have their scx 
property value be all the RtoL scripts, and software such as git could 
warn or forbid text of mixed scripts?

Or could a new property be created that allowed for machine detection of 
malicious use?

Karl Williamson

From eliz at gnu.org  Tue Nov 30 12:59:13 2021
From: eliz at gnu.org (Eli Zaretskii)
Date: Tue, 30 Nov 2021 20:59:13 +0200
Subject: Directionality controls for malicious code
In-Reply-To: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> (message
 from Karl Williamson via Unicode on Tue,
 30 Nov 2021 11:38:48 -0700)
References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com>
Message-ID: <83h7bttqpq.fsf@gnu.org>

> Date: Tue, 30 Nov 2021 11:38:48 -0700
> From: Karl Williamson via Unicode <unicode at corp.unicode.org>
> 
> Is there any legitimate use of BiDi controls in text that doesn't have a 
> mixture of LtoR and RtoL strings?

Yes, although it's rare.  For example, there could be text that is
used to explain the effect of these format controls on LTR characters.
Another legitimate use would be a string of LTR characters that is
enclosed in these formatting controls so that it could be later placed
in RTL context without risking to get a jumbled text due to characters
with weak directionality.

Moreover, in real-life applications it could be quite hard to even
know whether a given chunk of text contains mixed LTR and RTL
characters, because the region could be very large and the application
doesn't necessarily consider all of it.

> If not, and since there are relatively few scripts of RtoL characters, 
> is there any legitimate use of BiDi controls outside of script runs of 
> those scripts.

Of course.  A typical use is for LTR characters embedded inside
otherwise RTL text.  There are examples of that in UAX#9, I think.

> Or could a new property be created that allowed for machine detection of 
> malicious use?

"Malicious use" is hard to define precisely in this case, IME.  We,
humans, know it when we see it, but the malicious intent is many times
extremely context-dependent and semantically-loaded, so it's hard to
detect it algorithmically, because most algorithms don't understand
the semantics of the text.