From jameskass at code2001.com Mon Nov 1 10:11:57 2021 From: jameskass at code2001.com (James Kass) Date: Mon, 1 Nov 2021 15:11:57 +0000 Subject: Tales from the archives Message-ID: <45361202-4125-d8d4-f9b8-542dc34a7467@code2001.com> Recently someone mentioned how a public list thread can generate nuggets of insight even when the topic being discussed may be controversial and/or the thread might have a tendency to veer off topic.? Reviewing threads spanning April and May of 2004 in this list?s archives affirms the accuracy of that observation. While the threads being reviewed were ongoing, there were other conversations related to Unicode, such as why UTF-8 worked for Plane Two in a certain browser, but didn?t work for Plane One.? Additional discussions covered planned extensions to existing blocks as well as scripts which might be encoded in the future (as most of them were). But the discussion I examined was related to a script proposal by Michael Everson. Real and imaginary characters were brought into the discussion, such as George Custer, Zaphod Beeblebrox, Ezra the Scribe (and Ezra the font), Martin Bormann, Potter Stewart, Cerberus the three-headed dog, Popeye the Sailor Man, Alexander the Great, and Hannibal (not Lecter) ? some of whom might be considered off topic.? Even Chang and Eng popped up.? A neologism was coined which never gained currency.? The thread and its spawn became so popular that the topic itself was banned from further discussion. During the threads, Michael Everson shared information about how he, Ken Whistler, and Rick McGowan had set up the Roadmaps guided by the history and long-established studies of the world?s writing systems.? Various posters provided insight to UTC deliberations and considerations as well as procedural information about other standards bodies.? Definitions of some words as used in Unicode jargon were compared to how those same words were defined elsewhere, and some of the Unicode usages were further clarified.? Some list members offered their backgrounds and fields of interests, revealing considerable diversity among members. At the time, standardizing ancient scripts was fairly novel, so precedent and procedure were nascent. Ken Whistler made a post about determining whether a script should be considered which is well worth revisiting: https://www.unicode.org/mail-arch/unicode-ml/y2004-m05/1138.html Ken expressed the concepts clearly using language and phrasing understandable even to the casual list visitor.? Not only do those principles Ken outlined remain germane today, they are expected to continue to guide Unicode into the future. The Unicode public list archives are a treasure trove of information about Unicode and the history of the project.? We should all be thankful that they are available and well maintained. Best regards, James Kass From abrahamgross at disroot.org Tue Nov 2 20:03:08 2021 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Wed, 03 Nov 2021 01:03:08 +0000 Subject: New CJK characters Message-ID: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> I have a proposal regarding the future of encoding new Unihan characters into Unicode that I'd like to float by this group to see if it makes any sense. New CJK characters keep on being encoded and it doesn't seem to be slowing down. This is to the point where there are now in unicode 92,856 CJK characters! I think that going forward, it would make a lot of sense if instead of encoding each new character as a separate codepoint, we adopt a paradigm like that of Sutton SIgnwriting (https://en.wikipedia.org/wiki/Sutton_SignWriting_(Unicode_block))- where Unicode would provide a set of all radicals and position/sizing modifiers - and anyone that wants to use any arbitrary non-encoded character would be able to just combine the radicals the right way (by using a gui designed for this, ? la glyphwiki.org's or Wenlin's editor), and then be able to use the character right away. This would work because the font would have to support for all the basic strokes, and since all CJK characters are comprised of the basic strokes, the font will be able to put the character together without the need for a font maker to specifically create that character. This method of "encoding" would solve many problems we have now: * Non encoded characters can be used without the need to wait years for the character to be accepted into Unicode, and then a couple more years until the major OSes update their fonts to support the new characters. * This is in my opinion a really neat solution to the gaiji problem (described here (https://en.wikipedia.org/wiki/OpenType#SING_gaiji_solution)). * This would also give way to much more rapid font development time, since you'd only need to create the basic strokes and some radicals to get a working version of the font, then all other characters would just be refining the exact stroke size/positioning. * Most CJK fonts only have a small subset of all available characters. This will allow for all fonts to support any character you wish - including ones you dream up. * People have been coming up with new CJK characters for thousands of years, including nowadays (here's a new-kanji competition for example (https://sousaku-kanji.com/archive.html)), but any new characters created nowadays would be extremely hard to get into Unicode since Unicode requires proof of use before they accept a proposal, but how are people supposed to use a character if they can't type it. I still think that Unicode should keep track of new characters in a Nameslist of sorts so that font makers have a base to go off of. Q: My (city) name has a character that isn't encoded. How can I type it quickly without needing to open up an editor and creating it each time? A: Adding them to your IME's dictionary would allow you to just create the character once. - This can be extended in such a way where an IME can be fully formed entirely out of preconstructed characters instead of codepoints. Q: What would the specifics of such a system look like behind the scenes? A: I'm not sure yet, but I think Wenlin's CDL (http://guide.wenlininstitute.org/wenlin4.3/Character_Description_Language) would be a good place to start. -------------- next part -------------- An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Tue Nov 2 20:07:59 2021 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Wed, 03 Nov 2021 01:07:59 +0000 Subject: New CJK characters In-Reply-To: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> Message-ID: I sent this by mistake while writing it up (before finishing), but you can tell the basic gist of what I was trying to say. 2021?11?2? 21:03, "Abraham Gross via Unicode" )> wrote: I have a proposal regarding the future of encoding new Unihan characters into Unicode that I'd like to float by this group to see if it makes any sense. New CJK characters keep on being encoded and it doesn't seem to be slowing down. This is to the point where there are now in unicode 92,856 CJK characters! I think that going forward, it would make a lot of sense if instead of encoding each new character as a separate codepoint, we adopt a paradigm like that of Sutton SIgnwriting (https://en.wikipedia.org/wiki/Sutton_SignWriting_(Unicode_block))- where Unicode would provide a set of all radicals and position/sizing modifiers - and anyone that wants to use any arbitrary non-encoded character would be able to just combine the radicals the right way (by using a gui designed for this, ? la glyphwiki.org's or Wenlin's editor), and then be able to use the character right away. This would work because the font would have to support for all the basic strokes, and since all CJK characters are comprised of the basic strokes, the font will be able to put the character together without the need for a font maker to specifically create that character. This method of "encoding" would solve many problems we have now: * Non encoded characters can be used without the need to wait years for the character to be accepted into Unicode, and then a couple more years until the major OSes update their fonts to support the new characters. * This is in my opinion a really neat solution to the gaiji problem (described here (https://en.wikipedia.org/wiki/OpenType#SING_gaiji_solution)). * This would also give way to much more rapid font development time, since you'd only need to create the basic strokes and some radicals to get a working version of the font, then all other characters would just be refining the exact stroke size/positioning. * Most CJK fonts only have a small subset of all available characters. This will allow for all fonts to support any character you wish - including ones you dream up. * People have been coming up with new CJK characters for thousands of years, including nowadays (here's a new-kanji competition for example (https://sousaku-kanji.com/archive.html)), but any new characters created nowadays would be extremely hard to get into Unicode since Unicode requires proof of use before they accept a proposal, but how are people supposed to use a character if they can't type it. I still think that Unicode should keep track of new characters in a Nameslist of sorts so that font makers have a base to go off of. Q: My (city) name has a character that isn't encoded. How can I type it quickly without needing to open up an editor and creating it each time? A: Adding them to your IME's dictionary would allow you to just create the character once. - This can be extended in such a way where an IME can be fully formed entirely out of preconstructed characters instead of codepoints. Q: What would the specifics of such a system look like behind the scenes? A: I'm not sure yet, but I think Wenlin's CDL (http://guide.wenlininstitute.org/wenlin4.3/Character_Description_Language) would be a good place to start. -------------- next part -------------- An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Tue Nov 2 20:09:07 2021 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Wed, 03 Nov 2021 01:09:07 +0000 Subject: New CJK characters In-Reply-To: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> Message-ID: <2fac5213041386c0a733a48843f9b280@disroot.org> I sent this by mistake while writing it up (before finishing), but you can tell the basic gist of what I was trying to say. 2021?11?2? 21:03, "Abraham Gross via Unicode" )> wrote: I have a proposal regarding the future of encoding new Unihan characters into Unicode that I'd like to float by this group to see if it makes any sense. New CJK characters keep on being encoded and it doesn't seem to be slowing down. This is to the point where there are now in unicode 92,856 CJK characters! I think that going forward, it would make a lot of sense if instead of encoding each new character as a separate codepoint, we adopt a paradigm like that of Sutton SIgnwriting (https://en.wikipedia.org/wiki/Sutton_SignWriting_(Unicode_block))- where Unicode would provide a set of all radicals and position/sizing modifiers - and anyone that wants to use any arbitrary non-encoded character would be able to just combine the radicals the right way (by using a gui designed for this, ? la glyphwiki.org's or Wenlin's editor), and then be able to use the character right away. This would work because the font would have to support for all the basic strokes, and since all CJK characters are comprised of the basic strokes, the font will be able to put the character together without the need for a font maker to specifically create that character. This method of "encoding" would solve many problems we have now: * Non encoded characters can be used without the need to wait years for the character to be accepted into Unicode, and then a couple more years until the major OSes update their fonts to support the new characters. * This is in my opinion a really neat solution to the gaiji problem (described here (https://en.wikipedia.org/wiki/OpenType#SING_gaiji_solution)). * This would also give way to much more rapid font development time, since you'd only need to create the basic strokes and some radicals to get a working version of the font, then all other characters would just be refining the exact stroke size/positioning. * Most CJK fonts only have a small subset of all available characters. This will allow for all fonts to support any character you wish - including ones you dream up. * People have been coming up with new CJK characters for thousands of years, including nowadays (here's a new-kanji competition for example (https://sousaku-kanji.com/archive.html)), but any new characters created nowadays would be extremely hard to get into Unicode since Unicode requires proof of use before they accept a proposal, but how are people supposed to use a character if they can't type it. I still think that Unicode should keep track of new characters in a Nameslist of sorts so that font makers have a base to go off of. Q: My (city) name has a character that isn't encoded. How can I type it quickly without needing to open up an editor and creating it each time? A: Adding them to your IME's dictionary would allow you to just create the character once. - This can be extended in such a way where an IME can be fully formed entirely out of preconstructed characters instead of codepoints. Q: What would the specifics of such a system look like behind the scenes? A: I'm not sure yet, but I think Wenlin's CDL (http://guide.wenlininstitute.org/wenlin4.3/Character_Description_Language) would be a good place to start. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Tue Nov 2 22:34:47 2021 From: jameskass at code2001.com (James Kass) Date: Wed, 3 Nov 2021 03:34:47 +0000 Subject: New CJK characters In-Reply-To: <2fac5213041386c0a733a48843f9b280@disroot.org> References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <2fac5213041386c0a733a48843f9b280@disroot.org> Message-ID: On 2021-11-03 1:09 AM, Abraham Gross via Unicode wrote: > Q: What would the specifics of such a system look like behind the scenes? > A: I'm not sure yet, but I think Wenlin's CDL (http://guide.wenlininstitute.org/wenlin4.3/Character_Description_Language) would be a good place to start. This web page gives an overview of some of the approaches: https://everything.explained.today/Chinese_character_description_languages/ Wenlin's approach is quite sophisticated and has been around for a while.? A quick web search didn't turn up any previous proposals for getting Wenlin's CDL enshrined in Unicode, although Richard Cook has submitted various encoding proposals over the years.? If Wenlin personnel never floated any CDL-related proposal, it may be that they themselves consider such an approach to be out of scope for plain text. As many of us know, Andrew West maintains a list of IDS for encoded Han characters, available here: https://www.babelstone.co.uk/CJK/index.html Using IDS to generate glyphs on the fly might be workable, although such an approach might well be relegated to a higher level protocol.? Meanwhile an IDS can already be stored and exchanged in a standard fashion.? Counting how many of any IDS for an as yet unencoded ideograph exist in plain text might help to establish usage for future encoding consideration. Ken Whistler crunched some numbers about CJK additions here: https://www.unicode.org/mail-arch/unicode-ml/y2018-m03/0023.html Additional information about CJK proliferation can be found here: https://www.babelstone.co.uk/Blog/2007/07/cjk-unified-ideographs-to-infinity-and.html From 747.neutron at gmail.com Tue Nov 2 23:14:24 2021 From: 747.neutron at gmail.com (=?UTF-8?B?V8OhbmcgWWlmw6Fu?=) Date: Wed, 3 Nov 2021 13:14:24 +0900 Subject: Fwd: New CJK characters In-Reply-To: References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <2fac5213041386c0a733a48843f9b280@disroot.org> Message-ID: I just noticed that my message wasn't sent to the mail list. ---------- Forwarded message --------- From: W?ng Yif?n <747.neutron at gmail.com> Date: 2021?11?3?(?) 10:56 Subject: Re: New CJK characters To: FWIW I was told that BabelStone utilizes a mechanic that glues each element of IDS with WJ just like composite emoji (not dynamic though). It may be useful if such kind of notation gets any official recognition. Also see https://zi.tools/?secondary=ids > Q: What would the specifics of such a system look like behind the scenes? > A: I'm not sure yet, but I think Wenlin's CDL would be a good place to start. I think we need a separation of concerns here. CDL looks more of a font-level technology to me. Whether it will be adoptable or not, a more plain text format in Unicode sequence, if not IDS, will be surely required separately as the input to fonts. 2021?11?3?(?) 10:11 Abraham Gross via Unicode : > > I sent this by mistake while writing it up (before finishing), but you can tell the basic gist of what I was trying to say. > > 2021?11?2? 21:03, "Abraham Gross via Unicode" wrote: > > I have a proposal regarding the future of encoding new Unihan characters into Unicode that I'd like to float by this group to see if it makes any sense. > > New CJK characters keep on being encoded and it doesn't seem to be slowing down. This is to the point where there are now in unicode 92,856 CJK characters! > > I think that going forward, it would make a lot of sense if instead of encoding each new character as a separate codepoint, we adopt a paradigm like that of Sutton SIgnwriting - where Unicode would provide a set of all radicals and position/sizing modifiers - and anyone that wants to use any arbitrary non-encoded character would be able to just combine the radicals the right way (by using a gui designed for this, ? la glyphwiki.org's or Wenlin's editor), and then be able to use the character right away. This would work because the font would have to support for all the basic strokes, and since all CJK characters are comprised of the basic strokes, the font will be able to put the character together without the need for a font maker to specifically create that character. > > This method of "encoding" would solve many problems we have now: > > Non encoded characters can be used without the need to wait years for the character to be accepted into Unicode, and then a couple more years until the major OSes update their fonts to support the new characters. > This is in my opinion a really neat solution to the gaiji problem (described here). > This would also give way to much more rapid font development time, since you'd only need to create the basic strokes and some radicals to get a working version of the font, then all other characters would just be refining the exact stroke size/positioning. > Most CJK fonts only have a small subset of all available characters. This will allow for all fonts to support any character you wish - including ones you dream up. > People have been coming up with new CJK characters for thousands of years, including nowadays (here's a new-kanji competition for example), but any new characters created nowadays would be extremely hard to get into Unicode since Unicode requires proof of use before they accept a proposal, but how are people supposed to use a character if they can't type it. > > I still think that Unicode should keep track of new characters in a Nameslist of sorts so that font makers have a base to go off of. > > Q: My (city) name has a character that isn't encoded. How can I type it quickly without needing to open up an editor and creating it each time? > A: Adding them to your IME's dictionary would allow you to just create the character once. > - This can be extended in such a way where an IME can be fully formed entirely out of preconstructed characters instead of codepoints. > > Q: What would the specifics of such a system look like behind the scenes? > A: I'm not sure yet, but I think Wenlin's CDL would be a good place to start. > > > From abrahamgross at disroot.org Tue Nov 2 23:40:59 2021 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Wed, 3 Nov 2021 04:40:59 +0000 (UTC) Subject: Fwd: New CJK characters In-Reply-To: References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <2fac5213041386c0a733a48843f9b280@disroot.org> Message-ID: Wow I'm really impressed by this tool! https://zi.tools/?secondary=ids Examples I tried to test the limits of what it can do: https://imgur.com/9EMGqvM https://imgur.com/lkgSGeq From jameskass at code2001.com Wed Nov 3 00:27:01 2021 From: jameskass at code2001.com (James Kass) Date: Wed, 3 Nov 2021 05:27:01 +0000 Subject: Fwd: New CJK characters In-Reply-To: References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <2fac5213041386c0a733a48843f9b280@disroot.org> Message-ID: On 2021-11-03 4:40 AM, abrahamgross--- via Unicode wrote: > Wow I'm really impressed by this tool! > https://zi.tools/?secondary=ids > > Examples I tried to test the limits of what it can do: > > https://imgur.com/9EMGqvM > https://imgur.com/lkgSGeq It is very impressive.? I input an IDS for an as yet unencoded character slated for Extension H (???) and was immediately rewarded with a beautiful ideograph. The combos Abraham Gross tried are more complex than that.? I'd say the tool passes the tests! (I would guess that W?ng Yif?n uses component stroke counts in order to algorithmically determine the relative heights and widths of the components and may well have also assigned "classes" for each component's base, top, and so forth to determine how those components could be kerned or adjusted for the optimal fit.) Maybe in the future there will be a conversion feature in a plain text editor which would automatically generate ideographs based on IDSs for the display. From A.Schappo at lboro.ac.uk Wed Nov 3 08:18:41 2021 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Wed, 3 Nov 2021 13:18:41 +0000 Subject: New CJK characters In-Reply-To: References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <2fac5213041386c0a733a48843f9b280@disroot.org> Message-ID: I am totally impressed as well. I have just used it to generate a png for a character I created some time ago which I call ???I created it for a friend who has 3 children ? http://zu.zi.tools/???.png & https://?.??/hao3 Andr? Schappo ________________________________ From: Unicode on behalf of abrahamgross--- via Unicode Sent: 03 November 2021 04:40 To: unicode at corp.unicode.org Subject: Fwd: New CJK characters ** THIS MESSAGE ORIGINATED OUTSIDE LOUGHBOROUGH UNIVERSITY ** Be wary of links or attachments, especially if the email is unsolicited or you don't recognise the sender's email address. Wow I'm really impressed by this tool! https://zi.tools/?secondary=ids Examples I tried to test the limits of what it can do: https://imgur.com/9EMGqvM https://imgur.com/lkgSGeq -------------- next part -------------- An HTML attachment was scrubbed... URL: From 747.neutron at gmail.com Wed Nov 3 09:27:33 2021 From: 747.neutron at gmail.com (=?UTF-8?B?V8OhbmcgWWlmw6Fu?=) Date: Wed, 3 Nov 2021 23:27:33 +0900 Subject: New CJK characters In-Reply-To: References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <2fac5213041386c0a733a48843f9b280@disroot.org> Message-ID: As you might have noticed, I'm not the developer of the website I mentioned. It is a great service run by an IRG contributor, I think you can just join the Telegram to contact the community. 2021?11?3?(?) 22:24 Andre Schappo via Unicode : > > > I am totally impressed as well. I have just used it to generate a png for a character I created some time ago which I call ???I created it for a friend who has 3 children ? > > http://zu.zi.tools/???.png & https://?.??/hao3 > > Andr? Schappo > > ________________________________ > From: Unicode on behalf of abrahamgross--- via Unicode > Sent: 03 November 2021 04:40 > To: unicode at corp.unicode.org > Subject: Fwd: New CJK characters > > ** THIS MESSAGE ORIGINATED OUTSIDE LOUGHBOROUGH UNIVERSITY ** > > Be wary of links or attachments, especially if the email is unsolicited or you don't recognise the sender's email address. > > Wow I'm really impressed by this tool! > https://zi.tools/?secondary=ids > > Examples I tried to test the limits of what it can do: > > https://imgur.com/9EMGqvM > https://imgur.com/lkgSGeq From pgcon6 at msn.com Wed Nov 3 12:40:37 2021 From: pgcon6 at msn.com (Peter Constable) Date: Wed, 3 Nov 2021 17:40:37 +0000 Subject: New CJK characters In-Reply-To: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> Message-ID: Something to consider: While highlighting potential benefits in relation to characters that are used only very rarely (in general-there might be local exceptions for some place names), you don't mention the problems that would be created for the vast majority of much-more-frequently used ideographs, as well as the down-sides for those rare characters. For example, the IDS scheme would never be supported in IDNA, so that town name could never be used in a domain name. Peter From: Unicode On Behalf Of Abraham Gross via Unicode Sent: Tuesday, November 2, 2021 6:03 PM To: unicode at corp.unicode.org Subject: New CJK characters I have a proposal regarding the future of encoding new Unihan characters into Unicode that I'd like to float by this group to see if it makes any sense. New CJK characters keep on being encoded and it doesn't seem to be slowing down. This is to the point where there are now in unicode 92,856 CJK characters! I think that going forward, it would make a lot of sense if instead of encoding each new character as a separate codepoint, we adopt a paradigm like that of Sutton SIgnwriting - where Unicode would provide a set of all radicals and position/sizing modifiers - and anyone that wants to use any arbitrary non-encoded character would be able to just combine the radicals the right way (by using a gui designed for this, ? la glyphwiki.org's or Wenlin's editor), and then be able to use the character right away. This would work because the font would have to support for all the basic strokes, and since all CJK characters are comprised of the basic strokes, the font will be able to put the character together without the need for a font maker to specifically create that character. This method of "encoding" would solve many problems we have now: 1. Non encoded characters can be used without the need to wait years for the character to be accepted into Unicode, and then a couple more years until the major OSes update their fonts to support the new characters. 2. This is in my opinion a really neat solution to the gaiji problem (described here). 3. This would also give way to much more rapid font development time, since you'd only need to create the basic strokes and some radicals to get a working version of the font, then all other characters would just be refining the exact stroke size/positioning. 4. Most CJK fonts only have a small subset of all available characters. This will allow for all fonts to support any character you wish - including ones you dream up. 5. People have been coming up with new CJK characters for thousands of years, including nowadays (here's a new-kanji competition for example), but any new characters created nowadays would be extremely hard to get into Unicode since Unicode requires proof of use before they accept a proposal, but how are people supposed to use a character if they can't type it. I still think that Unicode should keep track of new characters in a Nameslist of sorts so that font makers have a base to go off of. Q: My (city) name has a character that isn't encoded. How can I type it quickly without needing to open up an editor and creating it each time? A: Adding them to your IME's dictionary would allow you to just create the character once. - This can be extended in such a way where an IME can be fully formed entirely out of preconstructed characters instead of codepoints. Q: What would the specifics of such a system look like behind the scenes? A: I'm not sure yet, but I think Wenlin's CDL would be a good place to start. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsbien at mimuw.edu.pl Wed Nov 3 13:13:07 2021 From: jsbien at mimuw.edu.pl (=?utf-8?Q?Janusz_S=2E_Bie=C5=84?=) Date: Wed, 03 Nov 2021 19:13:07 +0100 Subject: "DOS fonts" (was RE: Breaking barriers) In-Reply-To: (James Kass's message of "Mon, 25 Oct 2021 20:03:50 +0000") References: <001201d7c9c5$042bd240$0c8376c0$@ewellic.org> Message-ID: <87pmrhf6q4.fsf@mimuw.edu.pl> On Mon, Oct 25 2021 at 20:03 GMT, James Kass wrote: > On 2021-10-25 5:23 PM, Doug Ewell via Unicode wrote: >> Peter Constable wrote: >> >>>> A DOS command then enabled users to swap the font-in-use. >>> As I recall, DOS had no such command. Rather, one needed a utility >>> that would load the font data into specific memory. >> I suspect James was thinking of the MODE CON CP SELECT=x command, where 'x' was the code page ID of the desired character set. > My post was poorly phrased.? "A command entered at the DOS prompt" > would have been better.? It wasn't a native DOS command.? An internet > search revealed that typical extensions for the modified/newly created > fonts included "*.F11" or "*.F12".? I couldn't locate the "*.COM" file > which swapped the font-in-use in my archives, I can't remember the > file name.? I did find "8859-5.f16" in a directory, which appears to > be one I made back in the day. Switching the font started to be possible with EGA, I used to switch from CP852 to ISO Latin-2 just for fun. Earlier you had to change the ROM in your graphic card. For Polish letters you had to to "burn in" (?) your font into your custom ROM (with UHV light if I remember well). Regards JSB P.S. I read the list in a digest form, so my post may cross with other relevant posting. -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From abrahamgross at disroot.org Wed Nov 3 15:44:49 2021 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Wed, 3 Nov 2021 20:44:49 +0000 (UTC) Subject: New CJK characters In-Reply-To: References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> Message-ID: <56309b47-32aa-4251-812b-0574da750313@disroot.org> If a new/rare character is composed of an IDS sequence (ex: ?????) like many emojis, then it should be able to be represented on URLs just fine -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Wed Nov 3 15:51:15 2021 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 3 Nov 2021 13:51:15 -0700 Subject: New CJK characters In-Reply-To: <56309b47-32aa-4251-812b-0574da750313@disroot.org> References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <56309b47-32aa-4251-812b-0574da750313@disroot.org> Message-ID: On Wed, Nov 3, 2021 at 1:48 PM abrahamgross--- via Unicode < unicode at corp.unicode.org> wrote: > If a new/rare character is composed of an IDS sequence (ex: ?????) like > many emojis, then it should be able to be represented on URLs just fine > Peter's reference to IDNA points out that such sequences are not allowed in *domain names*. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Wed Nov 3 16:22:58 2021 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 3 Nov 2021 17:22:58 -0400 Subject: New CJK characters In-Reply-To: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> Message-ID: I'm waiting for some of the old-timers here to give a proper answer, Unicode history-wise. As I understood it, the idea of using IDS or something similar for CJK characters was considered (probably more than once) and it was decided to do things this way, and so that's the way we're doing them. A font wouldn't necessarily have to be able to generate new hanzi dynamically from IDS descriptions; it could have all the 100,000 or however many glyphs already there, and just render the known ones like ligatures or something.? It means it's still up to font-designers to add characters when they're needed, but the list of characters is then open-ended and it's up to font-designers to decide what they want to support. OTOH, as is well known, IDS descriptions are not unique.? There's frequently more than one way to slice a character up.? Should *all* be supported?? Should there be some way to decide the "canonical" decomposition?? I guess if we're leaving it up to fonts, it's then up to the font designers again, but that would break all the non-font uses of Unicode (searching, comparing, etc) unless there is some canonical representation. I don't know if IDS sequences can really represent "all" han characters; I'd guess probably not, but there are probably more sophisticated systems that can do better.? There'll probably always be corner cases, though. But at any rate, it's my understanding that that particular ship has already sailed, and atomic CJK characters is how Unicode does stuff.? Changing that now would be rather more disrupting than just saying "no more precomposed accented letters." On 11/2/21 21:03, Abraham Gross via Unicode wrote: > I have a proposal regarding the future of encoding new Unihan > characters into Unicode that I'd like to float by this group to see if > it makes any sense. .... ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Wed Nov 3 16:59:57 2021 From: jameskass at code2001.com (James Kass) Date: Wed, 3 Nov 2021 21:59:57 +0000 Subject: New CJK characters In-Reply-To: References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> Message-ID: On 2021-11-03 9:22 PM, Mark E. Shoulson via Unicode wrote: > There's frequently more than one way to slice a character up.? Should > *all* be supported?? Should there be some way to decide the > "canonical" decomposition? Take U+68DA "?", which can be given IDS of "???" or "?????". Entering either into the Zi tool gets the character.? Entering the latter results in the tool showing a "normalized IDS" which is the former.? It appears that the tool is, of necessity, performing its own "roll up" of the sequences in order to perform look-ups. Then there's unification issues.? For example, this recently added Extension G character: U+31310??? ???? ^???$(G)??? ^???$(Z) ...the tool generates fine ideographs for both IDS.? But only the first IDS is being recognized by the tool as a valid Unicode character. Then there's regional preferences of component glyph shapes to consider, and I don't know how or if that would be addressed. IDS are useful for expressing unencoded ideographs in plain-text, not only for those rare older characters, but also for newly invented ones. (Sorry for my earlier misperception about the identity of the tool's developer.) From john_h_jenkins at apple.com Wed Nov 3 17:18:53 2021 From: john_h_jenkins at apple.com (john_h_jenkins) Date: Wed, 03 Nov 2021 16:18:53 -0600 Subject: New CJK characters In-Reply-To: References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> Message-ID: <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> > On Nov 3, 2021, at 3:22 PM, Mark E. Shoulson via Unicode wrote: > I don't know if IDS sequences can really represent "all" han characters; I'd guess probably not, but there are probably more sophisticated systems that can do better. There'll probably always be corner cases, though. > > They do not. Even more sophisticated systems like CDL don?t. (See L2/21-118 .) I should point out that even sophisticated systems that draw characters based on their IDS (or CDL) are not going to match the quality of a commercial CJK font. > But at any rate, it's my understanding that that particular ship has already sailed, and atomic CJK characters is how Unicode does stuff. Changing that now would be rather more disrupting than just saying "no more precomposed accented letters.? > This is actually touched on in TUS (? 18.2) and the FAQ (Why doesn't the Unicode Standard adopt a compositional model for encoding Han ideographs? Wouldn't that save a large number of code points? ). Outside of the momentum issue mentioned, compositional methods don?t work because of ?spelling? ambiguity and failure to address issues such as collation, text-to-speech, searching, semantic analysis?basically, everything you want to use text for *other* than rendering. Even in rendering, you aren?t covering the region-specific shapes, at least not with IDS. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Wed Nov 3 17:35:13 2021 From: jameskass at code2001.com (James Kass) Date: Wed, 3 Nov 2021 22:35:13 +0000 Subject: New CJK characters In-Reply-To: <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> Message-ID: On 2021-11-03 10:18 PM, john_h_jenkins via Unicode wrote: > I should point out that even sophisticated systems that draw characters based on their IDS (or CDL) are not going to match the quality of a commercial CJK font. Any reasonable glyph is better than the "missing glyph". From abrahamgross at disroot.org Wed Nov 3 18:22:49 2021 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Wed, 3 Nov 2021 23:22:49 +0000 (UTC) Subject: New CJK characters In-Reply-To: <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> Message-ID: Sutton Signwriting is completely compositional, and yet it was encoded despite all the downfalls 2021/11/03 ??6:19:32 john_h_jenkins via Unicode : > > This is actually touched on in TUS (? 18.2) and the FAQ (Why doesn't the Unicode Standard adopt a compositional model for encoding Han ideographs? Wouldn't that save a large number of code points?). Outside of the momentum issue mentioned, compositional methods don?t work because of ?spelling? ambiguity and failure to address issues such as collation, text-to-speech, searching, semantic analysis?basically, everything you want to use text for *other* than rendering. Even in rendering, you aren?t covering the region-specific shapes, at least not with IDS. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Wed Nov 3 21:20:13 2021 From: jameskass at code2001.com (James Kass) Date: Thu, 4 Nov 2021 02:20:13 +0000 Subject: New CJK characters In-Reply-To: References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> Message-ID: <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com> Take a Han character already encoded and call it ???.? Since ? is encoded, it can be entered in plain-text and The Standard serves us well.? Rendering (higher level protocol) checks available fonts for coverage.? If ? is covered, that?s the end of it.? But if ? isn?t covered, the application /could/ query the IDS database and construct a glyph on the fly. If there?s an unencoded character, ???, it can?t be entered in plain-text directly.? IDCs/IDSs are a notational system which can serve as placeholders in plain-text.? Maybe ? will be encoded someday, maybe not.? Meanwhile The Standard serves us well because this notational system is encoded.? Rendering /could/ construct an /ad hoc/ glyph for ? which would be exo-Unicode.? The underlying data wouldn?t be altered. Any application sophisticated enough to generate reasonable glyphs on the fly based on IDSs should be sophisticated enough to check any opened files for IDSs which have since become encoded and offer the user the option of replacing IDSs with Unicode characters as appropriate. The document linked by John H. Jenkins earlier, L2/21-118, shows that efforts are underway to enhance the IDSs by adding missing IDCs as well as presently unencoded components.? The current level of support already covers the vast majority of encoded characters. When the enhancements are accomplished, only the most bizarre edge cases will remain unexpressable as IDSs, AFAICT. We shouldn?t expect Unicode to say that any conformant application must substitute glyphs on the fly for IDSs.? But many users would probably welcome sophisticated applications which can do it. From abrahamgross at disroot.org Wed Nov 3 21:38:17 2021 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Thu, 4 Nov 2021 02:38:17 +0000 (UTC) Subject: New CJK characters In-Reply-To: <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com> References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com> Message-ID: <7340a6c6-7ff0-4ca1-b912-ccab2dba0677@disroot.org> Id say making an update to HarfBuzz (most popular text-shaping engine) so that it includes IDS shaping would solve this problem very nicely. maybe we should require a special character somewhere in the IDS when we want it to combine 2021/11/03 ??10:21:00 James Kass via Unicode : > We shouldn?t expect Unicode to say that any conformant application must substitute glyphs on the fly for IDSs.? But many users would probably welcome sophisticated applications which can do it. From john_h_jenkins at apple.com Thu Nov 4 11:38:39 2021 From: john_h_jenkins at apple.com (john_h_jenkins) Date: Thu, 04 Nov 2021 10:38:39 -0600 Subject: New CJK characters In-Reply-To: References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> Message-ID: <94086C46-BAC3-4BF6-AC5A-8CDAF6B7C1B2@apple.com> > On Nov 3, 2021, at 4:35 PM, James Kass via Unicode wrote: > > > On 2021-11-03 10:18 PM, john_h_jenkins via Unicode wrote: >> I should point out that even sophisticated systems that draw characters based on their IDS (or CDL) are not going to match the quality of a commercial CJK font. > Any reasonable glyph is better than the "missing glyph?. Oh, this is true, and I should have been clearer. IDSs as a way of *representing* unencoded characters is fine. It?s what they were invented for. And any rendering engine that can turn these into visually-pleasing glyphs is welcome to do so (see TUS pp. 750?751). IDSs are not, however, a workable alternative to *encoding* Han ideographs as singletons. Even simpler ideas that would allow some ideographs to be implicitly encoded have been rejected by IRG members. From john_h_jenkins at apple.com Thu Nov 4 11:51:09 2021 From: john_h_jenkins at apple.com (john_h_jenkins) Date: Thu, 04 Nov 2021 10:51:09 -0600 Subject: New CJK characters In-Reply-To: References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> Message-ID: <7FF2A172-E795-47FE-AF3A-9A4FB2C3FCCC@apple.com> As I understand it, the encoded repertoire for Sutton SignWriting is inadequate for actual display of text because Unicode doesn?t provide a mechanism for the two-dimensional layout SignWriting uses (TUS p. 831). In this, it?s like music and mathematics. > On Nov 3, 2021, at 5:22 PM, abrahamgross--- via Unicode wrote: > > Sutton Signwriting is completely compositional, and yet it was encoded despite all the downfalls > > 2021/11/03 ??6:19:32 john_h_jenkins via Unicode : > > > This is actually touched on in TUS (? 18.2) and the FAQ (Why doesn't the Unicode Standard adopt a compositional model for encoding Han ideographs? Wouldn't that save a large number of code points?). Outside of the momentum issue mentioned, compositional methods don?t work because of ?spelling? ambiguity and failure to address issues such as collation, text-to-speech, searching, semantic analysis?basically, everything you want to use text for *other* than rendering. Even in rendering, you aren?t covering the region-specific shapes, at least not with IDS. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Thu Nov 4 18:55:42 2021 From: jameskass at code2001.com (James Kass) Date: Thu, 4 Nov 2021 23:55:42 +0000 Subject: SignWriting (was Re: New CJK characters) In-Reply-To: <7FF2A172-E795-47FE-AF3A-9A4FB2C3FCCC@apple.com> References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> <7FF2A172-E795-47FE-AF3A-9A4FB2C3FCCC@apple.com> Message-ID: <5b25599e-2a14-98dc-97ad-2307e728528b@code2001.com> On 2021-11-04 4:51 PM, john_h_jenkins via Unicode wrote: > As I understand it, the encoded repertoire for Sutton SignWriting is inadequate for actual display of text because Unicode doesn?t provide a mechanism for the two-dimensional layout SignWriting uses (TUS p. 831). In this, it?s like music and mathematics. This is correct as far as the currently encoded repertoire.? My understanding is that the current repertoire represents the characters without any layout mechanism, but that the mechanism was considered essential (as 'spelling') and would be proposed separately.? (Maybe it was proposed separately and rejected, IDK.) Quoting from: https://www.unicode.org/L2/L2012/12321-n4342-signwriting.pdf "In terms of UCS encoding, two main stages will be required. The first stage (represented in this proposal) is simpler: the encoding of the basic characters. These are simply graphic characters, proposed to be encoded in Plane 1. The second stage will deal with the spatial organization of SignWriting characters. The latter are anticipated to be encoded as control characters specific to SignWriting, probably in Plane 14." From jameskass at code2001.com Thu Nov 4 20:32:33 2021 From: jameskass at code2001.com (James Kass) Date: Fri, 5 Nov 2021 01:32:33 +0000 Subject: SignWriting (was Re: New CJK characters) In-Reply-To: <5b25599e-2a14-98dc-97ad-2307e728528b@code2001.com> References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> <7FF2A172-E795-47FE-AF3A-9A4FB2C3FCCC@apple.com> <5b25599e-2a14-98dc-97ad-2307e728528b@code2001.com> Message-ID: <0c2d5222-b445-6220-f65d-9a4865808b57@code2001.com> On 2021-11-04 11:55 PM, James Kass via Unicode wrote: > (Maybe it was proposed separately and rejected, IDK.) Sorry, my bad.? Apparently this is the case, and John H. Jenkins had already provided the relevant page number from the current Standard PDF: "The spatial arrangement of the symbols is an essential part of the writing system, but constitutes a higher-level protocol beyond the scope of the Unicode Standard." From c933103 at gmail.com Fri Nov 5 08:14:15 2021 From: c933103 at gmail.com (Phake Nick) Date: Fri, 5 Nov 2021 21:14:15 +0800 Subject: New CJK characters In-Reply-To: <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com> References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com> Message-ID: I have briefly mentioned the issue, in my previous mail to the mailing list, which I jave received some reply that worth consideration and still haven't get around writing reply to those mails I rsceoved, but yes such sort of character encoding system have been conceptualized since the 20th century before wide adaption of Unicode, and due to the current encoding system being too convenient people just opt to use this instead of any other possibly incrementally better but would be incompatible with existing system. Recently I have came across some proposed solutions to develop CJK fonts for array of characters by using deep learning to put radicals together with different components of different characters nocely according to their proportion through machine learning, that's also something we didn't have back in the pre-Unicode era. -------------- next part -------------- An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Fri Nov 5 08:25:27 2021 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Fri, 5 Nov 2021 13:25:27 +0000 (UTC) Subject: New CJK characters In-Reply-To: References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com> Message-ID: <65ed13f8-cb0e-41b7-9a14-765a9bb45c6c@disroot.org> Looking at TUS ?11.4 Egyptian Hieroglyphs you can see that there they decided to yes use control characters to shape complex characters. Anyone know why that is? -------------- next part -------------- An HTML attachment was scrubbed... URL: From Andrew.Glass at microsoft.com Fri Nov 5 12:02:20 2021 From: Andrew.Glass at microsoft.com (Andrew Glass) Date: Fri, 5 Nov 2021 17:02:20 +0000 Subject: [EXTERNAL] Re: New CJK characters In-Reply-To: <65ed13f8-cb0e-41b7-9a14-765a9bb45c6c@disroot.org> References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com> <65ed13f8-cb0e-41b7-9a14-765a9bb45c6c@disroot.org> Message-ID: We use control characters for Egyptian because it is possible and preferable to do so. The elements of the writing system are the encoded logographic and phonetic signs. The signs are arranged spatially to take advantage of available space. The blocks of writing can represent polysyllabic sequences or even multiple words. Thus, these blocks are quite different from CJK. Cataloguing attested blocks to encode them atomically would never be complete and would result in a massive number of combinations. It is important to the user community (mainly scholars) to be able to enter texts that are newly discovered, and therefore, would contain previously unattested blocks. So, rendering of arbitrary blocks is a requirement, hence the use of control characters to define the spatial relationships. Cheers, Andrew ________________________________ From: Unicode on behalf of abrahamgross--- via Unicode Sent: Friday, November 5, 2021 1:25 PM To: unicode at corp.unicode.org Subject: [EXTERNAL] Re: New CJK characters Looking at TUS ?11.4 Egyptian Hieroglyphs you can see that there they decided to yes use control characters to shape complex characters. Anyone know why that is? -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at sonic.net Fri Nov 5 12:06:38 2021 From: kenwhistler at sonic.net (Ken Whistler) Date: Fri, 5 Nov 2021 10:06:38 -0700 Subject: New CJK characters In-Reply-To: <65ed13f8-cb0e-41b7-9a14-765a9bb45c6c@disroot.org> References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com> <65ed13f8-cb0e-41b7-9a14-765a9bb45c6c@disroot.org> Message-ID: Because *quadrats* are sequences of independent signs organized into square boxes for presentation. They are conceived of that way by modern day Egyptologists, and presumably also by the people who wrote them millennia ago. Although both hieroglyphics and Han characters are graphically complex and both have concepts of dynamic (and somewhat recursive) principles for construction of more complex forms, when examined in detail the systems are quite distinct. And the way the writing systems map onto the languages involved is quite distinct as well. And then there is the simple fact of precedent, which weighs heavily on encoding decisions for complex scripts. For Han, we started with the existing fact of implemented JIS and GB systems and all their cousins, which encoded Han characters atomically (by necessity), and treated the dynamic structure of Han characters the same way almost all CJK dictionaries do: by enumerated list. For Egyptian hieroglyphs we started with the Gardiner list of *signs* (fundamental to Egyptian study). Gardiner and Egyptologists (and the implementations) subsequently assumed that quadrats are built up from the signs dynamically. The atomic unit is not the quadrat. --Ken On 11/5/2021 6:25 AM, abrahamgross--- via Unicode wrote: > Looking at TUS ?11.4 Egyptian Hieroglyphs you can see that there they > decided to yes use control characters to shape complex characters. > Anyone know why that is? -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri Nov 5 11:58:27 2021 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 5 Nov 2021 16:58:27 +0000 (GMT) Subject: SignWriting (was Re: New CJK characters) In-Reply-To: <0c2d5222-b445-6220-f65d-9a4865808b57@code2001.com> References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> <7FF2A172-E795-47FE-AF3A-9A4FB2C3FCCC@apple.com> <5b25599e-2a14-98dc-97ad-2307e728528b@code2001.com> <0c2d5222-b445-6220-f65d-9a4865808b57@code2001.com> Message-ID: <7ab3caf9.332c7.17cf1098e26.Webtop.101@btinternet.com> The following video about the history of Sutton SignWriting is wonderful. https://www.youtube.com/watch?v=sYQn6crcBno Only 40 views at the time of the writing of this note. It is one of a number of videos available from Ms Valerie Sutton. William Overington Friday 5 November 2021 -------------- next part -------------- An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Fri Nov 5 12:25:03 2021 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Fri, 5 Nov 2021 17:25:03 +0000 (UTC) Subject: SignWriting (was Re: New CJK characters) In-Reply-To: <7ab3caf9.332c7.17cf1098e26.Webtop.101@btinternet.com> References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> <7FF2A172-E795-47FE-AF3A-9A4FB2C3FCCC@apple.com> <5b25599e-2a14-98dc-97ad-2307e728528b@code2001.com> <0c2d5222-b445-6220-f65d-9a4865808b57@code2001.com> <7ab3caf9.332c7.17cf1098e26.Webtop.101@btinternet.com> Message-ID: These videos were indeed very interesting. seldom do we get to hear the thoughts of people who created a widely used writing system -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Fri Nov 5 19:31:43 2021 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Sat, 6 Nov 2021 09:31:43 +0900 Subject: New CJK characters In-Reply-To: References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com> Message-ID: On 2021-11-05 22:14, Phake Nick via Unicode wrote: > Recently I have came across some proposed solutions to develop CJK fonts > for array of characters by using deep learning to put radicals together > with different components of different characters nocely according to their > proportion through machine learning, that's also something we didn't have > back in the pre-Unicode era. I would be very interested in any pointers, either on or off list. Regards, Martin. From xfq.free at gmail.com Fri Nov 5 20:43:59 2021 From: xfq.free at gmail.com (Fuqiao Xue) Date: Sat, 6 Nov 2021 09:43:59 +0800 Subject: New CJK characters In-Reply-To: References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com> Message-ID: Hi Martin, 2021?11?6?(?) 8:35 Martin J. D?rst via Unicode : > > On 2021-11-05 22:14, Phake Nick via Unicode wrote: > > > Recently I have came across some proposed solutions to develop CJK fonts > > for array of characters by using deep learning to put radicals together > > with different components of different characters nocely according to their > > proportion through machine learning, that's also something we didn't have > > back in the pre-Unicode era. > > I would be very interested in any pointers, either on or off list. Although Phake may not be talking about this project, here is a project that uses a neural network to create Chinese fonts, and it has sparked a lot of discussions: https://github.com/kaonashi-tyc/Rewrite#motivation ~xfq > Regards, Martin. From Jens.Maurer at gmx.net Sat Nov 6 08:00:29 2021 From: Jens.Maurer at gmx.net (Jens Maurer) Date: Sat, 6 Nov 2021 14:00:29 +0100 Subject: Aliases for control characters; BELL in particular Message-ID: <8a962aad-9afc-705a-2a18-3785785c2478@gmx.net> Hi, I'm involved in extending the C++ programming language so that character names can be used to represent a Unicode character in source code, in addition to code point hex numbers. There are a number of obstacles here; I'll start with a rather specific concern. I'm looking at Unicode 14.0.0. In section 24.1 it says Normative Aliases [...] Normative aliases which provide information about corrections to defective character names or which provide alternate names in wide use for a Unicode format character are printed in the character names list, preceded by a special symbol [...]. Normative aliases serving other purposes, if listed, are shown by convention in all caps, following an ?=?. Normative aliases of type ?figment? for control codes are not listed. Normative aliases which represent commonly used abbreviations for control codes or format characters are shown in all caps, enclosed in parentheses. In contrast, informative aliases are shown in lowercase. For the definitive list of normative aliases, also including their type and suitable for machine parsing, see NameAliases.txt in the UCD. https://www.unicode.org/Public/14.0.0/ucd/NameAliases.txt says, in particular, # Note that no formal name alias for the ISO 6429 "BELL" is # provided for U+0007, because of the existing name collision # with U+1F514 BELL. 0007;ALERT;control 0007;BEL;abbreviation Yet, https://www.unicode.org/Public/14.0.0/charts/CodeCharts.pdf says 0007 = BELL and about a thousand pages later 1F514 BELL ? 0FC4 tibetan symbol dril bu ? 2407 symbol for bell ? 1F56D ringing bell So, given the explanation in section 24.1, CodeCharts.pdf defines a normative alias "BELL" for U+0007 (it's all-caps and follows "="), despite the utterance in NameAliases.txt that this is not desired. It feels that CodeCharts.pdf ought to say "0007 = ALERT" to avoid the naming conflict described in the comment in NameAliases.txt. (It would be good if NameAliases.txt would not use the phrase "formal name alias", but one of the category phrases from section 24.1.) A slightly related question is for these aliases from NameAliases.txt: 000A;LINE FEED;control 000A;NEW LINE;control 000A;END OF LINE;control This seems to indicate that all three aliases are on the same level. Yet, CodeCharts.pdf says 000A = LINE FEED (LF) = new line (NL) = end of line (EOL) which, according to the explanation in section 24.1, means that only LINE FEED is a normative alias, but "new line" and "end of line" are merely informative aliases. The data in NameAliases.txt does not support this interpretation. Is it the intention that all three aliases for U+000A are normative aliases? Thanks for your help! Jens From markus.icu at gmail.com Sat Nov 6 12:07:52 2021 From: markus.icu at gmail.com (Markus Scherer) Date: Sat, 6 Nov 2021 10:07:52 -0700 Subject: Aliases for control characters; BELL in particular In-Reply-To: <8a962aad-9afc-705a-2a18-3785785c2478@gmx.net> References: <8a962aad-9afc-705a-2a18-3785785c2478@gmx.net> Message-ID: Hallo Jens, On Sat, Nov 6, 2021 at 8:50 AM Jens Maurer via Unicode < unicode at corp.unicode.org> wrote: > So, given the explanation in section 24.1, CodeCharts.pdf defines a > normative > alias "BELL" for U+0007 (it's all-caps and follows "="), despite the > utterance > in NameAliases.txt that this is not desired. > Here is the disconnect. The code charts, with their annotations driven by https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt , are a presentation of glyphs, names and useful additional information. But the normative data is in NameAliases.txt. It would be best if you could report the discrepancy via https://www.unicode.org/reporting.html The data in NameAliases.txt does not support this interpretation. > Is it the intention that all three aliases for U+000A are normative > aliases? > Please use only the data in NameAliases.txt. https://www.unicode.org/reports/tr44/#NameAliases.txt vs. https://www.unicode.org/reports/tr44/#NamesList Viele Gr??e, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jens.Maurer at gmx.net Sat Nov 6 14:59:36 2021 From: Jens.Maurer at gmx.net (Jens Maurer) Date: Sat, 6 Nov 2021 20:59:36 +0100 Subject: Aliases for control characters; BELL in particular In-Reply-To: References: <8a962aad-9afc-705a-2a18-3785785c2478@gmx.net> Message-ID: On 06/11/2021 18.07, Markus Scherer via Unicode wrote: > Hallo Jens, > > On Sat, Nov 6, 2021 at 8:50 AM Jens Maurer via Unicode > wrote: > > So, given the explanation in section 24.1, CodeCharts.pdf defines a normative > alias "BELL" for U+0007 (it's all-caps and follows "="), despite the utterance > in NameAliases.txt that this is not desired. > > > Here is the disconnect. The code charts, with their annotations driven by?https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt , are a presentation of glyphs, names and useful additional information. > But the normative data is in NameAliases.txt. > > It would be best if you could report the discrepancy via?https://www.unicode.org/reporting.html I've posted two bug reports, one against the use of BELL for U+0007 and one against the presentation of aliases for U+000A (and other control characters with more than one "control" alias). > Please use only the data in NameAliases.txt. The sad part here is that C++ is an ISO standard, which really likes to refer to another ISO standard for these matters. But the code charts in ISO 10646:2020 have these bugs in them, and it seems those charts are normative in ISO 10646. Beyond that, according to ISO 10646 section 34.3, only the "correction" aliases are normative, the others are informative, which differs from the viewpoint of Unicode 14. And which means that the control characters are not nameable at all via ISO 10646 normative names/aliases, which makes me sad. Jens From c933103 at gmail.com Sun Nov 7 01:18:12 2021 From: c933103 at gmail.com (Phake Nick) Date: Sun, 7 Nov 2021 14:18:12 +0800 Subject: New CJK characters In-Reply-To: References: <06a21f1247e942ea71dec7178a8ebe22@disroot.org> <6E671439-0D1D-4BCB-95E0-0A15043A2638@apple.com> <1714bb40-1fce-c553-a673-4e8f22314ed3@code2001.com> Message-ID: I'm fairly certain it was introduced from a fontmaker's facebook post, however I cannot use Facebook account nowadays, and thus have difficulty finding the relevant posts, as Facebook now blocked many pages from non-logged-in users. But, while searching, I found the following links which might be interesting to anyone who want to look into this topic: https://www.astar.com.tw/astar_auto02.htm A Taiwan company's software, Astar Auto, which will dynamically generate characters in gif image format and serve it to client browser according to request. This is from ~2000s or so thus no fancy technology involved https://github.com/ButTaiwan/GlyphsTools/tree/main/TaiwanKit An opensource font making tool from Taiwan, which include the feature of auto generating symbols like Roman numerals or Full width Latin characters, based on glyphs that have already been created, and it can use mirroring and rotation to automatically make glyphs for symbols like tabulation symbols and arrows, as well as adding circle and such around numbers to form enclosed characters. But it doesn't appear to support auto-generating Chinese characters. It can also auto update the resultant design if the source glyph is modified. https://www.cjkfonts.io/blog/cjkfonts_allseto A Traditional Chinese font maker used machine learning to generate Simplified Chinese characters of the same style as an open source Japanese font and released it to the public. https://aihub.org.tw/ai_case/fd0c8ff03157edb37926475ef674873a Arphic, a famous Traditional Chinese font maker, is reportedly using their own AI module to automatically adjust structure and thickness of glyphs, and font designers will only need to do final quality check before releasing the product. Currently their AI can create 5000 characters from 5000 handmade characters, and they want to increase the rate to 90% glyphs being auto generated into the future. It is said that the introduction of such a tool has already improved their revenue, and in the next stage they want to open up the platform for public use, such that everyone can create Chinese fonts with their own personal style. Martin J. D?rst ? 2021?11?6??? ??8:31??? > > On 2021-11-05 22:14, Phake Nick via Unicode wrote: > > > Recently I have came across some proposed solutions to develop CJK fonts > > for array of characters by using deep learning to put radicals together > > with different components of different characters nocely according to their > > proportion through machine learning, that's also something we didn't have > > back in the pre-Unicode era. > > I would be very interested in any pointers, either on or off list. > > Regards, Martin. From tom at honermann.net Mon Nov 15 18:20:14 2021 From: tom at honermann.net (Tom Honermann) Date: Mon, 15 Nov 2021 18:20:14 -0600 Subject: ICU encoding name alias conflicts Message-ID: <5611ea3b-6e6e-3472-0417-cb959ad89808@honermann.net> I conducted an audit of all of the encoding names recognized by ICU with the goal of identifying any cases where comparison under the COMP_NAME loose matching algorithm specified in P1885 would lead to a conflict in selecting an ICU converter. The good news is that no conflicts were identified that can be attributed to the loose matching algorithm. However, I found that the same alias is used for different encodings in multiple cases as described in the table below. These can be verified with ICU Converter Explorer . I did not scrape the ICU Converter Explorer page to perform the audit. The data I worked off of was produced with ICU 70.1 by running uconv -l --canon and then massaging the output. Each row of the table describes a conflict between two ICU encodings, each of which is named in the left most and right most columns respectively. The inner columns list the specific aliases that conflict and which provider they correspond to. For at least some of these, one has to wonder if the ICU data is simply incorrect. Cases that only involve a conflict with an untagged alias are illustrated in gray so that the others stand out. Can anyone offer an explanation for these conflicts? Do these reflect defects in ICU (particularly for the cases where the untagged aliases disagree with)? *ICU encoding** * *Encoding alias****(provider)** * *Encoding alias****(provider)* *ICU encoding** * ibm-943_P15A-2003 cp932 (Windows) cp932 (Untagged) ibm-942_P12A-1999 ibm-943_P130-1999 ibm-943 (IBM) ibm-943 (Java) ibm-943 (Untagged) ibm-943_P15A-2003 ibm-943_P130-1999 Shift_JIS (Untagged) Shift_JIS (Windows) Shift_JIS (Java) Shift_JIS (IANA) Shift_JIS (MIME) ibm-943_P15A-2003 ibm-33722_P120-1999 ibm-33722 (IBM) ibm-33722 (Java) ibm-33722 (Untagged) ibm-33722_P12A_P12A-2009_U2 ibm-33722_P120-1999 ibm-5050 (IBM) ibm-5050 (Untagged) ibm-33722_P12A_P12A-2009_U2 windows-950-2000 windows-950 (Windows) windows-950 (Untagged) ibm-1373_P100-2002 ibm-5471_P100-2006 Big5-HKSCS (Untagged) Big5-HKSCS (Java) Big5-HKSCS (IANA) ibm-1375_P100-2008 windows-936-2000 windows-936 (Windows) windows-936 (Java) windows-936 (IANA) windows-936 (Untagged) ibm-1386_P100-2001 ibm-949_P11A-1999 ibm-949 (Untagged) ibm-949 (IBM) ibm-949 (Java) ibm-949_P110-1999 ibm-1363_P11B-1998 KS_C_5601-1987 (IANA) KS_C_5601-1987 (Java) ibm-970_P110_P110-2006_U2 ibm-1363_P11B-1998 KSC_5601 (IANA) KSC_5601 (Java) ibm-970_P110_P110-2006_U2 ibm-1363_P11B-1998 5601 (Untagged) 5601 (Java) ibm-970_P110_P110-2006_U2 ibm-1363_P110-1997 ibm-1363 (IBM) ibm-1363 (Untagged) ibm-1363_P11B-1998 windows-949-2000 windows-949 (Windows) windows-949 (Java) windows-949 (Untagged) ibm-1363_P11B-1998 windows-949-2000 KS_C_5601-1987 (Windows) KS_C_5601-1987 (Java) ibm-970_P110_P110-2006_U2 windows-949-2000 KS_C_5601-1989 (Windows) KS_C_5601-1989 (IANA) ibm-1363_P11B-1998 windows-949-2000 KSC_5601 (Windows) KSC_5601 (MIME) KSC_5601 (Java) ibm-970_P110_P110-2006_U2 windows-949-2000 csKSC56011987 (Windows) csKSC56011987 (IANA) ibm-1363_P11B-1998 windows-949-2000 korean (Windows) korean (IANA) ibm-1363_P11B-1998 windows-949-2000 iso-ir-149 (Windows) iso-ir-149 (IANA) ibm-1363_P11B-1998 ibm-874_P100-1995 TIS-620 (Java) TIS-620 (IANA) TIS-620 (Windows) windows-874-2000 ibm-1250_P100-1995 windows-1250 (Untagged) windows-1250 (Windows) windows-1250 (Java) windows-1250 (IANA) ibm-5346_P100-1998 ibm-1251_P100-1995 windows-1251 (Untagged) windows-1251 (Windows) windows-1251 (Java) windows-1251 (IANA) ibm-5347_P100-1998 ibm-1252_P100-2000 windows-1252 (Untagged) windows-1252 (Windows) windows-1252 (Java) windows-1252 (IANA) ibm-5348_P100-1997 ibm-1253_P100-1995 windows-1253 (Untagged) windows-1253 (Windows) windows-1253 (Java) windows-1253 (IANA) ibm-5349_P100-1998 ibm-1254_P100-1995 windows-1254 (Untagged) windows-1254 (Windows) windows-1254 (Java) windows-1254 (IANA) ibm-5350_P100-1998 ibm-5351_P100-1998 windows-1255 (Untagged) windows-1255 (Windows) windows-1255 (Java) windows-1255 (IANA) ibm-9447_P100-2002 ibm-5352_P100-1998 windows-1256 (Untagged) windows-1256 (Windows) windows-1256 (Java) windows-1256 (IANA) ibm-9448_X100-2005 ibm-5353_P100-1998 windows-1257 (Untagged) windows-1257 (Windows) windows-1257 (Java) windows-1257 (IANA) ibm-9449_P100-2002 ibm-1258_P100-1997 windows-1258 (Untagged) windows-1258 (Windows) windows-1258 (Java) windows-1258 (IANA) ibm-5354_P100-1998 Tom. -------------- next part -------------- An HTML attachment was scrubbed... URL: From harjitmoe at outlook.com Sun Nov 21 17:03:34 2021 From: harjitmoe at outlook.com (Harriet Riddle) Date: Sun, 21 Nov 2021 23:03:34 +0000 Subject: ICU encoding name alias conflicts In-Reply-To: <5611ea3b-6e6e-3472-0417-cb959ad89808@honermann.net> References: <5611ea3b-6e6e-3472-0417-cb959ad89808@honermann.net> Message-ID: Hello. Long infodump ahead, but there are several things going on here. ? Some of these are different mappings for the same encoding, e.g. ibm-33722_P120-1999 versus ibm-33722_P12A_P12A-2009_U2.?This is because the mapping of legacy character sets, JIS X 0208 and a subset of JIS X 0212 in this case, isn't always universally agreed upon between vendors (MINUS SIGN versus FULLWIDTH HYPHEN-MINUS, EM DASH versus HORIZONTAL BAR, WAVE DASH versus FULLWIDTH TILDE versus TILDE OPERATOR, et cetera), to say nothing of the REVERSE SOLIDUS / YEN SIGN / WON SIGN brouhaha. As a sidenote, IBM-33722 is the subset of IBM-954 (IBM's version of EUC-JP) that can be converted to IBM-942, similarly to how IBM-5050 is the subset of IBM-954 that can be converted to IBM-932, which is a subset of IBM-942 without the single-byte extensions (hence IBM-5050 is aliased to its superset IBM-33722).?Why both aren't just aliased to IBM-954 is beyond me. Further sidenote: both IBM-954 and the OSF/TUG eucJP-open encode the subset of the IBM Extensions section from IBM-932 that doesn't have standard codepoints in JIS X 0212 to an extension range in empty space in JIS X 0212; however, these schemes collide with one another.?In practice, it is NEC's scheme (which encodes the subset of the IBM Extensions section that doesn't have standard codepoints in NEC Row 13 to empty space in JIS X 0208) that gets used more often, in both EUC-JP and Shift_JIS, even when the IBM Extensions themselves are also included (as in Windows code page 932). ? A pervasive problem with legacy character encoding names is that Microsoft and IBM often use different definitions for a given code page number.?For instance, code page 932 was modified by Microsoft to use a newer JIS X 0208 edition and add NEC extensions as well as the existing IBM extensions (IBM-932 was also updated with the newer JIS X 0208 repertoire, but without the codepoint swaps of kyuujitai with corresponding extended shinjitai between levels 1 and 2 that JIS X 0208 made in 1983, and excluding additions which duplicated the existing IBM extensions).?Microsoft's code page 932 was later adopted by IBM as code page 943.?Hence some labels are inherently ambiguous. Likewise: IBM code page 949 and Windows code page 949 are both supersets of EUC-KR, but the similarities end there (Windows's one is Unified Hangul Code, IBM's adds its own extensions outside of the EUC range to fully support the repertoires of IBM-933 and IBM-934).?IBM's 1363 is Windows-949, although IBM and Microsoft don't entirely agree on mapping. IBM's code page 950 and Windows code page 950 are both subsets of Big5-ETEN, but IBM includes only the part of the ETEN extensions that Microsoft doesn't, both treating the other range as user-defined; IBM-1373 corresponds to Windows-950. Code page 936 is the most egregious, referring to formerly EUC-CN and latterly GBK on Windows, but seemingly referring to Shift_GB (or something very similar) by IBM's definition (though IBM-936 is heavily deprecated and is omitted by ICU). IBM-874 and Windows-874 are also different, otherwise-unrelated, extensions of TIS-620, the national standard which would, with a minor revision, become ISO-8859-11. ? IBM makes a distinction between CPGIDs and CCSIDs, both of which essentially occupy the same namespace, but CPGIDs identify a fixed-width plane with a potentially growing repertoire (unless the plane is full), while CCSIDs specify a repertoire (they can have a growing repertoire, but have to specify it explicitly) and can be variable-width by combining multiple planes within a higher-level scheme (such as ISO-2022-JP, general EUC, stateful EBCDIC, lead-byte-masked variable-width).?Microsoft does not, calling both code page numbers. Hence, IBM-5348 (CCSID 5348) is the current version of Windows-1252, with a larger specified repertoire than IBM-1252 (CCSID 1252), which is the version of Windows-1252 before the Euro Sign Update (which also added a few characters besides the Euro sign)?but CPGID 1252 refers to the whole thing (with the maximal CCSID of 5348). Similarly, IBM-5471 is Big5-HKSCS (2001) and IBM-1375 is Big5-HKSCS Growing, in practice meaning Big5-HKSCS (2008) as seen from its inclusion of 0x877A through 0x87DF?both are variable-width so neither is a CPGID (the pure double-byte CPGID for HKSCS is 1374). Often updates or extensions to, or conversely subsets of, an existing CCSID get assigned CCSIDs amounting to an increment of the existing one by a multiple of 4096 (hence 1257 versus 5353 versus 9449). I think those three explanations cover everything. ?Har. ________________________________ From: Unicode on behalf of Tom Honermann via Unicode Sent: 16 November 2021 00:20 To: SG16 ; UnicoDe List ; icu-support at lists.sourceforge.net Subject: ICU encoding name alias conflicts I conducted an audit of all of the encoding names recognized by ICU with the goal of identifying any cases where comparison under the COMP_NAME loose matching algorithm specified in P1885 would lead to a conflict in selecting an ICU converter. The good news is that no conflicts were identified that can be attributed to the loose matching algorithm. However, I found that the same alias is used for different encodings in multiple cases as described in the table below. These can be verified with ICU Converter Explorer. I did not scrape the ICU Converter Explorer page to perform the audit. The data I worked off of was produced with ICU 70.1 by running uconv -l --canon and then massaging the output. Each row of the table describes a conflict between two ICU encodings, each of which is named in the left most and right most columns respectively. The inner columns list the specific aliases that conflict and which provider they correspond to. For at least some of these, one has to wonder if the ICU data is simply incorrect. Cases that only involve a conflict with an untagged alias are illustrated in gray so that the others stand out. Can anyone offer an explanation for these conflicts? Do these reflect defects in ICU (particularly for the cases where the untagged aliases disagree with)? ICU encoding Encoding alias (provider) Encoding alias (provider) ICU encoding ibm-943_P15A-2003 cp932 (Windows) cp932 (Untagged) ibm-942_P12A-1999 ibm-943_P130-1999 ibm-943 (IBM) ibm-943 (Java) ibm-943 (Untagged) ibm-943_P15A-2003 ibm-943_P130-1999 Shift_JIS (Untagged) Shift_JIS (Windows) Shift_JIS (Java) Shift_JIS (IANA) Shift_JIS (MIME) ibm-943_P15A-2003 ibm-33722_P120-1999 ibm-33722 (IBM) ibm-33722 (Java) ibm-33722 (Untagged) ibm-33722_P12A_P12A-2009_U2 ibm-33722_P120-1999 ibm-5050 (IBM) ibm-5050 (Untagged) ibm-33722_P12A_P12A-2009_U2 windows-950-2000 windows-950 (Windows) windows-950 (Untagged) ibm-1373_P100-2002 ibm-5471_P100-2006 Big5-HKSCS (Untagged) Big5-HKSCS (Java) Big5-HKSCS (IANA) ibm-1375_P100-2008 windows-936-2000 windows-936 (Windows) windows-936 (Java) windows-936 (IANA) windows-936 (Untagged) ibm-1386_P100-2001 ibm-949_P11A-1999 ibm-949 (Untagged) ibm-949 (IBM) ibm-949 (Java) ibm-949_P110-1999 ibm-1363_P11B-1998 KS_C_5601-1987 (IANA) KS_C_5601-1987 (Java) ibm-970_P110_P110-2006_U2 ibm-1363_P11B-1998 KSC_5601 (IANA) KSC_5601 (Java) ibm-970_P110_P110-2006_U2 ibm-1363_P11B-1998 5601 (Untagged) 5601 (Java) ibm-970_P110_P110-2006_U2 ibm-1363_P110-1997 ibm-1363 (IBM) ibm-1363 (Untagged) ibm-1363_P11B-1998 windows-949-2000 windows-949 (Windows) windows-949 (Java) windows-949 (Untagged) ibm-1363_P11B-1998 windows-949-2000 KS_C_5601-1987 (Windows) KS_C_5601-1987 (Java) ibm-970_P110_P110-2006_U2 windows-949-2000 KS_C_5601-1989 (Windows) KS_C_5601-1989 (IANA) ibm-1363_P11B-1998 windows-949-2000 KSC_5601 (Windows) KSC_5601 (MIME) KSC_5601 (Java) ibm-970_P110_P110-2006_U2 windows-949-2000 csKSC56011987 (Windows) csKSC56011987 (IANA) ibm-1363_P11B-1998 windows-949-2000 korean (Windows) korean (IANA) ibm-1363_P11B-1998 windows-949-2000 iso-ir-149 (Windows) iso-ir-149 (IANA) ibm-1363_P11B-1998 ibm-874_P100-1995 TIS-620 (Java) TIS-620 (IANA) TIS-620 (Windows) windows-874-2000 ibm-1250_P100-1995 windows-1250 (Untagged) windows-1250 (Windows) windows-1250 (Java) windows-1250 (IANA) ibm-5346_P100-1998 ibm-1251_P100-1995 windows-1251 (Untagged) windows-1251 (Windows) windows-1251 (Java) windows-1251 (IANA) ibm-5347_P100-1998 ibm-1252_P100-2000 windows-1252 (Untagged) windows-1252 (Windows) windows-1252 (Java) windows-1252 (IANA) ibm-5348_P100-1997 ibm-1253_P100-1995 windows-1253 (Untagged) windows-1253 (Windows) windows-1253 (Java) windows-1253 (IANA) ibm-5349_P100-1998 ibm-1254_P100-1995 windows-1254 (Untagged) windows-1254 (Windows) windows-1254 (Java) windows-1254 (IANA) ibm-5350_P100-1998 ibm-5351_P100-1998 windows-1255 (Untagged) windows-1255 (Windows) windows-1255 (Java) windows-1255 (IANA) ibm-9447_P100-2002 ibm-5352_P100-1998 windows-1256 (Untagged) windows-1256 (Windows) windows-1256 (Java) windows-1256 (IANA) ibm-9448_X100-2005 ibm-5353_P100-1998 windows-1257 (Untagged) windows-1257 (Windows) windows-1257 (Java) windows-1257 (IANA) ibm-9449_P100-2002 ibm-1258_P100-1997 windows-1258 (Untagged) windows-1258 (Windows) windows-1258 (Java) windows-1258 (IANA) ibm-5354_P100-1998 Tom. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgcon6 at msn.com Tue Nov 30 04:45:07 2021 From: pgcon6 at msn.com (Peter Constable) Date: Tue, 30 Nov 2021 10:45:07 +0000 Subject: Agreement for Paramount In-Reply-To: References: <27888448-ee13-fb22-5a7b-7f78146fc27a@shoulson.com> <4009f05e-7858-bfc2-529f-322bdab2f0d5@ix.netcom.com> <6fa95c81-5cba-12fb-da62-a80ce2eaf250@shoulson.com> <3555bed2-fa40-6efa-1dbf-e92cf85d88ed@ix.netcom.com> <52061097-692c-7578-e7d7-9460654e3835@shoulson.com> <0ebce4be-19c8-8e0f-4012-8b15bf641413@shoulson.com> Message-ID: Forgot to include the list. From: Peter Constable Sent: November 29, 2021 11:55 PM To: Mark E. Shoulson Subject: RE: Agreement for Paramount From: Unicore > On Behalf Of Mark E. Shoulson via Unicore Sent: November 28, 2021 2:29 PM Subject: Re: Agreement for Paramount [snip] > If Unicode is willing to do the negotiations, why are we still arguing about this? The Unicode Consortium isn?t prepared to take the lead in establishing engagement from 3rd-party IP holders. That initiative needs to come from the proposers championing the encoding of a given script. With all of the many scripts that are candidates for encoding, Unicode doesn?t have the capacity to take the lead in preparing proposals for individual scripts, or even to take the lead in resolving questions of IP rights in cases in which there are potential concerns. It?s enough for the volunteers (whose time has been donated, in most cases, by their employers) to vet proposals and work through the technical details that often need to be sorted out. Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Tue Nov 30 11:54:10 2021 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 30 Nov 2021 12:54:10 -0500 Subject: Agreement for Paramount In-Reply-To: References: <27888448-ee13-fb22-5a7b-7f78146fc27a@shoulson.com> <4009f05e-7858-bfc2-529f-322bdab2f0d5@ix.netcom.com> <6fa95c81-5cba-12fb-da62-a80ce2eaf250@shoulson.com> <3555bed2-fa40-6efa-1dbf-e92cf85d88ed@ix.netcom.com> <52061097-692c-7578-e7d7-9460654e3835@shoulson.com> <0ebce4be-19c8-8e0f-4012-8b15bf641413@shoulson.com> Message-ID: <4fa2195b-b802-7732-f300-6dfd58ca174c@shoulson.com> "Initiative has to come from the proposers... Unicode doesn't have the capacity to take the lead in preparing proposals..."? I thought that's what I was doing (with my own volunteered time). I'm working on finding out if, after not-taking the lead, Unicode is at least willing to follow up, which apparently they have to do (or rather, nobody else can do it, but they don't have to; I can only hope they will.)? Working on it.? These exchanges have given me an idea of another route to explore. ~mark On 11/30/21 05:45, Peter Constable via Unicode wrote: > > Forgot to include the list. > > *From:* Peter Constable > *Sent:* November 29, 2021 11:55 PM > *To:* Mark E. Shoulson > *Subject:* RE: Agreement for Paramount > > *From:* Unicore *On Behalf Of *Mark > E. Shoulson via Unicore > *Sent:* November 28, 2021 2:29 PM > *Subject:* Re: Agreement for Paramount > > [snip] > > > If Unicode is willing to do the negotiations, why are we still > arguing about this? > > The Unicode Consortium isn?t prepared to take the lead in establishing > engagement from 3rd-party IP holders. That initiative needs to come > from the proposers championing the encoding of a given script. With > all of the many scripts that are candidates for encoding, Unicode > doesn?t have the capacity to take the lead in preparing proposals for > individual scripts, or even to take the lead in resolving questions of > IP rights in cases in which there are potential concerns. It?s enough > for the volunteers (whose time has been donated, in most cases, by > their employers) to vet proposals and work through the technical > details that often need to be sorted out. > > Peter > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgcon6 at msn.com Tue Nov 30 12:08:59 2021 From: pgcon6 at msn.com (Peter Constable) Date: Tue, 30 Nov 2021 18:08:59 +0000 Subject: Agreement for Paramount In-Reply-To: <4fa2195b-b802-7732-f300-6dfd58ca174c@shoulson.com> References: <27888448-ee13-fb22-5a7b-7f78146fc27a@shoulson.com> <4009f05e-7858-bfc2-529f-322bdab2f0d5@ix.netcom.com> <6fa95c81-5cba-12fb-da62-a80ce2eaf250@shoulson.com> <3555bed2-fa40-6efa-1dbf-e92cf85d88ed@ix.netcom.com> <52061097-692c-7578-e7d7-9460654e3835@shoulson.com> <0ebce4be-19c8-8e0f-4012-8b15bf641413@shoulson.com> <4fa2195b-b802-7732-f300-6dfd58ca174c@shoulson.com> Message-ID: >I'm working on finding out if, after not-taking the lead, Unicode is at least willing to follow up? Follow up how? There?s been follow up in this thread; e.g., Mark has provided some pretty detailed information, much more for this kind of issue than I?ve seen done before. If Paramount were to reach out to Unicode wanting to discuss licensing considerations related to a proposal, I?m pretty sure Unicode would be willing to engage with them. Just don?t expect Unicode to be initiating communication with Paramount. Peter From: Unicode On Behalf Of Mark E. Shoulson via Unicode Sent: November 30, 2021 9:54 AM To: unicode at corp.unicode.org Subject: Re: Agreement for Paramount "Initiative has to come from the proposers... Unicode doesn't have the capacity to take the lead in preparing proposals..." I thought that's what I was doing (with my own volunteered time). I'm working on finding out if, after not-taking the lead, Unicode is at least willing to follow up, which apparently they have to do (or rather, nobody else can do it, but they don't have to; I can only hope they will.) Working on it. These exchanges have given me an idea of another route to explore. ~mark On 11/30/21 05:45, Peter Constable via Unicode wrote: Forgot to include the list. From: Peter Constable Sent: November 29, 2021 11:55 PM To: Mark E. Shoulson Subject: RE: Agreement for Paramount From: Unicore > On Behalf Of Mark E. Shoulson via Unicore Sent: November 28, 2021 2:29 PM Subject: Re: Agreement for Paramount [snip] > If Unicode is willing to do the negotiations, why are we still arguing about this? The Unicode Consortium isn?t prepared to take the lead in establishing engagement from 3rd-party IP holders. That initiative needs to come from the proposers championing the encoding of a given script. With all of the many scripts that are candidates for encoding, Unicode doesn?t have the capacity to take the lead in preparing proposals for individual scripts, or even to take the lead in resolving questions of IP rights in cases in which there are potential concerns. It?s enough for the volunteers (whose time has been donated, in most cases, by their employers) to vet proposals and work through the technical details that often need to be sorted out. Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Tue Nov 30 12:38:48 2021 From: public at khwilliamson.com (Karl Williamson) Date: Tue, 30 Nov 2021 11:38:48 -0700 Subject: Directionality controls for malicious code Message-ID: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> It is possible to make text appear to be other than what it really is by using BiDi controls. Such text may be be in the form of computer code, which could allow a trojan horse attack by sneaking stuff past human code reviewers. I have not studied the BiDi algorithm, so this may be naive. Is there any legitimate use of BiDi controls in text that doesn't have a mixture of LtoR and RtoL strings? If not, and since there are relatively few scripts of RtoL characters, is there any legitimate use of BiDi controls outside of script runs of those scripts. If not, then could the Bidi control characters be made to have their scx property value be all the RtoL scripts, and software such as git could warn or forbid text of mixed scripts? Or could a new property be created that allowed for machine detection of malicious use? Karl Williamson From eliz at gnu.org Tue Nov 30 12:59:13 2021 From: eliz at gnu.org (Eli Zaretskii) Date: Tue, 30 Nov 2021 20:59:13 +0200 Subject: Directionality controls for malicious code In-Reply-To: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> (message from Karl Williamson via Unicode on Tue, 30 Nov 2021 11:38:48 -0700) References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> Message-ID: <83h7bttqpq.fsf@gnu.org> > Date: Tue, 30 Nov 2021 11:38:48 -0700 > From: Karl Williamson via Unicode > > Is there any legitimate use of BiDi controls in text that doesn't have a > mixture of LtoR and RtoL strings? Yes, although it's rare. For example, there could be text that is used to explain the effect of these format controls on LTR characters. Another legitimate use would be a string of LTR characters that is enclosed in these formatting controls so that it could be later placed in RTL context without risking to get a jumbled text due to characters with weak directionality. Moreover, in real-life applications it could be quite hard to even know whether a given chunk of text contains mixed LTR and RTL characters, because the region could be very large and the application doesn't necessarily consider all of it. > If not, and since there are relatively few scripts of RtoL characters, > is there any legitimate use of BiDi controls outside of script runs of > those scripts. Of course. A typical use is for LTR characters embedded inside otherwise RTL text. There are examples of that in UAX#9, I think. > Or could a new property be created that allowed for machine detection of > malicious use? "Malicious use" is hard to define precisely in this case, IME. We, humans, know it when we see it, but the malicious intent is many times extremely context-dependent and semantically-loaded, so it's hard to detect it algorithmically, because most algorithms don't understand the semantics of the text.