From beebe at math.utah.edu Wed May 1 15:13:46 2024 From: beebe at math.utah.edu (Nelson H. F. Beebe) Date: Wed, 1 May 2024 14:13:46 -0600 Subject: New history paper on Unicode, OpenType fonts, and Indic scripts Message-ID: The latest issue of a computer history journal has an article that may be of interest to Unicode list readers: Anushah Hossain Text Standards for the Rest of World: The Making of the Unicode Standard and the OpenType Format IEEE Annals of the History of Computing 46(1) 20--33 Jan/Mar 2024 https://doi.org/10.1109/MAHC.2024.3351948 More details are in entry Hossain:2024:TSR at https://www.math.utah.edu/pub/tex/bib/unicode.bib https://www.math.utah.edu/pub/tex/bib/unicode.html ------------------------------------------------------------------------------- - Nelson H. F. Beebe Tel: +1 801 581 5254 - - University of Utah - - Department of Mathematics, 110 LCB Internet e-mail: beebe at math.utah.edu - - 155 S 1400 E RM 233 beebe at acm.org beebe at computer.org - - Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ - ------------------------------------------------------------------------------- From pgcon6 at msn.com Wed May 1 19:47:05 2024 From: pgcon6 at msn.com (Peter Constable) Date: Thu, 2 May 2024 00:47:05 +0000 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> Message-ID: A ?private agreement? can be as simple as one party saying, ?Use [such-and-such] font to view this content,? and another party using that font to view the content. There doesn?t even need to be any direct interaction between the two parties. Peter From: Unicode on behalf of Erik Carvalhal Miller via Unicode Date: Tuesday, April 30, 2024 at 9:29?PM To: William_J_G Overington Cc: unicode at corp.unicode.org Subject: Re: Use of tag characters in a private encoding - is it valid please? On Mon, Apr 29, 2024 at 2:13?PM William_J_G Overington via Unicode > wrote: > I consider that the phrase "private agreement" in The Unicode Standard is, well, not. the whole situation, as it is perfectly possible for on person to produce and publish a document declaring some meanings and/or glyphs. So while for anyone else to apply those meanings and/or glyphs does imply at least a tacit, temporary, like watching a science fiction movie suspension of disbelief, sort of agreement, it is not the almost formal contractual situation that The Unicode Standard could be reasonably thought to be writing about. > > https://www.unicode.org/versions/Unicode15.0.0/ch23.pdf page 23 of the PDF document The section you cite does not support the obligation of an ?almost formal contractual situation?. One of Unicode?s online FAQ pages (https://www.unicode.org/faq/private_use.html) has this to say: >> Q: What does "private agreement among cooperating parties" mean? >> >> A "private agreement" simply refers to the fact that agreement about the interpretation of some set of private-use characters is done privately, outside the context of the standard. The Unicode Standard does not specify any particular interpretation for any private-use character. There is no implication that a private agreement necessarily has any contractual or other legal status?it is simply an agreement between two or more parties about how a particular set of private-use characters should be interpreted. >> >> Q: How would I define a private agreement? >> >> One can share, or even publish, documentation containing particular assignments for private-use characters, their glyphs, and other relevant information about their interpretation. One can then ask others to use those private-use characters as documented. One can create appropriate fonts and IMEs, or request that others do so. On Mon, Apr 29, 2024 at 2:13?PM William_J_G Overington via Unicode > wrote this too: > A font with visible glyphs for tag characters will be helpful for composing sequences and could also be useful for finding the meaning of sequences that are not supported by any font available to the particular end user. > > > since in this case it?s not likely that the PUA character would even be recognized as an emoji, the fallback you saw is the best?case scenario one can expect in the absence of a private?use agreement. > > Well, I was not restricting myself to emoji in applying the technique of using U+10FFFD followed by a sequence of tag characters of which the final one is a CANCEL TAG. Emoji sometimes, yet other things too. That same chapter you linked to, in ?23.9 (?Tag Characters?), specifies two usages for tag characters: (1) the now?deprecated language tagging that was their original purpose and (2) emoji tag sequences, as further specified in UTS #51 (as I brought up earlier). You began this thread by asking about validity; my reading is no, a non?emoji private?use tag sequence is not valid according to the Standard. (Nevertheless, you might get it to function anyway.) It?s not clear why you would want to use tag sequences (emoji or otherwise). The 137,468 private?use code points available are well suited for specialty characters. The fallback of having your specialty font(s) visibly display the tag characters of a (private?use) well?formed but unrecognized tag sequence, though possibly useful, not only perverts the notion that tag characters are supposed to be invisible in normal rendering but also sets up a needlessly inconsistent system. If it?s important and appropriate for end users to see a fallback display resembling the Basic Latin repertoire, then why not use the Basic Latin characters, so that end users without the benefit of a special font can see them? If it?s not appropriate or important, then why make the sequence characters visible in fallback at all (outside special modes such as composition or ?show hidden?)? And if the sequence pieces aren?t to be seen, why use a sequence at all (especially an invalid one), instead of individual private?use code points? The tag characters seem like a needless complication. -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Thu May 2 16:05:31 2024 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 2 May 2024 14:05:31 -0700 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> Message-ID: <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> On 4/29/2024 11:06 AM, William_J_G Overington via Unicode wrote: > the phrase "private agreement" in The Unicode Standard... ...is deliberately unconstrained. If you tell a buddy that you included some unusual private use characters in a document and where he can get the font to display them, and he does so, then you and he have exercised a "private agreement". If another friend tries to view the document without the font, or with a different font, and doesn't see what she expects, it does not in any way affect the claim of her font, software or platform to be conformant to the Unicode Standard. That's all this ever means. A./ PS: you are free to solicit other parties to join such private agreements and you may even choose to write them down. However, it's up to you to resolve any issues due to non-compliance with your private agreements. Unicode doesn't care -- as long as you don't agree to things that conflict with conformance to the Standard. In which case, such any conformance by participants in your agreement may no longer be valid. PPS: the mathematical single angle brackets had to be added to correct a mistaken canonical unification. The name without "mathematical" was already used for characters that are now deprecated. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Thu May 2 18:25:59 2024 From: jameskass at code2001.com (James Kass) Date: Thu, 2 May 2024 23:25:59 +0000 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> Message-ID: On 2024-05-02 9:05 PM, Asmus Freytag via Unicode wrote: > PS: you are free to solicit other parties to join such private > agreements and you may even choose to write them down. However, it's > up to you to resolve any issues due to non-compliance with your > private agreements. Unicode doesn't care -- as long as you don't agree > to things that conflict with conformance to the Standard. In which > case, such any conformance by participants in your agreement may no > longer be valid. Wouldn?t this kind of private use agreement be considered a higher level protocol? [HTML] Yadda yadda et cetera. [tags shown using encircled alphanumerics] Yadda yadda ??????? et cetera. There?s nothing stopping folks from putting out fonts with glyphs covering large sets of images using QID numbers expressed as tag characters (or even as enclosed alphanumerics) and treating them as ligature substitutions.? The same goes for any non-QID strings, as well. Yet both of the examples above can be considered mark-up languages which use elements of text.? Which may explain why ?Unicode doesn?t care? about such private agreements.? Because they are beyond the realm of plain-text. From asmusf at ix.netcom.com Thu May 2 19:29:36 2024 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 2 May 2024 17:29:36 -0700 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> Message-ID: On 5/2/2024 4:25 PM, James Kass via Unicode wrote: > > On 2024-05-02 9:05 PM, Asmus Freytag via Unicode wrote: >> PS: you are free to solicit other parties to join such private >> agreements and you may even choose to write them down. However, it's >> up to you to resolve any issues due to non-compliance with your >> private agreements. Unicode doesn't care -- as long as you don't >> agree to things that conflict with conformance to the Standard. In >> which case, such any conformance by participants in your agreement >> may no longer be valid. > > Wouldn?t this kind of private use agreement be considered a higher > level protocol? No. You can agree to use a font that displays a certain glyph at a certain PUA position. That's a private agreement, but not a "higher level protocol". The way I like to think about it, PUA characters, in contrast to images inserted into the flown text, constitute plain text (as long as you don't append the font selection instructions via some private tag, e.g. . > > [HTML] > Yadda yadda et cetera. > > [tags shown using encircled alphanumerics] > Yadda yadda ??????? et cetera. The minute you agree to show different glyphs for non-PUA characters, you are no longer simply conforming to Unicode. At least, as long as those glyphs aren't already associated as alternate glyphs to the given character by ordinary practice. Using Fraktur glyphs for Latin characters is very much conformant for that reason. > > There?s nothing stopping folks from putting out fonts with glyphs > covering large sets of images using QID numbers expressed as tag > characters (or even as enclosed alphanumerics) and treating them as > ligature substitutions.? The same goes for any non-QID strings, as well. > > Yet both of the examples above can be considered mark-up languages > which use elements of text.? Which may explain why ?Unicode doesn?t > care? about such private agreements.? Because they are beyond the > realm of plain-text. > If you create elaborate conventions for the use of tag characters you are creating a markup language. It's no different from re-using ASCII characters for syntax in addition to text. The same is true for repurposing the control codes. Especially, if your syntax allows parameters that are using non-control code characters. They are not SGML style markup, but they constitute markup in a most general sense. The way markup languages are conformant with Unicode is that they identify those text runs that are plain text unicode and those text runs where code points have syntactic functions. A./ From wjgo_10009 at btinternet.com Fri May 3 04:17:04 2024 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 3 May 2024 10:17:04 +0100 (BST) Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> References: <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> Message-ID: <1bc89339.b5f9.18f3dbe2ce2.Webtop.101@btinternet.com> A glyph of The Welsh Flag is encoded in Unicode as a sequence of a base character followed by some tag characters and a CANCEL TAG. ? As far as I am aware, this is regarded as a plain text encoding. I am not aware of that encoding ever having been referred to as markup. ? U+1F91F is listed in ? https://www.unicode.org/charts/PDF/U1F900.pdf ? as ? I LOVE YOU HAND SIGN ? and in the same document, ? U+1F98B is listed as BUTTERFLY ? If those two characters are in a block of text, not necessarily next to each other, and the text is to be transcribed as all alphanumeric text, how should those two characters be transcribed? Or used in a text to speech system? What if the text in the original document is in French, how should those two pictographs be transcribed? Or spoken? ? Please consider an encoding of a glyph that has been designed and assigned a meaning by an artist. (Yes, a hobbyist artist, but artists and novelists are not expected to be representing an organization, so their output is "recognized" rather than being discriminated against.) ? There is such a glyph that has been assigned the following meaning, intended for use in seeking information about relatives and friends after a disaster ? Is there any information about the following person please? ? Suppose that that glyph is encoded as follows. ? U+10FFFD followed by the tag versions of !313125 and a CANCEL TAG. ? Then it seems to me that that is a plain text encoding, based on the precedents of the encoding of the glyph of The Welsh Flag and of the encoding of a glyph with a meaning not obvious from its appearance. ? The glyph is displayed in Chapter 42 of my first novel, on page 2. ? http://www.users.globalnet.co.uk/~ngo/novel_plus.htm ? That novel was completed in 2019 and there have been some developments since then, but the chapter contains lots of symbols that may be of interest as to how indications of the assigned meanings information is packed into the various glyphs. ? The analysis in this post shows that the encoding that I am using for the glyph that I designed is plain text and therefore in principle the encoding could, if the Unicode Technical Committee so decides, be encoded as plain text in The Unicode Standard as a sequence of a base character, some tag characters, and a CANCEL TAG. ? William Overington ? Friday 3 May 2024 ? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Fri May 3 14:59:28 2024 From: jameskass at code2001.com (James Kass) Date: Fri, 3 May 2024 19:59:28 +0000 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> Message-ID: On 2024-05-03 12:29 AM, Asmus Freytag via Unicode wrote: > On 5/2/2024 4:25 PM, James Kass via Unicode wrote: >> Wouldn?t this kind of private use agreement be considered a higher >> level protocol? > > No. You can agree to use a font that displays a certain glyph at a > certain PUA position. That's a private agreement, but not a "higher > level protocol". The way I like to think about it, PUA characters, in > contrast to images inserted into the flown text, constitute plain text > (as long as you don't append the font selection instructions via some > private tag, e.g. . Maybe we're talking about different things.? Of course PUA characters are plain-text by definition.? Even when people map all kinds of non-textual items to the PUA.? But I'm referring to the substitution of a glyph/image for a string of plain-text characters.? This sort of thing is very common in fonts. Any private agreement is an alternate protocol regardless of its altitude.? I consider this kind of agreement (substitution of a text string with something different) to be "higher level" because it's over-and-above. >> >> [HTML] >> Yadda yadda et cetera. >> >> [tags shown using encircled alphanumerics] >> Yadda yadda ??????? et cetera. > The minute you agree to show different glyphs for non-PUA characters, > you are no longer simply conforming to Unicode. Sorry for not understanding this.? Both examples above involve the computer system substituting an image/glyph for a string of text. Both examples should be considered conformant.? In either case, the underlying encoded text does not get changed.? The higher level protocol only affects how that text is displayed. > If you create elaborate conventions for the use of tag > characters you are creating a markup language. It's no > different from re-using ASCII characters for syntax > in addition to text. It's also true when re-using any text characters, public or private, for the same purpose. From pgcon6 at msn.com Mon May 6 14:29:01 2024 From: pgcon6 at msn.com (Peter Constable) Date: Mon, 6 May 2024 19:29:01 +0000 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> Message-ID: In general (in my understanding at least), "protocol" means a documented specification for data representation or process interaction (APIs, file formats, structured message content...) that different parties can use for interoperability. (For example, see https://learn.microsoft.com/en-us/openspecs/windows_protocols). In that sense, for example, SIL's documentation of their use of PUA (https://scripts.sil.org/cms/scripts/page.php?id=pua_home&site_id=nrsi) would be considered protocol documentation. Perhaps what Asmus was reacting to was the mention of "higher-level". I understand you to mean _defined externally to Unicode_. But I think more common use of that term would be in relation to some _application of Unicode text encoding_ involving more than plain text. So, in relation to Unicode PUA, a private agreement on semantics of PUA code points would comprise a protocol, but not a _higher-level_ protocol. Peter -----Original Message----- From: Unicode On Behalf Of James Kass via Unicode Sent: Friday, May 3, 2024 12:59 PM To: unicode at corp.unicode.org Subject: Re: Use of tag characters in a private encoding - is it valid please? On 2024-05-03 12:29 AM, Asmus Freytag via Unicode wrote: > On 5/2/2024 4:25 PM, James Kass via Unicode wrote: >> Wouldn?t this kind of private use agreement be considered a higher >> level protocol? > > No. You can agree to use a font that displays a certain glyph at a > certain PUA position. That's a private agreement, but not a "higher > level protocol". The way I like to think about it, PUA characters, in > contrast to images inserted into the flown text, constitute plain text > (as long as you don't append the font selection instructions via some > private tag, e.g. . Maybe we're talking about different things.? Of course PUA characters are plain-text by definition.? Even when people map all kinds of non-textual items to the PUA.? But I'm referring to the substitution of a glyph/image for a string of plain-text characters.? This sort of thing is very common in fonts. Any private agreement is an alternate protocol regardless of its altitude.? I consider this kind of agreement (substitution of a text string with something different) to be "higher level" because it's over-and-above. >> >> [HTML] >> Yadda yadda et cetera. >> >> [tags shown using encircled alphanumerics] Yadda yadda ??????? et >> cetera. > The minute you agree to show different glyphs for non-PUA characters, > you are no longer simply conforming to Unicode. Sorry for not understanding this.? Both examples above involve the computer system substituting an image/glyph for a string of text. Both examples should be considered conformant.? In either case, the underlying encoded text does not get changed.? The higher level protocol only affects how that text is displayed. > If you create elaborate conventions for the use of tag > characters you are creating a markup language. It's no > different from re-using ASCII characters for syntax > in addition to text. It's also true when re-using any text characters, public or private, for the same purpose. From wjgo_10009 at btinternet.com Mon May 6 16:09:00 2024 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 6 May 2024 22:09:00 +0100 (BST) Subject: : Use of tag characters in a private encoding - is it valid please? In-Reply-To: References: Message-ID: <6f605ee7.e0e8.18f4fbd0c25.Webtop.101@btinternet.com> Peter Constable wrote as follows.? ? ? > In that sense, for example, SIL's documentation of their use of PUA > (https://scripts.sil.org/cms/scripts/page.php?id=pua_home&site_id=nrsi) > would be considered protocol documentation. ? ? That brought back memories of my producing of The Golden Ligatures collection and some of the codes being applied in a font. ? ? The golden ligatures collection of Private Use Area code points for ligatures. ? ? http://www.users.globalnet.co.uk/~ngo/golden.htm ? ? As a result of Peter's post I have been having a look through it. ? ? I also found the following page. ? ? A Private Use Area code point for a character for use in chromatic font research. ? ? http://www.users.globalnet.co.uk/~ngo/holly.htm ? ? ? William Overington ? ? Monday 6 May 2024 ? ? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Mon May 6 18:21:54 2024 From: jameskass at code2001.com (James Kass) Date: Mon, 6 May 2024 23:21:54 +0000 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> Message-ID: <0a531b2e-30b6-425f-825c-5bef583c3deb@code2001.com> On 2024-05-06 7:29 PM, Peter Constable via Unicode wrote: > Perhaps what Asmus was reacting to was the mention of "higher-level". I understand you to mean _defined externally to Unicode_. But I think more common use of that term would be in relation to some _application of Unicode text encoding_ involving more than plain text. So, in relation to Unicode PUA, a private agreement on semantics of PUA code points would comprise a protocol, but not a _higher-level_ protocol. > My phrasing may have been inept.? For single PUA characters, or even strings of PUA characters, private agreements are not higher level because PUA characters are supposed to be defined by private agreement. It's when PUA (or even non-PUA) characters are modified by tag characters as part of a private agreement that the scheme becomes higher level.? As Asmus pointed out, this is essentially a private agreement for mark-up. Asmus wrote, "If you create elaborate conventions for the use of tag characters you are creating a markup language. It's no different from re-using ASCII characters for syntax in addition to text." The question posed in the thread subject seems to have been answered by Asmus Freytag. PUA(1) + ZWJ + PUA(2) = a ligature glyph combining PUA(1) with PUA(2) - that's legit.? Not higher level. PUA(1) + a string of tag characters = something completely different. - higher level.? Even though this can be handled at the font/font engine level. So, if we're on the same page, 1)? U+10FFFD followed by the tag versions of !313125 and a CANCEL TAG. 2)? COMET plus CIRCUMFLEX followed by the ASCII string "!313125" ... both examples represent a private agreement mark-up, and Unicode shouldn't care. From pgcon6 at msn.com Mon May 6 21:01:56 2024 From: pgcon6 at msn.com (Peter Constable) Date: Tue, 7 May 2024 02:01:56 +0000 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: <0a531b2e-30b6-425f-825c-5bef583c3deb@code2001.com> References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> <0a531b2e-30b6-425f-825c-5bef583c3deb@code2001.com> Message-ID: > It's when PUA (or even non-PUA) characters are modified by tag characters as part of a private agreement that the scheme becomes higher level. If it's a tag sequence scheme defined by Unicode, then not higher-level. But if it's a scheme defined elsewhere, not by Unicode, they I agree that would become higher-level. Peter -----Original Message----- From: Unicode On Behalf Of James Kass via Unicode Sent: Monday, May 6, 2024 4:22 PM To: unicode at corp.unicode.org Subject: Re: Use of tag characters in a private encoding - is it valid please? On 2024-05-06 7:29 PM, Peter Constable via Unicode wrote: > Perhaps what Asmus was reacting to was the mention of "higher-level". I understand you to mean _defined externally to Unicode_. But I think more common use of that term would be in relation to some _application of Unicode text encoding_ involving more than plain text. So, in relation to Unicode PUA, a private agreement on semantics of PUA code points would comprise a protocol, but not a _higher-level_ protocol. > My phrasing may have been inept.? For single PUA characters, or even strings of PUA characters, private agreements are not higher level because PUA characters are supposed to be defined by private agreement. It's when PUA (or even non-PUA) characters are modified by tag characters as part of a private agreement that the scheme becomes higher level.? As Asmus pointed out, this is essentially a private agreement for mark-up. Asmus wrote, "If you create elaborate conventions for the use of tag characters you are creating a markup language. It's no different from re-using ASCII characters for syntax in addition to text." The question posed in the thread subject seems to have been answered by Asmus Freytag. PUA(1) + ZWJ + PUA(2) = a ligature glyph combining PUA(1) with PUA(2) - that's legit.? Not higher level. PUA(1) + a string of tag characters = something completely different. - higher level.? Even though this can be handled at the font/font engine level. So, if we're on the same page, 1)? U+10FFFD followed by the tag versions of !313125 and a CANCEL TAG. 2)? COMET plus CIRCUMFLEX followed by the ASCII string "!313125" ... both examples represent a private agreement mark-up, and Unicode shouldn't care. From ecm.unicode at gmail.com Mon May 6 21:23:32 2024 From: ecm.unicode at gmail.com (Erik Carvalhal Miller) Date: Mon, 6 May 2024 22:23:32 -0400 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: <0a531b2e-30b6-425f-825c-5bef583c3deb@code2001.com> References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> <0a531b2e-30b6-425f-825c-5bef583c3deb@code2001.com> Message-ID: On Mon, May 6, 2024 at 7:26?PM James Kass via Unicode < unicode at corp.unicode.org> wrote: > Asmus wrote, "If you create elaborate conventions for the use of tag > characters you are creating a markup language. It's no different from > re-using ASCII characters for syntax in addition to text." > > The question posed in the thread subject seems to have been answered by > Asmus Freytag. > > PUA(1) + ZWJ + PUA(2) = a ligature glyph combining PUA(1) with PUA(2) > - that's legit. Not higher level. > > PUA(1) + a string of tag characters = something completely different. > - higher level. Even though this can be handled at the font/font engine > level. > > So, if we're on the same page, > 1) U+10FFFD followed by the tag versions of !313125 and a CANCEL TAG. > 2) COMET plus CIRCUMFLEX followed by the ASCII string "!313125" > ... both examples represent a private agreement mark-up, and Unicode > shouldn't care. > If emoji tag sequences, including the existing flag emoji tag sequences, categorically constitute markup, then this markup format is one which Unicode has paradoxically defined as part of its plain?text standard. If Unicode?s RGI sequence for a Welsh?flag emoji is plain text, then an emoji tag sequence headed by a private?use emoji can be too, as per TUS ?23.5 and UTS #51. Why could it not be? (I did raise objections to William Overington?s hypothetical constructions on the bases that ? he additionally discussed non?emoji tag sequences, for which the Standard makes no provision (outside the deprecated language tagging); ? the suggested tag sequences appeared to be an overly complicated way to encode private?use characters, with no apparent benefit; and ? the notion of making the tag characters conditionally visible as a fallback in standard reading mode is nonconformant. But those issues do not impact on the conformance of the basic idea of an emoji tag sequence headed by a PUA emoji.) There are some significant distinctions among the hypothetical examples of peculiar character sequences which you (James Kass) have been examining: ??? ???? followed by a sequence of six tag characters from the range U+E0020?U+E007E is almost a well?formed emoji tag sequence ? it needs U+E007F CANCEL TAG appended to be well?formed. But even with that addition it?s currently invalid, as per UTS #51. ??? ?As I have argued, U+10FFFD followed by the tag analogues of ?!313125? and then a CANCEL TAG appears to be valid, if U+10FFFD is agreed to be an emoji and the entire sequence is meant to be interpreted as an emoji. ??? ???^!313125? is valid Unicode, such as it is. If in normal reading mode it?s meant to be replaced by a different comet or an aardvark or a Klingon symbol for empire or anything other than a representation of the characters ???^!313125?, then the intended interpretation is not valid as Unicode plain text, though it may be perfectly valid markup of some sort or another beyond Unicode?s concern. If the idea is for a font to make one of those substitutions, then such a font is not Unicode?conformant. ??? ??? is similar to ???^!313125?: Though not Unicode?conformant if to be normally interpreted as something other than that very sequence of characters, in the HTML context you cited it can serve as perfectly good markup. HTML is not intended to be processed at the font level ? we?re likely to see this sequence rendered as an aardvark image only when it?s run through a Web browser or similar application. Plain text, on the other hand, is portable: Generally speaking, with the proper font support, plain text can be used anywhere with a consistent interpretation ? yes, in various contexts you may run into issues such as markup interpretation, restrictions on allowable characters, and text?length limits. But broadly, in plain text a ?rose? is a ?rose? is a ?rose?, wherever you go; whereas if you?re finding that a ?🌹? is a ???, you?re probably dealing with not?so?plain text, even if the source code is plain text. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ecm.unicode at gmail.com Mon May 6 22:00:39 2024 From: ecm.unicode at gmail.com (Erik Carvalhal Miller) Date: Mon, 6 May 2024 23:00:39 -0400 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: <1bc89339.b5f9.18f3dbe2ce2.Webtop.101@btinternet.com> References: <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> <1bc89339.b5f9.18f3dbe2ce2.Webtop.101@btinternet.com> Message-ID: On Fri, May 3, 2024 at 5:22?AM William_J_G Overington via Unicode < unicode at corp.unicode.org> wrote: > The analysis in this post shows that the encoding that I am using for the > glyph that I designed is plain text and therefore in principle the encoding > could, if the Unicode Technical Committee so decides, be encoded as plain > text in The Unicode Standard as a sequence of a base character, some tag > characters, and a CANCEL TAG. > ?Could? and ?should? are very different animals. Assuming the UTC does end up deciding to accept your symbol (presumably a distinct symbol character, not merely a glyph, for Unicode encodes characters, not glyphs) for encoding (after considering a proposal fulfilling the usual applicable criteria, submitted in the prescribed manner), why should it choose the elaborate encoding you describe instead of a single code point? Currently there is only one variety of valid tag sequences, that of the regional (subnational) flags such as the Welsh flag you cited. I don?t know much about the decision process that was involved, but I take it that the encoding is a compromise born partly of the desire to keep the UTC out of some rather political and potentially never?ending business by taking advantage of an existing international standard that?s beyond Unicode?s purview. The encoding has some advantages and some disadvantages, the latter including length. There are other cases in which Unicode has chosen a code?point sequence, rather than a single code point, to represent a single character; but single code points are by far the norm. What would be the rationale for a nine-point sequence for your single character? and an unusually arbitrary?looking sequence at that? -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon May 6 22:05:40 2024 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 6 May 2024 20:05:40 -0700 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> <0a531b2e-30b6-425f-825c-5bef583c3deb@code2001.com> Message-ID: <95679058-12d0-4988-95ec-9b1486d9b864@ix.netcom.com> On 5/6/2024 7:23 PM, Erik Carvalhal Miller via Unicode wrote: > If emoji tag sequences, including the existing flag emoji tag > sequences, categorically constitute markup, then this markup format is > one which Unicode has paradoxically defined as part of its plain?text > standard. And that is the key. If you agree that Unicode is a plain text standard then anything that Unicode defines is ipso-facto plain text. You may choose to go "plainer" by not supporting some features of Unicode (and the conformance clauses make provisions for that). But you may not extend Unicode with additional features without leaving the definition of plain text. That may seem paradoxical, but that's what it is. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at sonic.net Mon May 6 22:28:15 2024 From: kenwhistler at sonic.net (Ken Whistler) Date: Mon, 6 May 2024 20:28:15 -0700 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> <0a531b2e-30b6-425f-825c-5bef583c3deb@code2001.com> Message-ID: <71e3bb8c-1921-4bca-8a37-ff04ba88d7f0@sonic.net> I'm not going to pile on about what constitutes "higher-level", but ... On 5/6/2024 7:23 PM, Erik Carvalhal Miller via Unicode wrote: > If emoji tag sequences, including the existing flag emoji tag > sequences, categorically constitute markup, then this markup format is > one which Unicode has paradoxically defined as part of its plain?text > standard. This is erroneous. Emoji tag sequences are not "defined as part of [the Unicode Consortium's] plain-text standard", i.e. the Unicode Standard. Emoji tag sequences are defined in and by UTS #51, which is a *separate* specification defined on top of the Unicode Standard. Emoji tag sequences make use of the tag characters defined in the Unicode Standard, but UTS #51 is defining a protocol for their use which is built on top of the Unicode Standard, and not formally a part of it. > If Unicode?s RGI sequence for a Welsh?flag emoji is plain text, then > an emoji tag sequence headed by a private?use emoji can be too, as per > TUS ?23.5 and UTS #51.? Why could it not be? Well, because private use is private use. The formal definition of an emoji_tag_sequence depends on the definition of a tag_base, which can either be an emoji_character or an emoji_modifier_sequence or an emoji_presentation_sequence. The problem, for extending any of those to PUA, is that all of those entity sets are very clearly and precisely defined by enumerations in data files associated with each version of the publication of UTS #51. PUA characters are not included in any of those lists. Therefore, a PUA character cannot be a tag_base, per UTS #51. It doesn't suffice to say, well, I've decided that U+F0000 is going to be an emoji character, so I can use it in an emoji_tag_sequence, per UTS #51. Rather, what one would have to do is build out a private agreement that 1) I am going to be treat U+F0000 as an emoji, and 2) I am going to be using a private extension of the concept of an emoji_tag_sequence which allows my "emoji" U+F0000 as a tag_base. I can document that, and if I can get somebody else to buy into that private agreement, then by all means, interchange all you want. But anybody else who happens to sample some of that text is under no obligation whatsoever to interpret any of that, or to even recognize your private extension of the concept of an emoji_tag_sequence to even be syntactically correct, let alone interpretable. --Ken From wjgo_10009 at btinternet.com Tue May 7 05:25:27 2024 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 7 May 2024 11:25:27 +0100 (BST) Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: References: Message-ID: <62da2cff.e791.18f529638fe.Webtop.101@btinternet.com> In the post ? ? https://corp.unicode.org/pipermail/unicode/2024-May/010886.html ? ? James Kass wrote as follows. ? ? > COMET plus CIRCUMFLEX ... ? ? For the benefit of newer readers of this mailing list, I mention that this is a reference to the following. ? ? http://www.users.globalnet.co.uk/~ngo/c_c00000.htm ? ? There is a short thread about this in the archive of this mailing list for October 2002. However, most of the disscussion about it is in a thread with the title Keys that is in the archive for September 2002. ? ? There is also a mention in the following document. ? ? https://www.unicode.org/review/pri408/ ? ? I have not researched on the Comet Circumflex system since. My present research is on a ?structurally simpler system with no parameters in the sentences. Since that time I have learned how to make fonts (at a hobbyist level, not at expert level) and I have been able to design symbols and make and apply a font that includes those symbols. ? ? William Overington ? ? Tuesday 7 May 2024 ? ? ? ? ? ? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Tue May 7 08:33:39 2024 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 7 May 2024 14:33:39 +0100 (BST) Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: <1bc89339.b5f9.18f3dbe2ce2.Webtop.101@btinternet.com> References: <1bc89339.b5f9.18f3dbe2ce2.Webtop.101@btinternet.com> Message-ID: <21799a1b.ec0f.18f5342859a.Webtop.101@btinternet.com> In the post ? ? ?? https://corp.unicode.org/pipermail/unicode/2024-May/010889.html ? ? ? Erik Carvalhal Miller asked as follows. ?? ? ? > What would be the rationale for a nine-point sequence for your single > character? and an unusually arbitrary?looking sequence at that??? ? ? ? The rationale is that it is part of a larger encoding, that in normal use it would be entered into an email by selecting the meaning from a cascading menu, and decoded automatically at the receiving end. Encoding from a menu in one language and decoding and display into a different language, thereby enabling, in some particular circumstances, communication through the language barrier. ? ? ? The encoding is explained in Chapter 2 and Chapter 6 of my second novel. ? ? ?? http://www.users.globalnet.co.uk/~ngo/locse_novel2.htm ? ? ?? Yes, I know that it is a novel and that that is an unusual way to do things, but back in 2016 I could not make progress with my invention and as I could not start a research organization to develop my ideas I decided to imagine one and write about it. I completed what is now the first novel in February 2019, it having been intended to be a stand alone novel, but I missed writing it so I started writing a sequel, namely the second novel: the second novel is not yet complete. ? ? ? > ... (after considering a proposal fulfilling the usual applicable > criteria, submitted in the prescribed manner), ... ? ? ? The big problem here is that to get a document before the Unicode Technical Committee it must be accepted as in scope by the person or persons who act as gatekeeper(s) to the Current Document Register. Yet but even if my proposal document is allowed to go before the Committee what should be the usual applicable criteria for considering it? Should it be the same criteria as for things published long ago? Should I need to show the system already widely in use by many people with a Private Use encoding? Or should it be the same "looks good for the future" consideration used for emoji? Why not, on a sauce for pasta is sauce for rice basis. ? ? ? If the policy is that I need to show the system already widely in use by many people with a Private Use encoding, then I have all but zero chance of that happening. ? ? ? Yet if the Unicode Technical Committee were to decide to consider the invention on the basis of would this be of benefit to consumers in the future and let us have a go at testing it out and finding out if it will be good to encode it, and some of the Full Members each have some people at their research centres work on implementing it as a multi-business project, then great progress could be made. ? ? ? William Overington ? ? ?? Tuesday 7 May 2024 ?? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From eik at iki.fi Tue May 7 09:58:49 2024 From: eik at iki.fi (eik at iki.fi) Date: Tue, 7 May 2024 17:58:49 +0300 Subject: VS: Use of tag characters in a private encoding - is it valid please? In-Reply-To: <21799a1b.ec0f.18f5342859a.Webtop.101@btinternet.com> References: <1bc89339.b5f9.18f3dbe2ce2.Webtop.101@btinternet.com> <21799a1b.ec0f.18f5342859a.Webtop.101@btinternet.com> Message-ID: <003201daa08f$12599840$370cc8c0$@iki.fi> Mr Overington, In my opinion the Unicode Technical Committee has more than enough work to do in trying to solve technical issues related to recognized needs of existing user communities. Sincerely, Erkki I. Kolehmainen Snellmaninkatu 3 D 42, 53100 Lappeenranta, Finland Mob: +358 400 825 943 L?hett?j?: Unicode Puolesta William_J_G Overington via Unicode L?hetetty: tiistai 7. toukokuuta 2024 16.34 Vastaanottaja: unicode at corp.unicode.org Aihe: Re: Use of tag characters in a private encoding - is it valid please? In the post https://corp.unicode.org/pipermail/unicode/2024-May/010889.html Erik Carvalhal Miller asked as follows. > What would be the rationale for a nine-point sequence for your single character? and an unusually arbitrary?looking sequence at that? The rationale is that it is part of a larger encoding, that in normal use it would be entered into an email by selecting the meaning from a cascading menu, and decoded automatically at the receiving end. Encoding from a menu in one language and decoding and display into a different language, thereby enabling, in some particular circumstances, communication through the language barrier. The encoding is explained in Chapter 2 and Chapter 6 of my second novel. http://www.users.globalnet.co.uk/~ngo/locse_novel2.htm Yes, I know that it is a novel and that that is an unusual way to do things, but back in 2016 I could not make progress with my invention and as I could not start a research organization to develop my ideas I decided to imagine one and write about it. I completed what is now the first novel in February 2019, it having been intended to be a stand alone novel, but I missed writing it so I started writing a sequel, namely the second novel: the second novel is not yet complete. > ... (after considering a proposal fulfilling the usual applicable criteria, submitted in the prescribed manner), ... The big problem here is that to get a document before the Unicode Technical Committee it must be accepted as in scope by the person or persons who act as gatekeeper(s) to the Current Document Register. Yet but even if my proposal document is allowed to go before the Committee what should be the usual applicable criteria for considering it? Should it be the same criteria as for things published long ago? Should I need to show the system already widely in use by many people with a Private Use encoding? Or should it be the same "looks good for the future" consideration used for emoji? Why not, on a sauce for pasta is sauce for rice basis. If the policy is that I need to show the system already widely in use by many people with a Private Use encoding, then I have all but zero chance of that happening. Yet if the Unicode Technical Committee were to decide to consider the invention on the basis of would this be of benefit to consumers in the future and let us have a go at testing it out and finding out if it will be good to encode it, and some of the Full Members each have some people at their research centres work on implementing it as a multi-business project, then great progress could be made. William Overington Tuesday 7 May 2024 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Tue May 7 11:16:11 2024 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 7 May 2024 17:16:11 +0100 (BST) Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: <21799a1b.ec0f.18f5342859a.Webtop.101@btinternet.com> References: <21799a1b.ec0f.18f5342859a.Webtop.101@btinternet.com> Message-ID: <5caffc56.f069.18f53d75160.Webtop.101@btinternet.com> Erkki I. Kolehmainen wrote as follows. ? ? ? > In my opinion the Unicode Technical Committee has more than enough > work to do in trying to solve technical issues related to recognized > needs of existing user communities. ? ? ? Well, there is always lots to do in making progress. ? ? ? I hope that whether the Unicode Technical Committee decides to push information technology and its applications forward by encoding my invention will be decided entirely on the merits of the invention and its potential for helpfulness in the future. ? ? ? William Overington ? ? ? Tuesday 7 May 2024 ? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ecm.unicode at gmail.com Tue May 7 13:16:19 2024 From: ecm.unicode at gmail.com (Erik Carvalhal Miller) Date: Tue, 7 May 2024 14:16:19 -0400 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: <71e3bb8c-1921-4bca-8a37-ff04ba88d7f0@sonic.net> References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> <0a531b2e-30b6-425f-825c-5bef583c3deb@code2001.com> <71e3bb8c-1921-4bca-8a37-ff04ba88d7f0@sonic.net> Message-ID: On Mon, May 6, 2024 at 11:34?PM Ken Whistler via Unicode < unicode at corp.unicode.org> wrote: > This is erroneous. Emoji tag sequences are not "defined as part of [the > Unicode Consortium's] plain-text standard", i.e. the Unicode Standard. > Emoji tag sequences are defined in and by UTS #51, which is a *separate* > specification defined on top of the Unicode Standard. Emoji tag > sequences make use of the tag characters defined in the Unicode > Standard, but UTS #51 is defining a protocol for their use which is > built on top of the Unicode Standard, and not formally a part of it. You?re of course correct about emoji tag sequences being defined in and by UTS #51. There are actually three things we call the Unicode Standard: the nowadays epic?length book also known as the core specification, a collection of documents that includes that book (and excludes UTS #51), and the intangible and utterly complex concept which that collection defines. The Unicode Standard (the book ? hence also the collection), in chapter 23, ?23.9, says, ?The current conformant use of the undeprecated 96 tag characters is specified in Unicode Technical Standard #51, ?Unicode Emoji.? See ED-14a. emoji tag sequence (ETS) and Annex C, Valid Emoji Tag Sequences in that specification.? No, the Standard itself (book or collection) does not define what emoji tag sequences are or which ones are valid; but that same Standard points to UTS #51 as the definitive specification of ETSs for ?conformant use? of tag characters. I think it?s quite reasonable to read that passage as acknowledging/specifying/defining ETSs? place as part of the Standard (the concept), even if it?s outsourcing the details. Perhaps that separation is a useful fiction, as fictions sometimes are (?Unicode, Inc. is a person!?), and of course an abstraction such as the Unicode Standard (the concept, of course) is a malleable fabrication ? so, I won?t begrudge you the useful fiction. > The formal definition of an > emoji_tag_sequence depends on the definition of a tag_base, which can > either be an emoji_character or an emoji_modifier_sequence or an > emoji_presentation_sequence. The problem, for extending any of those to > PUA, is that all of those entity sets are very clearly and precisely > defined by enumerations in data files associated with each version of > the publication of UTS #51. PUA characters are not included in any of > those lists. Therefore, a PUA character cannot be a tag_base, per UTS #51. This is erroneous. The Standard (book/collection) tells us quite clearly (most extensively in chapter 23, ?23.5) that private?use characters? use may be determined by agreement and nearly all properties of such characters may be changed or overridden as per agreement. There is nothing in the Standard forbidding PUA characters from being treated as emoji under a private agreement and therefore as viable candidates for tag_base. > It doesn't suffice to say, well, I've decided that U+F0000 is going to > be an emoji character, so I can use it in an emoji_tag_sequence, per UTS > #51. Rather, what one would have to do is build out a private agreement > that 1) I am going to be treat U+F0000 as an emoji, and 2) I am going to > be using a private extension of the concept of an emoji_tag_sequence > which allows my "emoji" U+F0000 as a tag_base. I can document that, and > if I can get somebody else to buy into that private agreement, then by > all means, interchange all you want. In other words, an emoji tag sequence headed by a private?use emoji can indeed be plain text. Good, that?s what I thought?! > But anybody else who happens to > sample some of that text is under no obligation whatsoever to interpret > any of that, or to even recognize your private extension of the concept > of an emoji_tag_sequence to even be syntactically correct, let alone > interpretable. Agreed, the risk of interpretation problems outside the context of the private agreement exists, just as with all other PUA usage. This particular usage does add some risk with its exotic syntax possibly upsetting some conformance gatekeeper. But how great is that risk in practical terms? I hope my example of ??????????????????? (for which I didn?t create a private agreement, not even in my own mind) isn?t crashing anyone?s computer? -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Tue May 7 13:42:55 2024 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 7 May 2024 11:42:55 -0700 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> <0a531b2e-30b6-425f-825c-5bef583c3deb@code2001.com> <71e3bb8c-1921-4bca-8a37-ff04ba88d7f0@sonic.net> Message-ID: On 5/7/2024 11:16 AM, Erik Carvalhal Miller via Unicode wrote: > ?The current conformant use of the undeprecated 96 tag characters is > specified in Unicode Technical Standard #51, ?Unicode Emoji.? ?See > ED-14a. emoji tag sequence (ETS) and Annex C, Valid Emoji Tag > Sequences in that specification.? ?No, the Standard itself (book or > collection) does not define what emoji tag sequences are or which ones > are valid; but that same Standard points to UTS #51 as the definitive > specification of ETSs for ?conformant use? of tag characters. Contrary to your reading, the correct interpretation of that passage is one that considers the tag characters as quasi reserved for use with a specific external protocol. The word "current" allows Unicode to later designate other protocols, if desired. It very clearly does not contemplate any other uses of tag characters, so no, their use with PUA characters (or any other characters) is not conformant in the same way as assigning a PUA code point a private character. In order to use the tag characters conformantly, you must claim conformance to both TUS and UTS#51. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Tue May 7 16:27:05 2024 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 7 May 2024 17:27:05 -0400 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> <0a531b2e-30b6-425f-825c-5bef583c3deb@code2001.com> <71e3bb8c-1921-4bca-8a37-ff04ba88d7f0@sonic.net> Message-ID: I guess one could always define one's own set of "tag characters" in the PUA and use them exactly as suggested for the regular tag characters (or any other way, for that matter.)? Use of PUA characters between consenting adults is nothing Unicode is concerned with, right?? Does that make it a "higher-level protocol"?? Don't know, don't entirely care.? If users agree to treat data a certain way, that's sort of the definition of a protocol, and even plain text is a protocol by that reasoning. ~mark On 5/7/24 14:42, Asmus Freytag via Unicode wrote: > On 5/7/2024 11:16 AM, Erik Carvalhal Miller via Unicode wrote: >> ?The current conformant use of the undeprecated 96 tag characters is >> specified in Unicode Technical Standard #51, ?Unicode Emoji.? ?See >> ED-14a. emoji tag sequence (ETS) and Annex C, Valid Emoji Tag >> Sequences in that specification.? ?No, the Standard itself (book or >> collection) does not define what emoji tag sequences are or which >> ones are valid; but that same Standard points to UTS #51 as the >> definitive specification of ETSs for ?conformant use? of tag characters. > > Contrary to your reading, the correct interpretation of that passage > is one that considers the tag characters as quasi reserved for use > with a specific external protocol. The word "current" allows Unicode > to later designate other protocols, if desired. > > It very clearly does not contemplate any other uses of tag characters, > so no, their use with PUA characters (or any other characters) is not > conformant in the same way as assigning a PUA code point a private > character. > > In order to use the tag characters conformantly, you must claim > conformance to both TUS and UTS#51. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Tue May 7 18:04:39 2024 From: jameskass at code2001.com (James Kass) Date: Tue, 7 May 2024 23:04:39 +0000 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> <0a531b2e-30b6-425f-825c-5bef583c3deb@code2001.com> <71e3bb8c-1921-4bca-8a37-ff04ba88d7f0@sonic.net> Message-ID: <0791c71a-17c4-4eb5-a8a2-9533fedb0415@code2001.com> On 2024-05-07 9:27 PM, Mark E. Shoulson via Unicode wrote: > I guess one could always define one's own set of "tag characters" in > the PUA and use them exactly as suggested for the regular tag > characters (or any other way, for that matter.)? Use of PUA characters > between consenting adults is nothing Unicode is concerned with, right? Sounds good to me. Seems like the only viable path forward (to making progress) WRT novel inventions and so forth.? Enabling experimentation is one of the functions of the PUA.? Publish the scheme, provide a mechanism for using it, and see if anybody wants it.? Actual usage would determine its merits, if any. From wjgo_10009 at btinternet.com Wed May 8 05:22:17 2024 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 8 May 2024 11:22:17 +0100 (BST) Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: References: Message-ID: <5ccfcd75.63b.18f57b9ad4a.Webtop.101@btinternet.com> ? ? For this research let us use U+10FFFD as the base character, and based on ? ? https://www.unicode.org/charts/PDF/UE0000.pdf ? ? for this research let us define U+F7020 through to U+F707F as RESEARCH TAG characters, such that, as examples, ? ? U+F7020 RESEARCH TAG SPACE ? ? U+F7021 RESEARCH TAG EXCLAMATION MARK ? ? and ? ? U+F707F RESEARCH CANCEL TAG ? ? Placing the research tags in Plane 15 is so that any software written will be more easily converted to using Plane 14 tag code points if the research is successful and leads to a regular Unicode encoding. ? ? The choice of the F7 range rather than the F0 range is so that the code point for a RESEARCH TAG is more than 1 bit different from the code point of a TAG. ? ? William Overington ? ? Wednesday 8 May 2024 ? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Wed May 8 16:46:05 2024 From: jameskass at code2001.com (James Kass) Date: Wed, 8 May 2024 21:46:05 +0000 Subject: Fonts and Unicode conformance (was Re: Use of tag ,,,) In-Reply-To: References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> <0a531b2e-30b6-425f-825c-5bef583c3deb@code2001.com> Message-ID: On 2024-05-07 2:23 AM, Erik Carvalhal Miller via Unicode wrote: > ????????^!313125? is valid Unicode, such as it is.? If in normal > reading mode it?s meant to be replaced by a different comet or an > aardvark or a Klingon symbol for empire or anything other than a > representation of the characters ???^!313125?, then the intended > interpretation is not valid as Unicode plain text, though it may be > perfectly valid markup of some sort or another beyond Unicode?s > concern.? If the idea is for a font to make one of those > substitutions, then such a font is not Unicode?conformant. TTF/OTF fonts are essentially programs. Years ago, IIRC, John Hudson postulated on an OpenType forum that an OpenType font could be designed to substitute innocuous words for swear words.? So, for example, if a dog lover developed a font that would replace the string "cat " with the string "dog " in the display, would that be considered non-conformant?? (Keeping in mind that the font display doesn't alter the underlying encoded text and cannot affect interchange and storage.) Or suppose a font developer named Zebediah Waldo Jablonsky set up an OpenType font to display a monogram any time his initials appeared in all-caps.? Or if a business set up an OpenType font to display its logo whenever a string like COMET plus CIRCUMFLEX appeared in the text.? Would either of those fonts be viewed as non-conformant? An OpenType font could theoretically be set up to display an aardvark glyph for any text string, even for the string .? Why would a browser program displaying an image for that string be conformant, yet a font program doing the same thing be non-conformant?? (I'm not saying it wouldn't be silly to do so.) From mark at kli.org Wed May 8 16:58:14 2024 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 8 May 2024 17:58:14 -0400 Subject: Fonts and Unicode conformance (was Re: Use of tag ,,,) In-Reply-To: References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> <0a531b2e-30b6-425f-825c-5bef583c3deb@code2001.com> Message-ID: On 5/8/24 17:46, James Kass via Unicode wrote: > > Years ago, IIRC, John Hudson postulated on an OpenType forum that an > OpenType font could be designed to substitute innocuous words for > swear words.? So, for example, if a dog lover developed a font that > would replace the string "cat " with the string "dog " in the display, > would that be considered non-conformant?? (Keeping in mind that the > font display doesn't alter the underlying encoded text and cannot > affect interchange and storage.) https://www.thepolitetype.com/ > Or suppose a font developer named Zebediah Waldo Jablonsky set up an > OpenType font to display a monogram any time his initials appeared in > all-caps.? Or if a business set up an OpenType font to display its > logo whenever a string like COMET plus CIRCUMFLEX appeared in the > text.? Would either of those fonts be viewed as non-conformant? > > An OpenType font could theoretically be set up to display an aardvark > glyph for any text string, even for the string src="aardvark.jpg"> .? Why would a browser program displaying an image > for that string be conformant, yet a font program doing the same thing > be non-conformant?? (I'm not saying it wouldn't be silly to do so.) I regularly (amuse myself and) make fonts render "www" as a ligature, etc. ~mark From doug at ewellic.org Wed May 8 23:26:49 2024 From: doug at ewellic.org (Doug Ewell) Date: Thu, 9 May 2024 04:26:49 +0000 Subject: Fonts and Unicode conformance (was Re: Use of tag ,,,) In-Reply-To: References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> <0a531b2e-30b6-425f-825c-5bef583c3deb@code2001.com> Message-ID: Mark E. Shoulson wrote: > I regularly (amuse myself and) make fonts render "www" as a ligature, > etc. Microsoft?s Cascadia Code does this sort of thing on the regular, which to me is a great reason to use Cascadia Mono instead: https://github.com/microsoft/cascadia-code?tab=readme-ov-file#font-features -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From cate at cateee.net Thu May 9 00:33:15 2024 From: cate at cateee.net (Giacomo Catenazzi) Date: Thu, 9 May 2024 07:33:15 +0200 Subject: Fonts and Unicode conformance (was Re: Use of tag ,,,) In-Reply-To: References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> <0a531b2e-30b6-425f-825c-5bef583c3deb@code2001.com> Message-ID: <4aaec420-39b1-4369-ab8f-b7fc7cb60a2e@cateee.net> On 09.05.2024 06:26, Doug Ewell via Unicode wrote: > Mark E. Shoulson wrote: > >> I regularly (amuse myself and) make fonts render "www" as a ligature, >> etc. > > Microsoft?s Cascadia Code does this sort of thing on the regular, which to me is a great reason to use Cascadia Mono instead: > > https://github.com/microsoft/cascadia-code?tab=readme-ov-file#font-features Fira is similar, but I think they changed the default. And IDE should have options to turn it off (IMHO: the default should be off). But that hints an additional features of fonts: they may include the same character twice or more, with different glyphs or spacing, and selectable with font options (which are often not easily accessible to writers). Very commons are "normal digits", "tabular digits". On some cases Unicode provide some control: Variant Selector, e.g. for digit 0 (but why only on one way? I can force to have the *slash* but not force not to have it). If there will be a "Unifont", I assume, just for Latin scripts, a printed version will take the space of many multi-volume encyclopedias. In any case, the most funny/annoying part is Turkish support: same character (and same script: Latin) but different glyph (compared other languages), an also the contrary: same glyph (Turkish vs. most of rest of languages using Latin scripts) but different Unicode character (and also read as different character). The scope of Unicode is interchange and semantic (and it is already a huge task). Note: Compiler community uses such method: they define semantic, the rest is left as "quality of implementation": you cannot rule on everything (and in every details): users should choose sensible compilers (e.g. no compiler will allows a infinite long source file, but is a compiler conformant if they accept only file shorter then 10 Unicode codepoints?). Or if the implementation of dynamic memory is just "fail with no-memory left". giacomo From mark at kli.org Thu May 9 07:48:22 2024 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 9 May 2024 08:48:22 -0400 Subject: Fonts and Unicode conformance (was Re: Use of tag ,,,) In-Reply-To: References: <4087e65c.51b1.18f20480164.Webtop.101@btinternet.com> <75c724d3.70f1.18f2b09a95f.Webtop.101@btinternet.com> <02dbc1bd-ed4e-4e77-8403-5381e16c4c56@ix.netcom.com> <0a531b2e-30b6-425f-825c-5bef583c3deb@code2001.com> Message-ID: Yes, there are not a few fonts out there that do this, along with various "programming ligatures."? I've committed worse atrocities as well, all the while keeping fonts ostensibly "monospaced." There was the time I developed a bunch of monospace blackletter(!) fonts for use in terminals (modifying regular fraktur fonts or the few existing monospace frakturs I found), with those ridiculous ligatures fraktur seems like for "ch" and "ck" and stuff, that look way too smooshed together and too much spacing around them, and ligatures for "mm" which always looks cramped, and r-rotunda contextual alternates... Anyway, bottom line is, there's lots of fonts out there doing all kinds of creative ligaturing.? The assorted "code fonts" with ligatures for multi-character operations are examples (Fira, Cascadia, Iosevka, Nerd Fonts, etc.? Pragmata Pro is probably the grand-daddy of them all, with ligatures essentially for *styling* of words like "BUG" and "TODO") ~mark On 5/9/24 00:26, Doug Ewell via Unicode wrote: > Mark E. Shoulson wrote: > >> I regularly (amuse myself and) make fonts render "www" as a ligature, >> etc. > Microsoft?s Cascadia Code does this sort of thing on the regular, which to me is a great reason to use Cascadia Mono instead: > > https://github.com/microsoft/cascadia-code?tab=readme-ov-file#font-features > > -- > Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org > From wjgo_10009 at btinternet.com Thu May 9 07:59:38 2024 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 9 May 2024 13:59:38 +0100 (BST) Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: References: Message-ID: <40de72a1.a32.18f5d701853.Webtop.101@btinternet.com> Readers who would like to try the invention, whether by a thought experiment, or with pen and paper, or by writing software, either privately or by also posting in this thread may like the following research scenario. ? ? A man in Portugal is using a cascading menu system to select preset sentences listed in Portuguese and including them in an email to send to an Information Management Centre in Slovenia where the lady there is using a computer where the preset sentences are listed in Slovenian and preset sentences in an incoming email are localized into Slovenian and displayed in Slovenian. ? ? Using the Private Use Area code points specified in the post ? ? https://corp.unicode.org/pipermail/unicode/2024-May/010900.html ? ? please explore the process of the man in Portugal sending a message enquiring about his sister whose train has been diverted because of an avalanche and the lady in Slovenia sending a reply that she is safe. ? ? Here are some codes with the meanings expressed in English. ? ? !123 Good day. ? ? !987 Best regards, ? ? !128 The following question has been asked. ? ? !129 My answer is as follows. ? ? !313125 Is there any information about the following person please? ? ? !313672 The enquirer is the brother of the first person that was named. ? ? !313987 The person is safe. ? ? William Overington ? ? Thursday 9 May 2024 ? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at sonic.net Thu May 9 09:00:49 2024 From: kenwhistler at sonic.net (Ken Whistler) Date: Thu, 9 May 2024 07:00:49 -0700 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: <40de72a1.a32.18f5d701853.Webtop.101@btinternet.com> References: <40de72a1.a32.18f5d701853.Webtop.101@btinternet.com> Message-ID: <7b1db70e-2bda-49a0-ae1d-24db21910816@sonic.net> Hmmm, On 5/9/2024 5:59 AM, William_J_G Overington via Unicode wrote: > > please explore the process of the man in Portugal sending a message > enquiring about his sister whose train has been diverted because of an > avalanche and the lady in Slovenia sending a reply that she is safe. > Well, most likely, the man in Portugal sends a text on his phone inquiring: A minha irm? Catarina estava no comboio que foi desviado por uma avalanche. Voc? sabe se ela est? segura? And the woman in Slovenia uses the translation app on her phone to read: Moja sestra Catarina je bila na vlaku, ki ga je preusmeril sne?ni plaz. Ali veste, ali je varna? She happens to know everyone on the train is safe, because the train is outside her window, and all the passengers have gathered in a cafe on her street. She sends a text back, saying: Tvoja sestra je varna. And the man in Portugal uses the translation app on his phone to read: Sua irm? est? segura. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at bluesky.org Thu May 9 09:34:08 2024 From: tom at bluesky.org (Tom Gewecke) Date: Thu, 9 May 2024 07:34:08 -0700 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: <40de72a1.a32.18f5d701853.Webtop.101@btinternet.com> References: <40de72a1.a32.18f5d701853.Webtop.101@btinternet.com> Message-ID: <4B129D24-9950-47A6-BD07-77284A9C11C0@bluesky.org> > On May 9, 2024, at 5:59 AM, William_J_G Overington via Unicode wrote: > Here are some codes with the meanings expressed in English. > You may not need to re-invent the codes, as this was worked out already during the telegraph era a century ago: https://archive.org/details/acmecommodityphr00acme/page/n15/mode/2up?ref=ol&view=theater? Acme commodity and phrase code : Acme Code Company : Free Download, Borrow, and Streaming : Internet Archive archive.org -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: acmecommodityphr00acme.jpeg Type: image/jpeg Size: 8345 bytes Desc: not available URL: From asmusf at ix.netcom.com Thu May 9 10:43:38 2024 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 9 May 2024 08:43:38 -0700 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: <40de72a1.a32.18f5d701853.Webtop.101@btinternet.com> References: <40de72a1.a32.18f5d701853.Webtop.101@btinternet.com> Message-ID: On 5/9/2024 5:59 AM, William_J_G Overington via Unicode wrote: > Readers who would like to try the invention, whether by a thought > experiment, or with pen and paper, or by writing software, either > privately or by also posting in this thread may like the following > research scenario. I think this suggestion clearly violates the spirit and policy of this list as well as past precedent. This is not the place to solicit others to collectively "try inventions" and then report on that. In particular, the topic of encoding phrases has been (repeatedly) ruled out of scope. Time for the list manager to close this thread. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From egg.robin.leroy at gmail.com Thu May 9 11:16:21 2024 From: egg.robin.leroy at gmail.com (Robin Leroy) Date: Thu, 9 May 2024 18:16:21 +0200 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: <4B129D24-9950-47A6-BD07-77284A9C11C0@bluesky.org> References: <40de72a1.a32.18f5d701853.Webtop.101@btinternet.com> <4B129D24-9950-47A6-BD07-77284A9C11C0@bluesky.org> Message-ID: Le jeu. 9 mai 2024 ? 16:40, Tom Gewecke via Unicode < unicode at corp.unicode.org> a ?crit : > You may not need to re-invent the codes, as this was worked out already > during the telegraph era a century ago: > Or indeed slightly earlier, in ?Unicode.? (1886), whose sixth edition (1889) may be found at https://archive.org/details/unicodeuniversa00unkngoog/page/n3/mode/2up. That book also reserves some cypher words for private use, https://archive.org/details/unicodeuniversa00unkngoog/page/n109/mode/2up; this will no doubt spark an exciting discussion as to what use of those words is valid? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Thu May 9 11:47:54 2024 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 9 May 2024 12:47:54 -0400 Subject: Use of tag characters in a private encoding - is it valid please? In-Reply-To: <40de72a1.a32.18f5d701853.Webtop.101@btinternet.com> References: <40de72a1.a32.18f5d701853.Webtop.101@btinternet.com> Message-ID: <1ec9144c-3689-42b9-8f43-98db9be6abf7@kli.org> On 5/9/24 08:59, William_J_G Overington via Unicode wrote: > > Readers who would like to try the invention, whether by a thought > experiment, or with pen and paper, or by writing software, either > privately or by also posting in this thread may like the following > research scenario. > No.? That is not the purpose of this list.? You want this to happen?? You want people to use this?? Then DO IT.? YOU do the work, YOU do the research, YOU assemble a group of people who are also interested in and believe in your idea.? When it's a standard other people use, THEN you can come to W3C or whoever and ask for it to be canonized as some international standard (but not Unicode, since it's already been determined to be out of scope.) Unicode is not your personal incubator for pursuing your own research and making it happen.? We don't do it for anyone else either.? You complain about how your ideas are never followed up here, but YOU are the one who isn't following up on them?not following up in the right place, that is.? You insist on trying to get Unicode to do the work for you, but YOU need to go and develop these ideas and GET THEM IN USE. Yes, there's the chicken-and-egg problem, and I whined and bitched about that an awful lot in trying to get Klingon encoded. But even then, even back at the beginning when there was "too little usage" because it wasn't encoded, there WAS some usage, and people were working on it and people were using it, even at the small scale, and that community grew and people contributed and now usage is definitely not lacking.? The chicken-and-egg problem is annoying and possibly a bit unfair, but not insurmountable. You've developed these ideas on your website.? Go find some like-minded people and raise a community that wants to use it, people who will write software to make it happen, etc.? You can use the PUA, that's what it's for, if that's how you want to do it.? Experience should tell you that you are unlikely to find many such like-minded people on this list, and this list isn't here for you to recruit for your own research anyway. You have good ideas?? Then put them to use.? It isn't Unicode's fault that they aren't getting used. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From root at corp.unicode.org Thu May 9 12:16:52 2024 From: root at corp.unicode.org (root at corp.unicode.org) Date: Thu, 09 May 2024 12:16:52 -0500 Subject: Use of tag characters in a private encoding - is it valid please? Message-ID: <663d0504.fnyf1WbYjzeU5jx1%root@corp.unicode.org> Please consider this thread closed. Thank you. From mark at kli.org Thu May 9 17:27:39 2024 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 9 May 2024 18:27:39 -0400 Subject: HEBREW HE-WITH-ADNY-INSIDE In-Reply-To: References: <1b2771dd-39b7-4f59-a9f2-4a81bacec565@kli.org> <47445d1d-654f-49eb-b9e1-3171e4462140@smontagu.org> Message-ID: <41883694-836d-4a6e-b249-569fb4f484b6@kli.org> On 4/20/24 21:22, Mark E. Shoulson via Unicode wrote: > (I have never seen this usage in an instance of the Tetragrammaton > that is meant to be pronounced ELOHIM.? I don't know if it's done or > how.) OK, so I have found an example of it.? It was handled precisely the same as normal, or at least one of the normal handlings: YOD-HATAF-SEGOL, HE-HOLAM, VAV-HIRIQ, HE(-WITH-ADNY-INSIDE), with usual disputation as to where exactly the HOLAM went.? So I guess it's just handled normally.? (Another place in the same prayer-book (I think) did without the special final HE, but also went without it in a "normal" name in the same paragraph.? I guess they aren't 100% consistent.) Vocalization of the Tetragrammaton is not so perfectly unchanging as one might think, even without the cabbalistic pointings with repeated vowels.? Usually, it's SHEVA-HOLAM-QAMATS.? But if it's preceded by certain grammatical prefixes, the ALEF in "adonay" becomes silent (not a glottal stop anymore, it's as if it isn't there.? A similar thing happens in Arabic with the name of Allah, I believe, and also with the article.)? And then, because "adonay" would no longer have a vowel under the ALEF, there is similarly no SHEVA in the Tetragrammaton spelling.? When it's pronounced "elohim", I often see HATAF-SEGOL-HOLAM-HIRIQ, but the more classical spelling I believe is SHEVA-HOLAM-HIRIQ, which makes sense because of reasons, but probably people want to be as clear as possible about this unusual case.? That initial SHEVA or HATAF-SEGOL also drops out in the same circumstances for similar reasons. The strange cabbalistic pointing in Sefardi prayer books seems to be *usually* four copies of the same vowel (and counting ?? as a vowel, making for the eight-letter Tetragrammaton), but there seem to be some places where it isn't the same vowel, nor is it the "normal" pointing as described above.? I think the link below describes examples. Think I should write this up as a proposal?? Throw it at the wall and see what sticks? ~mark > > On 4/20/24 14:18, Simon Montagu via Unicode wrote: >> Is there any use case for this glyph except as the last letter of the >> Tetragrammaton? Does it make sense to encode it separately rather >> than the whole combination HEBREW TETRAGRAMMATON WITH ADNY INSIDE THE >> HE? >> >> On 18/04/2024 04:20, Mark E. Shoulson via Unicode wrote: >>> Wow, not a peep about this?? Surely a group this opinionated would >>> have something to say.? I guess I should propose this, since it's in >>> use? Probably would have a compatibility equivalence to just plain >>> HEBREW LETTER HE. >>> >>> ~mark >>> >>> On 4/1/24 17:39, Mark E. Shoulson via Unicode wrote: >>>> Looking waaaay back to my opus (with Michael Everson) of 1998, >>>> http://std.dkuug.dk/jtc1/sc2/wg2/docs/n1740/n1740.htm, I call to >>>> attention one particular case mentioned there: the case where the >>>> second HEBREW LETTER HE of the Tetragrammaton is made very wide and >>>> another Holy Name (Adonay, ALEF-DALET-NUN-YOD) is printed in >>>> smaller letters inside it. As mentioned last century, this is even >>>> now (well, then) commonly met with, especially in Sephardic prayer >>>> books. >>>> >>>> I mention it because I've found a bunch of professional Hebrew >>>> fonts which have a glyph for this special character. Take a look at >>>> any one of many (but not all) of the offerings of the Samtype >>>> Foundry at https://www.myfonts.com/collections/samtype-foundry and >>>> you'll see what I mean.? Sometimes it's visible in the sample >>>> image, sometimes it isn't even though it's in the font.? They seem >>>> to be placing the glyph at codepoint U+FB50, which is ARABIC LETTER >>>> ALEF WASLA ISOLATED FORM, probably because it's the next character >>>> after the extended Hebrew code-block that ends at U+FB4F HEBREW >>>> LIGATURE ALEF LAMED and because, being in an Arabic codeblock, it >>>> has RTL directionality (while the PUA I think has LTR >>>> directionality, which is most inconvenient.) >>>> >>>> So it seems that this really is a thing being used by typefounders >>>> even now.? Probably should be encoded, yes?? My rationale from 1998 >>>> of encoding the Tetragrammaton as a glyph in itself was apparently >>>> not accepted, though after a later paper, >>>> https://unicode.org/L2/L2015/15092-hebew-nomina-sacra.pdf and some >>>> discussion, the YOD TRIANGLE U+05EF was encoded. Perhaps this >>>> should be too?? I guess as a variant of HE perhaps?? (the name in >>>> the subject-header is not meant as a serious proposal for the >>>> glyph-name, though this letter is actually serious, despite the date.) >>>> >>>> ~mark >>> From duerst at it.aoyama.ac.jp Fri May 10 03:11:57 2024 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2E_D=C3=BCrst?=) Date: Fri, 10 May 2024 17:11:57 +0900 Subject: Name Property in Regular Expressions Message-ID: <50d7140c-0be1-4630-b3b5-7d39d08280b1@it.aoyama.ac.jp> Dear Unicoders, I hope this more on-topic than the most recent discussions. I have some questions regarding name properties in regular expressions, i.e. about https://www.unicode.org/reports/tr18/#Name_Properties 1) When matching (see also https://www.unicode.org/reports/tr44/#Matching_Rules), it's clear that "zero-width space" is equivalent to "ZERO WIDTH SPACE" or "zerowidthspace", but should something like "Ze-rowi-dThsp ace" (hyphens or spaces in the wrong places) also be equivalent? 2) TR 18 suggests wildcards such as \p{name=/ALIEN/}. This looks very convenient, but I have doubts that implementation was really considered when writing this down. In essence, this would have to run a regular expression over close to one megabyte of name data (+some additional processing for the algorithmically defined names), just to compile the regular expression. (It's possible to speed that up with some clever indexing, but this would only add additional complexity and space.) So my question is whether anybody actually knows about some implementation of this name wildcard feature. Regards, Martin. From asmusf at ix.netcom.com Fri May 10 03:25:54 2024 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 10 May 2024 01:25:54 -0700 Subject: Name Property in Regular Expressions In-Reply-To: <50d7140c-0be1-4630-b3b5-7d39d08280b1@it.aoyama.ac.jp> References: <50d7140c-0be1-4630-b3b5-7d39d08280b1@it.aoyama.ac.jp> Message-ID: <72232199-ce87-4e40-a7b7-234673d0769f@ix.netcom.com> On 5/10/2024 1:11 AM, Martin J. D?rst via Unicode wrote: > Dear Unicoders, > > I hope this more on-topic than the most recent discussions. > > I have some questions regarding name properties in regular > expressions, i.e. about > https://www.unicode.org/reports/tr18/#Name_Properties > > 1) When matching (see also > https://www.unicode.org/reports/tr44/#Matching_Rules), it's clear that > "zero-width space" is equivalent to "ZERO WIDTH SPACE" or > "zerowidthspace", but should something like > "Ze-rowi-dThsp ace" (hyphens or spaces in the wrong places) also be > equivalent? YES. > > 2) TR 18 suggests wildcards such as \p{name=/ALIEN/}. This looks very > convenient, but I have doubts that implementation was really > considered when writing this down. In essence, this would have to run > a regular expression over close to one megabyte of name data (+some > additional processing for the algorithmically defined names), just to > compile the regular expression. (It's possible to speed that up with > some clever indexing, but this would only add additional complexity > and space.) > So my question is whether anybody actually knows about some > implementation of this name wildcard feature. > > Regards,?? Martin. From cate at cateee.net Fri May 10 03:54:02 2024 From: cate at cateee.net (Giacomo Catenazzi) Date: Fri, 10 May 2024 10:54:02 +0200 Subject: Name Property in Regular Expressions In-Reply-To: <50d7140c-0be1-4630-b3b5-7d39d08280b1@it.aoyama.ac.jp> References: <50d7140c-0be1-4630-b3b5-7d39d08280b1@it.aoyama.ac.jp> Message-ID: On 10.05.2024 10:11, Martin J. D?rst via Unicode wrote: > 2) TR 18 suggests wildcards such as \p{name=/ALIEN/}. This looks very > convenient, but I have doubts that implementation was really considered > when writing this down. In essence, this would have to run a regular > expression over close to one megabyte of name data (+some additional > processing for the algorithmically defined names), just to compile the > regular expression. (It's possible to speed that up with some clever > indexing, but this would only add additional complexity and space.) > So my question is whether anybody actually knows about some > implementation of this name wildcard feature. You write *very convenient*. Could you give some example? I think such extension go on the gray area between semantic and rendering. Unicode is working also on such area (text segmentation), but I'm not so sure regexp can handle it (with metadata and complexities). On rendering site I do not see much problem on having all data, but on a server/database, I'm not sure it is so useful, where regexp may be used for search, security and tokenization. So my question: how do you find it useful (but for us unicode standard lovers)? giacomo From wjgo_10009 at btinternet.com Fri May 10 08:58:13 2024 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 10 May 2024 14:58:13 +0100 (BST) Subject: Is the scope of Unicode unchangeable for ever? Message-ID: <57efda9b.1da1.18f62cc191f.Webtop.101@btinternet.com> Is the scope of Unicode unchangeable for ever? ? ? As in being guaranteed never to be changed as part of a stability guarantee or something like that? ? ? Or could the scope be changed if the Unicode Technical Committee wanted to allow some items not presently regarded as being in scope to become regarded as being in scope? ? ? Was the scope enlarged when emoji were encoded? ? ? This question is being asked as a general question. Clearly this has arisen now because I have been informed that some items that I would like encoded into Unicode are out of scope, but I am wondering what is the general answer on whether the scope of what items can be encoded is locked into what was decided many years ago or could it become widened if the Unicode Technical Committee wanted to do that. ? ? William Overington ? ? Friday 10 May 2024 ? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgcon6 at msn.com Fri May 10 11:58:11 2024 From: pgcon6 at msn.com (Peter Constable) Date: Fri, 10 May 2024 16:58:11 +0000 Subject: Is the scope of Unicode unchangeable for ever? In-Reply-To: <57efda9b.1da1.18f62cc191f.Webtop.101@btinternet.com> References: <57efda9b.1da1.18f62cc191f.Webtop.101@btinternet.com> Message-ID: The scope of what was considered _encodable character_ was enlarged when emoji were added since emoji are not simply text symbols but are full-blown graphic elements. But this was done _very reluctantly_ mainly out of a sense of necessity due to the need to operate with what mobile characters had already implemented. At the time, there was a broad consensus that a better architecture was for a higher-level protocol to support inline images. You?ve been told multiple times over many years that certain things you?d like encoded are considered out of scope. Nothing has changed to suggest that UTC might want to consider proposals to expand scope. Peter From: Unicode On Behalf Of William_J_G Overington via Unicode Sent: Friday, May 10, 2024 6:58 AM To: unicode at corp.unicode.org Subject: Is the scope of Unicode unchangeable for ever? Is the scope of Unicode unchangeable for ever? As in being guaranteed never to be changed as part of a stability guarantee or something like that? Or could the scope be changed if the Unicode Technical Committee wanted to allow some items not presently regarded as being in scope to become regarded as being in scope? Was the scope enlarged when emoji were encoded? This question is being asked as a general question. Clearly this has arisen now because I have been informed that some items that I would like encoded into Unicode are out of scope, but I am wondering what is the general answer on whether the scope of what items can be encoded is locked into what was decided many years ago or could it become widened if the Unicode Technical Committee wanted to do that. William Overington Friday 10 May 2024 -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Fri May 10 12:47:01 2024 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 10 May 2024 10:47:01 -0700 Subject: Is the scope of Unicode unchangeable for ever? In-Reply-To: <57efda9b.1da1.18f62cc191f.Webtop.101@btinternet.com> References: <57efda9b.1da1.18f62cc191f.Webtop.101@btinternet.com> Message-ID: <11be72d1-10a6-4da6-b058-09730dc2d4f0@ix.netcom.com> Unicode has the concept of rejecting certain proposals with prejudice (even if these actual words are perhaps not used). That means, such proposals cannot be re-submitted unless a supermajority first decides to allow reconsidering the issue (again, that's the gist of it, for precise details go look it up on the website). Such rejections with prejudice will limit the potential ways the scope of the Unicode standard can be fine tuned or "evolve" over time if you like that term. This list isn't a committee meeting. Mentioning things on this list will not lead to a ruling from the technical committee. Instead, you might get a reasonable prediction of how the committee would react to a given proposal -- based on a shared understanding about the role of and aims for development of the Unicode Standard. That kind of reaction is the best you can expect here. If people here suggest that something you are enamored of is out of scope for the Unicode Standard, it's useless to continue to pester the group with it. If you believe that you have a better insight into the committee's thinking than people who attend the meetings, your recourse is to submit a formal proposal, in which case there's a good probability that it will be rejected with prejudice. It's not only useless and impolite to continue to beat a dead horse on this list, it's animal cruelty. And that can get you flicked. A./ From mark at kli.org Fri May 10 14:49:51 2024 From: mark at kli.org (Mark E. Shoulson) Date: Fri, 10 May 2024 15:49:51 -0400 Subject: Is the scope of Unicode unchangeable for ever? In-Reply-To: <57efda9b.1da1.18f62cc191f.Webtop.101@btinternet.com> References: <57efda9b.1da1.18f62cc191f.Webtop.101@btinternet.com> Message-ID: On 5/10/24 09:58, William_J_G Overington via Unicode wrote: > Is the scope of Unicode unchangeable for ever? > Forever is a long time, so who's to say?? But you can be pretty sure that if the scope does change, it will change _in response_ to usage, and not _to promote_ it.? That's what happened with emoji: they were in wide use already, the concept of them being "plain text" was already in the community and in use among vendors, and Unicode followed along.? If you hope for the scope to expand, YOU have to expand it first.? The only time you can reasonably ask "can we expand the scope?" is when Unicode really has no choice but to say yes. Go make the usage happen first. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Fri May 10 20:50:55 2024 From: jameskass at code2001.com (James Kass) Date: Sat, 11 May 2024 01:50:55 +0000 Subject: Is the scope of Unicode unchangeable for ever? In-Reply-To: References: <57efda9b.1da1.18f62cc191f.Webtop.101@btinternet.com> Message-ID: On 2024-05-10 4:58 PM, Peter Constable via Unicode wrote: > You?ve been told multiple times over many years that certain things > you?d like encoded are considered out of scope. Examples of this are legion.? Here's one from 2010 on the High-Logic forum in response to a post expressing hope that some stuff will be encoded in regular Unicode. https://forum.high-logic.com/viewtopic.php?p=12429#p12429 Here's an excerpt from that response: "I'm not sure what you base that hope on. Unicode is not a plaything for hobbyists. It is a serious, industry driven effort to allow the entire world to electronically store text records from both the present and the past. The fact that no one has responded to this thread since the middle of April, and the only responses you have ever received on the Unicode list are along the lines of "let's create an off-topic list, so the main message board isn't clogged by this stuff" should indicate to you that this is not germaine, and is generally thought to be at best uninteresting, and more probably annoying." Many good people have offered sound advice over the years.? To no avail. From junicode at jcbradfield.org Sat May 11 05:05:11 2024 From: junicode at jcbradfield.org (Julian Bradfield) Date: Sat, 11 May 2024 11:05:11 +0100 (BST) Subject: Is the scope of Unicode unchangeable for ever? References: <57efda9b.1da1.18f62cc191f.Webtop.101@btinternet.com> Message-ID: Why do people keep feeding the troll? (Don't answer, just think about it the next time you feel an urge to follow up to William.) From sosipiuk at gmail.com Sat May 11 11:54:43 2024 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Sat, 11 May 2024 16:54:43 +0000 Subject: Is the scope of Unicode unchangeable for ever? In-Reply-To: References: Message-ID: <1715445715603.2167280539.4065268480@gmail.com> On Friday, 10 May 2024, 21:50:55 (-04:00), James Kass via Unicode wrote: > Examples of this are legion. Here's one from 2010 on the High-Logic forum in response to a post expressing hope that some stuff will be encoded in regular Unicode. > https://forum.high-logic.com/viewtopic.php?p=12429#p12429 > Here's an excerpt from that response: > "(...) It is a serious, industry driven effort (...)" I think this is key to understanding Unicode. It's primarily composed of INDUSTRY i.e. tech corporations (and... AirBnB for some reason). The members of the consortium have the biases (and they ARE biased) and follow the trends of the tech industry. They want things that are either directly useful to their purposes, or that at least generate reliable public goodwill and engagement. They don't want things that are unproven, useless, or silly. Once you understand WHO you're dealing with, it becomes more obvious why certain decisions are made. (e.g. emoji) S?awomir Osipiuk From ecm.unicode at gmail.com Sat May 11 13:33:23 2024 From: ecm.unicode at gmail.com (Erik Carvalhal Miller) Date: Sat, 11 May 2024 14:33:23 -0400 Subject: Is the scope of Unicode unchangeable for ever? In-Reply-To: <1715445715603.2167280539.4065268480@gmail.com> References: <1715445715603.2167280539.4065268480@gmail.com> Message-ID: On Saturday, May 11, 2024, S?awomir Osipiuk via Unicode < unicode at corp.unicode.org> wrote: > > "(...) It is a serious, industry driven effort (...)" > They don't want things that are unproven, useless, or silly. ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Wed May 15 16:28:41 2024 From: jameskass at code2001.com (James Kass) Date: Wed, 15 May 2024 21:28:41 +0000 Subject: Questions about Indic Conjuct Clusters In-Reply-To: References: Message-ID: <2d762f3e-b9ca-4e03-b972-ef1ac26a6780@code2001.com> On 2024-04-17 6:46 PM, Don Hosek via Unicode wrote: > It?s not immediately clear from the specification what the correct > implementation would be for a few pathological cases of the Indic > Conjuct Cluster specification in the Unicode 15.1.0 specification. > > For convenience?s sake, let?s use the following shorthand: > > C =?\p{InCB=Consonant} > E =?\p{InCB=Extend} > L =?\p{InCB=Linker} > M = \p{M} > > 1. It appears that both E and L are subsets of M and I think E?L = M > . Is this correct? If so, is GB9c equivalent to saying that CM+C > should be considered a single cluster iff that sequence of > characters M+ contains at least one character from L? (Having > written this question and looking at the statement of the rule > from > https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.html, > my restatement seems to correspond to 9.3 in that list). > 2. Should a sequence like, e.g., CLCLC be considered a single cluster > or would it be two clusters, CLCL ? C? > > > I would note also that the chart at > https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.html?seems > to be not quite correct. > > -dh One of the binary properties of U+094D DEVANAGARI SIGN VIRAMA (halant) is "Grapheme Link". So, IIUC, "CLCLC" is like Consonant + Virama + Consonant + Virama + Consonant, and it should be considered a single grapheme cluster. Although I know a little bit about Indic conjuncts, I don't have a working understanding of the syntax of the page linked above.? So I'm "bumping" this post in the hope that someone more knowledgeable will respond to the questions. Meanwhile, here's a link to a Microsoft typography spec page which illustrates how the shaping engine determines cluster boundaries (of course using OpenType terminology): https://learn.microsoft.com/en-us/typography/script-development/devanagari One of the examples on that page is (Ra + halant + Da + halant + Ma + I-matra), which is treated as a cluster:? ?????? Hope this is helpful. From julesbertholet at quoi.xyz Sun May 19 13:31:13 2024 From: julesbertholet at quoi.xyz (Jules Bertholet) Date: Sun, 19 May 2024 18:31:13 +0000 (UTC) Subject: Can 0023 FE0F be an emoji? Message-ID: The following sequences are listed in emoji-variation-sequences.txt (https://www.unicode.org/Public/UNIDATA/emoji/emoji-variation-sequences.txt): > 0023 FE0F ; emoji style; # (1.1) NUMBER SIGN > 002A FE0F ; emoji style; # (1.1) ASTERISK > 0030 FE0F ; emoji style; # (1.1) DIGIT ZERO > 0031 FE0F ; emoji style; # (1.1) DIGIT ONE > 0032 FE0F ; emoji style; # (1.1) DIGIT TWO > 0033 FE0F ; emoji style; # (1.1) DIGIT THREE > 0034 FE0F ; emoji style; # (1.1) DIGIT FOUR > 0035 FE0F ; emoji style; # (1.1) DIGIT FIVE > 0036 FE0F ; emoji style; # (1.1) DIGIT SIX > 0037 FE0F ; emoji style; # (1.1) DIGIT SEVEN > 0038 FE0F ; emoji style; # (1.1) DIGIT EIGHT > 0039 FE0F ; emoji style; # (1.1) DIGIT NINE However, these sequences do not appear in emoji-sequences.txt (https://unicode.org/Public/emoji/15.1/emoji-sequences.txt), except in combination with U+20E3 as part of an Emoji_Keycap_Sequence. Is it permissible to treat the sequence 0023 FE0F (without trailing U+20E3) as an emoji? Is it recommended? Is it required? Jules Bertholet From boldewyn at gmail.com Sun May 19 14:34:07 2024 From: boldewyn at gmail.com (Manuel Strehl) Date: Sun, 19 May 2024 21:34:07 +0200 Subject: Can 0023 FE0F be an emoji? In-Reply-To: References: Message-ID: <10b3f98d-4bf7-4910-9281-2f792631dfb7@gmail.com> Hi, according to TR 51 and the emoji test file, https://www.unicode.org/Public/emoji/15.1/emoji-test.txt these are not fully qualified emojis. That means, concerning Unicode it is not recommended for general interchange (?RGI?) to render them as emojis. And in fact, some vendors (Google, Samsung, MS) do, some don?t (Apple, Facebook) render them as emojis: https://emojipedia.org/digit-one#designs Cheers, Manuel Am 19.05.24 um 20:31 schrieb Jules Bertholet via Unicode: > The following sequences are listed in emoji-variation-sequences.txt > (https://www.unicode.org/Public/UNIDATA/emoji/emoji-variation-sequences.txt): > > > > 0023 FE0F ; emoji style; # (1.1) NUMBER SIGN > > 002A FE0F ; emoji style; # (1.1) ASTERISK > > 0030 FE0F ; emoji style; # (1.1) DIGIT ZERO > > 0031 FE0F ; emoji style; # (1.1) DIGIT ONE > > 0032 FE0F ; emoji style; # (1.1) DIGIT TWO > > 0033 FE0F ; emoji style; # (1.1) DIGIT THREE > > 0034 FE0F ; emoji style; # (1.1) DIGIT FOUR > > 0035 FE0F ; emoji style; # (1.1) DIGIT FIVE > > 0036 FE0F ; emoji style; # (1.1) DIGIT SIX > > 0037 FE0F ; emoji style; # (1.1) DIGIT SEVEN > > 0038 FE0F ; emoji style; # (1.1) DIGIT EIGHT > > 0039 FE0F ; emoji style; # (1.1) DIGIT NINE > > However, these sequences do not appear in emoji-sequences.txt > (https://unicode.org/Public/emoji/15.1/emoji-sequences.txt), > except in combination with U+20E3 as part of an Emoji_Keycap_Sequence. > > Is it permissible to treat the sequence 0023 FE0F (without trailing > U+20E3) > as an emoji? Is it recommended? Is it required? > > Jules Bertholet > > From rick at corp.unicode.org Fri May 31 15:55:52 2024 From: rick at corp.unicode.org (Rick McGowan) Date: Fri, 31 May 2024 13:55:52 -0700 Subject: New Event on June 25 - Webinar on Bidirectional Text (Part 1): The Basics of Bidi Message-ID: <4d8f2970-c89f-eeae-2f25-ff6a2995b169@unicode.org> Registration is Now Open! A number of scripts, such as Hebrew, Arabic and Urdu, write their letters horizontally on a page or screen, running right to left. A complication for these scripts is that other characters, such as digits, flow left-to-right, and can occur on the same line, or even alongside other left-to-right text, such as Latin. Text that handles both right-to-left and left-to-right text is called ?bidirectional? text (?bidi? in short). How to handle bidi text on browsers and in other software is challenging for both general users and implementers. This webinar will describe the basics with examples. It will be followed by a live question-and-answer period. A more in-depth question and answer session will take place August 13, 2024. *Who?* If you are a translator/localizers, localization tooling maker, I18n infrastructure developer, linguist and language researcher, application developer, or a content author, you will want to join us for this webinar. Bring your questions to the people involved for the live Q&A. *When?* Tuesday, 25 June 2024 starting at 8am (San Francisco), 11am (New York), and 5pm (Berlin). Registration is Open Now ! Please note this session will also be recorded and available via the Unicode YouTube channel. ------------------------------------------------------------------------ Getting Started with Bidirectional Text (Part 1): The Basics of Bidi Frequently Asked Questions: https://unicode.org/faq/bidi.html Articles: * https://www.w3.org/International/articles/inline-bidi-markup/uba-basics * https://www.w3.org/International/questions/qa-scripts * https://www.w3.org/International/techniques/authoring-html#direction * https://w3c.github.io/i18n-drafts/techniques/authoring-html.en#gsdirection Additional Articles from W3C: * https://www.w3.org/International/articlelist#direction About the Unicode Consortium The Unicode Consortium is the premier non-profit open source, open standards body for the internationalization of all software and services. For more than 30 years, the Unicode Consortium has coordinated the efforts of a worldwide team of volunteer programmers and linguists to standardize, evolve, and maintain a global software foundation that allows virtually every computer system and service to help people connect using their native language. For additional information about Unicode, visit home.unicode.org . Unicode Resources * Unicode Technical Quick Start Guide: https://home.unicode.org/technical-quick-start-guide/ * Unicode YouTube Playlist - Overview of Internationalization and Unicode Projects: https://www.youtube.com/playlist?list=PLMc927ywQmTNQrscw7yvaJbAbMJDIjeBh -------------- next part -------------- An HTML attachment was scrubbed... URL: