From abrahamgross at disroot.org Tue Jun 2 22:18:35 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Wed, 03 Jun 2020 03:18:35 +0000 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist Message-ID: Why are there precomposed Hebrew characters in Unicode (Alphabetic Presentation Forms block)? It says in the FAQ that ?a substantial number of presentation forms were encoded in Unicode as compatibility characters, because legacy software or data included them.? (https://www.unicode.org/faq/ligature_digraph.html#PForms (https://www.unicode.org/faq/ligature_digraph.html#PForms)) I can't find any character set other than Unicode that has separate codepoints for all Hebrew letters with a dagesh/mapiq or any of the other precomposed letters other than the Yiddish ligatures. (ex: Code page 862, ISO/IEC 8859-8, Windows-1255) Does anyone know where I can find the legacy software or character sets that had these presentation forms? I also want to see the documents/proposals that got these characters accepted as part of Unicode. Does anyone know where I can find them? The closest I got was when I figured out the proposal to add HEBREW LETTER YOD WITH HIRIQ is in proposal N1364, but I can't find it in the document register? -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Wed Jun 3 00:33:09 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 2 Jun 2020 22:33:09 -0700 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: Message-ID: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Wed Jun 3 00:37:19 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Wed, 3 Jun 2020 05:37:19 +0000 (UTC) Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> Message-ID: How successful might I be in adding an additional Hebrew character to the Alphabetic Presentation Forms block? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jk at koremail.com Wed Jun 3 02:34:43 2020 From: jk at koremail.com (jk at koremail.com) Date: Wed, 03 Jun 2020 15:34:43 +0800 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> Message-ID: Dear Abraham, adding such characters as these, whatever the language are a thing of the past. So a submission would not be successful. Yours sincerely John Knightley On 2020-06-03 13:37, abrahamgross--- via Unicode wrote: > How successful might I be in adding an additional Hebrew character to > the Alphabetic Presentation Forms block? From asmusf at ix.netcom.com Wed Jun 3 12:51:38 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 3 Jun 2020 10:51:38 -0700 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> Message-ID: An HTML attachment was scrubbed... URL: From mark at kli.org Wed Jun 3 16:42:52 2020 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 3 Jun 2020 17:42:52 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: Message-ID: <9afe4a17-3b80-87b9-697e-3e14f9eab536@kli.org> On 6/3/20 1:37 AM, abrahamgross--- via Unicode wrote: > How successful might I be in adding an additional Hebrew character to > the Alphabetic Presentation Forms block? > It is unlikely that such characters would be considered, but the way to add an additional character to anyplace is pretty much the same: submit a proposal, as described on https://www.unicode.org/pending/proposals.html What were you thinking of adding?? Things like a bent-neck LAMED or more "wide" letters are pretty much guaranteed non-starters, I would guess.? A "closed" QOF or the broken VAV (as is traditionally written in Numbers 25:12) would at best be a very tough sell. ~mark From mark at kli.org Wed Jun 3 16:51:04 2020 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 3 Jun 2020 17:51:04 -0400 Subject: QID Emoij (was: Re: Wireless Connection Symbol) In-Reply-To: References: Message-ID: <3a0e8cf4-353e-ac6c-b039-7503ce53ad75@kli.org> {Sorry this is out of date; I discovered my email to the unicode list wasn't going through.} I'm not sure how much I could add to the points that have already been made, but just to stand up and be counted, I also think QID emoji are an awful idea and I can barely believe they are even being considered seriously.? The possibilities are just too broad, etc... what everyone else said. We'd do better with a highly-compressed (vector?) image format that could somehow squeeze decent pictures into a few dozen characters. On 5/27/20 12:18 PM, S?awomir Osipiuk via Unicode wrote: > The issue to be resolved here lies in the process for adding emojis. > The current process is too onerous and slow. I can imagine a new > process, that isn't bound to a regular schedule, and that allows > eminently useful and needed emojis to be fast-tracked to approval in > days, not months. Perhaps an entire plane could be reserved for such > emojis - 65K should be enough for anyone, right? ;) Perhaps there > could be a provisional or probationary approval granted to certain > emojis, or at least a "reservation" system for code points. A vendor > could reserve spaces with emojis they plan to add (with reasonable > limits, of course). There could be a public voting system to add or > approve emojis in near-real-time based on thresholds for approval. > It's 2020; we have the technology. Provisional emojis or code points > reservations that don't see use/support after some amount of time are > rejected and code points are allowed to be reused. Those that see use > or public support are given final appro! > val and become bound by stability requirements. The Unicode Consortium > is still involved, but less so, relying more on automated metrics than > meetings, though they would still have veto power if there is some > valid subjective factor to consider. This is fairly well-said.? The problem is obviously real, or real enough to bug people: it takes too darn long to get emoji into Unicode.? It takes a long time to get anything into Unicode, but most of the things we're putting in at this stage of the game are rare characters, small-userbase scripts, etc, and even the people who would use them have been doing okay without them for a while. Emoji have a different type of demand.? Emoji become popular, and even "necessary," _after_ they are in the standard, lots of people are itching to use each incoming proposal, and their userbase is a very large and outspoken segment of computer-users.? A provisional something-or-other?? Not entirely a bad idea.? Lots of details perhaps to work out, to avoid assorted "horror" situations (reusing a code-point??? so my serious document about pok?mon might in later years appear with emoji of Linux distributions??? oh, won't someone think of the stability!) while still making it all work out.? No, I don't know how to solve all those issues. But the idea bears consideration, more than QID emoji do, IMO. ~mark From abrahamgross at disroot.org Wed Jun 3 19:21:39 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Thu, 4 Jun 2020 00:21:39 +0000 (UTC) Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> Message-ID: <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> What about a folded lamed? How do you think a proposal for that would go? I have plenty of proof of it being used in the same sentence (even in the same word) as a regular lamed, so its not just an alternate form of the same character like a and ?. Here are some examples: https://imgur.com/a/xw9Kb8Z -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Wed Jun 3 20:43:34 2020 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 3 Jun 2020 21:43:34 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> Message-ID: <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> On 6/3/20 8:21 PM, abrahamgross--- via Unicode wrote: > What about a folded lamed? How do you think a proposal for that would > go? I have plenty of proof of it being used in the same sentence (even > in the same word) as a regular lamed, so its not just an alternate > form of the same character like a and ?. > > > Here are some examples: > https://imgur.com/a/xw9Kb8Z > > I think it would be a very hard sell.? Just because they're used in the same sentence doesn't mean they aren't alternate forms of the same character.? Sometimes there were scribal preferences, etc.? There's not *meaning* that's different between the two LAMEDs.? There isn't any text where it matters which one you use where, except for trying to replicate the exact *appearance* of a document?and that is exactly the realm of more sophisticated systems.? Unicode isn't publishing software; it isn't supposed to replace Word.? A LAMED is a LAMED.? The example in your picture is actually quite interesting because it looks like they either ran out of bent LAMEDs or made a mistake or something.? The bent LAMED was invented for reasons of typesetting: LAMED is the only letter with an ascender, and it tended to get in the way of things with Hebrew text being set with little or no leading and letter-height filling almost the entire line-height.? You can see where there are straight LAMEDs on your page, that their ascenders reach into places in the line above that happen to be open enough not to cause problems, like spaces between words or letters with no baseline.? Otherwise, the bent LAMED was pressed into service, because that's what it's for.? Except... for the one you show inside a blue box.? That should have been a bent LAMED, because a straight one would have been bumping or almost bumping into the TSERE above it.? But for whatever reason, they didn't use a bent LAMED, and made do by taking a straight LAMED and cutting off its head! Here's another way to look at it.? If you (or the original typesetter) would have set this same text in the same font slightly differently, maybe a little wider or narrower, or maybe with an additional word or even footnote-mark inserted or something, would the bent LAMEDs still be bent and the straight LAMEDs still be straight?? No!? The text would flow differently, and some of the straight LAMEDs would have to be bent, because they no longer had space above them, while some of the bent LAMEDs could be straight, because in this layout there's space for them. So there isn't anything about the LAMED in the word ?? that you have highlighted in red that makes it "straight."? That isn't a feature of the letter in the plain text.? It's a feature of the typeset page.? Just like there's nothing special about an "i" following an "f" (in many fonts) that makes it have no dot; it's just a thing that happens to i following f in those fonts, that they join into an ? ligature.? It isn't a feature of the i, it's a feature of the typesetting.? (OK, that's a bad example because of course ? *is* encoded, but that was due to round-tripping considerations and other stuff that we don't like to apply anymore.? But the idea is still useful.) ~mark From asmusf at ix.netcom.com Wed Jun 3 21:12:45 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 3 Jun 2020 19:12:45 -0700 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> Message-ID: An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Wed Jun 3 21:21:39 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Thu, 4 Jun 2020 02:21:39 +0000 (UTC) Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> Message-ID: <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> This is exactly why I want the folded lamed. (I also want the headless lamed cuz I've also seen it used a lot and I really like it. its especially useful when u need to put a RAFE or other trop/accent marks on a folded lamed) 2020/06/03 ??9:44:18 Mark E. Shoulson via Unicode : > The bent LAMED was invented for reasons of typesetting: LAMED is the only letter with an ascender, and it tended to get in the way of things with Hebrew text being set with little or no leading and letter-height filling almost the entire line-height.? You can see where there are straight LAMEDs on your page, that their ascenders reach into places in the line above that happen to be open enough not to cause problems, like spaces between words or letters with no baseline.? Otherwise, the bent LAMED was pressed into service, because that's what it's for. > From abrahamgross at disroot.org Wed Jun 3 21:52:19 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Thu, 4 Jun 2020 02:52:19 +0000 (UTC) Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> Message-ID: Finally a good explanation! I still want the folded lamed tho? but at least I get it now 2020/06/03 ??9:44:18 Mark E. Shoulson via Unicode : > If you (or the original typesetter) would have set this same text in the same font slightly differently, maybe a little wider or narrower, or maybe with an additional word or even footnote-mark inserted or something, would the bent LAMEDs still be bent and the straight LAMEDs still be straight?? No!? The text would flow differently, and some of the straight LAMEDs would have to be bent, because they no longer had space above them, while some of the bent LAMEDs could be straight, because in this layout there's space for them. So there isn't anything about the LAMED in the word ?? that you have highlighted in red that makes it "straight."? That isn't a feature of the letter in the plain text.? It's a feature of the typeset page.? Just like there's nothing special about an "i" following an "f" (in many fonts) that makes it have no dot; it's just a thing that happens to i following f in those fonts, that they join into an ? ligature.? It isn't a feature of the i, it's a feature of the typesetting.? (OK, that's a bad example because of course ? *is* encoded, but that was due to round-tripping considerations and other stuff that we don't like to apply anymore.? But the idea is still useful.) > From mark at kli.org Wed Jun 3 22:02:44 2020 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 3 Jun 2020 23:02:44 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> Message-ID: <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> On 6/3/20 10:21 PM, abrahamgross--- via Unicode wrote: > This is exactly why I want the folded lamed. > > (I also want the headless lamed cuz I've also seen it used a lot and I really like it. its especially useful when u need to put a RAFE or other trop/accent marks on a folded lamed) Aha!? So you need it for typesetting reasons!? And that's exactly how you should obtain it.? This is *precisely* why God created OpenType tables in modern fonts!? So that when you have an AYIN with a vowel underneath it, the shape changes so it's flat and not descending as much (yes, I know, that's U+FB20 HEBREW LETTER ALTERNATIVE AYIN, but that, too, was added for reasons we don't like to admit anymore, and it would never be accepted today.)? I know for certain that John Hudson's "SBL Hebrew" font does exactly that, see the attached image.? Nothing was done between the right frame and the left frame aside from typing a QAMATS.? The letter changed automatically, because John Hudson has killer typography skillz[sic].? In fact, if I had used a PATAH, the letter would _not_ have changed, UNTIL I typed a following letter, because a PATAH under an AYIN at the end of a word is a patah genuvah, which some prefer to set shifted over to right a little. I don't know of any font machinery that can actually change things based on what's present on the previous *line*; that may not be supported.? But you can bet that such a thing won't be reason enough to encode a new character. As for wanting other funky shapes, why, there's nothing to stop you.? Just because they're all glyphic variants of the same letter doesn't mean you can't use them both.? You can have stylistic alternatives in a font, so THIS "a" is two-story while THAT "a" is one-story, in the same font, by using your (higher-level!) formatting software to turn options on and off in setting the font.? Look 'em up. (A more brute-force method would be to make two copies of the font, FontA and FontB, the same except that one has a bent LAMED and one has a straight LAMED.? Then you could change the LAMEDs you want to be this way to FontA and the ones you want that way to FontB.) (I hope the picture came through.) Bottom line: it's not bad to want these things, but this is not the way to get them.? There are other tools especially for situations like this. ~mark > 2020/06/03 ??9:44:18 Mark E. Shoulson via Unicode: >> The bent LAMED was invented for reasons of typesetting: LAMED is the only letter with an ascender, and it tended to get in the way of things with Hebrew text being set with little or no leading and letter-height filling almost the entire line-height.? You can see where there are straight LAMEDs on your page, that their ascenders reach into places in the line above that happen to be open enough not to cause problems, like spaces between words or letters with no baseline.? Otherwise, the bent LAMED was pressed into service, because that's what it's for. >> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: hdojaenebggkdich.png Type: image/png Size: 5416 bytes Desc: not available URL: From mark at kli.org Wed Jun 3 22:10:04 2020 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 3 Jun 2020 23:10:04 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> Message-ID: On 6/3/20 11:02 PM, Mark E. Shoulson via Unicode wrote: > Nothing was done between the right frame and the left frame aside from > typing a QAMATS.? The letter changed automatically, because John > Hudson has killer typography skillz[sic]. And to be clear, that means that the *characters* in the document are U+05E2 HEBREW LETTER AYIN followed by U+05B8 HEBREW POINT QAMATS, and the "alternative ayin" U+FB20 is nowhere to be seen and did not in fact need to exist for this to work.? It's just an alternate glyph for the character U+05E2.? Unicode encodes characters, not glyphs. ~mark From 747.neutron at gmail.com Wed Jun 3 22:15:57 2020 From: 747.neutron at gmail.com (=?UTF-8?B?V8OhbmcgWWlmw6Fu?=) Date: Thu, 4 Jun 2020 12:15:57 +0900 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> Message-ID: I can't say that I am knowledgeable in the Hebrew script at all, but at first glance of your examples, I feel that it's more appropriate to be either put in the main block or realized with a variation selector if it's of some significance and its usage is not algorithmically inferable. Compatibility characters are for compatibility, which means coping with standards preceding Unicode that don't go along with the Unicode model. If there'd be no prior standard manifested the need of that character/glyph, it'll be rather called "new character" so that it won't have a reason to be stuffed into that block. 2020?6?4?(?) 10:38 abrahamgross--- via Unicode : > > What about a folded lamed? How do you think a proposal for that would go? I have plenty of proof of it being used in the same sentence (even in the same word) as a regular lamed, so its not just an alternate form of the same character like a and ?. > > > Here are some examples: > https://imgur.com/a/xw9Kb8Z > > From abrahamgross at disroot.org Wed Jun 3 22:23:39 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Thu, 4 Jun 2020 03:23:39 +0000 (UTC) Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> Message-ID: <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org> > I don't know of any font machinery that can actually change things based on what's present on the previous *line*; that may not be supported.? But you can bet that such a thing won't be reason enough to encode a new character. ??? Not even with a variation selector? Do you know which standards that existed before unicode had the hebrew characters from the presentation forms block? if it had the alternative ayin, then chances are that it had an "alternative lamed" too. -------------- next part -------------- An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Wed Jun 3 22:26:30 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Thu, 4 Jun 2020 03:26:30 +0000 (UTC) Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> Message-ID: Why do the final forms of the hebrew letters (?????) exist as separate codepoints from their regular counterparts (?????), when arabic - which has up to 4 forms for each letter - only got a single codepoint per letter? From sosipiuk at gmail.com Wed Jun 3 22:44:28 2020 From: sosipiuk at gmail.com (=?utf-8?Q?S=C5=82awomir_Osipiuk?=) Date: Wed, 3 Jun 2020 23:44:28 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org> Message-ID: <002701d63a22$737f25e0$5a7d71a0$@gmail.com> A variation selector seems like a good choice here. There should be a way to request from the rendering engine a specific variant of the ?same? character. There is precedent for that in many other characters/languages. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskasskrv at gmail.com Thu Jun 4 01:26:34 2020 From: jameskasskrv at gmail.com (James Kass) Date: Thu, 4 Jun 2020 06:26:34 +0000 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> Message-ID: <706837b6-5200-97dd-305f-6bac0adb27b1@gmail.com> On 2020-06-04 3:26 AM, abrahamgross--- via Unicode wrote: > Why do the final forms of the hebrew letters (?????) exist as separate codepoints from their regular counterparts (?????), when arabic - which has up to 4 forms for each letter - only got a single codepoint per letter? > > Because they were in a legacy character set.? Windows 1255: https://en.wikipedia.org/wiki/Windows-1255 From jameskasskrv at gmail.com Thu Jun 4 01:30:53 2020 From: jameskasskrv at gmail.com (James Kass) Date: Thu, 4 Jun 2020 06:30:53 +0000 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <706837b6-5200-97dd-305f-6bac0adb27b1@gmail.com> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <706837b6-5200-97dd-305f-6bac0adb27b1@gmail.com> Message-ID: <4fde4f17-0168-90da-2c10-02edbc6d8764@gmail.com> On 2020-06-04 6:26 AM, James Kass wrote: > > On 2020-06-04 3:26 AM, abrahamgross--- via Unicode wrote: >> Why do the final forms of the hebrew letters (?????) exist as >> separate codepoints from their regular counterparts (?????), when >> arabic - which has up to 4 forms for each letter - only got a single >> codepoint per letter? >> >> > Because they were in a legacy character set.? Windows 1255: > https://en.wikipedia.org/wiki/Windows-1255 P.S. - The Arabic positional variants from legacy character sets were encoded as presentation forms. From jr at qsm.co.il Thu Jun 4 02:02:36 2020 From: jr at qsm.co.il (Jonathan Rosenne) Date: Thu, 4 Jun 2020 07:02:36 +0000 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <4fde4f17-0168-90da-2c10-02edbc6d8764@gmail.com> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <706837b6-5200-97dd-305f-6bac0adb27b1@gmail.com> <4fde4f17-0168-90da-2c10-02edbc6d8764@gmail.com> Message-ID: In modern Hebrew it is not possible, in general, to determine by means of a simple rule whether to use the final form or the non-final form. For example, in the word ?????? the non-final ? is used at the final position of the word, or in the Hebrew transliteration of Arabic words, such as ??????. In Arabic, on the other hand, according to the Arab representatives to the ISO committees, the choice of the form is strictly dependent on its position in the word and on the surrounding letters. Anecdotally, the first draft of Windows 1255 did use the same algorithm as for 1256, and it failed miserably on first demonstration as the name of the Microsoft Israel manager was ????. Best Regards, Jonathan Rosenne -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of James Kass via Unicode Sent: Thursday, June 4, 2020 9:31 AM To: unicode at unicode.org Subject: Re: Why do the Hebrew Alphabetic Presentation Forms Exist On 2020-06-04 6:26 AM, James Kass wrote: > > On 2020-06-04 3:26 AM, abrahamgross--- via Unicode wrote: >> Why do the final forms of the hebrew letters (?????) exist as >> separate codepoints from their regular counterparts (?????), when >> arabic - which has up to 4 forms for each letter - only got a single >> codepoint per letter? >> >> > Because they were in a legacy character set.? Windows 1255: > https://en.wikipedia.org/wiki/Windows-1255 P.S. - The Arabic positional variants from legacy character sets were encoded as presentation forms. From marius.spix at web.de Thu Jun 4 02:31:51 2020 From: marius.spix at web.de (Marius Spix) Date: Thu, 4 Jun 2020 09:31:51 +0200 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> Message-ID: <20200604093151.227437a0@spixxi> Unicode also has German s (U+0073) and ? (U+017F) which are equivalent, but were used in typesetting for a long time. If you want to precisely reproduce a historic text, it is required to have separate ways to encode different glyps. In plaintext documents you have no influence on OpenType presentation. But you can use variant selectors, which can be registered in the IVD database. This would be propably the best way. Technically, using variant selectors has the same effect as different code points, as and would encode different shapes of the same character. It also appears that there are more variants of lamed with special meanings in the bible: https://www.hebrew4christians.com/Grammar/Unit_One/Aleph-Bet/Lamed/lamed.html Can someone confirm that all variants of lamed have the same numeric value of 30? If it is different between the variants, that would qualify for different characters. We also have special glyph variants of the same character for special purposes, like an open tail g for IPA (?, U+0261? or an alternative phi for math (?, U+03D5), but these are completely optional and have no different meaning from the closed tail g and the curled phi. As far as I know linguists and mathematicians accept both glyph variants mutually interchangeable. I guess, they are only in Unicode for historic reasons. Regards, Marius On Wed, 3 Jun 2020 23:02:44 -0400 "Mark E. Shoulson via Unicode" wrote: > On 6/3/20 10:21 PM, abrahamgross--- via Unicode wrote: > > This is exactly why I want the folded lamed. > > > > (I also want the headless lamed cuz I've also seen it used a lot > > and I really like it. its especially useful when u need to put a > > RAFE or other trop/accent marks on a folded lamed) > > Aha!? So you need it for typesetting reasons!? And that's exactly how > you should obtain it.? This is *precisely* why God created OpenType > tables in modern fonts!? So that when you have an AYIN with a vowel > underneath it, the shape changes so it's flat and not descending as > much (yes, I know, that's U+FB20 HEBREW LETTER ALTERNATIVE AYIN, but > that, too, was added for reasons we don't like to admit anymore, and > it would never be accepted today.)? I know for certain that John > Hudson's "SBL Hebrew" font does exactly that, see the attached > image.? Nothing was done between the right frame and the left frame > aside from typing a QAMATS.? The letter changed automatically, > because John Hudson has killer typography skillz[sic].? In fact, if I > had used a PATAH, the letter would _not_ have changed, UNTIL I typed > a following letter, because a PATAH under an AYIN at the end of a > word is a patah genuvah, which some prefer to set shifted over to > right a little. > > I don't know of any font machinery that can actually change things > based on what's present on the previous *line*; that may not be > supported. But you can bet that such a thing won't be reason enough > to encode a new character. > > As for wanting other funky shapes, why, there's nothing to stop you.? > Just because they're all glyphic variants of the same letter doesn't > mean you can't use them both.? You can have stylistic alternatives in > a font, so THIS "a" is two-story while THAT "a" is one-story, in the > same font, by using your (higher-level!) formatting software to turn > options on and off in setting the font.? Look 'em up. > > (A more brute-force method would be to make two copies of the font, > FontA and FontB, the same except that one has a bent LAMED and one > has a straight LAMED.? Then you could change the LAMEDs you want to > be this way to FontA and the ones you want that way to FontB.) > > (I hope the picture came through.) > > Bottom line: it's not bad to want these things, but this is not the > way to get them.? There are other tools especially for situations > like this. > > ~mark > > > 2020/06/03 ??9:44:18 Mark E. Shoulson via > > Unicode: > >> The bent LAMED was invented for reasons of typesetting: LAMED is > >> the only letter with an ascender, and it tended to get in the way > >> of things with Hebrew text being set with little or no leading and > >> letter-height filling almost the entire line-height.? You can see > >> where there are straight LAMEDs on your page, that their ascenders > >> reach into places in the line above that happen to be open enough > >> not to cause problems, like spaces between words or letters with > >> no baseline.? Otherwise, the bent LAMED was pressed into service, > >> because that's what it's for. > From abrahamgross at disroot.org Thu Jun 4 02:32:54 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Thu, 4 Jun 2020 07:32:54 +0000 (UTC) Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <20200604093151.227437a0@spixxi> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604093151.227437a0@spixxi> Message-ID: <55a36a11-1463-46bf-be57-7282b61b6b68@disroot.org> They all share the value of 30 2020/06/04 ??3:28:51 Marius Spix via Unicode : > It also appears that there are more variants of lamed with special > meanings in the bible: > https://www.hebrew4christians.com/Grammar/Unit_One/Aleph-Bet/Lamed/lamed.html > > Can someone confirm that all variants of lamed have the same numeric > value of 30? If it is different between the variants, that would > qualify for different characters. > From richard.wordingham at ntlworld.com Thu Jun 4 02:59:37 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 4 Jun 2020 08:59:37 +0100 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> Message-ID: <20200604085937.5c3135d9@JRWUBU2> On Wed, 3 Jun 2020 23:02:44 -0400 "Mark E. Shoulson via Unicode" wrote: > I don't know of any font machinery that can actually change things > based on what's present on the previous *line*; that may not be > supported. But you can bet that such a thing won't be reason enough > to encode a new character. And that leads to a problem that would not be solved by the encoding of a new character. If the position of line breaks may change, or the text may be reset in a font with different relative character widths, then which lamedhs are bent would change. Arguably, the right place for standardisation is probably OpenType and AAT features - and it might even be addressed already. Richard. From richard.wordingham at ntlworld.com Thu Jun 4 03:28:57 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 4 Jun 2020 09:28:57 +0100 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> Message-ID: <20200604092857.32b2cf60@JRWUBU2> On Thu, 4 Jun 2020 03:26:30 +0000 (UTC) abrahamgross--- via Unicode wrote: > Why do the final forms of the hebrew letters (?????) exist as > separate codepoints from their regular counterparts (?????), when > arabic - which has up to 4 forms for each letter - only got a single > codepoint per letter? TUS gives an explanation for the separate encoding of those final forms, in the section on Hebrew. Devising rules for automatic selection would be too difficult, and would probably need an override mechanism anyway. There are similar cases scattered through Unicode. Off the top of my head I can think of: U+017F LATIN SMALL LETTER LONG S U+03C2 GREEK SMALL LETTER FINAL SIGMA U+1A55 TAI THAM CONSONANT SIGN MEDIAL RA U+1A56 TAI THAM CONSONANT SIGN MEDIAL LA Richard. From abrahamgross at disroot.org Thu Jun 4 03:37:41 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Thu, 4 Jun 2020 08:37:41 +0000 (UTC) Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <20200604092857.32b2cf60@JRWUBU2> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604092857.32b2cf60@JRWUBU2> Message-ID: <8307c659-5901-4f37-b761-ac6c56990fd1@disroot.org> Whats TUS? From mark at kli.org Thu Jun 4 07:28:08 2020 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 4 Jun 2020 08:28:08 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <002701d63a22$737f25e0$5a7d71a0$@gmail.com> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org> <002701d63a22$737f25e0$5a7d71a0$@gmail.com> Message-ID: <2706c719-9c76-b50b-7179-0aac9c38ece7@kli.org> On 6/3/20 11:44 PM, S?awomir Osipiuk via Unicode wrote: > > A variation selector seems like a good choice here. There should be a > way to request from the rendering engine a specific variant of the > ?same? character. There is precedent for that in many other > characters/languages. > This isn't a matter for a variation selector.? This is purely a *scribal* or *presentation* alternation.? It has as much relevance to the content of the text as choice of font.? This is a matter for a stylistic alternate in the font tables.? This is *exactly* what those are for! ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Thu Jun 4 07:58:23 2020 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Thu, 4 Jun 2020 14:58:23 +0200 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <8307c659-5901-4f37-b761-ac6c56990fd1@disroot.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604092857.32b2cf60@JRWUBU2> <8307c659-5901-4f37-b761-ac6c56990fd1@disroot.org> Message-ID: <2f972d97-620c-c5fd-c224-1c0fc5636cad@gmail.com> Le 04/06/2020 ? 10:37, abrahamgross--- via Unicode a ?crit?: > Whats TUS? > The Unicode Standard, I guess. It is available here www.unicode.org/versions/latest/.? The part on Hebrew in https://www.unicode.org/versions/Unicode13.0.0/ch09.pdf , indeed contains the following paragraph: Because final form usage is a matter of spelling convention, software should not automatically substitute final forms for nominal forms at the end of words. The positional variants should be coded directly and rendered one-to-one via their own glyphs?that is, without contextual analysis. Fr?d?ric From sosipiuk at gmail.com Thu Jun 4 08:00:28 2020 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Thu, 4 Jun 2020 09:00:28 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <8307c659-5901-4f37-b761-ac6c56990fd1@disroot.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604092857.32b2cf60@JRWUBU2> <8307c659-5901-4f37-b761-ac6c56990fd1@disroot.org> Message-ID: On Thu, Jun 4, 2020 at 4:42 AM abrahamgross--- via Unicode wrote: > > Whats TUS? > I believe that means "The Unicode Standard" and the section Richard Wordingham was referring to is in Chapter 9: Final (Contextual Variant) Letterforms. Variant forms of five Hebrew letters are encoded as separate characters in this block, as in Hebrew standards including ISO/IEC 8859-8. These variant forms are generally used in place of the nominal letterforms at the end of words. Certain words, however, are spelled with nominal rather than final forms, particu- larly names and foreign borrowings in Hebrew and some words in Yiddish. Because final form usage is a matter of spelling convention, software should not automatically substitute final forms for nominal forms at the end of words. The positional variants should be coded directly and rendered one-to-one via their own glyphs?that is, without contextual analy- sis. From mark at kli.org Thu Jun 4 08:02:40 2020 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 4 Jun 2020 09:02:40 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <20200604085937.5c3135d9@JRWUBU2> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604085937.5c3135d9@JRWUBU2> Message-ID: On 6/4/20 3:59 AM, Richard Wordingham via Unicode wrote: > On Wed, 3 Jun 2020 23:02:44 -0400 > "Mark E. Shoulson via Unicode" wrote: > >> I don't know of any font machinery that can actually change things >> based on what's present on the previous *line*; that may not be >> supported. But you can bet that such a thing won't be reason enough >> to encode a new character. > And that leads to a problem that would not be solved by the encoding of > a new character. If the position of line breaks may change, or the > text may be reset in a font with different relative character widths, > then which lamedhs are bent would change. > > Arguably, the right place for standardisation is probably OpenType and > AAT features - and it might even be addressed already. Yes, exactly.? An author (or typesetting program, higher level than a font) would have to choose the right variant for each LAMED... which is what 'salt' tables are for, isn't it? ~mark From richard.wordingham at ntlworld.com Thu Jun 4 08:30:38 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 4 Jun 2020 14:30:38 +0100 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <8307c659-5901-4f37-b761-ac6c56990fd1@disroot.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604092857.32b2cf60@JRWUBU2> <8307c659-5901-4f37-b761-ac6c56990fd1@disroot.org> Message-ID: <20200604143038.25fa9e05@JRWUBU2> On Thu, 4 Jun 2020 08:37:41 +0000 (UTC) abrahamgross--- via Unicode wrote: > Whats TUS? > The Unicode Standard. Richard. From richard.wordingham at ntlworld.com Thu Jun 4 11:15:39 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 4 Jun 2020 17:15:39 +0100 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <2706c719-9c76-b50b-7179-0aac9c38ece7@kli.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org> <002701d63a22$737f25e0$5a7d71a0$@gmail.com> <2706c719-9c76-b50b-7179-0aac9c38ece7@kli.org> Message-ID: <20200604171539.7bfb71cb@JRWUBU2> On Thu, 4 Jun 2020 08:28:08 -0400 "Mark E. Shoulson via Unicode" wrote: > On 6/3/20 11:44 PM, S?awomir Osipiuk via Unicode wrote: > > > > A variation selector seems like a good choice here. There should be > > a way to request from the rendering engine a specific variant of > > the ?same? character. There is precedent for that in many other > > characters/languages. > This isn't a matter for a variation selector.? This is purely a > *scribal* or *presentation* alternation.? It has as much relevance to > the content of the text as choice of font.? This is a matter for a > stylistic alternate in the font tables.? This is *exactly* what those > are for! That wasn't obvious to whoever first implemented them in MS Word. The feature settings for a font applied throughout the document! There's also a problem that application writers think one needs a friendly interface expressed in layman's terms, whereas a fix like this is quite likely to be described in the documentation as 'Set feature cv05 to 6 for lamedh to be bent'. It took ages to get OpenType features supported in LibreOffice, even though they'd already implemented Graphite features. Now, it has been pointed out elsewhere that for best effects, shaping should apply to whole paragraphs. Fortunately, applying to whole words is usually good enough. However, what if a word has two lamedhs, and only is to be bent? Are mere word-processors now up to handling that and processing the whole word as a whole, even though different parts have different feature settings? Richard. From richard.wordingham at ntlworld.com Thu Jun 4 11:27:49 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 4 Jun 2020 17:27:49 +0100 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604085937.5c3135d9@JRWUBU2> Message-ID: <20200604172749.357309a1@JRWUBU2> On Thu, 4 Jun 2020 09:02:40 -0400 "Mark E. Shoulson via Unicode" wrote: > > Arguably, the right place for standardisation is probably OpenType > > and AAT features - and it might even be addressed already. > Yes, exactly.? An author (or typesetting program, higher level than a > font) would have to choose the right variant for each LAMED... which > is what 'salt' tables are for, isn't it? I was thinking more along the lines of something like tnum, which gets digits to have the same advance width so that numbers in rows of digits can more easily align. You then don't have to refer to the font documentation; if you want that behaviour, either the font doesn't support it, or you just specify that feature tnum be applied. Richard. From abrahamgross at disroot.org Thu Jun 4 11:29:50 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Thu, 4 Jun 2020 16:29:50 +0000 (UTC) Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <20200604171539.7bfb71cb@JRWUBU2> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org> <002701d63a22$737f25e0$5a7d71a0$@gmail.com> <2706c719-9c76-b50b-7179-0aac9c38ece7@kli.org> <20200604171539.7bfb71cb@JRWUBU2> Message-ID: What? I don't understand what you're saying here 2020/06/04 ??0:17:21 Richard Wordingham via Unicode : > However, what if a word has two lamedhs, and > only is to be bent?? Are mere word-processors now up to handling that > and processing the whole word as a whole, even though different parts > have different feature settings? > From prosfilaes at gmail.com Thu Jun 4 12:05:00 2020 From: prosfilaes at gmail.com (David Starner) Date: Thu, 4 Jun 2020 10:05:00 -0700 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> Message-ID: On Wed, Jun 3, 2020 at 10:51 PM abrahamgross--- via Unicode wrote: > > Why do the final forms of the hebrew letters (?????) exist as separate codepoints from their regular counterparts (?????), when arabic - which has up to 4 forms for each letter - only got a single codepoint per letter? Because encoding is full of somewhat arbitrary choices. Alphabets with a handful of variant forms, like Latin, Greek, and Hebrew, it's easier and more expected to encode those separately, instead of complicating systems with one exception. Keyboard entry can go directly into a buffer with minimal massaging. Scripts like Arabic, where each letter takes four forms, would be harder to deal with under that model; you can't expect keyboard users to type each form separately, so either you add a heavy input manager, or you encode each letter and let the font deal with the different forms. (Which has its problems; I suspect if Persian script had been encoded separately/Persian was the main user of the Arabic script, that it would have been encoded slightly differently, as Persian uses ZWJ and ZWNJ more frequently to force forms. But the current encoding still works for Persian; it's just a matter of tradeoffs.) -- The standard is written in English . If you have trouble understanding a particular section, read it again and again and again . . . Sit up straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185 (1991) From mark at kli.org Thu Jun 4 15:30:20 2020 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 4 Jun 2020 16:30:20 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <20200604093151.227437a0@spixxi> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604093151.227437a0@spixxi> Message-ID: <0ea68b66-9d9a-e1ef-cb60-b89747c04ee4@kli.org> {Sent this morning, but it bounced due to size.? Re-sending, with attachments, using jpg for smaller file-sizes.} On 6/4/20 3:31 AM, Marius Spix via Unicode wrote: > Unicode also has German s (U+0073) and ? (U+017F) which are > equivalent, but were used in typesetting for a long time. If you want > to precisely reproduce a historic text, it is required to have > separate ways to encode different glyps. In plaintext documents you > have no influence on OpenType presentation. Long-s also existed in earlier standards, and so had to be preserved. > But you can use variant selectors, which can be registered > in the IVD database. This would be propably the best way. Technically, > using variant selectors has the same effect as different code points, > as and would encode different shapes of the > same character. I don't think this rises even to the level of variation selectors.? This is a scribal alternation, like deciding to put some extra swash into a letter in this word but not that one. It's the whole purpose of OpenType tables. > It also appears that there are more variants of lamed with special > meanings in the bible: > https://www.hebrew4christians.com/Grammar/Unit_One/Aleph-Bet/Lamed/lamed.html > > Can someone confirm that all variants of lamed have the same numeric > value of 30? If it is different between the variants, that would > qualify for different characters. They are all 30, and more importantly they are all LAMEDs. Every one of those examples, the spelling of the word includes simply LAMED.? That's what's in the text.? What's on the paper (or parchment) can't be considered "plain text" since written or printed text is by definition formatted somehow, to fit on the page. You don't want to go down the rabbit hole of letters written in certain old Torahs with anomalous tags, extra tags, curled and looped heads, etc (these exist, I have sources if you want.)? Those are specialized cases and not even accepted (halachically) as significant in writing a Torah. (You'd have better luck with the broken VAV in Numbers 25:12, which is at least still done in modern Torahs.)? I think these are too specialized a case to be considered actual variant letters.? Attached are some pictures from an old Torah I saw on display.? The first shows a "looped" or "wrapped" PEH.? In the second one, note extra tags on the SAMECHs in the second line and on the FINAL KAF in the last.? The medial closed MEM in ????? in Isaiah 9:6 is at least codified in the Mesorah as well. Unlike (I think) Arabic positional variants, the Hebrew final forms have had more of an independent life as letters, considered as symbols of their own, so even if it weren't for the legacy encodings, they probably would have been rightly encoded separately.? After all, you can adjust what kind of joining an Arabic letter shows with proper use of ZWJ and ZWNJ, so the use of non-final PEH at the end of a word, *from a purely typographic perspective*, would not have been a barrier to encoding only a single PEH and choosing the form only by context. But there are other considerations in the case of Hebrew as it actually is.? The fact is that a straight (final) PEH and a bent (non-final) PEH are *distinct* and different letters in Modern Hebrew, at least in the context of the end of a word.? As was mentioned already, if you spell the word ???? as ????, you have spelled it *wrong*, and it would be pronounced differently.? And that usage has been in place for a long time; I think it's in Yiddish as well (but not Biblical Hebrew, witness Proverbs 30:6, with the word ?????????, a final -P spelled with straight-PEH-dagesh).? There are some forms of gematria (numerology) which consider the final letters to have different numerical values than the non-final letters.? So there's some reasonable history to consider them distinct, and encoding them separately would have been the right move even without the legacy considerations.? I think Arabic traditions don't have such distinctions. > We also have special glyph variants of the same character for special > purposes, like an open tail g for IPA (?, U+0261? or an alternative phi > for math (?, U+03D5), but these are completely optional and have no > different meaning from the closed tail g and the curled phi. As far as > I know linguists and mathematicians accept both glyph variants mutually > interchangeable. I guess, they are only in Unicode for historic reasons. Not so!? Contrariwise, in fact, at least for the IPA ?.? The reason it is encoded is because IPA stipulates that the symbol for the voiced velar stop be written ? with an open loop, and it is incorrect to write it with a binocular g.? Linguists do not consider these to be mutually interchangeable.? Same with the IPA ?, which is wrong if written two-storey.? I'm not sure about mathematics usage, but I think that there may be situations in math wherein ? and ? were used with distinct meanings (and not just by an isolated author.) ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pastedpic.jpg Type: image/jpeg Size: 40116 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pastedpic2.jpg Type: image/jpeg Size: 7026 bytes Desc: not available URL: From mark at kli.org Thu Jun 4 16:01:34 2020 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 4 Jun 2020 17:01:34 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <20200604171539.7bfb71cb@JRWUBU2> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org> <002701d63a22$737f25e0$5a7d71a0$@gmail.com> <2706c719-9c76-b50b-7179-0aac9c38ece7@kli.org> <20200604171539.7bfb71cb@JRWUBU2> Message-ID: On 6/4/20 12:15 PM, Richard Wordingham via Unicode wrote: > On Thu, 4 Jun 2020 08:28:08 -0400 > "Mark E. Shoulson via Unicode" wrote: > >> On 6/3/20 11:44 PM, S?awomir Osipiuk via Unicode wrote: >>> A variation selector seems like a good choice here. There should be >>> a way to request from the rendering engine a specific variant of >>> the ?same? character. There is precedent for that in many other >>> characters/languages. >> This isn't a matter for a variation selector.? This is purely a >> *scribal* or *presentation* alternation.? It has as much relevance to >> the content of the text as choice of font.? This is a matter for a >> stylistic alternate in the font tables.? This is *exactly* what those >> are for! > That wasn't obvious to whoever first implemented them in MS Word. The > feature settings for a font applied throughout the document! Ah.? I'd been seeing it in LibreOffice and other places, where you can twiddle the settings on individual spans, and didn't realize that originally these things were expected to be document-wide.? Thank you for correcting me.? Would you say, though, that while it may not be what they were originally meant for, that this use fits very well into how they can be and are used today? > There's > also a problem that application writers think one needs a friendly > interface expressed in layman's terms, whereas a fix like this is > quite likely to be described in the documentation as 'Set feature cv05 > to 6 for lamedh to be bent'. It took ages to get OpenType features > supported in LibreOffice, even though they'd already implemented > Graphite features. Yeah, user interface is a hassle at all levels, and complicated things are going to have complicated interfaces. > Now, it has been pointed out elsewhere that for best effects, shaping > should apply to whole paragraphs. Fortunately, applying to whole words > is usually good enough. However, what if a word has two lamedhs, and > only is to be bent? Are mere word-processors now up to handling that > and processing the whole word as a whole, even though different parts > have different feature settings? Yes, what I had been envisioning would indeed involve setting the use of font-features on small (one-character) spans in the middle of words, and I didn't consider how well word-processors can handle such a thing, and I don't really know.? What about things like 'swsh' tables for swash effects?? Are those applied to a whole word (paragraph?) at a time, but the table itself only affects the final letters of words?? Or do you have to apply it to each individual letter that you would see swashed?? If the latter, it's a lot like what I'm thinking about in this case. ~mark From mark at kli.org Thu Jun 4 16:08:57 2020 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 4 Jun 2020 17:08:57 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <20200604172749.357309a1@JRWUBU2> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604085937.5c3135d9@JRWUBU2> <20200604172749.357309a1@JRWUBU2> Message-ID: <561d3072-dce7-afa9-1c15-3281f4e51520@kli.org> On 6/4/20 12:27 PM, Richard Wordingham via Unicode wrote: > On Thu, 4 Jun 2020 09:02:40 -0400 > "Mark E. Shoulson via Unicode" wrote: > >>> Arguably, the right place for standardisation is probably OpenType >>> and AAT features - and it might even be addressed already. >> Yes, exactly.? An author (or typesetting program, higher level than a >> font) would have to choose the right variant for each LAMED... which >> is what 'salt' tables are for, isn't it? > I was thinking more along the lines of something like tnum, which gets > digits to have the same advance width so that numbers in rows of > digits can more easily align. You then don't have to refer to the font > documentation; if you want that behaviour, either the font doesn't > support it, or you just specify that feature tnum be applied. And this, as you mentioned before, affecting the entire document, or at least a whole paragraph or table.? But of course, the intent isn't to make the user choose between all straight LAMEDs and all bent ones, but to allow some to be one and some the other.? I was thinking 'salt' tables could be used kind of like formatting instructions, to apply to _this_ span and not _that_ one, like you can highlight a single letter and italicize it.? Even if they can't be used that way, then maybe it isn't a font thing, maybe the the higher typesetting system has to make these decisions.? After all, it's something that depends on the entire text-block and how the typesetter saw fit to lay it out.? It's like hyphenation in that way, if you think about it.? A hyphen character can't "know" that it is in a situation where it must break the line and become visible; that decision is made by the word processor.? (just turning visible at the end of a line can, of course, be handled at the font level.) ~mark From wjgo_10009 at btinternet.com Thu Jun 4 16:14:20 2020 From: wjgo_10009 at btinternet.com (wjgo_10009 at btinternet.com) Date: Thu, 4 Jun 2020 22:14:20 +0100 (BST) Subject: QID Emoji Message-ID: <28e02754.18ab.172812f4f7a.Webtop.43@btinternet.com> QID Emoji The Public Review on the QID Emoji proposal is open for comment until 9 July 2020. https://www.unicode.org/review/ Discussion of QID Emoji in this mailing list, which is interesting and useful, does not however automatically form part of what is considered by the Unicode Technical Committee. https://www.unicode.org/review/pri408/ So, whatever is your opinion on the QID Emoji proposal, you might like to consider please sending it in on the contact form. Maybe a good compromise solution to the issue can be found. William Overington Thursday 4 June 2020 From mark at kli.org Thu Jun 4 17:14:59 2020 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 4 Jun 2020 18:14:59 -0400 Subject: Alternate presentation for U+229C CIRCLED EQUALS? Message-ID: An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Jun 4 17:49:58 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 4 Jun 2020 23:49:58 +0100 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org> <002701d63a22$737f25e0$5a7d71a0$@gmail.com> <2706c719-9c76-b50b-7179-0aac9c38ece7@kli.org> <20200604171539.7bfb71cb@JRWUBU2> Message-ID: <20200604234958.518a3a76@JRWUBU2> On Thu, 4 Jun 2020 16:29:50 +0000 (UTC) abrahamgross--- via Unicode wrote: > What? I don't understand what you're saying here > > 2020/06/04 ??0:17:21 Richard Wordingham via Unicode > : > > However, what if a word has two lamedhs, and > > only is to be bent?? Are mere word-processors now up to handling > > that and processing the whole word as a whole, even though > > different parts have different feature settings? Enabling and disabling features changes the set of rules a renderer uses to convert a sequence of characters to a sequence of coloured glyphs with defined relative positions. Even in simple scripts, they can control, amongst many other things, the horizontal spacing of letters, and even adjustments to white space, even handling things that were handled on typewriters by rules such as two spaces after commas and three spaces after full stops. (I'm told these were the RAF rules.) A set of rules is easiest to implement if the rules are the same for the whole of the string being shaped. One solution is to chop a string up into runs with the same rules to be applied. However, the font then loses control over how the two parts line up. Now, if you have two lamedhs in a word, you may wish to bend one and not bend the other. Any adjustment between letters becomes difficult if the letters are subject to different rules. An obvious question would be, "Which rules apply to their interaction?" Much as I dislike the idea of use variation sequences to control 'stylistic' effects, it does avoid these problems. It does come with the cost of increasing the number of glyphs whose interaction must be considered, though there are tricks to reduce the amount of thought required. Richard. From richard.wordingham at ntlworld.com Thu Jun 4 18:22:05 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 5 Jun 2020 00:22:05 +0100 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <0ea68b66-9d9a-e1ef-cb60-b89747c04ee4@kli.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604093151.227437a0@spixxi> <0ea68b66-9d9a-e1ef-cb60-b89747c04ee4@kli.org> Message-ID: <20200605002205.696251ba@JRWUBU2> On Thu, 4 Jun 2020 16:30:20 -0400 "Mark E. Shoulson via Unicode" wrote: > On 6/4/20 3:31 AM, Marius Spix via Unicode wrote: > > We also have special glyph variants of the same character for > > special purposes, like an open tail g for IPA (?, U+0261? or an > > alternative phi for math (?, U+03D5), but these are completely > > optional and have no different meaning from the closed tail g and > > the curled phi. As far as I know linguists and mathematicians > > accept both glyph variants mutually interchangeable. I guess, they > > are only in Unicode for historic reasons. > Not so!? Contrariwise, in fact, at least for the IPA ?.? The reason > it is encoded is because IPA stipulates that the symbol for the > voiced velar stop be written ? with an open loop, and it is incorrect > to write it with a binocular g. The IPA threw the towel in on that one, and now allow either. >? Linguists do not consider these to > be mutually interchangeable.? Same with the IPA ?, which is wrong if > written two-storey. That's different. [a] and [?] are two different sounds. Of course, it all gets horribly confused when type faces for children's books use single storey 'a' and open loop 'g'. >? I'm not sure about mathematics usage, but I > think that there may be situations in math wherein ? and ? were used > with distinct meanings (and not just by an isolated author.) I suspect that's the difference between curly phi and straight phi. I must say though that I need a soft stroked phi that drops the part above the circle when one applies a superscript. I'm British and I find the fluxion notation useful. (And no, differentiation was introduced to me with the 'd' notation.) Richard. From richard.wordingham at ntlworld.com Thu Jun 4 20:11:39 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 5 Jun 2020 02:11:39 +0100 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <30d1773a-c162-4d3e-8790-6d63440e4ca4@disroot.org> <002701d63a22$737f25e0$5a7d71a0$@gmail.com> <2706c719-9c76-b50b-7179-0aac9c38ece7@kli.org> <20200604171539.7bfb71cb@JRWUBU2> Message-ID: <20200605021139.43c54eee@JRWUBU2> On Thu, 4 Jun 2020 17:01:34 -0400 "Mark E. Shoulson via Unicode" wrote: > On 6/4/20 12:15 PM, Richard Wordingham via Unicode wrote: > > On Thu, 4 Jun 2020 08:28:08 -0400 > > "Mark E. Shoulson via Unicode" wrote: > >> On 6/3/20 11:44 PM, S?awomir Osipiuk via Unicode wrote: > >> This isn't a matter for a variation selector.? This is purely a > >> *scribal* or *presentation* alternation.? It has as much relevance > >> to the content of the text as choice of font.? This is a matter > >> for a stylistic alternate in the font tables.? This is *exactly* > >> what those are for! > > That wasn't obvious to whoever first implemented them in MS Word. > > The feature settings for a font applied throughout the document! > > Ah.? I'd been seeing it in LibreOffice and other places, where you > can twiddle the settings on individual spans, and didn't realize that > originally these things were expected to be document-wide.? Thank you > for correcting me.? Would you say, though, that while it may not be > what they were originally meant for, that this use fits very well > into how they can be and are used today? Features were around long before MS Word implemented user control of them. The description of some of the features implies interaction between the author and the rendering machine. Cascading Style Sheets (CSS) brought optional features to the masses, and they're not designed for interactive layout. The CSS approach effectively provides a set of font customisation, so switching from one set to another seems to be like switching fonts, which you suggested as one approach. If it is implemented that way, then one loses font control across changes of options. A lot of Indic script engines appear not to have allowed interaction between clusters, so there would have been no loss of control by applying different fonts to different clusters. Now, I've seen Windows interfaces that allow the application of features to be limited to parts of a string. I don't know how that works. I can imagine it becoming more sophisticated over time. Reviewers of features were horrified that Word required the settings to apply across the document. That was widely seen as a design fault, and I trust it has now been fixed. My point was simply that what you saw as an obvious use of features was not obvious to everyone. > Yes, what I had been envisioning would indeed involve setting the use > of font-features on small (one-character) spans in the middle of > words, and I didn't consider how well word-processors can handle such > a thing, and I don't really know.? What about things like 'swsh' > tables for swash effects?? Are those applied to a whole word > (paragraph?) at a time, but the table itself only affects the final > letters of words?? Or do you have to apply it to each individual > letter that you would see swashed? If the latter, it's a lot like > what I'm thinking about in this case. I haven't used sophisticated layout systems, so I don't know how they work. I could well imagine that they didn't work with automatic kerning. Rchard. From richard.wordingham at ntlworld.com Thu Jun 4 20:29:31 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 5 Jun 2020 02:29:31 +0100 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <561d3072-dce7-afa9-1c15-3281f4e51520@kli.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604085937.5c3135d9@JRWUBU2> <20200604172749.357309a1@JRWUBU2> <561d3072-dce7-afa9-1c15-3281f4e51520@kli.org> Message-ID: <20200605022931.2bacd68a@JRWUBU2> On Thu, 4 Jun 2020 17:08:57 -0400 "Mark E. Shoulson via Unicode" wrote: > On 6/4/20 12:27 PM, Richard Wordingham via Unicode wrote: > > On Thu, 4 Jun 2020 09:02:40 -0400 > > "Mark E. Shoulson via Unicode" wrote: > > > >>> Arguably, the right place for standardisation is probably OpenType > >>> and AAT features - and it might even be addressed already. > >> Yes, exactly.? An author (or typesetting program, higher level > >> than a font) would have to choose the right variant for each > >> LAMED... which is what 'salt' tables are for, isn't it? > > I was thinking more along the lines of something like tnum, which > > gets digits to have the same advance width so that numbers in rows > > of digits can more easily align. You then don't have to refer to > > the font documentation; if you want that behaviour, either the font > > doesn't support it, or you just specify that feature tnum be > > applied. > And this, as you mentioned before, affecting the entire document, or > at least a whole paragraph or table.? But of course, the intent isn't > to make the user choose between all straight LAMEDs and all bent > ones, but to allow some to be one and some the other.? I was thinking > 'salt' tables could be used kind of like formatting instructions, to > apply to _this_ span and not _that_ one, like you can highlight a > single letter and italicize it. Well, there's the rub. One loses layout control between italicised and unitalicised portions. This is how one would apply them using CSS. Of course, the system might be clever enough to work round breaks. The wording of the salt feature at https://docs.microsoft.com/en-us/typography/opentype/spec/features_pt#-tag-salt suggests that the conception was that the salt feature would enable substitutions of a single glyph by another glyph, with a user interface allowing the user to choose the replacement glyph from a menu. There's no whiff of the notion that the choice presented might be context dependent. A clever enough system could have context-sensitive substitutions before and after that straddle the change in options selected. Other systems might not allow interactions between spans with different options chosen. Richard. ? Even if they can't be used that way, > then maybe it isn't a font thing, maybe the the higher typesetting > system has to make these decisions.? After all, it's something that > depends on the entire text-block and how the typesetter saw fit to > lay it out.? It's like hyphenation in that way, if you think about > it.? A hyphen character can't "know" that it is in a situation where > it must break the line and become visible; that decision is made by > the word processor.? (just turning visible at the end of a line can, > of course, be handled at the font level.) > > ~mark > > From otto.stolz at uni-konstanz.de Fri Jun 5 06:26:50 2020 From: otto.stolz at uni-konstanz.de (Otto Stolz) Date: Fri, 5 Jun 2020 13:26:50 +0200 Subject: German long S (was: Why do the Hebrew Alphabetic Presentation Forms Exist) In-Reply-To: <20200604093151.227437a0@spixxi> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604093151.227437a0@spixxi> Message-ID: <487cc47f-df5b-0311-1e6b-f165ca8ee946@uni-konstanz.de> Hello, am 2020-06-04 um 9:31 hat Marius Spix via Unicode geschrieben: > Unicode also has German s (U+0073) and ? (U+017F) which are > equivalent, No, they are not equivalent. In any orthography using ???, at all, ?s? marks the end of a word, or of a constituent of a compound. Thus, e. g. - ?Wachstube? [?vakstu?b?] = ?Wachs-Tube?, a tube containing wax - ?Wach?tube? [?vax?tu?b?] = ?Wach-Stube?, guard room Just a reminder, we have discussed this earlier in this list. Best wishes, ? ?Otto From asmusf at ix.netcom.com Fri Jun 5 14:41:01 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 5 Jun 2020 12:41:01 -0700 Subject: German long S In-Reply-To: <487cc47f-df5b-0311-1e6b-f165ca8ee946@uni-konstanz.de> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604093151.227437a0@spixxi> <487cc47f-df5b-0311-1e6b-f165ca8ee946@uni-konstanz.de> Message-ID: An HTML attachment was scrubbed... URL: From tom at honermann.net Fri Jun 5 15:10:19 2020 From: tom at honermann.net (Tom Honermann) Date: Fri, 5 Jun 2020 16:10:19 -0400 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? Message-ID: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte order, states (emphasis mine): > ... *Use of a BOM is neither required nor recommended for UTF-8*, but > may be encountered in contexts where UTF-8 data is converted from > other encoding forms that? use? a? BOM? or? where? the? BOM? is? used > as? a? UTF-8? signature.? See? the? ?Byte? Order Mark? subsection in > Section 23.8, Specials, for more information. The emphasized statement is unconditional regarding the recommendation, but it isn't clear to me that this recommendation is intended to extend to both presence of a BOM in contexts where the encoding is known to be UTF-8 (where the BOM provides no additional information) and to contexts where the BOM signifies the presence of UTF-8 encoded text (where the BOM does provide additional information).? Is the guidance intended to state that, when possible, use of UTF-8 as an encoding signature is to be avoided in favor of some other mechanism? The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8 (Specials) contains no similar guidance; it is factual and details some possible consequences of use, but does not apply a judgement.? The discussion of use with other character sets could be read as an endorsement for use of a BOM as an encoding signature. Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode FAQ does not recommend for or against use of a BOM as an encoding signature.? It also can be read as endorsing such usage. So, my question is, what exactly is the intent of the emphasized statement above?? Is the recommendation intended to be so broadly worded?? Or is it only intended to discourage BOM use in cases where the encoding is known by other means? Tom. -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Fri Jun 5 15:14:27 2020 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 5 Jun 2020 13:14:27 -0700 Subject: Alternate presentation for U+229C CIRCLED EQUALS? In-Reply-To: References: Message-ID: emoji style seems wrong here. You would want this to look like the CC logo, not cute and colorful. It sounds like the default assumption is for choosing a font with a math-like glyph vs. a CC-like glyph. If this does not work, then a standardized variation sequence might be useful. https://www.unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt ... 2295 FE00; with white rim; # CIRCLED PLUS 2297 FE00; with white rim; # CIRCLED TIMES*229C FE00; with equal sign touching the circle; # CIRCLED EQUALS* 22DA FE00; with slanted equal; # LESS-THAN EQUAL TO OR GREATER-THAN 22DB FE00; with slanted equal; # GREATER-THAN EQUAL TO OR LESS-THAN ... markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Fri Jun 5 15:20:49 2020 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 5 Jun 2020 13:20:49 -0700 Subject: Alternate presentation for U+229C CIRCLED EQUALS? In-Reply-To: References: Message-ID: Actually, I should have looked at the proposal doc first: http://www.unicode.org/L2/L2017/17242r2-n4934r-creative-commons.pdf ... Their primary designs are exactly specified, while in current text forms may be used resembling the font design. ... In these examples, you see some variation in size, font, and placement, which is common for ? symbol as well. ... In other words, the glyphs for these symbols are not as fixed as you might think, and the use of ? likely fits right in. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Fri Jun 5 16:47:49 2020 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Fri, 5 Jun 2020 21:47:49 +0000 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> Message-ID: The modern viewpoint is that the BOM should be discouraged in all contexts. (Along with you should always be using Unicode encodings, probably UTF-8 or UTF-16). I?d recommend to anyone encountering ASCII-like data to presume it was UTF-8 unless proven otherwise. Are you asking because you?re interested in differentiating UTF-8 from UTF-16? Or UTF-8 from some other legacy non-Unicode encoding? Anecdotally, if you can decode data without error in UTF-8, then it?s probably UTF-8. Sensible sequences in other encodings rarely look like valid UTF-8, though there are a few short examples that can confuse it. -Shawn From: Unicode On Behalf Of Tom Honermann via Unicode Sent: Freitag, 5. Juni 2020 13:10 To: unicode at unicode.org Cc: Alisdair Meredith Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte order, states (emphasis mine): ... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the ?Byte Order Mark? subsection in Section 23.8, Specials, for more information. The emphasized statement is unconditional regarding the recommendation, but it isn't clear to me that this recommendation is intended to extend to both presence of a BOM in contexts where the encoding is known to be UTF-8 (where the BOM provides no additional information) and to contexts where the BOM signifies the presence of UTF-8 encoded text (where the BOM does provide additional information). Is the guidance intended to state that, when possible, use of UTF-8 as an encoding signature is to be avoided in favor of some other mechanism? The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8 (Specials) contains no similar guidance; it is factual and details some possible consequences of use, but does not apply a judgement. The discussion of use with other character sets could be read as an endorsement for use of a BOM as an encoding signature. Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode FAQ does not recommend for or against use of a BOM as an encoding signature. It also can be read as endorsing such usage. So, my question is, what exactly is the intent of the emphasized statement above? Is the recommendation intended to be so broadly worded? Or is it only intended to discourage BOM use in cases where the encoding is known by other means? Tom. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Fri Jun 5 17:00:23 2020 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Fri, 5 Jun 2020 22:00:23 +0000 Subject: Alternate presentation for U+229C CIRCLED EQUALS? In-Reply-To: References: Message-ID: I don?t really like the proposal at all. Is there prior context that I?m missing? They don?t want a ?circled cc? character. They want a Creative Commons license symbol. They don?t want the equivalent of ?, they want ?. In plain text, the Creative Commons symbol has an explicit meaning, it?s not a random emoji. It is unclear to me why this is being proposed as ?circled characters? rather than ?CC license symbols?. My preference would be to see these encoded as ?licensing symbols?. If I was designing a font that included the CC licensing symbols and the circled math symbols, I might choose to match the CC symbols published by them EXACTLY. However, the math symbols may have a slightly different style. As ? and ? likely do. Not to mention, if I have a ? in my text, then it?s clearly intended as an abbreviation for ?copyright? and not bullet c that I thought looked prettier in a circle. That intent is not lost if I change fonts or whatever. -Shawn From: Unicode On Behalf Of Markus Scherer via Unicode Sent: Freitag, 5. Juni 2020 13:21 To: Mark E. Shoulson Cc: unicode at unicode.org Subject: Re: Alternate presentation for U+229C CIRCLED EQUALS? Actually, I should have looked at the proposal doc first: http://www.unicode.org/L2/L2017/17242r2-n4934r-creative-commons.pdf ... Their primary designs are exactly specified, while in current text forms may be used resembling the font design. ... In these examples, you see some variation in size, font, and placement, which is common for ? symbol as well. ... In other words, the glyphs for these symbols are not as fixed as you might think, and the use of ? likely fits right in. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at honermann.net Fri Jun 5 17:15:08 2020 From: tom at honermann.net (Tom Honermann) Date: Fri, 5 Jun 2020 18:15:08 -0400 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> Message-ID: <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote: > > The modern viewpoint is that the BOM should be discouraged in all > contexts.?? (Along with you should always be using Unicode encodings, > probably UTF-8 or UTF-16). I?d recommend to anyone encountering > ASCII-like data to presume it was UTF-8 unless proven otherwise. > > Are you asking because you?re interested in differentiating UTF-8 from > UTF-16?? Or UTF-8 from some other legacy non-Unicode encoding? > The latter.? In particular, as a differentiator between shiny new UTF-8 encoded source code files and long-in-the-tooth legacy encoded source code files coexisting (perhaps via transitive package dependencies) within a single project. Tom. > Anecdotally, if you can decode data without error in UTF-8, then it?s > probably UTF-8. ?Sensible sequences in other encodings rarely look > like valid UTF-8, though there are a few short examples that can > confuse it. > > -Shawn > > *From:* Unicode *On Behalf Of *Tom > Honermann via Unicode > *Sent:* Freitag, 5. Juni 2020 13:10 > *To:* unicode at unicode.org > *Cc:* Alisdair Meredith > *Subject:* What is the Unicode guidance regarding the use of a BOM as > a UTF-8 encoding signature? > > Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte order, > states (emphasis mine): > > ... *Use of a BOM is neither required nor recommended for UTF-8*, > but may be encountered in contexts where UTF-8 data is converted > from other encoding forms that? use? a? BOM? or? where? the? BOM? > is? used? as a? UTF-8? signature.? See? the? ?Byte? Order Mark? > subsection in Section 23.8, Specials, for more information. > > The emphasized statement is unconditional regarding the > recommendation, but it isn't clear to me that this recommendation is > intended to extend to both presence of a BOM in contexts where the > encoding is known to be UTF-8 (where the BOM provides no additional > information) and to contexts where the BOM signifies the presence of > UTF-8 encoded text (where the BOM does provide additional > information).? Is the guidance intended to state that, when possible, > use of UTF-8 as an encoding signature is to be avoided in favor of > some other mechanism? > > The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8 > (Specials) contains no similar guidance; it is factual and details > some possible consequences of use, but does not apply a judgement.? > The discussion of use with other character sets could be read as an > endorsement for use of a BOM as an encoding signature. > > Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode FAQ > does not recommend for or > against use of a BOM as an encoding signature.? It also can be read as > endorsing such usage. > > So, my question is, what exactly is the intent of the emphasized > statement above?? Is the recommendation intended to be so broadly > worded?? Or is it only intended to discourage BOM use in cases where > the encoding is known by other means? > > Tom. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Fri Jun 5 17:22:54 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 5 Jun 2020 15:22:54 -0700 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> Message-ID: <187adced-82bb-99c7-1d59-a82ee74f5d87@ix.netcom.com> An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Fri Jun 5 17:23:43 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 5 Jun 2020 15:23:43 -0700 Subject: Alternate presentation for U+229C CIRCLED EQUALS? In-Reply-To: References: Message-ID: <3d70accb-93a3-18ec-e1e1-a3a84005bf2f@ix.netcom.com> An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Fri Jun 5 17:33:23 2020 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Fri, 5 Jun 2020 22:33:23 +0000 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> Message-ID: I?ve been recommending that people assume documents are UTF-8. If the UTF-8 decoding fails, then consider falling back to some other codepage. Pretty much all the other code pages would contain text that would look like unexpected trail bytes, or lead bytes without trail bytes, etc. One can anecdotally find single-word Latin examples that break the pattern (Nestl?? IIRC), but if you want to think of accuracy in terms of ?9s?, then that pretty much has as many nines as you have bytes of input data. I did find some DBCS CJK text that could look like valid UTF-8, so my ?one nine per byte of input? isn?t quite as high there, however for meaningful runs of text it is still reasonably hard to make sensible text in a double byte codepage look like UTF-8. Note that this ?works? partially because the ASCII range of the SBCS/DBCS code pages typically looks like ASCII, as does UTF-8. If you had a 7 bit codepage data with stateful shift sequences, of course that wouldn?t err in UTF-8. Fortunately for your scenario source code in 7 bit encodings is very rare nowadays. Hope that helps, -Shawn From: Tom Honermann Sent: Freitag, 5. Juni 2020 15:15 To: Shawn Steele Cc: Alisdair Meredith ; Unicode Mail List Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote: The modern viewpoint is that the BOM should be discouraged in all contexts. (Along with you should always be using Unicode encodings, probably UTF-8 or UTF-16). I?d recommend to anyone encountering ASCII-like data to presume it was UTF-8 unless proven otherwise. Are you asking because you?re interested in differentiating UTF-8 from UTF-16? Or UTF-8 from some other legacy non-Unicode encoding? The latter. In particular, as a differentiator between shiny new UTF-8 encoded source code files and long-in-the-tooth legacy encoded source code files coexisting (perhaps via transitive package dependencies) within a single project. Tom. Anecdotally, if you can decode data without error in UTF-8, then it?s probably UTF-8. Sensible sequences in other encodings rarely look like valid UTF-8, though there are a few short examples that can confuse it. -Shawn From: Unicode On Behalf Of Tom Honermann via Unicode Sent: Freitag, 5. Juni 2020 13:10 To: unicode at unicode.org Cc: Alisdair Meredith Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte order, states (emphasis mine): ... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the ?Byte Order Mark? subsection in Section 23.8, Specials, for more information. The emphasized statement is unconditional regarding the recommendation, but it isn't clear to me that this recommendation is intended to extend to both presence of a BOM in contexts where the encoding is known to be UTF-8 (where the BOM provides no additional information) and to contexts where the BOM signifies the presence of UTF-8 encoded text (where the BOM does provide additional information). Is the guidance intended to state that, when possible, use of UTF-8 as an encoding signature is to be avoided in favor of some other mechanism? The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8 (Specials) contains no similar guidance; it is factual and details some possible consequences of use, but does not apply a judgement. The discussion of use with other character sets could be read as an endorsement for use of a BOM as an encoding signature. Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode FAQ does not recommend for or against use of a BOM as an encoding signature. It also can be read as endorsing such usage. So, my question is, what exactly is the intent of the emphasized statement above? Is the recommendation intended to be so broadly worded? Or is it only intended to discourage BOM use in cases where the encoding is known by other means? Tom. -------------- next part -------------- An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Fri Jun 5 17:48:05 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Fri, 5 Jun 2020 22:48:05 +0000 (UTC) Subject: Alternate presentation for U+229C CIRCLED EQUALS? In-Reply-To: <3d70accb-93a3-18ec-e1e1-a3a84005bf2f@ix.netcom.com> References: <3d70accb-93a3-18ec-e1e1-a3a84005bf2f@ix.netcom.com> Message-ID: <60808396-077e-4df8-bc34-4793120dc91d@disroot.org> Yes, thank you! I vote for a separate codepoint for the CIRCLED EQUALS SIGN since it has a different meaning and would also visually be displayed differently 2020/06/05 ??6:27:43 Asmus Freytag via Unicode : > Overloading this mathematical symbol with anything that needs different styling is *wrong*. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Fri Jun 5 18:04:12 2020 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 5 Jun 2020 16:04:12 -0700 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> Message-ID: The BOM -- or for UTF-8 where "byte order" is meaningless, the Unicode signature byte sequence -- was popular when Unicode was gaining ground but legacy charsets were still widely used. Especially on Windows, which had settled on UTF-16 much earlier, lots of tools and editors started writing or expecting UTF-8 signatures. Other tools (especially in the Linux/Unix world) were never modified to expect or even cope with the signature, so ignored it or choked on it. There has never been uniform practice on this. For the most part, all new and recent text is now UTF-8, and the signature byte sequence has fallen out of favor again even where it had been used. Having said that, I think the statement is right: "neither required nor recommended for UTF-8" We might want to review chapter 23 and the FAQ and see if they should be updated. Thanks, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Fri Jun 5 18:17:30 2020 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 5 Jun 2020 16:17:30 -0700 Subject: Alternate presentation for U+229C CIRCLED EQUALS? In-Reply-To: References: Message-ID: On Fri, Jun 5, 2020 at 3:00 PM Shawn Steele wrote: > I don?t really like the proposal at all. > The proposal is from 2017/2018. These characters were added in Unicode 13. markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Fri Jun 5 18:30:51 2020 From: mark at kli.org (Mark E. Shoulson) Date: Fri, 5 Jun 2020 19:30:51 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <20200605002205.696251ba@JRWUBU2> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604093151.227437a0@spixxi> <0ea68b66-9d9a-e1ef-cb60-b89747c04ee4@kli.org> <20200605002205.696251ba@JRWUBU2> Message-ID: <7d4f417d-ea94-1497-6386-a5a1d5f4b5a6@kli.org> On 6/4/20 7:22 PM, Richard Wordingham via Unicode wrote: > On Thu, 4 Jun 2020 16:30:20 -0400 > "Mark E. Shoulson via Unicode" wrote: > >> Not so!? Contrariwise, in fact, at least for the IPA ?.? The reason >> it is encoded is because IPA stipulates that the symbol for the >> voiced velar stop be written ? with an open loop, and it is incorrect >> to write it with a binocular g. > The IPA threw the towel in on that one, and now allow either. Bah!? Cowards.? I suppose it doesn't matter from Unicode's perspective, since Unicode is also concerned with historical usage, and there was a time when it mattered.? (that's oversimplifying, I know.) >> ? Linguists do not consider these to >> be mutually interchangeable.? Same with the IPA ?, which is wrong if >> written two-storey. > That's different. [a] and [?] are two different sounds. Of course, it > all gets horribly confused when type faces for children's books use > single storey 'a' and open loop 'g'. Well, it's "different" only because binocular g didn't have another meaning, as two-storey a does.? Though to be honest, if IPA has to have ? because it uses two-storey a and one-storey ? contrastively, then by rights there ought to be a character (or variation sequence or something) like LATIN SMALL LETTER TWO STOREY A, since after all, some fonts don't draw U+0061 the way that IPA stipulates is needed for the open front vowel.? I've wondered about that from time to time. ~mark From mark at kli.org Fri Jun 5 18:38:52 2020 From: mark at kli.org (Mark E. Shoulson) Date: Fri, 5 Jun 2020 19:38:52 -0400 Subject: Alternate presentation for U+229C CIRCLED EQUALS? In-Reply-To: References: Message-ID: On 6/5/20 4:20 PM, Markus Scherer wrote: > Actually, I should have looked at the proposal doc first: > http://www.unicode.org/L2/L2017/17242r2-n4934r-creative-commons.pdf > > ... Their primary designs are exactly specified, while in current > text forms may be used resembling the font design. ... In these > examples, you see some variation in size, font, and placement, > which is common for ? symbol as well. ... > > > In other words, the glyphs for these symbols are not as fixed as you > might think, and the use of?? likely fits right in. Not certain I buy that.? I'm a font designer, I'm going to be designing the Creative Commons symbols, even if not in some "standard" way, at least in the style I envision for them, but CIRCLED EQUALS, a mathematics operator, will have different needs and I'll be designing it to comport well with mathematics, and that is very likely to be different from the CC symbols. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Fri Jun 5 18:42:50 2020 From: mark at kli.org (Mark E. Shoulson) Date: Fri, 5 Jun 2020 19:42:50 -0400 Subject: Alternate presentation for U+229C CIRCLED EQUALS? In-Reply-To: <60808396-077e-4df8-bc34-4793120dc91d@disroot.org> References: <3d70accb-93a3-18ec-e1e1-a3a84005bf2f@ix.netcom.com> <60808396-077e-4df8-bc34-4793120dc91d@disroot.org> Message-ID: <68cdeef6-471a-3e4e-2f91-ffb8ddec7993@kli.org> On 6/5/20 6:48 PM, abrahamgross--- via Unicode wrote: > Yes, thank you! > > I vote for a separate codepoint for the CIRCLED EQUALS SIGN since it > has a different meaning and would also visually be displayed differently > > 2020/06/05 ??6:27:43 Asmus Freytag via Unicode : > > Overloading this mathematical symbol with anything that needs > different styling is *wrong*. > We don't get to "vote" here, but I think my preference, too, would be to encode a new character, as opposed to a variant of U+229C (or doing nothing, which of course is another alternative.)? The "ND" license symbol just seems to be a different creature to the math operator. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Fri Jun 5 18:49:47 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Fri, 5 Jun 2020 23:49:47 +0000 (UTC) Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <7d4f417d-ea94-1497-6386-a5a1d5f4b5a6@kli.org> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604093151.227437a0@spixxi> <0ea68b66-9d9a-e1ef-cb60-b89747c04ee4@kli.org> <20200605002205.696251ba@JRWUBU2> <7d4f417d-ea94-1497-6386-a5a1d5f4b5a6@kli.org> Message-ID: YES, THIS! I've been thinking about writing a proposal for the double story "a" so that I can send an unambiguous IPA transcription - even to ppl with devices that have the U0061 ?a? as a single storey "a" - but I don't want to spend a ton of time on something thatll get rejected? 2020/06/05 ??7:31:32 Mark E. Shoulson via Unicode : > Though to be honest, if IPA has to have ? because it uses two-storey a and one-storey ? contrastively, then by rights there ought to be a character (or variation sequence or something) like LATIN SMALL LETTER TWO STOREY A, since after all, some fonts don't draw U+0061 the way that IPA stipulates is needed for the open front vowel. > From Shawn.Steele at microsoft.com Fri Jun 5 19:01:07 2020 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Sat, 6 Jun 2020 00:01:07 +0000 Subject: Alternate presentation for U+229C CIRCLED EQUALS? In-Reply-To: References: Message-ID: I guess I?m a little late ? From: Markus Scherer Sent: Friday, June 5, 2020 4:18 PM To: Shawn Steele Cc: Mark E. Shoulson ; Unicode Mail List Subject: Re: Alternate presentation for U+229C CIRCLED EQUALS? On Fri, Jun 5, 2020 at 3:00 PM Shawn Steele > wrote: I don?t really like the proposal at all. The proposal is from 2017/2018. These characters were added in Unicode 13. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Fri Jun 5 19:15:26 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Sat, 6 Jun 2020 00:15:26 +0000 (UTC) Subject: Alternate presentation for U+229C CIRCLED EQUALS? In-Reply-To: References: Message-ID: You can still make a proposal to add it to U+1F10D -------------- next part -------------- An HTML attachment was scrubbed... URL: From jk at koremail.com Fri Jun 5 20:32:21 2020 From: jk at koremail.com (jk at koremail.com) Date: Sat, 06 Jun 2020 09:32:21 +0800 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604093151.227437a0@spixxi> <0ea68b66-9d9a-e1ef-cb60-b89747c04ee4@kli.org> <20200605002205.696251ba@JRWUBU2> <7d4f417d-ea94-1497-6386-a5a1d5f4b5a6@kli.org> Message-ID: <731899901a3dee4a5dbfef61f14779b1@koremail.com> No, that some fonts display a character in a certain way would not be sufficient justification for a new character, but rather justification for not using those fonts in documents that contain IPA. Such a proposal would must certainly be rejected. On 2020-06-06 07:49, abrahamgross--- via Unicode wrote: > YES, THIS! > > I've been thinking about writing a proposal for the double story "a" > so that I can send an unambiguous IPA transcription - even to ppl with > devices that have the U0061 ?a? as a single storey "a" - but I don't > want to spend a ton of time on something thatll get rejected? > > 2020/06/05 ??7:31:32 Mark E. Shoulson via Unicode > : >> Though to be honest, if IPA has to have ? because it uses two-storey a >> and one-storey ? contrastively, then by rights there ought to be a >> character (or variation sequence or something) like LATIN SMALL LETTER >> TWO STOREY A, since after all, some fonts don't draw U+0061 the way >> that IPA stipulates is needed for the open front vowel. >> From markus.icu at gmail.com Fri Jun 5 22:25:59 2020 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 5 Jun 2020 20:25:59 -0700 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> Message-ID: On Fri, Jun 5, 2020 at 5:36 PM Tom Honermann via Unicode < unicode at unicode.org> wrote: > On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote: > > Are you asking because you?re interested in differentiating UTF-8 from > UTF-16? Or UTF-8 from some other legacy non-Unicode encoding? > > The latter. In particular, as a differentiator between shiny new UTF-8 > encoded source code files and long-in-the-tooth legacy encoded source code > files coexisting (perhaps via transitive package dependencies) within a > single project. > I would not use a BOM/signature on source code files. It will confuse or break various tools. I would take any non-ASCII/non-UTF-8 source code file and convert it to UTF-8, and be done with it. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Fri Jun 5 22:28:52 2020 From: public at khwilliamson.com (Karl Williamson) Date: Fri, 5 Jun 2020 21:28:52 -0600 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> Message-ID: <499dd515-f61d-8114-aae7-52da51d92e58@khwilliamson.com> On 6/5/20 4:33 PM, Shawn Steele via Unicode wrote: > I?ve been recommending that people assume documents are UTF-8. ?If the > UTF-8 decoding fails, then consider falling back to some other > codepage.? ?Pretty much all the other code pages would contain text that > would look like unexpected trail bytes, or lead bytes without trail > bytes, etc.? One can anecdotally find single-word Latin examples that > break the pattern (Nestl?? IIRC), but if you want to think of accuracy > in terms of ?9s?, then that pretty much has as many nines as you have > bytes of input data. I have code that attempts to distinguish between UTF-8 and CP1252 inputs. It now does a pretty good job; no one has complained in several years. To do this, I resort to some "semantic" analysis of the input. If it is syntactically valid UTF-8, but not a script run, it's not UTF-8. Likewise, the texts it will be subjected to are going to be in modern commercially-valuable scripts, so not IPA, for example. And it will be important characters, ones whose Age property is 1.1; text won't contain C1 controls. CP1252 is harder than plain ASCII/Latin1/C1 because manyh of the C1 controls are co-opted for graphic characters. Someone sent me the following example, scraped from some dictionaries, that it successfully gets right: Muvrar\xE1\x9A\x9Aa is a mountain in Norway is legal 1252, and syntactically legal UTF-8, but the "semantic" tests say it isn't UTF-8. I also have code that tries to distinguish between a UTF-8 POSIX locale and a non-UTF-8, and which needs to work on systems without certain C library functions that would make it foolproof. That is less successful primarily because of insufficient text available to make a determination. One might think that the operating system error messages would be fruitful, but it turns out that many are in English, no one bothered to translate them. The locale's currency symbol is always translated, though the dollar sign is commonly used in other languages as part of the symbol. The time and date names are usually translated, and I use them. > I did find some DBCS CJK text that could look like valid UTF-8, so my > ?one nine per byte of input? isn?t quite as high there, however for > meaningful runs of text it is still reasonably hard to make sensible > text in a double byte codepage look like UTF-8.? Note that this ?works? > partially because the ASCII range of the SBCS/DBCS code pages typically > looks like ASCII, as does UTF-8.? If you had a 7 bit codepage data with > stateful shift sequences, of course that wouldn?t err in UTF-8. > Fortunately for your scenario source code in 7 bit encodings is very > rare nowadays. > > Hope that helps, > > -Shawn > > *From:* Tom Honermann > *Sent:* Freitag, 5. Juni 2020 15:15 > *To:* Shawn Steele > *Cc:* Alisdair Meredith ; Unicode Mail List > > *Subject:* Re: What is the Unicode guidance regarding the use of a BOM > as a UTF-8 encoding signature? > > On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote: > > The modern viewpoint is that the BOM should be discouraged in all > contexts.?? (Along with you should always be using Unicode > encodings, probably UTF-8 or UTF-16).? I?d recommend to anyone > encountering ASCII-like data to presume it was UTF-8 unless proven > otherwise. > > Are you asking because you?re interested in differentiating UTF-8 > from UTF-16?? Or UTF-8 from some other legacy non-Unicode encoding? > > The latter.? In particular, as a differentiator between shiny new UTF-8 > encoded source code files and long-in-the-tooth legacy encoded source > code files coexisting (perhaps via transitive package dependencies) > within a single project. > > Tom. > > Anecdotally, if you can decode data without error in UTF-8, then > it?s probably UTF-8. ?Sensible sequences in other encodings rarely > look like valid UTF-8, though there are a few short examples that > can confuse it. > > -Shawn > > *From:* Unicode > *On Behalf Of *Tom Honermann > via Unicode > *Sent:* Freitag, 5. Juni 2020 13:10 > *To:* unicode at unicode.org > *Cc:* Alisdair Meredith > *Subject:* What is the Unicode guidance regarding the use of a BOM > as a UTF-8 encoding signature? > > Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte > order, states (emphasis mine): > > ... *Use of a BOM is neither required nor recommended for > UTF-8*, but may be encountered in contexts where UTF-8 data is > converted from other encoding forms that? use? a? BOM? or > where? the? BOM? is? used? as? a? UTF-8? signature. See? the > ?Byte? Order Mark? subsection in Section 23.8, Specials, for > more information. > > The emphasized statement is unconditional regarding the > recommendation, but it isn't clear to me that this recommendation is > intended to extend to both presence of a BOM in contexts where the > encoding is known to be UTF-8 (where the BOM provides no additional > information) and to contexts where the BOM signifies the presence of > UTF-8 encoded text (where the BOM does provide additional > information).? Is the guidance intended to state that, when > possible, use of UTF-8 as an encoding signature is to be avoided in > favor of some other mechanism? > > The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8 > (Specials) contains no similar guidance; it is factual and details > some possible consequences of use, but does not apply a judgement. > The discussion of use with other character sets could be read as an > endorsement for use of a BOM as an encoding signature. > > Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode > FAQ does not recommend > for or against use of a BOM as an encoding signature.? It also can > be read as endorsing such usage. > > So, my question is, what exactly is the intent of the emphasized > statement above?? Is the recommendation intended to be so broadly > worded?? Or is it only intended to discourage BOM use in cases where > the encoding is known by other means? > > Tom. > From abrahamgross at disroot.org Fri Jun 5 22:50:21 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Sat, 6 Jun 2020 03:50:21 +0000 (UTC) Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <731899901a3dee4a5dbfef61f14779b1@koremail.com> References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604093151.227437a0@spixxi> <0ea68b66-9d9a-e1ef-cb60-b89747c04ee4@kli.org> <20200605002205.696251ba@JRWUBU2> <7d4f417d-ea94-1497-6386-a5a1d5f4b5a6@kli.org> <731899901a3dee4a5dbfef61f14779b1@koremail.com> Message-ID: <712ae449-475d-425d-bfc3-934f45535b4d@disroot.org> Even though it completely different meanings? 2020/06/05 ??9:32:57 John Knightley via Unicode : > No, that some fonts display a character in a certain way would not be sufficient justification for a new character, but rather justification for not using those fonts in documents that contain IPA. Such a proposal would must certainly be rejected. > From jr at qsm.co.il Fri Jun 5 22:53:29 2020 From: jr at qsm.co.il (Jonathan Rosenne) Date: Sat, 6 Jun 2020 03:53:29 +0000 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: <499dd515-f61d-8114-aae7-52da51d92e58@khwilliamson.com> References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> <499dd515-f61d-8114-aae7-52da51d92e58@khwilliamson.com> Message-ID: I am curious about how your code would work with CP1255 or CP1256? Best Regards, Jonathan Rosenne -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Karl Williamson via Unicode Sent: Saturday, June 6, 2020 6:29 AM To: Shawn Steele; Tom Honermann Cc: Alisdair Meredith; Unicode Mail List Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? On 6/5/20 4:33 PM, Shawn Steele via Unicode wrote: > I?ve been recommending that people assume documents are UTF-8. ?If the > UTF-8 decoding fails, then consider falling back to some other > codepage.? ?Pretty much all the other code pages would contain text that > would look like unexpected trail bytes, or lead bytes without trail > bytes, etc.? One can anecdotally find single-word Latin examples that > break the pattern (Nestl?? IIRC), but if you want to think of accuracy > in terms of ?9s?, then that pretty much has as many nines as you have > bytes of input data. I have code that attempts to distinguish between UTF-8 and CP1252 inputs. It now does a pretty good job; no one has complained in several years. To do this, I resort to some "semantic" analysis of the input. If it is syntactically valid UTF-8, but not a script run, it's not UTF-8. Likewise, the texts it will be subjected to are going to be in modern commercially-valuable scripts, so not IPA, for example. And it will be important characters, ones whose Age property is 1.1; text won't contain C1 controls. CP1252 is harder than plain ASCII/Latin1/C1 because manyh of the C1 controls are co-opted for graphic characters. Someone sent me the following example, scraped from some dictionaries, that it successfully gets right: Muvrar\xE1\x9A\x9Aa is a mountain in Norway is legal 1252, and syntactically legal UTF-8, but the "semantic" tests say it isn't UTF-8. I also have code that tries to distinguish between a UTF-8 POSIX locale and a non-UTF-8, and which needs to work on systems without certain C library functions that would make it foolproof. That is less successful primarily because of insufficient text available to make a determination. One might think that the operating system error messages would be fruitful, but it turns out that many are in English, no one bothered to translate them. The locale's currency symbol is always translated, though the dollar sign is commonly used in other languages as part of the symbol. The time and date names are usually translated, and I use them. > I did find some DBCS CJK text that could look like valid UTF-8, so my > ?one nine per byte of input? isn?t quite as high there, however for > meaningful runs of text it is still reasonably hard to make sensible > text in a double byte codepage look like UTF-8.? Note that this ?works? > partially because the ASCII range of the SBCS/DBCS code pages typically > looks like ASCII, as does UTF-8.? If you had a 7 bit codepage data with > stateful shift sequences, of course that wouldn?t err in UTF-8. > Fortunately for your scenario source code in 7 bit encodings is very > rare nowadays. > > Hope that helps, > > -Shawn > > *From:* Tom Honermann > *Sent:* Freitag, 5. Juni 2020 15:15 > *To:* Shawn Steele > *Cc:* Alisdair Meredith ; Unicode Mail List > > *Subject:* Re: What is the Unicode guidance regarding the use of a BOM > as a UTF-8 encoding signature? > > On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote: > > The modern viewpoint is that the BOM should be discouraged in all > contexts.?? (Along with you should always be using Unicode > encodings, probably UTF-8 or UTF-16).? I?d recommend to anyone > encountering ASCII-like data to presume it was UTF-8 unless proven > otherwise. > > Are you asking because you?re interested in differentiating UTF-8 > from UTF-16?? Or UTF-8 from some other legacy non-Unicode encoding? > > The latter.? In particular, as a differentiator between shiny new UTF-8 > encoded source code files and long-in-the-tooth legacy encoded source > code files coexisting (perhaps via transitive package dependencies) > within a single project. > > Tom. > > Anecdotally, if you can decode data without error in UTF-8, then > it?s probably UTF-8. ?Sensible sequences in other encodings rarely > look like valid UTF-8, though there are a few short examples that > can confuse it. > > -Shawn > > *From:* Unicode > *On Behalf Of *Tom Honermann > via Unicode > *Sent:* Freitag, 5. Juni 2020 13:10 > *To:* unicode at unicode.org > *Cc:* Alisdair Meredith > *Subject:* What is the Unicode guidance regarding the use of a BOM > as a UTF-8 encoding signature? > > Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte > order, states (emphasis mine): > > ... *Use of a BOM is neither required nor recommended for > UTF-8*, but may be encountered in contexts where UTF-8 data is > converted from other encoding forms that? use? a? BOM? or > where? the? BOM? is? used? as? a? UTF-8? signature. See? the > ?Byte? Order Mark? subsection in Section 23.8, Specials, for > more information. > > The emphasized statement is unconditional regarding the > recommendation, but it isn't clear to me that this recommendation is > intended to extend to both presence of a BOM in contexts where the > encoding is known to be UTF-8 (where the BOM provides no additional > information) and to contexts where the BOM signifies the presence of > UTF-8 encoded text (where the BOM does provide additional > information).? Is the guidance intended to state that, when > possible, use of UTF-8 as an encoding signature is to be avoided in > favor of some other mechanism? > > The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8 > (Specials) contains no similar guidance; it is factual and details > some possible consequences of use, but does not apply a judgement. > The discussion of use with other character sets could be read as an > endorsement for use of a BOM as an encoding signature. > > Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode > FAQ does not recommend > for or against use of a BOM as an encoding signature.? It also can > be read as endorsing such usage. > > So, my question is, what exactly is the intent of the emphasized > statement above?? Is the recommendation intended to be so broadly > worded?? Or is it only intended to discourage BOM use in cases where > the encoding is known by other means? > > Tom. > From prosfilaes at gmail.com Fri Jun 5 22:59:27 2020 From: prosfilaes at gmail.com (David Starner) Date: Fri, 5 Jun 2020 20:59:27 -0700 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <7fd1336c-75d4-df85-7678-bb74405d18b3@ix.netcom.com> <555af216-fdc3-4dc8-9f93-ec2bb82dfb54@disroot.org> <2f95e5fd-109a-4644-c4fb-97891fb1e685@kli.org> <53fc6ed2-0bc5-4c2e-829f-83934a20edea@disroot.org> <44ca9112-a0d6-b793-c8aa-efcfb1b82d5d@kli.org> <20200604093151.227437a0@spixxi> <0ea68b66-9d9a-e1ef-cb60-b89747c04ee4@kli.org> <20200605002205.696251ba@JRWUBU2> <7d4f417d-ea94-1497-6386-a5a1d5f4b5a6@kli.org> Message-ID: On Fri, Jun 5, 2020 at 7:21 PM abrahamgross--- via Unicode wrote: > > YES, THIS! > > I've been thinking about writing a proposal for the double story "a" so that I can send an unambiguous IPA transcription - even to ppl with devices that have the U0061 ?a? as a single storey "a" - but I don't want to spend a ton of time on something thatll get rejected? I understand the argument, but it's been over a quarter century since IPA was encoded with U+0061 standing for the IPA a, and it seems long past changing. -- The standard is written in English . If you have trouble understanding a particular section, read it again and again and again . . . Sit up straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185 (1991) From eliz at gnu.org Sat Jun 6 01:39:44 2020 From: eliz at gnu.org (Eli Zaretskii) Date: Sat, 06 Jun 2020 09:39:44 +0300 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: (message from Shawn Steele via Unicode on Fri, 5 Jun 2020 22:33:23 +0000) References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> Message-ID: <83h7voaea7.fsf@gnu.org> > CC: Alisdair Meredith , > Unicode Mail List > > Date: Fri, 5 Jun 2020 22:33:23 +0000 > From: Shawn Steele via Unicode > > I?ve been recommending that people assume documents are UTF-8. If the UTF-8 decoding fails, then > consider falling back to some other codepage. That strategy would fail with 7-bit ISO 2022 based encodings, no? They look like plain 7-bit ASCII (which will not fail UTF-8), but actually represent non-ASCII text. From Shawn.Steele at microsoft.com Sat Jun 6 01:58:55 2020 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Sat, 6 Jun 2020 06:58:55 +0000 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: <83h7voaea7.fsf@gnu.org> References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> <83h7voaea7.fsf@gnu.org> Message-ID: I mentioned that later.... But there is a lot of content for interchange that are single/double byte (8 bit) rather than requiring escape sequences. The 2022 encodings seem rarer, though it may depend on your data source. -----Original Message----- From: Eli Zaretskii Sent: Friday, June 5, 2020 11:40 PM To: Shawn Steele Cc: tom at honermann.net; alisdairm at me.com; unicode at unicode.org Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? > CC: Alisdair Meredith , > Unicode Mail List > > Date: Fri, 5 Jun 2020 22:33:23 +0000 > From: Shawn Steele via Unicode > > I?ve been recommending that people assume documents are UTF-8. If the > UTF-8 decoding fails, then consider falling back to some other codepage. That strategy would fail with 7-bit ISO 2022 based encodings, no? They look like plain 7-bit ASCII (which will not fail UTF-8), but actually represent non-ASCII text. From eliz at gnu.org Sat Jun 6 02:12:50 2020 From: eliz at gnu.org (Eli Zaretskii) Date: Sat, 06 Jun 2020 10:12:50 +0300 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: (message from Shawn Steele via Unicode on Sat, 6 Jun 2020 06:58:55 +0000) References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> <83h7voaea7.fsf@gnu.org> Message-ID: <83a71gacr1.fsf@gnu.org> > CC: "tom at honermann.net" , > "alisdairm at me.com" > , > "unicode at unicode.org" > Date: Sat, 6 Jun 2020 06:58:55 +0000 > From: Shawn Steele via Unicode > > I mentioned that later.... But there is a lot of content for interchange that are single/double byte (8 bit) rather than requiring escape sequences. The 2022 encodings seem rarer, though it may depend on your data source. I agree that ISO 2022 is rare these days, but rarity doesn't help when you need to be accurate in decoding, because mistaking one encoding for another produces horribly incorrect results, and users complain vociferously when that happens. From junicode at jcbradfield.org Sat Jun 6 02:45:10 2020 From: junicode at jcbradfield.org (Julian Bradfield) Date: Sat, 6 Jun 2020 08:45:10 +0100 (BST) Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> Message-ID: Just to digress a little, I get quite a lot of mail which has BOM/ZWNBSP scattered through it, sometimes at the beginning of the mail, sometimes at the beginning of the quoted mail to which it is a reply. Occasionally at the start of every line. Mostly it emanates from known useless webmail providers such as Yahoo, but some slightly more reputable providers do it as well. (I don't easily have a list, as I now filter it out before it hits my mailbox.) Does anybody have an idea why they do this? Some accident of legacy coding? From eliz at gnu.org Sat Jun 6 05:57:57 2020 From: eliz at gnu.org (Eli Zaretskii) Date: Sat, 06 Jun 2020 13:57:57 +0300 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: (message from Harriet Riddle on Sat, 6 Jun 2020 10:05:49 +0000) References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> <83h7voaea7.fsf@gnu.org> , <83a71gacr1.fsf@gnu.org> Message-ID: <83sgf88nre.fsf@gnu.org> > From: Harriet Riddle > CC: "tom at honermann.net" , "alisdairm at me.com" > , "unicode at unicode.org" > Date: Sat, 6 Jun 2020 10:05:49 +0000 > > In theory, one decoder can handle both, since 7-bit ISO 2022 generally starts out in ASCII, and SI, SO and > the Gx-set designating ESC sequences make no sense in UTF-8. What do you mean by "make no sense"? A general-purpose editor is presented with a byte stream and needs to decide how to interpret and display it. It usually has no meta-data about the byte stream to help it decide what does and doesn't make sense. It doesn't even know whether the byte stream is human-readable text or just raw binary bytes. I understand that, given enough of the byte stream, one can analyze it and see whether interpreting it as one encoding or another will make more sense. But these decisions are sometimes required after only a small portion of the material has arrived (a case in point: a process or a network connection that outputs text in relatively small chunks). In any case, I was responding to a proposal to treat any text as UTF-8 "unless proven otherwise". My point is that with ISO 2022 encoding, and perhaps also others, such a proof is not really at hand. > If that isn't feasible, then, more moderate measures might include trying 7-bit ISO 2022 and if it runs into a > set high bit, retrying with UTF-8. Or trying UTF-8 and, if the result contains SI, SO or (for instance) the > sequence ESC $ B (U+001B U+0024 U+0042), retrying with (for instance) ISO-2022-JP-2. Treating ESC sequences as telltale signs of ISO 2022 is not foolproof, either. For example, you may be looking at UTF-8 text interspersed with terminal control sequences, like SGR or somesuch. Bottom line: the real world out there is not as clean as we might think, and those rare corner cases keep breaking any simple-minded decision rules such as "assume UTF-8 by default". From jr at qsm.co.il Sat Jun 6 06:17:48 2020 From: jr at qsm.co.il (Jonathan Rosenne) Date: Sat, 6 Jun 2020 11:17:48 +0000 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: <83sgf88nre.fsf@gnu.org> References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> <83h7voaea7.fsf@gnu.org> , <83a71gacr1.fsf@gnu.org> <83sgf88nre.fsf@gnu.org> Message-ID: Frequency analysis of bigrams and trigrams, provided the text is not too short, can reveal the encoding and even the language. But this is not normally the province of text editors and word processing software. Jonathan Rosenne -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Eli Zaretskii via Unicode Sent: Saturday, June 6, 2020 1:58 PM To: Harriet Riddle Cc: Shawn.Steele at microsoft.com; tom at honermann.net; alisdairm at me.com; unicode at unicode.org Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? > From: Harriet Riddle > CC: "tom at honermann.net" , "alisdairm at me.com" > , "unicode at unicode.org" > Date: Sat, 6 Jun 2020 10:05:49 +0000 > > In theory, one decoder can handle both, since 7-bit ISO 2022 generally starts out in ASCII, and SI, SO and > the Gx-set designating ESC sequences make no sense in UTF-8. What do you mean by "make no sense"? A general-purpose editor is presented with a byte stream and needs to decide how to interpret and display it. It usually has no meta-data about the byte stream to help it decide what does and doesn't make sense. It doesn't even know whether the byte stream is human-readable text or just raw binary bytes. I understand that, given enough of the byte stream, one can analyze it and see whether interpreting it as one encoding or another will make more sense. But these decisions are sometimes required after only a small portion of the material has arrived (a case in point: a process or a network connection that outputs text in relatively small chunks). In any case, I was responding to a proposal to treat any text as UTF-8 "unless proven otherwise". My point is that with ISO 2022 encoding, and perhaps also others, such a proof is not really at hand. > If that isn't feasible, then, more moderate measures might include trying 7-bit ISO 2022 and if it runs into a > set high bit, retrying with UTF-8. Or trying UTF-8 and, if the result contains SI, SO or (for instance) the > sequence ESC $ B (U+001B U+0024 U+0042), retrying with (for instance) ISO-2022-JP-2. Treating ESC sequences as telltale signs of ISO 2022 is not foolproof, either. For example, you may be looking at UTF-8 text interspersed with terminal control sequences, like SGR or somesuch. Bottom line: the real world out there is not as clean as we might think, and those rare corner cases keep breaking any simple-minded decision rules such as "assume UTF-8 by default". From eliz at gnu.org Sat Jun 6 07:53:30 2020 From: eliz at gnu.org (Eli Zaretskii) Date: Sat, 06 Jun 2020 15:53:30 +0300 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: (message from Harriet Riddle on Sat, 6 Jun 2020 12:20:40 +0000) References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> <83h7voaea7.fsf@gnu.org> , <83a71gacr1.fsf@gnu.org> , <83sgf88nre.fsf@gnu.org> Message-ID: <83mu5g8iet.fsf@gnu.org> > From: Harriet Riddle > CC: "Shawn.Steele at microsoft.com" , > "tom at honermann.net" , "alisdairm at me.com" > , "unicode at unicode.org" > Date: Sat, 6 Jun 2020 12:20:40 +0000 > > So it is true that detecting ESC on its own will not identify 7-bit ISO 2022, but the specific sequence ESC $ B > (ESC 0x24 0x42) has only one ANSI/ISO compliant meaning, which is to switch the G0 set to JIS X 0208. In > UTF-8, there is no such thing as a G0 set (due to it not being fully ISO 2022 based), so it is meaningless. If you are saying that "ESC $ B" or similar sequences can be considered as evidence that the text is not in UTF-8, then I might concur. Whether that's the "proof" that should reject UTF-8, I'm not sure. From sosipiuk at gmail.com Sat Jun 6 08:56:23 2020 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Sat, 6 Jun 2020 09:56:23 -0400 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: <83sgf88nre.fsf@gnu.org> References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> <83h7voaea7.fsf@gnu.org> <83a71gacr1.fsf@gnu.org> <83sgf88nre.fsf@gnu.org> Message-ID: On Sat, Jun 6, 2020 at 7:04 AM Eli Zaretskii via Unicode wrote: > > What do you mean by "make no sense"? A general-purpose editor is > presented with a byte stream and needs to decide how to interpret and > display it. It usually has no meta-data about the byte stream to help > it decide what does and doesn't make sense. It doesn't even know > whether the byte stream is human-readable text or just raw binary > bytes. Escape sequences may be present in UTF-8, but SI and SO cannot be, nor can most designation sequences (a special subset of escape sequences), not only because they make no sense, but because ISO 10646 explicitly forbids them: "Code extension control functions for the ISO/IEC 2022 code extension techniques (such as designation escape sequences, single shift, and locking shift) shall not be used with this coded character set." The presence of these in a UTF-8 stream indicates an error of some kind. It's not completely impossible for them to appear in something that is otherwise valid UTF-8, but they should be treated, in my opinion, the same as overlong sequences or surrogates; i.e. the UTF-8 math works, but the code point isn't valid. This can occur due to faulty conversion from another encoding, giving something that is close to UTF-8 but not quite right. This brings up the question of how error-tolerant Karl's algorithm is. 7-bit ISO 2022 encodings would clearly show such errors. Also: I did not receive the email from Harriet Riddle that Eli is replying to. Is there a problem with the mailing list? I may be missing other messages. S?awomir Osipiuk From eliz at gnu.org Sat Jun 6 09:27:33 2020 From: eliz at gnu.org (Eli Zaretskii) Date: Sat, 06 Jun 2020 17:27:33 +0300 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: (message from =?utf-8?Q?S=C5=82awomir?= Osipiuk on Sat, 6 Jun 2020 09:56:23 -0400) References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> <83h7voaea7.fsf@gnu.org> <83a71gacr1.fsf@gnu.org> <83sgf88nre.fsf@gnu.org> Message-ID: <83h7vo8e22.fsf@gnu.org> > From: S?awomir Osipiuk > Date: Sat, 6 Jun 2020 09:56:23 -0400 > > Escape sequences may be present in UTF-8, but SI and SO cannot be, nor > can most designation sequences (a special subset of escape sequences), > not only because they make no sense, but because ISO 10646 explicitly > forbids them: > > "Code extension control functions for the ISO/IEC 2022 code extension > techniques (such as designation escape sequences, single shift, and > locking shift) shall not be used with this coded character set." Alas, the stuff one bumps into out there doesn't always follow written standards, let alone recent enough standards. > The presence of these in a UTF-8 stream indicates an error of some > kind. It's not completely impossible for them to appear in something > that is otherwise valid UTF-8, but they should be treated, in my > opinion, the same as overlong sequences or surrogates; i.e. the UTF-8 > math works, but the code point isn't valid. What to do when these irregularities are found is a separate (though very important) issue. The issue discussed here is whether assuming UTF-8 "until proven otherwise" is sufficient in practice. I don't think it is, and I provided a few examples why. From harjitmoe at outlook.com Sat Jun 6 05:05:49 2020 From: harjitmoe at outlook.com (Harriet Riddle) Date: Sat, 6 Jun 2020 10:05:49 +0000 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: <83a71gacr1.fsf@gnu.org> References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> <83h7voaea7.fsf@gnu.org> , <83a71gacr1.fsf@gnu.org> Message-ID: In theory, one decoder can handle both, since 7-bit ISO 2022 generally starts out in ASCII, and SI, SO and the Gx-set designating ESC sequences make no sense in UTF-8. So, handling the left-hand side (those with the high bit unset) as (say) ISO-2022-JP-2 and the right-hand side (with the high bit set) as UTF-8 could work, with no ambiguity in practice. I do not recommend this for general use, since allowing this sort of mixed encoding at the receiving end can allow data to bypass upstream XSS sanitisers et cetera, but you presumably know how revelant this concern is to your work. It also probably doesn't make sense to write a decoder from scratch for this, unless you were doing that anyway. If that isn't feasible, then, more moderate measures might include trying 7-bit ISO 2022 and if it runs into a set high bit, retrying with UTF-8. Or trying UTF-8 and, if the result contains SI, SO or (for instance) the sequence ESC $ B (U+001B U+0024 U+0042), retrying with (for instance) ISO-2022-JP-2. ________________________________ From: Unicode on behalf of Eli Zaretskii via Unicode Sent: 06 June 2020 09:12 To: Shawn Steele Cc: tom at honermann.net ; alisdairm at me.com ; unicode at unicode.org Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? > CC: "tom at honermann.net" , > "alisdairm at me.com" > , > "unicode at unicode.org" > Date: Sat, 6 Jun 2020 06:58:55 +0000 > From: Shawn Steele via Unicode > > I mentioned that later.... But there is a lot of content for interchange that are single/double byte (8 bit) rather than requiring escape sequences. The 2022 encodings seem rarer, though it may depend on your data source. I agree that ISO 2022 is rare these days, but rarity doesn't help when you need to be accurate in decoding, because mistaking one encoding for another produces horribly incorrect results, and users complain vociferously when that happens. -------------- next part -------------- An HTML attachment was scrubbed... URL: From harjitmoe at outlook.com Sat Jun 6 07:20:40 2020 From: harjitmoe at outlook.com (Harriet Riddle) Date: Sat, 6 Jun 2020 12:20:40 +0000 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: <83sgf88nre.fsf@gnu.org> References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> <83h7voaea7.fsf@gnu.org> , <83a71gacr1.fsf@gnu.org> , <83sgf88nre.fsf@gnu.org> Message-ID: Point taken about it not necessarily being human readable text. I was mainly considering the case of distinguishing between a collection of files, the older ones being in ISO-2022-JP and the newer ones in UTF-8. In response to the comment about SGR sequences: ISO/IEC 2022 (ECMA-35, JIS X 0202), specifically section 13 (referencing the ECMA version), ultimately defines the format of all ANSI/ISO compliant escape sequences, whether in an actual ISO/IEC 2022 encoding (including both 7-bit code versions, and also 8-bit code versions such as ISO-8859-1) or in ISO 10646 / Unicode. The main difference is that ISO/IEC 10646 adds the requirement that they be padded to the code unit width, which is only relevant in the context of UTF-16 or UTF-32. However, "type Fe" escape sequences, i.e. ESC 0x40 (ESC @) through ESC 0x5F (ESC _) with no intervening bytes, are delegated to the C1 control code set in use, normally ISO/IEC 6429 (ECMA-48, JIS X 0211). The escape sequence ESC 0x5B (ESC [), which is the CSI control in turn used at the start of SGR, CUP etc. sequences, is one of these. The sequence ESC $ B (ESC 0x24 0x42), on the other hand, is a "type 4F" escape sequence, with a function defined by ISO/IEC 2022 itself. And yes, some of the code-switching sequences are supported by e.g. xterm, but this is mainly for their ISO 2022 code-switching purposes, e.g. using ESC - F to switch from ISO-8859-1 to ISO-8859-7, or ESC % G to switch from an ISO 2022 code version (such as ISO 8859) to UTF-8. So it is true that detecting ESC on its own will not identify 7-bit ISO 2022, but the specific sequence ESC $ B (ESC 0x24 0x42) has only one ANSI/ISO compliant meaning, which is to switch the G0 set to JIS X 0208. In UTF-8, there is no such thing as a G0 set (due to it not being fully ISO 2022 based), so it is meaningless. If you're dealing with non-ISO-compliant escape sequences used by some terminal, then fair enough. ________________________________ From: Eli Zaretskii Sent: 06 June 2020 12:57 To: Harriet Riddle Cc: Shawn.Steele at microsoft.com ; tom at honermann.net ; alisdairm at me.com ; unicode at unicode.org Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? > From: Harriet Riddle > CC: "tom at honermann.net" , "alisdairm at me.com" > , "unicode at unicode.org" > Date: Sat, 6 Jun 2020 10:05:49 +0000 > > In theory, one decoder can handle both, since 7-bit ISO 2022 generally starts out in ASCII, and SI, SO and > the Gx-set designating ESC sequences make no sense in UTF-8. What do you mean by "make no sense"? A general-purpose editor is presented with a byte stream and needs to decide how to interpret and display it. It usually has no meta-data about the byte stream to help it decide what does and doesn't make sense. It doesn't even know whether the byte stream is human-readable text or just raw binary bytes. I understand that, given enough of the byte stream, one can analyze it and see whether interpreting it as one encoding or another will make more sense. But these decisions are sometimes required after only a small portion of the material has arrived (a case in point: a process or a network connection that outputs text in relatively small chunks). In any case, I was responding to a proposal to treat any text as UTF-8 "unless proven otherwise". My point is that with ISO 2022 encoding, and perhaps also others, such a proof is not really at hand. > If that isn't feasible, then, more moderate measures might include trying 7-bit ISO 2022 and if it runs into a > set high bit, retrying with UTF-8. Or trying UTF-8 and, if the result contains SI, SO or (for instance) the > sequence ESC $ B (U+001B U+0024 U+0042), retrying with (for instance) ISO-2022-JP-2. Treating ESC sequences as telltale signs of ISO 2022 is not foolproof, either. For example, you may be looking at UTF-8 text interspersed with terminal control sequences, like SGR or somesuch. Bottom line: the real world out there is not as clean as we might think, and those rare corner cases keep breaking any simple-minded decision rules such as "assume UTF-8 by default". -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgcon6 at msn.com Sat Jun 6 10:06:23 2020 From: pgcon6 at msn.com (Peter Constable) Date: Sat, 6 Jun 2020 15:06:23 +0000 Subject: reminder about this list Message-ID: I'd just like to remind people (or point out): the Unicode Technical Committee does not monitor or act on anything discussed in this list. It's here for discussion-to seek answers to questions, bounce idea of others... If you bring up an idea that gets some support and think it would be work UTC considering, there are two channels to provide input that UTC will consider: Submit comments via the contact form: https://corp.unicode.org/reporting.html Submit a document with a specific proposal and rationale: https://www.unicode.org/pending/docsubmit.html For general info about UTC see https://www.unicode.org/consortium/utc.html Cheers! Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sat Jun 6 10:19:48 2020 From: doug at ewellic.org (Doug Ewell) Date: Sat, 6 Jun 2020 09:19:48 -0600 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? Message-ID: <000d01d63c15$eb40ca80$c1c25f80$@ewellic.org> Shawn Steele wrote: > I?ve been recommending that people assume documents are UTF-8. If > the UTF-8 decoding fails, then consider falling back to some other > codepage. Pretty much all the other code pages would contain text > that would look like unexpected trail bytes, or lead bytes without > trail bytes, etc. One can anecdotally find single-word Latin examples > that break the pattern (Nestl?? IIRC), That's traditionally been my example. You have to spell it in all caps (NESTL??), which Nestl? seldom does, in order to get an ISO 8859-1 sequence that can be mistaken for UTF-8: 4E 45 53 54 4C C9 AE where the last two code points could be UTF-8 for ?, U+026E LATIN SMALL LETTER LEZH. If the ? is lowercase, you get: 4E 45 53 54 4C E9 AE which is not valid UTF-8 (only one trail byte), and the heuristic that UTF-8 can be reliably auto-detected is reinforced. -- Doug Ewell | Thornton, CO, US | ewellic.org From doug at ewellic.org Sat Jun 6 10:43:34 2020 From: doug at ewellic.org (Doug Ewell) Date: Sat, 6 Jun 2020 09:43:34 -0600 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? Message-ID: <000e01d63c19$3d070ed0$b7152c70$@ewellic.org> Eli Zaretskii wrote: >>> That strategy would fail with 7-bit ISO 2022 based encodings, no? >>> They look like plain 7-bit ASCII (which will not fail UTF-8), but >>> actually represent non-ASCII text. >> >> I mentioned that later.... But there is a lot of content for >> interchange that are single/double byte (8 bit) rather than requiring >> escape sequences. The 2022 encodings seem rarer, though it may >> depend on your data source. > > I agree that ISO 2022 is rare these days, but rarity doesn't help when > you need to be accurate in decoding, because mistaking one encoding > for another produces horribly incorrect results, and users complain > vociferously when that happens. If you need to deal with an arbitrary set of encodings, such as CP1255 and CP1256 and 7-bit ISO 2022-based encodings, instead of just CP1252 versus UTF-8 as Karl stated, then auto-detection won't work without a fair amount of natural language context. Otherwise, the text really has to be tagged. Long ago I wrote some code that detected Russian text in any of six popular (at the time) Cyrillic encodings, and it seldom got it wrong, but I have no idea how it would do for other, especially non-Slavic, languages written in Cyrillic. I bet it would fail spectacularly for Mongolian, for example. -- Doug Ewell | Thornton, CO, US | ewellic.org From eliz at gnu.org Sat Jun 6 10:47:00 2020 From: eliz at gnu.org (Eli Zaretskii) Date: Sat, 06 Jun 2020 18:47:00 +0300 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: <000e01d63c19$3d070ed0$b7152c70$@ewellic.org> (message from Doug Ewell via Unicode on Sat, 6 Jun 2020 09:43:34 -0600) References: <000e01d63c19$3d070ed0$b7152c70$@ewellic.org> Message-ID: <83ftb88adn.fsf@gnu.org> > Date: Sat, 6 Jun 2020 09:43:34 -0600 > From: Doug Ewell via Unicode > > Eli Zaretskii wrote: > > If you need to deal with an arbitrary set of encodings, such as CP1255 and CP1256 and 7-bit ISO 2022-based encodings, instead of just CP1252 versus UTF-8 as Karl stated, then auto-detection won't work without a fair amount of natural language context. Otherwise, the text really has to be tagged. Yes, that's my experience as well. From doug at ewellic.org Sat Jun 6 10:58:31 2020 From: doug at ewellic.org (Doug Ewell) Date: Sat, 6 Jun 2020 09:58:31 -0600 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist Message-ID: <000f01d63c1b$53a627f0$faf277d0$@ewellic.org> abrahamgross at disroot.org wrote: > I've been thinking about writing a proposal for the double story "a" > so that I can send an unambiguous IPA transcription - even to ppl with > devices that have the U0061 ?a? as a single storey "a" - but I don't > want to spend a ton of time on something thatll get rejected? IMHO the major beneficiary of such a character would be the shadowy authors of those annoying "I know your password, now send me bitcoin or I'll send incriminating videos to all your contacts" messages, who love to sprinkle in lookalike characters (like, well, ?) for whatever reason. In this whole thread, I have yet to see why Alphabetic Presentation Forms would be considered a good place to encode a variant form of a Hebrew letter. If that's no longer being considered, or never was, perhaps a change in Subject line would help readers. -- Doug Ewell | Thornton, CO, US | ewellic.org From asmusf at ix.netcom.com Sat Jun 6 17:01:13 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sat, 6 Jun 2020 15:01:13 -0700 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: <83ftb88adn.fsf@gnu.org> References: <000e01d63c19$3d070ed0$b7152c70$@ewellic.org> <83ftb88adn.fsf@gnu.org> Message-ID: <3d396692-4400-86ce-eb15-d5088800b81d@ix.netcom.com> An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Sat Jun 6 19:40:08 2020 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Sun, 7 Jun 2020 09:40:08 +0900 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: <000e01d63c19$3d070ed0$b7152c70$@ewellic.org> References: <000e01d63c19$3d070ed0$b7152c70$@ewellic.org> Message-ID: <414518f9-adc3-af6f-1c68-b0c0df80a54d@it.aoyama.ac.jp> On 07/06/2020 00:43, Doug Ewell via Unicode wrote: > Eli Zaretskii wrote: > >>>> That strategy would fail with 7-bit ISO 2022 based encodings, no? >>>> They look like plain 7-bit ASCII (which will not fail UTF-8), but >>>> actually represent non-ASCII text. Well, yes, but if you exploit the fact that 7-bit ISO 2022 encodings contain ESC characters with specific character sequences thereafter, whereas UTF-8 text doesn't, that case should be easy to handle, too. >>> I mentioned that later.... But there is a lot of content for >>> interchange that are single/double byte (8 bit) rather than requiring >>> escape sequences. The 2022 encodings seem rarer, though it may >>> depend on your data source. >> >> I agree that ISO 2022 is rare these days, but rarity doesn't help when >> you need to be accurate in decoding, because mistaking one encoding >> for another produces horribly incorrect results, and users complain >> vociferously when that happens. > > If you need to deal with an arbitrary set of encodings, such as CP1255 and CP1256 and 7-bit ISO 2022-based encodings, instead of just CP1252 versus UTF-8 as Karl stated, then auto-detection won't work without a fair amount of natural language context. Otherwise, the text really has to be tagged. > > Long ago I wrote some code that detected Russian text in any of six popular (at the time) Cyrillic encodings, and it seldom got it wrong, but I have no idea how it would do for other, especially non-Slavic, languages written in Cyrillic. I bet it would fail spectacularly for Mongolian, for example. I agree. What's difficult is distinguish the various non-UTF-8 encodings among themselves. Compared to that, identifying something as UTF-8 is much easier. It's not 100% failproof, in particular not for very short pieces of non-ASCII text (just a word or so), but it gets better very, very fast the more non-ASCII text you have. Regards, Martin. From duerst at it.aoyama.ac.jp Sat Jun 6 19:48:37 2020 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Sun, 7 Jun 2020 09:48:37 +0900 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> Message-ID: On 06/06/2020 08:04, Markus Scherer via Unicode wrote: > The BOM -- or for UTF-8 where "byte order" is meaningless, the Unicode > signature byte sequence -- was popular when Unicode was gaining ground but > legacy charsets were still widely used. > Especially on Windows, which had settled on UTF-16 much earlier, lots of > tools and editors started writing or expecting UTF-8 signatures. > Other tools (especially in the Linux/Unix world) were never modified to > expect or even cope with the signature, so ignored it or choked on it. > There has never been uniform practice on this. > For the most part, all new and recent text is now UTF-8, and the signature > byte sequence has fallen out of favor again even where it had been used. I'm really glad to hear this, and I very much hope it is true. But I know of a case where the BOM on UTF-8 is necessary. It's to get Excel recognize a CSV file as UTF-8. Regards, Martin. > Having said that, I think the statement is right: "neither required nor > recommended for UTF-8" > > We might want to review chapter 23 and the FAQ and see if they should be > updated. > > Thanks, > markus > From abrahamgross at disroot.org Sun Jun 7 00:53:23 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Sun, 7 Jun 2020 05:53:23 +0000 (UTC) Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <000f01d63c1b$53a627f0$faf277d0$@ewellic.org> References: <000f01d63c1b$53a627f0$faf277d0$@ewellic.org> Message-ID: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> I just gave the Alphabetic Presentation Forms as a suggestion of where it can possibly be encoded. Everyone here disagreed, so the regular Hebrew block it is 2020/06/06 ??11:59:07 Doug Ewell via Unicode : > In this whole thread, I have yet to see why Alphabetic Presentation Forms would be considered a good place to encode a variant form of a Hebrew letter. If that's no longer being considered, or never was, perhaps a change in Subject line would help readers. > From pandey at umich.edu Sun Jun 7 00:58:42 2020 From: pandey at umich.edu (Anshuman Pandey) Date: Sat, 6 Jun 2020 23:58:42 -0600 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> Message-ID: <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> Hi Abraham, If you?re seriously thinking of submitting a proposal for a new Hebrew character, please consider getting in touch with Deborah Anderson, Michael Everson, or me. We?d be happy to help you figure out the suitability of encoding the character in question or figuring out ways to represent it in plain text, if need be. All my best, Anshu > On Jun 6, 2020, at 11:53 PM, abrahamgross--- via Unicode wrote: > > ?I just gave the Alphabetic Presentation Forms as a suggestion of where it can possibly be encoded. Everyone here disagreed, so the regular Hebrew block it is > > 2020/06/06 ??11:59:07 Doug Ewell via Unicode : > >> In this whole thread, I have yet to see why Alphabetic Presentation Forms would be considered a good place to encode a variant form of a Hebrew letter. If that's no longer being considered, or never was, perhaps a change in Subject line would help readers. >> > From asmusf at ix.netcom.com Sun Jun 7 01:19:32 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sat, 6 Jun 2020 23:19:32 -0700 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> Message-ID: An HTML attachment was scrubbed... URL: From tom at honermann.net Sun Jun 7 02:47:12 2020 From: tom at honermann.net (Tom Honermann) Date: Sun, 7 Jun 2020 03:47:12 -0400 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> Message-ID: <4a173391-8436-1f3f-c311-f4a9960d288e@honermann.net> Thank you to everyone that responded to this thread.? The responses have indicated that I need to be more clear about my motivation for asking.? More details below. On 6/5/20 7:04 PM, Markus Scherer via Unicode wrote: > The BOM -- or for UTF-8 where "byte order" is meaningless, the Unicode > signature byte sequence -- was popular when Unicode was gaining ground > but legacy charsets were still widely used. > Especially on Windows, which had settled on UTF-16 much earlier, lots > of tools and editors started writing or expecting UTF-8 signatures. > Other tools (especially in the Linux/Unix world) were never modified > to expect or even cope with the signature, so ignored it or choked on it. > There has never been uniform practice on this. > For the most part, all new and recent text is now UTF-8, and the > signature byte sequence has fallen out of favor again even where it > had been used. Thank you, this is helpful historical perspective. > > Having?said that, I think the statement is right: "neither required > nor recommended for UTF-8" I think different audiences could interpret that guidance in different ways. As a software tool provider, I can interpret the guidance as meaning that I should not require a BOM to be present on text that is consumed, nor produce a BOM in text that is produced.? But what is the recommendation for honoring a BOM that is present in consumed text?? Pragmatically, it seems to me that tools should honor the presence of a BOM by either treating the data following it as UTF-8 encoded or issuing a diagnostic if the BOM presents a conflict with other indications of expected encoding. As a protocol developer, I can interpret the guidance as meaning that a new protocol should either mandate a particular encoding or use some mechanism other than a BOM to negotiate encoding. As a text author, I can interpret the guidance as meaning that I should not place a BOM in text that I author without strong motivation, nor should I expect a tool to require one. Back to my motivation for asking the question... I'm researching support for UTF-8 encoded source files in various C++ compilers.? Here is a brief snapshot of existing practice: * Clang only accepts UTF-8 encoded source files.? A UTF-8 BOM is recognized and discarded. * GCC accepts UTF-8 encoded source files by default, but the encoding expectation can be overridden with a command line option.? If GCC is expecting UTF-8 source, then a BOM is discarded.? Otherwise, a BOM is *not* honored and its presence is likely to result in compilation error.? GCC has no support for compiling a translation unit consisting of differently encoded source files. * Microsoft Visual C++, by default, interprets source files as encoded according to the Windows' Active Code Page (ACP), but supports translation units consisting of differently encoded source files by honoring a UTF-8 BOM.? The default encoding can be overridden with a command line option. * IBM z/OS xlC C/C++ is IBM's compiler for C and C++ on mainframes (yes, though you may not have seen a green screen in recent times, mainframes are still busy crunching numbers behind the scenes for websites you frequent).? z/OS is an EBCDIC based operating system and IBM's xlC compiler for z/OS only accepts EBCDIC encoded source files.? Many EBCDIC code pages exist and the xlC compiler supports an in-source code page annotation that enables compilation of translation units consisting of differently encoded source files. The goal of this research is to produce a proposal for the C and C++ standards intended to better enable UTF-8 as a portable source file encoding.? The following are acknowledged (at least by me) as accepted constraints: * Existing compilers are not going to change their default mode of operation due to backward compatibility constraints. * Non-UTF-8 encoded source files are still in use, particularly by commercial software providers. * Converting source files to UTF-8 is not necessarily an easy task.? It isn't necessarily a simple matter of running the source files through 'iconv' and committing the results. * Transition to UTF-8 for source files will be aided by the possibility of incremental adoption; e.g., use of UTF-8 encoded header files by a project that has non-UTF-8 encoded source files. Various methods are being explored for how to support collections of mixed encoding source files.? The intent in asking the question is to help determine if/how use of a UTF-8 BOM fits in to the picture. > > We might want to review chapter 23 and the FAQ and see if they should > be updated. I think that would be useful.? In particular, per other comments above, if the standard or FAQ is to continue offering statements regarding recommendations or guidance, it may be helpful to tailor the guidance for different audiences.? For example, "Software providers are encouraged to honor the presence of a BOM signifying that a text is UTF-8 encoded in text that is consumed, and are discouraged from inserting a BOM in text that is produced .? Text authors are discouraged from inserting a BOM in their UTF-8 encoded documents [unless it is known to be needed; because UTF-8 should be considered a default, because some tools won't honor it, etc...]". Tom. > > Thanks, > markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Jun 7 06:46:27 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 7 Jun 2020 12:46:27 +0100 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> Message-ID: <20200607124627.12db8f87@JRWUBU2> On Sat, 6 Jun 2020 23:58:42 -0600 Anshuman Pandey via Unicode wrote: > Hi Abraham, > > If you?re seriously thinking of submitting a proposal for a new > Hebrew character, please consider getting in touch with Deborah > Anderson, Michael Everson, or me. We?d be happy to help you figure > out the suitability of encoding the character in question or figuring > out ways to represent it in plain text, if need be. If doesn't belong in plain text. It only becomes useful once line breaks and character spacing are known. Richard. From everson at evertype.com Sun Jun 7 08:45:00 2020 From: everson at evertype.com (Michael Everson) Date: Sun, 7 Jun 2020 14:45:00 +0100 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <20200607124627.12db8f87@JRWUBU2> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> Message-ID: <9C2A4C94-BAA2-4A33-A34C-57F18049079E@evertype.com> I?ve often helped encode Hebrew characters. :-) M > On 7 Jun 2020, at 12:46, Richard Wordingham via Unicode wrote: > > On Sat, 6 Jun 2020 23:58:42 -0600 > Anshuman Pandey via Unicode wrote: > >> Hi Abraham, >> >> If you?re seriously thinking of submitting a proposal for a new >> Hebrew character, please consider getting in touch with Deborah >> Anderson, Michael Everson, or me. We?d be happy to help you figure >> out the suitability of encoding the character in question or figuring >> out ways to represent it in plain text, if need be. > > If doesn't belong in plain text. It only becomes useful once line > breaks and character spacing are known. > > Richard. > From mark at kli.org Sun Jun 7 09:27:17 2020 From: mark at kli.org (Mark E. Shoulson) Date: Sun, 7 Jun 2020 10:27:17 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <20200607124627.12db8f87@JRWUBU2> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> Message-ID: On 6/7/20 7:46 AM, Richard Wordingham via Unicode wrote: > On Sat, 6 Jun 2020 23:58:42 -0600 > Anshuman Pandey via Unicode wrote: > >> Hi Abraham, >> >> If you?re seriously thinking of submitting a proposal for a new >> Hebrew character, please consider getting in touch with Deborah >> Anderson, Michael Everson, or me. We?d be happy to help you figure >> out the suitability of encoding the character in question or figuring >> out ways to represent it in plain text, if need be. > I[t] doesn't belong in plain text. It only becomes useful once line > breaks and character spacing are known. > > Richard. I agree.? Sorry, pretty typography is nice and everything, but if bent LAMED is anything, it's at best a presentation form (and even that is a hard sell.)? You show ANYONE a word spelled with any combination of bent and straight LAMEDs and ask how it's spelled, they'll just say "LAMED" for each one.? Unicode encodes different *characters*, symbols that have a different *meaning* in text, not things that happen to look different.? A U+05BA HOLAM HASER FOR VAV means not just "a dot like U+05B9 only shifted over a little," it means that there is something *different* going on: VAV plus HOLAM usually means one thing (a VAV as mater lectionis for an /o/ vowel), this is a consonantal VAV followed by a vowel.? In spelling it out, you could call one a holam mal?, but not the other.? A QAMATS QATAN is not just a qamats that looks a little different, it is a grammatically distinct character, and moreover one that cannot be deduced algorithmically by looking at the letters around it.? What you're talking about is a LAMED and a LAMED.? They are two *glyphs* for the same character, and Unicode doesn't encode glyphs (anymore?) ~mark From abrahamgross at disroot.org Sun Jun 7 13:45:05 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Sun, 07 Jun 2020 18:45:05 +0000 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> Message-ID: <1b3f13c54cc6102907fbb3d0043178ca@disroot.org> If this is the case, then why do the CJK blocks have tons of alternatives for the same character? (not counting the compatibility ideographs that were just added for compatibility with other encodings) If you look at old dictionaries, these alternatives get listed as alternatives of the same character you might see some fonts use. The meaning is exactly the same. Some examples (theres tons and tons more): ?????? ??? ???? ?? ??? ??? ??? ?????? ?? ?????? 2020?6?7? 10:27, "Mark E. Shoulson via Unicode" wrote: > On 6/7/20 7:46 AM, Richard Wordingham via Unicode wrote: > >> On Sat, 6 Jun 2020 23:58:42 -0600 >> Anshuman Pandey via Unicode wrote: >> >>> Hi Abraham, >>> >>> If you?re seriously thinking of submitting a proposal for a new >>> Hebrew character, please consider getting in touch with Deborah >>> Anderson, Michael Everson, or me. We?d be happy to help you figure >>> out the suitability of encoding the character in question or figuring >>> out ways to represent it in plain text, if need be. >> >> I[t] doesn't belong in plain text. It only becomes useful once line >> breaks and character spacing are known. >> >> Richard. > > I agree. Sorry, pretty typography is nice and everything, but if bent LAMED is anything, it's at > best a presentation form (and even that is a hard sell.) You show ANYONE a word spelled with any > combination of bent and straight LAMEDs and ask how it's spelled, they'll just say "LAMED" for each > one. Unicode encodes different *characters*, symbols that have a different *meaning* in text, not > things that happen to look different. A U+05BA HOLAM HASER FOR VAV means not just "a dot like > U+05B9 only shifted over a little," it means that there is something *different* going on: VAV plus > HOLAM usually means one thing (a VAV as mater lectionis for an /o/ vowel), this is a consonantal > VAV followed by a vowel. In spelling it out, you could call one a holam mal?, but not the other. > A QAMATS QATAN is not just a qamats that looks a little different, it is a grammatically distinct > character, and moreover one that cannot be deduced algorithmically by looking at the letters around > it. What you're talking about is a LAMED and a LAMED. They are two *glyphs* for the same > character, and Unicode doesn't encode glyphs (anymore?) > > ~mark From abrahamgross at disroot.org Sun Jun 7 13:50:01 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Sun, 07 Jun 2020 18:50:01 +0000 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <1b3f13c54cc6102907fbb3d0043178ca@disroot.org> References: <1b3f13c54cc6102907fbb3d0043178ca@disroot.org> <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> Message-ID: This doesnt display properly on my android device, so I hope ya'll received this intact. 2020?6?7? 14:45, "Abraham Gross via Unicode" wrote: > Some examples (theres tons and tons more): > ?????? > ??? > ???? > ?? > ??? > ??? > ??? > ?????? > ?? > ?????? From public at khwilliamson.com Sun Jun 7 14:29:50 2020 From: public at khwilliamson.com (Karl Williamson) Date: Sun, 7 Jun 2020 13:29:50 -0600 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <0d746f3e-80f4-1ecd-e334-0c84ba68aa95@honermann.net> <499dd515-f61d-8114-aae7-52da51d92e58@khwilliamson.com> Message-ID: On 6/5/20 9:53 PM, Jonathan Rosenne via Unicode wrote: > I am curious about how your code would work with CP1255 or CP1256? > > Best Regards, > > Jonathan Rosenne Send me a few problematic strings, and I'll check them out > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Karl Williamson via Unicode > Sent: Saturday, June 6, 2020 6:29 AM > To: Shawn Steele; Tom Honermann > Cc: Alisdair Meredith; Unicode Mail List > Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? > > On 6/5/20 4:33 PM, Shawn Steele via Unicode wrote: >> I?ve been recommending that people assume documents are UTF-8. ?If the >> UTF-8 decoding fails, then consider falling back to some other >> codepage.? ?Pretty much all the other code pages would contain text that >> would look like unexpected trail bytes, or lead bytes without trail >> bytes, etc.? One can anecdotally find single-word Latin examples that >> break the pattern (Nestl?? IIRC), but if you want to think of accuracy >> in terms of ?9s?, then that pretty much has as many nines as you have >> bytes of input data. > > I have code that attempts to distinguish between UTF-8 and CP1252 > inputs. It now does a pretty good job; no one has complained in several > years. To do this, I resort to some "semantic" analysis of the input. > If it is syntactically valid UTF-8, but not a script run, it's not > UTF-8. Likewise, the texts it will be subjected to are going to be in > modern commercially-valuable scripts, so not IPA, for example. And it > will be important characters, ones whose Age property is 1.1; text won't > contain C1 controls. CP1252 is harder than plain ASCII/Latin1/C1 > because manyh of the C1 controls are co-opted for graphic characters. > Someone sent me the following example, scraped from some dictionaries, > that it successfully gets right: > > Muvrar\xE1\x9A\x9Aa is a mountain in Norway > > is legal 1252, and syntactically legal UTF-8, but the "semantic" tests > say it isn't UTF-8. > > I also have code that tries to distinguish between a UTF-8 POSIX locale > and a non-UTF-8, and which needs to work on systems without certain C > library functions that would make it foolproof. That is less successful > primarily because of insufficient text available to make a > determination. One might think that the operating system error messages > would be fruitful, but it turns out that many are in English, no one > bothered to translate them. The locale's currency symbol is always > translated, though the dollar sign is commonly used in other languages > as part of the symbol. The time and date names are usually translated, > and I use them. > >> I did find some DBCS CJK text that could look like valid UTF-8, so my >> ?one nine per byte of input? isn?t quite as high there, however for >> meaningful runs of text it is still reasonably hard to make sensible >> text in a double byte codepage look like UTF-8.? Note that this ?works? >> partially because the ASCII range of the SBCS/DBCS code pages typically >> looks like ASCII, as does UTF-8.? If you had a 7 bit codepage data with >> stateful shift sequences, of course that wouldn?t err in UTF-8. >> Fortunately for your scenario source code in 7 bit encodings is very >> rare nowadays. >> >> Hope that helps, >> >> -Shawn >> >> *From:* Tom Honermann >> *Sent:* Freitag, 5. Juni 2020 15:15 >> *To:* Shawn Steele >> *Cc:* Alisdair Meredith ; Unicode Mail List >> >> *Subject:* Re: What is the Unicode guidance regarding the use of a BOM >> as a UTF-8 encoding signature? >> >> On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote: >> >> The modern viewpoint is that the BOM should be discouraged in all >> contexts.?? (Along with you should always be using Unicode >> encodings, probably UTF-8 or UTF-16).? I?d recommend to anyone >> encountering ASCII-like data to presume it was UTF-8 unless proven >> otherwise. >> >> Are you asking because you?re interested in differentiating UTF-8 >> from UTF-16?? Or UTF-8 from some other legacy non-Unicode encoding? >> >> The latter.? In particular, as a differentiator between shiny new UTF-8 >> encoded source code files and long-in-the-tooth legacy encoded source >> code files coexisting (perhaps via transitive package dependencies) >> within a single project. >> >> Tom. >> >> Anecdotally, if you can decode data without error in UTF-8, then >> it?s probably UTF-8. ?Sensible sequences in other encodings rarely >> look like valid UTF-8, though there are a few short examples that >> can confuse it. >> >> -Shawn >> >> *From:* Unicode >> *On Behalf Of *Tom Honermann >> via Unicode >> *Sent:* Freitag, 5. Juni 2020 13:10 >> *To:* unicode at unicode.org >> *Cc:* Alisdair Meredith >> *Subject:* What is the Unicode guidance regarding the use of a BOM >> as a UTF-8 encoding signature? >> >> Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte >> order, states (emphasis mine): >> >> ... *Use of a BOM is neither required nor recommended for >> UTF-8*, but may be encountered in contexts where UTF-8 data is >> converted from other encoding forms that? use? a? BOM? or >> where? the? BOM? is? used? as? a? UTF-8? signature. See? the >> ?Byte? Order Mark? subsection in Section 23.8, Specials, for >> more information. >> >> The emphasized statement is unconditional regarding the >> recommendation, but it isn't clear to me that this recommendation is >> intended to extend to both presence of a BOM in contexts where the >> encoding is known to be UTF-8 (where the BOM provides no additional >> information) and to contexts where the BOM signifies the presence of >> UTF-8 encoded text (where the BOM does provide additional >> information).? Is the guidance intended to state that, when >> possible, use of UTF-8 as an encoding signature is to be avoided in >> favor of some other mechanism? >> >> The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8 >> (Specials) contains no similar guidance; it is factual and details >> some possible consequences of use, but does not apply a judgement. >> The discussion of use with other character sets could be read as an >> endorsement for use of a BOM as an encoding signature. >> >> Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode >> FAQ does not recommend >> for or against use of a BOM as an encoding signature.? It also can >> be read as endorsing such usage. >> >> So, my question is, what exactly is the intent of the emphasized >> statement above?? Is the recommendation intended to be so broadly >> worded?? Or is it only intended to discourage BOM use in cases where >> the encoding is known by other means? >> >> Tom. >> > > > From asmusf at ix.netcom.com Sun Jun 7 16:31:33 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 7 Jun 2020 14:31:33 -0700 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <1b3f13c54cc6102907fbb3d0043178ca@disroot.org> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <1b3f13c54cc6102907fbb3d0043178ca@disroot.org> Message-ID: <5324b0cf-ff48-d6d5-768d-d41cba4155ee@ix.netcom.com> An HTML attachment was scrubbed... URL: From 747.neutron at gmail.com Mon Jun 8 02:23:41 2020 From: 747.neutron at gmail.com (=?UTF-8?B?V8OhbmcgWWlmw6Fu?=) Date: Mon, 8 Jun 2020 16:23:41 +0900 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <1b3f13c54cc6102907fbb3d0043178ca@disroot.org> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <1b3f13c54cc6102907fbb3d0043178ca@disroot.org> Message-ID: As CJK ideographs mentioned... They are different from most other Unicode code points in several ways, namely: - Most of the substantial discussion is going on under the supervision of ISO, rather than UTC. It's one of few fields whose description in ISO/IEC 10646 is more informative than that in the Unicode Standard. For practical knowledge see especially the ISO standard's Annex P and S. - Whether to separately encode two characters is mainly decided by difference in structure i.e. sub-character formation, besides the semantics, because Han characters are compositional by nature, unlike most phonetic scripts where each letter only means what it means as a whole shape (? is not a ? with hyphen, is it?). - The questionable quality of CJK Extension B characters is an open secret. 2020?6?8?(?) 4:57 Abraham Gross via Unicode : > > If this is the case, then why do the CJK blocks have tons of alternatives for the same character? (not counting the compatibility ideographs that were just added for compatibility with other encodings) If you look at old dictionaries, these alternatives get listed as alternatives of the same character you might see some fonts use. The meaning is exactly the same. > > Some examples (theres tons and tons more): > ?????? > ??? > ???? > ?? > ??? > ??? > ??? > ?????? > ?? > ?????? > > > 2020?6?7? 10:27, "Mark E. Shoulson via Unicode" wrote: > > > On 6/7/20 7:46 AM, Richard Wordingham via Unicode wrote: > > > >> On Sat, 6 Jun 2020 23:58:42 -0600 > >> Anshuman Pandey via Unicode wrote: > >> > >>> Hi Abraham, > >>> > >>> If you?re seriously thinking of submitting a proposal for a new > >>> Hebrew character, please consider getting in touch with Deborah > >>> Anderson, Michael Everson, or me. We?d be happy to help you figure > >>> out the suitability of encoding the character in question or figuring > >>> out ways to represent it in plain text, if need be. > >> > >> I[t] doesn't belong in plain text. It only becomes useful once line > >> breaks and character spacing are known. > >> > >> Richard. > > > > I agree. Sorry, pretty typography is nice and everything, but if bent LAMED is anything, it's at > > best a presentation form (and even that is a hard sell.) You show ANYONE a word spelled with any > > combination of bent and straight LAMEDs and ask how it's spelled, they'll just say "LAMED" for each > > one. Unicode encodes different *characters*, symbols that have a different *meaning* in text, not > > things that happen to look different. A U+05BA HOLAM HASER FOR VAV means not just "a dot like > > U+05B9 only shifted over a little," it means that there is something *different* going on: VAV plus > > HOLAM usually means one thing (a VAV as mater lectionis for an /o/ vowel), this is a consonantal > > VAV followed by a vowel. In spelling it out, you could call one a holam mal?, but not the other. > > A QAMATS QATAN is not just a qamats that looks a little different, it is a grammatically distinct > > character, and moreover one that cannot be deduced algorithmically by looking at the letters around > > it. What you're talking about is a LAMED and a LAMED. They are two *glyphs* for the same > > character, and Unicode doesn't encode glyphs (anymore?) > > > > ~mark > From abrahamgross at disroot.org Mon Jun 8 02:41:56 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Mon, 08 Jun 2020 07:41:56 +0000 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <1b3f13c54cc6102907fbb3d0043178ca@disroot.org> Message-ID: <453b7f69c101676fe1815355b5e41d22@disroot.org> ??? The way I understand it, a lot of ext b (including the crazy/cursive ones like ????????????????) come from dictionaries that had cursive entries (for some unknown reason). Example dictionary: https://sociorocketnewsen.files.wordpress.com/2016/09/broken-11.jpg https://sociorocketnewsen.files.wordpress.com/2016/09/wrong-11-e1473684067982.jpg https://sociorocketnewsen.files.wordpress.com/2016/09/curve-11.jpg https://sociorocketnewsen.files.wordpress.com/2016/09/curve-21-e1473684039617.jpg 2020/06/08 ??3:25:01 W?ng Yif?n via Unicode : > - The questionable quality of CJK Extension B characters is an open secret. From marius.spix at web.de Mon Jun 8 04:46:58 2020 From: marius.spix at web.de (Marius Spix) Date: Mon, 8 Jun 2020 11:46:58 +0200 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <1b3f13c54cc6102907fbb3d0043178ca@disroot.org> Message-ID: <20200608114350.364eba2e@spixxi> ? is a ligature of ? and ?. ? is a ? with breve. They are considered to be seperate letters for historic reasons. ? and ? have nothing in common but part of the shape. They derived from completely different characters. On Mon, 8 Jun 2020 16:23:41 +0900 W?ng Yif?n wrote: > - Whether to separately encode two characters is mainly decided by > difference in structure i.e. sub-character formation, besides the > semantics, because Han characters are compositional by nature, unlike > most phonetic scripts where each letter only means what it means as a > whole shape (? is not a ? with hyphen, is it?). From 747.neutron at gmail.com Mon Jun 8 07:43:48 2020 From: 747.neutron at gmail.com (=?UTF-8?B?V8OhbmcgWWlmw6Fu?=) Date: Mon, 8 Jun 2020 21:43:48 +0900 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <453b7f69c101676fe1815355b5e41d22@disroot.org> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <1b3f13c54cc6102907fbb3d0043178ca@disroot.org> <453b7f69c101676fe1815355b5e41d22@disroot.org> Message-ID: Those are not exact what we call failure in Ext B (some of them you listed are actually not from Ext B), because they are sort of "had to" be included rather than by carelessness. It's definitely one of the headaches, just in another dimension. Maybe you can take a glance of what we had discussed recently, if you really into it: https://www.unicode.org/L2/L2019/19346-gongche-policy.pdf https://appsrv.cse.cuhk.edu.hk/~irg/irg/irg53/IRGN2413_IDS_issues.pdf https://www.unicode.org/L2/L2020/20059-unihan-kstrange-update.pdf 2020?6?8?(?) 17:57 Abraham Gross via Unicode : > > > > The way I understand it, a lot of ext b (including the crazy/cursive ones like > ????????????????) come from dictionaries that had cursive entries (for some unknown reason). > > Example dictionary: > https://sociorocketnewsen.files.wordpress.com/2016/09/broken-11.jpg > https://sociorocketnewsen.files.wordpress.com/2016/09/wrong-11-e1473684067982.jpg > https://sociorocketnewsen.files.wordpress.com/2016/09/curve-11.jpg > https://sociorocketnewsen.files.wordpress.com/2016/09/curve-21-e1473684039617.jpg > > 2020/06/08 ??3:25:01 W?ng Yif?n via Unicode : > > > - The questionable quality of CJK Extension B characters is an open secret. > From abrahamgross at disroot.org Mon Jun 8 12:45:02 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Mon, 08 Jun 2020 17:45:02 +0000 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> Message-ID: <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> Unicode encodes characters that other character sets have even though it normally wouldn't. So if I find a character set with a folded lamed they'd add it? Here are 2 character sets with a folded lamed: https://i.imgur.com/iq8awBe.jpg ? an ??? ???? with the standing and folded lameds as separate letters. https://www.tug.org/TUGboat/tb15-3/tb44haralambous-hebrew.pdf#page=12 ? A TeX typesetting module with the standing and folded lameds as separate characters for fine-grain control when the automatic system doesn't work. 2020?6?7? 10:27, "Mark E. Shoulson via Unicode" wrote: > On 6/7/20 7:46 AM, Richard Wordingham via Unicode wrote: > > I agree. Sorry, pretty typography is nice and everything, but if bent LAMED is anything, it's at > best a presentation form (and even that is a hard sell.) You show ANYONE a word spelled with any > combination of bent and straight LAMEDs and ask how it's spelled, they'll just say "LAMED" for each > one. Unicode encodes different *characters*, symbols that have a different *meaning* in text, not > things that happen to look different. A U+05BA HOLAM HASER FOR VAV means not just "a dot like > U+05B9 only shifted over a little," it means that there is something *different* going on: VAV plus > HOLAM usually means one thing (a VAV as mater lectionis for an /o/ vowel), this is a consonantal > VAV followed by a vowel. In spelling it out, you could call one a holam mal?, but not the other. > A QAMATS QATAN is not just a qamats that looks a little different, it is a grammatically distinct > character, and moreover one that cannot be deduced algorithmically by looking at the letters around > it. What you're talking about is a LAMED and a LAMED. They are two *glyphs* for the same > character, and Unicode doesn't encode glyphs (anymore?) > > ~mark From john_h_jenkins at apple.com Mon Jun 8 13:09:38 2020 From: john_h_jenkins at apple.com (jenkins) Date: Mon, 08 Jun 2020 12:09:38 -0600 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> Message-ID: Unicode *encoded* characters that other character sets have, even though it normally wouldn?t. That?s really not done anymore. It?s also a matter of what the character set in question is. The two mentioned here are too obscure IMHO to have ever been covered by round-trip compatibility. > On Jun 8, 2020, at 11:45 AM, Abraham Gross via Unicode wrote: > > Unicode encodes characters that other character sets have even though it normally wouldn't. So if I find a character set with a folded lamed they'd add it? > > Here are 2 character sets with a folded lamed: > https://i.imgur.com/iq8awBe.jpg ? an ??? ???? with the standing and folded lameds as separate letters. > https://www.tug.org/TUGboat/tb15-3/tb44haralambous-hebrew.pdf#page=12 ? A TeX typesetting module with the standing and folded lameds as separate characters for fine-grain control when the automatic system doesn't work. > > 2020?6?7? 10:27, "Mark E. Shoulson via Unicode" wrote: > >> On 6/7/20 7:46 AM, Richard Wordingham via Unicode wrote: >> >> I agree. Sorry, pretty typography is nice and everything, but if bent LAMED is anything, it's at >> best a presentation form (and even that is a hard sell.) You show ANYONE a word spelled with any >> combination of bent and straight LAMEDs and ask how it's spelled, they'll just say "LAMED" for each >> one. Unicode encodes different *characters*, symbols that have a different *meaning* in text, not >> things that happen to look different. A U+05BA HOLAM HASER FOR VAV means not just "a dot like >> U+05B9 only shifted over a little," it means that there is something *different* going on: VAV plus >> HOLAM usually means one thing (a VAV as mater lectionis for an /o/ vowel), this is a consonantal >> VAV followed by a vowel. In spelling it out, you could call one a holam mal?, but not the other. >> A QAMATS QATAN is not just a qamats that looks a little different, it is a grammatically distinct >> character, and moreover one that cannot be deduced algorithmically by looking at the letters around >> it. What you're talking about is a LAMED and a LAMED. They are two *glyphs* for the same >> character, and Unicode doesn't encode glyphs (anymore?) >> >> ~mark > From mark at kli.org Mon Jun 8 16:02:37 2020 From: mark at kli.org (Mark E. Shoulson) Date: Mon, 8 Jun 2020 17:02:37 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> Message-ID: <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org> Look, think of it this way: what exactly is the content of, say, Exodus 6:10, for a nice short and common verse?? What are the letters, vowels, and cantillations that make up that verse?? The answer is pretty well-agreed-upon by most sources.? Tell me: is the LAMED in that verse bent or straight?? Can you find a list of LAMEDs in the Torah that are bent?? Not "which ones are bent in this particular book."? That's like finding me a list of YODs that are at the end of a line: it has nothing to do with the actual TEXT.? Which LAMEDs in the Torah are bent?? None of them.? Nor are any of them straight.? Nor are any of them written in Frank-Ruehl, or Hadassah, or David.? Those are not properties of the text.? The consonantal text of the Torah uses exactly 22 letters plus final forms, plus the NUN HAFUKHA and a few instances of UPPER DOT. Now, there *are* some letters in the Torah which are written unusually large or small, like the BET at the very beginning, or the small ALEPH in Leviticus 1:1.? But Unicode rightly considers those to be glyphic variants, to be handled at a higher level.? There's actually a better case for encoding these, because there IS a list of large BETs or small ALEPHs in the Torah, which "everyone" (who accepts Masoretic tradition) agrees are in these and those places in the text.? (But don't try to encode these, either.) Down to one sentence: until you can talk about which LAMEDs in the Torah are bent and which are straight, I would expect this to be a non-starter. ~mark On 6/8/20 1:45 PM, Abraham Gross via Unicode wrote: > Unicode encodes characters that other character sets have even though it normally wouldn't. So if I find a character set with a folded lamed they'd add it? > > Here are 2 character sets with a folded lamed: > https://i.imgur.com/iq8awBe.jpg ? an ??? ???? with the standing and folded lameds as separate letters. > https://www.tug.org/TUGboat/tb15-3/tb44haralambous-hebrew.pdf#page=12 ? A TeX typesetting module with the standing and folded lameds as separate characters for fine-grain control when the automatic system doesn't work. > > 2020?6?7? 10:27, "Mark E. Shoulson via Unicode" wrote: > >> On 6/7/20 7:46 AM, Richard Wordingham via Unicode wrote: >> >> I agree. Sorry, pretty typography is nice and everything, but if bent LAMED is anything, it's at >> best a presentation form (and even that is a hard sell.) You show ANYONE a word spelled with any >> combination of bent and straight LAMEDs and ask how it's spelled, they'll just say "LAMED" for each >> one. Unicode encodes different *characters*, symbols that have a different *meaning* in text, not >> things that happen to look different. A U+05BA HOLAM HASER FOR VAV means not just "a dot like >> U+05B9 only shifted over a little," it means that there is something *different* going on: VAV plus >> HOLAM usually means one thing (a VAV as mater lectionis for an /o/ vowel), this is a consonantal >> VAV followed by a vowel. In spelling it out, you could call one a holam mal?, but not the other. >> A QAMATS QATAN is not just a qamats that looks a little different, it is a grammatically distinct >> character, and moreover one that cannot be deduced algorithmically by looking at the letters around >> it. What you're talking about is a LAMED and a LAMED. They are two *glyphs* for the same >> character, and Unicode doesn't encode glyphs (anymore?) >> >> ~mark From kenwhistler at sonic.net Mon Jun 8 18:58:19 2020 From: kenwhistler at sonic.net (Ken Whistler) Date: Mon, 8 Jun 2020 16:58:19 -0700 Subject: Alternate presentation for U+229C CIRCLED EQUALS? In-Reply-To: References: Message-ID: <16312c19-dd47-fcdf-f775-b868e49b72fc@sonic.net> Actually, no you can't, because U+1F10D is already standardized as CIRCLED ZERO WITH SLASH, published in Unicode 13.0. https://www.unicode.org/charts/PDF/U1F100.pdf People discussing this should make sure they are referring to actual, published code charts, and not to proposals from 2 to 4 years ago. --Ken On 6/5/2020 5:15 PM, abrahamgross--- via Unicode wrote: > You can still make a proposal to add it to U+1F10D > From asmusf at ix.netcom.com Mon Jun 8 21:57:05 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 8 Jun 2020 19:57:05 -0700 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org> Message-ID: <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com> An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Mon Jun 8 23:47:24 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Tue, 9 Jun 2020 04:47:24 +0000 (UTC) Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org> <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com> Message-ID: Does anyone know which national standard the alternative ayin came from? I can't find it anywhere, and I want to look through it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pandey at umich.edu Tue Jun 9 01:10:34 2020 From: pandey at umich.edu (Anshuman Pandey) Date: Tue, 9 Jun 2020 01:10:34 -0500 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com> References: <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com> Message-ID: <52B1F014-F225-4934-A830-C0BFF7381710@umich.edu> > On Jun 8, 2020, at 9:57 PM, Asmus Freytag via Unicode wrote: > > ? > On 6/8/2020 2:02 PM, Mark E. Shoulson via Unicode wrote: >> Down to one sentence: until you can talk about which LAMEDs in the Torah are bent and which are straight, I would expect this to be a non-starter. > The meta issue: how to ensure that texts that have such features (i.e. layout-specific or scribe-specific choice of shapes) can be widely represented in interchangeable digital representations - even if that representation isn't plain text. > > A./ > You hit the nail on the head: dealing with this issue for alternate terminals for Old Uyghur letters. Scribes made a choice to use either vertical or a curved stroke. The shape of the terminal itself doesn?t change the semantic value of the letter, but it carries pragmatic value. Anshu -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Tue Jun 9 07:29:42 2020 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 9 Jun 2020 08:29:42 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org> <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com> Message-ID: <669714ba-c500-61b6-64ca-791bcf24ccae@kli.org> On 6/8/20 10:57 PM, Asmus Freytag via Unicode wrote: > On 6/8/2020 2:02 PM, Mark E. Shoulson via Unicode wrote: >> Down to one sentence: until you can talk about which LAMEDs in the >> Torah are bent and which are straight, I would expect this to be a >> non-starter. > > The meta issue: how to ensure that texts that have such features (i.e. > layout-specific or scribe-specific choice of shapes) can be widely > represented in interchangeable digital representations - even if that > representation isn't plain text. > > A./ > I guess that's what it comes down to.? Unicode is classically concerned only with plain text.? Aside from disputes about where "plain text" ends, what's to be done with "non-plain" text?? Some aspects of this non-plain text, like these scribal choices, obviously feels more connected to the abstract text than others, like page layout.? Are these part of Unicode's mission?? Should they be?? If not, then what?? You *can* represent and reproduce these details by kludges, be they as ham-fisted as having two fonts with different LAMEDs and formatting some in one font and some in another.? Is that good enough?? Does it mess up other things?? And even if it is good enough, does that count as an "interchangeable digital representation," that I can send a .odt file around?? Things to ponder. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From mhd at yv.org Mon Jun 8 23:45:45 2020 From: mhd at yv.org (Mark H. David) Date: Mon, 08 Jun 2020 21:45:45 -0700 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: Message-ID: <593a318d-4662-4410-b768-705c198e8eea@www.fastmail.com> Hi, sorry for late response, but regarding other character sets *besides* Unicode with Hebrew characters that ended up in Alphabetic Presentation Forms, several were from Apple: Mac OS Hebrew. See this mapping table: http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/HEBREW.TXT ----- Original message ----- From: Abraham Gross via Unicode To: unicode at unicode.org Subject: Why do the Hebrew Alphabetic Presentation Forms Exist Date: Tuesday, June 02, 2020 8:18 PM Why are there precomposed Hebrew characters in Unicode (Alphabetic Presentation Forms block)? It says in the FAQ that ?a substantial number of presentation forms were encoded in Unicode as compatibility characters, because legacy software or data included them.? (https://www.unicode.org/faq/ligature_digraph.html#PForms) I can't find any character set other than Unicode that has separate codepoints for all Hebrew letters with a dagesh/mapiq or any of the other precomposed letters other than the Yiddish ligatures. (ex: Code page 862, ISO/IEC 8859-8, Windows-1255) Does anyone know where I can find the legacy software or character sets that had these presentation forms? I also want to see the documents/proposals that got these characters accepted as part of Unicode. Does anyone know where I can find them? The closest I got was when I figured out the proposal to add HEBREW LETTER YOD WITH HIRIQ is in proposal N1364, but I can't find it in the document register? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jk at koremail.com Tue Jun 9 10:00:59 2020 From: jk at koremail.com (jk at koremail.com) Date: Tue, 09 Jun 2020 23:00:59 +0800 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <669714ba-c500-61b6-64ca-791bcf24ccae@kli.org> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org> <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com> <669714ba-c500-61b6-64ca-791bcf24ccae@kli.org> Message-ID: On 2020-06-09 20:29, Mark E. Shoulson via Unicode wrote: > On 6/8/20 10:57 PM, Asmus Freytag via Unicode wrote: > >> On 6/8/2020 2:02 PM, Mark E. Shoulson via Unicode wrote: >> >>> Down to one sentence: until you can talk about which LAMEDs in the >>> Torah are bent and which are straight, I would expect this to be a >>> non-starter. >> >> The meta issue: how to ensure that texts that have such features >> (i.e. layout-specific or scribe-specific choice of shapes) can be >> widely represented in interchangeable digital representations - even >> if that representation isn't plain text. >> >> A./ > > I guess that's what it comes down to. Unicode is classically > concerned only with plain text. Aside from disputes about where > "plain text" ends, what's to be done with "non-plain" text? Some > aspects of this non-plain text, like these scribal choices, obviously > feels more connected to the abstract text than others, like page > layout. Are these part of Unicode's mission? Should they be? If > not, then what? You *can* represent and reproduce these details by > kludges, be they as ham-fisted as having two fonts with different > LAMEDs and formatting some in one font and some in another. Is that > good enough? Does it mess up other things? And even if it is good > enough, does that count as an "interchangeable digital > representation," that I can send a .odt file around? Things to > ponder. > Unicode is concerned with information exchange, as you say "interchangeable digital representation" of which to common examples are emails and text messages. So an ~.odt file does not count. John > ~mark From everson at evertype.com Tue Jun 9 11:44:01 2020 From: everson at evertype.com (Michael Everson) Date: Tue, 9 Jun 2020 17:44:01 +0100 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> Message-ID: <519CF2BF-4F79-4054-BE09-FE7F3F9F711A@evertype.com> To respond to Mark, I?d say that these examples here certainly show a fairly obvious glyph distinction that is not really a ?hard sell?. > On 8 Jun 2020, at 18:45, Abraham Gross via Unicode wrote: > > Unicode encodes characters that other character sets have even though it normally wouldn't. So if I find a character set with a folded lamed they'd add it? > > Here are 2 character sets with a folded lamed: > https://i.imgur.com/iq8awBe.jpg ? an ??? ???? with the standing and folded lameds as separate letters. > https://www.tug.org/TUGboat/tb15-3/tb44haralambous-hebrew.pdf#page=12 ? A TeX typesetting module with the standing and folded lameds as separate characters for fine-grain control when the automatic system doesn't work. > > 2020?6?7? 10:27, "Mark E. Shoulson via Unicode" wrote: > >> On 6/7/20 7:46 AM, Richard Wordingham via Unicode wrote: >> >> I agree. Sorry, pretty typography is nice and everything, but if bent LAMED is anything, it's at >> best a presentation form (and even that is a hard sell.) You show ANYONE a word spelled with any >> combination of bent and straight LAMEDs and ask how it's spelled, they'll just say "LAMED" for each >> one. Unicode encodes different *characters*, symbols that have a different *meaning* in text, not >> things that happen to look different. A U+05BA HOLAM HASER FOR VAV means not just "a dot like >> U+05B9 only shifted over a little," it means that there is something *different* going on: VAV plus >> HOLAM usually means one thing (a VAV as mater lectionis for an /o/ vowel), this is a consonantal >> VAV followed by a vowel. In spelling it out, you could call one a holam mal?, but not the other. >> A QAMATS QATAN is not just a qamats that looks a little different, it is a grammatically distinct >> character, and moreover one that cannot be deduced algorithmically by looking at the letters around >> it. What you're talking about is a LAMED and a LAMED. They are two *glyphs* for the same >> character, and Unicode doesn't encode glyphs (anymore?) >> >> ~mark > From everson at evertype.com Tue Jun 9 14:53:31 2020 From: everson at evertype.com (Michael Everson) Date: Tue, 9 Jun 2020 20:53:31 +0100 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org> Message-ID: Doesn?t it matter _why_ they are bent? > On 8 Jun 2020, at 22:02, Mark E. Shoulson via Unicode wrote: > > Down to one sentence: until you can talk about which LAMEDs in the Torah are bent and which are straight, I would expect this to be a non-starter. From asmusf at ix.netcom.com Tue Jun 9 14:59:47 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 9 Jun 2020 12:59:47 -0700 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <669714ba-c500-61b6-64ca-791bcf24ccae@kli.org> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org> <2e7e2783-fb3a-ee4f-5859-850edf03dbac@ix.netcom.com> <669714ba-c500-61b6-64ca-791bcf24ccae@kli.org> Message-ID: <354d6a41-4306-cb10-d5de-5ff5d7294c60@ix.netcom.com> An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Jun 9 17:33:20 2020 From: doug at ewellic.org (Doug Ewell) Date: Tue, 9 Jun 2020 16:33:20 -0600 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist Message-ID: <002301d63ead$faaaf320$f000d960$@ewellic.org> abrahamgross at disroot.org wrote: > Unicode encodes characters that other character sets have even though > it normally wouldn't. So if I find a character set with a folded lamed > they'd add it? To elaborate a little on John's comment that "that's really not done anymore": Unicode more or less promised to encode everything that was present in existing, contemporary coded character sets. So if it was in ISO 8859-8, MS-DOS CP862, Windows CP1255, MARC-8 for Hebrew, etc., then it would be in Unicode as well. That's where the presentation forms came from, as mentioned earlier. This did not mean Unicode was obligated to conform retroactively to every coded character set introduced or updated *after* Unicode was published. It has certainly done so for some widely used character sets, particularly in East Asia, but there is no obligation for Unicode to add EWELLIC LETTER A just because I publish an 8-bit character set that contains that letter. And this promise always applied to "coded character sets," a collection of mappings between a code point (single-byte, double-byte, or multi-byte) and a character, used to represent plain text in computers. It didn't apply to glyph collections for typesetting, as in the TeX example below, and definitely not to charts of letters found in a book, with no corresponding code points, as in the JPEG image below. > Here are 2 character sets with a folded lamed: > https://i.imgur.com/iq8awBe.jpg ? an ??? ???? with the standing and > folded lameds as separate letters. > https://www.tug.org/TUGboat/tb15-3/tb44haralambous-hebrew.pdf#page=12 > ? A TeX typesetting module with the standing and folded lameds as > separate characters for fine-grain control when the automatic system > doesn't work. -- Doug Ewell | Thornton, CO, US | ewellic.org From abrahamgross at disroot.org Tue Jun 9 17:51:34 2020 From: abrahamgross at disroot.org (abraham gross) Date: Tue, 9 Jun 2020 18:51:34 -0400 Subject: OverStrike control character Message-ID: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> What do yall think about adding an OverStrike control character? Theres historical precedence of having such a control character. The famous Symbolics Space Cadet keyboard had such a key, and many typewriters relied on the its functionality (e.g. in order to make a ?!? you had to type ?'.? in most typewriters up until the mid 1900s) The programming language APL also heavily relied on the overstrike control character, so many systems in the 80s had the character including Lisp machines. Here's a quote from the Lisp Machine manual: ?OVER STRIKE: Moves the cursor back so that you can superpose (overlay) two characters, should you really want to. The key called BS will do the same thing on the old keyboards.? (The BackSpace key in Lisp Machines worked like they do in typewriters where it just went back a character. The Rub Out key actually deleted the last character) Unicode/ASCII currently has at ASCII 8 the character "BS" thats supposed to go back a character without deleting it, and "DEL" at ASCII 127 that does delete the character. But nowadays BS just deletes the previous character. In fact, it's prohibited in ISO/IEC 8859 for BS to not delete the previous character. Wikipedia says: ?[The cancel control character is] A control character ("CCH", "Cancel Character", U+0094, or ESC T) used to erase the previous character. This character was created as an unambiguous alternative to the much more common backspace character ("BS", U+0008), which has a now mostly obsolete alternative function of causing the following character to be superimposed on the preceding one.? Modern Usage: Since no modern systems have such a control character, you won't find it being used anywhere, but I can guarantee that it will receive wide adoption, especially in east asia, because the kaomoji community will have a field-day with it. People looking to add diacritics that aren't encoded as a combining character yet will also now have the option to do ?? like ?g?? would come out looking like a "g" with a sideways "Z" on top. Another use of the OverStrike key will be combining shapes in new creative ways for custom orthographies or for custom symbols that can be sent over plain text without the need for special fonts. (since unicode will never encode anyone's random conscript/symbols, this would be a great way for people to get this usage with only the addition of a single character) From richard.wordingham at ntlworld.com Tue Jun 9 18:31:09 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 10 Jun 2020 00:31:09 +0100 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org> Message-ID: <20200610003109.0297da84@JRWUBU2> On Tue, 9 Jun 2020 20:53:31 +0100 Michael Everson via Unicode wrote: > Doesn?t it matter _why_ they are bent? > > > On 8 Jun 2020, at 22:02, Mark E. Shoulson via Unicode > > wrote: > > > > Down to one sentence: until you can talk about which LAMEDs in the > > Torah are bent and which are straight, I would expect this to be a > > non-starter. Yes, it does. It seems that they are bent so that they don't clash with the line above. Changing the line breaks or even changing the relative widths of the characters will change which ones get bent. Being bent is an attribute of glyphs in laid out text, rather than an attribute of characters in a sequence of characters. That is why mention of ODT files is relevant. I'm not sure what one has to do to stop an ODT file reflowing. I suspect one may have to freeze a lot of the rendering chain to stop reflowing. Richard. From kenwhistler at sonic.net Tue Jun 9 18:37:07 2020 From: kenwhistler at sonic.net (Ken Whistler) Date: Tue, 9 Jun 2020 16:37:07 -0700 Subject: OverStrike control character In-Reply-To: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> Message-ID: On 6/9/2020 3:51 PM, abraham gross via Unicode wrote: > What do yall think about adding an OverStrike control character? Not gonna happen. > > Theres historical precedence of having such a control character. The famous Symbolics Space Cadet keyboard had such a key, and many typewriters relied on the its functionality (e.g. in order to make a ?!? you had to type ?'.? in most typewriters up until the mid 1900s) And actually U+0008 BACKSPACE (i.e. BS) has been in the Unicode Standard for 30 years now. If people were going to implement characters a la the 1980's (and earlier) backspace and overstrike mode, they had the character they needed for that already. It is a clue about how implementations work with characters and fonts these days that nobody is rushing out to implement overstriking with U+0008, even though it is there for the taking. > > The programming language APL also heavily relied on the overstrike control character, so many systems in the 80s had the character including Lisp machines. Another telling example. Unicode 1.0 in 1990 included U+2300 APL COMPOSE OPERATOR, which was intended precisely for the APL overstrike functionality. It was *removed* in the big merger that resulted in Unicode 1.1, in part because the APL community itself was more interested in getting the actual composed operators into the encoding, rather than depending on archaic sequences that reflected the limitations of 7- and 8-bit character encodings and associated keyboards. Hence, all the combined APL operators now seen in Unicode at U+2336 .. U+2379. There is some (very) limited use of the concept of an overstruck compositor in the Unicode Standard, but the concept is limited to specific scripts and is very constrained. The obvious example is U+13436 EGYPTIAN HIEROGLYPH OVERLAY MIDDLE. That is used as part of a syntax for constructing complete Egyptian hieroglyph quadrats. But the critical distinction is that that format control is part of a complex syntax used by a modern font technology that maps sequences of hieroglyphs into ligatures and/or using complex positioning and resizing rules. Nobody implements fonts these days that will just "back up" "one space" and "overstrike" a new character. Well, possibly outside the context of Societies for Deliberate Anachronism busy implementing emulations of long dead technology, I suppose. --Ken From harjitmoe at outlook.com Tue Jun 9 18:44:26 2020 From: harjitmoe at outlook.com (Harriet Riddle) Date: Tue, 9 Jun 2020 23:44:26 +0000 Subject: OverStrike control character In-Reply-To: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> Message-ID: > The programming language APL also heavily relied on the overstrike control character, so many systems in the 80s had the character including Lisp machines. The current way of handling APL overstamping sequences is to include the entire sequences in the mapping file: https://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/APL-ISO-IR-68.TXT The interpreter/compiler would presumably have a hardcoded list of sequences it recognises anyway? > Unicode/ASCII currently has at ASCII 8 the character "BS" thats supposed to go back a character without deleting it, and "DEL" at ASCII 127 that does delete the character. But nowadays BS just deletes the previous character. Unicode itself is fairly hands-off about how higher level protocols can interpret C0 and C1 control codes (general category Cc). Indeed, ISO 10646:2017 section 12.4, while giving the designation sequences of the ISO 6429 (ECMA 48) controls as the default, does go on to (on the next page) permit the use of ISO 2022 designations of other control code sets with UCS/Unicode (by contrast, ISO 2022 designation of graphical sets are not permitted inside UCS, and have no compatible semantic). That being said, TUS chapter 23.1 names a limited subset of them (HT, LF, VT, FF, CR, FS, GS, RS, US, NEL), so that they can be given custom behaviours for line breaking, bidirectional processing and classification as whitespace. BS is not amongst these. In practice, BS is not supported at all (i.e. has neither behaviour) outside of terminal emulators in my experience. > In fact, it's prohibited in ISO/IEC 8859 for BS to not delete the previous character. ISO 8859 defines profiles of ISO 4873 (ECMA 43) Level 1. Both ISO 8859 and ISO 4873 stipulate fixed character repertoires, and so prohibit creating new characters from overstamping existing ones by any means (including using BS or CR to seek back over them). I don't read this as limiting how BS itself might be implemented, just that it is invalid ISO 8859 for a text to use it to stamp characters on top of other characters to create a character with a different meaning to the two one after the other. They do permit using the GCC control sequence defined by ISO 6429 (ECMA 48) though, since it doesn't overstamp anything but merely renders them in one em-square (if that function is supported, and it usually isn't so far as I can tell, the most extreme example I can think of is that the byte sequence 9B 31 20 5F D5 E4 E9 20 C7 E4 E4 E7 20 D9 E4 EA E7 20 E8 D3 E4 E5 9B 32 20 5F in ISO-8859-6 might be shown with a U+FDFA glyph). From sosipiuk at gmail.com Tue Jun 9 19:01:32 2020 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Tue, 9 Jun 2020 20:01:32 -0400 Subject: OverStrike control character In-Reply-To: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> Message-ID: On Tue, Jun 9, 2020 at 6:57 PM abraham gross via Unicode wrote: > > What do yall think about adding an OverStrike control character? I don't think it's a goer. There are two things that immediately stand out: 1. Unicode doesn't seem eager to define control characters at all. In fact, aside from a handful of format effectors which were so universal and obvious that it made no sense to exclude them, Unicode is very passive on the topic of even the well-defined controls of ISO 6429/ECMA 48. An interesting exception to this is the pair of U+2028 and U+2029 (line and paragraph separators). Any control character is going to be a "hard sell". 2. Overstriking arbitrary characters is a qualitatively different process than using combining characters. In the latter case, the set of characters is restricted, and certain algorithms can be applied to make the presentation look sane (to varying degrees of success). Overstriking implies the need for the rendering engine to be able to combine any two characters, regardless of elements that interfere or clash. It seems simple in principle to just render the characters separately and overlay the pixels, but I'm very skeptical of what the results would actually look like in real-life, with users making unpredictable font and formatting choices. > Unicode/ASCII currently has at ASCII 8 the character "BS" thats supposed to go back a character without deleting it, and "DEL" at ASCII 127 that does delete the character. But nowadays BS just deletes the previous character. In fact, it's prohibited in ISO/IEC 8859 for BS to not delete the previous character. Is it? I know that's the behaviour in all modern software, but I can't find that prohibition. Can you point out the section? Speaking of old standards, though, ISO 6429/ECMA 48 has the GCC (GRAPHIC CHARACTER COMBINATION) control which seems to be its recommendation for overstriking (though it also waffles about how combined characters may simply be made half-width and inserted into the horizontal space of a single character, leaving the ultimate decision of "relative sizes and placements" to the implementation.) GCC looks like a mess. Because of the way it's built up from a CSI (control sequence introducer) and uses parameters, the way to combine two characters is to precede them both with the sequence [0x1B 0x5B 0x30 0x20 0x5F], and to combine more than two characters, enclose them with an initial [0x1B 0x5B 0x31 0x20 0x5F] and a final [0x1B 0x5B 0x32 0x20 0x5F]. How fun. S?awomir Osipiuk From sosipiuk at gmail.com Tue Jun 9 19:09:22 2020 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Tue, 9 Jun 2020 20:09:22 -0400 Subject: OverStrike control character In-Reply-To: References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> Message-ID: On Tue, Jun 9, 2020 at 8:01 PM S?awomir Osipiuk wrote: > > Is it? I know that's the behaviour in all modern software, but I can't > find that prohibition. Can you point out the section? > D'oh. It's right in the very first section ("Scope"). "The use of control functions, such as BACKSPACE or CARRIAGE RETURN for the coded representation of composite characters is prohibited by this Standard." From jk at koremail.com Tue Jun 9 20:13:30 2020 From: jk at koremail.com (jk at koremail.com) Date: Wed, 10 Jun 2020 09:13:30 +0800 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <20200610003109.0297da84@JRWUBU2> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org> <20200610003109.0297da84@JRWUBU2> Message-ID: <1f63ec03c58bf1bfac3ee7f46f72f475@koremail.com> On 2020-06-10 07:31, Richard Wordingham via Unicode wrote: > On Tue, 9 Jun 2020 20:53:31 +0100 > Michael Everson via Unicode wrote: > >> Doesn?t it matter _why_ they are bent? >> >> > On 8 Jun 2020, at 22:02, Mark E. Shoulson via Unicode >> > wrote: >> > >> > Down to one sentence: until you can talk about which LAMEDs in the >> > Torah are bent and which are straight, I would expect this to be a >> > non-starter. > > Yes, it does. It seems that they are bent so that they don't clash > with the line above. Changing the line breaks or even changing the > relative widths of the characters will change which ones get bent. > Being bent is an attribute of glyphs in laid out text, rather than an > attribute of characters in a sequence of characters. > > That is why mention of ODT files is relevant. I'm not sure what one > has to do to stop an ODT file reflowing. I suspect one may have to > freeze a lot of the rendering chain to stop reflowing. > > Richard. If whether or not the lamed is bent depends on the line above then clearly not a suitable candidate for encoding. John K From mark at kli.org Tue Jun 9 20:41:15 2020 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 9 Jun 2020 21:41:15 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> <1fa3a52f-7aec-a96a-d8ee-641d014d8486@kli.org> Message-ID: <8c1fe062-d31a-7c64-8ca0-4695e6a9f0a1@kli.org> On 6/9/20 3:53 PM, Michael Everson via Unicode wrote: > Doesn?t it matter _why_ they are bent? > >> On 8 Jun 2020, at 22:02, Mark E. Shoulson via Unicode wrote: >> >> Down to one sentence: until you can talk about which LAMEDs in the Torah are bent and which are straight, I would expect this to be a non-starter. On one hand, no, not really.? On the other hand, well, if there's a reason, that's already a start. See, note that I asked about "which LAMEDs in the *Torah*."? Not a certain printing or document.? Because the Torah is not a book, it is not a scroll, it is not a computer file.? It's a text, by which I mean it is a... conceptual string of characters?? That is to say, it isn't "A symbol that looks like this, followed by one that looks like that..." it's a BET followed by a RESH followed by an ALEPH... That, at its heart, is what "plain text" is all about.? A phrase I tried to coin years back: "there's no plain text on paper."? Once you're describing how something is printed, you're dealing with something that's been formatted. The Torah's contents have been dissected and delved through in excruciating detail.? We know which letters and symbols come in what order.? (We even know some distinctions that _still_ aren't encoded by Unicode, and I'm not saying they should be, like the difference between a PASEQ and the line that forms part of a LEGARMEHH cantillation.? There are lists of PASEQs you can check.)? Whether a LAMED is bent or not is determined mainly by whether or not there's space above it the way they text has been flowed.? Adding so much as a punctuation mark anywhere could mean the text is reformatted, and the situation in the line above it could be very different.? It isn't something that depends on the text, it depends on the paper, on the formatting. This doesn't seem like "character" information, to me. ~mark From mark at kli.org Tue Jun 9 20:57:43 2020 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 9 Jun 2020 21:57:43 -0400 Subject: Wide Hebrew characters Message-ID: <9ba86122-f9bc-a245-a7d0-55ce00f09b64@kli.org> An HTML attachment was scrubbed... URL: From mark at kli.org Tue Jun 9 21:09:06 2020 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 9 Jun 2020 22:09:06 -0400 Subject: Why do the Hebrew Alphabetic Presentation Forms Exist In-Reply-To: <519CF2BF-4F79-4054-BE09-FE7F3F9F711A@evertype.com> References: <1ff571c5-990b-4c60-9689-952440040d05@disroot.org> <988EB832-83E6-433B-A41F-DC018D73E026@umich.edu> <20200607124627.12db8f87@JRWUBU2> <79c24533c52e5c2fbfa0c5dbcc438658@disroot.org> <519CF2BF-4F79-4054-BE09-FE7F3F9F711A@evertype.com> Message-ID: Hm, you think?? This is like a font sample, showing all the alternate glyphs.? An old-time italic font might have 3 different "s"s or "n"s, depending on how much swoop and swash the typesetter felt like using at that particular spot, and the type sample pages would show them all, but that doesn't make them all distinct characters.? I think https://www.oldfonts.com/antiquepenman/wp-content/uploads/2017/03/libraryprimer.jpg is for handwriting, but one could easily imagine a typeface imitating that, with all those different forms of f and g and so on.? They're still just f's.? (I can see if I have some actual type samples for better examples if needed...) (I read Haralambous'(*) article on "Tiqwah" years ago; he definitely did some very careful work and study in Biblical typesetting.? But note, _typesetting_.? The art of laying out glyphs on paper.? That's not the same thing as characters.) ~mark (*) I think this may be the first time I noticed that his name isn't "Harambolous", which for some reason I thought it was.? Apologies, professor... On 6/9/20 12:44 PM, Michael Everson via Unicode wrote: > To respond to Mark, I?d say that these examples here certainly show a fairly obvious glyph distinction that is not really a ?hard sell?. > >> On 8 Jun 2020, at 18:45, Abraham Gross via Unicode wrote: >> >> Unicode encodes characters that other character sets have even though it normally wouldn't. So if I find a character set with a folded lamed they'd add it? >> >> Here are 2 character sets with a folded lamed: >> https://i.imgur.com/iq8awBe.jpg ? an ??? ???? with the standing and folded lameds as separate letters. >> https://www.tug.org/TUGboat/tb15-3/tb44haralambous-hebrew.pdf#page=12 ? A TeX typesetting module with the standing and folded lameds as separate characters for fine-grain control when the automatic system doesn't work. >> >> 2020?6?7? 10:27, "Mark E. Shoulson via Unicode" wrote: >> >>> On 6/7/20 7:46 AM, Richard Wordingham via Unicode wrote: >>> >>> I agree. Sorry, pretty typography is nice and everything, but if bent LAMED is anything, it's at >>> best a presentation form (and even that is a hard sell.) You show ANYONE a word spelled with any >>> combination of bent and straight LAMEDs and ask how it's spelled, they'll just say "LAMED" for each >>> one. Unicode encodes different *characters*, symbols that have a different *meaning* in text, not >>> things that happen to look different. A U+05BA HOLAM HASER FOR VAV means not just "a dot like >>> U+05B9 only shifted over a little," it means that there is something *different* going on: VAV plus >>> HOLAM usually means one thing (a VAV as mater lectionis for an /o/ vowel), this is a consonantal >>> VAV followed by a vowel. In spelling it out, you could call one a holam mal?, but not the other. >>> A QAMATS QATAN is not just a qamats that looks a little different, it is a grammatically distinct >>> character, and moreover one that cannot be deduced algorithmically by looking at the letters around >>> it. What you're talking about is a LAMED and a LAMED. They are two *glyphs* for the same >>> character, and Unicode doesn't encode glyphs (anymore?) >>> >>> ~mark From abrahamgross at disroot.org Tue Jun 9 21:50:07 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Wed, 10 Jun 2020 02:50:07 +0000 (UTC) Subject: OverStrike control character In-Reply-To: References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> Message-ID: <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character. It shouldnt do any fancy processing by default (unless if a font actually cares enough to mess with it). Most systems have just about the same font so I wouldn't worry about the results of overstriking not coming out perfect. Even if doesn't come out perfect, I'd take almost-exact-representation over no-represtation any day. 2020/06/09 ??8:02:39 S?awomir Osipiuk via Unicode : > 2. Overstriking arbitrary characters is a qualitatively different > process than using combining characters. In the latter case, the set > of characters is restricted, and certain algorithms can be applied to > make the presentation look sane (to varying degrees of success). > Overstriking implies the need for the rendering engine to be able to > combine any two characters, regardless of elements that interfere or > clash. It seems simple in principle to just render the characters > separately and overlay the pixels, but I'm very skeptical of what the > results would actually look like in real-life, with users making > unpredictable font and formatting choices. > From prosfilaes at gmail.com Tue Jun 9 22:14:14 2020 From: prosfilaes at gmail.com (David Starner) Date: Tue, 9 Jun 2020 20:14:14 -0700 Subject: OverStrike control character In-Reply-To: <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> Message-ID: On Tue, Jun 9, 2020 at 7:55 PM abrahamgross--- via Unicode wrote: > > It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character. > > It shouldnt do any fancy processing by default (unless if a font actually cares enough to mess with it). Most systems have just about the same font so I wouldn't worry about the results of overstriking not coming out perfect. Even if doesn't come out perfect, I'd take almost-exact-representation over no-represtation any day. There is a character to do that; BS. Like many other control characters, it's now generally considered obsolete. There's no need to provide a new character to do something that has an old character does just fine, even if that old character is unsupported because it doesn't fit with current design choices. Making a new character won't change that. -- The standard is written in English . If you have trouble understanding a particular section, read it again and again and again . . . Sit up straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185 (1991) From abrahamgross at disroot.org Tue Jun 9 22:25:10 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Wed, 10 Jun 2020 03:25:10 +0000 (UTC) Subject: OverStrike control character In-Reply-To: References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> Message-ID: <2f501bae-9f2a-4bad-be3a-f1e48646db78@disroot.org> BS doesn't do that tho. even in the beginning of ascii it was only supported on some devices. Doing `echo -e 'xy\x08z'` results in `xz` and not in https://imgur.com/B0020Xb From gwalla at gmail.com Wed Jun 10 00:44:41 2020 From: gwalla at gmail.com (Garth Wallace) Date: Tue, 9 Jun 2020 22:44:41 -0700 Subject: OverStrike control character In-Reply-To: <2f501bae-9f2a-4bad-be3a-f1e48646db78@disroot.org> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <2f501bae-9f2a-4bad-be3a-f1e48646db78@disroot.org> Message-ID: Would x OVERSTRIKE z look the same as z OVERSTRIKE x? If yes, would they be considered identical for string matching purposes? Would they have to be reordered for normalization? What would be the repercussions for collation? Display is not the only thing text is for. On Tue, Jun 9, 2020 at 8:30 PM abrahamgross--- via Unicode < unicode at unicode.org> wrote: > BS doesn't do that tho. even in the beginning of ascii it was only > supported on some devices. > > Doing `echo -e 'xy\x08z'` results in `xz` and not in > https://imgur.com/B0020Xb > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Wed Jun 10 01:05:57 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Wed, 10 Jun 2020 06:05:57 +0000 (UTC) Subject: OverStrike control character In-Reply-To: References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <2f501bae-9f2a-4bad-be3a-f1e48646db78@disroot.org> Message-ID: <9caa937c-f3d7-4132-a528-b435398785c0@disroot.org> 2020/06/10 ??1:45:32 Garth Wallace via Unicode : > Would x OVERSTRIKE z look the same as z OVERSTRIKE x? If yes, would they be considered identical for string matching purposes? They would look the same. In a perfect world they would be identical for string matching, but since its a new control character I would understand if ppl don't want to put in the effort to adopt it properly. > Would they have to be reordered for normalization? Not sure what this means, but if I understand it correctly, then this might actually be a good idea for collation. But it might also be too much effort to implement, so its not necessary. Like the japanese saying goes ??????????[https://www.weblio.jp/content/Simple+is+Best]? > What would be the repercussions for collation? I would say just take the first character in the sequence of overstriked characters and use that as the bases of collation. If this doesn't work, then I'm always open to suggestions. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskasskrv at gmail.com Wed Jun 10 02:36:31 2020 From: jameskasskrv at gmail.com (James Kass) Date: Wed, 10 Jun 2020 07:36:31 +0000 Subject: OverStrike control character In-Reply-To: <9caa937c-f3d7-4132-a528-b435398785c0@disroot.org> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <2f501bae-9f2a-4bad-be3a-f1e48646db78@disroot.org> <9caa937c-f3d7-4132-a528-b435398785c0@disroot.org> Message-ID: <60bd95ba-42c1-fb3b-aaae-d6099f6a848d@gmail.com> On 2020-06-10 6:05 AM, abrahamgross--- via Unicode wrote: > Like the japanese saying goes ??????????? That's English written phonetically in katakana. From richard.wordingham at ntlworld.com Wed Jun 10 04:27:04 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 10 Jun 2020 10:27:04 +0100 Subject: OverStrike control character In-Reply-To: <9caa937c-f3d7-4132-a528-b435398785c0@disroot.org> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <2f501bae-9f2a-4bad-be3a-f1e48646db78@disroot.org> <9caa937c-f3d7-4132-a528-b435398785c0@disroot.org> Message-ID: <20200610102704.3462a896@JRWUBU2> On Wed, 10 Jun 2020 06:05:57 +0000 (UTC) abrahamgross--- via Unicode wrote: > 2020/06/10 ??1:45:32 Garth Wallace via Unicode > : > > > Would x OVERSTRIKE z look the same as z OVERSTRIKE x? If yes, would > > they be considered identical for string matching purposes? > > They would look the same. > In a perfect world they would be identical for string matching, but > since its a new control character I would understand if ppl don't > want to put in the effort to adopt it properly. > > > Would they have to be reordered for normalization? > > Not sure what this means, but if I understand it correctly, then this > might actually be a good idea for collation. But it might also be too > much effort to implement, so its not necessary. Like the japanese > saying goes > ??????????[https://www.weblio.jp/content/Simple+is+Best]? > > > What would be the repercussions for collation? > > I would say just take the first character in the sequence of > overstriked characters and use that as the bases of collation. If > this doesn't work, then I'm always open to suggestions. But if they're identical for character matching, then they should collate identically, so this last bit is inherently broken. Consider and in a proportional width font. Are you expecting the rendering system to position the 'l' using the knowledge that it will be overstruck? Overstriking is designed for a teletype with fixed width characters. the knowledge that it will be overstruck? It takes special effort to get the overstriking effects on a video terminal or its emulation. Richard. From mark at kli.org Wed Jun 10 07:59:16 2020 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 10 Jun 2020 08:59:16 -0400 Subject: OverStrike control character In-Reply-To: <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> Message-ID: <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> On 6/9/20 10:50 PM, abrahamgross--- via Unicode wrote: > It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character. What are these "pixels" to which you refer?? Fonts these days are defined in terms of strokes, not pixels.? And Richard Wordingham points out the flaw in your notion of how it would be rendered, your claim that x OS z would look the same as z OS x: > Consider and in a proportional > width font. Are you expecting the rendering system to position the 'l' > using the knowledge that it will be overstruck? Overstriking is > designed for a teletype with fixed width characters. Besides, even if it worked as you said, with the narrow character centered, how long would it take before you found some examples that didn't really quite work out right?? Like overlaying a HEBREW LETTER YOD on a LATIN CAPITAL LETTER L, but what you really wanted was the YOD centered in the negative space of the L and not between the side-bearings, so next you'll want to be able to add some control over the exact positioning.? And of course that won't work right in general, because it all depends on the font(s) involved. And when it comes to matching, you say of x OS z and z OS x, > In a perfect world they would be identical for string matching, but > since its a new control character I would understand if ppl don't want > to put in the effort to adopt it properly. But we're talking about making the rules here, the "perfect world."? What should the *rule* be about string-matching?? You can't have an optional rule, so a pair of strings will match on one system and not the other and both are right.? Are we to understand that you think the rule should be that overstruck characters are considered to match in either order?? Your gracious forgiveness of laxity in the rules doesn't really enter into the picture.? And what about larger considerations?? Can I have "ab??xy" (using ? for the overstrike) to overstrike a&x and b&y?? What about "a?b?c?d?e?f?g?h"?? What about "abc?d??fg"? The f&b are overstruck and so are the c&d&g?? Is that combination of c?d overstruck with g different from c?d?g or the same?? What about other combinations?? These are all things that need answers.? What about overstriking a LTR character with a RTL one, or vice-versa?? Which way does the text go after that? But I think what you're really missing is the crucial point that Garth Wallace pointed out: > Display is not the only thing text is for. You're focussing a lot on how characters *look*, can we get this letter to look a little different, can we layer characters to make other weird symbols (which will look radically different depending on the font)... You're looking at how to _draw_ stuff, how to make things look this way or that on paper or when rendered.? But that's not what Unicode encodes.? You need to think more about the distinction Unicode makes between characters and glyphs.? "Plain text" isn't about display,? it's about representing what's in a document, the characters which encode (in a different sense) the spoken language (usually) that is being communicated.? All the things you're talking about are firmly in the realm of fonts and higher-level protocols.? You surely could work out this overstriking display with a sufficiently-advanced font (you could make zero-advance-width overlaying characters and ligatures that would replace X? with a zero-width equivalent of X, for example, in a monospace font), and you are welcome to do so, but that's where it belongs. > It shouldnt do any fancy processing by default Figuring out how much to backspace in order to center a glyph on another one, in a proportional-spaced font, is pretty fancy processing. > Most systems have just about the same font so I wouldn't worry about the results of overstriking not coming out perfect. What a bland world you live in, wherein most fonts are the same! It's not about working with the default font on your favorite system; we're dealing with _characters_ here, which could be represented in ANY font. ~mark From sosipiuk at gmail.com Wed Jun 10 09:39:08 2020 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Wed, 10 Jun 2020 10:39:08 -0400 Subject: OverStrike control character In-Reply-To: <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> Message-ID: On Tue, Jun 9, 2020 at 10:55 PM abrahamgross--- via Unicode wrote: > > It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character. > > Even if doesn't come out perfect, I'd take almost-exact-representation over no-represtation any day. To be sure, it would be a cool feature to have. However, as Raymond Chen (a Windows developer) succinctly put it, "every feature starts out at minus 100 points". It's not just a matter of having some value; that value must outweigh the effort it would take to fully implement it, and it must compete with other features that that effort could go to instead. The proposal here is to add a completely new feature to Unicode, that in turn demands an updated font rendering process, and to convince vendors to support it. It's not just a new character which can be added to a font. It's new behaviour for characters. That's big; that's a lot of effort. The idea of just overlaying pixels is a tunnel-view of what's involved. Character combination (as it's currently done) doesn't occur at that level. It's done earlier. Pixels are at the final level of display. You'd need a new set of routines to enable combination there. And it's not trivial then, either. What about anti-aliasing, sub-pixel rendering? Who will want to do all that? You'd need not just font support but an OS-level change in the rendering process, in all major OSes. All for a feature that's a bit of fun but isn't guaranteed to produce elegant results. There is also the question of backward-compatibility. Even if this change is included in new releases, there will be plenty of old systems out there that won't have any idea of what an overstrike character is. You won't just get an "unknown character" glyph, you'll get an "unknown character" glyph between the two characters you're trying to combine. I'm not saying it can't be done, or that it wouldn't be a nice-to-have. Where there's a will, there's a way. Realistically, though, I don't predict a lot of will to get something like this done. S?awomir Osipiuk From abrahamgross at disroot.org Wed Jun 10 10:45:49 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Wed, 10 Jun 2020 15:45:49 +0000 (UTC) Subject: OverStrike control character In-Reply-To: <60bd95ba-42c1-fb3b-aaae-d6099f6a848d@gmail.com> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <2f501bae-9f2a-4bad-be3a-f1e48646db78@disroot.org> <9caa937c-f3d7-4132-a528-b435398785c0@disroot.org> <60bd95ba-42c1-fb3b-aaae-d6099f6a848d@gmail.com> Message-ID: 2020/06/10 ??3:37:37 James Kass via Unicode : > > On 2020-06-10 6:05 AM, abrahamgross--- via Unicode wrote: > > Like the japanese saying goes ??????????? > That's English written phonetically in katakana. > It is, but "simple is best" doesn't mean anything in english, while it does mean something in japanese. in english it would be something like "simplicity is always better". From kent.b.karlsson at bahnhof.se Wed Jun 10 11:12:04 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Wed, 10 Jun 2020 18:12:04 +0200 Subject: OverStrike control character In-Reply-To: <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> Message-ID: <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se> (You (all) apparently mean ?overtype? rather than ?overstrike??; at least I read the latter as the same as crossed-out or strike-through.) Well, however the overtyping, or overlapping, is achieved (BS, GCC (is there any implementation of that at all? I would strongly recommend against it), pinching the glyph spacing (there is a control sequence for that in ECMA-48) too much, or simply via how the font?s glyphs are designed and spaced), there is no telling what the displayed result will be. Doing such things really passes from being text display (styled or not) into the realm of graphics. And sure, you can do lots of things in graphic design, also with overlapping ?graphic elements? (including glyphs for letters/digits/...). But as (possibly styled) TEXT, the displayed/printed result of overlaps would be ?implementation defined?. Please use a graphics editing program for controlling how overlapping graphic elements look like; for overlapping, you may want to use different layers, graphics editing programs often support ?layers?, for the graphic elements that overlap even if there are no layers when converting to (say) PNG. (And for graphics, sorting, searching, and other text operations do not apply?; in HTML for images/graphics, you can have an ?alt? text, which may or may not, indicate what is in the image/graphics.) /Kent Karlsson > 10 juni 2020 kl. 14:59 skrev Mark E. Shoulson via Unicode : > > On 6/9/20 10:50 PM, abrahamgross--- via Unicode wrote: >> It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character. > > What are these "pixels" to which you refer? Fonts these days are defined in terms of strokes, not pixels. And Richard Wordingham points out the flaw in your notion of how it would be rendered, your claim that x OS z would look the same as z OS x: > >> Consider and in a proportional >> width font. Are you expecting the rendering system to position the 'l' >> using the knowledge that it will be overstruck? Overstriking is >> designed for a teletype with fixed width characters. > Besides, even if it worked as you said, with the narrow character centered, how long would it take before you found some examples that didn't really quite work out right? Like overlaying a HEBREW LETTER YOD on a LATIN CAPITAL LETTER L, but what you really wanted was the YOD centered in the negative space of the L and not between the side-bearings, so next you'll want to be able to add some control over the exact positioning. And of course that won't work right in general, because it all depends on the font(s) involved. > > And when it comes to matching, you say of x OS z and z OS x, > >> In a perfect world they would be identical for string matching, but since its a new control character I would understand if ppl don't want to put in the effort to adopt it properly. > > But we're talking about making the rules here, the "perfect world." What should the *rule* be about string-matching? You can't have an optional rule, so a pair of strings will match on one system and not the other and both are right. Are we to understand that you think the rule should be that overstruck characters are considered to match in either order? Your gracious forgiveness of laxity in the rules doesn't really enter into the picture. And what about larger considerations? Can I have "ab??xy" (using ? for the overstrike) to overstrike a&x and b&y? What about "a?b?c?d?e?f?g?h"? What about "abc?d??fg"? The f&b are overstruck and so are the c&d&g? Is that combination of c?d overstruck with g different from c?d?g or the same? What about other combinations? These are all things that need answers. What about overstriking a LTR character with a RTL one, or vice-versa? Which way does the text go after that? > > But I think what you're really missing is the crucial point that Garth Wallace pointed out: > >> Display is not the only thing text is for. > > You're focussing a lot on how characters *look*, can we get this letter to look a little different, can we layer characters to make other weird symbols (which will look radically different depending on the font)... You're looking at how to _draw_ stuff, how to make things look this way or that on paper or when rendered. But that's not what Unicode encodes. You need to think more about the distinction Unicode makes between characters and glyphs. "Plain text" isn't about display, it's about representing what's in a document, the characters which encode (in a different sense) the spoken language (usually) that is being communicated. All the things you're talking about are firmly in the realm of fonts and higher-level protocols. You surely could work out this overstriking display with a sufficiently-advanced font (you could make zero-advance-width overlaying characters and ligatures that would replace X? with a zero-width equivalent of X, for example, in a monospace font), and you are welcome to do so, but that's where it belongs. >> It shouldnt do any fancy processing by default > Figuring out how much to backspace in order to center a glyph on another one, in a proportional-spaced font, is pretty fancy processing. >> Most systems have just about the same font so I wouldn't worry about the results of overstriking not coming out perfect. > > What a bland world you live in, wherein most fonts are the same! It's not about working with the default font on your favorite system; we're dealing with _characters_ here, which could be represented in ANY font. > > ~mark > > > From hsivonen at hsivonen.fi Wed Jun 10 11:47:47 2020 From: hsivonen at hsivonen.fi (Henri Sivonen) Date: Wed, 10 Jun 2020 19:47:47 +0300 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: <4a173391-8436-1f3f-c311-f4a9960d288e@honermann.net> References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <4a173391-8436-1f3f-c311-f4a9960d288e@honermann.net> Message-ID: Tom Honermann wrote: > I'm researching support for UTF-8 encoded source files in various C++ compilers. Here is a brief snapshot of existing practice: > > Clang only accepts UTF-8 encoded source files. A UTF-8 BOM is recognized and discarded. > GCC accepts UTF-8 encoded source files by default, but the encoding expectation can be overridden with a command line option. If GCC is expecting UTF-8 source, then a BOM is discarded. Otherwise, a BOM is *not* honored and its presence is likely to result in compilation error. GCC has no support for compiling a translation unit consisting of differently encoded source files. > Microsoft Visual C++, by default, interprets source files as encoded according to the Windows' Active Code Page (ACP), but supports translation units consisting of differently encoded source files by honoring a UTF-8 BOM. The default encoding can be overridden with a command line option. > IBM z/OS xlC C/C++ is IBM's compiler for C and C++ on mainframes (yes, though you may not have seen a green screen in recent times, mainframes are still busy crunching numbers behind the scenes for websites you frequent). z/OS is an EBCDIC based operating system and IBM's xlC compiler for z/OS only accepts EBCDIC encoded source files. Many EBCDIC code pages exist and the xlC compiler supports an in-source code page annotation that enables compilation of translation units consisting of differently encoded source files. > > The goal of this research is to produce a proposal for the C and C++ standards intended to better enable UTF-8 as a portable source file encoding. ... > Various methods are being explored for how to support collections of mixed encoding source files. The intent in asking the question is to help determine if/how use of a UTF-8 BOM fits in to the picture. Given your description of existing compiler behavior, I recommend making the C++ standard say that if a file (substitute the right ISO term for "file") starts with a UTF-8 BOM, the file must be interpreted as UTF-8 and the BOM be discarded before further processing. This already fits what you say GCC, clang, and MSVC do by default and would not be a compatibility-breaking change for IBM compilers (though I understood the IBM compilers are being superseded by clang anyway as far as implementing C++ versions later than C++11 goes). This would also facilitate migration to UTF-8 on Windows and z/OS. Shawn Steele wrote: > The modern viewpoint is that the BOM should be discouraged in all contexts. If you are writing an HTML serializer that 1) is a component distinct from the HTTP layer and, therefore, cannot control the HTTP headers and 2) mustn't impose restrictions on the shape of the DOM and, therefore, mustn't inject a meta element on its own, the best approach is to use the UTF-8 BOM. > I?d recommend to anyone encountering ASCII-like data to presume it was UTF-8 unless proven otherwise. This is problematic in contexts where there is non-UTF-8 legacy, the input arrives over time, and streaming processing of the input is expected. See https://hsivonen.fi/utf-8-detection/ On Eli Zaretskii wrote: > > From: Shawn Steele via Unicode > > > > I?ve been recommending that people assume documents are UTF-8. If the UTF-8 decoding fails, then > > consider falling back to some other codepage. > > That strategy would fail with 7-bit ISO 2022 based encodings, no? Yes. When HTML is labeled as UTF-8 and is valid UTF-8, Firefox disables the character encoding menu to prevent self-XSS and to prevent the user from introducing data corruption to forms. This is a bit of a problem with e.g. university servers that have acquired a server-wide HTTP-level UTF-8 declaration but that carry occasional ancient ISO-2022-JP content. So far, I've decided not to do anything about this. Fortunately, the ISO 2022 series isn't really relevant (as a good approximation) to C++. -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From junicode at jcbradfield.org Wed Jun 10 12:04:30 2020 From: junicode at jcbradfield.org (Julian Bradfield) Date: Wed, 10 Jun 2020 18:04:30 +0100 (BST) Subject: OverStrike control character References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <2f501bae-9f2a-4bad-be3a-f1e48646db78@disroot.org> <9caa937c-f3d7-4132-a528-b435398785c0@disroot.org> <60bd95ba-42c1-fb3b-aaae-d6099f6a848d@gmail.com> Message-ID: On 2020-06-10, abrahamgross--- via Unicode wrote: > 2020/06/10 ??3:37:37 James Kass via Unicode : > >> > On 2020-06-10 6:05 AM, abrahamgross--- via Unicode wrote: >> > Like the japanese saying goes ??????????? >> That's English written phonetically in katakana. >> > > It is, but "simple is best" doesn't mean anything in english, while it does mean something in japanese. in english it would be something like "simplicity is always better". I can only conclude you're not a native English speaker. We can noun adjectives if we want, just as we can verb nouns. But in compsci, we say it more forcefully: KISS! From harjitmoe at outlook.com Wed Jun 10 12:27:07 2020 From: harjitmoe at outlook.com (Harriet Riddle) Date: Wed, 10 Jun 2020 17:27:07 +0000 Subject: OverStrike control character In-Reply-To: <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org>, <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se> Message-ID: > From: Unicode on behalf of Kent Karlsson via Unicode > Sent: Wednesday, June 10, 2020 5:12:04 PM > [?] > Well, however the overtyping, or overlapping, is achieved (BS, GCC (is there any implementation of that at all? I would strongly recommend against it), pinching the glyph spacing (there is a control sequence for that in ECMA-48) too much, or simply via how the font?s glyphs are designed and spaced), there is no telling what the displayed result will be. >[?] Looking at Annex C of ECMA-43 (ECMA's designation for ISO 4873, in turn referenced from ISO 8859), GCC is only permitted because it is not supposed to create an effectively new character, but rather to ?juxtapose? the characters in one position (i.e. force a ligature, which if unsupported could just be shown as a sequence of individual characters). Similarly, BS is prohibited precisely because it overstamps to create a new character that the target system cannot be expected to support properly. Which you rightly mention, and which makes sense even for rendering, and let us not forget filename handling, narrator software for the visually impaired, _et cetera_? The example it gives is using GCC on the sequence Pts to represent a ligature form (i.e. U+20A7). ECMA-48 (ISO 6429) defines GCC's coded representation and parameters as a CSI sequence, and gives as a mere example the simplest case of triggering display of two characters side-by-side in one kanji width, i.e. what the Japanese era name ligatures, the CJK Compatibility block unit symbols, _et cetera_ do. So apparently, GCC was (from what I can tell from the standards themselves) an attempt at defining a general mechanism for coding arbitrary ligatures and arbitrary CJK squared forms. Not character overstamping. As a final note, I should probably mention that the best existing way to create an overstamped character cluster in HTML5 is probably to use embedded SVG. But for the reasons mentioned, this would inherently not be very good for accessibility. ________________________________ From: Unicode on behalf of Kent Karlsson via Unicode Sent: Wednesday, June 10, 2020 5:12:04 PM To: Mark E. Shoulson Cc: unicode at unicode.org Subject: Re: OverStrike control character (You (all) apparently mean ?overtype? rather than ?overstrike??; at least I read the latter as the same as crossed-out or strike-through.) Well, however the overtyping, or overlapping, is achieved (BS, GCC (is there any implementation of that at all? I would strongly recommend against it), pinching the glyph spacing (there is a control sequence for that in ECMA-48) too much, or simply via how the font?s glyphs are designed and spaced), there is no telling what the displayed result will be. Doing such things really passes from being text display (styled or not) into the realm of graphics. And sure, you can do lots of things in graphic design, also with overlapping ?graphic elements? (including glyphs for letters/digits/...). But as (possibly styled) TEXT, the displayed/printed result of overlaps would be ?implementation defined?. Please use a graphics editing program for controlling how overlapping graphic elements look like; for overlapping, you may want to use different layers, graphics editing programs often support ?layers?, for the graphic elements that overlap even if there are no layers when converting to (say) PNG. (And for graphics, sorting, searching, and other text operations do not apply?; in HTML for images/graphics, you can have an ?alt? text, which may or may not, indicate what is in the image/graphics.) /Kent Karlsson > 10 juni 2020 kl. 14:59 skrev Mark E. Shoulson via Unicode : > > On 6/9/20 10:50 PM, abrahamgross--- via Unicode wrote: >> It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character. > > What are these "pixels" to which you refer? Fonts these days are defined in terms of strokes, not pixels. And Richard Wordingham points out the flaw in your notion of how it would be rendered, your claim that x OS z would look the same as z OS x: > >> Consider and in a proportional >> width font. Are you expecting the rendering system to position the 'l' >> using the knowledge that it will be overstruck? Overstriking is >> designed for a teletype with fixed width characters. > Besides, even if it worked as you said, with the narrow character centered, how long would it take before you found some examples that didn't really quite work out right? Like overlaying a HEBREW LETTER YOD on a LATIN CAPITAL LETTER L, but what you really wanted was the YOD centered in the negative space of the L and not between the side-bearings, so next you'll want to be able to add some control over the exact positioning. And of course that won't work right in general, because it all depends on the font(s) involved. > > And when it comes to matching, you say of x OS z and z OS x, > >> In a perfect world they would be identical for string matching, but since its a new control character I would understand if ppl don't want to put in the effort to adopt it properly. > > But we're talking about making the rules here, the "perfect world." What should the *rule* be about string-matching? You can't have an optional rule, so a pair of strings will match on one system and not the other and both are right. Are we to understand that you think the rule should be that overstruck characters are considered to match in either order? Your gracious forgiveness of laxity in the rules doesn't really enter into the picture. And what about larger considerations? Can I have "ab??xy" (using ? for the overstrike) to overstrike a&x and b&y? What about "a?b?c?d?e?f?g?h"? What about "abc?d??fg"? The f&b are overstruck and so are the c&d&g? Is that combination of c?d overstruck with g different from c?d?g or the same? What about other combinations? These are all things that need answers. What about overstriking a LTR character with a RTL one, or vice-versa? Which way does the text go after that? > > But I think what you're really missing is the crucial point that Garth Wallace pointed out: > >> Display is not the only thing text is for. > > You're focussing a lot on how characters *look*, can we get this letter to look a little different, can we layer characters to make other weird symbols (which will look radically different depending on the font)... You're looking at how to _draw_ stuff, how to make things look this way or that on paper or when rendered. But that's not what Unicode encodes. You need to think more about the distinction Unicode makes between characters and glyphs. "Plain text" isn't about display, it's about representing what's in a document, the characters which encode (in a different sense) the spoken language (usually) that is being communicated. All the things you're talking about are firmly in the realm of fonts and higher-level protocols. You surely could work out this overstriking display with a sufficiently-advanced font (you could make zero-advance-width overlaying characters and ligatures that would replace X? with a zero-width equivalent of X, for example, in a monospace fo! nt), and you are welcome to do so, but that's where it belongs. >> It shouldnt do any fancy processing by default > Figuring out how much to backspace in order to center a glyph on another one, in a proportional-spaced font, is pretty fancy processing. >> Most systems have just about the same font so I wouldn't worry about the results of overstriking not coming out perfect. > > What a bland world you live in, wherein most fonts are the same! It's not about working with the default font on your favorite system; we're dealing with _characters_ here, which could be represented in ANY font. > > ~mark > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Wed Jun 10 14:13:37 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 10 Jun 2020 20:13:37 +0100 Subject: OverStrike control character In-Reply-To: <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se> Message-ID: <20200610201337.582a0775@JRWUBU2> On Wed, 10 Jun 2020 18:12:04 +0200 Kent Karlsson via Unicode wrote: > (You (all) apparently mean ?overtype? rather than ?overstrike??; at > least I read the latter as the same as crossed-out or strike-through.) It is not the same, though crossing out can be implemented by overstriking. Richard. From kent.b.karlsson at bahnhof.se Wed Jun 10 16:18:23 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Wed, 10 Jun 2020 23:18:23 +0200 Subject: OverStrike control character In-Reply-To: References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se> Message-ID: <494AEDF4-3BC9-4CFA-ADFA-EE874FED34CE@bahnhof.se> > Looking at Annex C of ECMA-43 (ECMA's designation for ISO 4873, in turn referenced from ISO 8859), GCC is only permitted because it is not supposed to create an effectively new character, but rather to ?juxtapose? the characters in one position (i.e. force a ligature, which if unsupported could just be shown as a sequence of individual characters). Annex C says nothing about GCC. And the Pts example you mention below I don?t find either? I dread to make long quotes from the ECMA-48 text here; but I?ll make a short one (from Annex A): -------- Such a device may, however, process the sequence: = BS / in such a way that it is preserved and can be forwarded to a device which can indeed produce the intended composite symbol. This example serves only the purpose of illustrating the difference between the effects of editor and formator functions. Where two or more graphic characters are to be imaged by a single graphic symbol, this should be done by using the control function GRAPHIC CHARACTER COMBINATION (GCC). ------- So yes, GCC is (or rather: was) intended for overtyping (among other things...), exemplified by composing a ?not equal to? symbol. Unicode does that particular example in a different way of course. While I do think much of ECMA-48 does have a future (ever used a terminal emulator?), I don?t think GCC has a future? Nor the interpretation of BS exemplified above? Nor an ?OVERSTRIKE? control character... /Kent K > Similarly, BS is prohibited precisely because it overstamps to create a new character that the target system cannot be expected to support properly. Which you rightly mention, and which makes sense even for rendering, and let us not forget filename handling, narrator software for the visually impaired, _et cetera_? > > The example it gives is using GCC on the sequence Pts to represent a ligature form (i.e. U+20A7). > > ECMA-48 (ISO 6429) defines GCC's coded representation and parameters as a CSI sequence, and gives as a mere example the simplest case of triggering display of two characters side-by-side in one kanji width, i.e. what the Japanese era name ligatures, the CJK Compatibility block unit symbols, _et cetera_ do. > > So apparently, GCC was (from what I can tell from the standards themselves) an attempt at defining a general mechanism for coding arbitrary ligatures and arbitrary CJK squared forms. Not character overstamping. > > As a final note, I should probably mention that the best existing way to create an overstamped character cluster in HTML5 is probably to use embedded SVG. But for the reasons mentioned, this would inherently not be very good for accessibility. > From: Unicode on behalf of Kent Karlsson via Unicode > Sent: Wednesday, June 10, 2020 5:12:04 PM > To: Mark E. Shoulson > Cc: unicode at unicode.org > Subject: Re: OverStrike control character > > (You (all) apparently mean ?overtype? rather than ?overstrike??; at least I read the latter as the same as crossed-out or strike-through.) > > Well, however the overtyping, or overlapping, is achieved (BS, GCC (is there any implementation of that at all? I would strongly recommend against it), pinching the glyph spacing (there is a control sequence for that in ECMA-48) too much, or simply via how the font?s glyphs are designed and spaced), there is no telling what the displayed result will be. > > Doing such things really passes from being text display (styled or not) into the realm of graphics. And sure, you can do lots of things in graphic design, also with overlapping ?graphic elements? (including glyphs for letters/digits/...). But as (possibly styled) TEXT, the displayed/printed result of overlaps would be ?implementation defined?. Please use a graphics editing program for controlling how overlapping graphic elements look like; for overlapping, you may want to use different layers, graphics editing programs often support ?layers?, for the graphic elements that overlap even if there are no layers when converting to (say) PNG. > > (And for graphics, sorting, searching, and other text operations do not apply?; in HTML for images/graphics, you can have an ?alt? text, which may or may not, indicate what is in the image/graphics.) > > /Kent Karlsson > > > 10 juni 2020 kl. 14:59 skrev Mark E. Shoulson via Unicode : > > > > On 6/9/20 10:50 PM, abrahamgross--- via Unicode wrote: > >> It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character. > > > > What are these "pixels" to which you refer? Fonts these days are defined in terms of strokes, not pixels. And Richard Wordingham points out the flaw in your notion of how it would be rendered, your claim that x OS z would look the same as z OS x: > > > >> Consider and in a proportional > >> width font. Are you expecting the rendering system to position the 'l' > >> using the knowledge that it will be overstruck? Overstriking is > >> designed for a teletype with fixed width characters. > > Besides, even if it worked as you said, with the narrow character centered, how long would it take before you found some examples that didn't really quite work out right? Like overlaying a HEBREW LETTER YOD on a LATIN CAPITAL LETTER L, but what you really wanted was the YOD centered in the negative space of the L and not between the side-bearings, so next you'll want to be able to add some control over the exact positioning. And of course that won't work right in general, because it all depends on the font(s) involved. > > > > And when it comes to matching, you say of x OS z and z OS x, > > > >> In a perfect world they would be identical for string matching, but since its a new control character I would understand if ppl don't want to put in the effort to adopt it properly. > > > > But we're talking about making the rules here, the "perfect world." What should the *rule* be about string-matching? You can't have an optional rule, so a pair of strings will match on one system and not the other and both are right. Are we to understand that you think the rule should be that overstruck characters are considered to match in either order? Your gracious forgiveness of laxity in the rules doesn't really enter into the picture. And what about larger considerations? Can I have "ab??xy" (using ? for the overstrike) to overstrike a&x and b&y? What about "a?b?c?d?e?f?g?h"? What about "abc?d??fg"? The f&b are overstruck and so are the c&d&g? Is that combination of c?d overstruck with g different from c?d?g or the same? What about other combinations? These are all things that need answers. What about overstriking a LTR character with a RTL one, or vice-versa? Which way does the text go after that? > > > > But I think what you're really missing is the crucial point that Garth Wallace pointed out: > > > >> Display is not the only thing text is for. > > > > You're focussing a lot on how characters *look*, can we get this letter to look a little different, can we layer characters to make other weird symbols (which will look radically different depending on the font)... You're looking at how to _draw_ stuff, how to make things look this way or that on paper or when rendered. But that's not what Unicode encodes. You need to think more about the distinction Unicode makes between characters and glyphs. "Plain text" isn't about display, it's about representing what's in a document, the characters which encode (in a different sense) the spoken language (usually) that is being communicated. All the things you're talking about are firmly in the realm of fonts and higher-level protocols. You surely could work out this overstriking display with a sufficiently-advanced font (you could make zero-advance-width overlaying characters and ligatures that would replace X? with a zero-width equivalent of X, for example, in a monospace fo! > nt), and you are welcome to do so, but that's where it belongs. > >> It shouldnt do any fancy processing by default > > Figuring out how much to backspace in order to center a glyph on another one, in a proportional-spaced font, is pretty fancy processing. > >> Most systems have just about the same font so I wouldn't worry about the results of overstriking not coming out perfect. > > > > What a bland world you live in, wherein most fonts are the same! It's not about working with the default font on your favorite system; we're dealing with _characters_ here, which could be represented in ANY font. > > > > ~mark > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From harjitmoe at outlook.com Wed Jun 10 16:21:41 2020 From: harjitmoe at outlook.com (Harriet Riddle) Date: Wed, 10 Jun 2020 21:21:41 +0000 Subject: OverStrike control character In-Reply-To: <494AEDF4-3BC9-4CFA-ADFA-EE874FED34CE@bahnhof.se> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se> , <494AEDF4-3BC9-4CFA-ADFA-EE874FED34CE@bahnhof.se> Message-ID: I was referring to Annex C of ECMA-43, not ECMA-48. Get Outlook for Android ________________________________ From: Kent Karlsson Sent: Wednesday, June 10, 2020 10:18:23 PM To: Harriet Riddle Cc: Mark E. Shoulson ; unicode at unicode.org Subject: Re: OverStrike control character Looking at Annex C of ECMA-43 (ECMA's designation for ISO 4873, in turn referenced from ISO 8859), GCC is only permitted because it is not supposed to create an effectively new character, but rather to ?juxtapose? the characters in one position (i.e. force a ligature, which if unsupported could just be shown as a sequence of individual characters). Annex C says nothing about GCC. And the Pts example you mention below I don?t find either? I dread to make long quotes from the ECMA-48 text here; but I?ll make a short one (from Annex A): -------- Such a device may, however, process the sequence: = BS / in such a way that it is preserved and can be forwarded to a device which can indeed produce the intended composite symbol. This example serves only the purpose of illustrating the difference between the effects of editor and formator functions. Where two or more graphic characters are to be imaged by a single graphic symbol, this should be done by using the control function GRAPHIC CHARACTER COMBINATION (GCC). ------- So yes, GCC is (or rather: was) intended for overtyping (among other things...), exemplified by composing a ?not equal to? symbol. Unicode does that particular example in a different way of course. While I do think much of ECMA-48 does have a future (ever used a terminal emulator?), I don?t think GCC has a future? Nor the interpretation of BS exemplified above? Nor an ?OVERSTRIKE? control character... /Kent K Similarly, BS is prohibited precisely because it overstamps to create a new character that the target system cannot be expected to support properly. Which you rightly mention, and which makes sense even for rendering, and let us not forget filename handling, narrator software for the visually impaired, _et cetera_? The example it gives is using GCC on the sequence Pts to represent a ligature form (i.e. U+20A7). ECMA-48 (ISO 6429) defines GCC's coded representation and parameters as a CSI sequence, and gives as a mere example the simplest case of triggering display of two characters side-by-side in one kanji width, i.e. what the Japanese era name ligatures, the CJK Compatibility block unit symbols, _et cetera_ do. So apparently, GCC was (from what I can tell from the standards themselves) an attempt at defining a general mechanism for coding arbitrary ligatures and arbitrary CJK squared forms. Not character overstamping. As a final note, I should probably mention that the best existing way to create an overstamped character cluster in HTML5 is probably to use embedded SVG. But for the reasons mentioned, this would inherently not be very good for accessibility. ________________________________ From: Unicode > on behalf of Kent Karlsson via Unicode > Sent: Wednesday, June 10, 2020 5:12:04 PM To: Mark E. Shoulson > Cc: unicode at unicode.org > Subject: Re: OverStrike control character (You (all) apparently mean ?overtype? rather than ?overstrike??; at least I read the latter as the same as crossed-out or strike-through.) Well, however the overtyping, or overlapping, is achieved (BS, GCC (is there any implementation of that at all? I would strongly recommend against it), pinching the glyph spacing (there is a control sequence for that in ECMA-48) too much, or simply via how the font?s glyphs are designed and spaced), there is no telling what the displayed result will be. Doing such things really passes from being text display (styled or not) into the realm of graphics. And sure, you can do lots of things in graphic design, also with overlapping ?graphic elements? (including glyphs for letters/digits/...). But as (possibly styled) TEXT, the displayed/printed result of overlaps would be ?implementation defined?. Please use a graphics editing program for controlling how overlapping graphic elements look like; for overlapping, you may want to use different layers, graphics editing programs often support ?layers?, for the graphic elements that overlap even if there are no layers when converting to (say) PNG. (And for graphics, sorting, searching, and other text operations do not apply?; in HTML for images/graphics, you can have an ?alt? text, which may or may not, indicate what is in the image/graphics.) /Kent Karlsson > 10 juni 2020 kl. 14:59 skrev Mark E. Shoulson via Unicode >: > > On 6/9/20 10:50 PM, abrahamgross--- via Unicode wrote: >> It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character. > > What are these "pixels" to which you refer? Fonts these days are defined in terms of strokes, not pixels. And Richard Wordingham points out the flaw in your notion of how it would be rendered, your claim that x OS z would look the same as z OS x: > >> Consider and in a proportional >> width font. Are you expecting the rendering system to position the 'l' >> using the knowledge that it will be overstruck? Overstriking is >> designed for a teletype with fixed width characters. > Besides, even if it worked as you said, with the narrow character centered, how long would it take before you found some examples that didn't really quite work out right? Like overlaying a HEBREW LETTER YOD on a LATIN CAPITAL LETTER L, but what you really wanted was the YOD centered in the negative space of the L and not between the side-bearings, so next you'll want to be able to add some control over the exact positioning. And of course that won't work right in general, because it all depends on the font(s) involved. > > And when it comes to matching, you say of x OS z and z OS x, > >> In a perfect world they would be identical for string matching, but since its a new control character I would understand if ppl don't want to put in the effort to adopt it properly. > > But we're talking about making the rules here, the "perfect world." What should the *rule* be about string-matching? You can't have an optional rule, so a pair of strings will match on one system and not the other and both are right. Are we to understand that you think the rule should be that overstruck characters are considered to match in either order? Your gracious forgiveness of laxity in the rules doesn't really enter into the picture. And what about larger considerations? Can I have "ab??xy" (using ? for the overstrike) to overstrike a&x and b&y? What about "a?b?c?d?e?f?g?h"? What about "abc?d??fg"? The f&b are overstruck and so are the c&d&g? Is that combination of c?d overstruck with g different from c?d?g or the same? What about other combinations? These are all things that need answers. What about overstriking a LTR character with a RTL one, or vice-versa? Which way does the text go after that? > > But I think what you're really missing is the crucial point that Garth Wallace pointed out: > >> Display is not the only thing text is for. > > You're focussing a lot on how characters *look*, can we get this letter to look a little different, can we layer characters to make other weird symbols (which will look radically different depending on the font)... You're looking at how to _draw_ stuff, how to make things look this way or that on paper or when rendered. But that's not what Unicode encodes. You need to think more about the distinction Unicode makes between characters and glyphs. "Plain text" isn't about display, it's about representing what's in a document, the characters which encode (in a different sense) the spoken language (usually) that is being communicated. All the things you're talking about are firmly in the realm of fonts and higher-level protocols. You surely could work out this overstriking display with a sufficiently-advanced font (you could make zero-advance-width overlaying characters and ligatures that would replace X? with a zero-width equivalent of X, for example, in a monospace fo! nt), and you are welcome to do so, but that's where it belongs. >> It shouldnt do any fancy processing by default > Figuring out how much to backspace in order to center a glyph on another one, in a proportional-spaced font, is pretty fancy processing. >> Most systems have just about the same font so I wouldn't worry about the results of overstriking not coming out perfect. > > What a bland world you live in, wherein most fonts are the same! It's not about working with the default font on your favorite system; we're dealing with _characters_ here, which could be represented in ANY font. > > ~mark > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Wed Jun 10 16:40:26 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Wed, 10 Jun 2020 23:40:26 +0200 Subject: OverStrike control character In-Reply-To: References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se> <494AEDF4-3BC9-4CFA-ADFA-EE874FED34CE@bahnhof.se> Message-ID: Ok ?3? not ?8?? The text you do refer to then seems to partially contradict ECMA.48 then. It is also technically wrong in the ?Pts? example: It cannot be ?GCC P t s ? Pts? (the intent here was that the result is ? not ?Pts"), but must be ?CSI 1 _ P t s CSI 2 _ ? Pts? (the GCC control sequence has a SPACE before the final ?_?) since there are three characters composed. /Kent K > 10 juni 2020 kl. 23:21 skrev Harriet Riddle : > > I was referring to Annex C of ECMA-43, not ECMA-48. > > Get Outlook for Android > From: Kent Karlsson > Sent: Wednesday, June 10, 2020 10:18:23 PM > To: Harriet Riddle > Cc: Mark E. Shoulson ; unicode at unicode.org > Subject: Re: OverStrike control character > > >> Looking at Annex C of ECMA-43 (ECMA's designation for ISO 4873, in turn referenced from ISO 8859), GCC is only permitted because it is not supposed to create an effectively new character, but rather to ?juxtapose? the characters in one position (i.e. force a ligature, which if unsupported could just be shown as a sequence of individual characters). > > Annex C says nothing about GCC. And the Pts example you mention below I don?t find either? > > I dread to make long quotes from the ECMA-48 text here; but I?ll make a short one (from Annex A): > > -------- > Such a device may, however, process the sequence: > = BS / > in such a way that it is preserved and can be forwarded to a device which can indeed produce the intended composite symbol. > > This example serves only the purpose of illustrating the difference between the effects of editor and formator functions. Where two or > more graphic characters are to be imaged by a single graphic symbol, this should be done by using the control function GRAPHIC > CHARACTER COMBINATION (GCC). > ------- > > So yes, GCC is (or rather: was) intended for overtyping (among other things...), exemplified by composing a ?not equal to? symbol. Unicode does > that particular example in a different way of course. While I do think much of ECMA-48 does have a future (ever used a terminal emulator?), > I don?t think GCC has a future? Nor the interpretation of BS exemplified above? Nor an ?OVERSTRIKE? control character... > > /Kent K > > >> Similarly, BS is prohibited precisely because it overstamps to create a new character that the target system cannot be expected to support properly. Which you rightly mention, and which makes sense even for rendering, and let us not forget filename handling, narrator software for the visually impaired, _et cetera_? >> >> The example it gives is using GCC on the sequence Pts to represent a ligature form (i.e. U+20A7). >> >> ECMA-48 (ISO 6429) defines GCC's coded representation and parameters as a CSI sequence, and gives as a mere example the simplest case of triggering display of two characters side-by-side in one kanji width, i.e. what the Japanese era name ligatures, the CJK Compatibility block unit symbols, _et cetera_ do. >> >> So apparently, GCC was (from what I can tell from the standards themselves) an attempt at defining a general mechanism for coding arbitrary ligatures and arbitrary CJK squared forms. Not character overstamping. >> >> As a final note, I should probably mention that the best existing way to create an overstamped character cluster in HTML5 is probably to use embedded SVG. But for the reasons mentioned, this would inherently not be very good for accessibility. >> From: Unicode > on behalf of Kent Karlsson via Unicode > >> Sent: Wednesday, June 10, 2020 5:12:04 PM >> To: Mark E. Shoulson > >> Cc: unicode at unicode.org > >> Subject: Re: OverStrike control character >> >> (You (all) apparently mean ?overtype? rather than ?overstrike??; at least I read the latter as the same as crossed-out or strike-through.) >> >> Well, however the overtyping, or overlapping, is achieved (BS, GCC (is there any implementation of that at all? I would strongly recommend against it), pinching the glyph spacing (there is a control sequence for that in ECMA-48) too much, or simply via how the font?s glyphs are designed and spaced), there is no telling what the displayed result will be. >> >> Doing such things really passes from being text display (styled or not) into the realm of graphics. And sure, you can do lots of things in graphic design, also with overlapping ?graphic elements? (including glyphs for letters/digits/...). But as (possibly styled) TEXT, the displayed/printed result of overlaps would be ?implementation defined?. Please use a graphics editing program for controlling how overlapping graphic elements look like; for overlapping, you may want to use different layers, graphics editing programs often support ?layers?, for the graphic elements that overlap even if there are no layers when converting to (say) PNG. >> >> (And for graphics, sorting, searching, and other text operations do not apply?; in HTML for images/graphics, you can have an ?alt? text, which may or may not, indicate what is in the image/graphics.) >> >> /Kent Karlsson >> >> > 10 juni 2020 kl. 14:59 skrev Mark E. Shoulson via Unicode >: >> > >> > On 6/9/20 10:50 PM, abrahamgross--- via Unicode wrote: >> >> It should just simply overlay the pixels of the two characters, with a thin character going in the center of a wider character. >> > >> > What are these "pixels" to which you refer? Fonts these days are defined in terms of strokes, not pixels. And Richard Wordingham points out the flaw in your notion of how it would be rendered, your claim that x OS z would look the same as z OS x: >> > >> >> Consider and in a proportional >> >> width font. Are you expecting the rendering system to position the 'l' >> >> using the knowledge that it will be overstruck? Overstriking is >> >> designed for a teletype with fixed width characters. >> > Besides, even if it worked as you said, with the narrow character centered, how long would it take before you found some examples that didn't really quite work out right? Like overlaying a HEBREW LETTER YOD on a LATIN CAPITAL LETTER L, but what you really wanted was the YOD centered in the negative space of the L and not between the side-bearings, so next you'll want to be able to add some control over the exact positioning. And of course that won't work right in general, because it all depends on the font(s) involved. >> > >> > And when it comes to matching, you say of x OS z and z OS x, >> > >> >> In a perfect world they would be identical for string matching, but since its a new control character I would understand if ppl don't want to put in the effort to adopt it properly. >> > >> > But we're talking about making the rules here, the "perfect world." What should the *rule* be about string-matching? You can't have an optional rule, so a pair of strings will match on one system and not the other and both are right. Are we to understand that you think the rule should be that overstruck characters are considered to match in either order? Your gracious forgiveness of laxity in the rules doesn't really enter into the picture. And what about larger considerations? Can I have "ab??xy" (using ? for the overstrike) to overstrike a&x and b&y? What about "a?b?c?d?e?f?g?h"? What about "abc?d??fg"? The f&b are overstruck and so are the c&d&g? Is that combination of c?d overstruck with g different from c?d?g or the same? What about other combinations? These are all things that need answers. What about overstriking a LTR character with a RTL one, or vice-versa? Which way does the text go after that? >> > >> > But I think what you're really missing is the crucial point that Garth Wallace pointed out: >> > >> >> Display is not the only thing text is for. >> > >> > You're focussing a lot on how characters *look*, can we get this letter to look a little different, can we layer characters to make other weird symbols (which will look radically different depending on the font)... You're looking at how to _draw_ stuff, how to make things look this way or that on paper or when rendered. But that's not what Unicode encodes. You need to think more about the distinction Unicode makes between characters and glyphs. "Plain text" isn't about display, it's about representing what's in a document, the characters which encode (in a different sense) the spoken language (usually) that is being communicated. All the things you're talking about are firmly in the realm of fonts and higher-level protocols. You surely could work out this overstriking display with a sufficiently-advanced font (you could make zero-advance-width overlaying characters and ligatures that would replace X? with a zero-width equivalent of X, for example, in a monospace fo! >> nt), and you are welcome to do so, but that's where it belongs. >> >> It shouldnt do any fancy processing by default >> > Figuring out how much to backspace in order to center a glyph on another one, in a proportional-spaced font, is pretty fancy processing. >> >> Most systems have just about the same font so I wouldn't worry about the results of overstriking not coming out perfect. >> > >> > What a bland world you live in, wherein most fonts are the same! It's not about working with the default font on your favorite system; we're dealing with _characters_ here, which could be represented in ANY font. >> > >> > ~mark >> > >> > >> > >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Wed Jun 10 16:53:06 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Wed, 10 Jun 2020 23:53:06 +0200 Subject: OverStrike control character In-Reply-To: <20200610201337.582a0775@JRWUBU2> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se> <20200610201337.582a0775@JRWUBU2> Message-ID: > 10 juni 2020 kl. 21:13 skrev Richard Wordingham via Unicode : > > On Wed, 10 Jun 2020 18:12:04 +0200 > Kent Karlsson via Unicode wrote: > >> (You (all) apparently mean ?overtype? rather than ?overstrike??; at >> least I read the latter as the same as crossed-out or strike-through.) > > It is not the same, though crossing out can be implemented by > overstriking. Whichever is the best term, I first thought (just seeing the subject line) the suggestion was about what ECMA-48 does via CSI 9m? And is supported also in HTML, MS Word, and surely other formats. Some ?mark-downs? (a bit surprisingly) use -crossedout text-; try pasting the typical output from the Unix/Linux command ?ls -l? and paste that into some place using that mark-down? Or try it with any text that uses hyphens (as HYPHEN-MINUS). (I did not like the result?) /K > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at honermann.net Thu Jun 11 00:00:49 2020 From: tom at honermann.net (Tom Honermann) Date: Thu, 11 Jun 2020 01:00:49 -0400 Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature? In-Reply-To: References: <04b320a8-6948-65ce-2561-4e6a81de892c@honermann.net> <4a173391-8436-1f3f-c311-f4a9960d288e@honermann.net> Message-ID: <014fe622-011e-b225-b769-4753a8af7883@honermann.net> On 6/10/20 12:47 PM, Henri Sivonen via Unicode wrote: > Tom Honermann wrote: >> I'm researching support for UTF-8 encoded source files in various C++ compilers. Here is a brief snapshot of existing practice: >> >> Clang only accepts UTF-8 encoded source files. A UTF-8 BOM is recognized and discarded. >> GCC accepts UTF-8 encoded source files by default, but the encoding expectation can be overridden with a command line option. If GCC is expecting UTF-8 source, then a BOM is discarded. Otherwise, a BOM is *not* honored and its presence is likely to result in compilation error. GCC has no support for compiling a translation unit consisting of differently encoded source files. >> Microsoft Visual C++, by default, interprets source files as encoded according to the Windows' Active Code Page (ACP), but supports translation units consisting of differently encoded source files by honoring a UTF-8 BOM. The default encoding can be overridden with a command line option. >> IBM z/OS xlC C/C++ is IBM's compiler for C and C++ on mainframes (yes, though you may not have seen a green screen in recent times, mainframes are still busy crunching numbers behind the scenes for websites you frequent). z/OS is an EBCDIC based operating system and IBM's xlC compiler for z/OS only accepts EBCDIC encoded source files. Many EBCDIC code pages exist and the xlC compiler supports an in-source code page annotation that enables compilation of translation units consisting of differently encoded source files. >> >> The goal of this research is to produce a proposal for the C and C++ standards intended to better enable UTF-8 as a portable source file encoding. > ... >> Various methods are being explored for how to support collections of mixed encoding source files. The intent in asking the question is to help determine if/how use of a UTF-8 BOM fits in to the picture. > Given your description of existing compiler behavior, I recommend > making the C++ standard say that if a file (substitute the right ISO > term for "file") starts with a UTF-8 BOM, the file must be interpreted > as UTF-8 and the BOM be discarded before further processing. This > already fits what you say GCC, clang, and MSVC do by default and would > not be a compatibility-breaking change for IBM compilers (though I > understood the IBM compilers are being superseded by clang anyway as > far as implementing C++ versions later than C++11 goes). This would > also facilitate migration to UTF-8 on Windows and z/OS. Thank you, Henri, this matches my inclination.? If anyone else has dissenting opinions, please share them. (The Clang ports to z/OS support distinct EBCDIC and ASCII modes, so they don't escape these concerns) > > Shawn Steele wrote: >> The modern viewpoint is that the BOM should be discouraged in all contexts. > If you are writing an HTML serializer that 1) is a component distinct > from the HTTP layer and, therefore, cannot control the HTTP headers > and 2) mustn't impose restrictions on the shape of the DOM and, > therefore, mustn't inject a meta element on its own, the best approach > is to use the UTF-8 BOM. > >> I?d recommend to anyone encountering ASCII-like data to presume it was UTF-8 unless proven otherwise. > This is problematic in contexts where there is non-UTF-8 legacy, the > input arrives over time, and streaming processing of the input is > expected. See https://hsivonen.fi/utf-8-detection/ > > On Eli Zaretskii wrote: >>> From: Shawn Steele via Unicode >>> >>> I?ve been recommending that people assume documents are UTF-8. If the UTF-8 decoding fails, then >>> consider falling back to some other codepage. >> That strategy would fail with 7-bit ISO 2022 based encodings, no? > Yes. When HTML is labeled as UTF-8 and is valid UTF-8, Firefox > disables the character encoding menu to prevent self-XSS and to > prevent the user from introducing data corruption to forms. This is a > bit of a problem with e.g. university servers that have acquired a > server-wide HTTP-level UTF-8 declaration but that carry occasional > ancient ISO-2022-JP content. So far, I've decided not to do anything > about this. > > Fortunately, the ISO 2022 series isn't really relevant (as a good > approximation) to C++. Additionally, fall back to another code page is not appropriate in contexts where proper diagnosis of ill-formed UTF-8 text is desired.? For source code in particular, fall back to ISO-8859-1 due to ill-formed UTF-8 in a string literal would result in silent miscompilation.? The performance overhead of fall back for C++ compilation would also not be acceptable (where compilation performance is already a challenge). Tom. From richard.wordingham at ntlworld.com Thu Jun 11 03:18:55 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 11 Jun 2020 09:18:55 +0100 Subject: OverStrike control character In-Reply-To: <494AEDF4-3BC9-4CFA-ADFA-EE874FED34CE@bahnhof.se> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se> <494AEDF4-3BC9-4CFA-ADFA-EE874FED34CE@bahnhof.se> Message-ID: <20200611091855.6e25a559@JRWUBU2> On Wed, 10 Jun 2020 23:18:23 +0200 Kent Karlsson via Unicode wrote: > So yes, GCC is (or rather: was) intended for overtyping (among other > things...), exemplified by composing a ?not equal to? symbol. Unicode > does that particular example in a different way of course. While I do > think much of ECMA-48 does have a future (ever used a terminal > emulator?), I don?t think GCC has a future? Nor the interpretation of > BS exemplified above? Nor an ?OVERSTRIKE? control character... I suspect terminal emulators still need to handle GCC. Years ago, underlining in man pages used to be presented by overstriking letters with an underscore. It was a major pain when copying a man page to another medium, or even a video terminal that didn't understand the convention. Nowadays, the output is tailored to the destination, so piping through od -c doesn't reveal how underlining is achieved in more modern systems. (However, 20 year old Unix boxes are still being used with their original operating systems, and rlogin and its analogues are still in use.) I could be wrong. The emacs shell window doesn't (27.0.50) handle man page underlining, and I assume it's not enough of an irritant to get fixed. GCC is seriously inconsistent with context-sensitive layout. Richard. From sosipiuk at gmail.com Thu Jun 11 09:52:29 2020 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Thu, 11 Jun 2020 10:52:29 -0400 Subject: OverStrike control character In-Reply-To: <20200611091855.6e25a559@JRWUBU2> References: <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <3A04B8DD-527D-4DDC-99C6-C8CC5F430B46@bahnhof.se> <494AEDF4-3BC9-4CFA-ADFA-EE874FED34CE@bahnhof.se> <20200611091855.6e25a559@JRWUBU2> Message-ID: On Thu, Jun 11, 2020 at 4:26 AM Richard Wordingham via Unicode wrote: > > I suspect terminal emulators still need to handle GCC. Years ago, > underlining in man pages used to be presented by overstriking letters > with an underscore. Of course, the proper way to underline is by using SGR (select graphic rendition), not GCC. From doug at ewellic.org Fri Jun 12 10:12:31 2020 From: doug at ewellic.org (Doug Ewell) Date: Fri, 12 Jun 2020 09:12:31 -0600 Subject: OverStrike control character Message-ID: <001901d640cb$e545e470$afd1ad50$@ewellic.org> If we're going to get all ECMA-48 about this, there is also CUB (CSI D), which moves the "active presentation position" back one character, or HPB (CSI j), which moves the "active data position" back one character. (Notice that even ECMA-48 understands the difference between presentation and data.) Both of these take an optional numeric parameter, so you can back up more than one position if you want. So you have those. And if they don't work for you, well, then they don't work. It's no different from adding an overstrike mechanism to Unicode and expecting it to work for everyone, in all editing and displaying contexts, with all fonts and rendering engines, on all platforms. This is a very non-Unicode concept and I would suggest re-reading what Ken Whistler and others had to say about it. -- Doug Ewell | Thornton, CO, US | ewellic.org From kent.b.karlsson at bahnhof.se Sun Jun 14 17:33:29 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Mon, 15 Jun 2020 00:33:29 +0200 Subject: OverStrike control character In-Reply-To: <001901d640cb$e545e470$afd1ad50$@ewellic.org> References: <001901d640cb$e545e470$afd1ad50$@ewellic.org> Message-ID: <6174958D-2780-4FD1-8890-51EECC3DCF20@bahnhof.se> > 12 juni 2020 kl. 17:12 skrev Doug Ewell via Unicode : > > If we're going to get all ECMA-48 about this, there is also CUB (CSI D), which moves the "active presentation position" back one character, or HPB (CSI j), which moves the "active data position" back one character. (Notice that even ECMA-48 understands the difference between presentation and data.) Note that CUB (and related: CUD, CUF, CUU) moves the cursor position (?active position?) AFTER GCC (if anything has where supported that one), line breaking, bidi (ECMA-48 had its own approach to that) and line/character directions. Note that CUB moves to the left, even for text with vertical lines, it?s a purely ?visual? move. HPB (and related: HPR) moves the cursor position BEFORE (nowadays referred to as ?backing store?) all that, it?s a ?logical? move. (They need to be synched up, moving one will move the other, but ECMA-48 does not give exact details.) Such cursor movement control sequences are not suitable to store in the ?backing store?, but ECMA-48 does not say so explicitly. And CUB (with the absent parameter defaulted to 1) is what your favorite terminal emulator sends (to the program reading what you type on the keyboard) when you press the left arrow key on the keyboard. Maybe arrow keys in terminal emulators should have sent HPB/CNL/CPL/HPR (and have them interpreted as specified) instead of CUB/CUD/CUU/CUF. Sometimes it seems like the CUB/CUF are interpreted as if they were HPB/HPR. However, this is for terminal emulators only. Otherwise, ?window" programs use keystroke events, not bothering with these cursor movement control sequences. But terminal emulators will be with us for quite some time yet! /Kent K > Both of these take an optional numeric parameter, so you can back up more than one position if you want. > > So you have those. And if they don't work for you, well, then they don't work. It's no different from adding an overstrike mechanism to Unicode and expecting it to work for everyone, in all editing and displaying contexts, with all fonts and rendering engines, on all platforms. > > This is a very non-Unicode concept and I would suggest re-reading what Ken Whistler and others had to say about it. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > > From corentin.jabot at gmail.com Sun Jun 14 17:17:41 2020 From: corentin.jabot at gmail.com (Corentin) Date: Mon, 15 Jun 2020 00:17:41 +0200 Subject: What constitute? an abstract character? Message-ID: Hello While trying to define suitable semantic for the lexing of C++, we seem to fail to agree on the definition of abstract characters Notably: - Would diatrics marks considered in isolation be considered abstract characters? - What about Hangul Jamos and other marks that are not found in isolation in their respective scripts, Variation selectors, etc ? I guess another way to phrase my question is: does every assigned codepoint represent on its own an abstract character? My understanding is that is not the case, but I am eager to be enlighten Thanks, Corentin -------------- next part -------------- An HTML attachment was scrubbed... URL: From xfq.free at gmail.com Sun Jun 14 19:47:37 2020 From: xfq.free at gmail.com (Fuqiao Xue) Date: Mon, 15 Jun 2020 08:47:37 +0800 Subject: What constitute? an abstract character? In-Reply-To: References: Message-ID: Hi Corentin, The term "abstract character" is ambiguous and can have multiple definitions. Depending on what you need, It can refer to visual (i.e., grapheme), logical (i.e., code point), or byte-level (i.e., code unit) representation of a given piece of text. FYI - W3C developed a Character Model document, which includes some guidelines on "characters" and may be useful to you: https://www.w3.org/TR/charmod/ Cheers, Fuqiao 2020?6?15?(?) 8:01 Corentin via Unicode : > > Hello > While trying to define suitable semantic for the lexing of C++, we seem to fail to agree on the definition of abstract characters > > Notably: > - Would diatrics marks considered in isolation be considered abstract characters? > - What about Hangul Jamos and other marks that are not found in isolation in their respective scripts, Variation selectors, etc ? > > I guess another way to phrase my question is: does every assigned codepoint represent on its own an abstract character? > > My understanding is that is not the case, but I am eager to be enlighten > > Thanks, > > Corentin From asmusf at ix.netcom.com Mon Jun 15 01:42:07 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 14 Jun 2020 23:42:07 -0700 Subject: What constitute? an abstract character? In-Reply-To: References: Message-ID: <93aa538b-a0f6-ed81-bc20-73261786c1b4@ix.netcom.com> An HTML attachment was scrubbed... URL: From xfq.free at gmail.com Mon Jun 15 09:22:26 2020 From: xfq.free at gmail.com (Fuqiao Xue) Date: Mon, 15 Jun 2020 22:22:26 +0800 Subject: What constitute? an abstract character? In-Reply-To: <93aa538b-a0f6-ed81-bc20-73261786c1b4@ix.netcom.com> References: <93aa538b-a0f6-ed81-bc20-73261786c1b4@ix.netcom.com> Message-ID: Thanks for the hint Asmus - what you said makes sense and is very useful information in addition to the definitions in The Unicode Standard Section 3.4. I forgot that the term "abstract character" is defined in TUS, sorry. Fuqiao 2020?6?15?(?) 14:44 Asmus Freytag via Unicode : > > On 6/14/2020 5:47 PM, Fuqiao Xue via Unicode wrote: > > Hi Corentin, > > The term "abstract character" is ambiguous and can have multiple > definitions. Depending on what you need, It can refer to visual (i.e., > grapheme), logical (i.e., code point), or byte-level (i.e., code unit) > representation of a given piece of text. > > An abstract character is related to a code point by the character encoding. See definitions D7 and D10-D12 in Section 3.4 Characters and Encoding. (http://www.unicode.org/versions/latest/ch03.pdf#G2212) > > It is never a "code unit" or a "byte-level" thing. It is also not the code point. > > It is the thing that is being assigned a code point. (D11: "Encoded character: An association (or mapping) between an abstract character and > a code point." -- the definition should really have an added "or code point sequence". Unicode finesses that by saying that sequences never encode an abstract character directly, but they can be used to "represent it", see comment on D7. That formally makes encoding a 1:1 process, but muddies the waters a bit on what we should consider an 'abstract character'. For example, it means that all "building blocks" of any sequences must be seen as abstract characters themselves.) > > Now the abstract character A-diaresis (?) is encode by a single code point and also has a canonically equivalent representation by a combining sequence. In effect, the whole sequence "encodes" a single abstract character, but that is formally not how Unicode defines it. > > A diaeresis is a recognizable item of the writing system; if used as an umlaut, it tends to act as a decoration of character that is more-or-less seen as a new entity (particularly in Swedish) and less a modified letter A. If used as a diaeresis, it acts more like a punctuation mark that has a function of its own (forcing separate pronunciation). Even though it's graphically applied to a vowel, it can be understood as its own abstract character. > > Treating the diaerersis as its own independent abstract character makes logical and not just formal sense. That may not be the case equally for all types of diacritical marks. However, since they can all be named, and thus arguably exist as their own concepts at least on a descriptive level, it becomes effectively a non-problem. > > The way combining marks are treated in other scripts, they can all be on different points of the scale as logically independent entities, and some are even on different points of the scale in terms of graphically combining (they may be graphically indistinguishable from regular spacing letters). > > To recap, an "abstract" character is a conceptual character, something that forms the atom of a writing system (smallest divisible particle) as viewed from the process of encoding, which associates with it a single code point. "Abstract" characters may exist that are not encoded; and some of them can be analyzed as series of smaller abstract characters, and thus be represented as code point sequences. > > Some abstract characters are more like small molecules; they can be encoded as such, or they can also have a more atomic sequence that represents them. The rationale of for allowing this dual nature is historical compatibility, not logical necessity, hence the model is in some ways not "pure" (just practical). > > A./ > > PS: while the character model document tries to unravel the implications of the Unicode Encoding model for W3C standards, it's not a substitution for the original definitions of how the Unicode Standard understands and defines the encoding process. > > FYI - W3C developed a Character Model document, which includes some > guidelines on "characters" and may be useful to you: > https://www.w3.org/TR/charmod/ > > Cheers, > > Fuqiao > > 2020?6?15?(?) 8:01 Corentin via Unicode : > > Hello > While trying to define suitable semantic for the lexing of C++, we seem to fail to agree on the definition of abstract characters > > Notably: > - Would diatrics marks considered in isolation be considered abstract characters? > - What about Hangul Jamos and other marks that are not found in isolation in their respective scripts, Variation selectors, etc ? > > I guess another way to phrase my question is: does every assigned codepoint represent on its own an abstract character? > > My understanding is that is not the case, but I am eager to be enlighten > > Thanks, > > Corentin > > From sosipiuk at gmail.com Mon Jun 15 09:33:04 2020 From: sosipiuk at gmail.com (=?utf-8?Q?S=C5=82awomir_Osipiuk?=) Date: Mon, 15 Jun 2020 10:33:04 -0400 Subject: What constitute? an abstract character? In-Reply-To: References: Message-ID: <000901d64321$e17f5ad0$a47e1070$@gmail.com> I believe the underlying question is: How does one programmatically identify and/or count the abstract characters in a Unicode text? S?awomir Osipiuk -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at sonic.net Mon Jun 15 10:03:10 2020 From: kenwhistler at sonic.net (Ken Whistler) Date: Mon, 15 Jun 2020 08:03:10 -0700 Subject: What constitute? an abstract character? In-Reply-To: <000901d64321$e17f5ad0$a47e1070$@gmail.com> References: <000901d64321$e17f5ad0$a47e1070$@gmail.com> Message-ID: <302647d5-c5ec-0623-b714-1804f0319ca9@sonic.net> Not an interesting question, actually. The units relevant to count in text, depending on what you are doing, are: code units code points user-perceived characters (and other higher-level constructs which may be orthography-specific) "abstract characters" are an artifact of the formal encoding process. They are really only "counted" by character encoding committees, not by software processing text strings. --Ken On 6/15/2020 7:33 AM, S?awomir Osipiuk via Unicode wrote: > > I believe the underlying question is: > > How does one programmatically identify and/or count the abstract > characters in a Unicode text? > > S?awomir Osipiuk > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgcon6 at msn.com Mon Jun 15 10:04:59 2020 From: pgcon6 at msn.com (Peter Constable) Date: Mon, 15 Jun 2020 15:04:59 +0000 Subject: What constitute? an abstract character? In-Reply-To: <000901d64321$e17f5ad0$a47e1070$@gmail.com> References: <000901d64321$e17f5ad0$a47e1070$@gmail.com> Message-ID: Unicode doesn?t give one answer since there?s more than one way that might be appropriate to answer it. You might want a count of Unicode code points. If a buffer contained a UTF-32 sequence, that would be the same as the sequence length divided by 4. (Count if UTF-16 or UTF-8 requires walking the sequence, obviously.) It would also mean that the text element _a-diaeresis_ could have a count of 1 in some cases but a count of 2 in other cases. An Old Hangul syllable might have a count of 1, 2 or 3, depending on the syllable. You might want a could of NFC-composable entities. In that case, _a-diaeresis_ would always have a count of 1, but an Old Hangul syllable would have a count of 1, 2 or 3 depending on the syllable. You might want a count of grapheme clusters, as defined in UAX #29. In that case, _a-diaeresis_ or any Old Hangul Syllable would always have a count of 1. Which way to count depends on one?s purpose for counting. Peter From: Unicode On Behalf Of Slawomir Osipiuk via Unicode Sent: Monday, June 15, 2020 7:33 AM To: 'Corentin' ; 'unicode Unicode Discussion' Subject: RE: What constitute? an abstract character? I believe the underlying question is: How does one programmatically identify and/or count the abstract characters in a Unicode text? S?awomir Osipiuk -------------- next part -------------- An HTML attachment was scrubbed... URL: From corentin.jabot at gmail.com Mon Jun 15 11:34:25 2020 From: corentin.jabot at gmail.com (Corentin) Date: Mon, 15 Jun 2020 18:34:25 +0200 Subject: What constitute? an abstract character? In-Reply-To: <93aa538b-a0f6-ed81-bc20-73261786c1b4@ix.netcom.com> References: <93aa538b-a0f6-ed81-bc20-73261786c1b4@ix.netcom.com> Message-ID: On Mon, 15 Jun 2020 at 08:44, Asmus Freytag via Unicode wrote: > On 6/14/2020 5:47 PM, Fuqiao Xue via Unicode wrote: > > Hi Corentin, > > The term "abstract character" is ambiguous and can have multiple > definitions. Depending on what you need, It can refer to visual (i.e., > grapheme), logical (i.e., code point), or byte-level (i.e., code unit) > representation of a given piece of text. > > An abstract character is related to a code point by the character > encoding. See definitions D7 and D10-D12 in Section 3.4 Characters and > Encoding. (http://www.unicode.org/versions/latest/ch03.pdf#G2212) > > It is never a "code unit" or a "byte-level" thing. It is also not the code > point. > > It is the thing that is being assigned a code point. (D11: "Encoded > character: An association (or mapping) between an abstract character and > a code point." -- the definition should really have an added "or code > point sequence". Unicode finesses that by saying that sequences never > encode an abstract character directly, but they can be used to "represent > it", see comment on D7. That formally makes encoding a 1:1 process, but > muddies the waters a bit on what we should consider an 'abstract > character'. For example, it means that all "building blocks" of any > sequences must be seen as abstract characters themselves.) > > Now the abstract character A-diaresis (?) is encode by a single code > point and also has a canonically equivalent representation by a combining > sequence. In effect, the whole sequence "encodes" a single abstract > character, but that is formally not how Unicode defines it. > > A diaeresis is a recognizable item of the writing system; if used as an > umlaut, it tends to act as a decoration of character that is more-or-less > seen as a new entity (particularly in Swedish) and less a modified letter > A. If used as a diaeresis, it acts more like a punctuation mark that has a > function of its own (forcing separate pronunciation). Even though it's > graphically applied to a vowel, it can be understood as its own abstract > character. > > Treating the diaerersis as its own independent abstract character makes > logical and not just formal sense. That may not be the case equally for all > types of diacritical marks. However, since they can all be named, and thus > arguably exist as their own concepts at least on a descriptive level, it > becomes effectively a non-problem. > > The way combining marks are treated in other scripts, they can all be on > different points of the scale as logically independent entities, and some > are even on different points of the scale in terms of graphically combining > (they may be graphically indistinguishable from regular spacing letters). > > To recap, an "abstract" character is a conceptual character, something > that forms the atom of a writing system (smallest divisible particle) as > viewed from the process of encoding, which associates with it a single code > point. "Abstract" characters may exist that are not encoded; and some of > them can be analyzed as series of smaller abstract characters, and thus be > represented as code point sequences. > > Some abstract characters are more like small molecules; they can be > encoded as such, or they can also have a more atomic sequence that > represents them. The rationale of for allowing this dual nature is > historical compatibility, not logical necessity, hence the model is in some > ways not "pure" (just practical). > Thanks for this detailed reply, this is exactly the answer I was looking for! > A./ > > PS: while the character model document tries to unravel the implications > of the Unicode Encoding model for W3C standards, it's not a substitution > for the original definitions of how the Unicode Standard understands and > defines the encoding process. > > FYI - W3C developed a Character Model document, which includes some > guidelines on "characters" and may be useful to you:https://www.w3.org/TR/charmod/ > > Cheers, > > Fuqiao > > 2020?6?15?(?) 8:01 Corentin via Unicode : > > Hello > While trying to define suitable semantic for the lexing of C++, we seem to fail to agree on the definition of abstract characters > > Notably: > - Would diatrics marks considered in isolation be considered abstract characters? > - What about Hangul Jamos and other marks that are not found in isolation in their respective scripts, Variation selectors, etc ? > > I guess another way to phrase my question is: does every assigned codepoint represent on its own an abstract character? > > My understanding is that is not the case, but I am eager to be enlighten > > Thanks, > > Corentin > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Jun 15 12:01:46 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 15 Jun 2020 18:01:46 +0100 Subject: What constitute? an abstract character? In-Reply-To: <302647d5-c5ec-0623-b714-1804f0319ca9@sonic.net> References: <000901d64321$e17f5ad0$a47e1070$@gmail.com> <302647d5-c5ec-0623-b714-1804f0319ca9@sonic.net> Message-ID: <20200615180146.6bb55149@JRWUBU2> On Mon, 15 Jun 2020 08:03:10 -0700 Ken Whistler via Unicode wrote: > Not an interesting question, actually. > > The units relevant to count in text, depending on what you are doing, > are: > > code units > > code points > > user-perceived characters (and other higher-level constructs which > may be orthography-specific) Which in general have a user- and time-specific definition, as at least hinted at in Peter Constable's comment that the way to count depends on the purpose of counting. Richard. From rhandwerker at us.ibm.com Tue Jun 16 08:37:52 2020 From: rhandwerker at us.ibm.com (Reinhard Handwerker) Date: Tue, 16 Jun 2020 13:37:52 +0000 Subject: unsubscribe Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.708519188946.png Type: image/png Size: 14191 bytes Desc: not available URL: From abrahamgross at disroot.org Tue Jun 16 11:43:16 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Tue, 16 Jun 2020 16:43:16 +0000 Subject: OverStrike control character In-Reply-To: <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> References: <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> Message-ID: <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> > What are these "pixels" to which you refer? Fonts these days are defined in terms of strokes, not pixels. > even though fonts are vectors, they still get rendered onto a raster screen. but the point was that they get overlayed and centered horizontally. > Consider and in a proportional > width font. Are you expecting the rendering system to position the 'l' > using the knowledge that it will be overstruck? Overstriking is > designed for a teletype with fixed width characters. You can think of the knowledge of being overstruck like the knowledge fonts have of characters being combined to with diacritics ? Fonts can specify an anchor point where the diacritics will go. Except with overstrike, the anchor will always be the center. The overstrike character is sorta like a ZWJ (zero width joiner) that turns the next character into a "diacritic". (hope this explanation makes sense) > Besides, even if it worked as you said, with the narrow character centered, how long would it take before you found some examples that didn't really quite work out right? Like overlaying a HEBREW LETTER YOD on a LATIN CAPITAL LETTER L, but what you really wanted was the YOD centered in the negative space of the L and not between the side-bearings, so next you'll want to be able to add some control over the exact positioning. And of course that won't work right in general, because it all depends on the font(s) involved. > I will never make a proposal for the addition of control character to control positioning. If it doesnt come out quite right, then either live with it, or find another character that fits better. After all, since fonts are different you cant expect it to come out the same on every device (I'm agreeing with you on this). But i still think that an almost perfect rendition of the overstriked characters is way better than having none at all. > Can I have "ab??xy" (using ? for the overstrike) to overstrike a&x and b&y? What about "a?b?c?d?e?f?g?h"? What about "abc?d??fg"? The f&b are overstruck and so are the c&d&g? Is that combination of c?d overstruck with g different from c?d?g or the same? What about other combinations? These are all things that need answers. > You can only put one overstrike character in a row. If you type a second one, then it gets ignored. So ab??xy will render as "a[bx]y" where [bracketed] characters are rendered overlayed. a?b?c?d?e?f?g?h will look like [abcdefgh], all on top of each other. abc?d??fg will be ab[cdf]g. to get a[bf][cdg] you need to type ab?fc?d?g the exact rendering will of course depend on the font of the device you're using, but again, i still think that an almost perfect rendition of the overstriked characters is way better than having none at all. > What about overstriking a LTR character with a RTL one, or vice-versa? Which way does the text go after that? > The text after that goes in the direction of the text afterwards. So for ?L??????? its gonna look like ?[L???]?????? and for ?L??ab? its gonna look like ?[L?]ab?. Meaning that only the very next letter gets overstruck, and anything afterwards continues on like it would normally. Going back to the L?? example, heres what it would look like: https://imgur.com/a/N9QApwh Here's a short command to generate the images, and it works for any 2 letter combinations. Just replace whats after `label:` with the letters you want to test how overstriking it might look like. (You need to install ImageMagick before using this though.) ``` magick -background none -pointsize 100 \ label:L label:? \ \( -clone 0 \) -delete "%[fx:u.w>v.w?2:0]" \ -gravity center -compose Over -composite \ -background white -flatten \ out.png ``` How it would like in a serif font: ``` magick -background none -font "FreeSerif" -pointsize 100 \ label:L label:? \ \( -clone 0 \) -delete "%[fx:u.w>v.w?2:0]" \ -gravity center -compose Over -composite \ -background white -flatten \ out.png ``` For windows users: ``` magick ^ in0.png in1.png ( -clone 0 ) ^ -delete "%%[fx:u.w>v.w?2:0]" ^ -compose Over -composite ^ -background white -flatten ^ out.png ``` Here's an example of a symbol that isn't widespread enough for its own codepoint, but which can be easily implemented through the usage of the overstrike key: The symbol in usage: https://imgur.com/AMAVrZT The symbol by overstriking ???: https://imgur.com/46ReTNu The ImageMagick command to generate it: ``` magick -background none -font "FreeSerif" -pointsize 100 \ label:? label:? \ \( -clone 0 \) -delete "%[fx:u.w>v.w?2:0]" \ -gravity center -compose Over -composite \ -background white -flatten \ clefheart.png ``` As for arrow keys, it should pass an overstruck combination in a single key press. However, the backspace key should remove only one of the overstruck characters, and not both/all at once. It should work like how combining diacritics work: it takes 3 presses of the arrow keys to go past "xy?z", but 4 BackSpace key presses to remove all of it because the grave gets deleted separately than the "y". From richard.wordingham at ntlworld.com Tue Jun 16 12:05:22 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 16 Jun 2020 18:05:22 +0100 Subject: OverStrike control character In-Reply-To: <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> References: <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> Message-ID: <20200616180522.3a2bbcae@JRWUBU2> On Tue, 16 Jun 2020 16:43:16 +0000 Abraham Gross via Unicode wrote: > > What are these "pixels" to which you refer? Fonts these days are > > defined in terms of strokes, not pixels. > > even though fonts are vectors, they still get rendered onto a raster > screen. but the point was that they get overlayed and centered > horizontally. > > > Consider and in a proportional > > width font. Are you expecting the rendering system to position the > > 'l' using the knowledge that it will be overstruck? Overstriking is > > designed for a teletype with fixed width characters. > > You can think of the knowledge of being overstruck like the knowledge > fonts have of characters being combined to with diacritics ? Fonts > can specify an anchor point where the diacritics will go. Except with > overstrike, the anchor will always be the center. The overstrike > character is sorta like a ZWJ (zero width joiner) that turns the next > character into a "diacritic". (hope this explanation makes sense) You miss the problem. There is an issue of advance width. Font writers by and large don't seem very fond of making 'i' with a circumflex wider than one without. Some bite the bullet - there is at least one Arabic font where adding vowel marks changes the consonant skeleton. Your equivalence calls for and to have the same advance width. Richard. From marius.spix at web.de Tue Jun 16 12:16:24 2020 From: marius.spix at web.de (Marius Spix) Date: Tue, 16 Jun 2020 19:16:24 +0200 Subject: Aw: Re: OverStrike control character In-Reply-To: <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> References: <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> Message-ID: >even though fonts are vectors, they still get rendered onto a raster screen. but the point was that they get overlayed and centered horizontally. It is possible to draw vector graphics on CRT screens or draw documents with a plotter. Unicode does not specify how characters are rendered. From abrahamgross at disroot.org Tue Jun 16 12:28:22 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Tue, 16 Jun 2020 17:28:22 +0000 Subject: OverStrike control character In-Reply-To: <20200616180522.3a2bbcae@JRWUBU2> References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> Message-ID: 2020?6?16? 13:06, "Richard Wordingham via Unicode" wrote: Your equivalence calls for and > to have the same advance width. > Right, exactly. Why's that a problem? From richard.wordingham at ntlworld.com Tue Jun 16 15:39:25 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 16 Jun 2020 21:39:25 +0100 Subject: OverStrike control character In-Reply-To: References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> Message-ID: <20200616213925.46980570@JRWUBU2> On Tue, 16 Jun 2020 17:28:22 +0000 Abraham Gross via Unicode wrote: > 2020?6?16? 13:06, "Richard Wordingham via Unicode" > wrote: > > Your equivalence calls for and > > to have the same advance width. > > > > Right, exactly. Why's that a problem? Have you written a proportional width font that can do that? I'm not saying it's impossible, just that it's a lot of work. Richard. From harjitmoe at outlook.com Tue Jun 16 15:43:15 2020 From: harjitmoe at outlook.com (Harriet Riddle) Date: Tue, 16 Jun 2020 20:43:15 +0000 Subject: OverStrike control character In-Reply-To: References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>, Message-ID: > Your equivalence calls for and > > to have the same advance width. > > > > Right, exactly. Why's that a problem? Because with how things usually work currently (in something like Roman or Greek, at any rate), the text renderer will make space for the first character first, and then position any following combining diacritics in that space. That is to say, the anchor point is a point inside the space allocated for the base character, and the diacritics are positioned so their anchors are at that point. The combining diacritics themselves have zero advance width, and no space is allocated for them; in the absence of anchors, they just poke over the previous character and (somewhat optimistically) hope for the best. So if, say, m and l were treated just as postfix combining diacritics are today, the m in lm would significantly poke out of both sides of the space allocated for the l. Whereas ml would not do that (since the space is allocated for the m, which is the wider of the two), and hence they wouldn't display the same way. In terms of use of for this in e.g. 7-bit ASCII, this works only because the output device is using a fixed width font such as Courier, and so doesn't have to worry about this sort of thing. Obviously, there are some existing exceptions to this being how combining characters work (e.g. some Tamil vowel marks actually display in-line before the base character, and so shove it forward in the line despite being encoded after it). But these exceptions pose an implementation burden, requiring the layout engine to actively support these scripts. From pgcon6 at msn.com Tue Jun 16 18:01:18 2020 From: pgcon6 at msn.com (Peter Constable) Date: Tue, 16 Jun 2020 23:01:18 +0000 Subject: OverStrike control character In-Reply-To: References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org>, Message-ID: From: Unicode On Behalf Of Harriet Riddle via Unicode Sent: Tuesday, June 16, 2020 1:43 PM >Because with how things usually work currently (in something like Roman or Greek, at any rate), the text renderer will make space for the first character first, and then position any following combining diacritics in that space. Well, maybe some legacy rendering engines are like that. But that is not how any text rendering engine capable of supporting any significant portion of Unicode is going to work. For most scripts-and for Latin or Greek script with any support for typographic features-the engine needs to resolve what are all the glyph IDs in a run before it can start determining advance widths / positioning. And to do the latter, it will start with default advance widths and positions for all the glyphs but then apply position actions that could revise any advance width or position. Peter From abrahamgross at disroot.org Wed Jun 17 09:44:52 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Wed, 17 Jun 2020 14:44:52 +0000 (UTC) Subject: OverStrike control character In-Reply-To: References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> Message-ID: Then theres no problem with the advance width changes that overstriking will do (e.g. l?m) 2020/06/16 ??7:02:02 Peter Constable via Unicode : > For most scripts-and for Latin or Greek script with any support for typographic features-the engine needs to resolve what are all the glyph IDs in a run before it can start determining advance widths / positioning. And to do the latter, it will start with default advance widths and positions for all the glyphs but then apply position actions that could revise any advance width or position. > From pgcon6 at msn.com Wed Jun 17 18:54:57 2020 From: pgcon6 at msn.com (Peter Constable) Date: Wed, 17 Jun 2020 23:54:57 +0000 Subject: OverStrike control character In-Reply-To: References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> Message-ID: Except that BS is not a graphic character that will get a glyph with default metrics and potential interaction with other glyphs. Peter -----Original Message----- From: Unicode On Behalf Of abrahamgross--- via Unicode Sent: Wednesday, June 17, 2020 7:45 AM To: unicode at unicode.org Subject: RE: OverStrike control character Then theres no problem with the advance width changes that overstriking will do (e.g. l?m) 2020/06/16 ??7:02:02 Peter Constable via Unicode : > For most scripts-and for Latin or Greek script with any support for typographic features-the engine needs to resolve what are all the glyph IDs in a run before it can start determining advance widths / positioning. And to do the latter, it will start with default advance widths and positions for all the glyphs but then apply position actions that could revise any advance width or position. > From abrahamgross at disroot.org Wed Jun 17 19:45:32 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Thu, 18 Jun 2020 00:45:32 +0000 (UTC) Subject: OverStrike control character In-Reply-To: References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> Message-ID: Which is why I'm advocating for an OverStrike control character 2020/06/17 ??7:55:36 Peter Constable via Unicode : > Except that BS is not a graphic character that will get a glyph with default metrics and potential interaction with other glyphs. > From corentin.jabot at gmail.com Thu Jun 18 10:54:16 2020 From: corentin.jabot at gmail.com (Corentin) Date: Thu, 18 Jun 2020 17:54:16 +0200 Subject: EBCDIC control characters Message-ID: Dear Unicode people. The C0 and C1 control blocks seems to have no intrinsic semantic, but the control characters of multiple characters sets (such as some of the ISO encodings, and the EBCDIC control characters) map to the same block of code points (for EBCDIC, a mapping is described in the UTF-EBCDIC UAX - not sure if this mapping is described anywhere else) such that a distinction between the different provenance is not possible, despite these control characters having potentially different semantic in their original character sets. Has this ever been an issue? Was it discussed at any point in history? Is there a recommended way of dealing with that? I realize the scenario in which this might be relevant is a bit far-fetched but as I try to push the C++ committee in the modern age, these questions, unfortunately, arised. Thanks a lot, Corentin -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at sonic.net Thu Jun 18 13:00:12 2020 From: kenwhistler at sonic.net (Ken Whistler) Date: Thu, 18 Jun 2020 11:00:12 -0700 Subject: EBCDIC control characters In-Reply-To: References: Message-ID: On 6/18/2020 8:54 AM, Corentin via Unicode wrote: > Dear Unicode people. > > The C0 and C1 control blocks seems to have no intrinsic semantic, but > the control characters > of multiple characters sets (such?as some of the ISO encodings, and > the EBCDIC control characters) map to the same block of code points > (for EBCDIC, a mapping is described in the UTF-EBCDIC UAX UTR, actually, not a UAX: https://www.unicode.org/reports/tr16/tr16-8.html > - not sure if this mapping is described anywhere else) Yes, in excruciating detail in the IBM Character Data Representation Architecture: https://www.ibm.com/downloads/cas/G01BQVRV > such that a distinction between the different provenance is not > possible, despite these control characters having potentially > different semantic in their original character sets. It isn't really a "character set" issue. Either ASCII graphic character sets or EBCDIC graphic character sets could be used, in principle, with different sets of control functions, mapped onto the control code positions in each overall scheme. That is typically how character sets worked in terminal environments. What the IBM CDRA establishes is a reliable mapping between all the code points used, so that it was possible to set up reliable interchange between EBCDIC systems and ASCII-based systems. There is one gotcha to watch out for, because there are two possible ways to map newlines back and forth. > > Has this ever been an issue? Was it discussed at any point in history? > Is there a recommended?way of dealing with that? > > I realize the scenario in which this might be relevant is a bit > far-fetched but as I try to push the C++ committee in the modern age, > these questions, unfortunately, arised. There really is no way for a C or C++ compiler to interpret arbitrary control functions associated with control codes, in any case, other than the specific control functions baked into the languages (which are? basically the same that the Unicode Standard insists should be nailed down to particular code points: CR, LF, TAB, etc.). Other control code points should be allowed (and not be messed with) in string literals, and the compiler should otherwise barf if they occur in program text where the language syntax doesn't allow it. And then compilers supporting EBCDIC should just use the IBM standard for mapping back and forth to ASCII-based values. --Ken From corentin.jabot at gmail.com Thu Jun 18 14:22:18 2020 From: corentin.jabot at gmail.com (Corentin) Date: Thu, 18 Jun 2020 21:22:18 +0200 Subject: EBCDIC control characters In-Reply-To: References: Message-ID: On Thu, 18 Jun 2020 at 20:00, Ken Whistler wrote: > > On 6/18/2020 8:54 AM, Corentin via Unicode wrote: > > Dear Unicode people. > > > > The C0 and C1 control blocks seems to have no intrinsic semantic, but > > the control characters > > of multiple characters sets (such as some of the ISO encodings, and > > the EBCDIC control characters) map to the same block of code points > > (for EBCDIC, a mapping is described in the UTF-EBCDIC UAX > > UTR, actually, not a UAX: > > https://www.unicode.org/reports/tr16/tr16-8.html > > > - not sure if this mapping is described anywhere else) > > Yes, in excruciating detail in the IBM Character Data Representation > Architecture: > > https://www.ibm.com/downloads/cas/G01BQVRV Thanks, I will have to read that ! > > > such that a distinction between the different provenance is not > > possible, despite these control characters having potentially > > different semantic in their original character sets. > > It isn't really a "character set" issue. Either ASCII graphic character > sets or EBCDIC graphic character sets could be used, in principle, with > different sets of control functions, mapped onto the control code > positions in each overall scheme. That is typically how character sets > worked in terminal environments. > That makes sense ! > > What the IBM CDRA establishes is a reliable mapping between all the code > points used, so that it was possible to set up reliable interchange > between EBCDIC systems and ASCII-based systems. > There is one gotcha to watch out for, because there are two possible > ways to map newlines back and forth. > > > > > Has this ever been an issue? Was it discussed at any point in history? > > Is there a recommended way of dealing with that? > > > > I realize the scenario in which this might be relevant is a bit > > far-fetched but as I try to push the C++ committee in the modern age, > > these questions, unfortunately, arised. > > There really is no way for a C or C++ compiler to interpret arbitrary > control functions associated with control codes, in any case, other than > the specific control functions baked into the languages (which are > basically the same that the Unicode Standard insists should be nailed > down to particular code points: CR, LF, TAB, etc.). Other control code > points should be allowed (and not be messed with) in string literals, > and the compiler should otherwise barf if they occur in program text > where the language syntax doesn't allow it. And then compilers > supporting EBCDIC should just use the IBM standard for mapping back and > forth to ASCII-based values. > The specific case that people are talking about is indeed string literals such as "\x06\u0086" where the hexadecimal escape is meant to be an ebcdic character and the \uxxxx is meant to be be an unicode character such that the hexadecimal sequence would map to that character, and whether, in that very odd scenario, they are or not the same character, and whether they should be distinguishable Our current model is source encoding -> unicode -> literal encoding, all three encodings being potentially distinct, so we do in fact "mess with" string literals and the question is whether or not going through unicode should ever considered destructive, and my argument is that it is never destructive because semantically preserving in all the relevant use cases. The question was in particular whether we should use "a super set of unicode" instead of "unicode" in that intermediate step. Again thanks a lot for your reply! > > --Ken > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at sonic.net Thu Jun 18 18:14:19 2020 From: kenwhistler at sonic.net (Ken Whistler) Date: Thu, 18 Jun 2020 16:14:19 -0700 Subject: EBCDIC control characters In-Reply-To: References: Message-ID: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> On 6/18/2020 12:22 PM, Corentin wrote: > The specific case that people are talking about is indeed string literals > such as "\x06\u0086" where the hexadecimal escape is meant to be an > ebcdic character and the \uxxxx is meant to be be an unicode character > such that the > hexadecimal sequence would map to that character, and whether, in that > very odd scenario, they are or not the same character, and > whether?they should be distinguishable Well, with the caveat that I am not a formal languge designer -- I just use them on T.V.... ;-) My opinion is that such constructs should simply be illegal and/or non-syntactical. The whole idea of letting people import the complexity of character set conversion (particularly extended to the incompatibility between EBCDIC and ASCII-based representation) into string literals strikes me as just daft. If program text is to be interpreted and compiled in an EBCDIC environment, any string literals contained in that source text should be constrained to EBCDIC, period, full stop. (0x4B, 0x4B) And if they contain more than the very restricted EBCDIC set of A..Z, a..z, 0..9 and a few common punctuation, then it better all be in one well-supported EBCDIC extended code page such as CP 500. If program text is to be interpreted and compiled in a Unicode environment, any string literals contained in that source text should be constrained to Unicode, period, full stop (U+002E, U+002E). And for basic, 8-bit char strings, it better all be UTF-8 these days. UTF-16 and UTF-32 also work, of course, but IMO, support for those is best handled by depending on libraries such as ICU, rather than expecting that the programming language and runtime libraries are going to support them as well as char* UTF-8 strings. If program source text has to be cross-compiled in both an EBCDIC and a Unicode environment, the only sane approach to extract all but the bare minimum of string literals to various kinds of resource files which can then be independently manipulated and pushed through character conversions, as needed -- not expecting that the *compiler* is going to suddenly get smart and do the right thing every time it encounters some otherwise untagged string literal sitting in program text. That's a whole lot cleaner than doing a whole bunch of conditional compilation and working with string literals in program text that are always going to be half-gibberish on whichever platform you view it for maintenance. I had to do some EBCDIC/ASCII cross-compiled code development once -- although admittedly 20 years ago. It wasn't pretty. > > Our current model is source encoding -> unicode -> literal encoding, > all three encodings being potentially distinct, > so we do in fact "mess with" string literals and the question is > whether or not going through unicode should ever considered destructive, Answer, no. If somebody these days is trying to do software development work in a one-off, niche character encoding that cannot be fully converted to Unicode, then *they* are daft. > and my argument is that it is never destructive because semantically > preserving?in all the relevant use cases. > > The question was in particular whether we should use "a super set of > unicode" instead of "unicode" in that intermediate step. Answer no. That will cause you nothing but trouble going forward. All my opinions, of course. YMMV. But probably not by a lot. ;-) --Ken From asmusf at ix.netcom.com Thu Jun 18 18:55:45 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 18 Jun 2020 16:55:45 -0700 Subject: EBCDIC control characters In-Reply-To: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> Message-ID: <08f8d673-e707-6d95-056b-5d8487ff56d9@ix.netcom.com> An HTML attachment was scrubbed... URL: From kenwhistler at sonic.net Thu Jun 18 19:24:35 2020 From: kenwhistler at sonic.net (Ken Whistler) Date: Thu, 18 Jun 2020 17:24:35 -0700 Subject: EBCDIC control characters In-Reply-To: <08f8d673-e707-6d95-056b-5d8487ff56d9@ix.netcom.com> References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <08f8d673-e707-6d95-056b-5d8487ff56d9@ix.netcom.com> Message-ID: <8d7f54ff-0f62-2aef-c7d7-8f1d6f3202e0@sonic.net> Asmus, On 6/18/2020 4:55 PM, Asmus Freytag via Unicode wrote: > The problem with the C/C++ compilers in this regard has always been > that they attempted to implement the character-set insensitive model, > which doesn't play well with Unicode, so if you want to compile a > program where string literals are in Unicode (and not just any 16-bit > character set) then you can't simply zero-extend. (And if you are > trying to create a UTF-8 literal, then all bets are off unless you > have a real conversion). As I said, daft. ;-) Anybody who depends on zero-sign extension for embedding Unicode character literals in an 8859-1 (or any other 8-bit character set) program text ought to have their head examined. Just because you *can* do it, and the compilers will cheerily do what the spec says they should in such cases doesn't mean that anybody *should* use it. (There is lots of stuff in C++ that no sane programmer should use. ) --Ken From asmusf at ix.netcom.com Thu Jun 18 20:16:05 2020 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Thu, 18 Jun 2020 18:16:05 -0700 Subject: EBCDIC control characters In-Reply-To: <8d7f54ff-0f62-2aef-c7d7-8f1d6f3202e0@sonic.net> References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <08f8d673-e707-6d95-056b-5d8487ff56d9@ix.netcom.com> <8d7f54ff-0f62-2aef-c7d7-8f1d6f3202e0@sonic.net> Message-ID: <9a97ea72-0084-a78a-ade0-c5d72eac2b8c@ix.netcom.com> On 6/18/2020 5:24 PM, Ken Whistler wrote: > Asmus, > > On 6/18/2020 4:55 PM, Asmus Freytag via Unicode wrote: >> The problem with the C/C++ compilers in this regard has always been >> that they attempted to implement the character-set insensitive model, >> which doesn't play well with Unicode, so if you want to compile a >> program where string literals are in Unicode (and not just any 16-bit >> character set) then you can't simply zero-extend. (And if you are >> trying to create a UTF-8 literal, then all bets are off unless you >> have a real conversion). > > As I said, daft. ;-) Ken, An argument can certainly be made that trying to be "character set independent" is daft - and back in the '90s I walked away from a job interview at a place that told me that they had "figured it all out" and were going to use "character set independence" as their i18n strategy and "only" needed someone to implement it. Easiest decision on my part. (They got creamed by their Unicode-based competitor in short order). My experience with C/C++ is perhaps colored a bit by the fact that I've always used compilers that were targeting Unicode-based systems and had special extension; not sure where things stand right now, for a purely generic implementation. A./ > > Anybody who depends on zero-sign extension for embedding Unicode > character literals in an 8859-1 (or any other 8-bit character set) > program text ought to have their head examined. Just because you *can* > do it, and the compilers will cheerily do what the spec says they > should in such cases doesn't mean that anybody *should* use it. (There > is lots of stuff in C++ that no sane programmer should use. ) > > --Ken > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgcon6 at msn.com Thu Jun 18 22:59:29 2020 From: pgcon6 at msn.com (Peter Constable) Date: Fri, 19 Jun 2020 03:59:29 +0000 Subject: OverStrike control character In-Reply-To: References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> Message-ID: And your new control character would have the same limitation: control characters are default ignorable and don't get rendered. Peter -----Original Message----- From: Unicode On Behalf Of abrahamgross--- via Unicode Sent: Wednesday, June 17, 2020 5:46 PM To: unicode at unicode.org Subject: RE: OverStrike control character Which is why I'm advocating for an OverStrike control character 2020/06/17 ??7:55:36 Peter Constable via Unicode : > Except that BS is not a graphic character that will get a glyph with default metrics and potential interaction with other glyphs. > From abrahamgross at disroot.org Thu Jun 18 23:37:39 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Fri, 19 Jun 2020 04:37:39 +0000 (UTC) Subject: OverStrike control character In-Reply-To: References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> Message-ID: Then how does the ZWJ (zero width joiner) work? 2020/06/19 ??0:00:26 Peter Constable via Unicode : > And your new control character would have the same limitation: control characters are default ignorable and don't get rendered. > > Peter > From jameskasskrv at gmail.com Fri Jun 19 00:20:33 2020 From: jameskasskrv at gmail.com (James Kass) Date: Fri, 19 Jun 2020 05:20:33 +0000 Subject: OverStrike control character In-Reply-To: References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> Message-ID: <54c36eae-5589-069b-1791-8dabea30b3e6@gmail.com> On 2020-06-19 4:37 AM, abrahamgross--- via Unicode wrote: > Then how does the ZWJ (zero width joiner) work? It?s considered punctuation even though it has control and format aspects.? ZWJ is default ignorable.? ZWJ requests a more joined form of a character string for the display if a more joined form is available in the font/rendering system.? If a more joined form is not available the display will be the same as if the ZWJ was not part of the character stream, and no harm done.? The point being that the author requested a more joined form and this authorial intent is preserved in the text/data. The ZWJ might be a good way to achieve over-striking.? For example, the string ?Respec?tfully? has a ZWJ inserted between the ?c? and the ?t?.? If you had a font which substituted a ?c-t? over-strike for that string, your display could show it.? If you had a font which substituted a ?c-t? ligature for that string, that ligature would be displayed.? Otherwise the string ?Respec?tfully? would look the same as the string ?Respectfully?. (Heh, the Thunderbird spell-checker chokes on the two instances with ZWJ.) From richard.wordingham at ntlworld.com Fri Jun 19 05:32:50 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 19 Jun 2020 11:32:50 +0100 Subject: OverStrike control character In-Reply-To: References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> Message-ID: <20200619113250.66bb7f47@JRWUBU2> On Fri, 19 Jun 2020 03:59:29 +0000 Peter Constable via Unicode wrote: > And your new control character would have the same limitation: > control characters are default ignorable and don't get rendered. As a systematic rule, that's incorrect behaviour. They should only be ignored if the system doesn't 'understand' them. So, if the font selected doesn't support it, it can be ignored, but if it does, it should be honoured as part of the text. Of course, there are misunderstandings; I've seen USE implementations complain about ZWJ. Richard. From richard.wordingham at ntlworld.com Fri Jun 19 05:48:27 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 19 Jun 2020 11:48:27 +0100 Subject: EBCDIC control characters In-Reply-To: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> Message-ID: <20200619114827.07df1f21@JRWUBU2> On Thu, 18 Jun 2020 16:14:19 -0700 Ken Whistler via Unicode wrote: > On 6/18/2020 12:22 PM, Corentin wrote: > > The specific case that people are talking about is indeed string > > literals such as "\x06\u0086" where the hexadecimal escape is meant > > to be an ebcdic character and the \uxxxx is meant to be be an > > unicode character such that the > > hexadecimal sequence would map to that character, and whether, in > > that very odd scenario, they are or not the same character, and > > whether?they should be distinguishable > > The question was in particular whether we should use "a super set > > of unicode" instead of "unicode" in that intermediate step. > Answer no. That will cause you nothing but trouble going forward. Isn't there still the issue of supporting U+0000 in C-type strings? Richard. From jameskasskrv at gmail.com Fri Jun 19 06:40:01 2020 From: jameskasskrv at gmail.com (James Kass) Date: Fri, 19 Jun 2020 11:40:01 +0000 Subject: OverStrike control character In-Reply-To: <20200619113250.66bb7f47@JRWUBU2> References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> <20200619113250.66bb7f47@JRWUBU2> Message-ID: A font could be designed to make appropriate glyph substitutions for strings which include the control picture for backspace, U+2408 (???).? So a font could substitute an over struck l-m glyph for the string ?l? + ??? + ?m?.? If the font didn?t support that string, the default display would still show authorial intent.? In this way users desiring to exchange data in plain-text which included over-strikes could do so without any additions to TUS. Unicode, excluding emoji, eschews encoding items just because they sound cool and somebody might use them. But if users want to band together and establish conventions, there?s nothing holding them back. From richard.wordingham at ntlworld.com Fri Jun 19 09:42:16 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 19 Jun 2020 15:42:16 +0100 Subject: OverStrike control character In-Reply-To: References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> <20200619113250.66bb7f47@JRWUBU2> Message-ID: <20200619154216.6f85f1d5@JRWUBU2> On Fri, 19 Jun 2020 11:40:01 +0000 James Kass via Unicode wrote: > A font could be designed to make appropriate glyph substitutions for > strings which include the control picture for backspace, U+2408 > (???). So a font could substitute an over struck l-m glyph for the > string ?l? + ??? + ?m?.? If the font didn?t support that string, the > default display would still show authorial intent.? In this way users > desiring to exchange data in plain-text which included over-strikes > could do so without any additions to TUS. Wouldn't this violate the character identity of U+2408? The proper mechanism would be to use a PUA character. The question is whether the font would be enough, or whether one would have to change its invoker. Richard. From gwidion at gmail.com Fri Jun 19 10:11:58 2020 From: gwidion at gmail.com (Joao S. O. Bueno) Date: Fri, 19 Jun 2020 12:11:58 -0300 Subject: OverStrike control character In-Reply-To: <20200619154216.6f85f1d5@JRWUBU2> References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> <20200619113250.66bb7f47@JRWUBU2> <20200619154216.6f85f1d5@JRWUBU2> Message-ID: Since this discussion has come this far, I will drop my 0.02 : I am currently authoring a library/framework to create character art ("ASCII art") - as a free software project, including drawing APIs using block characters, and helpers for using emoji. In this position, such a character combination would be a "nice to have" - and if it would not disturb other aspects of text-communication, I am all for it. Fact is one would still need a terminal app to support it properly, but, my project can also work with other backends for rendering Currently it supports ANSI-sequence texts aimed at terminal emulators and an HTML output based on monospaced fonts and CSS stylling. But pixel-based backends are on the roadmap, and easy to do. I see this library and similar projects as major users of the "overstrike" features. In the case of my project even as an enabler for other people to use it. However, as it is obvious, I have to count on higher level protocols to specify in-string text attributes, and I can make use of those for overriding character positioning with no need of an special character for overstrike. So, although my project could support this, and having some people using the overstrike character for some simplified output, it will certainly also integrate in-string markup for other positioning control (by coincidence I was coding exactly this part last night) . On the other hand if overstrike character is ever implemented and supported in terminals and other text APIs in popular toolkits such as Qt/GTK I can get more character artistic effects on those backends as well, instead of limiting them to pixel-based backends. (If anyone is curious about it, the project url is https://github.com/jsbueno/terminedia - and I can get help with having more unicode compliant internal names and APIs, as well as help other people for whom the project tools might be useful) Regards, js -><- On Fri, 19 Jun 2020 at 11:48, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Fri, 19 Jun 2020 11:40:01 +0000 > James Kass via Unicode wrote: > > > A font could be designed to make appropriate glyph substitutions for > > strings which include the control picture for backspace, U+2408 > > (???). So a font could substitute an over struck l-m glyph for the > > string ?l? + ??? + ?m?. If the font didn?t support that string, the > > default display would still show authorial intent. In this way users > > desiring to exchange data in plain-text which included over-strikes > > could do so without any additions to TUS. > > Wouldn't this violate the character identity of U+2408? > > The proper mechanism would be to use a PUA character. The question is > whether the font would be enough, or whether one would have to change > its invoker. > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Fri Jun 19 10:27:35 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Fri, 19 Jun 2020 15:27:35 +0000 (UTC) Subject: OverStrike control character In-Reply-To: <20200619154216.6f85f1d5@JRWUBU2> References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> <20200619113250.66bb7f47@JRWUBU2> <20200619154216.6f85f1d5@JRWUBU2> Message-ID: I think james used ? as an example, and not as a real proposal for what overstrike should look like where its its not supported. (?m?l? or ?m?l? might be a good alternative) 2020/06/19 ??10:43:03 Richard Wordingham via Unicode : > Wouldn't this violate the character identity of U+2408? > From kenwhistler at sonic.net Fri Jun 19 15:24:41 2020 From: kenwhistler at sonic.net (Ken Whistler) Date: Fri, 19 Jun 2020 13:24:41 -0700 Subject: EBCDIC control characters In-Reply-To: <20200619114827.07df1f21@JRWUBU2> References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <20200619114827.07df1f21@JRWUBU2> Message-ID: <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote: > Isn't there still the issue of supporting U+0000 in C-type strings? I don't see why. And it has nothing to do with Unicode per se, anyway. That is just a transform of the question of "the issue of supporting 0x00 in C-type strings restricted to ASCII." The issue is precisely the same, and the solutions are precisely the same -- by design. --Ken From markus.icu at gmail.com Fri Jun 19 16:00:21 2020 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 19 Jun 2020 14:00:21 -0700 Subject: EBCDIC control characters In-Reply-To: <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <20200619114827.07df1f21@JRWUBU2> <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> Message-ID: I would soften a bit what Ken and Asmus have said. Of course C++ compilers have to deal with a variety of charsets/codepages. There is (or used to be) a lot of code in various Windows/Mac/Linux/... codepages, including variations of Shift-JIS, EUC-KR, etc. My mental model of how compilers work (which might be outdated) is that they work within a charset family (usually ASCII, but EBCDIC on certain platforms) and mostly parse ASCII characters as is (and for the "basic character set" in EBCDIC, mostly assume the byte values of cp37 or 1047 depending on platform). For regular string literals, I expect it's mostly a pass-through from the source code (and \xhh bytes) to the output binary. But of course C++ has syntax for Unicode string literals. I think compilers basically call a system function to convert from the source bytes to Unicode, either with the process default charset or with an explicit one if specified on the command line. And then there are \uhhhh and \U00HHHHHH escapes even in non-Unicode string literals, as Corentin said. What I would expect to happen is that the compiler copies all of the literal bytes, and when it reads a Unicode escape it converts that one code point to the byte sequence in the default or execution-charset. It would get more interesting if a compiler had options for different source and execution charsets. I don't know if they would convert regular string literals directly from one to the other, or if they convert everything to Unicode (like a Java compiler) and then to the execution charset. (In Java, the execution charset is UTF-16, so the problem space there is simpler.) Of course, in many cases a conversion from A to B will pivot through Unicode anyway (so that you only need 2n tables not n^2.) About character conversion in general I would caution that there are basically two types of mappings: Round-trip mappings for what's really the same character on both sides, and fallbacks where you map to a different but more or less similar/related character because that may be more readable than a question mark or a replacement character. In a compiler, I would hope that both unmappable characters and fallback mappings lead to compiler errors, to avoid hidden surprises in runtime behavior. This probably constrains what the compiler can and should do. As a programmer, I want to be able to put any old byte sequence into a string literal, including NUL, controls, and non-character-encoding bytes. (We use string literals for more things than "text".) For example, when we didn't yet have syntax for UTF-8 string literals, we could write unmarked literals with \xhh sequences and pass them into functions that explicitly operated on UTF-8, regardless of whether those byte sequences were well-formed according to the source or execution charsets. This pretty much works only if there is no conversion that puts limits on the contents. I believe that EBCDIC platforms have dealt with this, where necessary, by using single-byte conversion mappings between EBCDIC-based and ASCII-based codepages that were strict permutations. Thus, control codes and other byte values would round-trip through any number of conversions back and forth. PS: I know that this really goes beyond string literals: C++ identifiers can include non-ASCII characters. I expect these to work much like regular string literals, minus escape sequences. I guess that the execution charset still plays a role for the linker symbol table. Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdowney at gmail.com Fri Jun 19 16:56:33 2020 From: sdowney at gmail.com (Steve Downey) Date: Fri, 19 Jun 2020 17:56:33 -0400 Subject: EBCDIC control characters In-Reply-To: References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <20200619114827.07df1f21@JRWUBU2> <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> Message-ID: On Fri, Jun 19, 2020 at 5:08 PM Markus Scherer via Unicode wrote: > > I would soften a bit what Ken and Asmus have said. > > Of course C++ compilers have to deal with a variety of charsets/codepages. There is (or used to be) a lot of code in various Windows/Mac/Linux/... codepages, including variations of Shift-JIS, EUC-KR, etc. > > My mental model of how compilers work (which might be outdated) is that they work within a charset family (usually ASCII, but EBCDIC on certain platforms) and mostly parse ASCII characters as is (and for the "basic character set" in EBCDIC, mostly assume the byte values of cp37 or 1047 depending on platform). For regular string literals, I expect it's mostly a pass-through from the source code (and \xhh bytes) to the output binary. > What you described is the standard model for C compilers. For better or worse, the C++ model is much more complicated. Note that what I'm about to describe isn't how actual compilers work, but is what is described in the C++ standard. When translating a source file, all of the characters outside the 'basic source character set' (ascii letters, numbers, some necessary punctuation) are converted to universal character names of the form \unnnn or \Unnnnnnnn, where the ns are the short name of the code point, and surrogate pairs are excluded, so really scalar values. Later in translation, the universal character names, and the basic source character set elements are mapped to the execution character set, where the values are determined by locale. Which is terribly vague and we'd like to clean that up. There are wide literals to deal with, as well as the newer Unicode literals, where we've mandated the encoding to be UTF of the appropriate code unit width, with distinct types of char8_t, char16_t, and char32_t. > But of course C++ has syntax for Unicode string literals. I think compilers basically call a system function to convert from the source bytes to Unicode, either with the process default charset or with an explicit one if specified on the command line. > > It would get more interesting if a compiler had options for different source and execution charsets. I don't know if they would convert regular string literals directly from one to the other, or if they convert everything to Unicode (like a Java compiler) and then to the execution charset. (In Java, the execution charset is UTF-16, so the problem space there is simpler.) In practice, compilers behave sensibly and will map from the source to the destination encodings. In theory they triangulate via code points. This difference, of course, can be made visible by chosen text where there are multiple possible destinations for a code point. In practice, users do not care because they get the results they expect. It's more a problem in specification. > PS: I know that this really goes beyond string literals: C++ identifiers can include non-ASCII characters. I expect these to work much like regular string literals, minus escape sequences. I guess that the execution charset still plays a role for the linker symbol table. > Identifiers work substantially the same way, although with additional restrictions. I'm currently working on a proposal to apply the current UAX 31 to C++ to clean up the historical allow and block list. ( http://wg21.link/p1949 : C++ Identifier Syntax using Unicode Standard Annex 31 ) I'll be posting some questions soon about that. -SMD From richard.wordingham at ntlworld.com Fri Jun 19 17:58:12 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 19 Jun 2020 23:58:12 +0100 Subject: EBCDIC control characters In-Reply-To: <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <20200619114827.07df1f21@JRWUBU2> <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> Message-ID: <20200619235812.405e74d0@JRWUBU2> On Fri, 19 Jun 2020 13:24:41 -0700 Ken Whistler via Unicode wrote: > On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote: > > Isn't there still the issue of supporting U+0000 in C-type > > strings? > > I don't see why. And it has nothing to do with Unicode per se, anyway. > > That is just a transform of the question of "the issue of supporting > 0x00 in C-type strings restricted to ASCII." > > The issue is precisely the same, and the solutions are precisely the > same -- by design. There is a solution, but it's not nice. The solution is to work with UTF-8 plus one other character code - <0xC0, 0x80> for U+0000. In the absence of policemen, it works. While Ken and Asmus both live (I can't remember whose life time it is), one can use scalar values beyond 0x10FFFF for character-like non-character entities, such as byte values with bit 7 or higher set (a widespread possibility for file names), or some enormous CJK glyph sets. I understand Emacs does that sort of thing, storing them using an extension of UTF-8, and seems to get away with it. I believe they're also used for Bucky-bitted 'characters' from keyboards. Outside Emacs, such things also provide reliable, private non-characters. Again, one has to watch out for policemen, which can make life fraught in complicated environments. Richard. From kent.b.karlsson at bahnhof.se Fri Jun 19 18:06:34 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Sat, 20 Jun 2020 01:06:34 +0200 Subject: What constitute? an abstract character? In-Reply-To: References: <000901d64321$e17f5ad0$a47e1070$@gmail.com> Message-ID: <5D51B47F-D1E9-45BD-8891-F9F374B9F97A@bahnhof.se> > 15 juni 2020 kl. 17:04 skrev Peter Constable via Unicode : > > Unicode doesn?t give one answer since there?s more than one way that might be appropriate to answer it. > > [?] An Old Hangul syllable might have a count of 1, 2 or 3, depending on the syllable. A bit peripheral to this thread, but: 1) No need to limit that to Old Hangul. It is equally valid for Modern Hangul. It?s just that for SOME old Hangul syllables there is no (canonically equivalent) single character form. This is for encoding historical reasons, nothing deep. Just that hindsight is (now) not at all a sufficient reason to radically change the encoding. (It was sufficient reason long ago, resulting in the ?Hangul mess? in Unicode...) 2) For practical (I guess) reasons one considers clusters of consonants and clusters of vowels as singular indivisible entities. However, since Hangul is an alphabetic script (and the letter basis has no consonant or vowel ?clusters?, the clusters consists of one to three letters), also the (canonical) decomposition into maximum three components is an artifact of the encoding. A Hangul syllable can often consist of more than three Hangul letters. And no, the compatibility decomposition of the Hangul Jamo is of no help, basically they are wrong for Hangul. DO NOT USE! Completely different decompositions are needed to decompose into the letters originally designed for the script. Furthermore, the consonants are (basically) double encoded, but that is for encoding technical reasons, not that there are really two different ones each, just two different positions in a syllable. This just shows that the mapping from ?abstract characters? (in this example, the letters of the Hangul alphabet) to encoded characters sometimes can be non-trivial. /Kent Karlsson -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Fri Jun 19 18:06:45 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Sat, 20 Jun 2020 01:06:45 +0200 Subject: EBCDIC control characters In-Reply-To: References: Message-ID: <0E7BDFB6-1F2E-40A3-B9AE-D45443DF288C@bahnhof.se> > 18 juni 2020 kl. 20:00 skrev Ken Whistler via Unicode : > [?] > It isn't really a "character set" issue. Either ASCII graphic character sets or EBCDIC graphic character sets could be used, in principle, with different sets of control functions, mapped onto the control code positions in each overall scheme. That does not seem to be a very good idea at all. Especially since we do not have any good way of telling which set of control codes are used in such cases, in particular it would be a very very bad idea for Unicode encodings. It would be even worse than the situation that lead up the the construction of Unicode. So let?s assume ?normal? control code allocation in the C0 and C1 areas when using the U+nnnn notation (or \unnnn). (Here, not saying anything about the contents for C0/C1 for other encodings.) I don?t usually need to worry about EBCDIC-based encodings? But it seems that at least earlier (UTF-EBCDIC not so much) EBCDIC based encoding had some control codes that has no direct correspondence in the ?normal? C0/C1. Several are listed in the Wikipedia page about EBCDIC. Even though there is no direct correspondence for them, there is a way to represent them; provided one agrees on a mapping: ISO/IEC 6429/ECMA-48 comes to the rescue. There are very many unused, but syntactically correct, escape sequences and control sequences. A few of them are designated as private use. So for (old?) EBCDIC control codes that do not have a representation in ?normal? C0/C1, if it is a parameterless one, ?allocate? an escape sequence (cmp. each C1 control code has an alternative as an escape sequence, like HTJ can be designated ESC I and NEL as ESC E), and for the ones that take a parameter, ?allocate? a control sequence (in the ECMA-48 sense) that takes a parameter (you will need a parameter value mapping as well). I?m not saying that these (old?) control codes unique to EBCDIC are well-designed all worthy of implementation and perpetual use. Not at all. But if you do need to keep some of them, in some contexts (and otherwise ignore them) allocating escape sequences and control sequences is the way to go. No need to allocate new characters in Unicode? And no need to interpret the C0/C1 space in Unicode ?strangely? in some contexts. That way you can represent ?odd? control codes, from e.g. (old?) EBCDIC-based encodings also in Unicode, and \unnnn notation (ok for some (old?) EBCDIC-based encodings one needs an extra conversion step to convert the escape/control sequence to the (old?) control codes if the string targets such an encoding). Happy summer (northern hemisphere...) solstice /Kent Karlsson From sdowney at gmail.com Fri Jun 19 21:16:29 2020 From: sdowney at gmail.com (Steve Downey) Date: Fri, 19 Jun 2020 22:16:29 -0400 Subject: UAX 31 for C++ Identifiers Message-ID: I'm the lead author for a proposal to rework C++ identifiers in line with the current recommendation of UAX 31. The current version is available at http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1949r4.html. The core of the proposal is to replace the current allowlist to using XID_Start and XID_Continue with the addition of LOW LINE in start. The summary The allowed Unicode code points in identifiers include many that are unassigned or unnecessary, and others that are actually counter-productive. By adopting the recommendations of UAX #31, Unicode Identifier and Pattern Syntax, C++ will be easier to work with in international environments and less prone to accidental problems. This proposal does not address some potential security concerns?so called homoglyph attacks?where letters that appear the same may be treated as distinct. Methods of defense against such attacks are complex and evolving, and requiring mitigation strategies would impose substantial implementation burden. This proposal also recommends adoption of Unicode normalization form C (NFC) for identifiers to ensure that when compared, identifiers intended to be the same will compare as equal. Legacy encodings are generally naturally in NFC when converted to Unicode. Most tools will, by default, produce NFC text. Some unusual scripts require the use of characters as joiners that are not allowed by UAX #31, these will no longer be available as identifiers in C++. As a side-effect of adopting the identifier characters from UAX #31, using emoji in or as identifiers becomes ill-formed. The most important open question is what are we losing by using the basic XID_Start XID_Continue* pattern. There are apparently natural languages that require code points outside that set in order to write some words? How much of a problem is that, and are there solutions without complex script analysis on potential identifiers? Secondarily, what would an excellent conformance statement look like? I'm proposing an annex to the C++ standard discussing the conformance points and how we are or are not meeting them, so as to have clarity on how and why. There are also open questions about emoji. There are currently a large number that are allowed, but it seems mostly due to open listing unassigned code points. Has there been discussion of a standard profile that would allow emoji in identifiers? I realize this has substantial overlap with script checking and the security paper. C++ identifiers are sort of half over the fence. ZWJ are allowed, but gender modifiers aren't, and neither were intentional with respect to emoji. The feedback I've got is that we, the C++ committee, would really like not to own this problem, even if members participate in solving the problem. Thanks! -SMD (wg21/sg16) -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Fri Jun 19 21:35:40 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 19 Jun 2020 19:35:40 -0700 Subject: UAX 31 for C++ Identifiers In-Reply-To: References: Message-ID: <9f9e7e04-e035-2740-419c-b9667b44bd80@ix.netcom.com> An HTML attachment was scrubbed... URL: From sdowney at gmail.com Fri Jun 19 22:22:35 2020 From: sdowney at gmail.com (Steve Downey) Date: Fri, 19 Jun 2020 23:22:35 -0400 Subject: UAX 31 for C++ Identifiers In-Reply-To: <9f9e7e04-e035-2740-419c-b9667b44bd80@ix.netcom.com> References: <9f9e7e04-e035-2740-419c-b9667b44bd80@ix.netcom.com> Message-ID: On Fri, Jun 19, 2020 at 10:44 PM Asmus Freytag via Unicode wrote: > > In source code, having ambiguous identifiers may not be worse than C-style obfuscation. > Until recently (the last release 10.1), gcc rejected much of allowed unicode in UTF-8 input, even in places it would allow \u universal-character-names. So this all becomes easier now. As a Standard, we should have handled this better earlier, but the second best time is now. The XID_ properties make this a lot more palatable w.r.t. stability, though, and I'm not going to second guess people 10 or 20 or more years ago, too much. Ambiguity in external identifiers is already ill-formed no diagnostic required, which means broken but in ways that compilers can't treat as undefined. > > But with module names, etc. you may run into security issues if naming allows / facilitates spoofing. > I, and other people doing tools, both won and lost this battle already. Module names in source do not correspond with anything physical. `import some.module` connects you to whatever exported `some.module` by magic as far as the standard is concerned. We're working on the actual mechanics as a Technical Report, and compiler vendors are participating and aren't, as far as I can tell, more insane than the average infrastructure engineer. So I have hope. Mapping anything to file paths is fraught beyond belief, and there are many experienced engineers providing war stories and parades of horribles, although I'd personally like to have more stories to tell. The entire disconnect between logical and physical actually is hopeful, in a way that `#include ` isn't. Even though we have a lot of understanding of how that maps to filesystem searches. Province of wg21/sg15 , which I also participate in. I suspect that trying to fix up anything with #include is infeasible since it's currently the wild west, changes will break, and C++ depends in practice on system provided headers that at best conform to old C standards. Thanks! -SMD From jameskasskrv at gmail.com Fri Jun 19 23:48:00 2020 From: jameskasskrv at gmail.com (James Kass) Date: Sat, 20 Jun 2020 04:48:00 +0000 Subject: OverStrike control character In-Reply-To: <20200619154216.6f85f1d5@JRWUBU2> References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> <20200619113250.66bb7f47@JRWUBU2> <20200619154216.6f85f1d5@JRWUBU2> Message-ID: <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com> Richard Wordingham wrote, > Wouldn't this violate the character identity of U+2408? U+2408 SYMBOL FOR BACKSPACE Using a symbol for backspace in running text as a symbol for backspace to illustrate a notational convention for overstriking shouldn?t violate its character identity.? It was offered in response to the objections of using the ASCII back space or other control characters because they are not graphic characters.? U+2408 is a graphic character. > The proper mechanism would be to use a PUA character. This would only be true if the data wasn?t intended to be interchangeable. Abraham Gross wrote, > (?m?l? or ?m?l? might be a good alternative) Since they?re graphic characters either should be workable.? As long as our hypothetical user community agrees on a notational convention, acceptable display should be possible with existing technology.? It might be interesting to see if people with a demonstrable need to exchange overstruck material in plain-text, such as epigraphers, already have an established convention. In numismatics, Yeoman?s catalogs simply spell it out for overstruck dates, such as ?1918D, 8 over 7?. From abrahamgross at disroot.org Sat Jun 20 01:30:02 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Sat, 20 Jun 2020 06:30:02 +0000 (UTC) Subject: OverStrike control character In-Reply-To: <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com> References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> <20200619113250.66bb7f47@JRWUBU2> <20200619154216.6f85f1d5@JRWUBU2> <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com> Message-ID: If epigraphers and numismaticians have the need for overstiking in plain text, isn't that reason enough to encode it? Unicode encoded many completely extinct scripts* and extinct characters in existing scripts, so adding the overstrike doesn't seem like a stretch at all. Does Yeoman show the ?8 over 7? visually too, or does it just say ?8 over 7? and you're supposed to imagine it urself? *Extinct scripts in Unicode: Georgian capitals Ogham Runes Glagolitic Linear B Phaistos disc Lycian Carien Old (RTL) Italic Gothic Old Permic Cuneiform Deseret (conscript) Shavian (conscript) Linear A Cypriot Imperial Aramaic Palmyrene Nabatean Hatran Phoenician Lydian Meroitic Old South Arabian Old North Arabian Avestan Inscriptional Parthian Inscriptional Pahlavi Psalter Pahlavi Old Turkic Old Hungarian Brahmi Then there are many extinct scripts in the propsal stage like oracle bone and classical yi 2020/06/20 ??0:48:47 James Kass via Unicode : > It might be interesting to see if people with a demonstrable need to exchange overstruck material in plain-text, such as epigraphers, already have an established convention. > > In numismatics, Yeoman?s catalogs simply spell it out for overstruck dates, such as ?1918D, 8 over 7?. > From asmusf at ix.netcom.com Sat Jun 20 01:44:59 2020 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Fri, 19 Jun 2020 23:44:59 -0700 Subject: UAX 31 for C++ Identifiers In-Reply-To: References: <9f9e7e04-e035-2740-419c-b9667b44bd80@ix.netcom.com> Message-ID: <8f744f47-5464-1fe0-7b56-4815a3a28ce3@ix.netcom.com> My meta point had been about possibly different levels security issues between compile time and runtime. A./ On 6/19/2020 8:22 PM, Steve Downey wrote: > On Fri, Jun 19, 2020 at 10:44 PM Asmus Freytag via Unicode > wrote: >> In source code, having ambiguous identifiers may not be worse than C-style obfuscation. >> > Until recently (the last release 10.1), gcc rejected much of allowed > unicode in UTF-8 input, even in places it would allow \u > universal-character-names. So this all becomes easier now. As a > Standard, we should have handled this better earlier, but the second > best time is now. The XID_ properties make this a lot more palatable > w.r.t. stability, though, and I'm not going to second guess people 10 > or 20 or more years ago, too much. Ambiguity in external identifiers > is already ill-formed no diagnostic required, which means broken but > in ways that compilers can't treat as undefined. > >> But with module names, etc. you may run into security issues if naming allows / facilitates spoofing. >> > I, and other people doing tools, both won and lost this battle > already. Module names in source do not correspond with anything > physical. `import some.module` connects you to whatever exported > `some.module` by magic as far as the standard is concerned. We're > working on the actual mechanics as a Technical Report, and compiler > vendors are participating and aren't, as far as I can tell, more > insane than the average infrastructure engineer. So I have hope. > > Mapping anything to file paths is fraught beyond belief, and there are > many experienced engineers providing war stories and parades of > horribles, although I'd personally like to have more stories to tell. > > The entire disconnect between logical and physical actually is > hopeful, in a way that `#include ` isn't. Even though > we have a lot of understanding of how that maps to filesystem > searches. > > Province of wg21/sg15 , which I also participate in. > > I suspect that trying to fix up anything with #include is infeasible > since it's currently the wild west, changes will break, and C++ > depends in practice on system provided headers that at best conform to > old C standards. > > Thanks! > > -SMD -------------- next part -------------- An HTML attachment was scrubbed... URL: From corentin.jabot at gmail.com Sat Jun 20 03:50:28 2020 From: corentin.jabot at gmail.com (Corentin) Date: Sat, 20 Jun 2020 10:50:28 +0200 Subject: EBCDIC control characters In-Reply-To: References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <20200619114827.07df1f21@JRWUBU2> <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> Message-ID: On Fri, 19 Jun 2020 at 23:00, Markus Scherer wrote: > I would soften a bit what Ken and Asmus have said. > > Of course C++ compilers have to deal with a variety of charsets/codepages. > There is (or used to be) a lot of code in various Windows/Mac/Linux/... > codepages, including variations of Shift-JIS, EUC-KR, etc. > > My mental model of how compilers work (which might be outdated) is that > they work within a charset family (usually ASCII, but EBCDIC on certain > platforms) and mostly parse ASCII characters as is (and for the "basic > character set" in EBCDIC, mostly assume the byte values of cp37 or 1047 > depending on platform). For regular string literals, I expect it's mostly a > pass-through from the source code (and \xhh bytes) to the output binary. > > But of course C++ has syntax for Unicode string literals. I think > compilers basically call a system function to convert from the source bytes > to Unicode, either with the process default charset or with an explicit one > if specified on the command line. > > And then there are \uhhhh and \U00HHHHHH escapes even in non-Unicode > string literals, as Corentin said. What I would expect to happen is that > the compiler copies all of the literal bytes, and when it reads a Unicode > escape it converts that one code point to the byte sequence in the default > or execution-charset. > > It would get more interesting if a compiler had options for different > source and execution charsets. I don't know if they would convert regular > string literals directly from one to the other, or if they convert > everything to Unicode (like a Java compiler) and then to the execution > charset. (In Java, the execution charset is UTF-16, so the problem space > there is simpler.) > Yes, and actually people are talking about that for legacy projects sake, and there using Unicode internally makes even more sense > > Of course, in many cases a conversion from A to B will pivot through > Unicode anyway (so that you only need 2n tables not n^2.) > > About character conversion in general I would caution that there are > basically two types of mappings: Round-trip mappings for what's really the > same character on both sides, and fallbacks where you map to a different > but more or less similar/related character because that may be more > readable than a question mark or a replacement character. In a compiler, I > would hope that both unmappable characters and fallback mappings lead to > compiler errors, to avoid hidden surprises in runtime behavior. > I am hoping to make conversions that do not preserve semantic invalid, right now compilers will behave differently, some will not compile, some will insert question marks, leading to the runtime issues you describe. Now, my argument is that going through Unicode ( and keep in mind that we are describing a specification not compiler implementations ), let us simplify the spec without preventing (nor mandating ) round tripping if the source and literal encodings happen to be the same. If there is a way through unicode, transitively there is a direct way. This probably constrains what the compiler can and should do. As a > programmer, I want to be able to put any old byte sequence into a string > literal, including NUL, controls, and non-character-encoding bytes. (We use > string literals for more things than "text".) For example, when we didn't > yet have syntax for UTF-8 string literals, we could write unmarked literals > with \xhh sequences and pass them into functions that explicitly operated > on UTF-8, regardless of whether those byte sequences were well-formed > according to the source or execution charsets. This pretty much works only > if there is no conversion that puts limits on the contents. > Okay, we are really in C++ territory now. for the sake of people who are not aware the \0 and \x escape sequences are really integer values and will never be semantically characters or involve conversion. > > I believe that EBCDIC platforms have dealt with this, where necessary, by > using single-byte conversion mappings between EBCDIC-based and ASCII-based > codepages that were strict permutations. Thus, control codes and other byte > values would round-trip through any number of conversions back and forth. > > PS: I know that this really goes beyond string literals: C++ identifiers > can include non-ASCII characters. I expect these to work much like regular > string literals, minus escape sequences. I guess that the execution charset > still plays a role for the linker symbol table. > > Best regards, > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From corentin.jabot at gmail.com Sat Jun 20 03:57:10 2020 From: corentin.jabot at gmail.com (Corentin) Date: Sat, 20 Jun 2020 10:57:10 +0200 Subject: EBCDIC control characters In-Reply-To: <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <20200619114827.07df1f21@JRWUBU2> <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> Message-ID: On Fri, 19 Jun 2020 at 22:30, Ken Whistler via Unicode wrote: > > On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote: > > Isn't there still the issue of supporting U+0000 in C-type strings? > > I don't see why. And it has nothing to do with Unicode per se, anyway. > > That is just a transform of the question of "the issue of supporting 0x00 > in > C-type strings restricted to ASCII." > > The issue is precisely the same, and the solutions are precisely the same > -- by design. > I'm not sure I understand that issue, could you clarify? in both C and C++, U+0000 is interpreted as the null character (which mark the end of the string depending on context), which is the same behavior as the equivalent ascii character > > --Ken > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Sat Jun 20 05:30:22 2020 From: prosfilaes at gmail.com (David Starner) Date: Sat, 20 Jun 2020 03:30:22 -0700 Subject: OverStrike control character In-Reply-To: References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> <20200619113250.66bb7f47@JRWUBU2> <20200619154216.6f85f1d5@JRWUBU2> <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com> Message-ID: You can use BS; you can use GCC; there are apparently ESC sequences that will do it, and you could implement it in any number of ways in rich text. They don't work right now; if you need this functionality, you're going to need to implement it. It doesn't feel that you want a way to store and display overtyped text, it feels that you want Unicode to officially support it. It's a very complex and expensive expansion, but if a bunch of people were using overtyped text, this discussion might be going differently. This seems to be the quintessential example of a feature that has very marginal use and would be rather complex to implement. The fact that back in the days of daisy-wheel printers, this was used, often to get characters not otherwise supported, like the cent sign, and when the daisy-wheel printer disappeared, so did any support for such a thing. It's like playing games with character cell fonts to support stuff like a mouse cursor. It's history, and history not particularly easy to support in Unicode. There were just recently a bunch of characters encoded to support old 8-bit machines, because that was easy. But the associated inverted characters were rejected, and the submitters told to use some higher level protocol to support them. That seems to be a comparable reaction to what you're getting. -- The standard is written in English . If you have trouble understanding a particular section, read it again and again and again . . . Sit up straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185 (1991) From richard.wordingham at ntlworld.com Sat Jun 20 06:09:04 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 20 Jun 2020 12:09:04 +0100 Subject: EBCDIC control characters In-Reply-To: References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <20200619114827.07df1f21@JRWUBU2> <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> Message-ID: <20200620120904.0c6f7630@JRWUBU2> On Sat, 20 Jun 2020 10:57:10 +0200 Corentin via Unicode wrote: > On Fri, 19 Jun 2020 at 22:30, Ken Whistler via Unicode > wrote: > > > > > On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote: > > > Isn't there still the issue of supporting U+0000 in C-type > > > strings? > > > > I don't see why. And it has nothing to do with Unicode per se, > > anyway. > > > > That is just a transform of the question of "the issue of > > supporting 0x00 in > > C-type strings restricted to ASCII." > > > > The issue is precisely the same, and the solutions are precisely > > the same -- by design. > > > > I'm not sure I understand that issue, could you clarify? > in both C and C++, U+0000 is interpreted as the null character > (which mark the end of the string depending on context), which is the > same behavior as the equivalent > ascii character One immediate consequence of that assertion is that one cannot in general store a line of Unicode text in a 'string'. There have been Unicode test cases that deliberately include a null in the middle of the text, and if the program thinks it has stored the line in a 'string', it will fail the test, because the null character and beyond are not part of the text being interpreted. One of the early tricks to store general character sequences in strings was to use non-shortest form UTF-8 encodings to avoid characters being interpreted as control characters with undesired characteristics. This form of UTF-8 is now invalid. Java was especially noted for using the encoding to store zero bytes in byte code in UTF-8. My guess is that Ken is alluding to not storing arbitrary text in strings, but rather in arrays of code units along with appropriate length parameters. Richard. From richard.wordingham at ntlworld.com Sat Jun 20 06:24:52 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 20 Jun 2020 12:24:52 +0100 Subject: OverStrike control character In-Reply-To: <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com> References: <20200616180522.3a2bbcae@JRWUBU2> <5df3e8fd-5035-f0ce-2c5c-313ea69868d9@kli.org> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> <20200619113250.66bb7f47@JRWUBU2> <20200619154216.6f85f1d5@JRWUBU2> <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com> Message-ID: <20200620122452.7c9b709b@JRWUBU2> On Sat, 20 Jun 2020 04:48:00 +0000 James Kass via Unicode wrote: > Richard Wordingham wrote, >> The proper mechanism would be to use a PUA character. > > This would only be true if the data wasn?t intended to be > interchangeable. PUA-encoded material is interchangeable - you just need to agree to the convention. Emoji started out in the PUA. Remember the Conscript Registry? Richard. From corentin.jabot at gmail.com Sat Jun 20 07:11:15 2020 From: corentin.jabot at gmail.com (Corentin) Date: Sat, 20 Jun 2020 14:11:15 +0200 Subject: EBCDIC control characters In-Reply-To: <20200620120904.0c6f7630@JRWUBU2> References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <20200619114827.07df1f21@JRWUBU2> <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> <20200620120904.0c6f7630@JRWUBU2> Message-ID: On Sat, 20 Jun 2020 at 13:14, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Sat, 20 Jun 2020 10:57:10 +0200 > Corentin via Unicode wrote: > > > On Fri, 19 Jun 2020 at 22:30, Ken Whistler via Unicode > > wrote: > > > > > > > > On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote: > > > > Isn't there still the issue of supporting U+0000 in C-type > > > > strings? > > > > > > I don't see why. And it has nothing to do with Unicode per se, > > > anyway. > > > > > > That is just a transform of the question of "the issue of > > > supporting 0x00 in > > > C-type strings restricted to ASCII." > > > > > > The issue is precisely the same, and the solutions are precisely > > > the same -- by design. > > > > > > > I'm not sure I understand that issue, could you clarify? > > in both C and C++, U+0000 is interpreted as the null character > > (which mark the end of the string depending on context), which is the > > same behavior as the equivalent > > ascii character > > One immediate consequence of that assertion is that one cannot in > general store a line of Unicode text in a 'string'. There have been > Unicode test cases that deliberately include a null in the middle of > the text, and if the program thinks it has stored the line in a > 'string', it will fail the test, because the null character and beyond > are not part of the text being interpreted. > > One of the early tricks to store general character sequences in > strings was to use non-shortest form UTF-8 encodings to avoid characters > being interpreted as control characters with undesired > characteristics. This form of UTF-8 is now invalid. Java was > especially noted for using the encoding to store zero bytes in > byte code in UTF-8. > > My guess is that Ken is alluding to not storing arbitrary text in > strings, but rather in arrays of code units along with appropriate > length parameters. > Oh, yes, I see thanks. It's a special case of "null-terminated strings were a mistake". But U+0000 has no other use or alternative semantic right? The main use case would be test cases? > > Richard. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskasskrv at gmail.com Sat Jun 20 07:57:43 2020 From: jameskasskrv at gmail.com (James Kass) Date: Sat, 20 Jun 2020 12:57:43 +0000 Subject: OverStrike control character In-Reply-To: References: <20200616180522.3a2bbcae@JRWUBU2> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> <20200619113250.66bb7f47@JRWUBU2> <20200619154216.6f85f1d5@JRWUBU2> <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com> Message-ID: On 2020-06-20 10:30 AM, David Starner via Unicode wrote: > It doesn't feel that you want a > way to store and display overtyped text, it feels that you want > Unicode to officially support it. It's a very complex and expensive > expansion, but if a bunch of people were using overtyped text, this > discussion might be going differently. Yes.? And even if we suppose that some groups like coin collectors or epigraphers might find that kind of feature helpful, there's a couple of points to consider.? One is that groups like that are probably already cheerfully exchanging information using either some kind of rich text scheme or some kind of plain-text convention.? The other is that if their needs were not being met they would be lobbying to get some kind of support, which doesn't seem to be happening. In the case of coin catalogs, the 'spell-it-out' convention predates the computer era.? It wouldn't surprise if epigraphic conventions also predate computers.? People tend to stick with their conventions.? If the overstrike feature became available in plain-text, I'd expect the coin catalogs to keep spelling things out for both clarity and consistency. From jameskasskrv at gmail.com Sat Jun 20 08:08:36 2020 From: jameskasskrv at gmail.com (James Kass) Date: Sat, 20 Jun 2020 13:08:36 +0000 Subject: OverStrike control character In-Reply-To: References: <20200616180522.3a2bbcae@JRWUBU2> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> <20200619113250.66bb7f47@JRWUBU2> <20200619154216.6f85f1d5@JRWUBU2> <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com> Message-ID: <0d6610c1-9f8f-0f9f-efba-677eb6700bc2@gmail.com> On 2020-06-20 6:30 AM, abrahamgross--- via Unicode wrote: > Does Yeoman show the ?8 over 7? visually too, or does it just say ?8 over 7? and you're supposed to imagine it urself? Sometimes the catalogs include close-up photographs of the more popular variations in the graphics section of a page, but in the text listings it's just spelled out.? The Yeoman consulted earlier was a 1984 print version.? But I also looked at an on-line catalog, "numista", which likewise spelled out this particular overstrike (U.S.A. 1918D nickel five cent piece). From andrewcwest at gmail.com Sat Jun 20 08:26:28 2020 From: andrewcwest at gmail.com (Andrew West) Date: Sat, 20 Jun 2020 14:26:28 +0100 Subject: OverStrike control character In-Reply-To: References: <20200616180522.3a2bbcae@JRWUBU2> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> <20200619113250.66bb7f47@JRWUBU2> <20200619154216.6f85f1d5@JRWUBU2> <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com> Message-ID: On Sat, 20 Jun 2020 at 14:03, James Kass via Unicode wrote: > > computer era. It wouldn't surprise if epigraphic conventions also > predate computers. People tend to stick with their conventions. If the > overstrike feature became available in plain-text, I'd expect the coin > catalogs to keep spelling things out for both clarity and consistency. No-one would ever typographical overstrike one character with another in a coin catalog because it would be difficult to read or even illegible, and impossible to know if it meant "7 overstruck with 8" or "8 overstruck with 7". Andrew From richard.wordingham at ntlworld.com Sat Jun 20 09:32:19 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 20 Jun 2020 15:32:19 +0100 Subject: EBCDIC control characters In-Reply-To: References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <20200619114827.07df1f21@JRWUBU2> <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> <20200620120904.0c6f7630@JRWUBU2> Message-ID: <20200620153219.4d62c00f@JRWUBU2> On Sat, 20 Jun 2020 14:11:15 +0200 Corentin via Unicode wrote: > On Sat, 20 Jun 2020 at 13:14, Richard Wordingham via Unicode < > unicode at unicode.org> wrote: > > > On Sat, 20 Jun 2020 10:57:10 +0200 > > Corentin via Unicode wrote: > > > > > On Fri, 19 Jun 2020 at 22:30, Ken Whistler via Unicode > > > wrote: > > > > > > > > > > > On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote: > > > > > Isn't there still the issue of supporting U+0000 in C-type > > > > > strings? > > > > > > > > I don't see why. And it has nothing to do with Unicode per se, > > > > anyway. > > > > > > > > That is just a transform of the question of "the issue of > > > > supporting 0x00 in > > > > C-type strings restricted to ASCII." > > > > > > > > The issue is precisely the same, and the solutions are precisely > > > > the same -- by design. > > > > > > > > > > I'm not sure I understand that issue, could you clarify? > > > in both C and C++, U+0000 is interpreted as the null character > > > (which mark the end of the string depending on context), which is > > > the same behavior as the equivalent > > > ascii character > > > > One immediate consequence of that assertion is that one cannot in > > general store a line of Unicode text in a 'string'. There have been > > Unicode test cases that deliberately include a null in the middle of > > the text, and if the program thinks it has stored the line in a > > 'string', it will fail the test, because the null character and > > beyond are not part of the text being interpreted. > > > > One of the early tricks to store general character sequences in > > strings was to use non-shortest form UTF-8 encodings to avoid > > characters being interpreted as control characters with undesired > > characteristics. This form of UTF-8 is now invalid. Java was > > especially noted for using the encoding to store zero > > bytes in byte code in UTF-8. > > > > My guess is that Ken is alluding to not storing arbitrary text in > > strings, but rather in arrays of code units along with appropriate > > length parameters. > > > > Oh, yes, I see thanks. > It's a special case of "null-terminated strings were a mistake". > But U+0000 has no other use or alternative semantic right? The main > use case would be test cases? I believe Unicode doesn't define its semantics, but rather defers by default to ECMA-48. Of NUL it says, "NUL is used for media-fill or time-fill. NUL characters may be inserted into, or removed from, a data stream without affecting the information content of that stream, but such action may affect the information layout and/or the control of equipment." I have used it for easy composition of Fortran output lines from CHARACTER variables; the NULs in the resulting lines were simply ignored when the output was display on a terminal or line printer. The Fortran 90 intrinisic function TRIM provided an easier and more reliable way of doing the same job; embedded NULs don't play well with C. Richard. From kenwhistler at sonic.net Sat Jun 20 09:45:45 2020 From: kenwhistler at sonic.net (Ken Whistler) Date: Sat, 20 Jun 2020 07:45:45 -0700 Subject: EBCDIC control characters In-Reply-To: References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <20200619114827.07df1f21@JRWUBU2> <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> <20200620120904.0c6f7630@JRWUBU2> Message-ID: <42525c51-20be-9443-4420-ac5e59b32319@sonic.net> On 6/20/2020 5:11 AM, Corentin via Unicode wrote: > > My guess is that Ken is alluding to not storing arbitrary text in > strings, but rather in arrays of code units along with appropriate > length parameters. > > > Oh, yes, I see thanks. > It's a special case of "null-terminated strings were a mistake". > But U+0000 has no other use or alternative semantic right? The main > use case would be test cases? Yes, that was basically what I was alluding to. Richard is making the purist point that U+0000 is a Unicode character, and therefore should be transmissible as part of any Unicode plain text stream. But the C string is not actually "plain text" -- it is a convention for representing a string which makes use of 0x00 as a "syntactic" character to terminate the string without counting for its length. And that was already true back in 7-bit ASCII days, of course. Peoples' workaround, if they need to represent NULL *in* character data in a "string" in a C program, was to simply use char arrays, manage length external to the "string" stored in the array, and then avoid the regular C string runtime library calls when manipulating them, because those depend on 0x00 as a signal of string termination. Such cases need not be limited to test cases. One can envision real cases, as for example, packing a data store full of null-terminated strings and then wanting to manipulate that entire data store as a chunk. It is, of course, full of NULL bytes for the null-terminated strings. But the answer, of course, is to just keep track of the size of the entire data store and use memcpy() instead of strcpy(). I've had to deal with precisely such cases in real production code. Now fast forward to Unicode and UTF-8. U+0000 is a Unicode character, but in UTF-8 it is, of course, represented as a single 0x00 code unit. And for the ASCII subset of Unicode, you cannot even tell the difference -- it is precisely identical, as far as C strings and their manipulation is concerned. Which was precisely my point: 7-bit ASCII: One cannot represent NULL (0x00) as part of the content of a C string. Resort to char arrays. Unicode UTF-8: One cannot represent U+0000 NULL (0x00) as part of the content of a C string. Resort to char arrays. The convention of using non-shortest UTF-8 to represent embedded NULLs in C strings was simply a non-interoperable hack that people tried because they fervently believed that NULLs *should* be embeddable in C strings, after all. The UTC put a spike in that one by ruling that non-shortest UTF-8 was ill-formed for any purpose. This whole issue has been a permanent confusion for C programmers, I think, largely because C is so loosey goosey about pointers, where a pointer is really just an index register wolf in sheep's clothing. With a char* pointer in hand, one cannot really tell whether it is referring to an actual C string following the null-termination convention, or a char array full of characters interpreted as a "string", but without null termination, or a char array full of arbitrary byte values meaning anything. And from that source flow thousands upon thousands of C program bugs. :( --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat Jun 20 10:53:26 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 20 Jun 2020 16:53:26 +0100 Subject: EBCDIC control characters In-Reply-To: <42525c51-20be-9443-4420-ac5e59b32319@sonic.net> References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <20200619114827.07df1f21@JRWUBU2> <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> <20200620120904.0c6f7630@JRWUBU2> <42525c51-20be-9443-4420-ac5e59b32319@sonic.net> Message-ID: <20200620165326.67dfbee5@JRWUBU2> On Sat, 20 Jun 2020 07:45:45 -0700 Ken Whistler via Unicode wrote: > Richard is making the purist point that U+0000 is a Unicode > character, and therefore should be transmissible as part of any > Unicode plain text stream. Prompted by the pain of Unicode test files with embedded nulls and even embedded end of file. I could never work out why isolated UTF-16 code units should be handled, but there was no need to handle isolated UTF-8 code units. > 7-bit ASCII: One cannot represent NULL (0x00) as part of the content > of a C string. Resort to char arrays. Actually, you can. As the size of char is at least 8 bits, you have 128 spare codes. :-) Richard. From harjitmoe at outlook.com Sat Jun 20 11:43:26 2020 From: harjitmoe at outlook.com (Harriet Riddle) Date: Sat, 20 Jun 2020 17:43:26 +0100 Subject: EBCDIC control characters In-Reply-To: <20200620165326.67dfbee5@JRWUBU2> References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <20200619114827.07df1f21@JRWUBU2> <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> <20200620120904.0c6f7630@JRWUBU2> <42525c51-20be-9443-4420-ac5e59b32319@sonic.net> <20200620165326.67dfbee5@JRWUBU2> Message-ID: Richard Wordingham via Unicode wrote: > Prompted by the pain of Unicode test files with embedded nulls and > even embedded end of file. Embedded nulls might indeed be used disruptively in user-submitted content (to induce truncation or, if nulls are removed or ignored at some stage, even to mask malicious sequences if the nulls are removed or ignored by something downstream of a sanitisation step). In such applications, there may be a need to deal with them somehow (even if that is simply replacing U+0000 instances with U+FFFD, as stipulated in the spec for e.g. CommonMark). But so long as it can accurately output the string and its length in code units, it's not really the decoder's job to sort this out. > I could never work out why isolated UTF-16 code units should be handled, but there was no need to handle isolated UTF-8 code units. Depends on the context you are working in. Python's PEP 383 ( https://www.python.org/dev/peps/pep-0383/ ) does define a scheme for passing isolated 8-bit code units through a decoder and encoder unchanged, actually in much the same way as tends to be done for UTF-16, i.e. passing around isolated surrogate codes. This is not the default behaviour, but it arose as a solution to the problem of handling potentially invalid data in Unix filenames (similar to the issue of potentially invalid UTF-16 data in Windows filenames). -- Har >> 7-bit ASCII: One cannot represent NULL (0x00) as part of the content >> of a C string. Resort to char arrays. > Actually, you can. As the size of char is at least 8 bits, you have > 128 spare codes. :-) > > Richard. From corentin.jabot at gmail.com Sat Jun 20 12:00:51 2020 From: corentin.jabot at gmail.com (Corentin) Date: Sat, 20 Jun 2020 19:00:51 +0200 Subject: EBCDIC control characters In-Reply-To: <42525c51-20be-9443-4420-ac5e59b32319@sonic.net> References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <20200619114827.07df1f21@JRWUBU2> <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> <20200620120904.0c6f7630@JRWUBU2> <42525c51-20be-9443-4420-ac5e59b32319@sonic.net> Message-ID: On Sat, 20 Jun 2020 at 16:45, Ken Whistler wrote: > > On 6/20/2020 5:11 AM, Corentin via Unicode wrote: > > My guess is that Ken is alluding to not storing arbitrary text in >> strings, but rather in arrays of code units along with appropriate >> length parameters. >> > > Oh, yes, I see thanks. > It's a special case of "null-terminated strings were a mistake". > But U+0000 has no other use or alternative semantic right? The main use > case would be test cases? > > Yes, that was basically what I was alluding to. > > Richard is making the purist point that U+0000 is a Unicode character, and > therefore should be transmissible as part of any Unicode plain text stream. > > But the C string is not actually "plain text" -- it is a convention for > representing a string which makes use of 0x00 as a "syntactic" character to > terminate the string without counting for its length. And that was already > true back in 7-bit ASCII days, of course. Peoples' workaround, if they need > to represent NULL *in* character data in a "string" in a C program, was to > simply use char arrays, manage length external to the "string" stored in > the array, and then avoid the regular C string runtime library calls when > manipulating them, because those depend on 0x00 as a signal of string > termination. > > Such cases need not be limited to test cases. One can envision real cases, > as for example, packing a data store full of null-terminated strings and > then wanting to manipulate that entire data store as a chunk. It is, of > course, full of NULL bytes for the null-terminated strings. But the answer, > of course, is to just keep track of the size of the entire data store and > use memcpy() instead of strcpy(). I've had to deal with precisely such > cases in real production code. > > Now fast forward to Unicode and UTF-8. U+0000 is a Unicode character, but > in UTF-8 it is, of course, represented as a single 0x00 code unit. And for > the ASCII subset of Unicode, you cannot even tell the difference -- it is > precisely identical, as far as C strings and their manipulation is > concerned. Which was precisely my point: > > 7-bit ASCII: One cannot represent NULL (0x00) as part of the content of a > C string. Resort to char arrays. > > Unicode UTF-8: One cannot represent U+0000 NULL (0x00) as part of the > content of a C string. Resort to char arrays. > > The convention of using non-shortest UTF-8 to represent embedded NULLs in > C strings was simply a non-interoperable hack that people tried because > they fervently believed that NULLs *should* be embeddable in C strings, > after all. The UTC put a spike in that one by ruling that non-shortest > UTF-8 was ill-formed for any purpose. > > This whole issue has been a permanent confusion for C programmers, I > think, largely because C is so loosey goosey about pointers, where a > pointer is really just an index register wolf in sheep's clothing. With a > char* pointer in hand, one cannot really tell whether it is referring to an > actual C string following the null-termination convention, or a char array > full of characters interpreted as a "string", but without null termination, > or a char array full of arbitrary byte values meaning anything. And from > that source flow thousands upon thousands of C program bugs. :( > To be super pedantic, strings *are* arrays, but they decay to pointers really easily, at which point the only way to know their size is to look for 0x0, which made sense at one point in 1964 - if you never use strlen you are fine. in fact it is common for people to use multiple null as string delimiters within a larger array > --Ken > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat Jun 20 12:34:33 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 20 Jun 2020 18:34:33 +0100 Subject: EBCDIC control characters In-Reply-To: References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <20200619114827.07df1f21@JRWUBU2> <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> <20200620120904.0c6f7630@JRWUBU2> <42525c51-20be-9443-4420-ac5e59b32319@sonic.net> Message-ID: <20200620183433.1c42f6ef@JRWUBU2> On Sat, 20 Jun 2020 19:00:51 +0200 Corentin via Unicode wrote: > To be super pedantic, strings *are* arrays, but they decay to pointers > really easily, at which point the only way to know their size is to > look for 0x0, which made sense at one point in 1964 - if you never > use strlen you are fine. in fact it is common for people to use > multiple null as string delimiters within a larger array I think almost all the functions in string.h go wrong if you want to treat NUL as an ordinary character. strncpy(), strncmp() and strncat() certainly do. Inserting NUL into a C string chops it up into multiple C strings. Richard. From haberg-1 at telia.com Sat Jun 20 14:34:38 2020 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Sat, 20 Jun 2020 21:34:38 +0200 Subject: EBCDIC control characters In-Reply-To: <42525c51-20be-9443-4420-ac5e59b32319@sonic.net> References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <20200619114827.07df1f21@JRWUBU2> <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> <20200620120904.0c6f7630@JRWUBU2> <42525c51-20be-9443-4420-ac5e59b32319@sonic.net> Message-ID: > On 20 Jun 2020, at 16:45, Ken Whistler via Unicode wrote: > > This whole issue has been a permanent confusion for C programmers, I think, largely because C is so loosey goosey about pointers, where a pointer is really just an index register wolf in sheep's clothing. With a char* pointer in hand, one cannot really tell whether it is referring to an actual C string following the null-termination convention, or a char array full of characters interpreted as a "string", but without null termination, or a char array full of arbitrary byte values meaning anything. And from that source flow thousands upon thousands of C program bugs. :( The distinction can conveniently be indicated in C++, as in the example at the bottom of [1]: "?" converts as a C string which truncates at the first \0, whereas "?"s converts to a std::string that keeps track of the full length. 1. https://en.cppreference.com/w/cpp/string/basic_string/operator%22%22s From tom at honermann.net Sat Jun 20 15:36:01 2020 From: tom at honermann.net (Tom Honermann) Date: Sat, 20 Jun 2020 16:36:01 -0400 Subject: UAX 31 for C++ Identifiers In-Reply-To: <8f744f47-5464-1fe0-7b56-4815a3a28ce3@ix.netcom.com> References: <9f9e7e04-e035-2740-419c-b9667b44bd80@ix.netcom.com> <8f744f47-5464-1fe0-7b56-4815a3a28ce3@ix.netcom.com> Message-ID: On 6/20/20 2:44 AM, Asmus Freytag (c) via Unicode wrote: > My meta point had been about possibly different levels security issues > between compile time and runtime. > A./ When you mentioned "modules", were you referring to C++20 modules?? If so, there may be some confusion; C++20 modules is a compile-time feature with no run-time component. Tom. > > On 6/19/2020 8:22 PM, Steve Downey wrote: >> On Fri, Jun 19, 2020 at 10:44 PM Asmus Freytag via Unicode >> wrote: >>> In source code, having ambiguous identifiers may not be worse than C-style obfuscation. >>> >> Until recently (the last release 10.1), gcc rejected much of allowed >> unicode in UTF-8 input, even in places it would allow \u >> universal-character-names. So this all becomes easier now. As a >> Standard, we should have handled this better earlier, but the second >> best time is now. The XID_ properties make this a lot more palatable >> w.r.t. stability, though, and I'm not going to second guess people 10 >> or 20 or more years ago, too much. Ambiguity in external identifiers >> is already ill-formed no diagnostic required, which means broken but >> in ways that compilers can't treat as undefined. >> >>> But with module names, etc. you may run into security issues if naming allows / facilitates spoofing. >>> >> I, and other people doing tools, both won and lost this battle >> already. Module names in source do not correspond with anything >> physical. `import some.module` connects you to whatever exported >> `some.module` by magic as far as the standard is concerned. We're >> working on the actual mechanics as a Technical Report, and compiler >> vendors are participating and aren't, as far as I can tell, more >> insane than the average infrastructure engineer. So I have hope. >> >> Mapping anything to file paths is fraught beyond belief, and there are >> many experienced engineers providing war stories and parades of >> horribles, although I'd personally like to have more stories to tell. >> >> The entire disconnect between logical and physical actually is >> hopeful, in a way that `#include ` isn't. Even though >> we have a lot of understanding of how that maps to filesystem >> searches. >> >> Province of wg21/sg15 , which I also participate in. >> >> I suspect that trying to fix up anything with #include is infeasible >> since it's currently the wild west, changes will break, and C++ >> depends in practice on system provided headers that at best conform to >> old C standards. >> >> Thanks! >> >> -SMD > > From markus.icu at gmail.com Sat Jun 20 15:47:59 2020 From: markus.icu at gmail.com (Markus Scherer) Date: Sat, 20 Jun 2020 13:47:59 -0700 Subject: EBCDIC control characters In-Reply-To: <20200620183433.1c42f6ef@JRWUBU2> References: <9a1ac547-0db2-4ab8-3dab-89012a377347@sonic.net> <20200619114827.07df1f21@JRWUBU2> <9b231c17-d21e-60ad-3f2f-95439ccb6356@sonic.net> <20200620120904.0c6f7630@JRWUBU2> <42525c51-20be-9443-4420-ac5e59b32319@sonic.net> <20200620183433.1c42f6ef@JRWUBU2> Message-ID: On Sat, Jun 20, 2020 at 10:40 AM Richard Wordingham via Unicode < unicode at unicode.org> wrote: > I think almost all the functions in string.h go wrong if you want to > treat NUL as an ordinary character. strncpy(), strncmp() and strncat() > certainly do. Inserting NUL into a C string chops it up into multiple > C strings. > Right, you need to carry ptr+length and use memcpy() etc. Or use std::string in C++. Since C++17 we can use std::string_view as *the* type for input strings that are not to be modified. Inspired by the email from Hans, I looked, and there is "syntax"sv for string_view literals. https://en.cppreference.com/w/cpp/string/basic_string_view https://en.cppreference.com/w/cpp/string/basic_string_view/operator%22%22sv markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskasskrv at gmail.com Sat Jun 20 18:26:19 2020 From: jameskasskrv at gmail.com (James Kass) Date: Sat, 20 Jun 2020 23:26:19 +0000 Subject: OverStrike control character In-Reply-To: References: <20200616180522.3a2bbcae@JRWUBU2> <20200609225134.oxkhssxtlj5t664t@nitsipc.home> <1bdad0a3-2aca-4350-97c8-f0616253eadd@disroot.org> <83cf7fb2b38bf1634d845d1569a6bbf3@disroot.org> <20200619113250.66bb7f47@JRWUBU2> <20200619154216.6f85f1d5@JRWUBU2> <651033e7-5554-259d-929b-ce4cc3615fdb@gmail.com> Message-ID: <61e66815-dfea-d64e-7dbe-7989255970eb@gmail.com> On 2020-06-20 6:30 AM, abrahamgross--- via Unicode wrote: > If epigraphers and numismaticians have the need for overstiking in plain text, isn't that reason enough to encode it? Unicode encoded many completely extinct scripts* and extinct characters in existing scripts, so adding the overstrike doesn't seem like a stretch at all. Epigraphers and numismatists indeed preserve and exchange information about overstriking.? But they have existing conventions for doing so which apparently serve them well.? Similar arguments were made against the encoding of ancient scripts.? The scholars could just go on happily transliterating and transcribing their ancient texts.? When contact was established with various user groups some folks said they would continue using transliteration. But other folks said they would welcome and embrace the ability to store and exchange data in the actual original scripts.? Encoding the ancient scripts did no harm; the scholars preferring transliteration could keep on transliterating.? Ancient script encoding opened up new vistas for those who welcomed it, and I think this is especially true for undeciphered scripts. So anyone seriously considering floating a proposal for an overstrike mechanism in Unicode would be well advised to establish contact with potential users to determine whether such a mechanism would see any actual use. From doug at ewellic.org Sat Jun 20 19:44:39 2020 From: doug at ewellic.org (Doug Ewell) Date: Sat, 20 Jun 2020 18:44:39 -0600 Subject: OverStrike control character Message-ID: <006701d64765$25b2c470$71184d50$@ewellic.org> James Kass wrote: > So anyone seriously considering floating a proposal for an overstrike > mechanism in Unicode would be well advised to establish contact with > potential users to determine whether such a mechanism would see any > actual use. When we proposed the "bunch of characters encoded to support old 8-bit machines" that David Starner referred to, being able to cite assurance from actual end users that they would use these characters was not just a good idea; it was essential to getting the characters encoded. (Since then, we have learned of many times more users who plan to use them, or are already using them, than we knew about at the time.) Neither Yeoman nor any other coin catalog would use intentionally print, say, an 8 over a 7 in a listing for an overdate. They might do so in rich text (i.e. a published book) to illustrate, for novice collectors, what is meant by an overdate; but that could be done just as well, and usually is, with a greatly magnified picture of the coin in question. So I don't think it can be said that numismatists have a "need" for overstriking in plain text. It seems the only serious use case for this character (as opposed to "it would be fun" or "it would be possible" or "Unicode has lots of empty code points, and look at the stuff they've already encoded") is that people could make up their own characters ? so long as they consisted of two or more existing glyphs, one overstruck on the other ? and they would have a non-PUA Unicode representation. Is that about the size of it? -- Doug Ewell | Thornton, CO, US | ewellic.org From abrahamgross at disroot.org Sat Jun 20 20:31:59 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Sun, 21 Jun 2020 01:31:59 +0000 (UTC) Subject: OverStrike control character In-Reply-To: <006701d64765$25b2c470$71184d50$@ewellic.org> References: <006701d64765$25b2c470$71184d50$@ewellic.org> Message-ID: <8025f806-835e-4422-8c63-60be10a4059b@disroot.org> Basically, yes. unicode has plenty of basic geometric shapes throughout that can be utilized to build interchangeable (and non-PUA) characters. (if Classical Yi ever get accepted, then youll be able to use just about any shape out there for your overstriking needs (the proposal lists over 88k new chars!)) 2020/06/20 ??8:45:18 Doug Ewell via Unicode : > It seems the only serious use case for this character (as opposed to "it would be fun" or "it would be possible" or "Unicode has lots of empty code points, and look at the stuff they've already encoded") is that people could make up their own characters ? so long as they consisted of two or more existing glyphs, one overstruck on the other ? and they would have a non-PUA Unicode representation. Is that about the size of it? > From mark at kli.org Sat Jun 20 20:37:32 2020 From: mark at kli.org (Mark E. Shoulson) Date: Sat, 20 Jun 2020 21:37:32 -0400 Subject: OverStrike control character In-Reply-To: <8025f806-835e-4422-8c63-60be10a4059b@disroot.org> References: <006701d64765$25b2c470$71184d50$@ewellic.org> <8025f806-835e-4422-8c63-60be10a4059b@disroot.org> Message-ID: <59e1f2b3-17e7-99a9-ccc2-5eb1ab75beef@kli.org> On 6/20/20 9:31 PM, abrahamgross--- via Unicode wrote: > Basically, yes. unicode has plenty of basic geometric shapes throughout that can be utilized to build interchangeable (and non-PUA) characters. (if Classical Yi ever get accepted, then youll be able to use just about any shape out there for your overstriking needs (the proposal lists over 88k new chars!)) > > 2020/06/20 ??8:45:18 Doug Ewell via Unicode : > >> It seems the only serious use case for this character (as opposed to "it would be fun" or "it would be possible" or "Unicode has lots of empty code points, and look at the stuff they've already encoded") is that people could make up their own characters ? so long as they consisted of two or more existing glyphs, one overstruck on the other ? and they would have a non-PUA Unicode representation. Is that about the size of it? >> You have just made one the strongest possible arguments against your own position. ~mark From abrahamgross at disroot.org Sat Jun 20 20:59:55 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Sun, 21 Jun 2020 01:59:55 +0000 (UTC) Subject: OverStrike control character In-Reply-To: <59e1f2b3-17e7-99a9-ccc2-5eb1ab75beef@kli.org> References: <006701d64765$25b2c470$71184d50$@ewellic.org> <8025f806-835e-4422-8c63-60be10a4059b@disroot.org> <59e1f2b3-17e7-99a9-ccc2-5eb1ab75beef@kli.org> Message-ID: <230194d6-3a7d-4f5c-8907-ade211f8929c@disroot.org> Why? 2020/06/20 ??9:38:10 Mark E. Shoulson via Unicode : > You have just made one the strongest possible arguments against your own position. > From jameskasskrv at gmail.com Sat Jun 20 22:34:09 2020 From: jameskasskrv at gmail.com (James Kass) Date: Sun, 21 Jun 2020 03:34:09 +0000 Subject: OverStrike control character In-Reply-To: <230194d6-3a7d-4f5c-8907-ade211f8929c@disroot.org> References: <006701d64765$25b2c470$71184d50$@ewellic.org> <8025f806-835e-4422-8c63-60be10a4059b@disroot.org> <59e1f2b3-17e7-99a9-ccc2-5eb1ab75beef@kli.org> <230194d6-3a7d-4f5c-8907-ade211f8929c@disroot.org> Message-ID: <26abf033-3f9c-dabd-ed99-19337db77a69@gmail.com> On 2020-06-21 1:59 AM, abrahamgross--- via Unicode asked: > Why? > > 2020/06/20 ??9:38:10 Mark E. Shoulson via Unicode : > >> You have just made one the strongest possible arguments against your own position. >> Quoting from TUS 13.0, 1.1 (page 3), "Note, however, that Unicode does not encode idiosyncratic, personal, *novel*, or private-use characters, nor does it encode logos or graphics."? (Asterisks added) By extension, Unicode wouldn't encode a mechanism specifically for designing novel characters. It's been said before that Unicode encodes what is or what was, not what might be. Even getting "what was" encoded can be something of an uphill battle.? Consider the recent addition to Unicode mentioned by David Starner and Doug Ewell, "214 graphic characters that provide compatibility with various home computers from the mid-1970s to the mid-1980s and with early teletext broadcasting standards".? If I'm not mistaken, many of those characters were first proposed some twenty years ago by Frank da Cruz and rejected by Unicode. So even being able to prove actual legacy usage, such as can be done with the backspace-for-overstrike technique, is no guarantee that a proposal would be accepted. From asmusf at ix.netcom.com Sat Jun 20 23:38:49 2020 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Sat, 20 Jun 2020 21:38:49 -0700 Subject: UAX 31 for C++ Identifiers In-Reply-To: References: <9f9e7e04-e035-2740-419c-b9667b44bd80@ix.netcom.com> <8f744f47-5464-1fe0-7b56-4815a3a28ce3@ix.netcom.com> Message-ID: <84ee1ede-30bf-f474-0cbc-03469043bea4@ix.netcom.com> On 6/20/2020 1:36 PM, Tom Honermann wrote: > On 6/20/20 2:44 AM, Asmus Freytag (c) via Unicode wrote: >> My meta point had been about possibly different levels security >> issues between compile time and runtime. >> A./ > > When you mentioned "modules", were you referring to C++20 modules?? If > so, there may be some confusion; C++20 modules is a compile-time > feature with no run-time component. > > Tom. I had been thinking of interfaces to the various OSs, like dynamically linked libraries, etc. that are usually named with identifiers of some sort. Although to the language proper, these may just be strings, of course. A./ > >> >> On 6/19/2020 8:22 PM, Steve Downey wrote: >>> On Fri, Jun 19, 2020 at 10:44 PM Asmus Freytag via Unicode >>> ? wrote: >>>> In source code, having ambiguous identifiers may not be worse than >>>> C-style obfuscation. >>>> >>> Until recently (the last release 10.1), gcc rejected much of allowed >>> unicode in UTF-8 input, even in places it would allow \u >>> universal-character-names. So this all becomes easier now. As a >>> Standard, we should have handled this better earlier, but the second >>> best time is now. The XID_ properties make this a lot more palatable >>> w.r.t. stability, though, and I'm not going to second guess people 10 >>> or 20 or more years ago, too much. Ambiguity in external identifiers >>> is already ill-formed no diagnostic required, which means broken but >>> in ways that compilers can't treat as undefined. >>> >>>> But with module names, etc. you may run into security issues if >>>> naming allows / facilitates spoofing. >>>> >>> I, and other people doing tools, both won and lost this battle >>> already. Module names in source do not correspond with anything >>> physical. `import some.module` connects you to whatever exported >>> `some.module` by magic as far as the standard is concerned. We're >>> working on the actual mechanics as a Technical Report, and compiler >>> vendors are participating and aren't, as far as I can tell, more >>> insane than the average infrastructure engineer. So I have hope. >>> >>> Mapping anything to file paths is fraught beyond belief, and there are >>> many experienced engineers providing war stories and parades of >>> horribles, although I'd personally like to have more stories to tell. >>> >>> The entire disconnect between logical and physical actually is >>> hopeful, in a way that `#include ` isn't. Even though >>> we have a lot of understanding of how that maps to filesystem >>> searches. >>> >>> Province of wg21/sg15 , which I also participate in. >>> >>> I suspect that trying to fix up anything with #include is infeasible >>> since it's currently the wild west, changes will break, and C++ >>> depends in practice on system provided headers that at best conform to >>> old C standards. >>> >>> Thanks! >>> >>> -SMD >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Sun Jun 21 05:36:43 2020 From: andrewcwest at gmail.com (Andrew West) Date: Sun, 21 Jun 2020 11:36:43 +0100 Subject: OverStrike control character In-Reply-To: <8025f806-835e-4422-8c63-60be10a4059b@disroot.org> References: <006701d64765$25b2c470$71184d50$@ewellic.org> <8025f806-835e-4422-8c63-60be10a4059b@disroot.org> Message-ID: On Sun, 21 Jun 2020 at 02:33, abrahamgross--- via Unicode wrote: > > Basically, yes. unicode has plenty of basic geometric shapes throughout that can be utilized to build interchangeable (and non-PUA) characters. (if Classical Yi ever get accepted, then youll be able to use just about any shape out there for your overstriking needs (the proposal lists over 88k new chars!)) There is no such thing as "Classical Yi" (i.e. a single language/script with a well-known and well-studied corpus of literary texts) -- if there was we would have encoded it ten years ago. The proposal you refer to just lists an unorganized and unified list of glyph forms used in multiple different Yi script traditions and manuscript sources. It is not even a starting point for a proper encoding proposal. In fact there are likely to be several separate proposals for additional Yi scripts representing separate regional traditions, each comprising about a thousand characters or so. See for example my listing of 1,389 characters used in the Sani Yi script: https://www.babelstone.co.uk/Yi/Sani_list.html Andrew From mark at kli.org Sun Jun 21 18:47:34 2020 From: mark at kli.org (Mark E. Shoulson) Date: Sun, 21 Jun 2020 19:47:34 -0400 Subject: OverStrike control character In-Reply-To: <8025f806-835e-4422-8c63-60be10a4059b@disroot.org> References: <006701d64765$25b2c470$71184d50$@ewellic.org> <8025f806-835e-4422-8c63-60be10a4059b@disroot.org> Message-ID: On 6/20/20 9:31 PM, abrahamgross--- via Unicode wrote: > Basically, yes. unicode has plenty of basic geometric shapes throughout that can be utilized to build interchangeable (and non-PUA) characters. (if Classical Yi ever get accepted, then youll be able to use just about any shape out there for your overstriking needs (the proposal lists over 88k new chars!)) Essentially "painting" with characters.? Which wouldn't work in a consistent fashion (you point out yourself it won't render things the same if you use different fonts) and would be MUCH more complicated to use than just encoding some vector drawing language with Unicode code-points (which has been suggested, and has its own raft of issues).? Which set of Yi characters will paint just the picture of George Washington that I want...?? Easier to paint it! You had previously said >> What about overstriking a LTR character with a RTL one, or vice-versa? Which way does the text go after that? >> > The text after that goes in the direction of the text afterwards. So for ?L??????? its gonna look like ?[L???]?????? and for ?L??ab? its gonna look like ?[L?]ab?. Meaning that only the very next letter gets overstruck, and anything afterwards continues on like it would normally. It's not about what happens if you put a strong LTR or RTL character afterwards.? Those always carry their own directionality!? Read up on the Unicode Bidi algorithm.? The direction of a stream of text is stateful, and some characters adapt themselves to what the current directionality is.? If I have A??, what state does that leave things in?? Is it the same or different from ??A? ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Sun Jun 21 19:15:22 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Mon, 22 Jun 2020 02:15:22 +0200 Subject: OverStrike control character In-Reply-To: <26abf033-3f9c-dabd-ed99-19337db77a69@gmail.com> References: <006701d64765$25b2c470$71184d50$@ewellic.org> <8025f806-835e-4422-8c63-60be10a4059b@disroot.org> <59e1f2b3-17e7-99a9-ccc2-5eb1ab75beef@kli.org> <230194d6-3a7d-4f5c-8907-ade211f8929c@disroot.org> <26abf033-3f9c-dabd-ed99-19337db77a69@gmail.com> Message-ID: <2871D945-971F-4D8B-AA00-DD4ECE57215F@bahnhof.se> > 21 juni 2020 kl. 05:34 skrev James Kass via Unicode : > [?] the recent addition to Unicode mentioned by David Starner and Doug Ewell, "214 graphic characters that provide compatibility with various home computers from the mid-1970s to the mid-1980s and with early teletext broadcasting standards?. Note, however, that Teletext is not something obsolete in any way. It does still use charsets that have (otherwise) grown obsolete, e.g. several ?national variants? of ISO 646, but also cover Greek, Arabic and Hebrew, with the charset used communicated in the Teletext protocol. But Teletext is still supported in every TV set (and ?TV box?) sold the last few decades in at least Europe and likely other parts of the world as well. Using Teletext for news service is very much on the decline, that is true. But Teletext is still often used for optional subtitling. See https://www.etsi.org/deliver/etsi_en/300700_300799/300706/01.02.01_60/en_300706v010201p.pdf (from 2003) for the standard describing it. There are (now) apps for mobile phones showing Teletext pages from certain TV channels (some still offer news services via Teletext); I find apps for SVT (Swedish public television) and Danish TV, and one that offers Teletext pages from several countries. These apps must convert to Unicode (at some point, since that is what is used for mobile phone apps?), or use graphics... As well as web pages showing current Teletext data, e.g. https://www.svt.se/svttext/web/pages/100.html , https://www.nrk.no/tekst-tv/100/ , https://www.dr.dk/cgi-bin/fttx1.exe/100 , https://www.rtve.es/television/teletexto/100/ . (These, and a few more, still have news services via Teletext.) /Kent Karlsson -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Jun 21 22:42:37 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 22 Jun 2020 04:42:37 +0100 Subject: OverStrike control character In-Reply-To: References: <006701d64765$25b2c470$71184d50$@ewellic.org> <8025f806-835e-4422-8c63-60be10a4059b@disroot.org> Message-ID: <20200622044237.422ccfa6@JRWUBU2> On Sun, 21 Jun 2020 19:47:34 -0400 "Mark E. Shoulson via Unicode" wrote: > It's not about what happens if you put a strong LTR or RTL character > afterwards.? Those always carry their own directionality!? Read up on > the Unicode Bidi algorithm.? The direction of a stream of text is > stateful, and some characters adapt themselves to what the current > directionality is.? If I have A??, what state does that leave things > in?? Is it the same or different from ??A? OTL can't handled mixed script text. Interracting characters have to wind up in the same script run. Richard. From doug at ewellic.org Mon Jun 22 13:34:58 2020 From: doug at ewellic.org (Doug Ewell) Date: Mon, 22 Jun 2020 12:34:58 -0600 Subject: What is the current Unicode stance on subscripts and superscripts for mathematical use? In-Reply-To: <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca> References: <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca> Message-ID: <000001d648c3$d5bcc870$81365950$@ewellic.org> So, does that mean you don't think L2/18-206 will fly? ? ? -- Doug Ewell | Thornton, CO, US | ewellic.org ? ? -----Original Message----- From: Unicore On Behalf Of John Hudson via Unicore Sent: Monday, June 22, 2020 10:48 To: unicore at unicode.org Subject: Re: What is the current Unicode stance on subscripts and superscripts for mathematical use? ? Math layout and display needs to be able to handle essentially arbitrary super- and subcript characters, and to do so at multiple levels of ?script embedding, e.g. subscripts of superscripts. This requires specialised fonts as well as specialised layout engines. The method we use in OpenType math fonts, i.e. fonts containing a MATH table with extensive scaling and alignment data to be used by math layout engines, is to have variant full-size glyphs that are then scaled down to the superscript, subscript, superscriptscript, etc. sizes and positioned according to MATH table data and tolerances within the layout engines (e.g. some environments may allow for more vertically compressed positioning for inline equations). ? The set of variant glyphs provided for scaling to ?script and ?scriptscript size will vary depending on the font. If a font does not contain such variants for a given character, the layout engine will apply scaling (as defined in the MATH table) to the default glyph for that character. ? We're in the process of extending the STIX Two Math font with a large number of additional variants for ?script and ?scriptscript use, based on frequency analysis from the American Mathematical Society and other members of the STI Pub consortium. Latin and Greek letters are definitely among the most frequent superscript and subscript typeforms, but the overall list is very much larger, includes multiple styles of letters, as well as a variety of symbols and operators. ? J. ? -- ? John Hudson Tiro Typeworks Ltd www.tiro.com Salish Sea, BC tiro at tiro.com ? NOTE: In the interests of productivity, I am currently dealing with email on only two days per week, usually Monday and Thursday unless this schedule is disrupted by travel. If you need to contact me urgently, please use some other method of communication. Thank you. ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Jun 22 13:45:16 2020 From: doug at ewellic.org (Doug Ewell) Date: Mon, 22 Jun 2020 12:45:16 -0600 Subject: What is the current Unicode stance on subscripts and superscripts for mathematical use? References: <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca> Message-ID: <000501d648c5$46100320$d2300960$@ewellic.org> Sorry, sent to wrong list. -- Doug Ewell | Thornton, CO, US | ewellic.org From john at tiro.ca Mon Jun 22 16:10:32 2020 From: john at tiro.ca (John Hudson) Date: Mon, 22 Jun 2020 14:10:32 -0700 Subject: What is the current Unicode stance on subscripts and superscripts for mathematical use? In-Reply-To: <000001d648c3$d5bcc870$81365950$@ewellic.org> References: <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca> <000001d648c3$d5bcc870$81365950$@ewellic.org> Message-ID: On 22062020 11:34 am, Doug Ewell wrote: > So, does that mean you don't think L2/18-206 will fly? Has it shown any signs of flying in the past two years? or am I being trolled? :) I'll bite: That document is targeting issues in general typographic display variants and muddies the character/glyph distinction. Most of what it calls for are clear cases of typographic glyph processing, e.g. smallcaps as variants of uppercase characters. In that respect, it at once goes too far in calling for smallcap encoding for a large number of existing uppercase characters and not nearly far enough in ignoring vast numbers of existing characters outside the small European subset identified in the document. The author seems also not to understand that existing 'small capitals' in Unicode are not typographic smallcap variants but distinct letters in some phonetic notation systems. The author is not wrong to point out that the existence of some super- and subscript characters in Unicode doesn't always play well with font and algorithmic display of additional characters with super- and subscript styling: size, weight, and alignments can vary, depending on the path from the encoded characters to the styled display, how well the font has been made, and what algorithms are used. But these problems are not solved by encoding a bunch of additional super- and subscript characters. The problems may be pushed further out ? at least for European users of the Latin script ? but not solved. Mathematical notation is a different case: a specialised writing system in which style, size, and relative position all have semantic meaning. It needs a different model for both encoding and layout than typical language text and typography. J. -- John Hudson Tiro Typeworks Ltd www.tiro.com Salish Sea, BC tiro at tiro.com NOTE: In the interests of productivity, I am currently dealing with email on only two days per week, usually Monday and Thursday unless this schedule is disrupted by travel. If you need to contact me urgently, please use some other method of communication. Thank you. From haberg-1 at telia.com Mon Jun 22 16:22:46 2020 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Mon, 22 Jun 2020 23:22:46 +0200 Subject: What is the current Unicode stance on subscripts and superscripts for mathematical use? In-Reply-To: <000501d648c5$46100320$d2300960$@ewellic.org> References: <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca> <000501d648c5$46100320$d2300960$@ewellic.org> Message-ID: <3A90FBA1-10BF-40F0-A0BB-59376C7603F4@telia.com> > On 22 Jun 2020, at 20:45, Doug Ewell via Unicode wrote: > > Sorry, sent to wrong list. I use, for text file input (plain UTF-8), Unicode subscript and superscript parentheses as subscript and superscript delimiters, like in ????, ????. It would be nice to have such subscript and superscript delimiters that provide the corresponding rendering in the text editor. From marius.spix at web.de Mon Jun 22 18:09:39 2020 From: marius.spix at web.de (Marius Spix) Date: Tue, 23 Jun 2020 01:09:39 +0200 Subject: What is the current Unicode stance on subscripts and superscripts for mathematical use? In-Reply-To: <3A90FBA1-10BF-40F0-A0BB-59376C7603F4@telia.com> References: <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca> <000501d648c5$46100320$d2300960$@ewellic.org> <3A90FBA1-10BF-40F0-A0BB-59376C7603F4@telia.com> Message-ID: <20200623010928.3a0d00fb@spixxi> This can already be done by rich text. Unicode includes some superscript characters like ?, ? or ? (the degree sign is a superscript version of the white circle U+25CB) for compatibility with legacy character sets and phonetic transcriptions (in some languages the tone is important). Unicode?s superscript characters are very limited and nested superscript is not supported at all. For some units and common chemical terms which often appear in plain text in non-scientific contexts (like m?, kg?m/s?, ?C, CO? or Na?) the Unicode superscript characters are sufficient, however. On Mon, 22 Jun 2020 23:22:46 +0200 Hans ?berg via Unicode wrote: > > > On 22 Jun 2020, at 20:45, Doug Ewell via Unicode > > wrote: > > > > Sorry, sent to wrong list. > > I use, for text file input (plain UTF-8), Unicode subscript and > superscript parentheses as subscript and superscript delimiters, like > in ????, ????. It would be nice to have such subscript and > superscript delimiters that provide the corresponding rendering in > the text editor. > > > From richard.wordingham at ntlworld.com Mon Jun 22 19:44:56 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 23 Jun 2020 01:44:56 +0100 Subject: What is the current Unicode stance on subscripts and superscripts for mathematical use? In-Reply-To: <20200623010928.3a0d00fb@spixxi> References: <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca> <000501d648c5$46100320$d2300960$@ewellic.org> <3A90FBA1-10BF-40F0-A0BB-59376C7603F4@telia.com> <20200623010928.3a0d00fb@spixxi> Message-ID: <20200623014456.6413cee6@JRWUBU2> On Tue, 23 Jun 2020 01:09:39 +0200 Marius Spix via Unicode wrote: > This can already be done by rich text. Unicode includes some > superscript characters like ?, ? or ? (the degree sign is a > superscript version of the white circle U+25CB) for compatibility > with legacy character sets and phonetic transcriptions (in some > languages the tone is important). But still there can be annoying gaps. Tai-Kadai is reconstructed to have four tones, conventionally called A, B, C and D, and it is very convenient to use the corresponding superscript letters to label the tone on the syllables. However, one of those capitals is missing, and so in Wiktionary they have to make do with a superscript lower letter for that one! I don't think there's confidence in the reconstruction of the original tones - and its possible that the tone inducers, not the tones themselves, go back to the proto-language. Richard. From haberg-1 at telia.com Tue Jun 23 03:36:53 2020 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Tue, 23 Jun 2020 10:36:53 +0200 Subject: What is the current Unicode stance on subscripts and superscripts for mathematical use? In-Reply-To: <20200623010928.3a0d00fb@spixxi> References: <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca> <000501d648c5$46100320$d2300960$@ewellic.org> <3A90FBA1-10BF-40F0-A0BB-59376C7603F4@telia.com> <20200623010928.3a0d00fb@spixxi> Message-ID: <5C2214FF-A9A4-44A9-8E73-02BA8080CC0C@telia.com> Indeed, and in the days of ASCII, one felt that all characters could be encoded with 7-bit bytes. But that is not really legible without some post-processing. I use it in a program that uses plain text as input, which turns out to be very convenient, apart from that superscripts and subscripts might look better. > On 23 Jun 2020, at 01:09, Marius Spix wrote: > > This can already be done by rich text. Unicode includes some > superscript characters like ?, ? or ? (the degree sign is a superscript > version of the white circle U+25CB) for compatibility with legacy > character sets and phonetic transcriptions (in some languages the tone > is important). Unicode?s superscript characters are very limited and > nested superscript is not supported at all. For some units and common > chemical terms which often appear in plain text in non-scientific > contexts (like m?, kg?m/s?, ?C, CO? or Na?) the Unicode superscript > characters are sufficient, however. From kent.b.karlsson at bahnhof.se Tue Jun 23 05:51:52 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Tue, 23 Jun 2020 12:51:52 +0200 Subject: What is the current Unicode stance on subscripts and superscripts for mathematical use? In-Reply-To: <20200623010928.3a0d00fb@spixxi> References: <838b1603-fb45-2e83-8467-2b835e1a3421@tiro.ca> <000501d648c5$46100320$d2300960$@ewellic.org> <3A90FBA1-10BF-40F0-A0BB-59376C7603F4@telia.com> <20200623010928.3a0d00fb@spixxi> Message-ID: <612502FD-3792-4761-AC6D-38C761AAB06E@bahnhof.se> > 23 juni 2020 kl. 01:09 skrev Marius Spix via Unicode : > > This can already be done by rich text. Unicode includes some > superscript characters like ?, ? or ? (the degree sign is a superscript > version of the white circle U+25CB) The degree sign is in origin a superscript 0 (zero), but now distinct from superscript 0 (and from superscript o, which in addition happens to be doubly encoded in Unicode?). /Kent K From richard.wordingham at ntlworld.com Tue Jun 23 06:54:57 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 23 Jun 2020 12:54:57 +0100 Subject: Distinguishing COENG TA from COENG DA in Khmer script Message-ID: <20200623125457.421435ce@JRWUBU2> The modern Khmer language does not make use of a COENG DA distinct from COENG TA. The normal practice is to render them the same, with a recommendation from Unicode that the choice be based on the sound the subscript represents. At least, there was such a recommendation; I tried to find it again, but failed. The visual distinction faded out in the 1920's according to Antelme. Now, the Khmer script is not just used for modern languages of Cambodia. It is used for transcribing Old Khmer (for words, at least) and was the religious script of most of Thailand until the 19th century, and was also the secular script in southern Thailand. In these usages, COENG TA and COENG DA are distinct, or at least, TA and DA have distinct subscripts that are clearly associated with them. Is it legitimate for a font to deliberately render the corresponding named sequences differently while claiming to respect characters' character identities? I thought it obviously was, but I received a demurral when I asked about the best way to request an arbitrary OpenType font to make the distinction. (I expect the overwhelming majority would refuse to make the distinction.) I am therefore asking here for advice on the legitimacy of such a request. Conceivably we need a new character to make the distinction. Richard. From asmusf at ix.netcom.com Tue Jun 23 17:50:27 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 23 Jun 2020 15:50:27 -0700 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: <20200623125457.421435ce@JRWUBU2> References: <20200623125457.421435ce@JRWUBU2> Message-ID: An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Tue Jun 23 18:29:38 2020 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Wed, 24 Jun 2020 08:29:38 +0900 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: <20200623125457.421435ce@JRWUBU2> References: <20200623125457.421435ce@JRWUBU2> Message-ID: Hello Richard, I'm not an expert on OpenType or Khmer (except for having been on the side of separately encoding subscript letters in Unicode list discussions in the 1990ies), but a few comments and questions below. On 23/06/2020 20:54, Richard Wordingham via Unicode wrote: > The modern Khmer language does not make use of a COENG DA distinct from > COENG TA. The normal practice is to render them the same, with a > recommendation from Unicode that the choice be based on the sound the > subscript represents. At least, there was such a recommendation; I > tried to find it again, but failed. The visual distinction faded out > in the 1920's according to Antelme. > > Now, the Khmer script is not just used for modern languages of > Cambodia. It is used for transcribing Old Khmer (for words, at least) > and was the religious script of most of Thailand until the 19th > century, and was also the secular script in southern Thailand. In > these usages, COENG TA and COENG DA are distinct, or at least, TA and DA > have distinct subscripts that are clearly associated with them. > > Is it legitimate for a font to deliberately render the corresponding > named sequences differently while claiming to respect characters' > character identities? A font for Old Khmer,... would do that, wouldn't it? I couldn't see anything wrong with that. > I thought it obviously was, but I received a > demurral when I asked about the best way to request an arbitrary > OpenType font to make the distinction. A truly arbitrary (i.e. arbitrarily choosen) OpenType font probably wouldn't cover Khmer anyway, so it would be unable to even start to make this distinction. > (I expect the overwhelming > majority would refuse to make the distinction.) The majority of fonts that actually cover modern Khmer might not include the relevant glyphs. > I am therefore asking > here for advice on the legitimacy of such a request. I'm guessing that your request was either "How can I coerce a font covering modern Khmer to show different glyphs for COENG TA and COENG DA?" or "How can I create a font that will allow to show different glyphs for COENG TA and COENG DA?" The reply to the former question is probably "you can't because the font doesn't contain the necessary glyph". For the later question, I think it should be possible, unless there's some OpenType stuff for Khmer that gets in the way. > Conceivably we need > a new character to make the distinction. Do you mean you want to make the distinction in modern Khmer fonts? Would that be e.g. for words of Old Khmer that are cited in modern Khmer, or something similar? Regards, Martin. > > Richard. From richard.wordingham at ntlworld.com Tue Jun 23 19:03:17 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 24 Jun 2020 01:03:17 +0100 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: References: <20200623125457.421435ce@JRWUBU2> Message-ID: <20200624010317.5f2f5310@JRWUBU2> On Tue, 23 Jun 2020 15:50:27 -0700 Asmus Freytag via Unicode wrote: > On 6/23/2020 4:54 AM, Richard Wordingham via Unicode wrote: > The modern Khmer language does not make use of a COENG DA distinct > from COENG TA. The normal practice is to render them the same, with a > recommendation from Unicode that the choice be based on the sound the > subscript represents. At least, there was such a recommendation; I > tried to find it again, but failed. The visual distinction faded out > in the 1920's according to Antelme. > > Now, the Khmer script is not just used for modern languages of > Cambodia. It is used for transcribing Old Khmer (for words, at least) > and was the religious script of most of Thailand until the 19th > century, and was also the secular script in southern Thailand. In > these usages, COENG TA and COENG DA are distinct, or at least, TA and > DA have distinct subscripts that are clearly associated with them. > > Is it legitimate for a font to deliberately render the corresponding > named sequences differently while claiming to respect characters' > character identities? I thought it obviously was, but I received a > demurral when I asked about the best way to request an arbitrary > OpenType font to make the distinction. (I expect the overwhelming > majority would refuse to make the distinction.) I am therefore asking > here for advice on the legitimacy of such a request. Conceivably we > need a new character to make the distinction. > > Richard. > > The recommendation you cite is a bit "common sense". I believe, > without actual knowledge, that there are no "dt" or "td" combinations > only "dd" and "tt". In that case, a spell checker can help you > pick the correct code for the subscript form. That's a grammar rule - I'm not that spell checkers can exploit it. While Series one ('a') normally has /nt/ for base consonant, there are or were (my source is Huffman) a few words with /nd/, and there are a few words that can be said either way (Durdin 2018, I think). My immediate concern was the alternative Old Khmer spellings ???? and ????, which look identical in most fonts. However, I am told the Windows UI font Leelawadee UI distinguishes them, which could make it difficult to outlaw deliberate distinction. > Now, the identity of the characters is DA and TA (the COENG forms a > sequence). Therefore, you don't violate the identity of DA and TA if > you render their subscript forms distinct. Don't multipoint characters get the same protection? COENG DA and COENG TA are named sequences. > If you have a font that works that way, it may not be usable for > modern Khmer (unless the there's a language tag to select the > behavior). That's a font issue. I'm not sure that we can get an OpenType language tag for 19th century Khmer. However, it seems that the feature tag 'hist' would be appropriate. One could try tagging as Southern Thai (ISO-693 sou), but that's another can of worms. Tagging as Sanskrit might work - I don't know enough about modern Khmer script Sanskrit. Richard. From richard.wordingham at ntlworld.com Tue Jun 23 19:20:29 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 24 Jun 2020 01:20:29 +0100 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: References: <20200623125457.421435ce@JRWUBU2> Message-ID: <20200624012029.45ddfe28@JRWUBU2> On Wed, 24 Jun 2020 08:29:38 +0900 Martin J. D?rst wrote: > > ... In these usages, COENG TA and COENG DA are distinct, or > > at least, TA and DA have distinct subscripts that are clearly > > associated with them. > > > > Is it legitimate for a font to deliberately render the corresponding > > named sequences differently while claiming to respect characters' > > character identities? > > I am therefore asking > > here for advice on the legitimacy of such a request. > > I'm guessing that your request was either "How can I coerce a font > covering modern Khmer to show different glyphs for COENG TA and COENG > DA?" or "How can I create a font that will allow to show different > glyphs for COENG TA and COENG DA?" The request would be made to the font by a combination of language and a setting of OpenType features. > The reply to the former question is probably "you can't because the > font doesn't contain the necessary glyph". For the later question, I > think it should be possible, unless there's some OpenType stuff for > Khmer that gets in the way. The OpenType question was closer to how do we make it easy to advise people how to use co-operative fonts, if they exist. > > Conceivably we need > > a new character to make the distinction. > > Do you mean you want to make the distinction in modern Khmer fonts? > Would that be e.g. for words of Old Khmer that are cited in modern > Khmer, or something similar? Something similar. The application domain was Wiktionary. I suspect most people would be happier to see the words in a Modern Khmer style, but not necessarily a modern Modern Khmer style. The Angkorian styles are quite different to the modern styles - unreadably so without practice. My Unicode question is also relevant for fonts displaying Unicode text in an Angkorian style. It seems that they do exist, but complying with TUS was probably low down on the authors' list of priorities. Richard. From asmusf at ix.netcom.com Tue Jun 23 22:04:13 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 23 Jun 2020 20:04:13 -0700 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: <20200624010317.5f2f5310@JRWUBU2> References: <20200623125457.421435ce@JRWUBU2> <20200624010317.5f2f5310@JRWUBU2> Message-ID: <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com> An HTML attachment was scrubbed... URL: From jameskasskrv at gmail.com Tue Jun 23 23:17:44 2020 From: jameskasskrv at gmail.com (James Kass) Date: Wed, 24 Jun 2020 04:17:44 +0000 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: <20200624010317.5f2f5310@JRWUBU2> References: <20200623125457.421435ce@JRWUBU2> <20200624010317.5f2f5310@JRWUBU2> Message-ID: On 2020-06-24 12:03 AM, Richard Wordingham via Unicode wrote: > My immediate concern was the alternative Old Khmer spellings ???? and > ????, which look identical in most fonts. However, I am told the > Windows UI font Leelawadee UI distinguishes them, which could make it > difficult to outlaw deliberate distinction. The Leelawadee font with Windows 7 covers Thai but not Khmer. From richard.wordingham at ntlworld.com Wed Jun 24 03:19:41 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 24 Jun 2020 09:19:41 +0100 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: <20200624010317.5f2f5310@JRWUBU2> References: <20200623125457.421435ce@JRWUBU2> <20200624010317.5f2f5310@JRWUBU2> Message-ID: <20200624091941.7507cd03@JRWUBU2> On Wed, 24 Jun 2020 01:03:17 +0100 Richard Wordingham via Unicode wrote: > On Tue, 23 Jun 2020 15:50:27 -0700 > Asmus Freytag via Unicode wrote: > > The recommendation you cite is a bit "common sense". I believe, > > without actual knowledge, that there are no "dt" or "td" > > combinations only "dd" and "tt". In that case, a spell checker can > > help you pick the correct code for the subscript form. > That's a grammar rule - I'm not that spell checkers can exploit it. > While Series one ('a') normally has /nt/ for base consonant, there are > or were (my source is Huffman) a few words with /nd/, and there are a > few words that can be said either way (Durdin 2018, I think). I forgot to add the condition that the /n/ be written with NO. Series one with NNO is usually /nd/; I don't know whether series one /nt/ written with NNO exists. It might exist in Pali, but be a matter of sect. Richard. From richard.wordingham at ntlworld.com Wed Jun 24 03:43:45 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 24 Jun 2020 09:43:45 +0100 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: References: <20200623125457.421435ce@JRWUBU2> <20200624010317.5f2f5310@JRWUBU2> Message-ID: <20200624094345.5b41e198@JRWUBU2> On Wed, 24 Jun 2020 04:17:44 +0000 James Kass via Unicode wrote: > On 2020-06-24 12:03 AM, Richard Wordingham via Unicode wrote: > > My immediate concern was the alternative Old Khmer spellings ???? > > and ????, which look identical in most fonts. However, I am told > > the Windows UI font Leelawadee UI distinguishes them, which could > > make it difficult to outlaw deliberate distinction. > The Leelawadee font with Windows 7 covers Thai but not Khmer. Still true on Windows 10. But Leelawadee UI also includes Khmer, and in response to this post I verified that it makes the distinction, albeit it in an innovative way. Richard. From richard.wordingham at ntlworld.com Wed Jun 24 05:39:01 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 24 Jun 2020 11:39:01 +0100 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: References: <20200623125457.421435ce@JRWUBU2> Message-ID: <20200624113901.7cc75c76@JRWUBU2> On Tue, 23 Jun 2020 15:50:27 -0700 Asmus Freytag via Unicode wrote: > The recommendation you cite is a bit "common sense". I believe, > without actual knowledge, that there are no "dt" or "td" combinations > only "dd" and "tt". In that case, a spell checker can help you > pick the correct code for the subscript form. Some SIL notes record ????? /?ut.??m/ ?excellent?; the Pali/Sanskrit word is _uttama_. It makes me wonder if "td" is much commoner than "tt" for series 1. Richard. From kent.b.karlsson at bahnhof.se Wed Jun 24 14:22:25 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Wed, 24 Jun 2020 21:22:25 +0200 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com> References: <20200623125457.421435ce@JRWUBU2> <20200624010317.5f2f5310@JRWUBU2> <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com> Message-ID: (Picking a quote slightly arbitrarily here.) > They are supposed to represent subscript DA and TA, and for the > old-Khmer style those look different. The fact that they look identical > does not mean that you should only use the subscript TA and expect > it to work where subscript DA is intended. I know it is very late to say this but? To me this seem very much like there has been an ORTHOGRAPHIC change over time (preferring TA over DA when subscript), NOT a commonisation of glyphs. Indeed, one can well argue that giving COENG TA and COENG DA the same glyph violates the character identity for these characters/ character sequences. /Kent Karlsson From richard.wordingham at ntlworld.com Wed Jun 24 18:29:30 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 25 Jun 2020 00:29:30 +0100 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: References: <20200623125457.421435ce@JRWUBU2> <20200624010317.5f2f5310@JRWUBU2> <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com> Message-ID: <20200625002930.5067dbc8@JRWUBU2> On Wed, 24 Jun 2020 21:22:25 +0200 Kent Karlsson via Unicode wrote: > (Picking a quote slightly arbitrarily here.) > > > They are supposed to represent subscript DA and TA, and for the > > old-Khmer style those look different. The fact that they look > > identical does not mean that you should only use the subscript TA > > and expect it to work where subscript DA is intended. > > > I know it is very late to say this but? To me this seem very much like > there has been an ORTHOGRAPHIC change over time (preferring > TA over DA when subscript), NOT a commonisation of glyphs. > > Indeed, one can well argue that giving COENG TA and COENG DA > the same glyph violates the character identity for these characters/ > character sequences. The identities of the subscript consonants do seem tied to the base consonants; there have been some drastic changes as the current shape becomes too confusable. Now, the usage of the base characters isn't,or wasn't, as sharply defined as one might hope. There are, or were (my source is pre-Khmer Rouge), some words written with a base consonant TA pronounced as though it were DA. According to Huffman, there was free variation between what are encoded and . By that correspondence, simply abandoning the concept of COENG DA probably wasn't an option. Deciding to make COENG DA identical to COENG TA was an option. Richard. From asmusf at ix.netcom.com Wed Jun 24 22:28:41 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 24 Jun 2020 20:28:41 -0700 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: References: <20200623125457.421435ce@JRWUBU2> <20200624010317.5f2f5310@JRWUBU2> <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com> Message-ID: <6e390a35-d08e-8c92-ead8-f2e8fa4a2697@ix.netcom.com> An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Jun 25 08:57:52 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 25 Jun 2020 14:57:52 +0100 Subject: Khmer tally-style numbers Message-ID: <20200625145752.63de24f6@JRWUBU2> I have come across some tally-style numbers in old Khmer inscriptions and I'm wondering how they should be encoded. I am assuming that the alphabetic script is Khmer. What characters might be being used here? If I use a sans-serif font, the Roman numerals I, II and III work well. Other possibilities I have considered are U+1D369 COUNTING ROD TENS DIGIT ONE to U+1D36B COUNTING ROD TENS DIGIT THREE and up to three instances of U+1D377 TALLY MARK ONE, which the chart calls a 'western tally mark'. Are any of these formally appropriate? For high numbers, e.g. '6', the texts seem to use ordinary Khmer decimal digits. It feels massively inappropriate to use single vertical stroke character U+17F2 KHMER SYMBOL LEK ATTAK PII for a pair of vertical strokes together meaning '2'. Richard. From vp88.mobile at gmail.com Thu Jun 25 04:02:38 2020 From: vp88.mobile at gmail.com (Vova Mobile) Date: Thu, 25 Jun 2020 12:02:38 +0300 Subject: Ukrainian names for somes specific Unicode characters Message-ID: Hi dear developers and mail-list members. Please tell me, where can I get the Ukrainian names of some specific Unicode characters? I am interested in the Ukrainian names of the following symbols: ? ? ? ? ? ? ? ? ? ? ? Please, tell me Ukrainian names this characters or tell me where can I get there. From kent.b.karlsson at bahnhof.se Sun Jun 28 15:05:35 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Sun, 28 Jun 2020 22:05:35 +0200 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: <6e390a35-d08e-8c92-ead8-f2e8fa4a2697@ix.netcom.com> References: <20200623125457.421435ce@JRWUBU2> <20200624010317.5f2f5310@JRWUBU2> <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com> <6e390a35-d08e-8c92-ead8-f2e8fa4a2697@ix.netcom.com> Message-ID: > 25 juni 2020 kl. 05:28 skrev Asmus Freytag via Unicode : > > On 6/24/2020 12:22 PM, Kent Karlsson via Unicode wrote: >> (Picking a quote slightly arbitrarily here.) >> >>> They are supposed to represent subscript DA and TA, and for the >>> old-Khmer style those look different. The fact that they look identical >>> does not mean that you should only use the subscript TA and expect >>> it to work where subscript DA is intended. >> >> I know it is very late to say this but? To me this seem very much like >> there has been an ORTHOGRAPHIC change over time (preferring >> TA over DA when subscript), NOT a commonisation of glyphs. >> >> Indeed, one can well argue that giving COENG TA and COENG DA >> the same glyph violates the character identity for these characters/ >> character sequences. >> >> /Kent Karlsson >> >> > What fact about the Khmer writing system leads you to that conclusion? > > A./ > > PS: at some point, looking merely at the printed shapes, an issue like this is not decidable -- to decide it you need to know how people using the script conceptualize it. > As far as I can gather, it seems like DA and TA both span ?t?-like and ?d?-like pronunciations. So from a pronunciation point of view, their degree of interchangeability seems high. Trying to make rules for when to use one or use the other is then very tenuous and prone to change both over space (dialects) and time (spell changes, formal or informal). Thus one cannot make a reliable rule as to whether a TA-looking subjoint (Khmer) letter should be seen as COENG TA or COENG DA. And indeed, if COENG DA and COENG TA are rendered the same by many but not all fonts supporting the Khmer script, it is impossible to reliably communicate things like ?the current spelling of the word is but the traditional spelling is ? in plain or formatted text (even in formatted text the font selection is not very hard, there can be font substitutions) without resorting to images or extraneous explanations of which letters were actually used. That seems like a pity. The different rendering need not be such that (e.g., as here, COENG DA) it is the old one, but needs to be distinguishable by a reasonable reader at reasonable font size/resolution. It could be a ?modernized? rendering of COENG DA, or a more traditional one, but sufficiently clearly distinct from the rendering of COENG TA (and distinct from other Khmer subscript letters); THAT would be a font difference. But at the point where ?original? COENG DA is rendered exactly the same as COENG TA, it is a spelling change, and should be treated as such. I am all for that the author of a text decides which letter-apparent there are in a piece of text, not font makers. This is especially important here, since historically COENG DA had its own separate rendering not conflated with COENG TA rendering. /Kent Karlsson -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Jun 29 09:56:40 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 29 Jun 2020 15:56:40 +0100 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: References: <20200623125457.421435ce@JRWUBU2> <20200624010317.5f2f5310@JRWUBU2> <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com> <6e390a35-d08e-8c92-ead8-f2e8fa4a2697@ix.netcom.com> Message-ID: <20200629155640.5d1c30f3@JRWUBU2> On Sun, 28 Jun 2020 22:05:35 +0200 Kent Karlsson via Unicode wrote: > And indeed, if COENG DA and COENG TA are rendered the same by many > but not all fonts supporting the Khmer script, it is impossible to > reliably communicate things like ?the current spelling of the word is > but the traditional spelling is > ? in plain or formatted text (even > in formatted text the font selection is not very hard, there can be > font substitutions) without resorting to images or extraneous > explanations of which letters were actually used. That seems like a > pity. The different rendering need not be such that (e.g., as here, > COENG DA) it is the old one, but needs to be distinguishable by a > reasonable reader at reasonable font size/resolution. It could be a > ?modernized? rendering of COENG DA, or a more traditional one, but > sufficiently clearly distinct from the rendering of COENG TA (and > distinct from other Khmer subscript letters); THAT would be a font > difference. But at the point where ?original? COENG DA is rendered > exactly the same as COENG TA, it is a spelling change, and should be > treated as such. One of the fonts that comes with Windows 10, Leelawadee UI, actually makes a distinction. (Marc Durdin pointed that out to me in response to this thread.) Its COENG DA leaves a wider gap with the base consonant, and is vertically more compressed to compensate. This is a modern innovation, and to me seems similar in intent to the barely perceptible difference between an open loop U+0067 LATIN SMALL LETTER G and U+0261 LATIN SMALL LETTER SCRIPT G that another font makes. That font looks like part of a move to change the encoding of modern Khmer by replacing COENG DA by COENG TA. That promises to be another complication in transliterating Pali and Sanskrit between Indic scripts. The visible spelling change seems to have been complete in Khmer by 1930, at least as far as printed material was concerned. Does anyone know if COENG TA and COENG DA have been distinguished if subscripts were encoded separately? Still, we are where we are. Richard. From kent.b.karlsson at bahnhof.se Mon Jun 29 10:47:34 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Mon, 29 Jun 2020 17:47:34 +0200 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: <20200629155640.5d1c30f3@JRWUBU2> References: <20200623125457.421435ce@JRWUBU2> <20200624010317.5f2f5310@JRWUBU2> <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com> <6e390a35-d08e-8c92-ead8-f2e8fa4a2697@ix.netcom.com> <20200629155640.5d1c30f3@JRWUBU2> Message-ID: > 29 juni 2020 kl. 16:56 skrev Richard Wordingham via Unicode : > The visible spelling change seems to have been complete in Khmer by > 1930, at least as far as printed material was concerned. To make this a little bit less abstract: what did (what is now) COENG DA look like well before 1930? /Kent Karlsson -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Jun 29 12:56:16 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 29 Jun 2020 18:56:16 +0100 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: References: <20200623125457.421435ce@JRWUBU2> <20200624010317.5f2f5310@JRWUBU2> <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com> <6e390a35-d08e-8c92-ead8-f2e8fa4a2697@ix.netcom.com> <20200629155640.5d1c30f3@JRWUBU2> Message-ID: <20200629185616.5622f60c@JRWUBU2> On Mon, 29 Jun 2020 17:47:34 +0200 Kent Karlsson via Unicode wrote: > To make this a little bit less abstract: what did (what is now) COENG > DA look like well before 1930? See pp25 and 26 of 'Inventaire provisoire des caract?res et divers signes des ?critures khm?res pr?-modernes et modernes employ?s pour la notation du khmer, du siamois, des dialectes tha?s m?ridionaux, du sanskrit et du p?li' by Michel Antelme and read Note 5 on p25. He gives the transliteration of TA and DA as 'ta' and '?a'. The tables treat the merger as a change of spelling. A copy of the paper is available at http://aefek.free.fr/iso_album/antelme_bis.pdf. The font Khmer2004 mentioned by Antelme, which is targeted at the Khom (Antelme writes this word 'kh?ma') variant of the script, is put through its paces at http://www.khmerfonts.info/fontinfo.php?font=1507 . The display starts with the alphabet, with each letter displayed on itself as a subscript. Richard. From kent.b.karlsson at bahnhof.se Mon Jun 29 13:34:33 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Mon, 29 Jun 2020 20:34:33 +0200 Subject: Distinguishing COENG TA from COENG DA in Khmer script In-Reply-To: <20200629185616.5622f60c@JRWUBU2> References: <20200623125457.421435ce@JRWUBU2> <20200624010317.5f2f5310@JRWUBU2> <85da7a2a-5ad9-2332-fe43-291dd0062f4c@ix.netcom.com> <6e390a35-d08e-8c92-ead8-f2e8fa4a2697@ix.netcom.com> <20200629155640.5d1c30f3@JRWUBU2> <20200629185616.5622f60c@JRWUBU2> Message-ID: <864B1991-CB30-4107-A822-125CC95985AE@bahnhof.se> > 29 juni 2020 kl. 19:56 skrev Richard Wordingham via Unicode : > > On Mon, 29 Jun 2020 17:47:34 +0200 > Kent Karlsson via Unicode wrote: > >> To make this a little bit less abstract: what did (what is now) COENG >> DA look like well before 1930? > > See pp25 and 26 of 'Inventaire provisoire des caract?res et divers > signes des ?critures khm?res pr?-modernes et modernes employ?s pour la > notation du khmer, du siamois, des dialectes tha?s m?ridionaux, du > sanskrit et du p?li' by Michel Antelme and read Note 5 on p25. He > gives the transliteration of TA and DA as 'ta' and '?a'. The tables > treat the merger as a change of spelling. ?The tables treat the merger as a change of spelling.? I think that is key. > A copy of the paper is > available at http://aefek.free.fr/iso_album/antelme_bis.pdf. > > The font Khmer2004 mentioned by Antelme, which is targeted at the Khom > (Antelme writes this word 'kh?ma') variant of the script, is put > through its paces at http://www.khmerfonts.info/fontinfo.php?font=1507 . > The display starts with the alphabet, with each letter displayed on > itself as a subscript. > > Richard. I note that both the references here give a ?DA-like? shape to ?COENG DA?. /Kent Karlsson -------------- next part -------------- An HTML attachment was scrubbed... URL: