From unicode at unicode.org Tue Mar 3 15:53:20 2020 From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode) Date: Tue, 3 Mar 2020 22:53:20 +0100 Subject: UAX #14 for 13.0.0: LB27 first's line is obsolete Message-ID: Hello,? I think (more precisely my compiler thinks [1]) the first line of LB27 is already handled by the new LB22 rule and can be removed.? Best,? Daniel [1] File "uuseg_line_break.ml", line 206, characters 38-40: 206 | ? | (* LB27 *) ?_, (JL|JV|JT|H2|H3), (IN|PO) -> no_boundary s ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ^^ Warning 12: this sub-pattern is unused. From unicode at unicode.org Tue Mar 3 17:22:20 2020 From: unicode at unicode.org (Andy Heninger via Unicode) Date: Tue, 3 Mar 2020 15:22:20 -0800 Subject: UAX #14 for 13.0.0: LB27 first's line is obsolete In-Reply-To: References: Message-ID: I agree. The LB27 first part rule (JL | JV | JT | H2 | H3) ? IN appears to be redundant. Good catch. -- Andy On Tue, Mar 3, 2020 at 1:53 PM Daniel B?nzli wrote: > Hello, > > I think (more precisely my compiler thinks [1]) the first line of LB27 is > already handled by the new LB22 rule and can be removed. > > Best, > > Daniel > > [1] > File "uuseg_line_break.ml", line 206, characters 38-40: > > 206 | | (* LB27 *) _, (JL|JV|JT|H2|H3), (IN|PO) -> no_boundary s > ^^ > Warning 12: this sub-pattern is unused. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Mar 4 11:01:25 2020 From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode) Date: Wed, 4 Mar 2020 18:01:25 +0100 Subject: UAX #29 and WB4 Message-ID: Hello,? My implementation of word break chokes only on the following test case from the file [1]:? ? 0020 ? 0308 ? 0020 ??# ?? [0.2] SPACE (WSegSpace) ? [4.0] COMBINING DIAERESIS (Extend_FE) ? [999.0] SPACE (WSegSpace) ? [0.3]? I find:? ? 0020 ? 0308 ??0020 ? Basically my implementation uses WB4 to rewrite the first two characters to WSegSpace and then applies WB3ad resulting in the non-break between 0308 and 0020. Re-reading the text I suspect I should not restart the rules from the first one when a WB4 rewrite occurs but only apply the subsequent rules. Is that correct ?? Best,? Daniel [1]:?https://unicode.org/Public/13.0.0/ucd/auxiliary/WordBreakTest.txt From unicode at unicode.org Wed Mar 4 11:48:09 2020 From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode) Date: Wed, 4 Mar 2020 18:48:09 +0100 Subject: UAX #29 and WB4 In-Reply-To: References: Message-ID: On 4 March 2020 at 18:01:25, Daniel B?nzli (daniel.buenzli at erratique.ch) wrote: > Re-reading the text I suspect I should not restart the rules from the first one when a WB4 > rewrite occurs but only apply the subsequent rules. Is that correct ? However even if that's correct I don't understand how this test case works: ? 1F6D1 ? 200D ? 1F6D1 ??# ?? [0.2] OCTAGONAL SIGN (ExtPict) ? [4.0] ZERO WIDTH JOINER (ZWJ_FE) ? [3.3] OCTAGONAL SIGN (ExtPict) ? [0.3] Here the first two chars get rewritten with WB4 to ExtPic then if only subsequent rules are applied we end up in WB999 and a break between 200D and 1F6D1. The justification in the comment indicates to use WB3c on the ZWJ but that one should have been rewritten to ExtPict by WB4.? Best, Daniel From unicode at unicode.org Wed Mar 4 13:26:42 2020 From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode) Date: Wed, 4 Mar 2020 20:26:42 +0100 Subject: UAX #29 and WB4 In-Reply-To: References: Message-ID: On 4 March 2020 at 18:48:09, Daniel B?nzli (daniel.buenzli at erratique.ch) wrote: > On 4 March 2020 at 18:01:25, Daniel B?nzli (daniel.buenzli at erratique.ch) wrote: > > > Re-reading the text I suspect I should not restart the rules from the first one when a > WB4 > > rewrite occurs but only apply the subsequent rules. Is that correct ? > > However even if that's correct I don't understand how this test case works: > > ? 1F6D1 ? 200D ? 1F6D1 ? # ? [0.2] OCTAGONAL SIGN (ExtPict) ? [4.0] ZERO WIDTH JOINER (ZWJ_FE) > ? [3.3] OCTAGONAL SIGN (ExtPict) ? [0.3] > > Here the first two chars get rewritten with WB4 to ExtPic then if only subsequent rules > are applied we end up in WB999 and a break between 200D and 1F6D1.? That's nonsense and not the operational model of the algorithm which IIRC was once clearly stated on this list by Mark Davis (sorry I failed to dig out the message) which is to take each boundary position candidate and apply the rule in sequences taking the first one that matches and then start over with the next one. In that case applying the rules bewteen 1F6D1 and 200D leads to WB4 but then that implicitely adds a non boundary condition -- this is not really evident from the formalism but see the comment above WB4, for that boundary position that settles the non boundary condition. Then we start again applying the rules between 200D and the last 1F6D1 and WB3c matches before WB4 quicks.? I think the behaviour of ? rules should be clarified: it's not clear on which data you apply it w.r.t. the boundary position candiate. If I understand correctly if the match spans over the boundary position candidate that simply turns it into a non-boundary. Otherwise you apply the rule on the left of the boundary position candiate.? Regarding the question of my original message it seems at a certain point I knew better:? ??https://www.unicode.org/mail-arch/unicode-ml/y2016-m11/0151.html Sorry for the noise.? Daniel P.S. I still think the UAX29 and UAX14 could benefit from clarifiying the operational model of the rules a bit (I also have the impression that the formalism to express all that may not be the right one, but then I don't have something better to propose at the time). Also it would be nicer for implementers if they didn't have to factorize rules themselves (e.g. like in the new LB30 rules of UAX14) so that correctness of implemented rules is easier to assert.? From unicode at unicode.org Wed Mar 4 17:58:57 2020 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Wed, 4 Mar 2020 15:58:57 -0800 Subject: UAX #29 and WB4 In-Reply-To: References: Message-ID: One thing we have considered for a while is whether to do a rewrite of the rules to simplify the processing (and avoid the "treat as" rules), but it would take a fair amount of design work that we haven't had time to do. If you (or others) are interested in getting involved, please let us know. Mark On Wed, Mar 4, 2020 at 11:30 AM Daniel B?nzli via Unicode < unicode at unicode.org> wrote: > On 4 March 2020 at 18:48:09, Daniel B?nzli (daniel.buenzli at erratique.ch) > wrote: > > > On 4 March 2020 at 18:01:25, Daniel B?nzli (daniel.buenzli at erratique.ch) > wrote: > > > > > Re-reading the text I suspect I should not restart the rules from the > first one when a > > WB4 > > > rewrite occurs but only apply the subsequent rules. Is that correct ? > > > > However even if that's correct I don't understand how this test case > works: > > > > ? 1F6D1 ? 200D ? 1F6D1 ? # ? [0.2] OCTAGONAL SIGN (ExtPict) ? [4.0] ZERO > WIDTH JOINER (ZWJ_FE) > > ? [3.3] OCTAGONAL SIGN (ExtPict) ? [0.3] > > > > Here the first two chars get rewritten with WB4 to ExtPic then if only > subsequent rules > > are applied we end up in WB999 and a break between 200D and 1F6D1. > > That's nonsense and not the operational model of the algorithm which IIRC > was once clearly stated on this list by Mark Davis (sorry I failed to dig > out the message) which is to take each boundary position candidate and > apply the rule in sequences taking the first one that matches and then > start over with the next one. > > In that case applying the rules bewteen 1F6D1 and 200D leads to WB4 but > then that implicitely adds a non boundary condition -- this is not really > evident from the formalism but see the comment above WB4, for that boundary > position that settles the non boundary condition. Then we start again > applying the rules between 200D and the last 1F6D1 and WB3c matches before > WB4 quicks. > > I think the behaviour of ? rules should be clarified: it's not clear on > which data you apply it w.r.t. the boundary position candiate. If I > understand correctly if the match spans over the boundary position > candidate that simply turns it into a non-boundary. Otherwise you apply the > rule on the left of the boundary position candiate. > > Regarding the question of my original message it seems at a certain point > I knew better: > > https://www.unicode.org/mail-arch/unicode-ml/y2016-m11/0151.html > > Sorry for the noise. > > Daniel > > P.S. I still think the UAX29 and UAX14 could benefit from clarifiying the > operational model of the rules a bit (I also have the impression that the > formalism to express all that may not be the right one, but then I don't > have something better to propose at the time). Also it would be nicer for > implementers if they didn't have to factorize rules themselves (e.g. like > in the new LB30 rules of UAX14) so that correctness of implemented rules is > easier to assert. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Mar 6 21:36:31 2020 From: unicode at unicode.org (Zack Newman via Unicode) Date: Fri, 6 Mar 2020 20:36:31 -0700 Subject: UAX #29 6.2 Message-ID: According to 6.2, "thus ignoring Extend is sufficient to disallow breaking within a grapheme cluster." However the sequence of Unicode scalar values (U+0600, U+0020) is considered a single grapheme cluster due to rule GB9, but the sequence is parsed into two words according to 4.1.1. While it would be ideal to not have sequences of Unicode scalar values that can be parsed into more words than grapheme clusters, I think it's more understandable if section 6.2 didn't explicitly state that this isn't possible. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Mar 7 13:36:56 2020 From: unicode at unicode.org (Rick McGowan via Unicode) Date: Sat, 07 Mar 2020 11:36:56 -0800 Subject: Reminder about reporting bugs, errors, and other feedback Message-ID: <5E63F7D8.4050309@unicode.org> Hello everyone... This is just a little public service reminder that discussions on the Unicode mail list are not considered official feedback, and are not reviewed by UTC members or staff as a source for bug reports. If you want to make sure your feedback and/or report gets into the UTC process, it is best to submit it through our reporting form, which can be found here: https://www.unicode.org/reporting.html Cheers, From unicode at unicode.org Tue Mar 10 00:00:57 2020 From: unicode at unicode.org (Andy Heninger via Unicode) Date: Mon, 9 Mar 2020 22:00:57 -0700 Subject: UAX #29 and WB4 In-Reply-To: References: Message-ID: daniel.buenzli wrote: I think the behaviour of ? rules should be clarified I wholeheartedly agree. If I understand correctly if the match [or a "treat-as" rule] spans over > the [candidate] boundary position candidate that simply turns it into a > non-boundary. Otherwise you apply the rule on the left of the boundary > position candiate. I have considered the extent of a left-side treat-as match to not continue beyond the candidate boundary position. This comes into play following a ZWJ, where it may be absorbed into a "treat as" on the left (WB4), while some other rule triggers on the right side (WB3C). At any rate, this is what I do in ICU. It gets very confusing, and is tricky to implement. Reconsidering how ZWJ rules work could also be a help, if we could figure out how to keep them out of the "treat as" rules, but use explicit no-break rules on both sides instead. -- Andy On Wed, Mar 4, 2020 at 4:01 PM Mark Davis ?? via Unicode < unicode at unicode.org> wrote: > One thing we have considered for a while is whether to do a rewrite of the > rules to simplify the processing (and avoid the "treat as" rules), but it > would take a fair amount of design work that we haven't had time to do. If > you (or others) are interested in getting involved, please let us know. > > Mark > > > On Wed, Mar 4, 2020 at 11:30 AM Daniel B?nzli via Unicode < > unicode at unicode.org> wrote: > >> On 4 March 2020 at 18:48:09, Daniel B?nzli (daniel.buenzli at erratique.ch) >> wrote: >> >> > On 4 March 2020 at 18:01:25, Daniel B?nzli (daniel.buenzli at erratique.ch) >> wrote: >> > >> > > Re-reading the text I suspect I should not restart the rules from the >> first one when a >> > WB4 >> > > rewrite occurs but only apply the subsequent rules. Is that correct ? >> > >> > However even if that's correct I don't understand how this test case >> works: >> > >> > ? 1F6D1 ? 200D ? 1F6D1 ? # ? [0.2] OCTAGONAL SIGN (ExtPict) ? [4.0] >> ZERO WIDTH JOINER (ZWJ_FE) >> > ? [3.3] OCTAGONAL SIGN (ExtPict) ? [0.3] >> > >> > Here the first two chars get rewritten with WB4 to ExtPic then if only >> subsequent rules >> > are applied we end up in WB999 and a break between 200D and 1F6D1. >> >> That's nonsense and not the operational model of the algorithm which IIRC >> was once clearly stated on this list by Mark Davis (sorry I failed to dig >> out the message) which is to take each boundary position candidate and >> apply the rule in sequences taking the first one that matches and then >> start over with the next one. >> >> In that case applying the rules bewteen 1F6D1 and 200D leads to WB4 but >> then that implicitely adds a non boundary condition -- this is not really >> evident from the formalism but see the comment above WB4, for that boundary >> position that settles the non boundary condition. Then we start again >> applying the rules between 200D and the last 1F6D1 and WB3c matches before >> WB4 quicks. >> >> I think the behaviour of ? rules should be clarified: it's not clear on >> which data you apply it w.r.t. the boundary position candiate. If I >> understand correctly if the match spans over the boundary position >> candidate that simply turns it into a non-boundary. Otherwise you apply the >> rule on the left of the boundary position candiate. >> >> Regarding the question of my original message it seems at a certain point >> I knew better: >> >> https://www.unicode.org/mail-arch/unicode-ml/y2016-m11/0151.html >> >> Sorry for the noise. >> >> Daniel >> >> P.S. I still think the UAX29 and UAX14 could benefit from clarifiying the >> operational model of the rules a bit (I also have the impression that the >> formalism to express all that may not be the right one, but then I don't >> have something better to propose at the time). Also it would be nicer for >> implementers if they didn't have to factorize rules themselves (e.g. like >> in the new LB30 rules of UAX14) so that correctness of implemented rules is >> easier to assert. >> >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Mar 11 12:29:06 2020 From: unicode at unicode.org (Karl Williamson via Unicode) Date: Wed, 11 Mar 2020 11:29:06 -0600 Subject: EGYPTIAN HIEROGLYPH MAN WITH A ROLL OF TOILET PAPER In-Reply-To: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> Message-ID: On 2/12/20 11:12 AM, Fr?d?ric Grosshans via Unicode wrote: > Dear Unicode list members (CC Michel Suignard), > > ? the Unicode proposal L2/20-068 > , > ?Revised draft for the encoding of an extended Egyptian Hieroglyphs > repertoire, Groups A to N? ( > https://www.unicode.org/L2/L2020/20068-n5128-ext-hieroglyph.pdf ) by > Michel Suignard contains a very interesting hieroglyph at position > *U+13579 EGYPTIAN HIEROGLYPH A-12-054, which seems to represent a man > with a laptop, as can be obvious in the attached image. > Someone suggested today that this would be the more up-to-date character From unicode at unicode.org Fri Mar 20 07:21:26 2020 From: unicode at unicode.org (Costello, Roger L. via Unicode) Date: Fri, 20 Mar 2020 12:21:26 +0000 Subject: Is the binaryness/textness of a data format a property? Message-ID: Hello Data Format Experts! [Definition] Property: an attribute, quality, or characteristic of something. JPEG is a binary data format. CSV is a text data format. Question #1: Is the binaryness/textness of a data format a property? Question #2: If the answer to Question #1 is yes, then what is the name of this binaryness/textness property? Question #3: Here is another way of asking Question #2: Please fill in the following blanks with the property name (both blanks should be filled with the same thing): For the JPEG data format: _____ = binary. For the CSV data format: _____ = text. /Roger From unicode at unicode.org Fri Mar 20 07:36:34 2020 From: unicode at unicode.org (Dreiheller, Albrecht via Unicode) Date: Fri, 20 Mar 2020 12:36:34 +0000 Subject: AW: Is the binaryness/textness of a data format a property? In-Reply-To: References: Message-ID: #1: Yes. #2: [ my suggestion ] File type category A.D. -----Urspr?ngliche Nachricht----- Von: Unicode Im Auftrag von Costello, Roger L. via Unicode Gesendet: Freitag, 20. M?rz 2020 13:21 An: unicode at unicode.org Betreff: Is the binaryness/textness of a data format a property? Hello Data Format Experts! [Definition] Property: an attribute, quality, or characteristic of something. JPEG is a binary data format. CSV is a text data format. Question #1: Is the binaryness/textness of a data format a property? Question #2: If the answer to Question #1 is yes, then what is the name of this binaryness/textness property? Question #3: Here is another way of asking Question #2: Please fill in the following blanks with the property name (both blanks should be filled with the same thing): For the JPEG data format: _____ = binary. For the CSV data format: _____ = text. /Roger From unicode at unicode.org Fri Mar 20 07:46:25 2020 From: unicode at unicode.org (Adam Borowski via Unicode) Date: Fri, 20 Mar 2020 13:46:25 +0100 Subject: Is the binaryness/textness of a data format a property? In-Reply-To: References: Message-ID: <20200320124625.GC32403@angband.pl> On Fri, Mar 20, 2020 at 12:21:26PM +0000, Costello, Roger L. via Unicode wrote: > [Definition] Property: an attribute, quality, or characteristic of something. > > JPEG is a binary data format. > CSV is a text data format. > > Question #1: Is the binaryness/textness of a data format a property? > > Question #2: If the answer to Question #1 is yes, then what is the name of > this binaryness/textness property? I'm afraid this question is too fuzzy to have a proper answer. For example, most Unix-heads will tell you that UTF16LE is a binary rather than text format. Microsoft employees and some members of this list will disagree. Then you have Postscript -- nothing but basic ASCII, yet utterly unreadable for a (sane) human. If you want _my_ definition of a file being _technically_ text, it's: * no bytes 0..31 other than newlines and tabs (even form feeds are out nowadays) * correctly encoded for the expected charset (and nowadays, if that's not UTF-8 Unicode, you're doing it wrong) * no invalid characters But besides this narrow technical meaning -- is a Word document "text"? And if it is, why not Powerpoint? This all falls apart. Meow! -- ??????? ??????? in the beginning was the boot and root floppies and they were good. ??????? -- on #linux-sunxi ??????? From unicode at unicode.org Fri Mar 20 09:22:45 2020 From: unicode at unicode.org (J Decker via Unicode) Date: Fri, 20 Mar 2020 07:22:45 -0700 Subject: Is the binaryness/textness of a data format a property? In-Reply-To: <20200320124625.GC32403@angband.pl> References: <20200320124625.GC32403@angband.pl> Message-ID: On Fri, Mar 20, 2020 at 5:48 AM Adam Borowski via Unicode < unicode at unicode.org> wrote: > On Fri, Mar 20, 2020 at 12:21:26PM +0000, Costello, Roger L. via Unicode > wrote: > > [Definition] Property: an attribute, quality, or characteristic of > something. > > > > JPEG is a binary data format. > > CSV is a text data format. > > > > Question #1: Is the binaryness/textness of a data format a property? > > > > Question #2: If the answer to Question #1 is yes, then what is the name > of > > this binaryness/textness property? > > I'm afraid this question is too fuzzy to have a proper answer. > > For example, most Unix-heads will tell you that UTF16LE is a binary rather > than text format. Microsoft employees and some members of this list will > disagree. > > Then you have Postscript -- nothing but basic ASCII, yet utterly unreadable > for a (sane) human. > > If you want _my_ definition of a file being _technically_ text, it's: > * no bytes 0..31 other than newlines and tabs (even form feeds are out > nowadays) > * correctly encoded for the expected charset (and nowadays, if that's not > UTF-8 Unicode, you're doing it wrong) > * no invalid characters > Just a minor note... In the case of UTF8, this means no bytes 0xF8-0xFF will ever be used; every valid utf8 codeunit has at least 1 bit off. I wouldn't be so picky about 'no bytes 0-31' because \t, \n, \x1b(ANSI codes) are all quite usable... > > But besides this narrow technical meaning -- is a Word document "text"? > And if it is, why not Powerpoint? This all falls apart. > > > Meow! > -- > ??????? > ??????? in the beginning was the boot and root floppies and they were good. > ??????? -- on #linux-sunxi > ??????? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Mar 20 09:41:23 2020 From: unicode at unicode.org (Adam Borowski via Unicode) Date: Fri, 20 Mar 2020 15:41:23 +0100 Subject: Is the binaryness/textness of a data format a property? In-Reply-To: References: <20200320124625.GC32403@angband.pl> Message-ID: <20200320144123.GA6554@angband.pl> On Fri, Mar 20, 2020 at 07:22:45AM -0700, J Decker via Unicode wrote: > On Fri, Mar 20, 2020 at 5:48 AM Adam Borowski via Unicode < > > For example, most Unix-heads will tell you that UTF16LE is a binary rather > > than text format. Microsoft employees and some members of this list will > > disagree. [...] > > If you want _my_ definition of a file being _technically_ text, it's: > > * no bytes 0..31 other than newlines and tabs (even form feeds are out > > nowadays) > > * correctly encoded for the expected charset (and nowadays, if that's not > > UTF-8 Unicode, you're doing it wrong) > > * no invalid characters > > Just a minor note... > In the case of UTF8, this means no bytes 0xF8-0xFF will ever be used; every > valid utf8 codeunit has at least 1 bit off. Yeah, but I allowed for ancient encodings, some of which do use these bytes. (I do discriminate against UTF16 and shift-state ones, they're too broken.) Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF or U+11000..U+7FFFFFFF (or possibly even up to 2?? or 2??), which has its uses but is not well-formed Unicode. > I wouldn't be so picky about 'no bytes 0-31' because \t, \n, \x1b(ANSI > codes) are all quite usable... \t is tab, \n a newline (blah blah blah \r). As for \e (\x1b), that's higher-level markup. I do use it -- hey, you can "apt/dnf install colorized-logs" for my tools -- but that's beyond plain text. ?! -- ??????? ??????? in the beginning was the boot and root floppies and they were good. ??????? -- on #linux-sunxi ??????? From unicode at unicode.org Fri Mar 20 09:49:24 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 20 Mar 2020 14:49:24 +0000 Subject: Is the binaryness/textness of a data format a property? In-Reply-To: <20200320124625.GC32403@angband.pl> References: <20200320124625.GC32403@angband.pl> Message-ID: <20200320144924.6dfab15a@JRWUBU2> On Fri, 20 Mar 2020 13:46:25 +0100 Adam Borowski via Unicode wrote: > On Fri, Mar 20, 2020 at 12:21:26PM +0000, Costello, Roger L. via > Unicode wrote: > > [Definition] Property: an attribute, quality, or characteristic of > > something. > > > > JPEG is a binary data format. > > CSV is a text data format. > > > > Question #1: Is the binaryness/textness of a data format a > > property? > > > > Question #2: If the answer to Question #1 is yes, then what is the > > name of this binaryness/textness property? I'd suggest 'texthood' as the correct English term. > I'm afraid this question is too fuzzy to have a proper answer. > > For example, most Unix-heads will tell you that UTF16LE is a binary > rather than text format. Microsoft employees and some members of > this list will disagree. Some files change type on changing operating system. Digital's old RMS formats included as basic text files in which each record (roughly a line) started with a binary 2-byte length field. Text records on magnetic tape typically started with an ASCII length count! > Then you have Postscript -- nothing but basic ASCII, yet utterly > unreadable for a (sane) human. No worse than a hex dump - in fact, a lot more readable. Indeed, are you not aware of the concept of a write-only programming language? > If you want _my_ definition of a file being _technically_ text, it's: > * no bytes 0..31 other than newlines and tabs (even form feeds are out > nowadays) > * correctly encoded for the expected charset (and nowadays, if that's > not UTF-8 Unicode, you're doing it wrong) > * no invalid characters Unassigned characters are perfectly reasonable in a text file. Surely you aren't saying that a text file using the characters new to Unicode 13.0 should, at present, usually be regarded as a binary file? > But besides this narrow technical meaning -- is a Word document > "text"? And if it is, why not Powerpoint? This all falls apart. Well, a .docx file isn't text - it's a variety of ZIP file, which is binary. Indeed, as word files naturally include pictures, it very much isn't a text file. A .doc file is more like an image dump of a file system. A .rtf file on the other hand, probably is a text file - though I've a feeling there are variants that aren't *A*SCII. Richard. From unicode at unicode.org Fri Mar 20 20:43:50 2020 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Sat, 21 Mar 2020 01:43:50 +0000 Subject: Is the binaryness/textness of a data format a property? In-Reply-To: <20200320144123.GA6554@angband.pl> References: <20200320124625.GC32403@angband.pl> <20200320144123.GA6554@angband.pl> Message-ID: <10828d7b-80a5-6282-4ef4-7ed075fde75a@it.aoyama.ac.jp> On 20/03/2020 23:41, Adam Borowski via Unicode wrote: > Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF or > U+11000..U+7FFFFFFF (or possibly even up to 2?? or 2??), which has its uses > but is not well-formed Unicode. This would definitely no longer be UTF-8! Martin. From unicode at unicode.org Sat Mar 21 12:13:40 2020 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sat, 21 Mar 2020 11:13:40 -0600 Subject: Is the binaryness/textness of a data format a property? Message-ID: <000001d5ffa4$11d30860$35791920$@ewellic.org> Adam Borowski wrote: > Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF > or U+11000..U+7FFFFFFF (or possibly even up to 2?? or 2??), which has > its uses but is not well-formed Unicode. I'd be interested in your elaboration on what these uses are. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat Mar 21 14:23:45 2020 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sat, 21 Mar 2020 21:23:45 +0200 Subject: Is the binaryness/textness of a data format a property? In-Reply-To: <000001d5ffa4$11d30860$35791920$@ewellic.org> (message from Doug Ewell via Unicode on Sat, 21 Mar 2020 11:13:40 -0600) References: <000001d5ffa4$11d30860$35791920$@ewellic.org> Message-ID: <8336a1ecla.fsf@gnu.org> > Date: Sat, 21 Mar 2020 11:13:40 -0600 > From: Doug Ewell via Unicode > > Adam Borowski wrote: > > > Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF > > or U+11000..U+7FFFFFFF (or possibly even up to 2?? or 2??), which has > > its uses but is not well-formed Unicode. > > I'd be interested in your elaboration on what these uses are. Emacs uses some of that for supporting charsets that cannot be mapped into Unicode. GB18030 is one example of such charsets. The internal representation of characters in Emacs is UTF-8, so it uses 5-byte UTF-8 like sequences to represent such characters. From unicode at unicode.org Sat Mar 21 14:33:18 2020 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sat, 21 Mar 2020 13:33:18 -0600 Subject: Is the binaryness/textness of a data format a property? In-Reply-To: <8336a1ecla.fsf@gnu.org> References: <000001d5ffa4$11d30860$35791920$@ewellic.org> <8336a1ecla.fsf@gnu.org> Message-ID: <000701d5ffb7$93544850$b9fcd8f0$@ewellic.org> Eli Zaretskii wrote: >>> Also, UTF-8 can carry more than Unicode -- for example, >>> U+D800..U+DFFF or U+11000..U+7FFFFFFF (or possibly even up to 2?? or >>> 2??), which has its uses but is not well-formed Unicode. >> >> I'd be interested in your elaboration on what these uses are. > > Emacs uses some of that for supporting charsets that cannot be mapped > into Unicode. GB18030 is one example of such charsets. The internal > representation of characters in Emacs is UTF-8, so it uses 5-byte > UTF-8 like sequences to represent such characters. When 137,468 private-use characters aren't enough? I thought the whole premise of GB18030 was that it was Unicode mapped into a GB2312 framework. What characters exist in GB18030 that don't exist in Unicode, and have they been proposed for Unicode yet, and why was none of the PUA space considered appropriate for that in the meantime? -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat Mar 21 15:26:24 2020 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sat, 21 Mar 2020 22:26:24 +0200 Subject: Is the binaryness/textness of a data format a property? In-Reply-To: <000701d5ffb7$93544850$b9fcd8f0$@ewellic.org> (doug@ewellic.org) References: <000001d5ffa4$11d30860$35791920$@ewellic.org> <8336a1ecla.fsf@gnu.org> <000701d5ffb7$93544850$b9fcd8f0$@ewellic.org> Message-ID: <831rple9ov.fsf@gnu.org> > From: "Doug Ewell" > Cc: > Date: Sat, 21 Mar 2020 13:33:18 -0600 > > > Emacs uses some of that for supporting charsets that cannot be mapped > > into Unicode. GB18030 is one example of such charsets. The internal > > representation of characters in Emacs is UTF-8, so it uses 5-byte > > UTF-8 like sequences to represent such characters. > > When 137,468 private-use characters aren't enough? Why is that relevant to the issue at hand? > I thought the whole premise of GB18030 was that it was Unicode mapped into a GB2312 framework. What characters exist in GB18030 that don't exist in Unicode, and have they been proposed for Unicode yet I don't remember off hand, but last time I looked at GB18030, there were a lot of them not in Unicode. > and why was none of the PUA space considered appropriate for that in the meantime? Because many fonts already use them? I don't really know why it was decided to use codepoints above 0x1FFFFF, it's just that this is how Emacs works for quite some time. You asked for examples of usage, and I provided one. From unicode at unicode.org Sat Mar 21 15:38:24 2020 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Sat, 21 Mar 2020 20:38:24 +0000 (GMT) Subject: Is the binaryness/textness of a data format a property? References: <000001d5ffa4$11d30860$35791920$@ewellic.org> <8336a1ecla.fsf@gnu.org> Message-ID: On 2020-03-21, Eli Zaretskii via Unicode wrote: >> Date: Sat, 21 Mar 2020 11:13:40 -0600 >> From: Doug Ewell via Unicode >> >> Adam Borowski wrote: >> >> > Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF >> > or U+11000..U+7FFFFFFF (or possibly even up to 2?? or 2??), which has >> > its uses but is not well-formed Unicode. >> >> I'd be interested in your elaboration on what these uses are. > > Emacs uses some of that for supporting charsets that cannot be mapped > into Unicode. GB18030 is one example of such charsets. The internal > representation of characters in Emacs is UTF-8, so it uses 5-byte > UTF-8 like sequences to represent such characters. My own (now >10 year old) Unicode adaptation of XEmacs does the same, even for charsets that can be mapped into Unicode. To ensure complete backward compatibility, it distinguishes "legacy" charsets from Unicode, and only does conversion when requested. From unicode at unicode.org Sat Mar 21 15:57:42 2020 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sat, 21 Mar 2020 14:57:42 -0600 Subject: Is the binaryness/textness of a data format a property? In-Reply-To: <831rple9ov.fsf@gnu.org> References: <000001d5ffa4$11d30860$35791920$@ewellic.org> <8336a1ecla.fsf@gnu.org> <000701d5ffb7$93544850$b9fcd8f0$@ewellic.org> <831rple9ov.fsf@gnu.org> Message-ID: <000a01d5ffc3$5dfd3ac0$19f7b040$@ewellic.org> Eli Zaretskii wrote: >> When 137,468 private-use characters aren't enough? > > Why is that relevant to the issue at hand? You're right. I did ask what the uses of non-standard UTF-8 were, and you gave me an example. > I don't remember off hand, but last time I looked at GB18030, there > were a lot of them not in Unicode. I'd forgotten that there were still about two dozen GB18030 characters mapped, more or less officially, into the Unicode PUA. But again, I changed the subject. Sorry about that. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat Mar 21 19:31:31 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 22 Mar 2020 00:31:31 +0000 Subject: Is the binaryness/textness of a data format a property? In-Reply-To: <000701d5ffb7$93544850$b9fcd8f0$@ewellic.org> References: <000001d5ffa4$11d30860$35791920$@ewellic.org> <8336a1ecla.fsf@gnu.org> <000701d5ffb7$93544850$b9fcd8f0$@ewellic.org> Message-ID: <20200322003131.657f1f23@JRWUBU2> On Sat, 21 Mar 2020 13:33:18 -0600 Doug Ewell via Unicode wrote: > Eli Zaretskii wrote: > > Emacs uses some of that for supporting charsets that cannot be > > mapped into Unicode. GB18030 is one example of such charsets. The > > internal representation of characters in Emacs is UTF-8, so it uses > > 5-byte UTF-8 like sequences to represent such characters. > When 137,468 private-use characters aren't enough? But they aren't private use! I haven't made any agreement with anyone about using them. Additionally, just as some people seem to think that stray UTF-16 code units should be supported (and occasionally declaring UTF-8 implementations of Unicode standard algorithms to be automatically non-compliant), there is a case for supporting stray UTF-8 code units. Emacs supports the full range of 8-bit byte values - 128 unified with ASCII and the other 128 with high bit set. > What characters exist in GB18030 that don't > exist in Unicode, and have they been proposed for Unicode yet, and > why was none of the PUA space considered appropriate for that in the > meantime? Doesn't GB18030 appropriate some of the PUA for Tibetan (and quite possibly other complex scripts)? I haven't looked up how Emacs handles this. Richard. From unicode at unicode.org Sun Mar 22 13:56:52 2020 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Sun, 22 Mar 2020 11:56:52 -0700 Subject: Is the binaryness/textness of a data format a property? In-Reply-To: <000701d5ffb7$93544850$b9fcd8f0$@ewellic.org> References: <000001d5ffa4$11d30860$35791920$@ewellic.org> <8336a1ecla.fsf@gnu.org> <000701d5ffb7$93544850$b9fcd8f0$@ewellic.org> Message-ID: On Sat, Mar 21, 2020 at 12:35 PM Doug Ewell via Unicode wrote: > I thought the whole premise of GB18030 was that it was Unicode mapped into > a GB2312 framework. What characters exist in GB18030 that don't exist in > Unicode, and have they been proposed for Unicode yet, and why was none of > the PUA space considered appropriate for that in the meantime? > My memory of GB18030 is that its code space has 1.6M code points, of which 1.1M are a permutation of Unicode. For the rest you would have to go beyond the Unicode code space for 1:1 round-trip mappings. Just please don't call it UTF-8. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Mar 22 18:29:03 2020 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Sun, 22 Mar 2020 23:29:03 +0000 Subject: Is the binaryness/textness of a data format a property? In-Reply-To: References: <000001d5ffa4$11d30860$35791920$@ewellic.org> <8336a1ecla.fsf@gnu.org> <000701d5ffb7$93544850$b9fcd8f0$@ewellic.org> Message-ID: <3eb6a9a0-ee7d-2650-157c-9ed02835edd8@it.aoyama.ac.jp> On 23/03/2020 03:56, Markus Scherer via Unicode wrote: > On Sat, Mar 21, 2020 at 12:35 PM Doug Ewell via Unicode > wrote: > >> I thought the whole premise of GB18030 was that it was Unicode mapped into >> a GB2312 framework. What characters exist in GB18030 that don't exist in >> Unicode, and have they been proposed for Unicode yet, and why was none of >> the PUA space considered appropriate for that in the meantime? >> > > My memory of GB18030 is that its code space has 1.6M code points, of which > 1.1M are a permutation of Unicode. For the rest you would have to go beyond > the Unicode code space for 1:1 round-trip mappings. This matches my recollection. What's more, there are no characters allocated in the parts of the GB 18030 codespace that doesn't map to Unicode, and there is as far as I understand no plan to use that space. It's just there because that was the most straightforward way to extend GB 2312/GBK. Regards, Martin. From unicode at unicode.org Mon Mar 23 17:29:57 2020 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Mon, 23 Mar 2020 22:29:57 +0000 (GMT) Subject: Base character plus tag sequences (from RE: Is the binaryness/textness of a data format a property?) Message-ID: <59f9f4cc.1054.17109844d3a.Webtop.71@btinternet.com> Doug Ewell wrote: > When 137,468 private-use characters aren't enough? In my opinion, a base character plus tag sequence has the potential to be used for many large scale applications for the future. A base character plus tag sequence encoding has the advantage over a Private Use Area encoding (except for a prompt experimental use or for some applications) that the encoding can be unique and thus interoperability is possible amongst people generally. QID emoji is just the very start of applications, some not even dreamed of yet, for which a base character sequence encoding could be used. Once restrictions of the result of a specific encoding of being only allowed to be a fixed image are removed, then new information technology applications will be possible within text streams. There is the QID Emoji Public Review and issues like this can be explored there so that they will be before the Unicode Technical Committee when it assesses the responses to the public review. In my response of Monday 2 March 2020 I put forward an idea that could allow the idea of QID emoji to proceed yet without the disadvantages. No comment after that has been published as of the time of sending this post. https://www.unicode.org/review/pri408/ Whatever your view on whether such ideas should be allowed to flourish and become mainstream in the future I opine that it would be good for there to be more responses to the public review so that as wide a range of views as possible are before the Unicode Technical Committee when it assesses the responses to the public review, not on just QID emoji as such but on whether the underlying method of encoding of a base character and tag character sequence for large sets of items should be encouraged. William Overington Monday 23 March 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Mar 31 13:22:37 2020 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Tue, 31 Mar 2020 19:22:37 +0100 (BST) Subject: How is meaning changed by context and typgraphy - in art, emoji and language Message-ID: <2ccfdccc.1f4b.17131d4bc09.Webtop.73@btinternet.com> I received a circulated email from MoMA, the Museum of Modern Art in New York. I am, at my request, on their mailing list. There is a link to a web page. https://www.moma.org/magazine/articles/257 There is a video embedded in the web page, 8 minutes. I watched the video and found it interesting. There is one part where two identical images each have a different title. I noticed that both titles were in English. With typography today it has become almost obligatory these days for a proposal for a new emoji character to become encoded, for the emoji character to be suggested as having multiple possible meanings, possibly linked to context, or maybe just anyway. The beginnings of this phenomenon and the problems of ambiguity of meaning of emoji characters was discussed in a talk at the Unicode conference in 2015. https://www.youtube.com/watch?v=9ldSVbXbjl4 There was mention of the possibility of "precise emoji". Yet these days imprecision of emoji meaning has become widespread. Yet has the possibility of QID emoji brought back the possibility of precise emoji? Decoding could be to an image, or to language-localized speech or language-localized text, or even all three at once. Yet only if QID emoji are allowed to flourish, perhaps after a few careful modifications to the original proposal so as to minimize, or at least limit, the possibility of encoding chaos. I have long been fascinated by what I regard as subtle changes of meaning that setting a piece of text in different fonts produces, though some other people opine that the meaning is unchanged, regardless of the font. Also, can some meanings not be expressed from one language to another? If so, is that due to the nature of the languages or the culture where the original text was produced, or some of each. Does the general shape of the way that a particular script has developed reflect, or influence, the original literature written in that script? Do words that rhyme in one language produce imagery that does not arise in a language where their translations do not rhyme? For example, boaco and erinaco rhyme In Esperanto, yet their translations in English, reindeer and hedgehog, do not rhyme. The art works in the MoMA video also reminded me of something that was in this mailing list probably in the early 2000s. The post was about translations linked to an art project. It was an art project about some orange blocks and people were taking photographs of art works where one of the orange blocks was presented in some context. Maybe it was a student project, I don't know. I have looked on the web and thus far found nothing about it, not even the original post in this mailing list thus far. Since then technology has changed a lot, much more is now possible for more people. There are now widespread emoji, there is Google street view, and so on. New art possibilities. Does anyone else remember the orange blocks please? Maybe an interesting stepping stone in the history of art. William Overington Tuesday 31 March 2020