From unicode at unicode.org Tue Jul 3 01:25:06 2018 From: unicode at unicode.org (=?UTF-8?B?WWlmw6FuIFfDoW5n?= via Unicode) Date: Tue, 3 Jul 2018 15:25:06 +0900 Subject: UTS#51 and emoji-sequences.txt In-Reply-To: References:

Message-ID: Hi, Sorry for delayed reply. The reason I posted here was that I wasn't sure how it was intended to become. Do you have an idea whether I can trust actual type_field value of each row over the description in the current txt file? Thanks, Wang Yifan 2018-06-09 20:10 GMT+09:00 Mark Davis ?? : > Thanks, it definitely looks like there are some mismatches in terminology > there. Can you please file this with the reporting form on the unicode site? > > {phone} > > On Sat, Jun 9, 2018, 05:00 Yif?n W?ng via Unicode > wrote: >> >> When I'm looking at >> https://unicode.org/Public/emoji/11.0/emoji-sequences.txt >> >> It goes on line 16 that: >> ---------- >> # type_field: any of {Emoji_Combining_Sequence, Emoji_Flag_Sequence, >> Emoji_Modifier_Sequence} >> # The type_field is a convenience for parsing the emoji sequence >> files, and is not intended to be maintained as a property. >> ---------- >> >> This field, however, actually contains "Emoji_Keycap_Sequence" and >> "Emoji_Tag_Sequence", instead of "Emoji_Combining_Sequence" (it was >> already so in 5.0). >> >> And I go back to >> http://www.unicode.org/reports/tr51/ >> >> Under the section 1.4.6: >> ---------- >> ED-21. emoji keycap sequence set ? The specific set of emoji sequences >> listed in the emoji-sequences.txt file [emoji-data] under the category >> Emoji_Keycap_Sequence. >> ED-22. emoji modifier sequence set ? The specific set of emoji >> sequences listed in the emoji-sequences.txt file [emoji-data] under >> the category Emoji_Modifier_Sequence. >> ED-23. RGI emoji flag sequence set ? The specific set of emoji >> sequences listed in the emoji-sequences.txt file [emoji-data] under >> the category Emoji_Flag_Sequence. >> ED-24. RGI emoji tag sequence set ? The specific set of emoji >> sequences listed in the emoji-sequences.txt file [emoji-data] under >> the category Emoji_Tag_Sequence. >> ---------- >> >> I'm not sure if the "category" means "type_field" or headings in the >> txt file, as the headings do not contain underscores. If it means >> "type_field", then the description of type_field above is wrong. >> >> Also the section 1.4.5: >> ---------- >> ED-14c. emoji keycap sequence ? A sequence of the following form: >> >> emoji_keycap_sequence := [0-9#*] \x{FE0F 20E3} >> >> - These characters are in the emoji-sequences.txt file listed under >> the category Emoji_Keycap_Sequence >> ---------- >> While in the previous version (rev. 12): >> ---------- >> ED-14c. emoji keycap sequence ? An emoji combining sequence of the >> following form: >> >> emoji_keycap_sequence := [0-9#*] \x{FE0F 20E3} >> >> - These characters are in the emoji-sequences.txt file listed under >> the category Emoji_Combining_Keycap_Sequence >> ---------- >> >> It seems there was some kind of confusion on terms, but anyway, isn't >> the last line of ED-14c redundant with the current revision? (Or >> "Emoji_Combining_Sequence" is intended?) >> >> Thank you. >> >> Wang Yifan >> > From unicode at unicode.org Sat Jul 7 23:52:28 2018 From: unicode at unicode.org (Karl Williamson via Unicode) Date: Sat, 7 Jul 2018 22:52:28 -0600 Subject: Missing UAX#31 tests? Message-ID: I am working on upgrading from Unicode 10 to Unicode 11. I used all the new files. The algorithms for some of the boundaries, like GCB and WB, have changed so that some of the property values no longer have code points associated with them. I ran the tests furnished in 11.0 for these boundaries, without having changed the algorithms from earlier releases. All passed 100%. Unless I'm missing something, that indicates that the tests furnished in 11.0 do not contain instances that exercise these changes. My guess is that the 10.0 tests were also deficient. I have been relying on the UCD to furnish tests that have enough coverage to sufficiently exercise the algorithms that are specified in UAX 31, but that appears to have been naive on my part From unicode at unicode.org Sun Jul 8 04:21:59 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sun, 8 Jul 2018 11:21:59 +0200 Subject: Missing UAX#31 tests? In-Reply-To: References: Message-ID: I'm surprised that the tests for 11.0 passed for a 10.0 implementation, because the following should have triggered a difference for WB. Can you check on this particular case? ? 0020 ? 0020 ? # ? [0.2] SPACE (WSegSpace) ? [3.4] SPACE (WSegSpace) ? [0.3] About the testing: The tests are generated so that they go all the combinations of pairs, and some combinations of triples. The generated test cases use a sample from each partition of characters, to cut down on the file size to a reasonable level. That also means that some changes in the rules don't cause changes in the test results. Because it is not possible to test every combination, so there is also provision for additional test cases, such as those at the end of the files, eg: https://unicode.org/Public/11.0.0/ucd/auxiliary/WordBreakTest.html https://unicode.org/Public/10.0.0/ucd/auxiliary/WordBreakTest.html We should extend those each time to make sure we cover combinations that aren't covered by pairs. There were some additions to that end; if they didn't cover enough cases, then we can look at your experience to add more. I can suggest two strategies for further testing: 1. To do a full test, for each row check every combinations obtained by replacing each sample character by every other character in its partition. Eg for the above line that would mean testing every sequence. 2. Use a monkey test against ICU. That is, generate random combinations of characters from different partitions and check that ICU and your implementation are in sync. 3. During the beta period, test your previous-version with the new test files. If there are no failures, yet there are changes in the rules, then raise that issue during the beta period so we can add tests. 4. If possible, during the beta period upgrade your implementation and test against the new and old test files. Anyone else have other suggestions for testing? Mark Mark On Sun, Jul 8, 2018 at 6:52 AM, Karl Williamson via Unicode < unicode at unicode.org> wrote: > I am working on upgrading from Unicode 10 to Unicode 11. > > I used all the new files. > > The algorithms for some of the boundaries, like GCB and WB, have changed > so that some of the property values no longer have code points associated > with them. > > I ran the tests furnished in 11.0 for these boundaries, without having > changed the algorithms from earlier releases. All passed 100%. > > Unless I'm missing something, that indicates that the tests furnished in > 11.0 do not contain instances that exercise these changes. My guess is > that the 10.0 tests were also deficient. > > I have been relying on the UCD to furnish tests that have enough coverage > to sufficiently exercise the algorithms that are specified in UAX 31, but > that appears to have been naive on my part > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jul 8 04:23:15 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sun, 8 Jul 2018 11:23:15 +0200 Subject: Missing UAX#31 tests? In-Reply-To: References:

Message-ID: PS, although the title was "Missing UAX#31 tests?", I assumed you were talking about http://unicode.org/reports/tr29/ Mark On Sun, Jul 8, 2018 at 11:21 AM, Mark Davis ?? wrote: > I'm surprised that the tests for 11.0 passed for a 10.0 implementation, > because the following should have triggered a difference for WB. Can you > check on this particular case? > > ? 0020 ? 0020 ? # ? [0.2] SPACE (WSegSpace) ? [3.4] SPACE (WSegSpace) ? > [0.3] > > About the testing: > > The tests are generated so that they go all the combinations of pairs, and > some combinations of triples. The generated test cases use a sample from > each partition of characters, to cut down on the file size to a reasonable > level. That also means that some changes in the rules don't cause changes > in the test results. Because it is not possible to test every > combination, so there is also provision for additional test cases, such as > those at the end of the files, eg: > > https://unicode.org/Public/11.0.0/ucd/auxiliary/WordBreakTest.html > https://unicode.org/Public/10.0.0/ucd/auxiliary/WordBreakTest.html > > We should extend those each time to make sure we cover combinations that > aren't covered by pairs. There were some additions to that end; if they > didn't cover enough cases, then we can look at your experience to add more. > > I can suggest two strategies for further testing: > > 1. To do a full test, for each row check every combinations obtained by > replacing each sample character by every other character in its > partition. Eg for the above line that would mean testing every WSegSpace> sequence. > > 2. Use a monkey test against ICU. That is, generate random combinations of > characters from different partitions and check that ICU and your > implementation are in sync. > > 3. During the beta period, test your previous-version with the new test > files. If there are no failures, yet there are changes in the rules, then > raise that issue during the beta period so we can add tests. > > 4. If possible, during the beta period upgrade your implementation and > test against the new and old test files. > > Anyone else have other suggestions for testing? > > Mark > > > > > Mark > > On Sun, Jul 8, 2018 at 6:52 AM, Karl Williamson via Unicode < > unicode at unicode.org> wrote: > >> I am working on upgrading from Unicode 10 to Unicode 11. >> >> I used all the new files. >> >> The algorithms for some of the boundaries, like GCB and WB, have changed >> so that some of the property values no longer have code points associated >> with them. >> >> I ran the tests furnished in 11.0 for these boundaries, without having >> changed the algorithms from earlier releases. All passed 100%. >> >> Unless I'm missing something, that indicates that the tests furnished in >> 11.0 do not contain instances that exercise these changes. My guess is >> that the 10.0 tests were also deficient. >> >> I have been relying on the UCD to furnish tests that have enough coverage >> to sufficiently exercise the algorithms that are specified in UAX 31, but >> that appears to have been naive on my part >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jul 8 09:02:21 2018 From: unicode at unicode.org (Karl Williamson via Unicode) Date: Sun, 8 Jul 2018 08:02:21 -0600 Subject: Missing UAX#31 tests? In-Reply-To: References:

Message-ID: <3f3c8bc9-31a9-399d-1a16-caa213b3cb58@khwilliamson.com> On 07/08/2018 03:23 AM, Mark Davis ?? wrote: > PS, although the title was "Missing UAX#31 tests?", I assumed you were > talking about http://unicode.org/reports/tr29/ > Yes, sorry. From unicode at unicode.org Mon Jul 9 05:18:06 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Mon, 9 Jul 2018 11:18:06 +0100 (BST) Subject: Memoji Message-ID: <29097513.14840.1531131486525.JavaMail.defaultUser@defaultHost> I have seen the following video. https://www.youtube.com/watch?v=CjqERCCD4iM How will memoji be communicated from one device to another? What happens if a message containing a memoji gets into a web page, such as in the archives of this mailing list? So, I am wondering whether memoji will become encoded into Unicode? Will Unicode also have animation features? This could be done with characters such as ANIMATION START MARKER ANIMATION FRAME SEPARATOR ANIMATION FINISH MARKER together with some more characters so as to specify frame duration individually for each frame in milliseconds if other then a default 2000 milliseconds is wanted for a particular frame. Could a message using memoji then be streamed using a plain text link? William Overington Monday 9 July 2018 From unicode at unicode.org Mon Jul 9 14:22:00 2018 From: unicode at unicode.org (John H. Jenkins via Unicode) Date: Mon, 09 Jul 2018 13:22:00 -0600 Subject: Memoji In-Reply-To: <29097513.14840.1531131486525.JavaMail.defaultUser@defaultHost> References: <29097513.14840.1531131486525.JavaMail.defaultUser@defaultHost> Message-ID: <7D81C4E2-53DC-4404-8FE0-F5D83A576854@apple.com> Memoji are not merely animated emoji; they are personalized avatars. As for animated emoji, I expect that the UTC would consider them out-of-scope for plain text. Note that web pages can already contain animated or moving elements which cannot be represented in plain text. > On Jul 9, 2018, at 4:18 AM, William_J_G Overington via Unicode wrote: > > I have seen the following video. > > https://www.youtube.com/watch?v=CjqERCCD4iM > > How will memoji be communicated from one device to another? > > What happens if a message containing a memoji gets into a web page, such as in the archives of this mailing list? > > So, I am wondering whether memoji will become encoded into Unicode? > > Will Unicode also have animation features? > > This could be done with characters such as > > ANIMATION START MARKER > > ANIMATION FRAME SEPARATOR > > ANIMATION FINISH MARKER > > together with some more characters so as to specify frame duration individually for each frame in milliseconds if other then a default 2000 milliseconds is wanted for a particular frame. > > Could a message using memoji then be streamed using a plain text link? > > William Overington > > Monday 9 July 2018 > From unicode at unicode.org Mon Jul 9 15:11:02 2018 From: unicode at unicode.org (Karl Williamson via Unicode) Date: Mon, 9 Jul 2018 14:11:02 -0600 Subject: Missing UAX#31 tests? In-Reply-To: References:

Message-ID: On 07/08/2018 03:21 AM, Mark Davis ?? wrote: > I'm surprised that the tests for 11.0 passed for a 10.0 implementation, > because the following should have triggered a difference for WB. Can you > check on this particular case? > > ? 0020 ? 0020 ?#? [0.2] SPACE (WSegSpace) ? [3.4] SPACE (WSegSpace) ? [0.3] I'm one of the people who advocated for this change, and I had already tailored our implementation of 10.0 to not break between horizontal white space, so it's actually not surprising that this rule didn't break > > > About the testing: > > The tests are generated so that they go all the combinations of pairs, > and some combinations of triples. The generated test cases use a sample > from each partition of characters, to cut down on the file size to a > reasonable level. That also means that some changes in the rules don't > cause changes in the test results. Because it is not possible to test > every combination, so there is also provision for additional test cases, > such as those at the end of the files, eg: > > https://unicode.org/Public/11.0.0/ucd/auxiliary/WordBreakTest.html > https://unicode.org/Public/10.0.0/ucd/auxiliary/WordBreakTest.html > > We should extend those each time to make sure we cover combinations that > aren't covered by pairs. There were some additions to that end; if they > didn't cover enough cases, then we can look at your experience to add more. > > I can suggest two strategies for further testing: > > 1. To do a full test, for each row check every combinations obtained by > replacing each sample character by every other character in its > partition.?Eg for the above line that would mean testing every > sequence. > > 2. Use a monkey test against ICU. That is, generate random combinations > of characters from different partitions and check that ICU and your > implementation are in sync. > > 3. During the beta period, test your previous-version with the new test > files. If there are no failures, yet there are changes in the rules, > then raise that issue during the beta period so we can add tests. I actually did this, and as I recall, did find some test failures. In retrospect, I must have screwed up somehow back then. I was under tight deadline pressure, and as a result, did more cursory beta testing than normal. > > 4. If possible, during the beta period upgrade your implementation and > test against the new and old test files. > > Anyone else have other suggestions for testing? > > Mark > As an aside, a release or two ago, I implemented SB, and someone immediately found a bug, and accused me of releasing software that had not been tested at all. He had looked through the test suite and not found anything that looked like it was testing that. But he failed to find the test file which bundled up all your tests, in a manner he was not accustomed to, so it was easy for him to overlook. The bug only manifested itself in longer runs of characters than your pairs and triples tested. I looked at it, and your SB tests still seemed reasonable, and I should not expect a more complete series than you furnished. > > > Mark > ////// > > On Sun, Jul 8, 2018 at 6:52 AM, Karl Williamson via Unicode > > wrote: > > I am working on upgrading from Unicode 10 to Unicode 11. > > I used all the new files. > > The algorithms for some of the boundaries, like GCB and WB, have > changed so that some of the property values no longer have code > points associated with them. > > I ran the tests furnished in 11.0 for these boundaries, without > having changed the algorithms from earlier releases.? All passed 100%. > > Unless I'm missing something, that indicates that the tests > furnished in 11.0 do not contain instances that exercise these > changes.? My guess is that the 10.0 tests were also deficient. > > I have been relying on the UCD to furnish tests that have enough > coverage to sufficiently exercise the algorithms that are specified > in UAX 31, but that appears to have been naive on my part > > From unicode at unicode.org Mon Jul 9 17:33:28 2018 From: unicode at unicode.org (Shai Berger via Unicode) Date: Tue, 10 Jul 2018 01:33:28 +0300 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext Message-ID: <20180710013328.17b7807f.shai@platonix.com> Hello all, About two and a half years ago, I suggested adding a FAQ about the applicability of higher-level protocols for bidirectional plaintext, as specified by http://www.unicode.org/reports/tr9/ -- my suggestion was to clarify that higher-level protocols can only be applied upon agreement between all producers and consumers, and that such agreements effectively mean that the text is "special text" -- no longer plain. In the time since then, I have been mostly removed from this issue, but I came back to it recently, to find that my suggested text was rejected, and instead, two FAQs were added to http://www.unicode.org/faq/bidi.html: The first, which is marked by the HTML anchor bidi7, goes with my understanding and defines a higher-level protocol as an agreement; but the second, marked as bidi8, goes the other way, and explains that actually, agreement is not necessary -- a program is at liberty to "implicitly define an overall directional context for display, and that implicit definition of direction is itself an example of application of a higher-level protocol for the purposes of the UBA". One result of this is the following scenario: I open my standard-compliant text editor, and write a line of text (to make things accessible to a wider audience, I use capitals for right-to-left English and small letters for normal, left-to-right English; note this sentence starts from the right): SESU RETHO DNA email ROF plaintext REFERP I I save this line in a text file. Then I display it using my standards-compliant text viewer, but now it looks like this: REFERP I plaintext ROF email SESU RETHO DNA And this is because my standard-compliant text-viewer chooses to apply its higher-level protocol and treat the line as a LTR paragraph. Since bidi8 is a little abstract on this point, and focuses on terminal windows rather than editors and viewers, I would like to ask: Does this concrete result represent the intents of the UTC? Thanks for your attention, Shai. From unicode at unicode.org Mon Jul 9 22:24:06 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Tue, 10 Jul 2018 05:24:06 +0200 Subject: Missing UAX#31 tests? In-Reply-To: References:

Message-ID: Thanks, Karl. Mark On Mon, Jul 9, 2018 at 10:11 PM, Karl Williamson wrote: > On 07/08/2018 03:21 AM, Mark Davis ?? wrote: > >> I'm surprised that the tests for 11.0 passed for a 10.0 implementation, >> because the following should have triggered a difference for WB. Can you >> check on this particular case? >> >> ? 0020 ? 0020 ?#? [0.2] SPACE (WSegSpace) ? [3.4] SPACE (WSegSpace) ? >> [0.3] >> > > I'm one of the people who advocated for this change, and I had already > tailored our implementation of 10.0 to not break between horizontal white > space, so it's actually not surprising that this rule didn't break > >> >> >> About the testing: >> >> The tests are generated so that they go all the combinations of pairs, >> and some combinations of triples. The generated test cases use a sample >> from each partition of characters, to cut down on the file size to a >> reasonable level. That also means that some changes in the rules don't >> cause changes in the test results. Because it is not possible to test every >> combination, so there is also provision for additional test cases, such as >> those at the end of the files, eg: >> >> https://unicode.org/Public/11.0.0/ucd/auxiliary/WordBreakTest.html >> https://unicode.org/Public/10.0.0/ucd/auxiliary/WordBreakTest.html >> >> We should extend those each time to make sure we cover combinations that >> aren't covered by pairs. There were some additions to that end; if they >> didn't cover enough cases, then we can look at your experience to add more. >> >> I can suggest two strategies for further testing: >> >> 1. To do a full test, for each row check every combinations obtained by >> replacing each sample character by every other character in its >> partition. Eg for the above line that would mean testing every > WSegSpace> sequence. >> >> 2. Use a monkey test against ICU. That is, generate random combinations >> of characters from different partitions and check that ICU and your >> implementation are in sync. >> >> 3. During the beta period, test your previous-version with the new test >> files. If there are no failures, yet there are changes in the rules, then >> raise that issue during the beta period so we can add tests. >> > > I actually did this, and as I recall, did find some test failures. In > retrospect, I must have screwed up somehow back then. I was under tight > deadline pressure, and as a result, did more cursory beta testing than > normal. > >> >> 4. If possible, during the beta period upgrade your implementation and >> test against the new and old test files. >> > > >> Anyone else have other suggestions for testing? >> >> Mark >> >> > As an aside, a release or two ago, I implemented SB, and someone > immediately found a bug, and accused me of releasing software that had not > been tested at all. He had looked through the test suite and not found > anything that looked like it was testing that. But he failed to find the > test file which bundled up all your tests, in a manner he was not > accustomed to, so it was easy for him to overlook. The bug only manifested > itself in longer runs of characters than your pairs and triples tested. I > looked at it, and your SB tests still seemed reasonable, and I should not > expect a more complete series than you furnished. > >> >> >> Mark >> ////// >> >> On Sun, Jul 8, 2018 at 6:52 AM, Karl Williamson via Unicode < >> unicode at unicode.org > wrote: >> >> I am working on upgrading from Unicode 10 to Unicode 11. >> >> I used all the new files. >> >> The algorithms for some of the boundaries, like GCB and WB, have >> changed so that some of the property values no longer have code >> points associated with them. >> >> I ran the tests furnished in 11.0 for these boundaries, without >> having changed the algorithms from earlier releases. All passed 100%. >> >> Unless I'm missing something, that indicates that the tests >> furnished in 11.0 do not contain instances that exercise these >> changes. My guess is that the 10.0 tests were also deficient. >> >> I have been relying on the UCD to furnish tests that have enough >> coverage to sufficiently exercise the algorithms that are specified >> in UAX 31, but that appears to have been naive on my part >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jul 10 06:37:56 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 10 Jul 2018 13:37:56 +0200 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <20180710013328.17b7807f.shai@platonix.com> References: <20180710013328.17b7807f.shai@platonix.com> Message-ID: Your "standard compliant" plain text editor just forces a LTR default for the whole document, and does not tolerate that individual paragraphs may start with an undetermined direction (which should then be determined by the first character on the line that defines a direction.) In my opinion, even if your text editor still does not enforce the default left margin side for aligning the text, it should still treat individual paragraphs isolately and determine the direction to use (each paragraph break should cancel the direction inheritance). A plain text editor should not have a default strong LTR default, it should have a weak undetermined direction, independantly of the fact that it will align the pagraph to the left of right margin according to the resolved direction of the first character. That's what web browsers are doing for example in input fields (where automatic side of the start margin does not change when you start typing some text in the input field and there's no "text-align:left" or "text-align:right" to force it, just "text-align:justify" or "text-align:normal"; note that CSS "text-align:justify" positions the start margin according to the CSS direction of the container element, this makes a difference for the last line of the paragraph, but with automatic determination of an unspecified direction, a justified paragraph may look ugly if this does not also properly sets the start margin of the paragraph according to the resolved direction of the first character of the paragraph or block element Note also that images or other inline objects embedded in paragraphs/block also don't have a defined strong direction for themselves, they act like Unicode "isolates", but you may want to style them to set its outer direction, independantly of the inner direction of the isolate; I'm not sure however if images e.g. in SVG, may inherit their direction from the outer context of the isolate, but if they do, I doubt it can, then they are acting more like the old-fashioned Unicode "embeds" rather than "isolates", except that what is after the image should not depend on the last direction used inside the SVG; images should be completely isolated from their context of use and completly define their expected rendering; SVG images also contain their own upper layer protocol as they can embeded mutliple texts, but in the context of the SVG document; now with SVG elements directly in the HTML5 DOM as plain elements, the situation may have changed because they can inherit many things from the HTML5 doc, including shared stylesheets...). 2018-07-10 0:33 GMT+02:00 Shai Berger via Unicode : > Hello all, > > About two and a half years ago, I suggested adding a FAQ about the > applicability of higher-level protocols for bidirectional plaintext, as > specified by http://www.unicode.org/reports/tr9/ -- my suggestion was > to clarify that higher-level protocols can only be applied upon > agreement between all producers and consumers, and that such agreements > effectively mean that the text is "special text" -- no longer plain. > > In the time since then, I have been mostly removed from this issue, but > I came back to it recently, to find that my suggested text was > rejected, and instead, two FAQs were added to > http://www.unicode.org/faq/bidi.html: The first, which is marked by the > HTML anchor bidi7, goes with my understanding and defines a > higher-level protocol as an agreement; but the second, marked as bidi8, > goes the other way, and explains that actually, agreement is not > necessary -- a program is at liberty to "implicitly define an overall > directional context for display, and that implicit definition of > direction is itself an example of application of a higher-level > protocol for the purposes of the UBA". > > One result of this is the following scenario: I open my > standard-compliant text editor, and write a line of text (to make > things accessible to a wider audience, I use capitals for right-to-left > English and small letters for normal, left-to-right English; note this > sentence starts from the right): > > SESU RETHO DNA email ROF plaintext REFERP I > > I save this line in a text file. Then I display it using my > standards-compliant text viewer, but now it looks like this: > > REFERP I plaintext ROF email SESU RETHO DNA > > And this is because my standard-compliant text-viewer chooses to apply > its higher-level protocol and treat the line as a LTR paragraph. > > Since bidi8 is a little abstract on this point, and focuses on terminal > windows rather than editors and viewers, I would like to ask: > Does this concrete result represent the intents of the UTC? > > Thanks for your attention, > > Shai. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jul 10 10:50:03 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Tue, 10 Jul 2018 18:50:03 +0300 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: (message from Philippe Verdy via Unicode on Tue, 10 Jul 2018 13:37:56 +0200) References: <20180710013328.17b7807f.shai@platonix.com> Message-ID: <83601ngkes.fsf@gnu.org> > Date: Tue, 10 Jul 2018 13:37:56 +0200 > Cc: unicode Unicode Discussion > From: Philippe Verdy via Unicode > > Your "standard compliant" plain text editor just forces a LTR default for the whole document, and does not > tolerate that individual paragraphs may start with an undetermined direction (which should then be determined > by the first character on the line that defines a direction.) > In my opinion, even if your text editor still does not enforce the default left margin side for aligning the text, it > should still treat individual paragraphs isolately and determine the direction to use (each paragraph break > should cancel the direction inheritance). > > A plain text editor should not have a default strong LTR default, it should have a weak undetermined direction, > independantly of the fact that it will align the pagraph to the left of right margin according to the resolved > direction of the first character. I think you may be missing the point. The issue raised by Shai is not what should be the default, the issue is whether each program can have its own rules for overriding the default paragraph direction by applying "higher-level" protocols private to the program, and not shared by other programs when they present the exact same text. There's no argument about the default -- it should indeed behave as described in the UBA, i.e. look for the first string directional character in each isolate run. From unicode at unicode.org Tue Jul 10 11:40:59 2018 From: unicode at unicode.org (Shai Berger via Unicode) Date: Tue, 10 Jul 2018 19:40:59 +0300 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: References: <20180710013328.17b7807f.shai@platonix.com> Message-ID: <20180710194059.24f0a54a.shai@platonix.com> Hello Philippe, On Tue, 10 Jul 2018 13:37:56 +0200 Philippe Verdy via Unicode wrote: > A plain text editor should not have a default strong LTR default, it > should have a weak undetermined direction, I agree -- but the UTC does not, according to the last entry in http://www.unicode.org/faq/bidi.html. I would like to convince them otherwise, or to be shown why my position is wrong. Thanks, Shai. From unicode at unicode.org Tue Jul 10 15:43:55 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Tue, 10 Jul 2018 21:43:55 +0100 (BST) Subject: Memoji Message-ID: <13373389.51091.1531255435640.JavaMail.defaultUser@defaultHost> Thank you for your reply. John H. Jenkins wrote: > Memoji are not merely animated emoji; they are personalized avatars. Some more information about that would be appreciated please. In particular I am wondering how they are transmitted from one end user to another end user. For example, is each frame sent as a sequence of Private Use Area code points with one as a base character and the rest as modifiers, for example, for hair style, hair colour, style of glasses and so on? If that is so then both sender device and receiver device would need to have the same font installed. Would it be possible please for someone to post an email with a memoji in it to this mailing list so that readers who so choose can analyse the coding in both a circulated email and in the web page archive of the email? If I remember correctly the rationale for encoding any emoji at all into Unicode was, at the time, that emoji encoded using Private Use Area code points were getting into mailing lists and the mailing lists were becoming archived in web page archives and thus the archives included Private Use Area characters and there was a desire to remove the potential for ambiguity in the archives. So, if memoji with all of those base characters and modifiers get into databases then maybe the same rationale will be needed and that they will all become encoded into Unicode. However, maybe they are encoded in a different way, maybe using a base character in the Private Use Area followed by a sequence of tag characters. Or some other way. > As for animated emoji, I expect that the UTC would consider them out-of-scope for plain text. Well, my experience is that if a proposal is regarded as being out-of-scope for Unicode then it is screened out and not included in the document register and UTC (Unicode Technical Committee) does not consider the matter and no reason is supplied as to why the proposal is regarded as out-of-scope. So, as far as I am aware, UTC does not consider whether the scope of Unicode should be extended. Yesterday I suggested that there could be a ANIMATION FRAME SEPARATOR character. Thinking about this in relation to memoji I have since thought that there could also be a ANIMATION FRAME SEPARATOR YET KEEP THE LETTERING character, then the expression of the memoji could change from frame to frame yet the lettering containing a message could persist over many frames, without needing to be repeated in every frame, until an ANIMATION FRAME SEPARATOR character is received with the lettering for the next part of the message. > Note that web pages can already contain animated or moving elements which cannot be represented in plain text. Well, the implication of the name seems to be that memoji are used in text. Whether that is plain text is of interest. The moji in emoji means letter, character. https://en.oxforddictionaries.com/definition/emoji William Overington Tuesday 10 July 2018 From unicode at unicode.org Tue Jul 10 16:24:16 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 10 Jul 2018 14:24:16 -0700 Subject: Memoji In-Reply-To: <13373389.51091.1531255435640.JavaMail.defaultUser@defaultHost> References: <13373389.51091.1531255435640.JavaMail.defaultUser@defaultHost> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 11 18:47:49 2018 From: unicode at unicode.org (Yuhong Bao via Unicode) Date: Wed, 11 Jul 2018 23:47:49 +0000 Subject: Unicode, emoji and Sundar Pichai Message-ID: I wonder how much Sundar Pichai (CEO of Google) participate in Unicode (especially the emoji part)? Would he be interested in Unicode UTC meetings for example? Yuhong Bao From unicode at unicode.org Wed Jul 11 19:22:18 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 11 Jul 2018 16:22:18 -0800 Subject: Unicode, emoji and Sundar Pichai In-Reply-To: References: Message-ID: According to information found on-line, Sundar Pichai's official e-mail ids are: sundar at google.com sundar at gmail.com Although speculation can be fun, Sundar Pichai would be the best source for answers to your questions. From unicode at unicode.org Fri Jul 13 02:57:25 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 13 Jul 2018 08:57:25 +0100 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <20180710194059.24f0a54a.shai@platonix.com> References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> Message-ID: <20180713085725.75923890@JRWUBU2> On Tue, 10 Jul 2018 19:40:59 +0300 Shai Berger via Unicode wrote: An agreement can take the form of a hidden condition that if one uses someone else's software, one accepts what that software chooses to do. (Most notoriously, this applies to the PUA.) There seem to be no applicable rules against 'unfair terms'. > On Tue, 10 Jul 2018 13:37:56 +0200 > Philippe Verdy via Unicode wrote: > > A plain text editor should not have a default strong LTR default, it > > should have a weak undetermined direction, > I agree -- but the UTC does not, according to the last entry in > http://www.unicode.org/faq/bidi.html. I would like to convince them > otherwise, or to be shown why my position is wrong. Even just for horizontal text, one problem is the shape of the canvas. If it has a left and a right-hand margin, than having an undetermined direction by default can work, given enough memory. The rendering system then has to have enough memory to store the entire paragraph - the strongly directional character may be the last one in the paragraph. I'm not sure that a protocol is allowed to be based on analysing the first 100 characters of a paragraph. However, it is common for displays to provide a window into a canvas that is unbounded both downwards and either rightwards or leftwards. If it is unbounded rightwards, one needs an LTR paragraph direction: if it is unbounded leftwards, one needs an RTL paragraph direction. I believe that having a mix of paragraphs unbounded on the left and paragraphs unbounded on the right would feel distinctly odd; it could also be a challenge to manage panning the window. It also raises the question of where the LTR and RTL paragraphs would overlap. Richard. From unicode at unicode.org Fri Jul 13 03:22:51 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 13 Jul 2018 11:22:51 +0300 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <20180713085725.75923890@JRWUBU2> (message from Richard Wordingham via Unicode on Fri, 13 Jul 2018 08:57:25 +0100) References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> Message-ID: <831sc7ee90.fsf@gnu.org> > Date: Fri, 13 Jul 2018 08:57:25 +0100 > From: Richard Wordingham via Unicode > > Even just for horizontal text, one problem is the shape of the canvas. > If it has a left and a right-hand margin, than having an undetermined > direction by default can work, given enough memory. The rendering > system then has to have enough memory to store the entire paragraph - > the strongly directional character may be the last one in the > paragraph. I'm not sure that a protocol is allowed to be based on > analysing the first 100 characters of a paragraph. Indeed. We've discovered this problem in Emacs when the UBA was implemented: some buffers, like those visiting log files, have very long stretches of weak characters (digits and punctuation), which require the automatic paragraph direction search very far, potentially slowing down the display engine. > However, it is common for displays to provide a window into a canvas > that is unbounded both downwards and either rightwards or leftwards. > If it is unbounded rightwards, one needs an LTR paragraph direction: if > it is unbounded leftwards, one needs an RTL paragraph direction. Yes. In Emacs, there are commands that display text derived from standardized templates. In these cases, we cannot rely on the default determination of the paragraph direction, because the first strong directional character might be unpredictable. We must force a certain paragraph direction in those cases. > I believe that having a mix of paragraphs unbounded on the left and > paragraphs unbounded on the right would feel distinctly odd; it > could also be a challenge to manage panning the window. It also > raises the question of where the LTR and RTL paragraphs would > overlap. Different applications will have different needs here, so there's definitely a need to provide applications and users with some control of paragraph direction, and the way to do this is define high-level protocols controlled by some optional variables. A well-known example of that is the paragraph-direction buttons in Word and similar processors (although they don't produce plain text, so the analogy is limited). From unicode at unicode.org Fri Jul 13 13:33:29 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Fri, 13 Jul 2018 20:33:29 +0200 Subject: Handling emoji Message-ID: Put together a doc about this; suggestions for improvement are welcome. https://docs.google.com/document/d/1pC7N32TnmDr2xzFW4HscA1DyAPPZnwILUH2_03UL6Jo/preview Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jul 13 17:09:04 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Fri, 13 Jul 2018 15:09:04 -0700 Subject: Unicode, emoji and Sundar Pichai Message-ID: <20180713150904.665a7a7059d7ee80bb4d670165c8327d.f0e4936b16.wbe@email03.godaddy.com> Yuhong Bao wrote: > I wonder how much Sundar Pichai (CEO of Google) participate in Unicode > (especially the emoji part)? > Would he be interested in Unicode UTC meetings for example? Google currently has a representative on the Unicode Board of Directors (Bob Jung), the Unicode Consortium President, CLDR Technical Committee chair, and Emoji Subcommittee chair (Mark Davis), and the ICU Technical Committee chair (Markus Scherer). With apologies to James Kass, I would speculate that Mr. Pichai is a busy man and is quite satisfied with Google's representation within the Unicode organization. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat Jul 14 05:06:29 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 14 Jul 2018 12:06:29 +0200 Subject: Handling emoji In-Reply-To: References: Message-ID: Hello Mark, In your document (https://docs.google.com/document/d/ 1pC7N32TnmDr2xzFW4HscA1DyAPPZnwILUH2_03UL6Jo/preview), The last code segment has bugs: *ULocale danishLocale = ULocale.forLanguageTag("da");Collator danishAndEmoji = new RuleBasedCollator( ((RuleBasedCollator) Collator.getInstance(locale1)).getRules() + ((RuleBasedCollator) Collator.getInstance(locale2)).getRules());*where locale1 and locale2 are undefined. I suppose they are danishLocale, defined here, and emojiLocale defined previously as: *ULocale emojiLocale = ULocale.forLanguageTag("und-u-co-emoji");*But I'm not sure of their order (which one of the two defined (named) locales is locale1 or locale2. Philippe. 2018-07-13 20:33 GMT+02:00 Mark Davis ?? via Unicode : > Put together a doc about this; suggestions for improvement are welcome. > > https://docs.google.com/document/d/1pC7N32TnmDr2xzFW4HscA1DyAPPZn > wILUH2_03UL6Jo/preview > > Mark > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jul 14 05:09:11 2018 From: unicode at unicode.org (Shai Berger via Unicode) Date: Sat, 14 Jul 2018 13:09:11 +0300 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <831sc7ee90.fsf@gnu.org> References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> Message-ID: <20180714130911.5d92dd35.shai@platonix.com> On Fri, 13 Jul 2018 11:22:51 +0300 Eli Zaretskii via Unicode wrote: > > Different applications will have different needs here, so there's > definitely a need to provide applications and users with some control > of paragraph direction, and the way to do this is define high-level > protocols controlled by some optional variables. A well-known example > of that is the paragraph-direction buttons in Word and similar > processors (although they don't produce plain text, so the analogy is > limited). I have no argument with this, but I do think that in such cases it is wrong for the app to pretend that it is still treating the text as plain. If it is an email client, it should use a mime-type such as (just inventing something here) "text/plain:ltr" instead of "text/plain". Emacs should have an LTR-defaulting "Log mode" for log files, while keeping the UBA default for Text mode. With the current definitions and FAQ, plain text is simply not a viable option for intercahnge whenever BiDi is involved. The only upside I see for them is, essentially, what Eli and Richard noted: The possibility for improved performance in fringe use-cases. I repeat/rephrase my original question: The preference expressed by the Bidi FAQ, allowing programs to apply hifger-level protocols to plain-text with no limitation, affords performance improvements in fringe cases, for the price of giving up "Plain text must contain enough information to permit the text to be rendered legibly"[1] where BiDi is involved. Are there other upsides? And whether there are or not -- Does the trade-off reflect the intentions of the UTC? Do they realize how deeply BiDi plaintext is broken? Thanks for your attention and consideration, Shai. [1] http://www.unicode.org/versions/Unicode11.0.0/ch02.pdf page 19 From unicode at unicode.org Sat Jul 14 06:07:50 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sat, 14 Jul 2018 14:07:50 +0300 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <20180714130911.5d92dd35.shai@platonix.com> (message from Shai Berger on Sat, 14 Jul 2018 13:09:11 +0300) References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> <20180714130911.5d92dd35.shai@platonix.com> Message-ID: <83h8l2axdl.fsf@gnu.org> > Date: Sat, 14 Jul 2018 13:09:11 +0300 > From: Shai Berger > Cc: Eli Zaretskii > > I have no argument with this, but I do think that in such cases it is > wrong for the app to pretend that it is still treating the text as > plain. What is "plain text" in this context? Does, for example, text with bidi formatting controls count as "plain"? > If it is an email client, it should use a mime-type such as > (just inventing something here) "text/plain:ltr" instead of > "text/plain". As long as such mime-types don't exist, we cannot use them, right? > Emacs should have an LTR-defaulting "Log mode" for log > files, while keeping the UBA default for Text mode. That's exactly what Emacs does. > With the current definitions and FAQ, plain text is simply not a > viable option for intercahnge whenever BiDi is involved. The only > upside I see for them is, essentially, what Eli and Richard noted: The > possibility for improved performance in fringe use-cases. I > repeat/rephrase my original question: I don't think those use cases are fringe, FWIW. From unicode at unicode.org Sat Jul 14 07:51:16 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sat, 14 Jul 2018 14:51:16 +0200 Subject: Handling emoji In-Reply-To: References:

Message-ID: Thanks for the feedback, Philippe. I haven't fixed that one yet, but added some more text (thanks to Ben Hamilton!) and an acknowledgments section. Mark On Sat, Jul 14, 2018 at 12:06 PM, Philippe Verdy wrote: > Hello Mark, > > In your document (https://docs.google.com/docum > ent/d/1pC7N32TnmDr2xzFW4HscA1DyAPPZnwILUH2_03UL6Jo/preview), The last > code segment has bugs: > > > > > *ULocale danishLocale = ULocale.forLanguageTag("da");Collator > danishAndEmoji = new RuleBasedCollator( ((RuleBasedCollator) > Collator.getInstance(locale1)).getRules() + ((RuleBasedCollator) > Collator.getInstance(locale2)).getRules());*where locale1 and locale2 are > undefined. I suppose they are danishLocale, defined here, and emojiLocale > defined previously as: > > > *ULocale emojiLocale = ULocale.forLanguageTag("und-u-co-emoji");*But I'm > not sure of their order (which one of the two defined (named) locales is > locale1 or locale2. > > Philippe. > > 2018-07-13 20:33 GMT+02:00 Mark Davis ?? via Unicode > : > >> Put together a doc about this; suggestions for improvement are welcome. >> >> https://docs.google.com/document/d/1pC7N32TnmDr2xzFW4HscA1Dy >> APPZnwILUH2_03UL6Jo/preview >> >> Mark >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jul 14 07:53:33 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sat, 14 Jul 2018 14:53:33 +0200 Subject: Handling emoji In-Reply-To: References:

Message-ID: Just fixed the one you found, Philippe... Mark On Sat, Jul 14, 2018 at 2:51 PM, Mark Davis ?? wrote: > Thanks for the feedback, Philippe. > > I haven't fixed that one yet, but added some more text (thanks to Ben > Hamilton!) and an acknowledgments section. > > > > Mark > > On Sat, Jul 14, 2018 at 12:06 PM, Philippe Verdy > wrote: > >> Hello Mark, >> >> In your document (https://docs.google.com/docum >> ent/d/1pC7N32TnmDr2xzFW4HscA1DyAPPZnwILUH2_03UL6Jo/preview), The last >> code segment has bugs: >> >> >> >> >> *ULocale danishLocale = ULocale.forLanguageTag("da");Collator >> danishAndEmoji = new RuleBasedCollator( ((RuleBasedCollator) >> Collator.getInstance(locale1)).getRules() + ((RuleBasedCollator) >> Collator.getInstance(locale2)).getRules());*where locale1 and locale2 >> are undefined. I suppose they are danishLocale, defined here, and >> emojiLocale defined previously as: >> >> >> *ULocale emojiLocale = ULocale.forLanguageTag("und-u-co-emoji");*But I'm >> not sure of their order (which one of the two defined (named) locales is >> locale1 or locale2. >> >> Philippe. >> >> 2018-07-13 20:33 GMT+02:00 Mark Davis ?? via Unicode > >: >> >>> Put together a doc about this; suggestions for improvement are welcome. >>> >>> https://docs.google.com/document/d/1pC7N32TnmDr2xzFW4HscA1Dy >>> APPZnwILUH2_03UL6Jo/preview >>> >>> Mark >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jul 14 08:15:37 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 14 Jul 2018 14:15:37 +0100 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <20180714130911.5d92dd35.shai@platonix.com> References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> <20180714130911.5d92dd35.shai@platonix.com> Message-ID: <20180714141537.5a462282@JRWUBU2> On Sat, 14 Jul 2018 13:09:11 +0300 Shai Berger via Unicode wrote: > On Fri, 13 Jul 2018 11:22:51 +0300 > Eli Zaretskii via Unicode wrote: > > > > > Different applications will have different needs here, so there's > > definitely a need to provide applications and users with some > > control of paragraph direction, and the way to do this is define > > high-level protocols controlled by some optional variables. A > > well-known example of that is the paragraph-direction buttons in > > Word and similar processors (although they don't produce plain > > text, so the analogy is limited). > > I have no argument with this, but I do think that in such cases it is > wrong for the app to pretend that it is still treating the text as > plain. The problem with your concept of 'plain text' is that there is almost no such thing. To display text, one has to choose a basic writing direction - direction within lines (LTR, RTL, TTB or BTT) and direction from line to line (TTB, BTT, LTR or RTL) - and that's ignoring boustrophedon variants and specialised cases such as 'round robin' or the spiral of the Phaistos disc. If the display concept is to treat lines as being of unbounded length, one needs a left margin, a right margin, or perhaps one centres each line. Centred text does not strike me as 'plain'. Centred text is the only one that can handle paragraphs of different directionality well in this concept. Lines of unbounded length is the natural choice for editors for programming languages - lines are often syntactically significant. They are also syntactically relevant for emails in point by point discussions. The default BiDi rule for the basic directionality of paragraphs usually works when there is a left margin and a right margin, though buffering makes it impossible to bound the amount of memory required. Note that several key utilities limit the number of combining marks or the length of Indic syllables. Richard. From unicode at unicode.org Sat Jul 14 10:51:16 2018 From: unicode at unicode.org (Karl Williamson via Unicode) Date: Sat, 14 Jul 2018 09:51:16 -0600 Subject: Missing UAX#31 tests? In-Reply-To: References:

Message-ID: On 07/09/2018 02:11 PM, Karl Williamson via Unicode wrote: > On 07/08/2018 03:21 AM, Mark Davis ?? wrote: >> I'm surprised that the tests for 11.0 passed for a 10.0 >> implementation, because the following should have triggered a >> difference for WB. Can you check on this particular case? >> >> ? 0020 ? 0020 ?#? [0.2] SPACE (WSegSpace) ? [3.4] SPACE (WSegSpace) ? >> [0.3] > > I'm one of the people who advocated for this change, and I had already > tailored our implementation of 10.0 to not break between horizontal > white space, so it's actually not surprising that this rule didn't break >> It turns out that the fault was all mine; the Unicode 11.0 tests were failing on a 10.0 implementation. I'm sorry for starting this red herring thread. If you care to know the details, read on. The code that runs the tests knows what version of the UCD it is using, and it knows what version of the UAX boundary algorithms it is using. If these differ, it emits a warning about the discrepancy, and expects that there are going to be many test failures, so it marks all failing ones as 'To do' which suppresses their output, so as to not distract from any other failures that have been introduced by using the new UCD version. (Updating the algorithm comes last.) The solution for the future is to change the warning about the discrepancy to note that the failing boundary algorithm tests are suppressed. This will clue me (or whoever) in that all is not necessarily well. >> >> About the testing: >> >> The tests are generated so that they go all the combinations of pairs, >> and some combinations of triples. The generated test cases use a >> sample from each partition of characters, to cut down on the file size >> to a reasonable level. That also means that some changes in the rules >> don't cause changes in the test results. Because it is not possible to >> test every combination, so there is also provision for additional test >> cases, such as those at the end of the files, eg: >> >> https://unicode.org/Public/11.0.0/ucd/auxiliary/WordBreakTest.html >> https://unicode.org/Public/10.0.0/ucd/auxiliary/WordBreakTest.html >> >> We should extend those each time to make sure we cover combinations >> that aren't covered by pairs. There were some additions to that end; >> if they didn't cover enough cases, then we can look at your experience >> to add more. >> >> I can suggest two strategies for further testing: >> >> 1. To do a full test, for each row check every combinations obtained >> by replacing each sample character by every other character in its >> partition.?Eg for the above line that would mean testing every >> sequence. >> >> 2. Use a monkey test against ICU. That is, generate random >> combinations of characters from different partitions and check that >> ICU and your implementation are in sync. >> >> 3. During the beta period, test your previous-version with the new >> test files. If there are no failures, yet there are changes in the >> rules, then raise that issue during the beta period so we can add tests. > > I actually did this, and as I recall, did find some test failures.? In > retrospect, I must have screwed up somehow back then.? I was under tight > deadline pressure, and as a result, did more cursory beta testing than > normal. >> >> 4. If possible, during the beta period upgrade your implementation and >> test against the new and old test files. > >> >> Anyone else have other suggestions for testing? >> >> Mark >> > > As an aside, a release or two ago, I implemented SB, and someone > immediately found a bug, and accused me of releasing software that had > not been tested at all.? He had looked through the test suite and not > found anything that looked like it was testing that.? But he failed to > find the test file which bundled up all your tests, in a manner he was > not accustomed to, so it was easy for him to overlook.? The bug only > manifested itself in longer runs of characters than your pairs and > triples tested.? I looked at it, and your SB tests still seemed > reasonable, and I should not expect a more complete series than you > furnished. >> >> >> Mark >> ////// >> >> On Sun, Jul 8, 2018 at 6:52 AM, Karl Williamson via Unicode >> > wrote: >> >> ??? I am working on upgrading from Unicode 10 to Unicode 11. >> >> ??? I used all the new files. >> >> ??? The algorithms for some of the boundaries, like GCB and WB, have >> ??? changed so that some of the property values no longer have code >> ??? points associated with them. >> >> ??? I ran the tests furnished in 11.0 for these boundaries, without >> ??? having changed the algorithms from earlier releases.? All passed >> 100%. >> >> ??? Unless I'm missing something, that indicates that the tests >> ??? furnished in 11.0 do not contain instances that exercise these >> ??? changes.? My guess is that the 10.0 tests were also deficient. >> >> ??? I have been relying on the UCD to furnish tests that have enough >> ??? coverage to sufficiently exercise the algorithms that are specified >> ??? in UAX 31, but that appears to have been naive on my part >> >> > > > From unicode at unicode.org Sat Jul 14 11:57:19 2018 From: unicode at unicode.org (Rick McGowan via Unicode) Date: Sat, 14 Jul 2018 09:57:19 -0700 Subject: Server move notice, Unicode Message-ID: <5B4A2B6F.6000402@unicode.org> Hello everyone, Over this weekend on Sunday (US time) July 15 the *www.unicode.org *server will be undergoing a migration. Downtime should be minimal, but there is some possibility of brief periods off-line from Sunday morning through evening. The Unicode mail list may be unavailable for parts of the day on Sunday. We expect to be complete and functioning normally before Monday morning. Only the "www" server is affected. CLDR, Survey Tool, and ICU facilities are not affected at this time. Rick -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jul 14 12:50:07 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sat, 14 Jul 2018 19:50:07 +0200 Subject: Missing UAX#31 tests? In-Reply-To: References:

Message-ID: Not to worry, these things happen to the best of us. Just glad the root of the problem was found. Mark Mark On Sat, Jul 14, 2018 at 5:51 PM, Karl Williamson wrote: > On 07/09/2018 02:11 PM, Karl Williamson via Unicode wrote: > >> On 07/08/2018 03:21 AM, Mark Davis ?? wrote: >> >>> I'm surprised that the tests for 11.0 passed for a 10.0 implementation, >>> because the following should have triggered a difference for WB. Can you >>> check on this particular case? >>> >>> ? 0020 ? 0020 ?#? [0.2] SPACE (WSegSpace) ? [3.4] SPACE (WSegSpace) ? >>> [0.3] >>> >> >> I'm one of the people who advocated for this change, and I had already >> tailored our implementation of 10.0 to not break between horizontal white >> space, so it's actually not surprising that this rule didn't break >> >>> >>> > It turns out that the fault was all mine; the Unicode 11.0 tests were > failing on a 10.0 implementation. I'm sorry for starting this red herring > thread. > > If you care to know the details, read on. > > The code that runs the tests knows what version of the UCD it is using, > and it knows what version of the UAX boundary algorithms it is using. If > these differ, it emits a warning about the discrepancy, and expects that > there are going to be many test failures, so it marks all failing ones as > 'To do' which suppresses their output, so as to not distract from any other > failures that have been introduced by using the new UCD version. (Updating > the algorithm comes last.) > > The solution for the future is to change the warning about the discrepancy > to note that the failing boundary algorithm tests are suppressed. This > will clue me (or whoever) in that all is not necessarily well. > > > >>> About the testing: >>> >>> The tests are generated so that they go all the combinations of pairs, >>> and some combinations of triples. The generated test cases use a sample >>> from each partition of characters, to cut down on the file size to a >>> reasonable level. That also means that some changes in the rules don't >>> cause changes in the test results. Because it is not possible to test every >>> combination, so there is also provision for additional test cases, such as >>> those at the end of the files, eg: >>> >>> https://unicode.org/Public/11.0.0/ucd/auxiliary/WordBreakTest.html >>> https://unicode.org/Public/10.0.0/ucd/auxiliary/WordBreakTest.html >>> >>> We should extend those each time to make sure we cover combinations that >>> aren't covered by pairs. There were some additions to that end; if they >>> didn't cover enough cases, then we can look at your experience to add more. >>> >>> I can suggest two strategies for further testing: >>> >>> 1. To do a full test, for each row check every combinations obtained by >>> replacing each sample character by every other character in its >>> partition. Eg for the above line that would mean testing every >> WSegSpace> sequence. >>> >>> 2. Use a monkey test against ICU. That is, generate random combinations >>> of characters from different partitions and check that ICU and your >>> implementation are in sync. >>> >>> 3. During the beta period, test your previous-version with the new test >>> files. If there are no failures, yet there are changes in the rules, then >>> raise that issue during the beta period so we can add tests. >>> >> >> I actually did this, and as I recall, did find some test failures. In >> retrospect, I must have screwed up somehow back then. I was under tight >> deadline pressure, and as a result, did more cursory beta testing than >> normal. >> >>> >>> 4. If possible, during the beta period upgrade your implementation and >>> test against the new and old test files. >>> >> >> >>> Anyone else have other suggestions for testing? >>> >>> Mark >>> >>> >> As an aside, a release or two ago, I implemented SB, and someone >> immediately found a bug, and accused me of releasing software that had not >> been tested at all. He had looked through the test suite and not found >> anything that looked like it was testing that. But he failed to find the >> test file which bundled up all your tests, in a manner he was not >> accustomed to, so it was easy for him to overlook. The bug only manifested >> itself in longer runs of characters than your pairs and triples tested. I >> looked at it, and your SB tests still seemed reasonable, and I should not >> expect a more complete series than you furnished. >> >>> >>> >>> Mark >>> ////// >>> >>> On Sun, Jul 8, 2018 at 6:52 AM, Karl Williamson via Unicode < >>> unicode at unicode.org > wrote: >>> >>> I am working on upgrading from Unicode 10 to Unicode 11. >>> >>> I used all the new files. >>> >>> The algorithms for some of the boundaries, like GCB and WB, have >>> changed so that some of the property values no longer have code >>> points associated with them. >>> >>> I ran the tests furnished in 11.0 for these boundaries, without >>> having changed the algorithms from earlier releases. All passed >>> 100%. >>> >>> Unless I'm missing something, that indicates that the tests >>> furnished in 11.0 do not contain instances that exercise these >>> changes. My guess is that the 10.0 tests were also deficient. >>> >>> I have been relying on the UCD to furnish tests that have enough >>> coverage to sufficiently exercise the algorithms that are specified >>> in UAX 31, but that appears to have been naive on my part >>> >>> >>> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jul 14 14:14:35 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 14 Jul 2018 12:14:35 -0700 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <20180714141537.5a462282@JRWUBU2> References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> <20180714130911.5d92dd35.shai@platonix.com> <20180714141537.5a462282@JRWUBU2> Message-ID: <9daddafe-4b15-be59-4b2b-616786e59e2b@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jul 15 02:50:30 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 15 Jul 2018 08:50:30 +0100 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <9daddafe-4b15-be59-4b2b-616786e59e2b@ix.netcom.com> References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> <20180714130911.5d92dd35.shai@platonix.com> <20180714141537.5a462282@JRWUBU2> <9daddafe-4b15-be59-4b2b-616786e59e2b@ix.netcom.com> Message-ID: <20180715085030.1f4dd06e@JRWUBU2> On Sat, 14 Jul 2018 12:14:35 -0700 Asmus Freytag via Unicode wrote: > The bidi case is just another such case where you cannot expect any > fidelity in presentation whatsoever. (And certainly not in the case > of degenerate files containing all but one weak character). It's going a bit far to call an ASCII histogram degenerate. Richard. From unicode at unicode.org Mon Jul 16 00:07:58 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Mon, 16 Jul 2018 07:07:58 +0200 Subject: Variation Sequences (and L2-11/059) Message-ID: <868t6b4vkh.fsf@mimuw.edu.pl> FAQ (http://unicode.org/faq/vs.html) states: For historic scripts, the variation sequence provides a useful tool, because it can show mistaken or nonce glyphs and relate them to the base character. It can also be used to reflect the views of scholars, who may see the relation between the glyphs and base characters differently. Also, new variation sequences can be added for new variant appearances (and their relation to the base characters) as more evidence is discovered. It states also: What variation sequences are valid? Only those listed in StandardizedVariants.txt... However the file in question contains only sections for mathematics and some rather exotic scripts. To the best of my knowledge, the only attempt to introduce additional variation sequences was the strongly criticised Karl Pentzlin's proposal L2-11/059 http://www.unicode.org/L2/L2011/11059-latin-cyr-var.pdf What has happen to it? I don't remember any information about it on the list. However my primary question is: Are variation sequences *really* recommended for historical scripts? I ask the question because there are now several historical corpora of Polish under development, which use at present a kind of fall-back or some other ad hoc solutions for "nonce glyphs", as they are called in the FAQ. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Mon Jul 16 02:53:03 2018 From: unicode at unicode.org (Shai Berger via Unicode) Date: Mon, 16 Jul 2018 10:53:03 +0300 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <9daddafe-4b15-be59-4b2b-616786e59e2b@ix.netcom.com> References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> <20180714130911.5d92dd35.shai@platonix.com> <20180714141537.5a462282@JRWUBU2> <9daddafe-4b15-be59-4b2b-616786e59e2b@ix.netcom.com> Message-ID: <20180716104439.4ca7e64b.shai@platonix.com> On Sat, 14 Jul 2018 12:14:35 -0700 Asmus Freytag via Unicode wrote: > I would say the problem lies in the attempt to exchange arbitrary raw > data and expect perfectly compatible rendering [...] Editors for > plain text will wrap or not wrap lines on presentation [...] The bidi > case is just another such case This is not about "perfectly compatible rendering", it is about legible rendering. As specified in the Unicode standard. Another example below. On Sat, 14 Jul 2018 14:15:37 +0100 Richard Wordingham via Unicode wrote: > If the display concept is to treat lines as being of unbounded length, > one needs a left margin, a right margin, or perhaps one centres each > line. Centred text does not strike me as 'plain'. You seem to be confounding directionality with alignment. While, for plaintext, I would find it preferable if the two always matched, this is not what I'm asking for, and not what I see as a requirement for making plain text usable. To be clear: If I write a file containing a single line (this is all English, no special use of capitals), the iconic: Hello, World! then, when I open this file in a standard-compliant editor, I'm ok with seeing (centered) Hello, World! or (right aligned) Hello, World! or even (wrapped at a surprisingly short line length) Hello, World! Indeed, these are presentation issues, where fidelity is not expected, almost on a level with font and color. What I'm not OK with is: !Hello, World Which is what you'll see if your editor decides to use RTL directionality for this file, as the FAQ says it may. What I'm asking is that we stop calling this behavior "standard compliant"; and I refer you back to my first message in this thread[1] for an example of the mess that this creates with true BiDi text. Thanks, Shai. [1] http://unicode.org/pipermail/unicode/2018-July/006702.html From unicode at unicode.org Mon Jul 16 03:08:54 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 16 Jul 2018 01:08:54 -0700 Subject: Variation Sequences (and L2-11/059) In-Reply-To: <868t6b4vkh.fsf@mimuw.edu.pl> References: <868t6b4vkh.fsf@mimuw.edu.pl> Message-ID: <580e30bb-7fad-eee5-5077-8c6d121be7da@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 16 13:00:29 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Mon, 16 Jul 2018 19:00:29 +0100 (BST) Subject: Variation Sequences (and L2-11/059) In-Reply-To: <868t6b4vkh.fsf@mimuw.edu.pl> References: <868t6b4vkh.fsf@mimuw.edu.pl> Message-ID: <26004755.45820.1531764029934.JavaMail.defaultUser@defaultHost> Hi > I ask the question because there are now several historical corpora of Polish under development, which use at present a kind of fall-back or some other ad hoc solutions for "nonce glyphs", as they are called in the FAQ. I wonder if you could say please what are the "kind of fall-back or some other ad hoc solutions" please. The reason I ask is because I have thought of a possible solution to the problem that has graceful fall-back and uses only plane 0 characters, no Private Use Area characters at all: I am wondering whether my suggestion will be of use or if it is just another method that could just be added to a collection of "kind of fall-back or some other ad hoc solutions". My suggestion is to use for each desired glyph a sequence consisting of three characters, and then have an OpenType font decode them so that the glyph can be displayed. Each such sequence being of the form. Base character ZERO WIDTH JOINER then a circled digit character or a circled number character. http://www.unicode.org/charts/PDF/U2460.pdf Thus there being up to twenty specific glyphs for each base character. The list of glyphs could be gradually extended as needed and if an attempt to display a newly added glyph is made using a font implemented from an earlier list then there would be graceful fall-back to the base character followed by a circled digit. It would be helpful for entering text into documents if the ZERO WIDTH JOINER character has a visible glyph within the font. Then entering text with OpenType glyph substitution turned off could be easier to carry out. I am wondering quite how acceptable such a solution would be for standardization: the list of ways that something can be encoded using a ZWJ (ZERO WIDTH JOINER) character seems to have recently been de facto extended for use with generating emoji sequences - not with circled digits but use of ZWJ to change meaning which is a far bigger extension than needed for this suggestion as meaning would often be unaltered when using this suggestion. William Overington Monday 16 July 2018 ----Original message---- >From : unicode at unicode.org Date : 2018/07/16 - 06:07 (GMTDT) To : unicode at unicode.org Subject : Variation Sequences (and L2-11/059) FAQ (http://unicode.org/faq/vs.html) states: For historic scripts, the variation sequence provides a useful tool, because it can show mistaken or nonce glyphs and relate them to the base character. It can also be used to reflect the views of scholars, who may see the relation between the glyphs and base characters differently. Also, new variation sequences can be added for new variant appearances (and their relation to the base characters) as more evidence is discovered. It states also: What variation sequences are valid? Only those listed in StandardizedVariants.txt... However the file in question contains only sections for mathematics and some rather exotic scripts. To the best of my knowledge, the only attempt to introduce additional variation sequences was the strongly criticised Karl Pentzlin's proposal L2-11/059 http://www.unicode.org/L2/L2011/11059-latin-cyr-var.pdf What has happen to it? I don't remember any information about it on the list. However my primary question is: Are variation sequences *really* recommended for historical scripts? I ask the question because there are now several historical corpora of Polish under development, which use at present a kind of fall-back or some other ad hoc solutions for "nonce glyphs", as they are called in the FAQ. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Mon Jul 16 17:51:12 2018 From: unicode at unicode.org (Shai Berger via Unicode) Date: Tue, 17 Jul 2018 01:51:12 +0300 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <83h8l2axdl.fsf@gnu.org> References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> <20180714130911.5d92dd35.shai@platonix.com> <83h8l2axdl.fsf@gnu.org> Message-ID: <20180717015112.162cd939.shai@platonix.com> Hi Eli and all, On Sat, 14 Jul 2018 14:07:50 +0300 Eli Zaretskii via Unicode wrote: > From: Shai Berger > > > > I have no argument with this, but I do think that in such cases it > > is wrong for the app to pretend that it is still treating the text > > as plain. > > What is "plain text" in this context? > Plain text here is the thing described in subsection "Plain Text" in the core unicode standard, Chapter 2 Section 2 "General Structure: Unicode Design Principles". In terms of composition, it is "a pure sequence of character codes"; in terms of function, it is "public, standardized, and universally readable". > Does, for example, text with bidi formatting controls count as > "plain"? So long as the bidi controls are Unicode characters, I'd say "yes" -- according to the definitions above. The one thing I would disagree with is calling them "formatting controls" -- as I believe they encode semantics, not appearance. And I should add, in response to the other points raised in this thread, from the same page in the core standard: "If the same plain text sequence is given to disparate rendering processes, there is no expectation that rendered text in each instance should have the same appearance. Instead, the disparate rendering processes are simply required to make the text legible according to the intended reading." That paragraph ends with the following summary, emphasized in the source: Plain text must contain enough information to permit the text to be rendered legibly, and nothing more. The last answer in http://www.unicode.org/faq/bidi.html violates this dictum, as I have showed here with different examples. As long as it stands, the Unicode standard fails its own criteria. Thanks, Shai. From unicode at unicode.org Mon Jul 16 19:40:50 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Mon, 16 Jul 2018 17:40:50 -0700 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <20180717015112.162cd939.shai@platonix.com> References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> <20180714130911.5d92dd35.shai@platonix.com> <83h8l2axdl.fsf@gnu.org> <20180717015112.162cd939.shai@platonix.com> Message-ID: On 7/16/2018 3:51 PM, Shai Berger via Unicode wrote: > And I should add, in response to the other points raised in this > thread, from the same page in the core standard: "If the same plain text > sequence is given to disparate rendering processes, there is no > expectation that rendered text in each instance should have the same > appearance. Instead, the disparate rendering processes are simply > required to make the text legible according to the intended reading." > That paragraph ends with the following summary, emphasized in the > source: > > Plain text must contain enough information to permit the text > to be rendered legibly, and nothing more. > > The last answer inhttp://www.unicode.org/faq/bidi.html violates this > dictum, as I have showed here with different examples. As long as it > stands, the Unicode standard fails its own criteria. I've been trying to following your reasoning in this long thread, but am still not finding much to convince that there is anything wrong in the #bidi8 FAQ entry that you keep claiming is wrong. First, for your "Hello, world!" example, in a rendering that imposes a RTL directional context, the correct, conformant display of that string is: !Hello, world as you cited in your earlier example. To do otherwise, would represent a *non*-conformant implementation of the UBA. So your complaint seems to boil down to the claim that if you transmit "Hello, world!" to a process which then renders it conformantly according to the Unicode Standard (including UBA), then that process must somehow know *and honor* your intent that it display in a LTR directional context. That information, however, is explicitly *not* contained in the plain text string there, and has to be conveyed by means of a higher-level protocol. (E.g. HTML markup as dir="ltr", etc.) If the receiving process, by whatever means, has raised its hand and says, effectively, "I assume a RTL context for all text display", that is its right. You can't complain if it displays your "Hello, world!" as shown above. Well, you *can* complain, but you wouldn't be correct. Basically, you and the receiving process do not share the same assumptions about the higher-level protocol involved which specifies paragraph direction. So as I see it, you are either wanting the plain text to somehow contain and enforce upon the renderer your assumption about the directional context that it should be displayed in, OR, you are just unhappy about the bidirectional rendering conundrums of some edge cases for the UBA. In either case, the remedy is the application of LTR characters to provide context (or directional isolate controls, or explicit higher-level markup). --Ken From unicode at unicode.org Mon Jul 16 22:30:39 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 17 Jul 2018 04:30:39 +0100 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <20180716104439.4ca7e64b.shai@platonix.com> References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> <20180714130911.5d92dd35.shai@platonix.com> <20180714141537.5a462282@JRWUBU2> <9daddafe-4b15-be59-4b2b-616786e59e2b@ix.netcom.com> <20180716104439.4ca7e64b.shai@platonix.com> Message-ID: <20180717043039.73934571@JRWUBU2> On Mon, 16 Jul 2018 10:53:03 +0300 Shai Berger via Unicode wrote: > What I'm not OK with is: > > !Hello, World > > Which is what you'll see if your editor decides to use RTL > directionality for this file, as the FAQ says it may. Using 'left aligned' for RTL and 'right aligned' for LTR are 'marked' styles; they are not appropriate for uninterpreted plain text. Thus if text is to displayed as left aligned, LTR defaults are appropriate. With RTL default and right alignment, what looks like !Hello, World is much more acceptable for "Hello, World!". An interesting ambiguity is "!True" v. "True!". "!True" can be read as "Not true". The solution may be to encourage the determination of the (default) paragraph direction from the first paragraph for implementations with only one margin. I am not sure if this behaviour is 'standard compliant'. Richard. From unicode at unicode.org Mon Jul 16 23:51:32 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 16 Jul 2018 21:51:32 -0700 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <20180717043039.73934571@JRWUBU2> References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> <20180714130911.5d92dd35.shai@platonix.com> <20180714141537.5a462282@JRWUBU2> <9daddafe-4b15-be59-4b2b-616786e59e2b@ix.netcom.com> <20180716104439.4ca7e64b.shai@platonix.com> <20180717043039.73934571@JRWUBU2> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jul 17 00:04:28 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Tue, 17 Jul 2018 07:04:28 +0200 Subject: Variation Sequences (and L2-11/059) References: <868t6b4vkh.fsf@mimuw.edu.pl> <580e30bb-7fad-eee5-5077-8c6d121be7da@ix.netcom.com> Message-ID: <86601e4fmr.fsf@mimuw.edu.pl> On Mon, Jul 16 2018 at 1:08 -0700, unicode at unicode.org writes: > The use case would seem to be more properly served by some form of > registration mechanism, like the one IVD represents for ideographs. I agree. > > The use of "standardized" variation sequences with the understanding > that those would be (fairly) widely implemented would, in contrast, be > best reserved to cases where the the encoding in the Standard resulted > in deliberately unifying some variations for which there is > nevertheless a common (!) use case of requiring each alternate to be > selected. I agree. [...] > On 7/15/2018 10:07 PM, Janusz S. Bie? via Unicode wrote: > > > FAQ (http://unicode.org/faq/vs.html) states: > > For historic scripts, the variation sequence provides a useful tool, > because it can show mistaken or nonce glyphs and relate them to the > base character. It can also be used to reflect the views of > scholars, who may see the relation between the glyphs and base > characters differently. Also, new variation sequences can be added > for new variant appearances (and their relation to the base > characters) as more evidence is discovered. > It states also: > > What variation sequences are valid? > Only those listed in StandardizedVariants.txt... The full answer is: Only those listed in StandardizedVariants.txt, emoji-variation-sequences.txt, or the registered sequences listed in the Ideographic Variation Database (IVD). Do we agree that the statements are not consistent, at least with your view, which I share? I understand there is no sufficient demand for the Unicode Consortium maintaining a supplementary non-ideographic variation database. Hence for the time being a kind of Private Use variation database seems to be the only solution - am I right? Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Tue Jul 17 00:06:44 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Tue, 17 Jul 2018 07:06:44 +0200 Subject: Variation Sequences (and L2-11/059) References: <868t6b4vkh.fsf@mimuw.edu.pl> <26004755.45820.1531764029934.JavaMail.defaultUser@defaultHost> Message-ID: <86y3ea30yj.fsf@mimuw.edu.pl> On Mon, Jul 16 2018 at 19:00 +0100, wjgo_10009 at btinternet.com writes: > Hi > >> I ask the question because there are now several historical corpora >> of Polish under development, which use at present a kind of fall-back >> or some other ad hoc solutions for "nonce glyphs", as they are called >> in the FAQ. > > I wonder if you could say please what are the "kind of fall-back or > some other ad hoc solutions" please. I would prefer not to go into details. I think some of those "solutions" are simply wrong but the list is not the right place to criticize them. > The reason I ask is because I have thought of a possible solution to >the problem that has graceful fall-back and uses only plane 0 >characters, no Private Use Area characters at all: I am wondering >whether my suggestion will be of use or if it is just another method >that could just be added to a collection of "kind of fall-back or some >other ad hoc solutions". > My suggestion is to use for each desired glyph a sequence consisting > of three characters, and then have an OpenType font decode them so > that the glyph can be displayed. This is a prohibitive requirement, because for years there is the lack of font creators interested in old Polish. > Each such sequence being of the form. > > Base character ZERO WIDTH JOINER then a circled digit character or a circled number character. > > http://www.unicode.org/charts/PDF/U2460.pdf > > Thus there being up to twenty specific glyphs for each base character. > > The list of glyphs could be gradually extended as needed and if an > attempt to display a newly added glyph is made using a font > implemented from an earlier list then there would be graceful > fall-back to the base character followed by a circled digit. > > It would be helpful for entering text into documents if the ZERO WIDTH > JOINER character has a visible glyph within the font. Then entering > text with OpenType glyph substitution turned off could be easier to > carry out. I perceive your proposal as "visible variant selectors for private variation sequences", as a text encoded this way can be easily converted into a text using real variant selectors. I think it might be a reasonable temporary solution, but not the ultimate one. > I am wondering quite how acceptable such a solution would be for > standardization: the list of ways that something can be encoded using > a ZWJ (ZERO WIDTH JOINER) character seems to have recently been de > facto extended for use with generating emoji sequences - not with > circled digits but use of ZWJ to change meaning which is a far bigger > extension than needed for this suggestion as meaning would often be > unaltered when using this suggestion. I would expect arguments that is has no obvious advantage over variations sequences. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Tue Jul 17 07:45:43 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Tue, 17 Jul 2018 13:45:43 +0100 (BST) Subject: Variation Sequences (and L2-11/059) In-Reply-To: <86601e4fmr.fsf@mimuw.edu.pl> References: <868t6b4vkh.fsf@mimuw.edu.pl> <580e30bb-7fad-eee5-5077-8c6d121be7da@ix.netcom.com> <86601e4fmr.fsf@mimuw.edu.pl> Message-ID: <16224306.24341.1531831543363.JavaMail.defaultUser@defaultHost> Janusz S. Bien wrote: > I understand there is no sufficient demand for the Unicode Consortium maintaining a supplementary non-ideographic variation database. Hence for the time being a kind of Private Use variation database seems to be the only solution - am I right? Well, with the greatest respect, in my opinion, no. You could use my suggestion and send a copy of your encoding to the Unicode Technical Committee (UTC) and maybe they will endorse it. There is precedence over the astronaut emoji where in glyph substitution the rocket was lost and a space suit was obtained from somewhere. For my suggestion the circled digit would be lost and an alternate glyph introduced. Whether the Unicode Technical Committee would endorse such an encoding would need to wait for a meeting of the UTC. Too often in relation to Unicode matters things get done by people saying what they consider the UTC will say and ideas get screened out and the UTC never gets the opportunity to consider them. William Overington Tuesday 17 July 2018 From unicode at unicode.org Tue Jul 17 08:07:39 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Tue, 17 Jul 2018 14:07:39 +0100 (BST) Subject: Variation Sequences (and L2-11/059) In-Reply-To: <86y3ea30yj.fsf@mimuw.edu.pl> References: <868t6b4vkh.fsf@mimuw.edu.pl> <26004755.45820.1531764029934.JavaMail.defaultUser@defaultHost> <86y3ea30yj.fsf@mimuw.edu.pl> Message-ID: <28368285.25592.1531832859579.JavaMail.defaultUser@defaultHost> WJGO >> My suggestion is to use for each desired glyph a sequence consisting of three characters, and then have an OpenType font decode them so that the glyph can be displayed. JSB >This is a prohibitive requirement, because for years there is the lack of font creators interested in old Polish. Well, I have not been aware of any call for participation. It seems an interesting project. I make OpenType fonts using the FontCreator program. There is an active forum with helpful people participating. So you could if you wish try to make your own font and receive help or you could f you wish ask if people might like to join in the research project and make fonts. https://forum.high-logic.com/ JSB > I perceive your proposal as "visible variant selectors for private variation sequences", as a text encoded this way can be easily converted into a text using real variant selectors. JSB > I think it might be a reasonable temporary solution, but not the ultimate one. Well, based on the present practice of the way that ZERO WIDTH JOINER is being used for encoding emoji, I opine that it has the potential to be a permanent formally-encoded solution. JSB > I would expect arguments that is has no obvious advantage over variations sequences. Well, when I looked at the IVD database information it seems to use plane 14 characters. As a practical consideration, my suggestion has the advantage that it only use plane 0 characters. William Overington Tuesday 17 July 2018 From unicode at unicode.org Tue Jul 17 10:34:04 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 17 Jul 2018 08:34:04 -0700 Subject: Variation Sequences (and L2-11/059) In-Reply-To: <86601e4fmr.fsf@mimuw.edu.pl> References: <868t6b4vkh.fsf@mimuw.edu.pl> <580e30bb-7fad-eee5-5077-8c6d121be7da@ix.netcom.com> <86601e4fmr.fsf@mimuw.edu.pl> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jul 17 22:56:48 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Wed, 18 Jul 2018 05:56:48 +0200 Subject: Variation Sequences (and L2-11/059) In-Reply-To: (Asmus Freytag via Unicode's message of "Tue, 17 Jul 2018 08:34:04 -0700") References: <868t6b4vkh.fsf@mimuw.edu.pl> <580e30bb-7fad-eee5-5077-8c6d121be7da@ix.netcom.com> <86601e4fmr.fsf@mimuw.edu.pl> Message-ID: <86efg16vsv.fsf@mimuw.edu.pl> On Tue, Jul 17 2018 at 8:34 -0700, Asmus Freytag writes: > On 7/16/2018 10:04 PM, Janusz S. Bie? via Unicode wrote: > > I understand there is no sufficient demand for the Unicode Consortium > maintaining a supplementary non-ideographic variation database. Hence > for the time being a kind of Private Use variation database seems to be > the only solution - am I right? > > The question comes down to resources, among other things. As well as to whether > there are actual users / implementers waiting for and ready to adopt such a database > as solution to their problems. I hope the resources are sufficient to improve wording of the variation sequence FAQ. Do we agree that at present users/implementers are rather misled by it? > A strawman proposal could identify these issues and some ways that they might be > addressed and then ask for criteria of what the UTC might deem sufficient. Perhaps this statement should be put into FAQ, instead of "you should propose your addition as a variation sequence"? On Tue, Jul 17 2018 at 13:45 +0100, William_J_G Overington writes: > Janusz S. Bien wrote: > >> I understand there is no sufficient demand for the Unicode >> Consortium maintaining a supplementary non-ideographic variation >> database. Hence for the time being a kind of Private Use variation >> database seems to be the only solution - am I right? > > Well, with the greatest respect, in my opinion, no. > > You could use my suggestion and send a copy of your encoding to the > Unicode Technical Committee (UTC) and maybe they will endorse it. Difficult to do as there is no "my encoding". > > There is precedence over the astronaut emoji where in glyph > substitution the rocket was lost and a space suit was obtained from > somewhere. > > For my suggestion the circled digit would be lost and an alternate > glyph introduced. You seem to assume that my concern is only rendering. [...] On Tue, Jul 17 2018 at 14:07 +0100, William_J_G Overington writes: > WJGO >> My suggestion is to use for each desired glyph a sequence > consisting of three characters, and then have an OpenType font decode > them so that the glyph can be displayed. > > JSB >This is a prohibitive requirement, because for years there is the lack of font creators interested in old Polish. > > Well, I have not been aware of any call for participation. It seems an interesting project. > > I make OpenType fonts using the FontCreator program. > > There is an active forum with helpful people participating. > > So you could if you wish try to make your own font Actually I tried: https://bitbucket.org/jsbien/parkosz-font/ Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Wed Jul 18 02:33:17 2018 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Wed, 18 Jul 2018 00:33:17 -0700 Subject: Variation Sequences (and L2-11/059) In-Reply-To: <86efg16vsv.fsf@mimuw.edu.pl> References: <868t6b4vkh.fsf@mimuw.edu.pl> <580e30bb-7fad-eee5-5077-8c6d121be7da@ix.netcom.com> <86601e4fmr.fsf@mimuw.edu.pl> <86efg16vsv.fsf@mimuw.edu.pl> Message-ID: On 7/17/2018 8:56 PM, Janusz S. "Bie?" wrote: > On Tue, Jul 17 2018 at 8:34 -0700, Asmus Freytag writes: >> On 7/16/2018 10:04 PM, Janusz S. Bie? via Unicode wrote: >> >> I understand there is no sufficient demand for the Unicode Consortium >> maintaining a supplementary non-ideographic variation database. Hence >> for the time being a kind of Private Use variation database seems to be >> the only solution - am I right? >> >> The question comes down to resources, among other things. As well as to whether >> there are actual users / implementers waiting for and ready to adopt such a database >> as solution to their problems. > I hope the resources are sufficient to improve wording of the variation > sequence FAQ. Do we agree that at present users/implementers are rather > misled by it? Sure, we can go either of two ways: we can state that Unicode has no, and will not have any, solution to the issue of such variants for non-ideographic scripts. That part is easy. Or, alternatively we could figure out, what the solution space might be (in the right circumstances), including some external resources for maintaining a database on an ongoing basis, and a larger well-identified community of scholars or archivists that sign up to use and support it. If a non-zero solution space exists, simply saying that there will never be any solution would be equally wrong as the current wording which points at something that is not longer part of the solution space . . . (although at one point, people thought it might be). > >> A strawman proposal could identify these issues and some ways that they might be >> addressed and then ask for criteria of what the UTC might deem sufficient. > Perhaps this statement should be put into FAQ, instead of "you should > propose your addition as a variation sequence"? There are some additions that should be proposed for standardization, but the bar is relatively high. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 18 03:51:36 2018 From: unicode at unicode.org (Shai Berger via Unicode) Date: Wed, 18 Jul 2018 11:51:36 +0300 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> <20180714130911.5d92dd35.shai@platonix.com> <83h8l2axdl.fsf@gnu.org> <20180717015112.162cd939.shai@platonix.com> Message-ID: <20180718102606.37c75ca9.shai@platonix.com> On Mon, 16 Jul 2018 17:40:50 -0700 Ken Whistler via Unicode wrote: > > So your complaint seems to boil down to the claim that if you > transmit "Hello, world!" to a process which then renders it > conformantly according to the Unicode Standard (including > UBA), then that process must somehow know *and honor* > your intent that it display in a LTR directional context. That > information, however, is explicitly *not* contained in > the plain text string there, and has to be conveyed by means of a > higher-level protocol. > (E.g. HTML markup as dir="ltr", etc.) > I believe this is an inaccurate description, but indeed the discrepancy is at the root of the issue here. The UBA defines a default algorithm for determining the directionality of plain text paragraphs. My claim is that in the absence of an agreed or conveyed higher-level protocol, this default must be respected. > If the receiving process, by whatever means, has raised its hand and > says, effectively, "I assume a RTL context for all text display", > that is its right. You can't complain if it displays your "Hello, > world!" as shown above. Well, you *can* complain, but you wouldn't be > correct. Basically, you and the receiving process do not share the > same assumptions about the higher-level protocol involved which > specifies paragraph direction. > This, essentially, boils down to a claim that the default is not really a default, but itself must be the subject of agreement between sides. My view is that expressed by FAQ #bidi7 -- a higher-level protocol is an agreement. It can be explicit (e.g. HTML) or implicit (e.g. the convention that log files are to be read LTR), but it cannot be applied in a void, or else interoperability is lost. > OR, you are just unhappy about the bidirectional > rendering conundrums > of some edge cases for the UBA. I wish they were -- while the "Hello, World!" example is a bit of a contrition, the "SESU RETHO DNA email ROF plaintext REFERP I" example is quite cental to the UBA, and represents an extremely common case; Hebrew paragraphs with embedded English words are at least whole percents of all paragraphs written in Hebrew about technology, for example. On Mon, 16 Jul 2018 21:51:32 -0700 Asmus Freytag via Unicode wrote: > [The Unicode Standard's] conformance clause is written to allow > implementations to solve real-world issues without becoming formally > non-conformant. I accept that this was the intention; I claim that, as things are currently written, they cause more real-world issues than they solve. The only example given here of a real-world issue served by abolishing the UBA defaults is performance degradation on some special files -- which are just as easy to treat specially, as Eli described in the case of Emacs and logs. One other consideration raised boils down to, "it's better to make some texts completely unreadable, then to present some other texts readably, but with the wrong alignment". The trade-off you seem to prefer is to make the "plain text is universally readable" idea from the core Unicode definition, not applicable to BiDi text. Why? Thanks, Shai From unicode at unicode.org Wed Jul 18 04:50:12 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 18 Jul 2018 02:50:12 -0700 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <20180718102606.37c75ca9.shai@platonix.com> References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> <20180714130911.5d92dd35.shai@platonix.com> <83h8l2axdl.fsf@gnu.org> <20180717015112.162cd939.shai@platonix.com> <20180718102606.37c75ca9.shai@platonix.com> Message-ID: <32328d0b-80c1-180f-acd6-b646c9a793b5@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 18 04:55:17 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 18 Jul 2018 02:55:17 -0700 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <20180718102606.37c75ca9.shai@platonix.com> References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> <20180714130911.5d92dd35.shai@platonix.com> <83h8l2axdl.fsf@gnu.org> <20180717015112.162cd939.shai@platonix.com> <20180718102606.37c75ca9.shai@platonix.com> Message-ID: <86c57a6e-4377-2493-7c3e-87acb5de311f@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 18 08:43:36 2018 From: unicode at unicode.org (philip chastney via Unicode) Date: Wed, 18 Jul 2018 13:43:36 +0000 (UTC) Subject: UAX #9: applicability of higher-level protocols to bidi plaintext References: <1730253206.6016468.1531921416625.ref@mail.yahoo.com> Message-ID: <1730253206.6016468.1531921416625@mail.yahoo.com> -------------------------------------------- On Tue, 17/7/18, Richard Wordingham via Unicode wrote: > Subject: Re: UAX #9: applicability of higher-level protocols to bidi plaintext > To: unicode at unicode.org > Date: Tuesday, 17 July, 2018, 3:30 AM > An interesting ambiguity is "!True" v. "True!".? > "!True" can be read as "Not true". true - there are contexts where "!True" can be read as "Not true". it's unclear from the short sample given whether "True" is a variable name, or a Boolean constant, but there are other contexts where "True!" can be read as "the factorial value of True" and yet others where where "!True" can be similarly interpreted there are also contexts where "Hello World!" can be read as the function "Hello", applied to the factorial value of "World" even though such a move wouldn't necessarily remove all ambiguity, the easiest solution is to declare that formal notations cannot be "plain" text use a higher-level protocol to identify what formal notation is being used, perhaps, except that I remember a conference where one of the paricipants noted that fully one-third of the time allocated to each presentation was taken up explaining the presenter's notation /phil From unicode at unicode.org Wed Jul 18 09:36:48 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 18 Jul 2018 07:36:48 -0700 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <1730253206.6016468.1531921416625@mail.yahoo.com> References: <1730253206.6016468.1531921416625.ref@mail.yahoo.com> <1730253206.6016468.1531921416625@mail.yahoo.com> Message-ID: <764186fb-e09a-a56d-2396-932167c5211f@att.net> On 7/18/2018 6:43 AM, philip chastney via Unicode wrote: > there are also contexts where "Hello World!" can be read as > the function "Hello", applied to the factorial value of "World" > > even though such a move wouldn't necessarily remove all ambiguity, > the easiest solution is to declare that formal notations cannot be "plain" text > Of course they can -- and (usually) should be, as they are designed that way. To state otherwise would just create headaches for designing parsers for formal notations. I think you are confusing ambiguity of *interpretation* of bits of formal notation, taken out of context, with ambiguity of *display* of formal notations in contexts where one does not know and control the paragraph directionality. The easiest (and correct) solution, when displaying formal notation for visual interpretation by human readers, is to use tools where one knows and can rely on the paragraph directionality explicitly, so that Unicode bidi doesn't add an out-of-left-field set of display conundrums, as it were, for bidi edge cases that can result in *mis*interpretation by the reader. In other words, if I am trying to read C program text or regex expressions, I expect that my tooling is not going to silently assume a RTL paragraph directional context and present me with visual garbage to interpret, forcing me to reverse engineer the bidi algorithm in my head, just to read the text. Why would I put up with that? --Ken From unicode at unicode.org Wed Jul 18 12:21:03 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 18 Jul 2018 10:21:03 -0700 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <1730253206.6016468.1531921416625@mail.yahoo.com> References: <1730253206.6016468.1531921416625.ref@mail.yahoo.com> <1730253206.6016468.1531921416625@mail.yahoo.com> Message-ID: <4003413c-db6e-0bc9-5337-cec6020e8890@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 18 13:45:42 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 18 Jul 2018 19:45:42 +0100 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <1730253206.6016468.1531921416625@mail.yahoo.com> References: <1730253206.6016468.1531921416625.ref@mail.yahoo.com> <1730253206.6016468.1531921416625@mail.yahoo.com> Message-ID: <20180718194542.795b8d0b@JRWUBU2> On Wed, 18 Jul 2018 13:43:36 +0000 (UTC) philip chastney via Unicode wrote: > -------------------------------------------- > On Tue, 17/7/18, Richard Wordingham via Unicode > wrote: > > > Subject: Re: UAX #9: applicability of higher-level protocols to > > bidi plaintext To: unicode at unicode.org > > Date: Tuesday, 17 July, 2018, 3:30 AM > > > An interesting ambiguity is "!True" v. "True!".? > > "!True" can be read as "Not true". > > true - there are contexts where "!True" can be read as "Not true". The context I had in mind was terse exchanges between those who have recently programmed in C. Thus, 'true' would be read as 'true' rather than as '1', and '!true' as 'not true'. A longer context would usually eliminate the ambiguity. Richard. From unicode at unicode.org Wed Jul 18 16:46:15 2018 From: unicode at unicode.org (Rick McGowan via Unicode) Date: Wed, 18 Jul 2018 14:46:15 -0700 Subject: Server move notice, Unicode Message-ID: <5B4FB527.50800@unicode.org> Hello everyone, On Wednesday evening (US time) July 18 the www.unicode.org server will again be undergoing migration. Downtime / off-line period is expected to be a few hours at most, beginning shortly after 17:00 Pacific time. We apologize for the inconvenience. Rick From unicode at unicode.org Thu Jul 19 02:38:18 2018 From: unicode at unicode.org (Shai Berger via Unicode) Date: Thu, 19 Jul 2018 10:38:18 +0300 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <86c57a6e-4377-2493-7c3e-87acb5de311f@ix.netcom.com> References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> <20180714130911.5d92dd35.shai@platonix.com> <83h8l2axdl.fsf@gnu.org> <20180717015112.162cd939.shai@platonix.com> <20180718102606.37c75ca9.shai@platonix.com> <86c57a6e-4377-2493-7c3e-87acb5de311f@ix.netcom.com> Message-ID: <20180719103818.0087cdb5.shai@platonix.com> On Wed, 18 Jul 2018 02:55:17 -0700 Asmus Freytag wrote: > On 7/18/2018 1:51 AM, Shai Berger via Unicode wrote: > > The trade-off you seem to prefer is to make the "plain text > > is universally readable" idea from the core Unicode definition, not > > applicable to BiDi text. > > Your idea would simply outlaw being able to view text with a > reader-defined stylesheet imposed on it. Such a stylesheet should be > perfectly able to impose a paragraph direction. > This argument is essentially circular: My point is exactly that such a stylesheet should not be able to impose paragraph directionality, just like it should not be able to impose any other random word reordering. Again, text directionality is an issue of content, not presentation. (I am well aware that the W3C's CSS definition include directionality controls -- I'm arguing that they're appropriate for HTML, but not for plain text) > Just as you might make sure that your application gives you a choice > of using a stylesheet that "imposes" default paragraph direction. > And again -- the point is interoperability. If I cannot trust that people I communicate with make the same choices I make, plain text cannot be used. If the Unicode standard does not impose a universal default, it does not define interchangeable plain text. My main point, whose rejection baffles me to no end, is that it should. Shai. From unicode at unicode.org Thu Jul 19 08:05:28 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Thu, 19 Jul 2018 16:05:28 +0300 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <20180719103818.0087cdb5.shai@platonix.com> (message from Shai Berger via Unicode on Thu, 19 Jul 2018 10:38:18 +0300) References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> <20180714130911.5d92dd35.shai@platonix.com> <83h8l2axdl.fsf@gnu.org> <20180717015112.162cd939.shai@platonix.com> <20180718102606.37c75ca9.shai@platonix.com> <86c57a6e-4377-2493-7c3e-87acb5de311f@ix.netcom.com> <20180719103818.0087cdb5.shai@platonix.com> Message-ID: <83o9f32x5z.fsf@gnu.org> > Date: Thu, 19 Jul 2018 10:38:18 +0300 > Cc: Asmus Freytag > From: Shai Berger via Unicode > > And again -- the point is interoperability. If I cannot trust that > people I communicate with make the same choices I make, plain text > cannot be used. This conclusion is too extreme. In Real Life?, every reasonable application that supports bidirectional text will have a knob that allows the user to force a particular paragraph direction on a region of text. So if you display some text you received from outside the application, and the display looks juggled, let alone illegible, you force the other paragraph direction, and the problem will usually be solved. At least IME, and I do have experience not only with Emacs. From unicode at unicode.org Thu Jul 19 11:47:42 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Thu, 19 Jul 2018 17:47:42 +0100 (BST) Subject: Variation Sequences (and L2-11/059) In-Reply-To: <86efg16vsv.fsf@mimuw.edu.pl> References: <868t6b4vkh.fsf@mimuw.edu.pl> <580e30bb-7fad-eee5-5077-8c6d121be7da@ix.netcom.com> <86601e4fmr.fsf@mimuw.edu.pl> <86efg16vsv.fsf@mimuw.edu.pl> Message-ID: <22913690.37041.1532018862390.JavaMail.defaultUser@defaultHost> Janusz S. Bien wrote: > You seem to assume that my concern is only rendering. Well my thinking is that what you are wanting is a way to accurately transcribe documents and maybe printed books from Old Polish into a Unicode-based electronic format so that the information can be more readily studied, while retaining glyph information that is not presently representable using Unicode characters. I found the following. https://en.wikipedia.org/wiki/Old_Polish_language WJGO >> So you could if you wish try to make your own font JSB >Actually I tried: JSB > https://bitbucket.org/jsbien/parkosz-font/ Thank you for the link to the font. I have studied the font in the FontCreator program (version 8). I remember that I produced an OpenType font using Variation Selectors and OpenType Glyph Substitution back in April 2017. I wrote about it and provided a link to the font and a link to a typecase document. https://forum.high-logic.com/viewtopic.php?f=10&t=7033 Although that font is about chess, I am thinking that that is the sort of font that is needed for what you are wanting to do. This could use variation selectors or could use circled digits as desired. I am a researcher and I am looking for a worthwhile project related to typography in which to participate from time to time - no money charged, no money to pay - and I am interested in printed books of the incunabula period and the early sixteenth century. I do not know any Polish, but I do not need to be involved in choosing which glyphs are needed, so my not knowing any Polish would not seem to be a problem. William Overington Thursday 19 July 2018 From unicode at unicode.org Thu Jul 19 20:10:49 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Thu, 19 Jul 2018 18:10:49 -0700 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <20180719103818.0087cdb5.shai@platonix.com> References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> <20180714130911.5d92dd35.shai@platonix.com> <83h8l2axdl.fsf@gnu.org> <20180717015112.162cd939.shai@platonix.com> <20180718102606.37c75ca9.shai@platonix.com> <86c57a6e-4377-2493-7c3e-87acb5de311f@ix.netcom.com> <20180719103818.0087cdb5.shai@platonix.com> Message-ID: <72a06159-60b0-d264-f696-7a8c29300244@att.net> On 7/19/2018 12:38 AM, Shai Berger via Unicode wrote: > If I cannot trust that > people I communicate with make the same choices I make, plain text > cannot be used. Here is a counterexample. The following is a chunk of plain text output from the bidi reference implementation: Trace: Entering br_UBA_IdentifyIsolatingRunSequences [X10] Current State: 6 ? Position:?????? 0??? 1??? 2??? 3??? 4??? 5??? 6??? 7??? 8??? 9 10?? 11?? 12 ? Text:??????? 05D0 2067 0061 2066 0061 202B 0061 202C 0061 2069 0061 2069 0061 ? Bidi_Class:???? R? RLI??? L? LRI??? L? RLE??? L? PDF??? L? PDI L? PDI??? L ? Levels:???????? 1??? 1??? 3??? 3??? 4??? x??? 5??? x??? 4??? 3 3??? 1??? 1 ? Runs:???????

? Seqs (L= 1): ? Seqs (L= 3): ? Seqs (L= 4):???????????????????? ? Seqs (L= 5):?????????????????????????????? If I just let that default to browser output choices (and assuming you read your email with a proportional display font), it becomes almost incomprehensible for casual reading, because the output has an underlying assumption that there is column alignment across lines, which in turn depends on a user choice of a fixed-width font for display. Rectifying that, the reader would then see: Trace: Entering br_UBA_IdentifyIsolatingRunSequences [X10] Current State: 6 ? Position:?????? 0??? 1??? 2??? 3??? 4??? 5??? 6??? 7 8??? 9?? 10?? 11?? 12 ? Text:??????? 05D0 2067 0061 2066 0061 202B 0061 202C 0061 2069 0061 2069 0061 ? Bidi_Class:???? R? RLI??? L? LRI??? L? RLE??? L? PDF L? PDI??? L? PDI??? L ? Levels:???????? 1??? 1??? 3??? 3??? 4??? x??? 5??? x 4??? 3??? 3??? 1??? 1 ? Runs:???????

? Seqs (L= 1): ? Seqs (L= 3): ? Seqs (L= 4): ?????????? ? Seqs (L= 5): where now everything makes sense. (Well, at least if the UBA internals are your thing!) It isn't that "plain text cannot be used" to convey this content. The content is certainly "legible" in the minimal sense required by the Unicode Standard, and it is interchangeable without data corruption. The problem is that for optimal display and interpretation as intended, I also need to convey (and/or have the reader guess) the higher-level protocol requirement that this particular plain text needs to be displayed with a monowidth font. > If the Unicode standard does not impose a > universal default, it does not define interchangeable plain text. And that is simply not the case. If your text is (), that will display as {abc!} in a LTR paragraph directional context and as {!abc} in a RTL paragraph directional context. Reliably. It isn't that we don't have interchangeable plain text. We do. What you cannot do is predict exactly how that text will *display*, if you haven't agreed with your interlocutor about paragraph direction. But substantively, that is no different than the proportional versus monowidth font example I just gave. So I think this still really boils down to the putative requirement that for something like "Hello, world!", bidi is just too weird, and that somehow plain text shouldn't be allowed to behave that way. In other words, if plain text doesn't forcefully carry with it and require how it must be displayed, well, then it isn't really interchangeable. But that isn't what the Unicode Standard means by plain text. And isn't what it requires for interchangeability of plain text. (And yes, bidi is weird!) > > My main point, whose rejection baffles me to no end, is that it should. Well, I'm not expecting that I can make you feel good about the situation. ;-) But perhaps the UTC position will seem a little less baffling. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jul 20 00:04:48 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Fri, 20 Jul 2018 07:04:48 +0200 Subject: Variation Sequences (and L2-11/059) In-Reply-To: <22913690.37041.1532018862390.JavaMail.defaultUser@defaultHost> (William J. G. Overington's message of "Thu, 19 Jul 2018 17:47:42 +0100 (BST)") References: <868t6b4vkh.fsf@mimuw.edu.pl> <580e30bb-7fad-eee5-5077-8c6d121be7da@ix.netcom.com> <86601e4fmr.fsf@mimuw.edu.pl> <86efg16vsv.fsf@mimuw.edu.pl> <22913690.37041.1532018862390.JavaMail.defaultUser@defaultHost> Message-ID: <86in5aa45r.fsf@mimuw.edu.pl> On Thu, Jul 19 2018 at 17:47 +0100, wjgo_10009 at btinternet.com writes: > Janusz S. Bien wrote: > >> You seem to assume that my concern is only rendering. > > Well my thinking is that what you are wanting is a way to accurately > transcribe documents and maybe printed books from Old Polish into a > Unicode-based electronic format so that the information can be more > readily studied, while retaining glyph information that is not > presently representable using Unicode characters. That's right. As long as we have no corpus tools able to handle variation sequences, both variation sequences and yuor proposal can be considered just a form of transcription and your proposal may have perhaps a liitle advantage. However if somebody will have time and/or money to implement a new corpus software, it make more sense in my opinion to implement standard variation sequences. Of course sticking to the standard make sense if the standard is reasonable. In my opinion Unicode was designed with only one application in mind: some text is input on the keyboard and has to be rendered after some processing. However due to the mass digitalization we have quite often the reverse situation: we have scans with graphical object which might be difficult to identify, we have to analyse the text somehow and identyfying the Unicode characters is the final part of the research. To be more specific, I will quote my response to David Perry on the MUFI list: On Fri, Jul 20 2018 at 6:54 +0200, jsbien at mimuw.edu.pl writes: > On Wed, Jul 18 2018 at 13:33 -0700, [...] writes: [...] >> If you are working to digitize the Polish dictionary you mentioned, >> the first step would be to determine whether there is any difference >> in meaning between the two versions of the section sign. If not, just >> encode them all with U+00A7. > > I beg to disagree. > > The difference should be encoded in some way (at the moment I plan to > use a simple transciption like ?? for SECTION SIGN mirrored), than their > occurrences analysed with some corpus tools (concordances etc.) and > finally the opinion formulated about the function of the distinction or > tha lack of it. On the other hand, I was just surprised by the information from David Perry, who said on the MUFI list: > Note, however, that most applications check whether VSs have been > registered for the script in use and, if not, they will not display > the variants even if they a font maker has put them in. (I tried > . . . ) If the consortium will be reluctant to register new sequences and the software will strictly adhere to the standard, then there will be a problem. > I found the following. > > https://en.wikipedia.org/wiki/Old_Polish_language Thank you for your interest in Polish language. I will answer to the rest of you post a little later. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Fri Jul 20 02:21:33 2018 From: unicode at unicode.org (Peter Constable via Unicode) Date: Fri, 20 Jul 2018 07:21:33 +0000 Subject: Unicode 11 Georgian uppercase vs. fonts In-Reply-To: References: Message-ID: IMO, the correct answer is 2, except that ?all common fonts? is more sweeping that necessary: it?s sufficient to have fonts used for fallback in platforms and browsers, and the related fallback logic, to get updated. Of course, that takes some time, and it?s not even two months since Unicode 11 was released. The Georgian community understood that it would take time to get implementations in place, and that they would need to take measures to smooth over that transition ? which can include having Web sites for Georgian businesses and institutions using fonts to match the requirements of the content. Peter From: Unicore On Behalf Of Markus Scherer via Unicore Sent: Wednesday, July 18, 2018 3:05 PM To: unicore UnicoRe Discussion Cc: mark Subject: Unicode 11 Georgian uppercase vs. fonts Dear fellow Unicoders, We?ve run into some significant problems with the Georgian capital letters added in Unicode 11. If you have run into them yourselves, or have feedback on our brainstormed solutions below, we?d love to hear your thoughts. Here's the problem. The vast majority of Georgian fonts do not yet have the new uppercase characters. So when any system uses case mapping to uppercase text (e.g. browsers interpreting CSS?s text-transform: capitalize), then the users of Georgian will see boxes (?tofu?) if the font they are using does not have the glyphs. For example, a program constructs a web page with buttons. It uses a CSS style to uppercase text in buttons, as a house style. Unless the user has a very up-to-date font, they see tofu (boxes). If a server does backend rendering, its font has to be very up-to-date. We also saw this problem in a program that was doing titlecasing, but on the first character it used the uppercase mappings rather than titlecase mappings. Not the right thing to do, of course, but code that accidentally works (most of the time) doesn't get fixed if nobody reports a bug about it. All of these will result in bad bugs in the UI, in software that formerly worked fine. We brainstormed some options to fix this: 1. Get all call sites to change their code to not uppercase Georgian (and fix titlecasing to use the titlecase mappings, not the uppercase mappings). Since we have no control over call sites and release cycles of affected software, this would not help Georgian users for a long time, if ever. We?d eventually want to retract these changes, creating even more work. 2. Change all common fonts with Georgian characters to add the U11.0 ones. This should eventually happen but would probably take a couple of years at least, which does not help users in the short term. 3. Hack font CMAPs to just map the new characters to the glyphs of the old ones. Works but only when a programmer can control the fonts used, such as with server-side rendering or downloadable fonts. 4. Remove the uppercase mappings for Georgian, until the fonts catch up. * Would at least have to be done in all browsers, otherwise web apps will still break for Georgian. * A broader alternative is to do it in ICU. Because that is used by the majority of the browser implementations, it would solve the short-term problem for the browsers ? and many other programs. Drawback: Non-conformant, and uppercasing will be inconsistent depending on who has which variant of ICU (with vs. without hack, on top of: with Unicode 11 vs. before Unicode 11). * One precedent is that in CLDR we deliberately hold back from using new currency characters until the font support is sufficiently widespread. (Wishing we'd held back the uppercase mappings in Unicode 11.0 too!) Mark & Markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jul 20 02:25:23 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Fri, 20 Jul 2018 09:25:23 +0200 Subject: Variation Sequences (and L2-11/059) In-Reply-To: <22913690.37041.1532018862390.JavaMail.defaultUser@defaultHost> (William J. G. Overington's message of "Thu, 19 Jul 2018 17:47:42 +0100 (BST)") References: <868t6b4vkh.fsf@mimuw.edu.pl> <580e30bb-7fad-eee5-5077-8c6d121be7da@ix.netcom.com> <86601e4fmr.fsf@mimuw.edu.pl> <86efg16vsv.fsf@mimuw.edu.pl> <22913690.37041.1532018862390.JavaMail.defaultUser@defaultHost> Message-ID: <864lgu9xng.fsf@mimuw.edu.pl> On Thu, Jul 19 2018 at 17:47 +0100, wjgo_10009 at btinternet.com writes: > Janusz S. Bien wrote: > >> You seem to assume that my concern is only rendering. > > Well my thinking is that what you are wanting is a way to accurately > transcribe documents and maybe printed books from Old Polish into a > Unicode-based electronic format so that the information can be more > readily studied, while retaining glyph information that is not > presently representable using Unicode characters. > > I found the following. > > https://en.wikipedia.org/wiki/Old_Polish_language > > WJGO >> So you could if you wish try to make your own font > > JSB >Actually I tried: > > JSB > https://bitbucket.org/jsbien/parkosz-font/ > > Thank you for the link to the font. I have studied the font in the FontCreator program (version 8). > > I remember that I produced an OpenType font using Variation Selectors > and OpenType Glyph Substitution back in April 2017. I wrote about it > and provided a link to the font and a link to a typecase document. > > https://forum.high-logic.com/viewtopic.php?f=10&t=7033 > > Although that font is about chess, I am thinking that that is the sort > of font that is needed for what you are wanting to do. This could use > variation selectors or could use circled digits as desired. > > I am a researcher and I am looking for a worthwhile project related to > typography in which to participate from time to time - no money > charged, no money to pay - and I am interested in printed books of the > incunabula period and the early sixteenth century. > > I do not know any Polish, but I do not need to be involved in choosing > which glyphs are needed, so my not knowing any Polish would not seem > to be a problem. > > William Overington > > Thursday 19 July 2018 > -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Fri Jul 20 04:45:17 2018 From: unicode at unicode.org (Shai Berger via Unicode) Date: Fri, 20 Jul 2018 12:45:17 +0300 Subject: UAX #9: applicability of higher-level protocols to bidi plaintext In-Reply-To: <72a06159-60b0-d264-f696-7a8c29300244@att.net> References: <20180710013328.17b7807f.shai@platonix.com> <20180710194059.24f0a54a.shai@platonix.com> <20180713085725.75923890@JRWUBU2> <831sc7ee90.fsf@gnu.org> <20180714130911.5d92dd35.shai@platonix.com> <83h8l2axdl.fsf@gnu.org> <20180717015112.162cd939.shai@platonix.com> <20180718102606.37c75ca9.shai@platonix.com> <86c57a6e-4377-2493-7c3e-87acb5de311f@ix.netcom.com> <20180719103818.0087cdb5.shai@platonix.com> <72a06159-60b0-d264-f696-7a8c29300244@att.net> Message-ID: <20180720124517.3ed7afbd.shai@platonix.com> Hi Ken (and all), Thanks for your time and patience with this. On Thu, 19 Jul 2018 18:10:49 -0700 Ken Whistler via Unicode wrote: > On 7/19/2018 12:38 AM, Shai Berger via Unicode wrote: > > If I cannot trust that > > people I communicate with make the same choices I make, plain text > > cannot be used. > > Here is a counterexample [a table rendered in plain text, which is > only truly legible using a fixed-width font]. > > It isn't that "plain text cannot be used" to convey this content. The > content is certainly "legible" in the minimal sense required by the > Unicode Standard, and it is interchangeable without data corruption. > The problem is that for optimal display and interpretation as > intended, I also need to convey (and/or have the reader guess) the > higher-level protocol requirement that this particular plain text > needs to be displayed with a monowidth font. > If I understand correctly, you are rejecting my claim that directionality is an issue of content, and claiming that, just like the crumbling-down of your table, it is an issue of display. But that argument is clearly disproved by the mere presence of the directionality-setting characters (RLM, LRE, etc) in the Unicode character set; in other words, your example would be convincing if Unicode included characters like "start table row" and "close table cell", AND there was an annex saying that your lines (for whatever reason) are to be treated as table rows unless a higher-level-protocol said otherwise. I believe this is not the case. > > If the Unicode standard does not impose a > > universal default, it does not define interchangeable plain text. > > And that is simply not the case. If your text is ( L, > ON>), that will display as {abc!} in a LTR paragraph directional > ON>context and as {!abc} in a RTL paragraph directional context. > [...] if plain text doesn't forcefully carry with it and > require how it must be displayed, well, then it isn't really > interchangeable. > > But that isn't what the Unicode Standard means by plain text. And > isn't what it requires for interchangeability of plain text. If I understood your argument correctly, it amounts to a claim that Unicode defines plain text as a component in a data format, but not to be used as a full document. If that is correct, then there is much to fix -- I think that quite a lot of existing technology assumes the opposite (e.g. the use of "Content-Type: text/plain; charset=UTF-8" in MIME should be strongly discouraged, if the people who designed Unicode and UTF-8 think it is not appropriate for full documents). If I misunderstood, please correct me. > > > > My main point, whose rejection baffles me to no end, is that it > > should. > > Well, I'm not expecting that I can make you feel good about the > situation. ;-) But perhaps the UTC position will seem a little less > baffling. As I hope I've shown above, there's plenty of reason for bafflement. The UTC defines code points to encode directionality, but then refuses to treat directionality as content when it comes to paragraph directionality; it defines a higher-level-protocol as an agreement, and then turns around and says the word "agreement" actually means "decision". I can guess reasons for why the things are the way they are, but not justifications. I stay baffled. Thanks, Shai. From unicode at unicode.org Fri Jul 20 05:41:01 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Fri, 20 Jul 2018 12:41:01 +0200 Subject: old Polish and Unicode (was: Variation Sequences (and L2-11/059)) In-Reply-To: <22913690.37041.1532018862390.JavaMail.defaultUser@defaultHost> (William J. G. Overington's message of "Thu, 19 Jul 2018 17:47:42 +0100 (BST)") References: <868t6b4vkh.fsf@mimuw.edu.pl> <580e30bb-7fad-eee5-5077-8c6d121be7da@ix.netcom.com> <86601e4fmr.fsf@mimuw.edu.pl> <86efg16vsv.fsf@mimuw.edu.pl> <22913690.37041.1532018862390.JavaMail.defaultUser@defaultHost> Message-ID: <86k1pq196q.fsf_-_@mimuw.edu.pl> I apologize for sending by mistake the previous post with no new content. On Thu, Jul 19 2018 at 17:47 +0100, wjgo_10009 at btinternet.com writes: [...] > I found the following. > > https://en.wikipedia.org/wiki/Old_Polish_language Thanks again for your interest in Polish language. There is also https://en.wikipedia.org/wiki/History_of_Polish https://en.wikipedia.org/wiki/Middle_Polish_language https://en.wikipedia.org/wiki/Polish_orthography https://en.wikipedia.org/wiki/History_of_Polish_orthography To make a long story short, this is just a mess. Looking for a good link to recommend I just found https://culture.pl/en/article/a-foreigners-guide-to-the-polish-alphabet which seems worth looking at (but the multimedia version doesn't work for me). I used to recommend the paper http://wbl.klf.uw.edu.pl/45/ which unfortunately it seems no longer available on the Internet. > > WJGO >> So you could if you wish try to make your own font > > JSB >Actually I tried: > > JSB > https://bitbucket.org/jsbien/parkosz-font/ > > Thank you for the link to the font. I have studied the font in the FontCreator program (version 8). Please revisit the site, I just added some links and comments. This project is now orphaned. > > I remember that I produced an OpenType font using Variation Selectors > and OpenType Glyph Substitution back in April 2017. I wrote about it > and provided a link to the font and a link to a typecase document. > > https://forum.high-logic.com/viewtopic.php?f=10&t=7033 > > Although that font is about chess, I am thinking that that is the sort > of font that is needed for what you are wanting to do. This could use > variation selectors or could use circled digits as desired. Thanks for the link. I think I will do some tests with XeLaTeX. > > I am a researcher and I am looking for a worthwhile project related to > typography in which to participate from time to time - no money > charged, no money to pay - and I am interested in printed books of the > incunabula period and the early sixteenth century. > > I do not know any Polish, but I do not need to be involved in choosing > which glyphs are needed, so my not knowing any Polish would not seem > to be a problem. Please feel free to take over the font for Parkosz's treatise, if you wish to. I think another interesting challenge is "Nowy Karakter Polski", a 16th century treatise comparing several proposals of Polish spelling, which uses various strange characters. You can find the scan in various places and in various format, e.g. https://books.google.pl/books?id=Z3ojAAAAMAAJ http://www.dbc.wroc.pl/publication/4239 The treatise is used as one of the important sources used by the dictionary of the 16th century Polish language: http://spxvi.edu.pl/ The only English language presentation of the dictionary seems to be Luto-Kami?ska, A. (2017). Several words on the dictionary of the 16th century Polish language. unfortunately behind a paywall: http://www.dbpia.co.kr/Journal/ArticleList/VOIS00297995# The history of the dictionary is long and sad. The work started in 1949 (!) and after the initial enthusiasm and generous funding the team had to struggle with various difficulties; in the consequence the dictionary is still unfinished but the work continues, although rather slowly. In my unpublished presentation http://bc.klf.uw.edu.pl/179/ I show how the editors managed quoting "Nowy Karakter" (slides 26-35). Look like in the time of hot type the strange letters has been written by hand, and there was a regress when the dictionary started to be typeset on computer. In my presentation I made some suggestions how to use Unicode for "Nowy Karakter" (slides 40-69). Unfortunately the dictionary editors were not interested in the proposal (there had at the time much more important problems). Not long ago the team received long-awaited grant for computerizing the work on the dictionary, in particular for creating a corpus of 16th century texts. Looks like the corpus was prepared rather in a hurry and there was no time or money to develop a faithfull rendering of "Nowy Karakter". The work exists in the corpus in two forms: PDF: http://rcin.org.pl/publication/82568 HTML: http://spxvi.edu.pl/korpus/teksty/JanNKar/ I must say that for a typical user of the dictionary the solution applied is probably a good one. The spelling has been modernized but the occurences of strange characters has been marked with color in PDF, and in HTML additionaly with some information displayed when you hoover over the appropriate fragment of the text. This solution is however not applicable to e.g. quotations in a research paper when color is for some reasons not allowed. So encoding "Nowy Karakter Polski" in Unicode and providing a font for it is still in my opinion an interesting open problem. Cf. also the thread http://www.unicode.org/mail-arch/unicode-ml/y2010-m04/0024.html BTW, I was definitely too optimistic... Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Fri Jul 20 18:17:03 2018 From: unicode at unicode.org (Norbert Lindenberg via Unicode) Date: Fri, 20 Jul 2018 16:17:03 -0700 Subject: Consonant shifters and ZWNJ in Khmer Message-ID: <2932F8ED-D077-42A3-B9FC-94C80F79BECA@lindenbergsoftware.com> The section on consonant shifters in the Khmer section of the Unicode standard (page 647 of Unicode 11 [1]) isn?t entirely clear on where the zero width non-joiner should be placed to prevent a consonant shifter that?s followed by an above-base vowel from being changed to a below-base glyph. First, it says ?U+200C zero width non-joiner should be inserted before the consonant shifter? to prevent the change. Then it continues ?in such cases, U+200C zero width non-joiner is inserted before the vowel sign?, which could be interpreted as ?after the consonant shifter?. Finally, the examples show ZWNJ inserted before the consonant shifter. The OpenType Khmer shaping description [2], on the other hand, expects ZWNJ to be inserted between the consonant shifter (here called RegShift) and the above-base vowel. Questions to the people here who have dealt with Khmer: How is this handled in real life? Thanks, Norbert [1] https://www.unicode.org/versions/Unicode11.0.0/ch16.pdf [2] https://docs.microsoft.com/en-us/typography/script-development/khmer From unicode at unicode.org Fri Jul 20 20:01:31 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 21 Jul 2018 02:01:31 +0100 Subject: Tamil Brahmi Short Mid Vowels Message-ID: <20180721020131.4b22887b@JRWUBU2> A problem has been spotted with the rendering of Tamil Brahmi vowels - in particular the sequence does not conform to the grammar of the Universal Shaping Engine (USE); a dotted circle may be inserted between the vowel and the pulli. When considering font-level remedies, I realised that there may be a problem with a following consonant - is a correct encoding of what may be transliterated as _k?ta_? The nearest to a convincing justification I can find for it to require U+200C ZWNJ after the virama is the text in TUS Section 12.1 for *Explicit Virama*, but that merely says that ZWNJ is required to produce explicit virama rather than a _conjunct_. As I understand it, a subscript final consonant would be encoded as consonant+virama rather than virama+consonant, so there is no ambiguity in Brahmi text. (If we try to make a rule out of two conflicting mechanisms, the difference might be that one is used for viramas and the other is used for invisible stackers, though that would require changing U+10A3F KHAROSHTHI VIRAMA back to being a virama.) The problem is that a font that tries to recover the situation might interpret as having TA subscripted to the dotted circle. If ZWNJ is required for _k?ta_, what text if any in TUS requires it? Richard. From unicode at unicode.org Fri Jul 20 21:25:51 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Sat, 21 Jul 2018 07:55:51 +0530 Subject: Tamil Brahmi Short Mid Vowels In-Reply-To: <20180721020131.4b22887b@JRWUBU2> References: <20180721020131.4b22887b@JRWUBU2> Message-ID: This is a unique problem because this is probably the only case where the same script produces conjuncts for one language and not for another. I had asked for a separate Tamil Brahmi virama to be encoded which would obviate this problem but that was shot down. Maybe that case should be reopened? On Sat 21 Jul, 2018, 06:33 Richard Wordingham via Unicode, < unicode at unicode.org> wrote: > A problem has been spotted with the rendering of Tamil Brahmi vowels - > in particular the sequence VOWEL SIGN O, U+11046 BRAHMI VIRAMA> does not conform to the grammar > of the Universal Shaping Engine (USE); a dotted circle may be inserted > between the vowel and the pulli. > > When considering font-level remedies, I realised that there may be a > problem with a following consonant - is U+11022 BRAHMI LETTER TA> a correct encoding of what may be > transliterated as _k?ta_? > > The nearest to a convincing justification I can find for it to require > U+200C ZWNJ after the virama is the text in TUS Section 12.1 for > *Explicit Virama*, but that merely says that ZWNJ is required to > produce explicit virama rather than a _conjunct_. As I understand > it, a subscript final consonant would be encoded as consonant+virama > rather than virama+consonant, so there is no ambiguity in Brahmi text. > (If we try to make a rule out of two conflicting mechanisms, the > difference might be that one is used for viramas and the other is used > for invisible stackers, though that would require changing U+10A3F > KHAROSHTHI VIRAMA back to being a virama.) The problem is that a font > that tries to recover the situation might interpret U+11044, U+25CC DOTTED CIRCLE, U+11046, U+11022> as having TA > subscripted to the dotted circle. If ZWNJ is required for _k?ta_, what > text if any in TUS requires it? > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jul 21 02:50:26 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 21 Jul 2018 08:50:26 +0100 Subject: Tamil Brahmi Short Mid Vowels In-Reply-To: References: <20180721020131.4b22887b@JRWUBU2> Message-ID: <20180721085026.6aa07876@JRWUBU2> On Sat, 21 Jul 2018 07:55:51 +0530 Shriramana Sharma via Unicode wrote: > This is a unique problem because this is probably the only case where > the same script produces conjuncts for one language and not for > another. There are and have been similar cases. Reformed (a.k.a. 'typewriter') Malayalam v. traditional Malayalam comes immediately to mind. Pre-5.0 Myanamar script was similar, with Pali stacking and Burmese mostly not, though that gives you the precedent of disunifying the invisible stacker and the vowel killer, which I've always considered a bad unification inherited from ISCII. 'Pure' Tai and Pali use stacking quite differently in the Tai Tham script, but some Tai languages use a lot of Pali-style spellings. > I had asked for a separate Tamil Brahmi virama to be encoded > which would obviate this problem but that was shot down. Maybe that > case should be reopened? Could be messy. Are you saying that people are relying on fonts being free of conjuncts? One could use a keyboard with a 'pulli' key that produced - I don't know if people do. Richard. From unicode at unicode.org Thu Jul 26 05:40:18 2018 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Thu, 26 Jul 2018 12:40:18 +0200 (CEST) Subject: Diacritic marks in parentheses Message-ID: <104663606.43684.1532601618204@ox.hosteurope.de> German umlauts often occur when a noun is plural or an agens noun is female, e.g. _Arzt_ '(male) physician', _?rzte_ 'physicians' and _?rztin_ 'female physician'. There are several cases where a short notation for both singular and plural or, more frequently, male and female singular are desired. A number of notations are commonly encountered, e.g. (not showing number pairs) _Doktor(in)_, _Doktor/-in_, _Doktor/in_, _DoktorIn_, _Doktor_in_, _Doktor*in_. These only[^1] work well if there is no umlaut difference, i.e. neither _?rzt/-in_ nor _Arzt/-in_ would be appropriate. A way to show the umlaut dots are conditional would be required but is not available in plain text systems and complicated to achieve in most rich text systems. Unicode has '?' HYPHEN WITH DIAERESIS (U+2E1A) to offer, i.e. _Arzt?in_ or _Arzt/?in_. This is also very uncommon, but may be used in some linguistic texts. I believe the most intuitive solution would be tiny parentheses before and after the two dots. This has no established usage as far as I am aware of, so would probably not qualify for encoding in the Unicode Standard. However, if it would qualify nevertheless, should this be a new atomic diacritic mark, e.g. COMBINING PARENTHESIZED DIAERESIS ABOVE, or two characters, e.g. COMBINING OPEN PARENTHESES ABOVE and COMBINING CLOSE PARENTHESES ABOVE to be used with COMBINING DIAERESIS (U+0308)? [^1] Yes, there are other cases where the stem changes in different ways, but that is irrelevant here. From unicode at unicode.org Thu Jul 26 06:12:43 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 26 Jul 2018 13:12:43 +0200 (CEST) Subject: Diacritic marks in parentheses In-Reply-To: <104663606.43684.1532601618204@ox.hosteurope.de> References: <104663606.43684.1532601618204@ox.hosteurope.de> Message-ID: <1411637396.5643.1532603563943.JavaMail.www@wwinf1m18> We do have this already, in combining marks extended: ? @@ 1AB0 Combining Diacritical Marks Extended 1AFF @ Used for German dialectology [?] 1ABB COMBINING PARENTHESES ABOVE * intended to surround a diacritic above 1ABC COMBINING DOUBLE PARENTHESES ABOVE 1ABD COMBINING PARENTHESES BELOW * intended to surround a diacritic below 1ABE COMBINING PARENTHESES OVERLAY * intended to surround a base letter * exact placement is font dependent ? Best regards, Marcel ? On?26/07/18 12:49?Christoph P?per via Unicode wrote: > > German umlauts often occur when a noun is plural or an agens noun is female, e.g. _Arzt_ '(male) physician', _?rzte_ 'physicians' and _?rztin_ 'female physician'. There are several cases where a short notation for both singular and plural or, more frequently, male and female singular are desired. A number of notations are commonly encountered, e.g. (not showing number pairs) _Doktor(in)_, _Doktor/-in_, _Doktor/in_, _DoktorIn_, _Doktor_in_, _Doktor*in_. > > These only[^1] work well if there is no umlaut difference, i.e. neither _?rzt/-in_ nor _Arzt/-in_ would be appropriate. A way to show the umlaut dots are conditional would be required but is not available in plain text systems and complicated to achieve in most rich text systems. Unicode has '?' HYPHEN WITH DIAERESIS (U+2E1A) to offer, i.e. _Arzt?in_ or _Arzt/?in_. This is also very uncommon, but may be used in some linguistic texts. > > I believe the most intuitive solution would be tiny parentheses before and after the two dots. This has no established usage as far as I am aware of, so would probably not qualify for encoding in the Unicode Standard. However, if it would qualify nevertheless, should this be a new atomic diacritic mark, e.g. COMBINING PARENTHESIZED DIAERESIS ABOVE, or two characters, e.g. COMBINING OPEN PARENTHESES ABOVE and COMBINING CLOSE PARENTHESES ABOVE to be used with COMBINING DIAERESIS (U+0308)? > > [^1] Yes, there are other cases where the stem changes in different ways, but that is irrelevant here. > > From unicode at unicode.org Thu Jul 26 11:27:10 2018 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Thu, 26 Jul 2018 09:27:10 -0700 Subject: Diacritic marks in parentheses In-Reply-To: <1411637396.5643.1532603563943.JavaMail.www@wwinf1m18> References: <104663606.43684.1532601618204@ox.hosteurope.de> <1411637396.5643.1532603563943.JavaMail.www@wwinf1m18> Message-ID: I would not expect for ?+combining () above = ?? to look right except with specialized fonts. http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%84%5Cu1ABB&s=&uv=0 Even if it worked widely, I think it would be confusing. I think you are best off writing Arzt/?rztin. Viele Gr??e, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jul 26 14:27:08 2018 From: unicode at unicode.org (Alexey Ostrovsky via Unicode) Date: Thu, 26 Jul 2018 23:27:08 +0400 Subject: Unicode 11 Georgian uppercase vs. fonts In-Reply-To: References: Message-ID: Hi there! "The Georgian community understood" ? sorry, but here "the Georgian community" means a small group of Georgian font designers who promote upper-case for effectively caseless Georgian. Many Georgian scientists working with script and language are not fans of "uppercase" font styles. Option #2 as well as any other forcible upper-case option for Georgian is an error (it can be compared with forcible black-letter option for, say, Cyrillic through a CSS attribute). Well, doesn't matter, what about options. Actually, the problem must be split in two issues: a) Whether to capitalize a Georgian text in the same case when we capitalize a Latin one. b) How to handle cases when we transform the text and there is no capital characters for Georgian. Before answering, we must mention the caseless nature of the Georgian script. It "capital" letters do not exists as letters, they are letter variants used exactly the same way as the Latin title case. Therefore, Georgian "uppercase" = Georgian title case = Georgian "capital letters" in Unicode 11, it is far from Latin uppercase by its behavior and its features. Here are some examples for Georgian (I use English, but semantics and casing mean to reflect Georgian) to understand where we are: -- "mr. john smith" is unconditionally OK; -- "MR JOHN SMITH" or "mr JOHN SMITH" can be OK or wrong depending on situation, usually it is OK; -- "Mr John Smith" is unconditionally wrong (except some marginal cases, similar to English "mR jOHN sMITH"). Therefore, easiest answer is (b): leave it "minuscule", as it is an excellent and fully readable default solution. An answer to (a) is not that easy, as it depends on designer's mood etc. I would say the designer has to have an option to control it (say, through "important" CSS option), and the default behavior must to be to ignore uppercase transformations for Georgian. (If one accepts it by default, there are cases like [mr john smith]). Based on above, the answers to the initial questions are: *1) Get all call sites to change their code to not uppercase Georgian (and fix titlecasing to use the titlecase mappings, not the uppercase mappings). * This requires John Smith to have a special knowledge How to deal with Georgian. But, anyway, it is a good behavior. *> 2) Change all common fonts with Georgian characters to add the U11.0 ones. * This does not address the issues like 'John Smith" and appropriate usage of Georgian fonts. Capitalization rules can vary and some options may be inappropriate for a caseless script, as Georgian. *> 3) Hack font CMAPs to just map the new characters to the glyphs of the old ones. * This is the best behavior, but the solution is not that good. The best solution would be a special treatment of Georgian uppercase in CSS and on OS level (I know that is bad, but Unicode 11 is already released and it was already approved). Sincerely, Alex. P.S. Adding uppercase for Georgian was a mistake (in my opinion, of course), as it violates the Unicode principle to encode characters. On Fri, Jul 20, 2018 at 11:21 AM, Peter Constable via Unicode < unicode at unicode.org> wrote: > IMO, the correct answer is 2, except that ?all common fonts? is more > sweeping that necessary: it?s sufficient to have fonts used for fallback in > platforms and browsers, and the related fallback logic, to get updated. Of > course, that takes some time, and it?s not even two months since Unicode 11 > was released. The Georgian community understood that it would take time to > get implementations in place, and that they would need to take measures to > smooth over that transition ? which can include having Web sites for > Georgian businesses and institutions using fonts to match the requirements > of the content. > > > > > > Peter > > > > *From:* Unicore *On Behalf Of *Markus > Scherer via Unicore > *Sent:* Wednesday, July 18, 2018 3:05 PM > *To:* unicore UnicoRe Discussion > *Cc:* mark > *Subject:* Unicode 11 Georgian uppercase vs. fonts > > > > Dear fellow Unicoders, > > > > We?ve run into some significant problems with the Georgian capital letters > added in Unicode 11. If you have run into them yourselves, or have feedback > on our brainstormed solutions below, we?d love to hear your thoughts. > > > > Here's the problem. The vast majority of Georgian fonts do not yet have > the new uppercase characters. So when any system uses case mapping to > uppercase text (e.g. browsers interpreting CSS?s text-transform: > capitalize), then the users of Georgian will see boxes (?tofu?) if the font > they are using does not have the glyphs. > > > > For example, a program constructs a web page with buttons. It uses a CSS > style to uppercase text in buttons, as a house style. Unless the user has a > very up-to-date font, they see tofu (boxes). If a server does backend > rendering, its font has to be very up-to-date. We also saw this problem in > a program that was doing titlecasing, but on the first character it used > the *uppercase* mappings rather than *titlecase* mappings. Not the right > thing to do, of course, but code that accidentally works (most of the time) > doesn't get fixed if nobody reports a bug about it. > > > > All of these will result in bad bugs in the UI, in software that formerly > worked fine. > > > > We brainstormed some options to fix this: > > > > 1. Get all call sites to change their code to *not* uppercase Georgian > (and fix titlecasing to use the titlecase mappings, not the uppercase > mappings). Since we have no control over call sites and release cycles of > affected software, this would not help Georgian users for a long time, if > ever. We?d eventually want to retract these changes, creating even more > work. > 2. Change all common fonts with Georgian characters to add the U11.0 > ones. This should eventually happen but would probably take a couple of > years at least, which does not help users in the short term. > 3. Hack font CMAPs to just map the new characters to the glyphs of the > old ones. Works but only when a programmer can control the fonts used, such > as with server-side rendering or downloadable fonts. > 4. Remove the uppercase mappings for Georgian, until the fonts catch > up. > > > 1. Would at least have to be done in all browsers, otherwise web apps > will still break for Georgian. > 2. A broader alternative is to do it in ICU. Because that is used > by the majority of the browser implementations, it would solve the > short-term problem for the browsers ? and many other programs. Drawback: > Non-conformant, and uppercasing will be inconsistent depending on who has > which variant of ICU (with vs. without hack, on top of: with Unicode 11 vs. > before Unicode 11). > > > 1. One precedent is that in CLDR we deliberately hold back from using > new currency characters until the font support is sufficiently widespread. > (Wishing we'd held back the uppercase mappings in Unicode 11.0 too!) > > > > Mark & Markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jul 26 14:46:52 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 26 Jul 2018 20:46:52 +0100 Subject: Unicode 11 Georgian uppercase vs. fonts In-Reply-To: References:

Message-ID: <20180726204652.39387370@JRWUBU2> On Thu, 26 Jul 2018 23:27:08 +0400 Alexey Ostrovsky via Unicode wrote: > Before answering, we must mention the caseless nature of the Georgian > script. It "capital" letters do not exists as letters, they are letter > variants used exactly the same way as the Latin title case. Therefore, > Georgian "uppercase" = Georgian title case = Georgian "capital > letters" in Unicode 11, it is far from Latin uppercase by its > behavior and its features. Here are some examples for Georgian (I use > English, but semantics and casing mean to reflect Georgian) to > understand where we are: -- "mr. john smith" is unconditionally OK; > -- "MR JOHN SMITH" or "mr JOHN SMITH" can be OK or wrong depending on > situation, usually it is OK; > -- "Mr John Smith" is unconditionally wrong (except some marginal > cases, similar to English "mR jOHN sMITH"). > Therefore, easiest answer is (b): leave it "minuscule", as it is an > excellent and fully readable default solution. An answer to (a) is > not that easy, as it depends on designer's mood etc. I would say the > designer has to have an option to control it (say, through > "important" CSS option), and the default behavior must to be to > ignore uppercase transformations for Georgian. (If one accepts it by > default, there are cases like [mr class="x">john smith]). >From what you say, the new letter characters don't sound like title case letters. Title case is what one uses when words normally start with a capital and continue in small letters, but some letters act like ligatures of two letters and the appropriate form for an initial letter is like a ligature of a capital letter and a small letter. Richard. From unicode at unicode.org Thu Jul 26 15:11:40 2018 From: unicode at unicode.org (Alexey Ostrovsky via Unicode) Date: Fri, 27 Jul 2018 00:11:40 +0400 Subject: Unicode 11 Georgian uppercase vs. fonts In-Reply-To: <20180726204652.39387370@JRWUBU2> References:

<20180726204652.39387370@JRWUBU2> Message-ID: > > From what you say, the new letter characters don't sound like title > case letters. Title case is what one uses when words normally start with > a capital and continue in small letters, but some letters act like > ligatures of two letters and the appropriate form for an initial letter > is like a ligature of a capital letter and a small letter. Yes, you are right, this is my mistake. Sorry, I didn?t mean to confuse with a typography terminology here (which is not excuse, of course). What I mean is a kind of optional text transformation like small-caps or uppercase, optionally(!!) used in the Latin script to render a title (just for distinguishing from the rest of the text). The situation with the Georgian script differs from the English or Cyrillic essentially, mixing cases is not allowed (as far as we can state anything for a writing system, which is usually very flexible). > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jul 26 15:12:46 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Thu, 26 Jul 2018 21:12:46 +0100 (BST) Subject: Diacritic marks in parentheses In-Reply-To: <16032972.34295.1532618256304.JavaMail.defaultUser@defaultHost> References: <3498149.33619.1532617791939.JavaMail.root@webmail13.bt.ext.cpcloud.co.uk> <16032972.34295.1532618256304.JavaMail.defaultUser@defaultHost> Message-ID: <22997716.49286.1532635966994.JavaMail.defaultUser@defaultHost> Hi Markus > I would not expect for ?+combining () above = ?? to look right except with specialized fonts. I had a go making a font this afternoon (that is, afternoon United Kingdom time) and I am pleased with the result. As well as with a capital A the two combining characters also work well together with a capital O and with a capital U, though I do not know any German words where a test could be done - I only know a very very small amount of German, quite literally. Here is a forwarding of a post that I sent around to a few people, including an image. You are welcome to download the font and use it if you so choose. Best regards, William Overington Thursday 26 July 2018 ---- Hi I have enjoyed this thread and I have now made a font that implements the original request. http://www.users.globalnet.co.uk/~ngo/questtext20180726.otf Please find attached a graphic produced in the PagePlus X5 desktop publishing package using the font. Here is the sequence of characters used to produce the graphic. A???rzt/in That is nine characters, with the plain A having two combining characters combining with it. I have enjoyed making the font. Best regards, William Overington Thursday 26 July 2018 -------------- next part -------------- A non-text attachment was scrubbed... Name: questtext20180726test.png Type: image/png Size: 6278 bytes Desc: not available URL: From unicode at unicode.org Thu Jul 26 15:57:53 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 26 Jul 2018 13:57:53 -0700 Subject: Diacritic marks in parentheses In-Reply-To: References: <104663606.43684.1532601618204@ox.hosteurope.de> <1411637396.5643.1532603563943.JavaMail.www@wwinf1m18> Message-ID: <0b3f7895-1428-0363-4ce5-da82cce00032@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jul 26 16:15:36 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Thu, 26 Jul 2018 14:15:36 -0700 Subject: Diacritic marks in parentheses In-Reply-To: <0b3f7895-1428-0363-4ce5-da82cce00032@ix.netcom.com> References: <104663606.43684.1532601618204@ox.hosteurope.de> <1411637396.5643.1532603563943.JavaMail.www@wwinf1m18> <0b3f7895-1428-0363-4ce5-da82cce00032@ix.netcom.com> Message-ID: ?? Mark On Thu, Jul 26, 2018 at 1:57 PM, Asmus Freytag via Unicode < unicode at unicode.org> wrote: > On 7/26/2018 9:27 AM, Markus Scherer via Unicode wrote: > > I would not expect for ?+combining () above = ?? to look right except with > specialized fonts. > http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%84%5Cu1ABB&s=&uv=0 > > Even if it worked widely, I think it would be confusing. > I think you are best off writing Arzt/?rztin. > > > Why do something simple and unambiguous, when you can to something that's > technologically complex, looks unfamiliar to readers and is likely to be > misunderstood? > > :) > > A./ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jul 26 16:33:03 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 26 Jul 2018 23:33:03 +0200 (CEST) Subject: Diacritic marks in parentheses In-Reply-To: References: <104663606.43684.1532601618204@ox.hosteurope.de> <1411637396.5643.1532603563943.JavaMail.www@wwinf1m18> Message-ID: <1837092124.14473.1532640783697.JavaMail.www@wwinf1m18> Indeed when target use is general, dialectological diacritics are visibly not an option, as despite being in Unicode since v7.0 (2014), they are still unsupported by mainstream. Writing ?der Arzt oder die ?rztin? or, depending on context, ?einen Arzt oder eine ?rztin?, which I remember being common on package leaflets, is best practice. Mit freundlichen Gr??en, Marcel On 26/07/18 18:27, Markus Scherer wrote: > > I would not expect for ?+combining () above =??? to look right except with specialized fonts. > http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%84%5Cu1ABB&s=&uv=0 > > Even if it worked widely, I think it would be confusing. > I think you are best off writing Arzt/?rztin. > > > Viele Gr??e, > markus From unicode at unicode.org Thu Jul 26 17:41:47 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Thu, 26 Jul 2018 15:41:47 -0700 Subject: Diacritic marks in parentheses In-Reply-To: References: <104663606.43684.1532601618204@ox.hosteurope.de> <1411637396.5643.1532603563943.JavaMail.www@wwinf1m18> <0b3f7895-1428-0363-4ce5-da82cce00032@ix.netcom.com> Message-ID: But Asmus, think of how easy it would be to read: Ein??? A???rzt???? hat eine??? Studenti???n gesehen. Mark On Thu, Jul 26, 2018 at 2:15 PM, Mark Davis ?? wrote: > ?? > > Mark > > On Thu, Jul 26, 2018 at 1:57 PM, Asmus Freytag via Unicode < > unicode at unicode.org> wrote: > >> On 7/26/2018 9:27 AM, Markus Scherer via Unicode wrote: >> >> I would not expect for ?+combining () above = ?? to look right except >> with specialized fonts. >> http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%84%5Cu1ABB&s=&uv=0 >> >> Even if it worked widely, I think it would be confusing. >> I think you are best off writing Arzt/?rztin. >> >> >> Why do something simple and unambiguous, when you can to something that's >> technologically complex, looks unfamiliar to readers and is likely to be >> misunderstood? >> >> :) >> >> A./ >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jul 26 23:54:05 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 26 Jul 2018 20:54:05 -0800 Subject: Unicode 11 Georgian uppercase vs. fonts In-Reply-To: References:

<20180726204652.39387370@JRWUBU2> Message-ID: Alexey Ostrovsky wrote, > "The Georgian community understood" ? sorry, but > here "the Georgian community" means a small group > of Georgian font designers who promote upper-case > for effectively caseless Georgian. https://unicode.org/wg2/docs/n4712-georgian.pdf The revised proposal to change the Georgian encoding model from caseless to casing was convincing and compelling. (It's bilingual, too, English and Georgian.) From unicode at unicode.org Fri Jul 27 01:16:40 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Fri, 27 Jul 2018 15:16:40 +0900 Subject: Diacritic marks in parentheses In-Reply-To: References: <104663606.43684.1532601618204@ox.hosteurope.de> <1411637396.5643.1532603563943.JavaMail.www@wwinf1m18> Message-ID: <16038fdf-4ad5-a7bb-a6fd-8d0acbd3d828@it.aoyama.ac.jp> On 2018/07/27 01:27, Markus Scherer via Unicode wrote: > I would not expect for ?+combining () above = ?? to look right except with > specialized fonts. > http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%84%5Cu1ABB&s=&uv=0 > > Even if it worked widely, I think it would be confusing. Yes, for the moment. We don't know how this will develop. (Famous German (grammatically incorrect) saying: Man gew?hnt sich an allem, auch am Dativ.) > I think you are best off writing Arzt/?rztin. Regards, Martin. From unicode at unicode.org Fri Jul 27 02:22:47 2018 From: unicode at unicode.org (Arthur Reutenauer via Unicode) Date: Fri, 27 Jul 2018 09:22:47 +0200 Subject: Diacritic marks in parentheses In-Reply-To: References: <104663606.43684.1532601618204@ox.hosteurope.de> <1411637396.5643.1532603563943.JavaMail.www@wwinf1m18> <0b3f7895-1428-0363-4ce5-da82cce00032@ix.netcom.com> Message-ID: <20180727072247.GA1728455@phare.normalesup.org> On Thu, Jul 26, 2018 at 03:41:47PM -0700, Mark Davis ?? via Unicode wrote: > Ein??? A???rzt???? hat eine??? Studenti???n gesehen. It doesn?t actually look that bad! Although I?d like to amend the end to ?eine??? Student?????? gesehen?. Best, Arthur From unicode at unicode.org Fri Jul 27 03:35:10 2018 From: unicode at unicode.org (Alexey Ostrovsky via Unicode) Date: Fri, 27 Jul 2018 12:35:10 +0400 Subject: Unicode 11 Georgian uppercase vs. fonts In-Reply-To: References:

<20180726204652.39387370@JRWUBU2> Message-ID: On Fri, Jul 27, 2018 at 8:54 AM, James Kass via Unicode wrote: > https://unicode.org/wg2/docs/n4712-georgian.pdf > > The revised proposal to change the Georgian encoding model from > caseless to casing was convincing and compelling. (It's bilingual, > too, English and Georgian.) > It may look so, but my statement is still correct. This is not the first time, when the consortium mistreats Georgian (one can remember a story of encoding the ecclesiastic minuscule). Just two points: 1) "compelling" (less important). The supporters are either font designers or non-specialists organizations. There are several institutions in Georgia that had to be involved IMHO (like Institute of Georgian Language, Institute of Manuscripts and Academy of Sciences; Ministry of Economy is not an institution competent in the script issues). 2) "convincing". I will not discuss all the controversies here, but will only cite ?1.1 and ?8: ?1.1, on "*Mkhedruli? is caseless, and no casing behaviour is expected or permitted by Georgian users. The mtavruli titling style of Mkhedruli? is not case; it is a style analogous to small caps or bold or italic. <...> Mtavruli-style letters are never used as ?capitals?; a word is always entirely presented in mtavruli or not. Mtavruli-style is used in titles, newspaper headlines, and other kinds of headings.*" of the original encoding (N2608R2): ? "*This statement was not correct.*" At the same time, ?8 on successful implementation of the proposal in question: "*Within a sentence a given word might be written IN ALL CAPS (MTAVRULI) for emphasis. An entire sentence or header may also be written in Mtavruli.*" And all the sample photos of the modern books and journals demonstrate exactly the same behavior as described in N2608R2: " *Mtavruli-style is used in titles, newspaper headlines, and other kinds of headings*". (I can provide more information if needed) The key question is whether Georgian is caseless or not in plain text encoding, and N2608R2 does not provide any evidence for casing in modern Georgian. Basically, the issues addressed are the low level of technical support for implementing small caps in Georgian typesetting (but this must not be Unicode issue) and incorrect idea that small caps must be preserved in plain text encoding (just because someone loves it), it is obvious from ?1.1 (right after the text I cited). Sincerely, a. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jul 27 05:42:07 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Fri, 27 Jul 2018 11:42:07 +0100 Subject: Unicode 11 Georgian uppercase vs. fonts In-Reply-To: References:

<20180726204652.39387370@JRWUBU2> Message-ID: Yes and it explains clearly that ?effectively caseless Georgian? is incorrect. Georgian has case. Georgian uses case differently from other scripts. This is an orthographic distinction, not a structural one. In fact as it is also stated in the proposal, there are 19th-century texts which do titlecase. It?s just that that orthography is no longer in use and that behaviour no longer desirable. Michael Everson > On 27 Jul 2018, at 05:54, James Kass via Unicode wrote: > > Alexey Ostrovsky wrote, > >> "The Georgian community understood" ? sorry, but >> here "the Georgian community" means a small group >> of Georgian font designers who promote upper-case >> for effectively caseless Georgian. > > https://unicode.org/wg2/docs/n4712-georgian.pdf > > The revised proposal to change the Georgian encoding model from > caseless to casing was convincing and compelling. (It's bilingual, > too, English and Georgian.) > From unicode at unicode.org Fri Jul 27 06:22:15 2018 From: unicode at unicode.org (Alexey Ostrovsky via Unicode) Date: Fri, 27 Jul 2018 15:22:15 +0400 Subject: Unicode 11 Georgian uppercase vs. fonts In-Reply-To: References: