From mark at macchiato.com Tue Apr 1 02:01:39 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 1 Apr 2014 09:01:39 +0200 Subject: FYI: More emoji from Chrome Message-ID: More emoji from Chrome: http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Apr 1 02:13:39 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 1 Apr 2014 09:13:39 +0200 Subject: FYI: More emoji from Chrome In-Reply-To: References: Message-ID: April 1st joke... 2014-04-01 9:01 GMT+02:00 Mark Davis ?? : > More emoji from Chrome: > > http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html > > with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Apr 1 02:20:59 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 1 Apr 2014 09:20:59 +0200 Subject: FYI: More emoji from Chrome In-Reply-To: References: Message-ID: Yup! Mark *? Il meglio ? l?inimico del bene ?* On 1 April 2014 09:13, Philippe Verdy wrote: > April 1st joke... > > > 2014-04-01 9:01 GMT+02:00 Mark Davis ?? : > >> More emoji from Chrome: >> >> http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html >> >> with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathias at qiwi.be Tue Apr 1 02:25:29 2014 From: mathias at qiwi.be (Mathias Bynens) Date: Tue, 1 Apr 2014 09:25:29 +0200 Subject: FYI: More emoji from Chrome In-Reply-To: References: Message-ID: <945ED05C-7DB3-40C4-8622-F5EDAD63D04E@qiwi.be> On 1 Apr 2014, at 09:13, Philippe Verdy wrote: > April 1st joke... Sure ? it really works, though. Try it out. Kinda cool :) I would?ve preferred if Google had finally implemented support for proper emoji in OS X, though: https://code.google.com/p/chromium/issues/detail?id=62435 From jjc at jclark.com Tue Apr 1 00:51:11 2014 From: jjc at jclark.com (James Clark) Date: Tue, 1 Apr 2014 12:51:11 +0700 Subject: Bidi reordering of soft hyphen Message-ID: Suppose I have a paragraph (uppercase = RTL): CARROT IS car\u00ADrot IN ENGLISH and the paragraph gets broken at the soft hyphen. Is the correct ordering for the first line car- SI TORRAC or -car SI TORRAC ? I did not succeed in deducing the answer from UAX#9. Soft hyphen has bidi class BN, which means it gets removed in stage X9, and so, if I have understood correctly, doesn't have a defined embedding level. I'm guessing the correct ordering is the first one, but I don't trust my instincts here. (In particular, I wondered whether this was analogous to the case where rule L1 resets embedding levels so that trailing whitespace is at the visual end of the line.) More generally, suppose you have a markup language which has a construct for discretionary breaks, as in TeX, with pre-break, post-break and no-break text. Soft hyphen is a special case of this (where the pre-break text consists of a hyphen, and the pos and no-break texts are empty); you can also regard space as a kind of discretionary break (post-break text empty, no-break text contains the space, pre-break text either contains the space or is empty, depending on how you want to think about it). Obviously the embedding level for the no-break text should be resolved as if discretionary break was replaced by the no-break text (which is consistent with a bidi class of BN for soft hyphen). However, for the pre- and post-break text, it is not clear to me what the right way is to resolve embedding levels (or how their content should be restricted so that there is a sensible way to resolve the embedding levels). I would be grateful for any suggestions. James -------------- next part -------------- An HTML attachment was scrubbed... URL: From nospam-abuse at ilyaz.org Tue Apr 1 11:43:43 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Tue, 1 Apr 2014 09:43:43 -0700 Subject: FYI: More emoji from Chrome In-Reply-To: References: Message-ID: <20140401164343.GA5003@powdermilk> On Tue, Apr 01, 2014 at 09:01:39AM +0200, Mark Davis ?? wrote: > More emoji from Chrome: > > http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html > > with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y I do not know? The demos leave me completely unimpressed: emoji???by their nature???require higher resolution than text, so an emoji for ?pie? does not save any place comparing to the word itself. So the impact of this on everyday English-languare communication would not be in any way beneficial. However, this MAY be a beginning of revolution in scientific communication. Science-and-about publications contains very long words in abundance, and it is HERE where impact of emojification should be felt the most! So I think the task of emojification of scientific terms???be it ?secularization?, ?gamma-globulin?, or ?derived ?-category????should be at elevated priority in the Unicode commitees. The general public often considers scientific publications are too dense, and does not bother to read many scienific journals. What Google did is a beginning of a major step forward in making contemporary science (finally!) accessible to general public. Ilya From richard.wordingham at ntlworld.com Tue Apr 1 15:10:23 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 1 Apr 2014 21:10:23 +0100 Subject: Bidi reordering of soft hyphen In-Reply-To: References: Message-ID: <20140401211023.6042e0a5@JRWUBU2> On Tue, 1 Apr 2014 12:51:11 +0700 James Clark wrote: > Suppose I have a paragraph (uppercase = RTL): > > CARROT IS car\u00ADrot IN ENGLISH > > and the paragraph gets broken at the soft hyphen. > > Is the correct ordering for the first line > > car- SI TORRAC > > or > > -car SI TORRAC > > ? I did not succeed in deducing the answer from UAX#9. Soft hyphen > has bidi class BN, which means it gets removed in stage X9, and so, > if I have understood correctly, doesn't have a defined embedding > level. > > I'm guessing the correct ordering is the first one, but I don't trust > my instincts here. (In particular, I wondered whether this was > analogous to the case where rule L1 resets embedding levels so that > trailing whitespace is at the visual end of the line.) There is no conformance requirement on the location of the soft hyphen. Indeed, there is no requirement on whether it is rendered at all (TUS Section 16.2). As the treatment of the soft-hyphen is language dependent even in unidirectional text, I am afraid the treatment is down to good taste and the language(s) involved. (E.g., is this Arabic text effectively embedding English text within an overall Thai context?) As U+2010 HYPHEN would result in text like 'car-', in an English influenced context I would also go with 'car-'. Richard. From ken.whistler at sap.com Tue Apr 1 15:20:13 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Tue, 1 Apr 2014 20:20:13 +0000 Subject: Bidi reordering of soft hyphen In-Reply-To: References: Message-ID: I don?t think the answer is directly deduced from UAX #9, because it involves deciding where to insert a visible hyphen for display. However, I think the correct answer here is your number two guess, i.e. (in a RTL paragraph context): -car SI TORRAC A way to think about this, rather than starting from the BN nature of U+00AD, is to ask what would happen if there was an *explicit* hyphen-minus at the same position. Shortening your example line ?CARROT IS car\u00AD? to just the equivalent of ?ABC car-?, the outcome of the bidiref processing for a RTL paragraph context is: Trace: Entering br_UBA_ReverseLevels [L2] Current State: 19 Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D Bidi_Class: R R R R L L L R Levels: 1 1 1 1 2 2 2 1 Runs: Order: [7 4 5 6 3 2 1 0] In other words, on display: -car CBA <--------- with the hyphen-minus at the *end* of the reordered line, as expected. If you run the same example, but substituting U+00AD for U+002D, you get: Trace: Entering br_UBA_ReverseLevels [L2] Current State: 19 Text: 05D0 05D1 05D2 0020 0063 0061 0072 00AD Bidi_Class: R R R R L L L BN Levels: 1 1 1 1 2 2 2 x Runs: Order: [4 5 6 3 2 1 0] And the display for that would be: car CBA But *then* your hyphenation algorithm would presumably kick in and decide that the U+00AD is at the end of the line and should display as a visible hyphen glyph. But ?end of the line? here means the same as it would for the explicit hyphen-minus, so when you insert the visible hyphen glyph, you end up with the same result: -car CBA Another way of looking at this is that in order to line break your text in the first place, you need to be able to calculate the resolved display width to fit in the line. That would have to include the visual display of the inserted hyphen glyph. So once you have *decided* to break the line at the soft hyphen, in effect, you substitute a visual display symbol U+002D (or the actual hyphen U+2010, etc.) for U+00AD. *Then* run the UBA on the results to get the resolved order of all the elements on the line. The net effect should be the same. Maybe folks with full implementations of bidi rendering would have more to contribute on this, but that would be my own take on the problem. --Ken Suppose I have a paragraph (uppercase = RTL): CARROT IS car\u00ADrot IN ENGLISH and the paragraph gets broken at the soft hyphen. Is the correct ordering for the first line car- SI TORRAC or -car SI TORRAC ? I did not succeed in deducing the answer from UAX#9. Soft hyphen has bidi class BN, which means it gets removed in stage X9, and so, if I have understood correctly, doesn't have a defined embedding level. I'm guessing the correct ordering is the first one, but I don't trust my instincts here. (In particular, I wondered whether this was analogous to the case where rule L1 resets embedding levels so that trailing whitespace is at the visual end of the line.) More generally, suppose you have a markup language which has a construct for discretionary breaks, as in TeX, with pre-break, post-break and no-break text. Soft hyphen is a special case of this (where the pre-break text consists of a hyphen, and the pos and no-break texts are empty); you can also regard space as a kind of discretionary break (post-break text empty, no-break text contains the space, pre-break text either contains the space or is empty, depending on how you want to think about it). Obviously the embedding level for the no-break text should be resolved as if discretionary break was replaced by the no-break text (which is consistent with a bidi class of BN for soft hyphen). However, for the pre- and post-break text, it is not clear to me what the right way is to resolve embedding levels (or how their content should be restricted so that there is a sensible way to resolve the embedding levels). I would be grateful for any suggestions. James -------------- next part -------------- An HTML attachment was scrubbed... URL: From ken.whistler at sap.com Tue Apr 1 15:31:08 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Tue, 1 Apr 2014 20:31:08 +0000 Subject: Bidi reordering of soft hyphen In-Reply-To: <20140401211023.6042e0a5@JRWUBU2> References: <20140401211023.6042e0a5@JRWUBU2> Message-ID: Richard Wordingham noted: > As U+2010 HYPHEN would result in text like 'car-', in an English > influenced context I would also go with 'car-'. That's always a possibility, I suppose, but I'm not sure what "English influenced context" means here. The examples I just gave were for a RTL paragraph context. In a LTR paragraph context, the same input would end up in a very different order: Trace: Entering br_UBA_ReverseLevels [L2] Current State: 19 Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D Bidi_Class: R R R L L L L L Levels: 1 1 1 0 0 0 0 0 Runs: Order: [2 1 0 3 4 5 6 7] And you get the display: CBA car- ---------> As opposed to: -car CBA <--------- In either case, the hyphen-minus (or hyphen), ends up at the *end of the line*. My take is that *if* I am going to insert a visible glyph at the point of the SHY, it would probably be best to insert it at the actual line break at the end of the line, to be in the same position as an explicit hyphen-minus with the same line break. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From roozbeh at unicode.org Tue Apr 1 16:00:25 2014 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Tue, 1 Apr 2014 14:00:25 -0700 Subject: Bidi reordering of soft hyphen In-Reply-To: References: Message-ID: Adding Behdad for his insight on the rendering stack. But as for user requirements and expectations, the first option, with the hyphen on the right side of "car" as "car-" is what a good publisher would want to print in his magazine or book. The second option is harder to decipher for an RTL reader. (Note that breaking opposite-direction phrases across lines in bidi paragraphs is also avoided as much as possible in good typography, as the output is weird to some readers anyway.) On Apr 1, 2014 1:21 PM, "Whistler, Ken" wrote: > I don?t think the answer is directly deduced from UAX #9, because > > it involves deciding where to insert a visible hyphen for display. > > However, I think the correct answer here is your number two guess, > > i.e. (in a RTL paragraph context): > > > > -car SI TORRAC > > > > A way to think about this, rather than starting from the BN nature > > of U+00AD, is to ask what would happen if there was an *explicit* > > hyphen-minus at the same position. Shortening your example > > line ?CARROT IS car\u00AD? to just the equivalent of ?ABC car-?, > > the outcome of the bidiref processing for a RTL paragraph context is: > > > > Trace: Entering br_UBA_ReverseLevels [L2] > > Current State: 19 > > Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D > > Bidi_Class: R R R R L L L R > > Levels: 1 1 1 1 2 2 2 1 > > Runs: > > > > Order: [7 4 5 6 3 2 1 0] > > > > In other words, on display: > > > > -car CBA > > <--------- > > > > with the hyphen-minus at the *end* of the reordered line, as > > expected. > > > > If you run the same example, but substituting U+00AD for U+002D, you get: > > > > Trace: Entering br_UBA_ReverseLevels [L2] > > Current State: 19 > > Text: 05D0 05D1 05D2 0020 0063 0061 0072 00AD > > Bidi_Class: R R R R L L L BN > > Levels: 1 1 1 1 2 2 2 x > > Runs: > > > > Order: [4 5 6 3 2 1 0] > > > > And the display for that would be: > > > > car CBA > > > > But *then* your hyphenation algorithm would presumably kick in and decide > > that the U+00AD is at the end of the line and should display as a visible > > hyphen glyph. But ?end of the line? here means the same as it would for > > the explicit hyphen-minus, so when you insert the visible hyphen glyph, you > > end up with the same result: > > > > -car CBA > > > > Another way of looking at this is that in order to line break your text in > > the first place, you need to be able to calculate the resolved display > width > > to fit in the line. That would have to include the visual display of the > inserted > > hyphen glyph. So once you have *decided* to break the line at the soft > > hyphen, in effect, you substitute a visual display symbol U+002D (or > > the actual hyphen U+2010, etc.) for U+00AD. *Then* run the UBA on the > > results to get the resolved order of all the elements on the line. The net > > effect should be the same. > > > > Maybe folks with full implementations of bidi rendering would have more to > > contribute on this, but that would be my own take on the problem. > > > > --Ken > > > > > > > > Suppose I have a paragraph (uppercase = RTL): > > > > CARROT IS car\u00ADrot IN ENGLISH > > > > and the paragraph gets broken at the soft hyphen. > > > > Is the correct ordering for the first line > > > > car- SI TORRAC > > > > or > > > > -car SI TORRAC > > > > ? I did not succeed in deducing the answer from UAX#9. Soft hyphen has > bidi class BN, which means it gets removed in stage X9, and so, if I have > understood correctly, doesn't have a defined embedding level. > > > > I'm guessing the correct ordering is the first one, but I don't trust my > instincts here. (In particular, I wondered whether this was analogous to > the case where rule L1 resets embedding levels so that trailing whitespace > is at the visual end of the line.) > > > > More generally, suppose you have a markup language which has a construct > for discretionary breaks, as in TeX, with pre-break, post-break and > no-break text. Soft hyphen is a special case of this (where the pre-break > text consists of a hyphen, and the pos and no-break texts are empty); you > can also regard space as a kind of discretionary break (post-break text > empty, no-break text contains the space, pre-break text either contains the > space or is empty, depending on how you want to think about it). Obviously > the embedding level for the no-break text should be resolved as if > discretionary break was replaced by the no-break text (which is consistent > with a bidi class of BN for soft hyphen). However, for the pre- and > post-break text, it is not clear to me what the right way is to resolve > embedding levels (or how their content should be restricted so that there > is a sensible way to resolve the embedding levels). I would be grateful for > any suggestions. > > > > James > > > > > > > > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Tue Apr 1 16:43:38 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 01 Apr 2014 14:43:38 -0700 Subject: Bidi reordering of soft hyphen In-Reply-To: References: <20140401211023.6042e0a5@JRWUBU2> Message-ID: <533B330A.8030401@ix.netcom.com> I think this calls for an implementation note on UAX#9 along these lines. ------------------------- During line breaking, if a line is broken at the location of a SHY, the text around the line break may change. A common case is the replacement of the invisible SHY by a visible HYPHEN, but see Section x.x in the Unicode Standard. For the purposes of the Bidi Algorithm, apply steps .. to .. after any substitutions have been made, using the directional classes for the substituted characters, instead of a single BN for the SHY character. Note, no special action need be taken for a SHY character in the middle of a line, unless they are rendered as visible glyphs in a "show hidden character" mode. In the latter case, the recommendation would be to treat the visible symbol substituted for the SHY as having bidi class ON. ------------------------ I am not sure whether -car CBA or car- CBA is the right answer, nor whether the substitution will always be limited to the preceding line. (Old orthography German had B?cker turning in to B?k-|ker, where I've used | to show the line ending.) Those are details that the UBA should be ignorant about. The important thing is that the array of bidi directional classes is not constrained to contain a single entry for BN at the location of the original SHY. If "car- CBA" is the right answer then the substitution would have to be HYPHEN plus LRM to get this to come out right, but that would be under the control of the line-breaking conventions, and not legislated by the UBA. A./ On 4/1/2014 1:31 PM, Whistler, Ken wrote: > > Richard Wordingham noted: > > > As U+2010 HYPHEN would result in text like 'car-', in an English > > > influenced context I would also go with 'car-'. > > That's always a possibility, I suppose, but I'm not sure what > > "English influenced context" means here. > > The examples I just gave were for a RTL paragraph context. > > In a LTR paragraph context, the same input would end up in > > a very different order: > > Trace: Entering br_UBA_ReverseLevels [L2] > > Current State: 19 > > Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D > > Bidi_Class: R R R L L L L L > > Levels: 1 1 1 0 0 0 0 0 > > Runs: > > Order: [2 1 0 3 4 5 6 7] > > And you get the display: > > CBA car- > > ---------> > > As opposed to: > > -car CBA > > <--------- > > In either case, the hyphen-minus (or hyphen), ends up at the *end of > the line*. > > My take is that *if* I am going to insert a visible glyph at the point > of the > > SHY, it would probably be best to insert it at the actual line break > at the > > end of the line, to be in the same position as an explicit > hyphen-minus with > > the same line break. > > --Ken > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From nikiselken at gmail.com Tue Apr 1 16:50:01 2014 From: nikiselken at gmail.com (Nicole Selken) Date: Tue, 1 Apr 2014 17:50:01 -0400 Subject: Unicode Digest, Vol 4, Issue 1 In-Reply-To: References: Message-ID: I think Emoji is totally beneficial as a communication form. Yea, it takes op some UTF space and such but they literally affect different parts of the brain then written words. In this way they change the kind of communication possible. Also, so many people (especially the young) are using them, to ignore them or dismiss them would be a mistake. I like that they were included into the character set for Unicode and I would love to talk with someone who was a decision maker on that panel for my Emoji Project. If anyone who has worked on it has some time, drop me a line! https://niki-selken.squarespace.com/#/world-translation-foundation/ Thanks, Niki Selken Working on: www.nikiselken.com On Tue, Apr 1, 2014 at 1:00 PM, wrote: > Send Unicode mailing list submissions to > unicode at unicode.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://unicode.org/mailman/listinfo/unicode > or, via email, send a message with subject or body 'help' to > unicode-request at unicode.org > > You can reach the person managing the list at > unicode-owner at unicode.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Unicode digest..." > > Today's Topics: > > 1. Call for the experts of U+3013 (suzuki toshiya) > 2. FYI: More emoji from Chrome (Mark Davis ??) > 3. Re: FYI: More emoji from Chrome (Philippe Verdy) > 4. Re: FYI: More emoji from Chrome (Mark Davis ??) > 5. Re: FYI: More emoji from Chrome (Mathias Bynens) > 6. Bidi reordering of soft hyphen (James Clark) > 7. Re: FYI: More emoji from Chrome (Ilya Zakharevich) > > > ---------- Forwarded message ---------- > From: suzuki toshiya > To: Unicode Discussion > Cc: > Date: Tue, 01 Apr 2014 09:28:26 +0900 > Subject: Call for the experts of U+3013 > Dear all, > > Today I submitted a preliminary proposal to standardize > Variation Selectors for U+3013, so-called "GETA" mark. > > ftp://std.dkuug.dk/ftp.anonymous/JTC1/SC2/WG2/docs/n4572.pdf > > The geta mark was introduced from JIS X 0208:1990 and > GB 2312-1980. When I check the original documents > including the geta mark, some of the representative glyphs > in these regional standards are different from original > geta mark. I investigated theoretically possible visual > shapes of the geta mark, and concluded the registry-based > standardization of the geta mark is a considerable option. > > Unfortunately, the officially printed matters including > the geta mark is not popular (I found only a few books > in Japanese national diet library), so I want to hear the > comments from the geta expert for the official proposal. > > Regards, > mpsuzuki > > > > ---------- Forwarded message ---------- > From: "Mark Davis ??" > To: Unicode Public > Cc: > Date: Tue, 1 Apr 2014 09:01:39 +0200 > Subject: FYI: More emoji from Chrome > More emoji from Chrome: > > http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html > > with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y > > > ---------- Forwarded message ---------- > From: Philippe Verdy > To: "Mark Davis ??" > Cc: Unicode Public > Date: Tue, 1 Apr 2014 09:13:39 +0200 > Subject: Re: FYI: More emoji from Chrome > April 1st joke... > > > 2014-04-01 9:01 GMT+02:00 Mark Davis ?? : > >> More emoji from Chrome: >> >> http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html >> >> with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> > > > ---------- Forwarded message ---------- > From: "Mark Davis ??" > To: verdy_p > Cc: Unicode Public > Date: Tue, 1 Apr 2014 09:20:59 +0200 > Subject: Re: FYI: More emoji from Chrome > Yup! > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > > On 1 April 2014 09:13, Philippe Verdy wrote: > >> April 1st joke... >> >> >> 2014-04-01 9:01 GMT+02:00 Mark Davis ?? : >> >>> More emoji from Chrome: >>> >>> http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html >>> >>> with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y >>> >>> _______________________________________________ >>> Unicode mailing list >>> Unicode at unicode.org >>> http://unicode.org/mailman/listinfo/unicode >>> >>> >> > > > ---------- Forwarded message ---------- > From: Mathias Bynens > To: verdy_p at wanadoo.fr > Cc: "Mark Davis ??" , Unicode Public < > unicode at unicode.org> > Date: Tue, 1 Apr 2014 09:25:29 +0200 > Subject: Re: FYI: More emoji from Chrome > On 1 Apr 2014, at 09:13, Philippe Verdy wrote: > > > April 1st joke... > > Sure ? it really works, though. Try it out. Kinda cool :) > > I would?ve preferred if Google had finally implemented support for proper > emoji in OS X, though: > https://code.google.com/p/chromium/issues/detail?id=62435 > > > > ---------- Forwarded message ---------- > From: James Clark > To: unicode at unicode.org > Cc: > Date: Tue, 1 Apr 2014 12:51:11 +0700 > Subject: Bidi reordering of soft hyphen > Suppose I have a paragraph (uppercase = RTL): > > CARROT IS car\u00ADrot IN ENGLISH > > and the paragraph gets broken at the soft hyphen. > > Is the correct ordering for the first line > > car- SI TORRAC > > or > > -car SI TORRAC > > ? I did not succeed in deducing the answer from UAX#9. Soft hyphen has > bidi class BN, which means it gets removed in stage X9, and so, if I have > understood correctly, doesn't have a defined embedding level. > > I'm guessing the correct ordering is the first one, but I don't trust my > instincts here. (In particular, I wondered whether this was analogous to > the case where rule L1 resets embedding levels so that trailing whitespace > is at the visual end of the line.) > > More generally, suppose you have a markup language which has a construct > for discretionary breaks, as in TeX, with pre-break, post-break and > no-break text. Soft hyphen is a special case of this (where the pre-break > text consists of a hyphen, and the pos and no-break texts are empty); you > can also regard space as a kind of discretionary break (post-break text > empty, no-break text contains the space, pre-break text either contains the > space or is empty, depending on how you want to think about it). Obviously > the embedding level for the no-break text should be resolved as if > discretionary break was replaced by the no-break text (which is consistent > with a bidi class of BN for soft hyphen). However, for the pre- and > post-break text, it is not clear to me what the right way is to resolve > embedding levels (or how their content should be restricted so that there > is a sensible way to resolve the embedding levels). I would be grateful for > any suggestions. > > James > > > > > > > > ---------- Forwarded message ---------- > From: Ilya Zakharevich > To: Mark Davis ?? > Cc: Unicode Public > Date: Tue, 1 Apr 2014 09:43:43 -0700 > Subject: Re: FYI: More emoji from Chrome > On Tue, Apr 01, 2014 at 09:01:39AM +0200, Mark Davis ?? wrote: > > More emoji from Chrome: > > > > http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html > > > > with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y > > I do not know? The demos leave me completely unimpressed: emoji ? by > their nature ? require higher resolution than text, so an emoji for > ?pie? does not save any place comparing to the word itself. So the > impact of this on everyday English-languare communication would not be > in any way beneficial. > > However, this MAY be a beginning of revolution in scientific > communication. Science-and-about publications contains very long > words in abundance, and it is HERE where impact of emojification > should be felt the most! So I think the task of emojification of > scientific terms ? be it ?secularization?, ?gamma-globulin?, or > ?derived ?-category? ? should be at elevated priority in the Unicode > commitees. > > The general public often considers scientific publications are too > dense, and does not bother to read many scienific journals. What > Google did is a beginning of a major step forward in making > contemporary science (finally!) accessible to general public. > > Ilya > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From smontagu at smontagu.org Tue Apr 1 17:40:57 2014 From: smontagu at smontagu.org (Simon Montagu) Date: Wed, 02 Apr 2014 01:40:57 +0300 Subject: Bidi reordering of soft hyphen In-Reply-To: References: Message-ID: <533B4079.2030402@smontagu.org> On 04/02/2014 12:00 AM, Roozbeh Pournader wrote: > Adding Behdad for his insight on the rendering stack. > > But as for user requirements and expectations, the first option, with > the hyphen on the right side of "car" as "car-" is what a good publisher > would want to print in his magazine or book. The second option is > harder to decipher for an RTL reader. I agree with Roozbeh here. Since the hyphen marks a break in the middle of the word, I think the most natural user expectation is that it should appear after the last character in the word, where "after" and "last" both refer to the reading direction of the word. I have seen examples of this in published Hebrew books, and this is also the way it's rendered in Chrome, Firefox and Opera (but in the case of Firefox, since I wrote the code for it I can testify that it isn't this way by design: as far as I remember I only took into account the direction of the text run containing the soft hyphen and didn't even think about the opposite-direction case). From richard.wordingham at ntlworld.com Tue Apr 1 18:02:57 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 2 Apr 2014 00:02:57 +0100 Subject: Bidi reordering of soft hyphen In-Reply-To: References: Message-ID: <20140402000257.32dd544d@JRWUBU2> On Tue, 1 Apr 2014 20:20:13 +0000 "Whistler, Ken" wrote: > I don?t think the answer is directly deduced from UAX #9, because > it involves deciding where to insert a visible hyphen for display. > However, I think the correct answer here is your number two guess, > i.e. (in a RTL paragraph context): > > -car SI TORRAC > > A way to think about this, rather than starting from the BN nature > of U+00AD, is to ask what would happen if there was an *explicit* > hyphen-minus at the same position. Is it legitimate to truncate the context to a single line? The BiDi algorithm is attempting to interpret unlabelled text as embedded text (it's not an arbitrary dance), and in just one line there is no indicator of whether the hyphen is part of the LTR text embedded in RTL text. However, the very next character is 'r', which tells us that the left-to-right run contains the hyphen. I also think the HYPHEN-MINUS is the wrong character to consider - the analogy should be with U+2010 HYPHEN (class ON) rather than with U+2212 MINUS SIGN (class ES), let alone the ambiguous HPYHEN-MINUS, for which ES is merely the interpretation most likely to work. I found a similar example, but with Hebrew embedded in the Latin script, in the introduction to the Stuttgart Bible. The corresponding character was U+05BE HEBREW PUNCTUATION MAQAF, though in this case the class is R (because one doesn't expect MAQAF to be used with left-to right scripts), and therefore not as good an example as I would have hoped for. The BiDi algorith then happily places the MAQAF internally, making the analogy 'car- SI TORRAC'. (I metaphorically embedded the quote, so I don't get 'SI TORRAC car-', which is plain wrong.) Now, a valid opposing view is that the graphical representation of soft hyphens says, "When written out as one very long line, there is no space between successive lines", as opposed to "This apparent word is actually continued by text on the next line". If you take the interpretation of the marks operating at the level of lines, then '-car SI TORRAC' is reasonable. As English has the hyphen as a half-way house between one word and two words, English very naturally works at the word level. I am not sure about other languages. Richard. From jonathan.rosenne at gmail.com Tue Apr 1 18:12:35 2014 From: jonathan.rosenne at gmail.com (Jonathan Rosenne) Date: Wed, 2 Apr 2014 02:12:35 +0300 Subject: Bidi reordering of soft hyphen In-Reply-To: <533B4079.2030402@smontagu.org> References: <533B4079.2030402@smontagu.org> Message-ID: <002b01cf4dff$dfb53df0$9f1fb9d0$@gmail.com> The use of soft hyphen is a cultural matter. In Hebrew, Classic and Israeli, soft hyphens are not used. Best Regards, Jonathan Rosenne 054-4246522 -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Simon Montagu Sent: Wednesday, April 02, 2014 1:41 AM To: Roozbeh Pournader; Ken Whistler, (ken.whistler at sap.com) Cc: Behdad Esfahbod; unicode at unicode.org; James Clark Subject: Re: Bidi reordering of soft hyphen On 04/02/2014 12:00 AM, Roozbeh Pournader wrote: > Adding Behdad for his insight on the rendering stack. > > But as for user requirements and expectations, the first option, with > the hyphen on the right side of "car" as "car-" is what a good > publisher would want to print in his magazine or book. The second > option is harder to decipher for an RTL reader. I agree with Roozbeh here. Since the hyphen marks a break in the middle of the word, I think the most natural user expectation is that it should appear after the last character in the word, where "after" and "last" both refer to the reading direction of the word. I have seen examples of this in published Hebrew books, and this is also the way it's rendered in Chrome, Firefox and Opera (but in the case of Firefox, since I wrote the code for it I can testify that it isn't this way by design: as far as I remember I only took into account the direction of the text run containing the soft hyphen and didn't even think about the opposite-direction case). _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From asmusf at ix.netcom.com Tue Apr 1 18:39:13 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 01 Apr 2014 16:39:13 -0700 Subject: Bidi reordering of soft hyphen In-Reply-To: <002b01cf4dff$dfb53df0$9f1fb9d0$@gmail.com> References: <533B4079.2030402@smontagu.org> <002b01cf4dff$dfb53df0$9f1fb9d0$@gmail.com> Message-ID: <533B4E21.2020504@ix.netcom.com> On 4/1/2014 4:12 PM, Jonathan Rosenne wrote: > The use of soft hyphen is a cultural matter. In Hebrew, Classic and Israeli, > soft hyphens are not used. More to the point, how does software render a soft hyphen included in inserted LTR text, when the outer text is Hebrew? Would it always be ignored? Would it be rendered? How? Mind you, I don't think that the bidi algorithm as such needs to care about these details, but the Unicode Standard does mumble about different conventions. Might be useful to add some examples to such mumbling. A./ > > Best Regards, > > Jonathan Rosenne > > 054-4246522 > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Simon > Montagu > Sent: Wednesday, April 02, 2014 1:41 AM > To: Roozbeh Pournader; Ken Whistler, (ken.whistler at sap.com) > Cc: Behdad Esfahbod; unicode at unicode.org; James Clark > Subject: Re: Bidi reordering of soft hyphen > > On 04/02/2014 12:00 AM, Roozbeh Pournader wrote: >> Adding Behdad for his insight on the rendering stack. >> >> But as for user requirements and expectations, the first option, with >> the hyphen on the right side of "car" as "car-" is what a good >> publisher would want to print in his magazine or book. The second >> option is harder to decipher for an RTL reader. > I agree with Roozbeh here. Since the hyphen marks a break in the middle of > the word, I think the most natural user expectation is that it should appear > after the last character in the word, where "after" and "last" > both refer to the reading direction of the word. > > I have seen examples of this in published Hebrew books, and this is also the > way it's rendered in Chrome, Firefox and Opera (but in the case of Firefox, > since I wrote the code for it I can testify that it isn't this way by > design: as far as I remember I only took into account the direction of the > text run containing the soft hyphen and didn't even think about the > opposite-direction case). > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From ken.whistler at sap.com Tue Apr 1 18:41:48 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Tue, 1 Apr 2014 23:41:48 +0000 Subject: Bidi reordering of soft hyphen In-Reply-To: <20140402000257.32dd544d@JRWUBU2> References: <20140402000257.32dd544d@JRWUBU2> Message-ID: > Is it legitimate to truncate the context to a single line? The BiDi > algorithm is attempting to interpret unlabelled text as embedded text > (it's not an arbitrary dance), and in just one line there is no > indicator of whether the hyphen is part of the LTR text embedded in RTL > text. For this discussion, I think yes. See Section 3.4 of UAX #9: The following rules describe the logical process of finding the correct display order. As opposed to resolution phases, these rules act on a per-line basis and are applied after any line wrapping is applied to the paragraph. The main collection of UBA rules apply on a per-paragraph basis, but you cannot actually do reordering of the resolved levels until you have specified the line breaks. Effectively, the hyphenation decision has to be taken first. And *then* you can reorder the results line-by-line. So once we have the decision where we are breaking ?car-/rot?, we can then talk just about where the ?car-? ends up on the single line. But I agree that there are many conundrums for trying to hyphenate individual words in mixed-direction bidi text, so I am not surprised that there would be special typographical conventions which might, as Asmus suggested, require dropping in LRM?s or the like, if you wanted the visual placement of hyphens to override the basic behavior of the algorithm. > However, the very next character is 'r', which tells us that the > left-to-right run contains the hyphen. I also think the HYPHEN-MINUS > is the wrong character to consider - the analogy should be with U+2010 > HYPHEN (class ON) rather than with U+2212 MINUS SIGN (class ES), let > alone the ambiguous HPYHEN-MINUS, for which ES is merely the > interpretation most likely to work. Well, sure, but for the purposes of *this* particular discussion, it makes no difference whatsoever whether we are using U+002D or U+2010, despite the difference in Bidi_Class, since there is no question of numerical formatting here. Rule W6 will convert the bc=ES to bc=ON, and thereafter the processing is identical: Trace: Entering br_UBA_ResolveTerminators [W5] Current State: 11 Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D Bidi_Class: R R R WS L L L ES Levels: 1 1 1 1 1 1 1 1 Runs: Trace: Entering br_UBA_ResolveESCSET [W6] Current State: 12 Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D Bidi_Class: R R R WS L L L ON Levels: 1 1 1 1 1 1 1 1 Runs: --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Apr 1 20:31:43 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 2 Apr 2014 03:31:43 +0200 Subject: FYI: More emoji from Chrome In-Reply-To: <20140401164343.GA5003@powdermilk> References: <20140401164343.GA5003@powdermilk> Message-ID: 2014-04-01 18:43 GMT+02:00 Ilya Zakharevich : > However, this MAY be a beginning of revolution in scientific > communication. Science-and-about publications contains very long > words in abundance, and it is HERE where impact of emojification > should be felt the most! So I think the task of emojification of > scientific terms ? be it ?secularization?, ?gamma-globulin?, or > ?derived ?-category? ? should be at elevated priority in the Unicode > commitees. The general public often considers scientific publications are too > dense, and does not bother to read many scienific journals. Density of scientific publication is not much about word lengths (actually they are not really longer than in general text) but in terms of precision added by each word and associated informations that require frequent use of qualifiers and subqualifiers. Frequently it is difficult to give names to the concepts so scientists will start using notations, and many abbreviations defined specifically for a document or topic which can only be understood in their specific context (outside this context, or without prior knowledge of commonly used conventions the text will look extremely confuse). Note also that the common use of synonyms in generic speach does not apply here because scientists tend to create stronger distinctions between terms that most public would not really discriminate. This is all about terminology and even this list frequently has problems discussing concepts due to terms that are now carrying more precise meaning (an example on this list is all the discussions related to "character", "codes", "code points", "collation element" vs. "collating element" : the general public cannot see the differences and the specifications then look very confusive or obscure to them). Reading a scientific paper requires then much more attention and prior knowledge of specific conventions. > What > Google did is a beginning of a major step forward in making > contemporary science (finally!) accessible to general public. > Not at all. Emojis are certainly not what scientists are using for their needed conventions, simply because their representation is too much permissive (they carry similar "emotions", their glyphs are frequently modified with lots of variants, different colors, styles.) In fact scientists do not use emojis. When thye need to summaize concepts, they create conventional abreviations/acronyms, or symbols with precise glyphs (and the glyph appearence is semantically important, e.g. in maths, chemical formulas, electronic, physics, building engineering...), or specific terminologies (legal texts...). These conventions are not freely translatable with emojis. Even a cookbook for meals cannot use easily emojis. If words are not enough qualifying, they'll use photos. But cuisine or gardening also has its own terminology. -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Tue Apr 1 20:39:13 2014 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Wed, 02 Apr 2014 10:39:13 +0900 Subject: FYI: More emoji from Chrome In-Reply-To: <20140401164343.GA5003@powdermilk> References: <20140401164343.GA5003@powdermilk> Message-ID: <533B6A41.1080508@it.aoyama.ac.jp> Now that it's no longer April 1st (at least not here in Japan), I can add a (moderately) serious comment. On 2014/04/02 01:43, Ilya Zakharevich wrote: > On Tue, Apr 01, 2014 at 09:01:39AM +0200, Mark Davis ?? wrote: >> More emoji from Chrome: >> >> http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html >> >> with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y > > I do not know? The demos leave me completely unimpressed: emoji???by > their nature???require higher resolution than text, so an emoji for > ?pie? does not save any place comparing to the word itself. So the > impact of this on everyday English-languare communication would not be > in any way beneficial. This is somewhat different for Japanese (and languages with similar writing systems) because they have higher line height. Regards, Martin. From jonathan.rosenne at gmail.com Tue Apr 1 23:25:26 2014 From: jonathan.rosenne at gmail.com (Jonathan Rosenne) Date: Wed, 2 Apr 2014 07:25:26 +0300 Subject: Bidi reordering of soft hyphen In-Reply-To: <533B4E21.2020504@ix.netcom.com> References: <533B4079.2030402@smontagu.org> <002b01cf4dff$dfb53df0$9f1fb9d0$@gmail.com> <533B4E21.2020504@ix.netcom.com> Message-ID: <001c01cf4e2b$94462920$bcd27b60$@gmail.com> I don't think it matters very much what would the software do were there to be a soft hyphen in the text, firstly because it is not very likely for a soft hyphen to have been in the text intentionally and secondly because the software would more likely that not have been developed in a cultural environment that cares about soft hyphens. Best Regards, Jonathan Rosenne -----Original Message----- From: Asmus Freytag [mailto:asmusf at ix.netcom.com] Sent: Wednesday, April 02, 2014 2:39 AM To: Jonathan Rosenne; unicode at unicode.org Subject: Re: Bidi reordering of soft hyphen On 4/1/2014 4:12 PM, Jonathan Rosenne wrote: > The use of soft hyphen is a cultural matter. In Hebrew, Classic and > Israeli, soft hyphens are not used. More to the point, how does software render a soft hyphen included in inserted LTR text, when the outer text is Hebrew? Would it always be ignored? Would it be rendered? How? Mind you, I don't think that the bidi algorithm as such needs to care about these details, but the Unicode Standard does mumble about different conventions. Might be useful to add some examples to such mumbling. A./ > > Best Regards, > > Jonathan Rosenne > > 054-4246522 > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Simon > Montagu > Sent: Wednesday, April 02, 2014 1:41 AM > To: Roozbeh Pournader; Ken Whistler, (ken.whistler at sap.com) > Cc: Behdad Esfahbod; unicode at unicode.org; James Clark > Subject: Re: Bidi reordering of soft hyphen > > On 04/02/2014 12:00 AM, Roozbeh Pournader wrote: >> Adding Behdad for his insight on the rendering stack. >> >> But as for user requirements and expectations, the first option, with >> the hyphen on the right side of "car" as "car-" is what a good >> publisher would want to print in his magazine or book. The second >> option is harder to decipher for an RTL reader. > I agree with Roozbeh here. Since the hyphen marks a break in the > middle of the word, I think the most natural user expectation is that > it should appear after the last character in the word, where "after" and "last" > both refer to the reading direction of the word. > > I have seen examples of this in published Hebrew books, and this is > also the way it's rendered in Chrome, Firefox and Opera (but in the > case of Firefox, since I wrote the code for it I can testify that it > isn't this way by > design: as far as I remember I only took into account the direction of > the text run containing the soft hyphen and didn't even think about > the opposite-direction case). > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From smontagu at smontagu.org Wed Apr 2 01:02:08 2014 From: smontagu at smontagu.org (Simon Montagu) Date: Wed, 02 Apr 2014 09:02:08 +0300 Subject: Bidi reordering of soft hyphen In-Reply-To: <002b01cf4dff$dfb53df0$9f1fb9d0$@gmail.com> References: <533B4079.2030402@smontagu.org> <002b01cf4dff$dfb53df0$9f1fb9d0$@gmail.com> Message-ID: <533BA7E0.2050004@smontagu.org> On 04/02/2014 02:12 AM, Jonathan Rosenne wrote: > The use of soft hyphen is a cultural matter. In Hebrew, Classic and Israeli, > soft hyphens are not used. I don't understand this statement. Classic, yes, but in Israeli Hebrew soft hyphens typically _are_ used in texts printed in relatively narrow justified columns -- common examples are newspapers and encyclop?dias. (Or are we using terms differently? In any case, with respect to the original question about where to position a soft hyphen in a line break in the middle of a word in an opposite-direction run in bidirectional text, I believe that it doesn't make a difference whether we are referring to U+00AD SOFT HYPHEN, a hyphen automatically inserted by typesetting software, or a hyphen inserted manually). From chris.fynn at gmail.com Wed Apr 2 01:04:54 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Wed, 2 Apr 2014 12:04:54 +0600 Subject: Emoji Message-ID: On 02/04/2014, Nicole Selken wrote: > I think Emoji is totally beneficial as a communication form. A reversion to a crude form of Hieroglyphics? From jonathan.rosenne at gmail.com Wed Apr 2 01:15:55 2014 From: jonathan.rosenne at gmail.com (Jonathan Rosenne) Date: Wed, 2 Apr 2014 09:15:55 +0300 Subject: Bidi reordering of soft hyphen In-Reply-To: <533BA7E0.2050004@smontagu.org> References: <533B4079.2030402@smontagu.org> <002b01cf4dff$dfb53df0$9f1fb9d0$@gmail.com> <533BA7E0.2050004@smontagu.org> Message-ID: <004001cf4e3b$03512400$09f36c00$@gmail.com> Some papers are indeed doing this sporadically. It looks like it is up to the individual writer. The samples I see are barely readable, incorrect and unprofessional any way you look at them, and seem to derive from the use of inappropriate software. Best Regards, Jonathan Rosenne -----Original Message----- From: Simon Montagu [mailto:smontagu at smontagu.org] Sent: Wednesday, April 02, 2014 9:02 AM To: Jonathan Rosenne; unicode at unicode.org Subject: Re: Bidi reordering of soft hyphen On 04/02/2014 02:12 AM, Jonathan Rosenne wrote: > The use of soft hyphen is a cultural matter. In Hebrew, Classic and > Israeli, soft hyphens are not used. I don't understand this statement. Classic, yes, but in Israeli Hebrew soft hyphens typically _are_ used in texts printed in relatively narrow justified columns -- common examples are newspapers and encyclop?dias. (Or are we using terms differently? In any case, with respect to the original question about where to position a soft hyphen in a line break in the middle of a word in an opposite-direction run in bidirectional text, I believe that it doesn't make a difference whether we are referring to U+00AD SOFT HYPHEN, a hyphen automatically inserted by typesetting software, or a hyphen inserted manually). From mark at macchiato.com Wed Apr 2 01:27:23 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 2 Apr 2014 08:27:23 +0200 Subject: Bidi reordering of soft hyphen In-Reply-To: <533B330A.8030401@ix.netcom.com> References: <20140401211023.6042e0a5@JRWUBU2> <533B330A.8030401@ix.netcom.com> Message-ID: I tend to agree with Roozbeh and Behdad. I would expect to find the visible appearance of the hyphen "replacing" the letters that were broken off from the last word. That is, if the word was "beekeeper", I'd expect to see: .... bee- ..... That would be no matter where the word occurred, and no matter what the direction of the paragraph or surrounding text. (If the SHY occurred at a directional boundary, I'd also say we don't care much...) In any event, once we come up with an agreed recommendation, I'd suggest an implementation note like Asmus describes, but rather than talk about algorithmic steps, just point out the desired visual behavior (since there are many ways to do it). Mark *? Il meglio ? l?inimico del bene ?* On 1 April 2014 23:43, Asmus Freytag wrote: > I think this calls for an implementation note on UAX#9 along these lines. > ------------------------- > During line breaking, if a line is broken at the location of a SHY, the > text around the line break may change. A common case is the replacement of > the invisible SHY by a visible HYPHEN, but see Section x.x in the Unicode > Standard. > > For the purposes of the Bidi Algorithm, apply steps .. to .. after any > substitutions have been made, using the directional classes for the > substituted characters, instead of a single BN for the SHY character. > > > > Note, no special action need be taken for a SHY character in the middle of > a line, unless they are rendered as visible glyphs in a "show hidden > character" mode. In the latter case, the recommendation would be to treat > the visible symbol substituted for the SHY as having bidi class ON. > ------------------------ > > I am not sure whether -car CBA or car- CBA is the right answer, nor > whether the substitution will always be limited to the preceding line. (Old > orthography German had B?cker turning in to B?k-|ker, where I've used > | to show the line ending.) Those are details that the UBA should be > ignorant about. The important thing is that the array of bidi directional > classes is not constrained to contain a single entry for BN at the location > of the original SHY. > > If "car- CBA" is the right answer then the substitution would have to be > HYPHEN plus LRM to get this to come out right, but that would be under the > control of the line-breaking conventions, and not legislated by the UBA. > > A./ > > > On 4/1/2014 1:31 PM, Whistler, Ken wrote: > > Richard Wordingham noted: > > > > > As U+2010 HYPHEN would result in text like 'car-', in an English > > > influenced context I would also go with 'car-'. > > > > That's always a possibility, I suppose, but I'm not sure what > > "English influenced context" means here. > > > > The examples I just gave were for a RTL paragraph context. > > In a LTR paragraph context, the same input would end up in > > a very different order: > > > > Trace: Entering br_UBA_ReverseLevels [L2] > > Current State: 19 > > Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D > > Bidi_Class: R R R L L L L L > > Levels: 1 1 1 0 0 0 0 0 > > Runs: > > > > Order: [2 1 0 3 4 5 6 7] > > > > And you get the display: > > > > CBA car- > > ---------> > > > > As opposed to: > > > > -car CBA > > <--------- > > > > In either case, the hyphen-minus (or hyphen), ends up at the *end of the > line*. > > > > My take is that *if* I am going to insert a visible glyph at the point of > the > > SHY, it would probably be best to insert it at the actual line break at the > > end of the line, to be in the same position as an explicit hyphen-minus > with > > the same line break. > > > > --Ken > > > > > > > _______________________________________________ > Unicode mailing listUnicode at unicode.orghttp://unicode.org/mailman/listinfo/unicode > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Wed Apr 2 01:29:15 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 2 Apr 2014 07:29:15 +0100 (BST) Subject: Emoji In-Reply-To: References: Message-ID: <1396420155.35962.YahooMailNeo@web87803.mail.ir2.yahoo.com> For me, an important aspect of emoji is that they are independent of language. They can localize in the mind of the reader. How can they express verbs such as need and must; and pronouns? How can they express thanks? William Overington 2 April 2014 ----- Original Message ----- From: Christopher Fynn To: Unicode List Cc: Nicole Selken Sent: Wednesday, 2 April 2014, 7:04 Subject: Re: Emoji On 02/04/2014, Nicole Selken wrote: > I think? Emoji is totally beneficial as a communication form. A reversion to a crude form of Hieroglyphics? _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From chris.fynn at gmail.com Wed Apr 2 01:46:11 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Wed, 2 Apr 2014 12:46:11 +0600 Subject: Emoji In-Reply-To: <1396420155.35962.YahooMailNeo@web87803.mail.ir2.yahoo.com> References: <1396420155.35962.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: On 02/04/2014, William_J_G Overington wrote: > For me, an important aspect of emoji is that they are independent of > language. Emoji seem fairly culturally specific. (Maybe the mobile-phone messaging culture.) Kind of shorthand expressions which may be used with several languages - but not independent of language. I suspect some of them already convey one thing to a Japanese teenager and quite another to an American. And if you showed these symbols many people in other countries they wouldn't have a clue as to what they are supposed to mean. From richard.wordingham at ntlworld.com Wed Apr 2 02:36:24 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 2 Apr 2014 08:36:24 +0100 Subject: Bidi reordering of soft hyphen In-Reply-To: References: <20140402000257.32dd544d@JRWUBU2> Message-ID: <20140402083624.24b8da63@JRWUBU2> On Tue, 1 Apr 2014 23:41:48 +0000 "Whistler, Ken" wrote: > > Is it legitimate to truncate the context to a single line? The BiDi > > algorithm is attempting to interpret unlabelled text as embedded > > text > > (it's not an arbitrary dance), and in just one line there is no > > indicator of whether the hyphen is part of the LTR text embedded in > > RTL text. > For this discussion, I think yes. See Section 3.4 of UAX #9: > The following rules describe the logical process of finding the > correct display order. As opposed to resolution phases, these rules > act on a per-line basis and are applied after any line wrapping is > applied to the paragraph. But it is a *resolution* rule that converts the true hyphen or minus sign to Bidi Class L; these apply before the scope reduces from paragraph to line. Richard. From chris.fynn at gmail.com Wed Apr 2 03:42:29 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Wed, 2 Apr 2014 14:42:29 +0600 Subject: FYI: More emoji from Chrome In-Reply-To: <533B6A41.1080508@it.aoyama.ac.jp> References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> Message-ID: Rather than Emoji it might be better if people learnt Han ideographs which are also compact (and a far more developed system of communication than emoji). One CJK character can also easily replace dozens of Latin characters - which is what is being claimed for emoji. On 02/04/2014, "Martin J. D?rst" wrote: > Now that it's no longer April 1st (at least not here in Japan), I can > add a (moderately) serious comment. > > On 2014/04/02 01:43, Ilya Zakharevich wrote: >> On Tue, Apr 01, 2014 at 09:01:39AM +0200, Mark Davis ?? wrote: >>> More emoji from Chrome: >>> >>> http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html >>> >>> with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y >> >> I do not know? The demos leave me completely unimpressed: emoji???by >> their nature???require higher resolution than text, so an emoji for >> ?pie? does not save any place comparing to the word itself. So the >> impact of this on everyday English-languare communication would not be >> in any way beneficial. > > This is somewhat different for Japanese (and languages with similar > writing systems) because they have higher line height. > > Regards, Martin. > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From chris.fynn at gmail.com Wed Apr 2 04:07:44 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Wed, 2 Apr 2014 15:07:44 +0600 Subject: FYI: More emoji from Chrome In-Reply-To: <533B6A41.1080508@it.aoyama.ac.jp> References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> Message-ID: On 02/04/2014, "Martin J. D?rst" wrote: > Now that it's no longer April 1st (at least not here in Japan), I can > add a (moderately) serious comment. Long past April 1 here too - I'd already forgotten. ;-) >>> More emoji from Chrome: >>> >>> http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html >>> >>> with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y >> >> I do not know? The demos leave me completely unimpressed: emoji???by >> their nature???require higher resolution than text, so an emoji for >> ?pie? does not save any place comparing to the word itself. So the >> impact of this on everyday English-languare communication would not be >> in any way beneficial. > > This is somewhat different for Japanese (and languages with similar > writing systems) because they have higher line height. > > Regards, Martin. So CJK glyphs take up similar space to that needed to display an emoji character. - Presumably the individual Han ideographs for "pie", "dumpling" or "turd" would save as much screen space as using the corresponding emoji pictographs. Once there were enough emoji to carry on a conversation above the level of a 4 year old, they would also require an IME as complex as that needed for entering CJK text. From asmusf at ix.netcom.com Wed Apr 2 05:17:35 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 02 Apr 2014 03:17:35 -0700 Subject: Bidi reordering of soft hyphen In-Reply-To: <20140402083624.24b8da63@JRWUBU2> References: <20140402000257.32dd544d@JRWUBU2> <20140402083624.24b8da63@JRWUBU2> Message-ID: <533BE3BF.2010502@ix.netcom.com> On 4/2/2014 12:36 AM, Richard Wordingham wrote: > On Tue, 1 Apr 2014 23:41:48 +0000 > "Whistler, Ken" wrote: > >>> Is it legitimate to truncate the context to a single line? The BiDi >>> algorithm is attempting to interpret unlabelled text as embedded >>> text >>> (it's not an arbitrary dance), and in just one line there is no >>> indicator of whether the hyphen is part of the LTR text embedded in >>> RTL text. > >> For this discussion, I think yes. See Section 3.4 of UAX #9: > >> The following rules describe the logical process of finding the >> correct display order. As opposed to resolution phases, these rules >> act on a per-line basis and are applied after any line wrapping is >> applied to the paragraph. > But it is a *resolution* rule that converts the true hyphen or minus > sign to Bidi Class L; these apply before the scope reduces from > paragraph to line. When breaking a line at a soft hyphen, one is essentially modifying the text around the line break for display, because the SHY is not specific as to what should happen (as was the case with German old orthography, the changes go beyond simple substitution of a hyphen). When you change the text, you have to fix up the resolution. A./ > > Richard. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From asmusf at ix.netcom.com Wed Apr 2 05:19:08 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 02 Apr 2014 03:19:08 -0700 Subject: FYI: More emoji from Chrome In-Reply-To: References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> Message-ID: <533BE41C.1050903@ix.netcom.com> On 4/2/2014 1:42 AM, Christopher Fynn wrote: > Rather than Emoji it might be better if people learnt Han ideographs > which are also compact (and a far more developed system of > communication than emoji). One CJK character can also easily replace > dozens of Latin characters - which is what is being claimed for emoji. One wonders why the Japanese, who already know Han ideographs, took to emoji as they did.... A./ > > On 02/04/2014, "Martin J. D?rst" wrote: >> Now that it's no longer April 1st (at least not here in Japan), I can >> add a (moderately) serious comment. >> >> On 2014/04/02 01:43, Ilya Zakharevich wrote: >>> On Tue, Apr 01, 2014 at 09:01:39AM +0200, Mark Davis ?? wrote: >>>> More emoji from Chrome: >>>> >>>> http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html >>>> >>>> with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y >>> I do not know? The demos leave me completely unimpressed: emoji???by >>> their nature???require higher resolution than text, so an emoji for >>> ?pie? does not save any place comparing to the word itself. So the >>> impact of this on everyday English-languare communication would not be >>> in any way beneficial. >> This is somewhat different for Japanese (and languages with similar >> writing systems) because they have higher line height. >> >> Regards, Martin. >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From kojiishi at gluesoft.co.jp Wed Apr 2 06:05:22 2014 From: kojiishi at gluesoft.co.jp (Koji Ishii) Date: Wed, 2 Apr 2014 11:05:22 +0000 Subject: FYI: More emoji from Chrome In-Reply-To: <533BE41C.1050903@ix.netcom.com> References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> <533BE41C.1050903@ix.netcom.com> Message-ID: <14FF721E-8589-498C-BA77-AB05F852048F@gluesoft.co.jp> On Apr 2, 2014, at 7:19 PM, Asmus Freytag wrote: > On 4/2/2014 1:42 AM, Christopher Fynn wrote: >> Rather than Emoji it might be better if people learnt Han ideographs >> which are also compact (and a far more developed system of >> communication than emoji). One CJK character can also easily replace >> dozens of Latin characters - which is what is being claimed for emoji. > > One wonders why the Japanese, who already know Han ideographs, took to emoji as they did.... All the ancient emoji characters we inherited from our ancestors were already turned into Han ideographs like this[1][2], so we needed new ones to add more Han ideographs in next centuries ;) [1] http://ameblo.jp/happy2525tkg/entry-11541848940.html [2] http://ameblo.jp/happy2525tkg/entry-11578197418.html /koji From chris.fynn at gmail.com Wed Apr 2 06:08:02 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Wed, 2 Apr 2014 17:08:02 +0600 Subject: FYI: More emoji from Chrome In-Reply-To: <533BE41C.1050903@ix.netcom.com> References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> <533BE41C.1050903@ix.netcom.com> Message-ID: On 02/04/2014, Asmus Freytag wrote: > On 4/2/2014 1:42 AM, Christopher Fynn wrote: >> Rather than Emoji it might be better if people learnt Han ideographs >> which are also compact (and a far more developed system of >> communication than emoji). One CJK character can also easily replace >> dozens of Latin characters - which is what is being claimed for emoji. > > One wonders why the Japanese, who already know Han ideographs, took to > emoji as they did.... Perhaps because emoji are a sort of playful version of a means of communication they are already used to From duerst at it.aoyama.ac.jp Wed Apr 2 06:26:19 2014 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Wed, 02 Apr 2014 20:26:19 +0900 Subject: FYI: More emoji from Chrome In-Reply-To: References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> <533BE41C.1050903@ix.netcom.com> Message-ID: <533BF3DB.1010103@it.aoyama.ac.jp> On 2014/04/02 20:08, Christopher Fynn wrote: > On 02/04/2014, Asmus Freytag wrote: >> On 4/2/2014 1:42 AM, Christopher Fynn wrote: >>> Rather than Emoji it might be better if people learnt Han ideographs >>> which are also compact (and a far more developed system of >>> communication than emoji). One CJK character can also easily replace >>> dozens of Latin characters - which is what is being claimed for emoji. >> >> One wonders why the Japanese, who already know Han ideographs, took to >> emoji as they did.... > > Perhaps because emoji are a sort of playful version of a means of > communication they are already used to Yes. Already used to the concept that a character can represent (more or less) a concept. Already used to the concept that there are lots of characters, and a few more won't make such a difference. Already used to the concept that character entry means keying a word or phrase and the selecting what you actually want. But I think the main reason for their spread was that the mobile phone companies introduced them and young people found them cute. In a followup, Line (http://line.me/en/), the most popular Japanese mobile message app (similar to WhatsApp) got popular mostly because of their gorgeous collection of 'stickers' (over 10,000), fortunately after realizing that the technically correct way to deal with them was not squeezing them into the PUA, but treating them as inline images, avoiding headaches down the line for the Unicode Consortium :-). Regards, Martin. From mpsuzuki at hiroshima-u.ac.jp Wed Apr 2 07:00:42 2014 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Wed, 02 Apr 2014 21:00:42 +0900 Subject: ["Unicode"] Re: FYI: More emoji from Chrome In-Reply-To: <533BF3DB.1010103@it.aoyama.ac.jp> References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> <533BE41C.1050903@ix.netcom.com> <533BF3DB.1010103@it.aoyama.ac.jp> Message-ID: <533BFBEA.5040003@hiroshima-u.ac.jp> On 04/02/2014 08:26 PM, "Martin J. D?rst" wrote: > On 2014/04/02 20:08, Christopher Fynn wrote: >> On 02/04/2014, Asmus Freytag wrote: >>> On 4/2/2014 1:42 AM, Christopher Fynn wrote: >>>> Rather than Emoji it might be better if people learnt Han ideographs >>>> which are also compact (and a far more developed system of >>>> communication than emoji). One CJK character can also easily replace >>>> dozens of Latin characters - which is what is being claimed for emoji. >>> >>> One wonders why the Japanese, who already know Han ideographs, took to >>> emoji as they did.... >> >> Perhaps because emoji are a sort of playful version of a means of >> communication they are already used to > > Yes. Already used to the concept that a character can represent (more or less) a concept. Already used to the concept that there are lots of characters, and a few more won't make such a difference. Already used to the concept that character entry means keying a word or phrase and the selecting what you actually want. > > But I think the main reason for their spread was that the mobile phone companies introduced them and young people found them cute. > > In a followup, Line (http://line.me/en/), the most popular Japanese mobile message app (similar to WhatsApp) got popular mostly because of their gorgeous collection of 'stickers' (over 10,000), fortunately after realizing that the technically correct way to deal with them was not squeezing them into the PUA, but treating them as inline images, avoiding headaches down the line for the Unicode Consortium :-). Utilization of the words including rarely-used Han ideograph requests the deep knowledge about Chinese classics (except of the cases like "what is the most complex kanji?"). It is too hard for modern Japanese people who prefers video media than text media. I think the wide acceptance of new emojis and stickers (Japanese LINE users call as "stamp") by Japanese young people does not mean that they have something hard to express by existing characters or emoticons. Collecting them is something like an ambition to encode all comedy skits. Regards, mpsuzuki From asmusf at ix.netcom.com Wed Apr 2 07:02:54 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 02 Apr 2014 05:02:54 -0700 Subject: FYI: More emoji from Chrome In-Reply-To: <14FF721E-8589-498C-BA77-AB05F852048F@gluesoft.co.jp> References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> <533BE41C.1050903@ix.netcom.com> <14FF721E-8589-498C-BA77-AB05F852048F@gluesoft.co.jp> Message-ID: <533BFC6E.3060308@ix.netcom.com> On 4/2/2014 4:05 AM, Koji Ishii wrote: > On Apr 2, 2014, at 7:19 PM, Asmus Freytag wrote: > >> On 4/2/2014 1:42 AM, Christopher Fynn wrote: >>> Rather than Emoji it might be better if people learnt Han ideographs >>> which are also compact (and a far more developed system of >>> communication than emoji). One CJK character can also easily replace >>> dozens of Latin characters - which is what is being claimed for emoji. >> One wonders why the Japanese, who already know Han ideographs, took to emoji as they did.... > All the ancient emoji characters we inherited from our ancestors were already turned into Han ideographs like this[1][2], so we needed new ones to add more Han ideographs in next centuries ;) You may be on to something :) > > [1] http://ameblo.jp/happy2525tkg/entry-11541848940.html > [2] http://ameblo.jp/happy2525tkg/entry-11578197418.html > > /koji > From theppitak at gmail.com Wed Apr 2 07:16:32 2014 From: theppitak at gmail.com (Theppitak Karoonboonyanan) Date: Wed, 2 Apr 2014 19:16:32 +0700 Subject: Unencoded Lao Characters In-Reply-To: <20140329223559.0965a007@JRWUBU2> References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> <20140328091547.64b97a4f@JRWUBU2> <20140329223559.0965a007@JRWUBU2> Message-ID: On Sun, Mar 30, 2014 at 5:35 AM, Richard Wordingham wrote: > On Sat, 29 Mar 2014 11:10:52 +0700 > Theppitak Karoonboonyanan wrote, > under topic 'Pali in Thai Script': > >> On Fri, Mar 28, 2014 at 4:15 PM, Richard Wordingham >> wrote: > >> > An older form of the Lao script is called the Thai Noi script. That >> > script has many of the characters needed. It has the characters, to >> > give them their 'standard' Unicode Indic names, GHA, NYA, TTHA, NNA, >> > DHA, BHA, and even has the Sanskrit-supporting characters SHA, SSA >> > and Vocalic R. The lack of CHA, JHA, TTA, DDA, DDHA and LLA may be >> > due to their rarity, as with the lack of Vocalic L. >> >> I don't think so. From my studies so far, Tai Noi script (aka. Lao >> Buhan) writing system was not so different from that of contemporary >> Lao script. Some characters are just obsolete. >> >> In fact, I have been drafting a summarized proposal to encode Tai Noi >> script here: >> >> http://linux.thai.net/~thep/esaan-scripts/tn-issues/tn-encoding.html > > That seems to be based on the analysis that the Tai Noi script is a > form of the Lao script. In that case, it ought to address GHA, NYA, > TTHA, NNA, DHA and BHA as seen in inscriptions, recorded for example in > the 1979 MA thesis of Thawaj Poonotoke (???? ???????) at > http://www.khamkoo.com/uploads/9/0/0/4/9004485/thai_noi_palaeography.pdf . I see. As said in the thesis, these Thai-borrowed characters were mostly used by the elites who were influenced by foreign states. That's why I don't find them in palm leaf documents which were inscribed by ordinary people, where the characters were simply borrowed from Tham script, not from (archaic) Thai when in use. And, as also said in the thesis, the official letters (Bai Jum) are not as abundant as palm leaves, and the author himself suggested that studying the writing system used in palm leaves were more useful. That's why most next-generation scholars, including those I consulted, do not mention the one used by the elites in their books at all. At least, they don't suggest it for contemporary use when the script is revivied. Anyway, I think we should take the elite's writing system into account when we encode it. > The Buddhist Institute 'additions' should also be handled. There are > several fonts around that make presumptions about their encoding in > Unicode. I'm not convinced that the old Tai Noi and Buddhist Institute > forms of each of NYA and NNA are the same character - I suspect we may > have four characters here. The two versions of NYA are particularly > difficult to reconcile. Don't you think it's a matter of style, in the same manner that Lao Tham share the same block with Lanna and Khun? > My though on the subscript consonants are: > > 1) The Lao block already has two subscript consonants, U+0EBC LAO > SEMIVOWEL SIGN LO and U+0EBD LAO SEMIVOWEL SIGN NYO, though perhaps > the various forms of the latter need to disunified. How does the > latter's J-shaped glyph kern? I'd rather leave the kerning to fonts (i.e. fonts for contemporary Lao and those for Tai Noi would kern differently). For the variations, I'm afraid it's a matter of style again. In case one insists to use different forms in the same document, I'm not sure how Variation Selectors fit? > 2) If we allow the Lao script to be split between planes, subscript > forms could be accommodated in an 'Archaic Lao' block in the SMP. This > would have the advantages that: > > (a) In UTF-8, a subscript consonant would only take 4 bytes, whereas > using a coeng in the BMP would require 6 bytes, 3 for the coeng and and > 3 for the consonant identity. The memory requirement is 4 bytes for > both schemes in UTF-16. > > (b) Distinct subscripts for the same letter can easily be encoded > distinctly. For example, the Lao letters LO, DO and NO can easily be > taken to have two distinct subscript forms, and in the related Thai > Nithet script (?????????????), formerly used in Northern Thailand, one > can argue for four forms of the cluster HO MO - the ligature HO MO (as > LAO HO MO), and HO plus (i) a purely subscript MO (gc=Mn), (ii) > subscript MO with an ascender (gc=Mc), and (iii) a borrowing of Tai > Tham (gc=Mn if treated as a single character). What's the difference between HO plus (i) and HO plus (ii)? I think I haven't seen the former case yet. Yes, the supplement block can be a good alternative, as it can address different forms of subscripts more flexibly. Regards, -- Theppitak Karoonboonyanan http://linux.thai.net/~thep/ From wjgo_10009 at btinternet.com Wed Apr 2 08:01:40 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 2 Apr 2014 14:01:40 +0100 (BST) Subject: Transmission of emoji within plain text messages In-Reply-To: <533BF3DB.1010103@it.aoyama.ac.jp> References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> <533BE41C.1050903@ix.netcom.com> <533BF3DB.1010103@it.aoyama.ac.jp> Message-ID: <1396443700.49390.YahooMailNeo@web87801.mail.ir2.yahoo.com> ""Martin J. D?rst"" wrote: > In a followup, Line (http://line.me/en/), the most popular Japanese mobile message app (similar to WhatsApp) got popular mostly because of their gorgeous collection of 'stickers' (over 10,000), fortunately after realizing that the technically correct way to deal with them was not squeezing them into the PUA, but treating them as inline images, avoiding headaches down the line for the Unicode Consortium :-). There is another possible way to proceed, namely to use markup bubbles for transmission and to decode them with a local OpenType colour font, where the glyphs of the decoded items are unmapped. I have successfully tested such a markup bubble technique in monochrome for nine-character markup bubbles, for a different project, using a font made using High-Logic FontCreator 7 and applied using Serif PagePlus X5. William Overington 2 April 2014 From James_Lin at symantec.com Wed Apr 2 12:00:08 2014 From: James_Lin at symantec.com (James Lin) Date: Wed, 2 Apr 2014 10:00:08 -0700 Subject: Emoji In-Reply-To: References: <1396420155.35962.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: Emoji or ???, literally means Face word or Face Characters, essentially, provides an emotional state in the context of words. Emoji is very popular in APJ, and specially in Japan where most of your text will contain at least half dozen Emoji characters. Remember, people in Japan spend more than half of their commute in the train, and no talk on the cellphone in the train, so most people text instead. Everyone can guess what are the following emoji that used frequently in Japan: ?(???;)? - worried ?(??????? - happy ?(#`??)? - angry ??_???- confused there is a lot more... On 4/1/14, 11:46 PM, "Christopher Fynn" wrote: >On 02/04/2014, William_J_G Overington wrote: >> For me, an important aspect of emoji is that they are independent of >> language. > >Emoji seem fairly culturally specific. (Maybe the mobile-phone >messaging culture.) Kind of shorthand expressions which may be used >with several languages - but not independent of language. I suspect >some of them already convey one thing to a Japanese teenager and quite >another to an American. And if you showed these symbols many people >in other countries they wouldn't have a clue as to what they are >supposed to mean. >_______________________________________________ >Unicode mailing list >Unicode at unicode.org >http://unicode.org/mailman/listinfo/unicode From nospam-abuse at ilyaz.org Wed Apr 2 13:27:23 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Wed, 2 Apr 2014 11:27:23 -0700 Subject: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector? Message-ID: <20140402182723.GA8350@powdermilk> Current (and 7.0.0-tobe) versions do not say much: 23AF HORIZONTAL LINE EXTENSION * used for extension of arrows x (vertical line extension - 23D0) If it is intended to be a variation selector (possibly prepended instead of appended!), then using it with ? should give longer double arrow, and using it with ? should give a longer variant of ?. If it is a glyph, then what is the difference with U+2500 ? ? It looks like then any vertical-positioning distinction is trivially understood from context? Any thoughts? Thanks, Ilya From nospam-abuse at ilyaz.org Wed Apr 2 13:36:45 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Wed, 2 Apr 2014 11:36:45 -0700 Subject: HORIZONTAL SCAN LINEs Message-ID: <20140402183645.GA8366@powdermilk> The current version (and 7.0.0-tobe) describe them as: @ Scan lines for terminal graphics @+ The scan line numbers here refer to old, low-resolution technology for terminals, with only 9 scan lines per fixed-size character glyph. Even-numbered scan lines are unified with box-drawing graphics. 23BA HORIZONTAL SCAN LINE-1 23BB HORIZONTAL SCAN LINE-3 23BC HORIZONTAL SCAN LINE-7 23BD HORIZONTAL SCAN LINE-9 Is not it a complete BS? Was not this intended to be written similar to: Glyphs for even-numbered scan lines were never defined. The 5th scan line is unified with U+2500 (from box-drawing graphic). Please note that even well-researched fonts (like Symbola) treat them wrongly. Should not another comment (or two) be better added: Line-1 was the top line of the character box, and Line-9 was at the bottom. These characters were intended to connect on the sides, and with 23B8 LEFT VERTICAL BOX LINE 23B9 RIGHT VERTICAL BOX LINE Thanks, Ilya P.S. The references I found are http://lists.freedesktop.org/archives/wayland-devel/2012-May/003687.html http://invisible-island.net/vttest/images/VTTEST-VT100%20character%20sets.png illustrating http://invisible-island.net/vttest/ http://www.unicode.org/mail-arch/unicode-ml/y2001-m09/0037.html 5-9 in ftp://kermit.columbia.edu/kermit/ucsterminal/ucsterminal.txt http://paulbourke.net/dataformats/ascii/ (search for: control-backspace) From nospam-abuse at ilyaz.org Wed Apr 2 13:52:53 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Wed, 2 Apr 2014 11:52:53 -0700 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: References: <1396420155.35962.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: <20140402185253.GA8489@powdermilk> On Wed, Apr 02, 2014 at 10:00:08AM -0700, James Lin wrote: > Everyone can guess what are the following emoji that used frequently in > Japan: What makes you think so? I would not have a slightest clue what the intended meaning is? > ?(???;)? - worried [I removed the rest since they crash the Web interface to the list anyway: http://www.unicode.org/mail-arch/unicode-ml/y2014-m04/0039.html ] Ilya From ken.whistler at sap.com Wed Apr 2 13:56:54 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Wed, 2 Apr 2014 18:56:54 +0000 Subject: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector? In-Reply-To: <20140402182723.GA8350@powdermilk> References: <20140402182723.GA8350@powdermilk> Message-ID: Ilya, U+23AF is *definitely* not a variation selector at all. It is part of a set of bracket pieces (and other graphic pieces) in the range U+239B..U+23B1. See discussion of the topic at: http://www.unicode.org/forum/viewtopic.php?f=35&t=206 See also Section 2.13 of UTR #25: http://www.unicode.org/reports/tr25/ which discusses the use of these symbol pieces. It does not specifically talk about the arrow extender pieces, focusing instead on the bracket pieces, but the principles are the same. These glyphic pieces of symbols are only relevant and useful in the context of mathematical typesetting programs like TeX. The set of box drawing characters in the U+2500 block were encoded for compatibility with old character sets that did character cell graphics. So the two are different, but neither set is of much current relevance for general text currently using arrows. --Ken > Subject: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector? > > Current (and 7.0.0-tobe) versions do not say much: > > 23AF HORIZONTAL LINE EXTENSION > * used for extension of arrows > x (vertical line extension - 23D0) > > If it is intended to be a variation selector (possibly prepended > instead of appended!), then using it with ? should give longer double > arrow, and using it with ? should give a longer variant of ?. > > If it is a glyph, then what is the difference with U+2500 ? ? It > looks like then any vertical-positioning distinction is trivially > understood from context? > > Any thoughts? Thanks, > Ilya From nospam-abuse at ilyaz.org Wed Apr 2 14:35:28 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Wed, 2 Apr 2014 12:35:28 -0700 Subject: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector? In-Reply-To: References: <20140402182723.GA8350@powdermilk> Message-ID: <20140402193528.GA8861@powdermilk> On Wed, Apr 02, 2014 at 06:56:54PM +0000, Whistler, Ken wrote: > Ilya, > > U+23AF is *definitely* not a variation selector at all. > > It is part of a set of bracket pieces (and other graphic pieces) > in the range U+239B..U+23B1. Evidence does not support this (see below). > See discussion of the topic at: > > http://www.unicode.org/forum/viewtopic.php?f=35&t=206 Apparently, this has no relationship to U+23af at all? > See also Section 2.13 of UTR #25: > > http://www.unicode.org/reports/tr25/ Likewise? > which discusses the use of these symbol pieces. It does not > specifically talk about the arrow extender pieces, focusing > instead on the bracket pieces, but the principles are the same. AFAIU, they are not. > These glyphic pieces of symbols are only relevant and useful > in the context of mathematical typesetting programs like TeX. Are you sure? Did you look at Figure 6 in Appendix F of TeXBook? ? There is no horizontal extension pieces; ? The vertical extension pieces consists of very short chunks (to make size tunable in small increments). The horizontal extension of arrows in TeX is done by macros like \longleftarrow and \Longleftarrow (Appendix B, or just look in plain.tex). However, these macros (again!) have no relation to U+23af, since they need DIFFERENT extension pieces for single/double/triple/etc arrows. ======================================================= In short: if U+23AF were a part of extensible set, it would be short (never saw it like this) AND would have double/triple/etc counterparts. > The set of box drawing characters in the U+2500 block were > encoded for compatibility with old character sets that did > character cell graphics. > > So the two are different, but neither set is of much current > relevance for general text currently using arrows. I won?t be so sure. I use extended arrows in ?general text? mode; they are not shown reliably in ANY environment I know. I?d love to know a solution. Thanks, Ilya From jkorpela at cs.tut.fi Wed Apr 2 14:39:10 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Wed, 02 Apr 2014 22:39:10 +0300 Subject: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector? In-Reply-To: References: <20140402182723.GA8350@powdermilk> Message-ID: <533C675E.6040805@cs.tut.fi> 2014-04-02 21:56, Whistler, Ken wrote: > U+23AF is *definitely* not a variation selector at all. > > It is part of a set of bracket pieces (and other graphic pieces) > in the range U+239B..U+23B1. [?] > These glyphic pieces of symbols are only relevant and useful > in the context of mathematical typesetting programs like TeX. I?m not sure whether TeX uses such characters at all. TeX is oriented towards typesetting glyphs, often not caring that much about abstract characters. When I use, say, $$\begin{pmatrix}?\end{pmatrix} in LaTeX to get a nicely formatted array with large parentheses around, I don?t think LaTeX internally uses characters like U+239B. On the other hand, such characters can be used in very primitive ?typesetting? in a plain text environment under some conditions. For example, to create a largish left parentheses I could use U+239B U+239C ? U+239C U+239D each at the start of a new line: ? ? ? ? This won?t work on everyone?s email reader, of course. It works in Notepad, for example. On a web page, it works when you set the text solid, with line-height: 1. Of course, there would be the issue of font coverage, but I don?t see any particular reason why such characters could not be used in plain text, in word processors, in HTML documents?apart from the practical point that there are usually better alternaties. U+23AF is a simpler building block, but it has its problems, too. Despite the purpose mentioned in a comment in the standard, there is no guarantee that it joins smoothly with adjacent simple arrows. But of course it is a graphic character, and one that can be expected to have a rather specific shape. It?s not something abstract that says that some arrow should be extended; rather, it can be used as an extension. Yucca From doug at ewellic.org Wed Apr 2 14:44:46 2014 From: doug at ewellic.org (Doug Ewell) Date: Wed, 02 Apr 2014 12:44:46 -0700 Subject: Emoji [And crash in the Web interface to the mailing list] Message-ID: <20140402124446.665a7a7059d7ee80bb4d670165c8327d.8654bc4039.wbe@email03.secureserver.net> Ilya Zakharevich wrote: > [I removed the rest since they crash the Web interface to the list > anyway: > http://www.unicode.org/mail-arch/unicode-ml/y2014-m04/0039.html > ] I didn't have any trouble viewing James's examples from the Web interface, although of course the private-use characters showed up as dots instead of whatever they were supposed to be. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From nospam-abuse at ilyaz.org Wed Apr 2 15:00:45 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Wed, 2 Apr 2014 13:00:45 -0700 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: <20140402124446.665a7a7059d7ee80bb4d670165c8327d.8654bc4039.wbe@email03.secureserver.net> References: <20140402124446.665a7a7059d7ee80bb4d670165c8327d.8654bc4039.wbe@email03.secureserver.net> Message-ID: <20140402200045.GA9154@powdermilk> On Wed, Apr 02, 2014 at 12:44:46PM -0700, Doug Ewell wrote: > Ilya Zakharevich wrote: > > > [I removed the rest since they crash the Web interface to the list > > anyway: > > http://www.unicode.org/mail-arch/unicode-ml/y2014-m04/0039.html > > ] > > I didn't have any trouble viewing James's examples from the Web > interface, Yes, I double-checked with wget, and it can retrieve the page fine. So the problem is in Firefox (it shows even the HTML source truncated?). > although of course the private-use characters showed up as > dots instead of whatever they were supposed to be. Private-use?! The page is in iso-2022-jp! Anyway, why would private-use show as dots in browsers? If it is defined somewhere in the places the browser looks at, it will be shown THAT way; otherwise the browser?s last-resort would hit (which I never saw to be dots; HEX in Firefox; ?non-squares? ;-] in Chrome). Ilya From rick at unicode.org Wed Apr 2 15:24:26 2014 From: rick at unicode.org (Rick McGowan) Date: Wed, 02 Apr 2014 13:24:26 -0700 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: <20140402124446.665a7a7059d7ee80bb4d670165c8327d.8654bc4039.wbe@email03.secureserver.net> References: <20140402124446.665a7a7059d7ee80bb4d670165c8327d.8654bc4039.wbe@email03.secureserver.net> Message-ID: <533C71FA.2080700@unicode.org> Also, fwiw, the new Mailman archives using Pipermail seem to do better than the "legacy" archives: http://unicode.org/pipermail/unicode/2014-April/000382.html On 4/2/2014 12:44 PM, Doug Ewell wrote: > I didn't have any trouble viewing James's examples from the Web > interface, although of course the private-use characters showed up as > dots instead of whatever they were supposed to be. From richard.wordingham at ntlworld.com Wed Apr 2 15:29:37 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 2 Apr 2014 21:29:37 +0100 Subject: Unencoded Lao Characters In-Reply-To: References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> <20140328091547.64b97a4f@JRWUBU2> <20140329223559.0965a007@JRWUBU2> Message-ID: <20140402212937.1543e9e2@JRWUBU2> On Wed, 2 Apr 2014 19:16:32 +0700 Theppitak Karoonboonyanan wrote: > On Sun, Mar 30, 2014 at 5:35 AM, Richard Wordingham > wrote: > > On Sat, 29 Mar 2014 11:10:52 +0700 > > In that case, it ought to address GHA, NYA, > > TTHA, NNA, DHA and BHA as seen in inscriptions, recorded for > > example in the 1979 MA thesis of Thawaj Poonotoke (???? ???????) at > > http://www.khamkoo.com/uploads/9/0/0/4/9004485/thai_noi_palaeography.pdf . > I see. As said in the thesis, these Thai-borrowed characters were > mostly used by the elites who were influenced by foreign states. Are they any more borrowed than the rest of the alphabet? > > I'm not convinced that the old Tai Noi and > > Buddhist Institute forms of each of NYA and NNA are the same > > character - I suspect we may have four characters here. The two > > versions of NYA are particularly difficult to reconcile. > Don't you think it's a matter of style, in the same manner that Lao > Tham share the same block with Lanna and Khun? Perhaps it will work. It's tidier if it does. > > 1) The Lao block already has two subscript consonants, U+0EBC LAO > > SEMIVOWEL SIGN LO and U+0EBD LAO SEMIVOWEL SIGN NYO, though perhaps > > the various forms of the latter need to disunified. How does the > > latter's J-shaped glyph kern? > I'd rather leave the kerning to fonts (i.e. fonts for contemporary > Lao and those for Tai Noi would kern differently). For the > variations, I'm afraid it's a matter of style again. My worry here is with the Khmu usage of the J-shaped glyph. Khmu uses U+0EBD as an initial consonant. If it is kerned in Khmu usage, then there is not a problem. > > ... in the > > related Thai Nithet script (?????????????), formerly used in > > Northern Thailand, one can argue for four forms of the cluster HO > > MO - the ligature HO MO (as LAO HO MO), and HO plus (i) a purely > > subscript MO (gc=Mn), (ii) subscript MO with an ascender (gc=Mc), > > and (iii) a borrowing of Tai Tham (gc=Mn if treated as > > a single character). > > What's the difference between HO plus (i) and HO plus (ii)? > I think I haven't seen the former case yet. It's the same as the difference between U+1A5E TAI THAM CONSONANT SIGN SA and or between U+1A56 TAI THAM CONSONANT SIGN MEDIAL LA and . Richard. From doug at ewellic.org Wed Apr 2 15:31:04 2014 From: doug at ewellic.org (Doug Ewell) Date: Wed, 02 Apr 2014 13:31:04 -0700 Subject: Emoji [And crash in the Web interface to the mailing list] Message-ID: <20140402133104.665a7a7059d7ee80bb4d670165c8327d.08a5508c35.wbe@email03.secureserver.net> Ilya Zakharevich wrote: >> although of course the private-use characters showed up as >> dots instead of whatever they were supposed to be. > > Private-use?! The page is in iso-2022-jp! Anyway, why would > private-use show as dots in browsers? If it is defined somewhere in > the places the browser looks at, it will be shown THAT way; otherwise > the browser?s last-resort would hit (which I never saw to be dots; HEX > in Firefox; ?non-squares? ;-] in Chrome). Sorry, this was my mistake. IE8 on Windows 7 displayed James's "angry" line like this: ?(????????????????????????????(#`??)? - angry The only "private-use" character was something that got transcoded to U+E559, and IE8 displayed that as a space, not a dot. But a quick look at the ISO-2022-JP source shows this isn't right at all. So I guess I did have trouble viewing it, maybe not a crash, but severe mojibake. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From richard.wordingham at ntlworld.com Wed Apr 2 15:46:51 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 2 Apr 2014 21:46:51 +0100 Subject: Bidi reordering of soft hyphen In-Reply-To: <533BE3BF.2010502@ix.netcom.com> References: <20140402000257.32dd544d@JRWUBU2> <20140402083624.24b8da63@JRWUBU2> <533BE3BF.2010502@ix.netcom.com> Message-ID: <20140402214651.5eef9b17@JRWUBU2> On Wed, 02 Apr 2014 03:17:35 -0700 Asmus Freytag wrote: > On 4/2/2014 12:36 AM, Richard Wordingham wrote: > > But it is a *resolution* rule that converts the true hyphen or minus > > sign to Bidi Class L; these apply before the scope reduces from > > paragraph to line. > When breaking a line at a soft hyphen, one is essentially modifying > the text around the line break for display, because the SHY is not > specific as to what should happen (as was the case with German old > orthography, the changes go beyond simple substitution of a hyphen). > > When you change the text, you have to fix up the resolution. The argument was based on what happened to U+002D HYPHEN-MINUS. The change to the text then is to replace what is, in code order, 'CARROT IS carrot...' by 'CARROT IS carrot...'. One can even argue that this replacement would result from SHY under the rules of English typography. Reapplying the resolution rules, the left-to-right run now includes, even after truncation, 'car'. Richard. From ken.whistler at sap.com Wed Apr 2 16:08:44 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Wed, 2 Apr 2014 21:08:44 +0000 Subject: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector? In-Reply-To: <533C675E.6040805@cs.tut.fi> References: <20140402182723.GA8350@powdermilk> <533C675E.6040805@cs.tut.fi> Message-ID: Yucca noted: > > These glyphic pieces of symbols are only relevant and useful > > in the context of mathematical typesetting programs like TeX. > > I?m not sure whether TeX uses such characters at all. TeX is oriented > towards typesetting glyphs, often not caring that much about abstract > characters. Yes, I'm not claiming there is any actual use of U+23AF in TeX per se. The original source of U+23AF, by the way, was as a compatibility mapping character for the PostScript "arrowhorizex". See octal 276 in the symbol font encoding for PostScript. The original proposal for the set of these can be seen in L2/99-346 in the UTC document register. The whole set of these pieces was included into the repertoire of mathematical symbols under discussion at the time, and was eventually published as part of Unicode 3.2. --Ken From mark at kli.org Wed Apr 2 18:39:39 2014 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 02 Apr 2014 19:39:39 -0400 Subject: Emoji In-Reply-To: <1396420155.35962.YahooMailNeo@web87803.mail.ir2.yahoo.com> References: <1396420155.35962.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: <533C9FBB.2020401@kli.org> Relatedly, see https://www.kwikpoint.com/ Basically, little (laminated) booklets of pictures of hopefully understandable items and concepts you can point at in foreign countries. ~mark On 04/02/2014 02:29 AM, William_J_G Overington wrote: > For me, an important aspect of emoji is that they are independent of language. > > They can localize in the mind of the reader. > > How can they express verbs such as need and must; and pronouns? > > How can they express thanks? > > William Overington > > 2 April 2014 > > > > ----- Original Message ----- > From: Christopher Fynn > To: Unicode List > Cc: Nicole Selken > Sent: Wednesday, 2 April 2014, 7:04 > Subject: Re: Emoji > > On 02/04/2014, Nicole Selken wrote: > >> I think Emoji is totally beneficial as a communication form. > A reversion to a crude form of Hieroglyphics? > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From duerst at it.aoyama.ac.jp Wed Apr 2 20:13:51 2014 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Thu, 03 Apr 2014 10:13:51 +0900 Subject: Emoji In-Reply-To: References: <1396420155.35962.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: <533CB5CF.9010109@it.aoyama.ac.jp> On 2014/04/03 02:00, James Lin wrote: > Emoji or ???, literally means Face word or Face Characters, essentially, Emoji is ??? (picture character), ??? is kaomoji (face character). Regards, Martin. > provides an emotional state in the context of words. Emoji is very > popular in APJ, and specially in Japan where most of your text will > contain at least half dozen Emoji characters. Remember, people in Japan > spend more than half of their commute in the train, and no talk on the > cellphone in the train, so most people text instead. > > Everyone can guess what are the following emoji that used frequently in > Japan: > > ?(???;)? - worried > > > > ?(??????? - happy > > ?(#`??)? - angry > > ??_???- confused > > > there is a lot more... From verdy_p at wanadoo.fr Wed Apr 2 21:51:26 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 3 Apr 2014 04:51:26 +0200 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: <20140402133104.665a7a7059d7ee80bb4d670165c8327d.08a5508c35.wbe@email03.secureserver.net> References: <20140402133104.665a7a7059d7ee80bb4d670165c8327d.08a5508c35.wbe@email03.secureserver.net> Message-ID: There was no such browser bug in my Chrome install GChrome (in Gmail)which rendered the full string correctly (no dots, all characters displayed properly). So this looks like a Firefox bug. There's a rendering problem in IE, but no such critical bug that breaks the rest of the page. It looks like a problem of transcoding of emails by browsers (in Gmail, the transcoding to UTF-8 is performed apparently by the web server, so Firefox does not break when rendering the Gmail page). It is posible that what is really broken is in fact another webmail interface that incorrectly transcodes the email to UTF-8. I cannot know if the broken rendering was performed on everyone's client, if he use a webmail or standalone email agent. Many webmail services are broken in how they transcode the mails received in order to embed its content in a UTF-8 web page. 2014-04-02 22:31 GMT+02:00 Doug Ewell : > > Sorry, this was my mistake. IE8 on Windows 7 displayed James's "angry" > line like this: > > ?(????????????????????????????(#`??)? > - angry > > The only "private-use" character was something that got transcoded to > U+E559, and IE8 displayed that as a space, not a dot. But a quick look > at the ISO-2022-JP source shows this isn't right at all. So I guess I > did have trouble viewing it, maybe not a crash, but severe mojibake. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Apr 2 22:18:21 2014 From: doug at ewellic.org (Doug Ewell) Date: Wed, 2 Apr 2014 21:18:21 -0600 Subject: Emoji [And crash in the Web interface to the mailing list] Message-ID: <201404030319.s333IgoV022247@unicode.org> It's really quite simple: Sending e-mails in ISO-2022-JP to the Unicode mailing list causes problems. ?? -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell -----Original Message----- From: "Philippe Verdy" Sent: ?4/?2/?2014 20:51 To: "Doug Ewell" Cc: "Ilya Zakharevich" ; "Unicode Mailing List" Subject: Re: Emoji [And crash in the Web interface to the mailing list] There was no such browser bug in my Chrome install GChrome (in Gmail)which rendered the full string correctly (no dots, all characters displayed properly). So this looks like a Firefox bug. There's a rendering problem in IE, but no such critical bug that breaks the rest of the page. It looks like a problem of transcoding of emails by browsers (in Gmail, the transcoding to UTF-8 is performed apparently by the web server, so Firefox does not break when rendering the Gmail page). It is posible that what is really broken is in fact another webmail interface that incorrectly transcodes the email to UTF-8. I cannot know if the broken rendering was performed on everyone's client, if he use a webmail or standalone email agent. Many webmail services are broken in how they transcode the mails received in order to embed its content in a UTF-8 web page. 2014-04-02 22:31 GMT+02:00 Doug Ewell : > > Sorry, this was my mistake. IE8 on Windows 7 displayed James's "angry" > line like this: > > ?(????????????????????????????(#`??)? > - angry > > The only "private-use" character was something that got transcoded to > U+E559, and IE8 displayed that as a space, not a dot. But a quick look > at the ISO-2022-JP source shows this isn't right at all. So I guess I > did have trouble viewing it, maybe not a crash, but severe mojibake. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Apr 3 02:25:15 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 3 Apr 2014 09:25:15 +0200 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: <81a90d39549afeec4ea1725d525c3bf9@mwinf5c28.me-wanadoo.net> References: <81a90d39549afeec4ea1725d525c3bf9@mwinf5c28.me-wanadoo.net> Message-ID: But then why did I these emoticons (using "ASCII art" with additional CJK characters) correctly from the same mailing list ? I did not see any dot, or squares for PUA, but the stars, triangle and some kanas. I don't think that the mailing list itself is broken (or may be Gmail corrected things after). You cannot always predict the encoding used by the effective sending mailing agent (or the first reaying agent). Even Gmail will try to fit the "best" (popular) legacy 8-bit encoding according to content and the recipient (where it will try to geolocalize the target domain name, or use its own knowledge of the languages and encodings most often used by senders in that domain), instead of always sending with UTF-8, when the default user settings are for using a "default" encoding if the user had not specified that the mail would be forced to UTF-8. 2014-04-03 5:18 GMT+02:00 Doug Ewell : > It's really quite simple: Sending e-mails in ISO-2022-JP to the Unicode > mailing list causes problems. > > ?? > > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > ------------------------------ > From: Philippe Verdy > Sent: ?4/?2/?2014 20:51 > To: Doug Ewell > Cc: Ilya Zakharevich ; Unicode Mailing List > Subject: Re: Emoji [And crash in the Web interface to the mailing list] > > There was no such browser bug in my Chrome install GChrome (in Gmail)which > rendered the full string correctly (no dots, all characters displayed > properly). > > So this looks like a Firefox bug. There's a rendering problem in IE, but > no such critical bug that breaks the rest of the page. It looks like a > problem of transcoding of emails by browsers (in Gmail, the transcoding to > UTF-8 is performed apparently by the web server, so Firefox does not break > when rendering the Gmail page). > > It is posible that what is really broken is in fact another webmail > interface that incorrectly transcodes the email to UTF-8. I cannot know if > the broken rendering was performed on everyone's client, if he use a > webmail or standalone email agent. Many webmail services are broken in how > they transcode the mails received in order to embed its content in a UTF-8 > web page. > > 2014-04-02 22:31 GMT+02:00 Doug Ewell : >> >> Sorry, this was my mistake. IE8 on Windows 7 displayed James's "angry" >> line like this: >> >> ?(????????????????????????????(#`??)? >> - angry >> >> The only "private-use" character was something that got transcoded to >> U+E559, and IE8 displayed that as a space, not a dot. But a quick look >> at the ISO-2022-JP source shows this isn't right at all. So I guess I >> did have trouble viewing it, maybe not a crash, but severe mojibake. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Apr 3 14:26:24 2014 From: doug at ewellic.org (Doug Ewell) Date: Thu, 03 Apr 2014 12:26:24 -0700 Subject: Emoji [And crash in the Web interface to the mailing list] Message-ID: <20140403122624.665a7a7059d7ee80bb4d670165c8327d.b5256a3b70.wbe@email03.secureserver.net> Philippe Verdy wrote: >> It's really quite simple: Sending e-mails in ISO-2022-JP to the >> Unicode mailing list causes problems. >> >> ?? > > But then why did I these emoticons (using "ASCII art" with additional > CJK characters) correctly from the same mailing list ? > I did not see any dot, or squares for PUA, but the stars, triangle and > some kanas. It was meant as a small joke. It's the *Unicode* mailing list. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From verdy_p at wanadoo.fr Thu Apr 3 15:53:37 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 3 Apr 2014 22:53:37 +0200 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: <20140403122624.665a7a7059d7ee80bb4d670165c8327d.b5256a3b70.wbe@email03.secureserver.net> References: <20140403122624.665a7a7059d7ee80bb4d670165c8327d.b5256a3b70.wbe@email03.secureserver.net> Message-ID: Yes I know and I jave noticed it immediately (look at my first message), but we are already discussing someting else than the Google's joke. 2014-04-03 21:26 GMT+02:00 Doug Ewell : > Philippe Verdy wrote: > > >> It's really quite simple: Sending e-mails in ISO-2022-JP to the > >> Unicode mailing list causes problems. > >> > >> ?? > > > > But then why did I these emoticons (using "ASCII art" with additional > > CJK characters) correctly from the same mailing list ? > > I did not see any dot, or squares for PUA, but the stars, triangle and > > some kanas. > > It was meant as a small joke. It's the *Unicode* mailing list. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From buck at yelp.com Thu Apr 3 18:16:39 2014 From: buck at yelp.com (Buck Golemon) Date: Thu, 3 Apr 2014 16:16:39 -0700 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: References: <81a90d39549afeec4ea1725d525c3bf9@mwinf5c28.me-wanadoo.net> Message-ID: I too received the intended emoji via direct email but I see the garbled characters in the web interface: ?(???;)? - worried ?(??????? - happy ?(#`??)? - angry ??_???- confused I believe there is an encoding issue somewhere in the unicode.org/mail-archtoolchain. On Thu, Apr 3, 2014 at 12:25 AM, Philippe Verdy wrote: > But then why did I these emoticons (using "ASCII art" with additional CJK > characters) correctly from the same mailing list ? > I did not see any dot, or squares for PUA, but the stars, triangle and > some kanas. > > I don't think that the mailing list itself is broken (or may be Gmail > corrected things after). > > You cannot always predict the encoding used by the effective sending > mailing agent (or the first reaying agent). Even Gmail will try to fit the > "best" (popular) legacy 8-bit encoding according to content and the > recipient (where it will try to geolocalize the target domain name, or use > its own knowledge of the languages and encodings most often used by senders > in that domain), instead of always sending with UTF-8, when the default > user settings are for using a "default" encoding if the user had not > specified that the mail would be forced to UTF-8. > > > > > 2014-04-03 5:18 GMT+02:00 Doug Ewell : > > It's really quite simple: Sending e-mails in ISO-2022-JP to the Unicode >> mailing list causes problems. >> >> ?? >> >> >> -- >> Doug Ewell | Thornton, CO, USA >> http://ewellic.org | @DougEwell >> ------------------------------ >> From: Philippe Verdy >> Sent: ?4/?2/?2014 20:51 >> To: Doug Ewell >> Cc: Ilya Zakharevich ; Unicode Mailing List >> Subject: Re: Emoji [And crash in the Web interface to the mailing list] >> >> There was no such browser bug in my Chrome install GChrome (in >> Gmail)which rendered the full string correctly (no dots, all characters >> displayed properly). >> >> So this looks like a Firefox bug. There's a rendering problem in IE, but >> no such critical bug that breaks the rest of the page. It looks like a >> problem of transcoding of emails by browsers (in Gmail, the transcoding to >> UTF-8 is performed apparently by the web server, so Firefox does not break >> when rendering the Gmail page). >> >> It is posible that what is really broken is in fact another webmail >> interface that incorrectly transcodes the email to UTF-8. I cannot know if >> the broken rendering was performed on everyone's client, if he use a >> webmail or standalone email agent. Many webmail services are broken in how >> they transcode the mails received in order to embed its content in a UTF-8 >> web page. >> >> 2014-04-02 22:31 GMT+02:00 Doug Ewell : >>> >>> Sorry, this was my mistake. IE8 on Windows 7 displayed James's "angry" >>> line like this: >>> >>> ?(????????????????????????????(#`??)? >>> - angry >>> >>> The only "private-use" character was something that got transcoded to >>> U+E559, and IE8 displayed that as a space, not a dot. But a quick look >>> at the ISO-2022-JP source shows this isn't right at all. So I guess I >>> did have trouble viewing it, maybe not a crash, but severe mojibake. >> >> > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From naenaguru at gmail.com Thu Apr 3 21:52:48 2014 From: naenaguru at gmail.com (Naena Guru) Date: Thu, 3 Apr 2014 21:52:48 -0500 Subject: Singhala scirpt ill defined by OpenType standard Message-ID: Here is the proof that OpenType standard defined the Singhala script wrongly. Also find a BNF grammar that describes it. http://ahangama.com/unicode/index.htm Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kojiishi at gluesoft.co.jp Thu Apr 3 23:48:52 2014 From: kojiishi at gluesoft.co.jp (Koji Ishii) Date: Fri, 4 Apr 2014 04:48:52 +0000 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: References: <81a90d39549afeec4ea1725d525c3bf9@mwinf5c28.me-wanadoo.net> Message-ID: <76A14357-762B-4A06-BD1E-68CB3C8C452D@gluesoft.co.jp> Go to Encoding menu and choose UTF-8 to fix the garbled characters. It looks like the page is served in UTF-8, but it declares itself as us-ascii: and /koji On Apr 4, 2014, at 8:16 AM, Buck Golemon > wrote: I too received the intended emoji via direct email but I see the garbled characters in the web interface: ?(???;)? - worried ?(??????? - happy ?(#`??)? - angry ??_???- confused I believe there is an encoding issue somewhere in the unicode.org/mail-arch toolchain. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Apr 4 00:45:08 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 4 Apr 2014 07:45:08 +0200 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: <76A14357-762B-4A06-BD1E-68CB3C8C452D@gluesoft.co.jp> References: <81a90d39549afeec4ea1725d525c3bf9@mwinf5c28.me-wanadoo.net> <76A14357-762B-4A06-BD1E-68CB3C8C452D@gluesoft.co.jp> Message-ID: The content is transfered as UTF-8 at the MIME level for both the plain-text and HTML parts attached: --_000_76A14357762B4A06BD1E68CB3C8C452Dgluesoftcojp_ Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 ... --_000_76A14357762B4A06BD1E68CB3C8C452Dgluesoftcojp_ Content-Type: text/html; charset="utf-8"Content-Transfer-Encoding: base64 ... Normally the xml declaration or meta tag in the (X)HTML headers should be ignored, mail agents will not transform the attachments except possibly changing the content-transfer-encoding (here base64 in both parts). If you mail agents does not process the HTML part, it will render the plain-text verson which has no declaration at all, the MIME content-type will then be the only indication. I don't know how the sending email agent could generate "us-ascii" in XHTML headers, but in fact it should simply be discarded in all cases (in HTML5, us-ascii or iso-8859-1 and their aliases are normally all treated like windows-1252, and "us-ascii" is simply ignored, it is bugged by itself in almost all cases). But here we are not in an HTML5 case; so once the HTML headers are discarded, the next candidate is the MIME part declaration (the transporting layer). As it specifies UTF-8, this should work without forcing the reading email agent to start using its "encoding guessing" magic. But even in this case, UTF-8 is certainly a better guess than Iegacy japanese charsets (using various settings in the reding mail agent such as specifying a prefered default encoding to use one of these legacy charsets has no effect UTF-8 is always used to process the message. So those that see bugs are affected if: - a user sending is message with an outdated and bugged email agent for composing and sending the mail (which inserts compeltely incorrect XHTML headers) - recipients use themselves an outdated and bugged email agent, not performing the most reasonable processing and guessing steps (or this behavior can only be reproduced by those using a email agent whose user localisation is Japanese). The encoding guesser here is most probably bugged but affected by the fact that there are not enough contextual content to guess it with good confidence (only a few isolated characters whose use here was discretionary and extremely rare in an English text conten, those few characters have near-zero confidence value in English as long as there's no other East Asian language used). It looks like the reading email agent does not reach a minimum threshold level of confidence for the guessed encoding; so it eems that the result of the guesser is simply discarded, and then the reading email agent only uses the default user setting of the encoding to use to process messages with unknown/unspecified encodings. I'm not sure this is valid to discard the UTF-8 explicit MIME declaration which does not come from the encoding guesser, as UTF-8 is now a solid default to use (a default now for almost all new IETF standards since long now, with now a wide majority of software installations using it effectively as ther default), notably when it is specified as here). We know that UTF-8 is now the best guess for content at the *worldwide* level. But is UTF-8 still a minority encoding for contents exchanged in Japan ? The ISO2022-JP seems very unlikely to be used instead of UTF-8, and I would have possibly expected a shift-JIS variant instead, if Unicode is still not the best choice for Japan. But if the email agent is on a now antique OS (Windows XP or 2000 ? themselves installed with in their Japanese localisation) may be that user never updated its agent for that old OS (and it is quite surprising for Japan that like to use the newest technology products, except if the reading user is using a tricky installation with lots of personal system settings for their "geek" tools that have never been ported to newer OSes). In my opinion we are in an extremely user-specific situation. But I do not see where the mailing list was acting incorrectly (it won't change its settings only for a few "geeks" with tricky installations and using antique softwares). 2014-04-04 6:48 GMT+02:00 Koji Ishii : > Go to Encoding menu and choose UTF-8 to fix the garbled characters. > > It looks like the page is served in UTF-8, but it declares itself as > us-ascii: > > and > > > /koji > > On Apr 4, 2014, at 8:16 AM, Buck Golemon wrote: > > I too received the intended emoji via direct email but I see the garbled > characters in the web interface: > > ?(???;)? - worried > > > > ?(??????? - happy > > ?(#`??)? - angry > > ??_???- confused > > > I believe there is an encoding issue somewhere in the > unicode.org/mail-arch toolchain. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.fynn at gmail.com Fri Apr 4 02:15:33 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Fri, 4 Apr 2014 13:15:33 +0600 Subject: Singhala scirpt ill defined by OpenType standard In-Reply-To: References: Message-ID: If you think there is a problem with OpenType and Singala, the place to bring that up is on the OpenType list - not the Unicode list. There are often several different ways of accomplishing things with OpenType - and how you do them also depends on how you design the font. Creating and Supporting OpenType Fonts for Sinhala Script is not part of the OpenType specification it is just a guideline for creating OpenType Singhala fonts based on how Microsoft made their own Singhala font - but some other font maker could do things a little differently - as long as the text renders correctly with the font. If you have problems with the document itself and the use of terms - you should take that up with Microsoft typography and give them suggestions how to fix it. It was probably written by someone who makes the OpenType tables for fonts for many different scripts with no particular knowledge of the the Singhala language or of Sanskrit. On 04/04/2014, Naena Guru wrote: > Here is the proof that OpenType standard defined the Singhala script > wrongly. Also find a BNF grammar that describes it. > http://ahangama.com/unicode/index.htm > > Thanks. > From wjgo_10009 at btinternet.com Fri Apr 4 05:09:26 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 4 Apr 2014 11:09:26 +0100 (BST) Subject: Singhala scirpt ill defined by OpenType standard In-Reply-To: References: Message-ID: <1396606166.45507.YahooMailNeo@web87804.mail.ir2.yahoo.com> > If you think there is a problem with OpenType? and Singala, the placeto bring that up is on the OpenType? list ... subscribe: opentype-subscribe at indx.co.uk I seem to remember that joining is by asking the gentleman who runs it, rather than there being an automated system. It is a friendly mailing list with many experts on OpenType amongst the members. William Overington 4 April 2014 From doug at ewellic.org Fri Apr 4 11:12:59 2014 From: doug at ewellic.org (Doug Ewell) Date: Fri, 04 Apr 2014 09:12:59 -0700 Subject: Singhala scirpt ill defined by OpenType standard Message-ID: <20140404091259.665a7a7059d7ee80bb4d670165c8327d.f174ce1446.wbe@email03.secureserver.net> Christopher Fynn wrote: >> Here is the proof that OpenType standard defined the Singhala script >> wrongly. Also find a BNF grammar that describes it. >> http://ahangama.com/unicode/index.htm > > If you think there is a problem with OpenType and Singala, the place > to bring that up is on the OpenType list - not the Unicode list. If you look at the page Naena cited, you'll see that he conflates Unicode and OpenType -- the page is titled "Unicode misunderstands Singhala script" -- and also that he considers the Unicode encoding of Sinhala to have been motivated by evil intentions: "However, this is threatened by the unscrupulous implementation of Unicode Sinhala code page specification closing door to objective criticism. A nearly decade long intransigence seems to be the willingness among the technocracy in the country to value personal wellbeing over obtaining a successful solution for digitizing Singhala." -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From rick at unicode.org Fri Apr 4 20:49:58 2014 From: rick at unicode.org (Rick McGowan) Date: Fri, 04 Apr 2014 18:49:58 -0700 Subject: Unicode.org scheduled maintenance time on April 5 Message-ID: <533F6146.7060505@unicode.org> On April 5, at 11:00 pm US Central time (0400 GMT on April 6), the Unicode.org server may experience downtime due to some maintenance in the data center. The outage is expected to last less than two hours. From mark at macchiato.com Thu Apr 10 02:09:13 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 10 Apr 2014 09:09:13 +0200 Subject: Updated emoji working draft Message-ID: We updated the working draft for emoji info, http://unicode.org/draft/reports/tr51/tr51.html. It now has a more comprehensive list of characters, with images from the various systems (thanks to those who supplied them!). This is still just a working draft, without any UTC status, and may change at any time. Please let me know of any comments before the May UTC. Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Thu Apr 10 08:43:38 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 10 Apr 2014 14:43:38 +0100 (BST) Subject: Updated emoji working draft In-Reply-To: References: Message-ID: <1397137418.12792.YahooMailNeo@web87803.mail.ir2.yahoo.com> May I mention Chloe and Phil please? Chloe and Phil originated as part of my creative writing in the late 1990s and feature in various animations and songs. http://www.users.globalnet.co.uk/~ngo/cw000000.htm http://www.users.globalnet.co.uk/~ngo/de000000.htm http://www.users.globalnet.co.uk/~ngo/cpjs0001.htm http://www.users.globalnet.co.uk/~ngo/song1008.htm http://www.users.globalnet.co.uk/~ngo/song1011.htm http://www.users.globalnet.co.uk/~ngo/ast02400.htm The whole website has now been archived by the British Library in accordance with the 2013 regulations. http://www.bl.uk/aboutus/legaldeposit/index.html William Overington 10 April 2014 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Sat Apr 12 02:15:11 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 12 Apr 2014 08:15:11 +0100 (BST) Subject: Updated emoji working draft In-Reply-To: References: Message-ID: <1397286911.12024.YahooMailNeo@web87801.mail.ir2.yahoo.com> A multi-colour "multi-do-not" glyph displayed in Venice. https://maps.google.com/?ll=45.435077,12.333736&spn=0.00071,0.001124&t=m&z=19&layer=c&cbll=45.435113,12.333845&panoid=D0xVyae3dpu1Z5dA8nkXyA&cbp=12,292.04,,0,13.68 Please zoom in 3 times and go full screen. William Overington 12 April 2014 From verdy_p at wanadoo.fr Sat Apr 12 03:30:45 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 12 Apr 2014 10:30:45 +0200 Subject: Updated emoji working draft In-Reply-To: <1397286911.12024.YahooMailNeo@web87801.mail.ir2.yahoo.com> References: <1397286911.12024.YahooMailNeo@web87801.mail.ir2.yahoo.com> Message-ID: Clearly not a glyph but a free logographic composition with icons. Such composition pattern is in fact very common, not exclusive to this place, and the various sub-icons will change in all aspects: number of objects, placement, color, relatve sizes, and drawing styles (photos may be used as well). The types of restricted objects here are just foods and drinks (including cans), but there are other items commonly found: smoking cigarettes, pets dogs (except those specifically trained and equipped for guiding blind/handicaped people, as stated by law in some countries), umbrellas, shoes (near pools), trousers/shorts (only trunks allowed in pools), rollers, skis, radios/audio devices, mobile phones, fire (in natural areas), person shouting (keep silent), children... 2014-04-12 9:15 GMT+02:00 William_J_G Overington : > A multi-colour "multi-do-not" glyph displayed in Venice. > > > https://maps.google.com/?ll=45.435077,12.333736&spn=0.00071,0.001124&t=m&z=19&layer=c&cbll=45.435113,12.333845&panoid=D0xVyae3dpu1Z5dA8nkXyA&cbp=12,292.04,,0,13.68 > > Please zoom in 3 times and go full screen. > > William Overington > > 12 April 2014 > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Sat Apr 12 03:45:09 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 12 Apr 2014 09:45:09 +0100 (BST) Subject: Updated emoji working draft In-Reply-To: References: <1397286911.12024.YahooMailNeo@web87801.mail.ir2.yahoo.com> Message-ID: <1397292309.25459.YahooMailNeo@web87806.mail.ir2.yahoo.com> I found this rather fine example of some "do not" signs in use in the Uffizi Gallery in Florence. It is zoomed-in, so one needs to zoom-out, three times, in order to be able to move around in the simulation. https://maps.google.com/?ll=43.768808,11.256574&spn=0.000715,0.001124&t=m&z=19&layer=c&cbll=43.768802,11.256064&panoid=qjjyp6ul4nnS4zBqcRzSJQ&cbp=12,273.36,,3,12.92 Are such symbols emoji? In the future, perhaps there will be a colour font useful for making such signs. William Overington 12 April 2014 ________________________________ From: Philippe Verdy To: William_J_G Overington Cc: Mark Davis ?? ; Unicode Public Sent: Saturday, 12 April 2014, 9:30 Subject: Re: Updated emoji working draft Clearly not a glyph but a free logographic composition with icons. Such composition pattern is in fact very common, not exclusive to this place, and the various sub-icons will change in all aspects: number of objects, placement, color, relatve sizes, and drawing styles (photos may be used as well). The types of restricted objects here are just foods and drinks (including cans), but there are other items commonly found: smoking cigarettes, pets dogs (except those specifically trained and equipped for guiding blind/handicaped people, as stated by law in some countries),?umbrellas, shoes (near pools), trousers/shorts (only trunks allowed in pools), rollers,?skis, radios/audio devices, mobile phones, fire (in natural areas), person shouting (keep silent), children... 2014-04-12 9:15 GMT+02:00 William_J_G Overington : A multi-colour "multi-do-not" glyph displayed in Venice. > >https://maps.google.com/?ll=45.435077,12.333736&spn=0.00071,0.001124&t=m&z=19&layer=c&cbll=45.435113,12.333845&panoid=D0xVyae3dpu1Z5dA8nkXyA&cbp=12,292.04,,0,13.68 > >Please zoom in 3 times and go full screen. > >William Overington > >12 April 2014 > From wjgo_10009 at btinternet.com Sat Apr 12 04:46:43 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 12 Apr 2014 10:46:43 +0100 (BST) Subject: Updated emoji working draft In-Reply-To: References: Message-ID: <1397296003.47902.YahooMailNeo@web87801.mail.ir2.yahoo.com> The document http://unicode.org/draft/reports/tr51/tr51.html at present includes the following. quote There is one further kind of label, called a "read-out", for text-to-speech. For accessibility when reading text, it is useful to have a semi-unique name for an emoji character. The Unicode character name can often serve as a basis for this, but its requirements for uniqueness often ends up with names that are overly long, such as black right-pointing double triangle with vertical bar for ?. Note that the labels need to be in each user?s language to useful. They cannot simply be a translation of an English label, since different words, or even different categorizations, may be what is expected in different languages. The terms given in the data files here have been collected from different sources. They are only initial suggestions, not expected to be complete, and only in English. end quote In March 2014 I published the attached document, depositing a copy with the British Library. The_format_of_the_translit.dat_file_suggested_for_possible_use_for_transliteration.pdf Is this format suitable to become standardized for use in producing localized text-to-speech from emoji to the chosen local language? William Overington 12 April 2014 -------------- next part -------------- A non-text attachment was scrubbed... Name: The_format_of_the_translit.dat_file_suggested_for_possible_use_for_transliteration.pdf Type: application/pdf Size: 32815 bytes Desc: not available URL: From jf at colson.eu Sat Apr 12 09:36:29 2014 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Sat, 12 Apr 2014 16:36:29 +0200 Subject: IPA and unofficial extensions Message-ID: <53494F6D.8070501@colson.eu> Hello I have a few questions about the IPA and about its unofficial extensions. In the consonant charts at https://upload.wikimedia.org/wikipedia/commons/f/f5/IPA_chart_2005_png.svg there are a few grey symbols which are already in the IPA: ????. There are also three symbols I didn?t find: ? palatal lateral fricative (Latin small letter turned y with belt) ? velar lateral fricative (Latin letter small capital l with belt) ? retroflex lateral flap (Latin small letter turned r with long leg and retroflex hook) Is there a proposal including those three letters? In the suprasegmentals, I found an ?extra stress? character. It looks like a double primary stress ??. Is that the right way to write it? Would a new character be required? What about the strident diacritic in the diacritics table? Is it right to use the tilde below twice (n?? a??) or would a new diacritic (combining double tilde below) be proposed? Voiced bilabial fricative. Presently, for this letter, the Greek letter ? is used. The Latin letter ? (U+A7B5 Latin small letter beta) is about to be accepted. Would it be used instead of the Greek letter in IPA? Voiceless uvular fricative. Presently, for this letter, the Greek letter ? is used. Phonetic letters for German dialectology are about to be accepted. I?ve seen in several proposals that it includes a Latin small letter stretched x, but several code points were proposed for it in several proposals and I don?t know where is the last one. Would it be used instead of the Greek letter in IPA? The Greek chi has a wavy line while the streched x is very similar to and confusable with the standard x. From mark at macchiato.com Sat Apr 12 09:39:52 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 12 Apr 2014 16:39:52 +0200 Subject: Updated emoji working draft In-Reply-To: <1397296003.47902.YahooMailNeo@web87801.mail.ir2.yahoo.com> References: <1397296003.47902.YahooMailNeo@web87801.mail.ir2.yahoo.com> Message-ID: On 12 April 2014 11:46, William_J_G Overington wrote: ?...? In March 2014 I published the attached document, depositing a copy with the > British Library. > > > The_format_of_the_translit.dat_file_suggested_for_possible_use_for_transliteration.pdf > > Is this format suitable to become standardized for use in producing > localized text-to-speech from emoji to the chosen local language? > ? no?, not particularly Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Sat Apr 12 09:54:54 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 12 Apr 2014 15:54:54 +0100 (BST) Subject: Updated emoji working draft Message-ID: <1397314494.29013.YahooMailNeo@web87805.mail.ir2.yahoo.com> The document http://unicode.org/draft/reports/tr51/tr51.html at present includes the following. quote The longer-term goal for implementations should be to support embedded graphics. That would allow arbitrary emoji characters, and not be dependent on additional Unicode encoding. end quote Would it be good, for an emoji that is not encoded in regular Unicode, to include mention of the possibility of transmission by markup bubble, rendered upon reception as an unmapped glyph by an OpenType colour font? For example, as nine Unicode characters. COLON COLON U1 U2 U3 U4 U5 COLON SEMICOLON This would perhaps not always allow new emoji to be added as quickly as with embedded graphics, yet with this technique, the message could be archived as plain text and would be searchable and text-to-speech would be possible at the receiving end. William Overington 12 April 2014 From mark at macchiato.com Sat Apr 12 11:07:42 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 12 Apr 2014 18:07:42 +0200 Subject: Updated emoji working draft In-Reply-To: <1397314494.29013.YahooMailNeo@web87805.mail.ir2.yahoo.com> References: <1397314494.29013.YahooMailNeo@web87805.mail.ir2.yahoo.com> Message-ID: On 12 April 2014 16:54, William_J_G Overington wrote: > Would it be good, for an emoji that is not encoded in regular Unicode, to > include mention of the possibility of transmission by markup bubble, > rendered upon reception as an unmapped glyph by an OpenType colour font? > > For example, as nine Unicode characters. > > COLON COLON U1 U2 U3 U4 U5 COLON SEMICOLON > > This would perhaps not always allow new emoji to be added as quickly as > with embedded graphics, yet with this technique, the message could be > archived as plain text and would be searchable and text-to-speech would be > possible at the receiving end. > ?I don't think anything like what you suggest would be feasible, or desirable. Longer term, I think the most feasible approach is the interchange of embedded graphics, which can always have alt values (at least in html) for readings. Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Mon Apr 14 02:27:17 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 14 Apr 2014 08:27:17 +0100 (BST) Subject: Updated emoji working draft Message-ID: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> >> Is this format suitable to become standardized for use in producing localized text-to-speech from emoji to the chosen local language? > no, not particularly Thank you for replying. Well, I feel that it would be good if a format, whatever it may be, were decided at the May 2014 UTC (Unicode Technical Committee) meeting. This would enable application implementers to use a standardized format with a standardized file name; and enable advocates for localization to a particular language to produce a localization file for that particular language confident that the file produced would be widely compatible with various applications, such as browsers and email clients and ebook readers, from various manufacturers. The particular file format that I mentioned is a simplified variant of an earlier format that I produced for my research. The original format contains a facility for organizing a cascading menu system for use in generating messages as well. Yet, in general, what features are needed for such a format and can such a format become specified in good time for discussion before the May 2014 UTC meeting? William Overington 14 April 2014 From wjgo_10009 at btinternet.com Mon Apr 14 03:54:26 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 14 Apr 2014 09:54:26 +0100 (BST) Subject: Updated emoji working draft In-Reply-To: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> References: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> Message-ID: <1397465666.9700.YahooMailNeo@web87801.mail.ir2.yahoo.com> > I don't think anything like what you suggest would be feasible, or desirable. Decoding a nine-character markup bubble to an unmapped monochrome glyph using an OpenType font is known to be possible as I have done that using a font made using High-Logic FontCreator 7 with the font tested using Serif PagePlus X5. Colour fonts are now available and OpenType COLR/CPAL colour fonts can be made using FontCreator 7.5. I do not at present have access to an application that is both OpenType-aware and that can also support colour fonts. Some people might have access to such an application: I do not know at present. William Overington 14 April 2014 From wjgo_10009 at btinternet.com Mon Apr 14 09:01:11 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 14 Apr 2014 15:01:11 +0100 (BST) Subject: Updated emoji working draft In-Reply-To: <1397465666.9700.YahooMailNeo@web87801.mail.ir2.yahoo.com> References: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> <1397465666.9700.YahooMailNeo@web87801.mail.ir2.yahoo.com> Message-ID: <1397484071.93086.YahooMailNeo@web87805.mail.ir2.yahoo.com> Here are two examples each of a symbol together with accompanying text in Venice. The symbol is global and the text is local. https://maps.google.com/maps?q=Venice,+Italy&hl=en&ll=45.432399,12.337928&spn=0.000702,0.001124&sll=37.0625,-95.677068&sspn=26.039016,36.826172&oq=venice&hnear=Venice,+Veneto,+Italy&t=m&layer=c&cbll=45.432473,12.337638&panoid=YazHmOmqVm1q5CZ2H7klMQ&cbp=12,16.36,,0,8.23&z=19 Going full screen and zooming-in is helpful. William Overington 14 April 2014 From mark at macchiato.com Mon Apr 14 11:08:23 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 14 Apr 2014 18:08:23 +0200 Subject: Updated emoji working draft In-Reply-To: <1397484071.93086.YahooMailNeo@web87805.mail.ir2.yahoo.com> References: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> <1397465666.9700.YahooMailNeo@web87801.mail.ir2.yahoo.com> <1397484071.93086.YahooMailNeo@web87805.mail.ir2.yahoo.com> Message-ID: This is really off topic. If you want to start up a thread about this, please use a different subject. Mark *? Il meglio ? l?inimico del bene ?* On 14 April 2014 16:01, William_J_G Overington wrote: > Here are two examples each of a symbol together with accompanying text in > Venice. > > The symbol is global and the text is local. > > > https://maps.google.com/maps?q=Venice,+Italy&hl=en&ll=45.432399,12.337928&spn=0.000702,0.001124&sll=37.0625,-95.677068&sspn=26.039016,36.826172&oq=venice&hnear=Venice,+Veneto,+Italy&t=m&layer=c&cbll=45.432473,12.337638&panoid=YazHmOmqVm1q5CZ2H7klMQ&cbp=12,16.36,,0,8.23&z=19 > > Going full screen and zooming-in is helpful. > > William Overington > > 14 April 2014 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Mon Apr 14 11:51:38 2014 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Mon, 14 Apr 2014 18:51:38 +0200 Subject: IPA and unofficial extensions In-Reply-To: <53494F6D.8070501@colson.eu> References: <53494F6D.8070501@colson.eu> Message-ID: <534C121A.6030903@gmail.com> Le 12/04/2014 16:36, Jean-Fran?ois Colson a ?crit : > > I have a few questions about the IPA and about its unofficial extensions. > > In the consonant charts at > https://upload.wikimedia.org/wikipedia/commons/f/f5/IPA_chart_2005_png.svg > there are a few grey symbols which are already in the IPA: ????. You probably meant in Unicode > There are also three symbols I didn?t find: > ? palatal lateral fricative (Latin small letter turned y with belt) > ? velar lateral fricative (Latin letter small capital l with belt) > ? retroflex lateral flap (Latin small letter turned r with long leg > and retroflex hook) > Is there a proposal including those three letters? They are in the SIL PUA at position F267, F268 and F269 . You have more details at http://scripts.sil.org/cms/scripts/page.php?item_id=SILPUAassignments , but the 6.2a version of it ( http://scripts.sil.org/cms/scripts/render_download.php?format=file&media_id=SILCorpPUAAssign20130215&filename=SILCorpPUAAssign20130215_6.2a.zip ) does not list them as proposed to Unicode. I could not find any proposal for them, but the related LATIN CAPITAL LETTER L WITH BELT, used in the Alabama (Alibamu) language (aks) has been proposed in L2/12-080 by Joshua M Jensen and Karl Pentzlin. ( http://www.unicode.org/L2/L2012/12080-l-with-belt.pdf ) and has been accepted for unicode 7.0 as U+A7AD ( http://www.unicode.org/charts/PDF/Unicode-7.0/U70-A720.pdf ) > > In the suprasegmentals, I found an ?extra stress? character. > It looks like a double primary stress ??. Is that the right way to > write it? Would a new character be required? That is the way used to write it on wikipedia at least (e.g. here https://en.wikipedia.org/wiki/Stress_%28linguistics%29 and here https://en.wikipedia.org/wiki/Obsolete_and_nonstandard_symbols_in_the_International_Phonetic_Alphabet ). A new character does not seem to be required. > > What about the strident diacritic in the diacritics table? Is it right > to use the tilde below twice (n?? a??) or would a new diacritic (combining > double tilde below) be proposed? I think the correct encoding is indeed two uses of the tilde below and no new character is needed > > Voiced bilabial fricative. > Presently, for this letter, the Greek letter ? is used. The Latin > letter ? (U+A7B5 Latin small letter beta) is about to be accepted. > Would it be used instead of the Greek letter in IPA? > > Voiceless uvular fricative. > Presently, for this letter, the Greek letter ? is used. Phonetic > letters for German dialectology are about to be accepted. I?ve seen in > several proposals that it includes a Latin small letter stretched x, > but several code points were proposed for it in several proposals and > I don?t know where is the last one. The last encoding is U+AB53 LATIN SMALL LETTER CHI, as you can see here http://www.unicode.org/charts/PDF/Unicode-7.0/U70-AB30.pdf . I supose its final name folows the reasoning of this proposal http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4296.pdf which, I guess, was following this thread of the mailing list : http://www.unicode.org/mail-arch/unicode-ml/y2012-m06/0048.html > Would it be used instead of the Greek letter in IPA? The Greek chi has > a wavy line while the streched x is very similar to and confusable > with the standard x. The wavy line of the greek chi is optional. In some sans-serif fonts x and ? (chi) look exactly the same, so that Greek and Latin text blend better together. This is, for example, the design choice taken by the Ubuntu font (http://font.ubuntu.com/), and it makes this font difficult to use for mathematical and phonetic texts. I feel that the stretched x form is then more distinct, because of its different height and having it not contrasting with x simply makes no sense to any sane typograph. The encoding of these two ?greek-latin? letter used in IPA (together with theta) is a subject of discussion which comes every few years on this list. The tree following blog posts from Michael Everson and John Wells in 2010 (as well as the comments) discuss the unfortunate effect of their unification with Latin. http://evertype.com/blog/blog/2010/07/23/latin-and-greek-a-problem-for-the-ipa/ http://phonetic-blog.blogspot.fr/2010/07/disunification-1.html http://phonetic-blog.blogspot.fr/2010/07/disunification-2.html Only the latin theta is currently missing, but since they it is used in some Native american, Unifon and Rromani orthographies, it will be integrated in Unicode in some future date. And it is indeed proposed here http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4262.pdf So some person will volontarly use the Latin version of these letters for IPA, and I?m almost sure Michael Everson will tell you that it is the right thing to do... Fr?d?ric Grosshans From wjgo_10009 at btinternet.com Tue Apr 15 06:14:48 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 15 Apr 2014 12:14:48 +0100 (BST) Subject: Updated emoji working draft In-Reply-To: <1397484071.93086.YahooMailNeo@web87805.mail.ir2.yahoo.com> References: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> <1397465666.9700.YahooMailNeo@web87801.mail.ir2.yahoo.com> <1397484071.93086.YahooMailNeo@web87805.mail.ir2.yahoo.com> Message-ID: <1397560488.5735.YahooMailNeo@web87803.mail.ir2.yahoo.com> > This is really off topic. If you want to start up a thread about this, please use a different subject. Well, perhaps I may explain why I consider the post to be on topic. The document http://unicode.org/draft/reports/tr51/tr51.html at present includes the following. quote There is one further kind of label, called a "read-out", for text-to-speech. For accessibility when reading text, it is useful to have a semi-unique name for an emoji character. The Unicode character name can often serve as a basis for this, but its requirements for uniqueness often ends up with names that are overly long, such as black right-pointing double triangle with vertical bar for ?. Note that the labels need to be in each user?s language to useful. They cannot simply be a translation of an English label, since different words, or even different categorizations, may be what is expected in different languages. The terms given in the data files here have been collected from different sources. They are only initial suggestions, not expected to be complete, and only in English. end quote If the UTC (Unicode Technical Committee) accepts the introduction of read-out labels, each read-out label both linked to a pictograph character and also linked to a language-localization text string, then that will be a far-reaching enhancement to Unicode which may have enormous implications for facilitating communication through the language barrier. Although suggested in the draft as for use in text-to-speech, a read-out label could also be displayed as text, either in addition to the pictograph or instead of the pictograph. The linked picture in my post contains two examples, each of which may, in the present context be regarded as a pictograph and a read-out label text string displayed together. Imagine please if museum and art gallery websites each were to have an international webpage in its on-line shop. If there were on the webpage emoji for Surname, Forename, Delivery address, Card number, Card expiry date and so on and the end-user could display text in his or her own language by displaying the appropriate read-out label next to the pictograph of the emoji, then that could be very helpful. William Overington 15 April 2014 From mark at macchiato.com Tue Apr 15 07:57:52 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 15 Apr 2014 14:57:52 +0200 Subject: Updated emoji working draft In-Reply-To: <1397560488.5735.YahooMailNeo@web87803.mail.ir2.yahoo.com> References: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> <1397465666.9700.YahooMailNeo@web87801.mail.ir2.yahoo.com> <1397484071.93086.YahooMailNeo@web87805.mail.ir2.yahoo.com> <1397560488.5735.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: On 15 April 2014 13:14, William_J_G Overington wrote: > If the UTC (Unicode Technical Committee) accepts the introduction of > read-out labels, each read-out label both linked to a pictograph character > and also linked to a language-localization text string, then that will be a > far-reaching enhancement to Unicode which may have enormous implications > for facilitating communication through the language barrier. > > If the UTC (Unicode Technical Committee) accepts the introduction of read-out labels The passage just points out that those can exist, the document does not provide any data for that. > If there were on the webpage emoji for Surname, Forename, Delivery address, Card number I can't see any possible future in which emoji like that are encoded. As I said before, please move this discussion to another email subject. Otherwise, I'll take a step I should have long ago, and simply filter out all email coming from you. Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Tue Apr 15 11:29:23 2014 From: petercon at microsoft.com (Peter Constable) Date: Tue, 15 Apr 2014 16:29:23 +0000 Subject: Updated emoji working draft In-Reply-To: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> References: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> Message-ID: <21684ddd2077444f8c9cd2fc30b40bee@BL2PR03MB450.namprd03.prod.outlook.com> William, the UTC is not in the business of creating file formats for localization data. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of William_J_G Overington Sent: April 14, 2014 12:27 AM To: Unicode Public; Mark Davis ?? Cc: wjgo_10009 at btinternet.com Subject: Re: Updated emoji working draft >> Is this format suitable to become standardized for use in producing localized text-to-speech from emoji to the chosen local language? > no, not particularly Thank you for replying. Well, I feel that it would be good if a format, whatever it may be, were decided at the May 2014 UTC (Unicode Technical Committee) meeting. This would enable application implementers to use a standardized format with a standardized file name; and enable advocates for localization to a particular language to produce a localization file for that particular language confident that the file produced would be widely compatible with various applications, such as browsers and email clients and ebook readers, from various manufacturers. The particular file format that I mentioned is a simplified variant of an earlier format that I produced for my research. The original format contains a facility for organizing a cascading menu system for use in generating messages as well. Yet, in general, what features are needed for such a format and can such a format become specified in good time for discussion before the May 2014 UTC meeting? William Overington 14 April 2014 _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From wjgo_10009 at btinternet.com Wed Apr 16 06:04:07 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 16 Apr 2014 12:04:07 +0100 (BST) Subject: The application of localized read-out labels Message-ID: <1397646247.70807.YahooMailNeo@web87806.mail.ir2.yahoo.com> > William, the UTC is not in the business of creating file formats for localization data. > Peter Thank you for replying. Feeling that a format for the particular application is important I have now produced a format myself and published it. Please find a copy attached. Posting the publication as an attachment here will also hopefully place it in the mailing list archives for long-term availability. I have also sent a copy to the British Library for Legal Deposit. The publication has the following title. The format of the readouts.dat file suggested for possible use in the application of localized read-out labels The file has the following file name. The_format_of_the_readouts.dat_file_suggested_for_possible_use_in_the_application_of_localized_read-out_labels.pdf William Overington 16 April 2014 -------------- next part -------------- A non-text attachment was scrubbed... Name: The_format_of_the_readouts.dat_file_suggested_for_possible_use_in_the_application_of_localized_read-out_labels.pdf Type: application/pdf Size: 34283 bytes Desc: not available URL: From chris.fynn at gmail.com Wed Apr 16 14:41:30 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Thu, 17 Apr 2014 01:41:30 +0600 Subject: Updated emoji working draft In-Reply-To: <21684ddd2077444f8c9cd2fc30b40bee@BL2PR03MB450.namprd03.prod.outlook.com> References: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> <21684ddd2077444f8c9cd2fc30b40bee@BL2PR03MB450.namprd03.prod.outlook.com> Message-ID: On 15/04/2014, Peter Constable wrote: > William, the UTC is not in the business of creating file formats for > localization data. > > Peter Yes a proper understanding of what is the scope of Unicode - and what is not within that scope - might help. From chris.fynn at gmail.com Wed Apr 16 15:07:52 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Thu, 17 Apr 2014 02:07:52 +0600 Subject: The application of localized read-out labels In-Reply-To: <1397646247.70807.YahooMailNeo@web87806.mail.ir2.yahoo.com> References: <1397646247.70807.YahooMailNeo@web87806.mail.ir2.yahoo.com> Message-ID: On 16/04/2014, William_J_G Overington wrote: > >> William, the UTC is not in the business of creating file formats for >> localization data. > >> Peter > > Thank you for replying. > > Feeling that a format for the particular application is important I have now > produced a format myself and published it. Whether or not it is important, it is clearly beyond the defined scope of Unicode so off-topic here. From wjgo_10009 at btinternet.com Thu Apr 17 01:14:55 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 17 Apr 2014 07:14:55 +0100 (BST) Subject: The application of localized read-out labels In-Reply-To: References: <1397646247.70807.YahooMailNeo@web87806.mail.ir2.yahoo.com> Message-ID: <1397715295.40851.YahooMailNeo@web87802.mail.ir2.yahoo.com> Christopher Fynn wrote: > Whether or not it is important, it is clearly beyond the defined scope of Unicode so off-topic here. Well, whether it is beyond the scope of Unicode is not the issue here when considering whether it was reasonable for me to have made the post. The issue is whether it is beyond the scope of the Unicode Public Email List. Please consider the following web page. http://www.unicode.org/consortium/distlist-unicode.html That page includes the following. quote Discussion list for Unicode and general internationalization issues About 700 members world-wide, discuss such subjects as: implementing the Unicode Standard, discussion of new proposals, etc. end quote This mailing list is not only about producing the Unicode Standard, it is also for matters considering implementing end-user projects that use Unicode characters. Going back to the issue of whether the post is relevant to Unicode as such, the fact of the matter is, that at the time of posting, and indeed now, the document at the following webpage is implicitly due to be put before the UTC (Unicode Technical Committee) meeting in May 2014. http://unicode.org/draft/reports/tr51/tr51.html That page includes the following. quote There is one further kind of label, called a "read-out", for text-to-speech. For accessibility when reading text, it is useful to have a semi-unique name for an emoji character. The Unicode character name can often serve as a basis for this, but its requirements for uniqueness often ends up with names that are overly long, such as black right-pointing double triangle with vertical bar for ?. Note that the labels need to be in each user?s language to useful. They cannot simply be a translation of an English label, since different words, or even different categorizations, may be what is expected in different languages. The terms given in the data files here have been collected from different sources. They are only initial suggestions, not expected to be complete, and only in English. end quote So as the matter is raised in the Unicode Public Email List and is due to go before the UTC in May 2014, then I opine that it is both reasonable and within the rules to discuss the implications of the practical application of read-out labels in the Unicode Public Email List. In fact, I did not know of this concept of a read-out label in relation to emoji characters before I read that text. I feel that it is an important matter. It remains until the meeting for it to become clear what the UTC decides is relevant to its scope. Until recently, character colour was not in scope, everything was monochrome only. Now character colour is in scope. So until the Chair of the meeting reaches that topic, who can say what the UTC will decide to be in scope? I am pleased that the pdf document has been circulated around the world and I hope that it will be of practical use in relation to accessibility. If the format is used by software manufacturers and by people producing specific localization files so that interoperability is achieved, then that will be a good result. It would be of great help if the UTC chooses to participate and I hope that it does, yet if that is not possible the format can still be applied by end-users of the Unicode Standard. Here is a link to another item about accessibility that I produced some years ago. http://www.users.globalnet.co.uk/~ngo/spec0001.htm William Overington 17 April 2014 From corbett.dav at husky.neu.edu Thu Apr 17 08:39:20 2014 From: corbett.dav at husky.neu.edu (David Corbett) Date: Thu, 17 Apr 2014 09:39:20 -0400 Subject: IPA and unofficial extensions Message-ID: > > What about the strident diacritic in the diacritics table? Is it right > > to use the tilde below twice (n?? a??) or would a new diacritic (combining > > double tilde below) be proposed? > I think the correct encoding is indeed two uses of the tilde below and > no new character is needed The strident diacritic is U+1DFD COMBINING ALMOST EQUAL TO BELOW. L2/07-334R explains: COMBINING ALMOST EQUAL TO BELOW could possibly be considered COMBINING TILDE BELOW + COMBINING TILDE BELOW. However, COMBINING ALMOST EQUAL TO BELOW is one character representing strident vowels. It does not represent creaky voiced which is what tilde below represents in the IPA. COMBINING ALMOST EQUAL TO ABOVE exists in Unicode and we believe the COMBINING ALMOST EQUAL TO BELOW should be added for linguistic usage as well. From doug at ewellic.org Thu Apr 17 13:43:47 2014 From: doug at ewellic.org (Doug Ewell) Date: Thu, 17 Apr 2014 11:43:47 -0700 Subject: The application of localized read-out labels Message-ID: <20140417114347.665a7a7059d7ee80bb4d670165c8327d.a0ad58431a.wbe@email03.secureserver.net> Christopher Fynn wrote: >> Feeling that a format for the particular application is important I >> have now produced a format myself and published it. > > Whether or not it is important, it is clearly beyond the defined scope > of Unicode so off-topic here. It's labeled prominently as a "thought experiment," which means there is no expectation that anyone will implement the format or software which reads it, only think about what would happen if it were implemented. I actually read through the document, 18-point body type and all, before noticing this key point. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From eliz at gnu.org Sun Apr 20 05:24:22 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Sun, 20 Apr 2014 13:24:22 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 Message-ID: <83ha5ofgdl.fsf@gnu.org> Would someone please help understand the following subtleties and obscure language in the UBA document found at http://www.unicode.org/reports/tr9/? Thanks in advance. 1. In paragraph 3.1.2, near its very end, we have this sentence (with my emphasis): As rule X10 will specify, an isolating run sequence is the unit to which the rules following it are applied, and the last character of ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ one level run in the sequence is considered to be immediately followed by the first character of the next level run in the sequence during this phase of the algorithm. What does it mean here by "the rules following it"? Following what? 2. In BD16 (paragraph 3.1.3), the 1st bullet says: . Create a stack for elements each consisting of a bracket character and a text position. Initialize it to empty. But then 1st sub-bullet of the 3rd bullet says: . If an opening paired bracket is found, push its Bidi_Paired_Bracket property value and its text position onto the stack. But the stack does not hold values of Bidi_Paired_Bracket property, it holds characters. Items 2 and 3 below that say: 2. Compare the closing paired bracket being inspected or its canonical equivalent to the bracket in the current stack element. 3. If the values match, meaning the two characters form a bracket pair, then [...] So I guess the 1st bullet is correct, but the 3rd bullet should say "... push the opening paired bracket character and its text position onto the stack". Is this the correct interpretation? 3. Paragraph 3.3.2 says, under "Non-formatting characters": X6. For all types besides B, BN, RLE, LRE, RLO, LRO, PDF, RLI, LRI, FSI, and PDI: . Set the current character?s embedding level to the embedding level of the last entry on the directional status stack. [...] Note that the current embedding level is not changed by this rule. What does this last sentence mean by "the current embedding level"? The first bullet of X6 mandates that "the current character?s embedding level" _is_ changed by this rule, so what other "current embedding level" is alluded to here? 4. Rule X10 says in its last bullet: Apply rules W1?W7, N0?N2, and I1?I2, in the order in which they appear below, to each of the isolating run sequences, applying one rule to all the characters in the sequence in the order in which they occur in the sequence before applying another rule to any part of the sequence. The order that one isolating run sequence is treated relative to another does not matter. Does the last sentence mean that it is OK to apply W1 to the 1st isolating sequence, then apply W1 to the second isolating sequence, then apply W2 to the 1st isolating sequence, followed by W2 application to the 2nd isolating sequence, etc.? IOW, the last sentence refers to the order of processing between the isolating run sequences, but says nothing about the order of applying rules between the sequences. 5. Rule N0 says: . For each bracket-pair element in the list of pairs of text positions a. Inspect the bidirectional types of the characters enclosed within the bracket pair. b. If any strong type (either L or R) matching the embedding direction is found, set the type for both brackets in the pair to match the embedding direction. First, what is meant here by "strong type [...] matching the embedding direction"? Does the "match" here consider only the odd/even value of the current embedding level vs R/L type, in the sense that odd levels "match" R and even levels "match" L? Or does this mean some other kind of matching? Table 3, which the only place that seems to refer to the issue, is not entirely clear, either: e The text ordering type (L or R) that matches the embedding level direction (even or odd). Again, the sense of the "match" here is not clear. Next, what is meant here by "the characters enclosed within the bracket pair"? If the bracket pair encloses another bracket pair, which is inner to it, do the characters inside the inner pair count for the purposes of resolving the level of the outer pair? Lastly, I presume that by "the bidirectional types of the enclosed characters" the text means the resolved types as modified by the preceding phases, not the original types. Is that correct? Again, thanks in advance for any help. From asmusf at ix.netcom.com Sun Apr 20 14:58:23 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 20 Apr 2014 12:58:23 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <83ha5ofgdl.fsf@gnu.org> References: <83ha5ofgdl.fsf@gnu.org> Message-ID: <535426DF.9020308@ix.netcom.com> On 4/20/2014 3:24 AM, Eli Zaretskii wrote: > Would someone please help understand the following subtleties and > obscure language in the UBA document found at > http://www.unicode.org/reports/tr9/? Thanks in advance. Eli, I've tried to give you some explanations - in some places, I concur with you that the wording could be improved and that such improved wording should be proposed to the UTC (or its editorial committee) for incorporation into a future update. For details, see below. > > 1. In paragraph 3.1.2, near its very end, we have this sentence (with > my emphasis): > > As rule X10 will specify, an isolating run sequence is the unit to > which the rules following it are applied, and the last character of > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > one level run in the sequence is considered to be immediately > followed by the first character of the next level run in the > sequence during this phase of the algorithm. > > What does it mean here by "the rules following it"? Following what? That looks like a bad referent, but from context, this "it" must be X10 > > 2. In BD16 (paragraph 3.1.3), the 1st bullet says: > > . Create a stack for elements each consisting of a bracket character > and a text position. Initialize it to empty. > > But then 1st sub-bullet of the 3rd bullet says: > > . If an opening paired bracket is found, push its > Bidi_Paired_Bracket property value and its text position onto > the stack. > > But the stack does not hold values of Bidi_Paired_Bracket property, it > holds characters. The Bidi_Paired_Bracket property is a character code (it is the character code of the other partner in the pair). > Items 2 and 3 below that say: > > 2. Compare the closing paired bracket being inspected or its > canonical equivalent to the bracket in the current stack > element. > 3. If the values match, meaning the two characters > form a bracket pair, then [...] > > So I guess the 1st bullet is correct, but the 3rd bullet should say > "... push the opening paired bracket character and its text position > onto the stack". Is this the correct interpretation? What's really required is that the stack contain a unique identifier for each bracket pair, so that, given a function that maps either opening or closing brackets (or their canonical equivalents) to this id, one can determine that both character belong to the same pair. This unique id could be the opening or the closing bracket (or its canonical equivalent), it makes to practical difference. However, it looks like UAX#9 is written in terms of the code point for the closing bracket. Bullet 1 could be changed to . Create a stack for elements each consisting of a*code point* (Bidi_Paired_Bracket property value) and a text position. Initialize it to empty. to make things more clear. And a slight wording change might help the reader with item 2: 2. Compare the*code point for the*closing paired bracket being inspected or its canonical equivalent to the*code poin*t (Bidi_Paired_Bracket property value) in the current stack element. And, to continue 3. If the values match, meaning*the character being inspected and the character** ** at the text position in the stack* form a bracket pair, then [...] > > 3. Paragraph 3.3.2 says, under "Non-formatting characters": > > X6. For all types besides B, BN, RLE, LRE, RLO, LRO, PDF, RLI, LRI, > FSI, and PDI: > > . Set the current character?s embedding level to the embedding > level of the last entry on the directional status stack. > > [...] > > Note that the current embedding level is not changed by this rule. > > What does this last sentence mean by "the current embedding level"? > The first bullet of X6 mandates that "the current character?s > embedding level" _is_ changed by this rule, so what other "current > embedding level" is alluded to here? I'm punting on that one - can someone else answer this? > > 4. Rule X10 says in its last bullet: > > Apply rules W1?W7, N0?N2, and I1?I2, in the order in which they > appear below, to each of the isolating run sequences, applying one > rule to all the characters in the sequence in the order in which > they occur in the sequence before applying another rule to any part > of the sequence. The order that one isolating run sequence is > treated relative to another does not matter. > > Does the last sentence mean that it is OK to apply W1 to the 1st > isolating sequence, then apply W1 to the second isolating sequence, > then apply W2 to the 1st isolating sequence, followed by W2 > application to the 2nd isolating sequence, etc.? IOW, the last > sentence refers to the order of processing between the isolating run > sequences, but says nothing about the order of applying rules between > the sequences. Apply rules W1?W7, N0?N2, and I1?I2 to each of the isolating run sequences. For each sequence, [completely] apply each rule in the order in which they appear below. The order that one isolating run sequence is treated relative to another does not matter. I believe the above restatement expresses the same thing in fewer words. The "completely" may be unnecessary. The text about applying the rules to "all characters" seems to be unnecessary, unless there is, in any of the rules, an option to not apply it to some characters. Unless incomplete application is envisaged, calling out the "all characters" here just confuses. > > 5. Rule N0 says: > > . For each bracket-pair element in the list of pairs of text positions > > a. Inspect the bidirectional types of the characters enclosed > within the bracket pair. > b. If any strong type (either L or R) matching the embedding > direction is found, set the type for both brackets in the pair > to match the embedding direction. > > First, what is meant here by "strong type [...] matching the embedding > direction"? Does the "match" here consider only the odd/even value of > the current embedding level vs R/L type, in the sense that odd levels > "match" R and even levels "match" L? Or does this mean some other > kind of matching? Table 3, which the only place that seems to refer > to the issue, is not entirely clear, either: > > e The text ordering type (L or R) that matches the embedding level > direction (even or odd). > > Again, the sense of the "match" here is not clear. even/odd --- R/L match, might be made more explicit > > Next, what is meant here by "the characters enclosed within the > bracket pair"? If the bracket pair encloses another bracket pair, > which is inner to it, do the characters inside the inner pair count > for the purposes of resolving the level of the outer pair? They do, so there's no need to change the text. > > Lastly, I presume that by "the bidirectional types of the enclosed > characters" the text means the resolved types as modified by the > preceding phases, not the original types. Is that correct? It's the strong type assigned by rule N0. A./ > > Again, thanks in advance for any help. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From jjc at jclark.com Sun Apr 20 20:54:34 2014 From: jjc at jclark.com (James Clark) Date: Mon, 21 Apr 2014 08:54:34 +0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <535426DF.9020308@ix.netcom.com> References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> Message-ID: On Mon, Apr 21, 2014 at 2:58 AM, Asmus Freytag wrote: > On 4/20/2014 3:24 AM, Eli Zaretskii wrote: > > Would someone please help understand the following subtleties and > obscure language in the UBA document found athttp://www.unicode.org/reports/tr9/? Thanks in advance. > > 3. Paragraph 3.3.2 says, under "Non-formatting characters": > > X6. For all types besides B, BN, RLE, LRE, RLO, LRO, PDF, RLI, LRI, > FSI, and PDI: > > . Set the current character?s embedding level to the embedding > level of the last entry on the directional status stack. > > [...] > > Note that the current embedding level is not changed by this rule. > > What does this last sentence mean by "the current embedding level"? > The first bullet of X6 mandates that "the current character?s > embedding level" _is_ changed by this rule, so what other "current > embedding level" is alluded to here? > > I'm punting on that one - can someone else answer this? > I assume "current embedding level" here meant "the embedding level of the last entry on the directional status stack". (This is a natural slip to make if you think in terms of an optimized implementation that stores each component of the top of the directional status stack in a variable, as suggested in 3.3.2.) James -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Apr 21 01:03:20 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 20 Apr 2014 23:03:20 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> Message-ID: <5354B4A8.4030201@ix.netcom.com> On 4/20/2014 6:54 PM, James Clark wrote: > On Mon, Apr 21, 2014 at 2:58 AM, Asmus Freytag > wrote: > > On 4/20/2014 3:24 AM, Eli Zaretskii wrote: >> Would someone please help understand the following subtleties and >> obscure language in the UBA document found at >> http://www.unicode.org/reports/tr9/? Thanks in advance. >> 3. Paragraph 3.3.2 says, under "Non-formatting characters": >> >> X6. For all types besides B, BN, RLE, LRE, RLO, LRO, PDF, RLI, LRI, >> FSI, and PDI: >> >> . Set the current character?s embedding level to the embedding >> level of the last entry on the directional status stack. >> >> [...] >> >> Note that the current embedding level is not changed by this rule. >> >> What does this last sentence mean by "the current embedding level"? >> The first bullet of X6 mandates that "the current character?s >> embedding level" _is_ changed by this rule, so what other "current >> embedding level" is alluded to here? > I'm punting on that one - can someone else answer this? > > > I assume "current embedding level" here meant "the embedding level of > the last entry on the directional status stack". (This is a natural > slip to make if you think in terms of an optimized implementation that > stores each component of the top of the directional status stack in a > variable, as suggested in 3.3.2.) > > James > In general, I heartily dislike "specifications" that just narrate a particular implementation... A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Mon Apr 21 02:02:40 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 21 Apr 2014 08:02:40 +0100 (BST) Subject: The application of localized read-out labels In-Reply-To: <1397646247.70807.YahooMailNeo@web87806.mail.ir2.yahoo.com> References: <1397646247.70807.YahooMailNeo@web87806.mail.ir2.yahoo.com> Message-ID: <1398063760.79331.YahooMailNeo@web87801.mail.ir2.yahoo.com> The text of the first post in this thread was not recorded in the archive of the Unicode Public Email List. Maybe because there was an attachment to the post? This post is so as to include a transcript of the text of that post in the archive of the Unicode Public Email List. William Overington 21 April 2014 Transcript: > William, the UTC is not in the business of creating file formats for localization data. > Peter Thank you for replying. Feeling that a format for the particular application is important I have now produced a format myself and published it. Please find a copy attached. Posting the publication as an attachment here will also hopefully place it in the mailing list archives for long-term availability. I have also sent a copy to the British Library for Legal Deposit. The publication has the following title. The format of the readouts.dat file suggested for possible use in the application of localized read-out labels The file has the following file name. The_format_of_the_readouts.dat_file_suggested_for_possible_use_in_the_application_of_localized_read-out_labels.pdf William Overington 16 April 2014 From eliz at gnu.org Mon Apr 21 02:55:59 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Mon, 21 Apr 2014 10:55:59 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <535426DF.9020308@ix.netcom.com> References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> Message-ID: <83zjjfce0g.fsf@gnu.org> > Date: Sun, 20 Apr 2014 12:58:23 -0700 > From: Asmus Freytag > > On 4/20/2014 3:24 AM, Eli Zaretskii wrote: > > Would someone please help understand the following subtleties and > > obscure language in the UBA document found at > > http://www.unicode.org/reports/tr9/? Thanks in advance. > > Eli, > > I've tried to give you some explanations Thanks! > in some places, I concur with you that the wording could be improved > and that such improved wording should be proposed to the UTC (or its > editorial committee) for incorporation into a future update. How do we do that? > For details, see below. > > > > 1. In paragraph 3.1.2, near its very end, we have this sentence (with > > my emphasis): > > > > As rule X10 will specify, an isolating run sequence is the unit to > > which the rules following it are applied, and the last character of > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > one level run in the sequence is considered to be immediately > > followed by the first character of the next level run in the > > sequence during this phase of the algorithm. > > > > What does it mean here by "the rules following it"? Following what? > > That looks like a bad referent, but from context, this "it" must be X10 Ah, so simply saying "the following rules" or "rules following X10" would be enough. > Bullet 1 could be changed to > > . Create a stack for elements each consisting of a*code point* (Bidi_Paired_Bracket property value) > and a text position. Initialize it to empty. > > to make things more clear. And a slight wording change might help the > reader with item 2: > > 2. Compare the*code point for the*closing paired bracket being inspected or its > canonical equivalent to the*code poin*t (Bidi_Paired_Bracket property value) in the current stack > element. > > > And, to continue > > 3. If the values match, meaning*the character being inspected and the character** > ** at the text position in the stack* form a bracket pair, then [...] Right, this makes the description a whole lot more clear. > Apply rules W1?W7, N0?N2, and I1?I2 to each of the isolating run sequences. > For each sequence, [completely] apply each rule in the order in which they appear below. > The order that one isolating run sequence is treated relative to another does not matter. > > I believe the above restatement expresses the same thing in fewer words. It does, thanks. > > 5. Rule N0 says: > > > > . For each bracket-pair element in the list of pairs of text positions > > > > a. Inspect the bidirectional types of the characters enclosed > > within the bracket pair. > > b. If any strong type (either L or R) matching the embedding > > direction is found, set the type for both brackets in the pair > > to match the embedding direction. > > > > First, what is meant here by "strong type [...] matching the embedding > > direction"? Does the "match" here consider only the odd/even value of > > the current embedding level vs R/L type, in the sense that odd levels > > "match" R and even levels "match" L? Or does this mean some other > > kind of matching? Table 3, which the only place that seems to refer > > to the issue, is not entirely clear, either: > > > > e The text ordering type (L or R) that matches the embedding level > > direction (even or odd). > > > > Again, the sense of the "match" here is not clear. > > even/odd --- R/L match, might be made more explicit I agree this should be made more explicit, as this is a somewhat subtle issue that might trip the reader. > > Next, what is meant here by "the characters enclosed within the > > bracket pair"? If the bracket pair encloses another bracket pair, > > which is inner to it, do the characters inside the inner pair count > > for the purposes of resolving the level of the outer pair? > They do, so there's no need to change the text. It might be a good idea to say that explicitly, e.g. as a note, or at least provide another example where the strong characters are only inside an inner bracket pair, which will send the same message to the reader. Thanks again for the clarifications. From eliz at gnu.org Mon Apr 21 03:01:13 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Mon, 21 Apr 2014 11:01:13 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> Message-ID: <83y4yzcdrq.fsf@gnu.org> > From: James Clark > Date: Mon, 21 Apr 2014 08:54:34 +0700 > Cc: Eli Zaretskii , unicode at unicode.org, Kenneth Whistler > > > X6. For all types besides B, BN, RLE, LRE, RLO, LRO, PDF, RLI, LRI, > > FSI, and PDI: > > > > . Set the current character?s embedding level to the embedding > > level of the last entry on the directional status stack. > > > > [...] > > > > Note that the current embedding level is not changed by this rule. > > > > What does this last sentence mean by "the current embedding level"? > > The first bullet of X6 mandates that "the current character?s > > embedding level" _is_ changed by this rule, so what other "current > > embedding level" is alluded to here? > > > > I'm punting on that one - can someone else answer this? > > > > I assume "current embedding level" here meant "the embedding level of the > last entry on the directional status stack". Thanks, that was my guess as well, but I wanted to be sure. IMO, the unfortunate wording here is that the same phrase ("current embedding level") was used just before the problematic sentence to mean something completely different. Having identical phrases close to one another always tricks readers into thinking they are describing the same thing; when they aren't, confusion settles in. So I would suggest to reword one or both of these references to the "current embedding level". Btw, why is that note, about the current embedding level not being changed by X6, important? Why would someone mistakenly think the contrary? From eliz at gnu.org Mon Apr 21 03:33:39 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Mon, 21 Apr 2014 11:33:39 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <5354B4A8.4030201@ix.netcom.com> References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> <5354B4A8.4030201@ix.netcom.com> Message-ID: <83wqejcc9o.fsf@gnu.org> > Date: Sun, 20 Apr 2014 23:03:20 -0700 > From: Asmus Freytag > CC: Eli Zaretskii , unicode at unicode.org, > Kenneth Whistler > > >> Note that the current embedding level is not changed by this rule. > >> > >> What does this last sentence mean by "the current embedding level"? > >> The first bullet of X6 mandates that "the current character?s > >> embedding level" _is_ changed by this rule, so what other "current > >> embedding level" is alluded to here? > > I'm punting on that one - can someone else answer this? > > > > > > I assume "current embedding level" here meant "the embedding level of > > the last entry on the directional status stack". (This is a natural > > slip to make if you think in terms of an optimized implementation that > > stores each component of the top of the directional status stack in a > > variable, as suggested in 3.3.2.) > > > > James > > > In general, I heartily dislike "specifications" that just narrate a > particular implementation... I cannot agree more. In fact, my main gripe about the UBA additions in 6.3 are that some of their crucial parts are not formally defined, except by an algorithm that narrates a specific implementation. The two worst examples of that are the "definitions" of the isolating run sequence and of the bracket pair. I didn't ask about those because I succeeded to figure them out, but it took many readings of the corresponding parts of the document. It is IMO a pity that the two main features added in 6.3 are based on definitions that are so hard to penetrate, and which actually all but force you to use the specific implementation described by the document. My working definition that replaces BD13 is this: An isolating run sequence is the maximal sequence of level runs of the same embedding level that can be obtained by removing all the characters between an isolate initiator and its matching PDI (or paragraph end, if there is no matching PDI) within those level runs. As for bracket pair (BD16), I'm really amazed that a concept as easy and widely known/used as this would need such an obscure definition that must have an algorithm as its necessary part. How about this instead: A bracket pair is a pair of an opening paired bracket and a closing paired bracket characters within the same isolating run sequence, such that the Bidi_Paired_Bracket property value of the former character or its canonical equivalent equals the latter character or its canonical equivalent, and all the opening and closing bracket characters in between these two are balanced. Then we could use the algorithm to explain what it means for brackets to be balanced (for those readers who somehow don't already know that). Again, thanks for clarifying these subtle issues. I can now proceed to updating the Emacs bidirectional display with the changes in Unicode 6.3. From wjgo_10009 at btinternet.com Mon Apr 21 04:29:36 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 21 Apr 2014 10:29:36 +0100 (BST) Subject: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries Message-ID: <1398072576.61843.YahooMailNeo@web87803.mail.ir2.yahoo.com> Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries Imagine please if museum and art gallery websites each were to have an international webpage in its on-line shop. If there were on the webpage colourful symbols, one each for Surname, Forename, Card number and so on and the end-user could display text in his or her own language by displaying the appropriate read-out label next to each symbol, thus localizing the web page, then that could be very helpful. I have produced designs for nine symbols. There are two glyphs for each symbol, one colourful and one monochrome. The symbols are octagonal, using not quite a regular octagon. In the monochrome glyphs there is a border around the edge, yet in the colourful glyphs there is no border. The colourful glyphs are displayed in blue and orange, the idea being that the effect to the viewer is of blue upon an orange background. The designs are influenced by heraldry to some extent. This is because I consider Surname to be the most important, so I used a heraldic chief. Then for Forename I used a pale as Forename is different from Surname yet accompanies to Surname to form a name. A bar is used for Address. Name as on card may be different from Forename concatenated with a space and Surname, due to use of Mr Mrs Miss Ms etc and initials, hence the reason for the design not being a union of a chief and one pale. Two bars are used for Card number. Card start date and Card expiry date seemed liked brackets, so that inspired the designs. Card security code is just a design so as to be different from the other designs yet not use any diagonal shapes. Delivery address is included to allow for the possibility of sending a gift directly to someone who lives at another address. I am hoping to attach images showing the designs to other posts in this thread. William Overington 21 April 2014 From wjgo_10009 at btinternet.com Mon Apr 21 04:47:24 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 21 Apr 2014 10:47:24 +0100 (BST) Subject: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries In-Reply-To: <1398072576.61843.YahooMailNeo@web87803.mail.ir2.yahoo.com> References: <1398072576.61843.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: <1398073644.88727.YahooMailNeo@web87801.mail.ir2.yahoo.com> > I am hoping to attach images showing the designs to other posts in this thread. Please find attached an image of the designs of the colourful glyphs. William Overington 21 April 2014 -------------- next part -------------- A non-text attachment was scrubbed... Name: colourful_glyphs.png Type: image/png Size: 21679 bytes Desc: not available URL: From wjgo_10009 at btinternet.com Mon Apr 21 04:49:00 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 21 Apr 2014 10:49:00 +0100 (BST) Subject: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries In-Reply-To: <1398073644.88727.YahooMailNeo@web87801.mail.ir2.yahoo.com> References: <1398072576.61843.YahooMailNeo@web87803.mail.ir2.yahoo.com> <1398073644.88727.YahooMailNeo@web87801.mail.ir2.yahoo.com> Message-ID: <1398073740.83541.YahooMailNeo@web87801.mail.ir2.yahoo.com> > I am hoping to attach images showing the designs to other posts in this thread. Please find attached an image of the designs of the monochrome glyphs. William Overington 21 April 2014 -------------- next part -------------- A non-text attachment was scrubbed... Name: monochrome_glyphs.png Type: image/png Size: 21415 bytes Desc: not available URL: From ruland at luckymail.com Mon Apr 21 05:18:25 2014 From: ruland at luckymail.com (=?UTF-8?B?Q2hhcmxpZSBSdWxhbmQg4piY?=) Date: Mon, 21 Apr 2014 12:18:25 +0200 Subject: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries In-Reply-To: <1398072576.61843.YahooMailNeo@web87803.mail.ir2.yahoo.com> References: <1398072576.61843.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: <5354F071.1060008@luckymail.com> I am sorry, but this doesn?t look like internationalization. Rather it seems like another attempt by the British to force their culture upon the rest of the world. The richness of world-wide naming conventions for people is simply ignored, Putin Vladimir Vladimirovi? won?t be able to use his full name (let alone in the order required), and this will lead to World War III. William J. G. Overington, please admit that others know so much more about internationalization than you do, and stop these imperialist off-topic activities. Charlie Ruland ? William_J_G Overington a ?crit: > Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries > > Imagine please if museum and art gallery websites each were to have an international webpage in its on-line shop. > > If there were on the webpage colourful symbols, one each for Surname, Forename, Card number and so on and the end-user could display text in his or her own language by displaying the appropriate read-out label next to each symbol, thus localizing the web page, then that could be very helpful. > > I have produced designs for nine symbols. There are two glyphs for each symbol, one colourful and one monochrome. The symbols are octagonal, using not quite a regular octagon. In the monochrome glyphs there is a border around the edge, yet in the colourful glyphs there is no border. The colourful glyphs are displayed in blue and orange, the idea being that the effect to the viewer is of blue upon an orange background. > > The designs are influenced by heraldry to some extent. > > This is because I consider Surname to be the most important, so I used a heraldic chief. > > Then for Forename I used a pale as Forename is different from Surname yet accompanies to Surname to form a name. > > A bar is used for Address. > > Name as on card may be different from Forename concatenated with a space and Surname, due to use of Mr Mrs Miss Ms etc and initials, hence the reason for the design not being a union of a chief and one pale. > > Two bars are used for Card number. > > Card start date and Card expiry date seemed liked brackets, so that inspired the designs. > > Card security code is just a design so as to be different from the other designs yet not use any diagonal shapes. > > Delivery address is included to allow for the possibility of sending a gift directly to someone who lives at another address. > > I am hoping to attach images showing the designs to other posts in this thread. > > William Overington > > 21 April 2014 > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From wjgo_10009 at btinternet.com Mon Apr 21 07:34:38 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 21 Apr 2014 13:34:38 +0100 (BST) Subject: The application of localized read-out labels In-Reply-To: <20140417114347.665a7a7059d7ee80bb4d670165c8327d.a0ad58431a.wbe@email03.secureserver.net> References: <20140417114347.665a7a7059d7ee80bb4d670165c8327d.a0ad58431a.wbe@email03.secureserver.net> Message-ID: <1398083678.81771.YahooMailNeo@web87803.mail.ir2.yahoo.com> Doug Ewell wrote: > It's labeled prominently as a "thought experiment," which means there is no expectation that anyone will implement the format or software which reads it, only think about what would happen if it were implemented. Well, it states as follows. quote This is a thought experiment at present. Automated localization would be by having a file readouts.dat available. In the thought experiment the file is a UTF-16 text file, such as can be saved from the WordPad program by selecting saving as a Unicode Text Document. end quote My reason for putting "This is a thought experiment at present." was that the format has not been tested by me in practical application and is only theoretically based at the present time, yet I am hoping that the situation may change and that the format might become implemented in practice by someone and become widely used; or maybe that the publication of the format will act as a catalyst to someone publishing a format that is accepted, so that the end result of a standardized format is achieved. > I actually read through the document, 18-point body type and all, before noticing this key point. Thank you for reading through the document. http://en.wikipedia.org/wiki/Thought_experiment http://en.wikipedia.org/wiki/John_Searle http://en.wikipedia.org/wiki/Philosophy_of_language ---- http://en.wikipedia.org/wiki/Thought_experiment http://en.wikipedia.org/wiki/Backcasting William Overington 21 April 2014 From asmusf at ix.netcom.com Mon Apr 21 09:32:15 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Apr 2014 07:32:15 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <83wqejcc9o.fsf@gnu.org> References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> <5354B4A8.4030201@ix.netcom.com> <83wqejcc9o.fsf@gnu.org> Message-ID: <53552BEF.90308@ix.netcom.com> On 4/21/2014 1:33 AM, Eli Zaretskii wrote: >> Date: Sun, 20 Apr 2014 23:03:20 -0700 >> From: Asmus Freytag >> CC: Eli Zaretskii , unicode at unicode.org, >> Kenneth Whistler >> >>>> Note that the current embedding level is not changed by this rule. >>>> >>>> What does this last sentence mean by "the current embedding level"? >>>> The first bullet of X6 mandates that "the current character?s >>>> embedding level" _is_ changed by this rule, so what other "current >>>> embedding level" is alluded to here? >>> I'm punting on that one - can someone else answer this? >>> >>> >>> I assume "current embedding level" here meant "the embedding level of >>> the last entry on the directional status stack". (This is a natural >>> slip to make if you think in terms of an optimized implementation that >>> stores each component of the top of the directional status stack in a >>> variable, as suggested in 3.3.2.) >>> >>> James >>> >> In general, I heartily dislike "specifications" that just narrate a >> particular implementation... > I cannot agree more. > > In fact, my main gripe about the UBA additions in 6.3 are that some of > their crucial parts are not formally defined, except by an algorithm > that narrates a specific implementation. The two worst examples of > that are the "definitions" of the isolating run sequence and of the > bracket pair. I didn't ask about those because I succeeded to figure > them out, but it took many readings of the corresponding parts of the > document. It is IMO a pity that the two main features added in 6.3 > are based on definitions that are so hard to penetrate, and which > actually all but force you to use the specific implementation > described by the document. > > My working definition that replaces BD13 is this: > > An isolating run sequence is the maximal sequence of level runs of > the same embedding level that can be obtained by removing all the > characters between an isolate initiator and its matching PDI (or > paragraph end, if there is no matching PDI) within those level runs. > > As for bracket pair (BD16), I'm really amazed that a concept as easy > and widely known/used as this would need such an obscure definition > that must have an algorithm as its necessary part. How about this > instead: > > A bracket pair is a pair of an opening paired bracket and a closing > paired bracket characters within the same isolating run sequence, > such that the Bidi_Paired_Bracket property value of the former > character or its canonical equivalent equals the latter character or > its canonical equivalent, and all the opening and closing bracket > characters in between these two are balanced. > > Then we could use the algorithm to explain what it means for brackets > to be balanced (for those readers who somehow don't already know > that). > > Again, thanks for clarifying these subtle issues. I can now proceed > to updating the Emacs bidirectional display with the changes in > Unicode 6.3. > > FWIW here is the restatement of BD16 that I used for myself (and that I put into the source comments of the sample Java implementation): // The following is a restatement of BD 16 using non-algorithmic language. // // A bracket pair is a pair of characters consisting of an opening // paired bracket and a closing paired bracket such that the // Bidi_Paired_Bracket property value of the former equals the latter, // subject to the following constraints. // - both characters of a pair occur in the same isolating run sequence // - the closing character of a pair follows the opening character // - any bracket character can belong at most to one pair, the earliest possible one // - any bracket character not part of a pair is treated like an ordinary character // - pairs may nest properly, but their spans may not overlap otherwise // Bracket characters with canonical decompositions are supposed to be treated // as if they had been normalized, to allow normalized and non-normalized text // to give the same result. Your language is more concise, but you may compare for differences. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Apr 21 09:35:57 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Apr 2014 07:35:57 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <83zjjfce0g.fsf@gnu.org> References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> <83zjjfce0g.fsf@gnu.org> Message-ID: <53552CCD.1010703@ix.netcom.com> On 4/21/2014 12:55 AM, Eli Zaretskii wrote: >> in some places, I concur with you that the wording could be improved >> >and that such improved wording should be proposed to the UTC (or its >> >editorial committee) for incorporation into a future update. > How do we do that? > You file a problem report using the "contact form". A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Apr 21 09:40:15 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 21 Apr 2014 07:40:15 -0700 Subject: The application of localized read-out labels Message-ID: <20140421074015.665a7a7059d7ee80bb4d670165c8327d.302f747b25.wbe@email03.secureserver.net> William_J_G Overington wrote: > My reason for putting "This is a thought experiment at present." was > that the format has not been tested by me in practical application and > is only theoretically based at the present time, It's not, of course. It's specified in enough detail that conformant files could be created, and consumed by an application. > yet I am hoping that the situation may change and that the format > might become implemented in practice by someone and become widely > used; or maybe that the publication of the format will act as a > catalyst to someone publishing a format that is accepted, so that the > end result of a standardized format is achieved. It could be argued that this is at least part of the hypothesis for the "experiment." The expected result, not quite stated, is that the format will in fact be used, or will in fact stimulate the creation of a similar format. Because, of course, if there is no hypothesis, then this is neither a Gedankenexperiment nor any other kind of experiment, just an exercise in creating a file format, which is engineering. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From verdy_p at wanadoo.fr Mon Apr 21 11:18:09 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 21 Apr 2014 18:18:09 +0200 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <53552BEF.90308@ix.netcom.com> References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> <5354B4A8.4030201@ix.netcom.com> <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> Message-ID: There are some cases where these rules will not be clear enough. Look at the following where overlaps do occur; but directionality still matters: "This is an [<<] example [>>] for demonstration only." There are two parsings possible if you just consider a hierarchic layout where overlaps are disabled: 1. "This is an [...] for demonstration only.", embedding "<<...>>", itself embedding "] example [" (here the square brackets match externally) 2. "This is an [...] example [...] for demonstration only.", embedding two spans for "<<" and ">>" separately (they also pair externally) Now suppose that the term "example" is translated in Arabic: It is not very clear how the UBA will work while preserving the correct pariing direction of the 3 pairs (one pair is "<<...>>", there are two pairs for "[...]"). Still all 3 pairs have a coherent direction that Bidi-reordering or glyph mirorring should not mix. I see only one solution to tag such text so that it will behave correctly: either the two pairs of square brackets or the pair or guillemets should be encoded with isolated Bidi overrides. But then what is happening to the ordering of the surrounding text? There should be a stable way to encode this case so that UBA will still work in preserving the correct reding order, and the expected semantics and orientation of pairs and the fact that the guillemets are effectively not really embedding the brackets, but the translated word "example". There are several ways to use Bidi-override or Bidi-embedding controls; I don't know which one is better but all of them should still work with UBA. I just hope that the complex cases of the brackets in the middle ("]...[") can be handled gracefully. My opinion would require embedding and isolating the each square bracket, they will no longer match together (externally they are treated as symbols with transparent direction, but how we ensure that the sequence "[<<]" will still occur before the RTL (Arabic) "example" word followed by the sequence "[>>]" and that the rest of the sentence (for demonstration only) will still occur in the correct order : we also have to embed/isolate the "example", or the whole sequence "[<<] example [>>]" so that the main sentence "This is an ... for demonstration only" will stil have a coherent reading direction. Such cases are not so exceptional because they occur to represent two distinct parallel readings of te same text, where in one reading for one kind of pairs will simply treat the other pairs as ignored "transparently". It should be an interesting case to investigate for validating UBA algorithms in a conformance test case. 2014-04-21 16:32 GMT+02:00 Asmus Freytag : > On 4/21/2014 1:33 AM, Eli Zaretskii wrote: > > Date: Sun, 20 Apr 2014 23:03:20 -0700 > From: Asmus Freytag > CC: Eli Zaretskii , unicode at unicode.org, > Kenneth Whistler > > Note that the current embedding level is not changed by this rule. > > What does this last sentence mean by "the current embedding level"? > The first bullet of X6 mandates that "the current character's > embedding level" _is_ changed by this rule, so what other "current > embedding level" is alluded to here? > > I'm punting on that one - can someone else answer this? > > > I assume "current embedding level" here meant "the embedding level of > the last entry on the directional status stack". (This is a natural > slip to make if you think in terms of an optimized implementation that > stores each component of the top of the directional status stack in a > variable, as suggested in 3.3.2.) > > James > > > In general, I heartily dislike "specifications" that just narrate a > particular implementation... > > I cannot agree more. > > In fact, my main gripe about the UBA additions in 6.3 are that some of > their crucial parts are not formally defined, except by an algorithm > that narrates a specific implementation. The two worst examples of > that are the "definitions" of the isolating run sequence and of the > bracket pair. I didn't ask about those because I succeeded to figure > them out, but it took many readings of the corresponding parts of the > document. It is IMO a pity that the two main features added in 6.3 > are based on definitions that are so hard to penetrate, and which > actually all but force you to use the specific implementation > described by the document. > > My working definition that replaces BD13 is this: > > An isolating run sequence is the maximal sequence of level runs of > the same embedding level that can be obtained by removing all the > characters between an isolate initiator and its matching PDI (or > paragraph end, if there is no matching PDI) within those level runs. > > As for bracket pair (BD16), I'm really amazed that a concept as easy > and widely known/used as this would need such an obscure definition > that must have an algorithm as its necessary part. How about this > instead: > > A bracket pair is a pair of an opening paired bracket and a closing > paired bracket characters within the same isolating run sequence, > such that the Bidi_Paired_Bracket property value of the former > character or its canonical equivalent equals the latter character or > its canonical equivalent, and all the opening and closing bracket > characters in between these two are balanced. > > Then we could use the algorithm to explain what it means for brackets > to be balanced (for those readers who somehow don't already know > that). > > Again, thanks for clarifying these subtle issues. I can now proceed > to updating the Emacs bidirectional display with the changes in > Unicode 6.3. > > > > FWIW here is the restatement of BD16 that I used for myself (and that I > put > into the source comments of the sample Java implementation): > > // The following is a restatement of BD 16 using non-algorithmic > language. > // > // A bracket pair is a pair of characters consisting of an opening > // paired bracket and a closing paired bracket such that the > // Bidi_Paired_Bracket property value of the former equals the latter, > // subject to the following constraints. > // - both characters of a pair occur in the same isolating run sequence > // - the closing character of a pair follows the opening character > // - any bracket character can belong at most to one pair, the > earliest possible one > // - any bracket character not part of a pair is treated like an > ordinary character > // - pairs may nest properly, but their spans may not overlap otherwise > > // Bracket characters with canonical decompositions are supposed to be > treated > // as if they had been normalized, to allow normalized and > non-normalized text > // to give the same result. > > Your language is more concise, but you may compare for differences. > > A./ > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Apr 21 12:48:43 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Apr 2014 10:48:43 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> <5354B4A8.4030201@ix.netcom.com> <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> Message-ID: <535559FB.60100@ix.netcom.com> Philippe, I fail to understand how your post contributes to the topic. The issue was unclear wording of the specification, not deficiencies in the UBA or the PBA in general. Let's keep this discussion limited to issues of wording for the *existing* specification. Feel free to start a new discussion about something else under a new subject. A./ On 4/21/2014 9:18 AM, Philippe Verdy wrote: > There are some cases where these rules will not be clear enough. Look > at the following where overlaps do occur; but directionality still > matters: > > "This is an [?] example [?] for demonstration only." > > There are two parsings possible if you just consider a hierarchic > layout where overlaps are disabled: > > 1. "This is an [...] for demonstration only.", embedding "?...?", > itself embedding "] example [" (here the square brackets match externally) > > 2. "This is an [...] example [...] for demonstration only.", embedding > two spans for "?" and "?" separately (they also pair externally) > > Now suppose that the term "example" is translated in Arabic: It is not > very clear how the UBA will work while preserving the correct pariing > direction of the 3 pairs (one pair is "?...?", there are two pairs for > "[...]"). Still all 3 pairs have a coherent direction that > Bidi-reordering or glyph mirorring should not mix. > > I see only one solution to tag such text so that it will behave > correctly: either the two pairs of square brackets or the pair or > guillemets should be encoded with isolated Bidi overrides. But then > what is happening to the ordering of the surrounding text? > > There should be a stable way to encode this case so that UBA will > still work in preserving the correct reding order, and the expected > semantics and orientation of pairs and the fact that the guillemets > are effectively not really embedding the brackets, but the translated > word "example". > > There are several ways to use Bidi-override or Bidi-embedding > controls; I don't know which one is better but all of them should > still work with UBA. I just hope that the complex cases of the > brackets in the middle ("]...[") can be handled gracefully. > > My opinion would require embedding and isolating the each square > bracket, they will no longer match together (externally they are > treated as symbols with transparent direction, but how we ensure that > the sequence "[?]" will still occur before the RTL (Arabic) "example" > word followed by the sequence "[?]" and that the rest of the sentence > (for demonstration only) will still occur in the correct order : we > also have to embed/isolate the "example", or the whole sequence "[?] > example [?]" so that the main sentence "This is an ... for > demonstration only" will stil have a coherent reading direction. > > Such cases are not so exceptional because they occur to represent two > distinct parallel readings of te same text, where in one reading for > one kind of pairs will simply treat the other pairs as ignored > "transparently". > > It should be an interesting case to investigate for validating UBA > algorithms in a conformance test case. > > > 2014-04-21 16:32 GMT+02:00 Asmus Freytag >: > > On 4/21/2014 1:33 AM, Eli Zaretskii wrote: >>> Date: Sun, 20 Apr 2014 23:03:20 -0700 >>> From: Asmus Freytag >>> CC: Eli Zaretskii ,unicode at unicode.org , >>> Kenneth Whistler >>> >>>>> Note that the current embedding level is not changed by this rule. >>>>> >>>>> What does this last sentence mean by "the current embedding level"? >>>>> The first bullet of X6 mandates that "the current character?s >>>>> embedding level" _is_ changed by this rule, so what other "current >>>>> embedding level" is alluded to here? >>>> I'm punting on that one - can someone else answer this? >>>> >>>> >>>> I assume "current embedding level" here meant "the embedding level of >>>> the last entry on the directional status stack". (This is a natural >>>> slip to make if you think in terms of an optimized implementation that >>>> stores each component of the top of the directional status stack in a >>>> variable, as suggested in 3.3.2.) >>>> >>>> James >>>> >>> In general, I heartily dislike "specifications" that just narrate a >>> particular implementation... >> I cannot agree more. >> >> In fact, my main gripe about the UBA additions in 6.3 are that some of >> their crucial parts are not formally defined, except by an algorithm >> that narrates a specific implementation. The two worst examples of >> that are the "definitions" of the isolating run sequence and of the >> bracket pair. I didn't ask about those because I succeeded to figure >> them out, but it took many readings of the corresponding parts of the >> document. It is IMO a pity that the two main features added in 6.3 >> are based on definitions that are so hard to penetrate, and which >> actually all but force you to use the specific implementation >> described by the document. >> >> My working definition that replaces BD13 is this: >> >> An isolating run sequence is the maximal sequence of level runs of >> the same embedding level that can be obtained by removing all the >> characters between an isolate initiator and its matching PDI (or >> paragraph end, if there is no matching PDI) within those level runs. >> >> As for bracket pair (BD16), I'm really amazed that a concept as easy >> and widely known/used as this would need such an obscure definition >> that must have an algorithm as its necessary part. How about this >> instead: >> >> A bracket pair is a pair of an opening paired bracket and a closing >> paired bracket characters within the same isolating run sequence, >> such that the Bidi_Paired_Bracket property value of the former >> character or its canonical equivalent equals the latter character or >> its canonical equivalent, and all the opening and closing bracket >> characters in between these two are balanced. >> >> Then we could use the algorithm to explain what it means for brackets >> to be balanced (for those readers who somehow don't already know >> that). >> >> Again, thanks for clarifying these subtle issues. I can now proceed >> to updating the Emacs bidirectional display with the changes in >> Unicode 6.3. >> >> > FWIW here is the restatement of BD16 that I used for myself (and > that I put > into the source comments of the sample Java implementation): > > // The following is a restatement of BD 16 using > non-algorithmic language. > // > // A bracket pair is a pair of characters consisting of an opening > // paired bracket and a closing paired bracket such that the > // Bidi_Paired_Bracket property value of the former equals the > latter, > // subject to the following constraints. > // - both characters of a pair occur in the same isolating run > sequence > // - the closing character of a pair follows the opening character > // - any bracket character can belong at most to one pair, the > earliest possible one > // - any bracket character not part of a pair is treated like > an ordinary character > // - pairs may nest properly, but their spans may not overlap > otherwise > > // Bracket characters with canonical decompositions are > supposed to be treated > // as if they had been normalized, to allow normalized and > non-normalized text > // to give the same result. > > Your language is more concise, but you may compare for differences. > > A./ > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Apr 21 13:14:35 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 21 Apr 2014 11:14:35 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 Message-ID: <20140421111435.665a7a7059d7ee80bb4d670165c8327d.24acdc5211.wbe@email03.secureserver.net> From: Asmus Freytag wrote: > In general, I heartily dislike "specifications" that just narrate a > particular implementation... I agree completely. I see this with CLDR as well; there is a more or less implicit assumption that I will be using ICU to implement whatever is being described. I don't care how robust and well-tested a wheel is; as a developer, I should be able to use the specification to reinvent it if I like. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From verdy_p at wanadoo.fr Mon Apr 21 13:23:57 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 21 Apr 2014 20:23:57 +0200 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <535559FB.60100@ix.netcom.com> References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> <5354B4A8.4030201@ix.netcom.com> <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> <535559FB.60100@ix.netcom.com> Message-ID: It is on topic because the proposed description attempts to explain how paired brackets should match and how this witll then affect the rendering in bidirectional contexts. This is exactly the kind of things that are difficult because the proposed description assumes that paired brackets are organized hierarchically. Quote: "both characters of a pair occur in the same isolating run sequence" (does not work here sequences are not fully isolated) Quote: "any bracket character can belong at most to one pair, the earliest possible one" (does not work here, this is not the earliest possible) 2014-04-21 19:48 GMT+02:00 Asmus Freytag : > Philippe, > > I fail to understand how your post contributes to the topic. > > The issue was unclear wording of the specification, not deficiencies in > the UBA or the PBA in general. > > Let's keep this discussion limited to issues of wording for the *existing* > specification. Feel free to start a new discussion about something else > under a new subject. > > A./ > > > On 4/21/2014 9:18 AM, Philippe Verdy wrote: > > There are some cases where these rules will not be clear enough. Look at > the following where overlaps do occur; but directionality still matters: > > "This is an [<<] example [>>] for demonstration only." > > There are two parsings possible if you just consider a hierarchic layout > where overlaps are disabled: > > 1. "This is an [...] for demonstration only.", embedding "<<...>>", itself > embedding "] example [" (here the square brackets match externally) > > 2. "This is an [...] example [...] for demonstration only.", embedding > two spans for "<<" and ">>" separately (they also pair externally) > > Now suppose that the term "example" is translated in Arabic: It is not > very clear how the UBA will work while preserving the correct pariing > direction of the 3 pairs (one pair is "<<...>>", there are two pairs for > "[...]"). Still all 3 pairs have a coherent direction that Bidi-reordering > or glyph mirorring should not mix. > > I see only one solution to tag such text so that it will behave > correctly: either the two pairs of square brackets or the pair or > guillemets should be encoded with isolated Bidi overrides. But then what is > happening to the ordering of the surrounding text? > > There should be a stable way to encode this case so that UBA will still > work in preserving the correct reding order, and the expected semantics and > orientation of pairs and the fact that the guillemets are effectively not > really embedding the brackets, but the translated word "example". > > There are several ways to use Bidi-override or Bidi-embedding controls; > I don't know which one is better but all of them should still work with > UBA. I just hope that the complex cases of the brackets in the middle > ("]...[") can be handled gracefully. > > My opinion would require embedding and isolating the each square > bracket, they will no longer match together (externally they are treated as > symbols with transparent direction, but how we ensure that the sequence > "[<<]" will still occur before the RTL (Arabic) "example" word followed by > the sequence "[>>]" and that the rest of the sentence (for demonstration > only) will still occur in the correct order : we also have to embed/isolate > the "example", or the whole sequence "[<<] example [>>]" so that the main > sentence "This is an ... for demonstration only" will stil have a coherent > reading direction. > > Such cases are not so exceptional because they occur to represent two > distinct parallel readings of te same text, where in one reading for one > kind of pairs will simply treat the other pairs as ignored "transparently". > > It should be an interesting case to investigate for validating UBA > algorithms in a conformance test case. > > > 2014-04-21 16:32 GMT+02:00 Asmus Freytag : > >> On 4/21/2014 1:33 AM, Eli Zaretskii wrote: >> >> Date: Sun, 20 Apr 2014 23:03:20 -0700 >> From: Asmus Freytag >> CC: Eli Zaretskii , unicode at unicode.org, >> Kenneth Whistler >> >> Note that the current embedding level is not changed by this rule. >> >> What does this last sentence mean by "the current embedding level"? >> The first bullet of X6 mandates that "the current character's >> embedding level" _is_ changed by this rule, so what other "current >> embedding level" is alluded to here? >> >> I'm punting on that one - can someone else answer this? >> >> >> I assume "current embedding level" here meant "the embedding level of >> the last entry on the directional status stack". (This is a natural >> slip to make if you think in terms of an optimized implementation that >> stores each component of the top of the directional status stack in a >> variable, as suggested in 3.3.2.) >> >> James >> >> >> In general, I heartily dislike "specifications" that just narrate a >> particular implementation... >> >> I cannot agree more. >> >> In fact, my main gripe about the UBA additions in 6.3 are that some of >> their crucial parts are not formally defined, except by an algorithm >> that narrates a specific implementation. The two worst examples of >> that are the "definitions" of the isolating run sequence and of the >> bracket pair. I didn't ask about those because I succeeded to figure >> them out, but it took many readings of the corresponding parts of the >> document. It is IMO a pity that the two main features added in 6.3 >> are based on definitions that are so hard to penetrate, and which >> actually all but force you to use the specific implementation >> described by the document. >> >> My working definition that replaces BD13 is this: >> >> An isolating run sequence is the maximal sequence of level runs of >> the same embedding level that can be obtained by removing all the >> characters between an isolate initiator and its matching PDI (or >> paragraph end, if there is no matching PDI) within those level runs. >> >> As for bracket pair (BD16), I'm really amazed that a concept as easy >> and widely known/used as this would need such an obscure definition >> that must have an algorithm as its necessary part. How about this >> instead: >> >> A bracket pair is a pair of an opening paired bracket and a closing >> paired bracket characters within the same isolating run sequence, >> such that the Bidi_Paired_Bracket property value of the former >> character or its canonical equivalent equals the latter character or >> its canonical equivalent, and all the opening and closing bracket >> characters in between these two are balanced. >> >> Then we could use the algorithm to explain what it means for brackets >> to be balanced (for those readers who somehow don't already know >> that). >> >> Again, thanks for clarifying these subtle issues. I can now proceed >> to updating the Emacs bidirectional display with the changes in >> Unicode 6.3. >> >> >> >> FWIW here is the restatement of BD16 that I used for myself (and that >> I put >> into the source comments of the sample Java implementation): >> >> // The following is a restatement of BD 16 using non-algorithmic >> language. >> // >> // A bracket pair is a pair of characters consisting of an opening >> // paired bracket and a closing paired bracket such that the >> // Bidi_Paired_Bracket property value of the former equals the latter, >> // subject to the following constraints. >> // - both characters of a pair occur in the same isolating run >> sequence >> // - the closing character of a pair follows the opening character >> // - any bracket character can belong at most to one pair, the >> earliest possible one >> // - any bracket character not part of a pair is treated like an >> ordinary character >> // - pairs may nest properly, but their spans may not overlap >> otherwise >> >> // Bracket characters with canonical decompositions are supposed to >> be treated >> // as if they had been normalized, to allow normalized and >> non-normalized text >> // to give the same result. >> >> Your language is more concise, but you may compare for differences. >> >> A./ >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Apr 21 13:56:50 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Apr 2014 11:56:50 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> <5354B4A8.4030201@ix.netcom.com> <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> <535559FB.60100@ix.netcom.com> Message-ID: <535569F2.5080601@ix.netcom.com> On 4/21/2014 11:23 AM, Philippe Verdy wrote: > It is on topic because the proposed description attempts to explain > how paired brackets should match and how this witll then affect the > rendering in bidirectional contexts. This is exactly the kind of > things that are difficult because the proposed description assumes > that paired brackets are organized hierarchically. > > Quote: "both characters of a pair occur in the same isolating run > sequence" (does not work here sequences are not fully isolated) > Quote: "any bracket character can belong at most to one pair, the > earliest possible one" (does not work here, this is not the earliest > possible) That's OK, it's a limitation of the algorithm, not the description. In other words, the algorithm can help set the a better directionality of paired (!) brackets, and those are the ones that nest properly. What Eli brough to our attention is that the description of this algorithm is suboptimal - whether the algorithm could or should be improved is a separate matter. A./ PS: I think it is unlikely that the UTC will be interested in substantial changes to the algorithm, but it should be interested in allowing the specification to be less dependent on the sample implementation. > > > 2014-04-21 19:48 GMT+02:00 Asmus Freytag >: > > Philippe, > > I fail to understand how your post contributes to the topic. > > The issue was unclear wording of the specification, not > deficiencies in the UBA or the PBA in general. > > Let's keep this discussion limited to issues of wording for the > *existing* specification. Feel free to start a new discussion > about something else under a new subject. > > A./ > > > On 4/21/2014 9:18 AM, Philippe Verdy wrote: >> There are some cases where these rules will not be clear enough. >> Look at the following where overlaps do occur; but directionality >> still matters: >> >> "This is an [?] example [?] for demonstration only." >> >> There are two parsings possible if you just consider a hierarchic >> layout where overlaps are disabled: >> >> 1. "This is an [...] for demonstration only.", embedding "?...?", >> itself embedding "] example [" (here the square brackets match >> externally) >> >> 2. "This is an [...] example [...] for demonstration only.", >> embedding two spans for "?" and "?" separately (they also pair >> externally) >> >> Now suppose that the term "example" is translated in Arabic: It >> is not very clear how the UBA will work while preserving the >> correct pariing direction of the 3 pairs (one pair is "?...?", >> there are two pairs for "[...]"). Still all 3 pairs have a >> coherent direction that Bidi-reordering or glyph mirorring should >> not mix. >> >> I see only one solution to tag such text so that it will behave >> correctly: either the two pairs of square brackets or the pair or >> guillemets should be encoded with isolated Bidi overrides. But >> then what is happening to the ordering of the surrounding text? >> >> There should be a stable way to encode this case so that UBA will >> still work in preserving the correct reding order, and the >> expected semantics and orientation of pairs and the fact that the >> guillemets are effectively not really embedding the brackets, but >> the translated word "example". >> >> There are several ways to use Bidi-override or Bidi-embedding >> controls; I don't know which one is better but all of them should >> still work with UBA. I just hope that the complex cases of the >> brackets in the middle ("]...[") can be handled gracefully. >> >> My opinion would require embedding and isolating the each square >> bracket, they will no longer match together (externally they are >> treated as symbols with transparent direction, but how we ensure >> that the sequence "[?]" will still occur before the RTL (Arabic) >> "example" word followed by the sequence "[?]" and that the rest >> of the sentence (for demonstration only) will still occur in the >> correct order : we also have to embed/isolate the "example", or >> the whole sequence "[?] example [?]" so that the main sentence >> "This is an ... for demonstration only" will stil have a coherent >> reading direction. >> >> Such cases are not so exceptional because they occur to represent >> two distinct parallel readings of te same text, where in one >> reading for one kind of pairs will simply treat the other pairs >> as ignored "transparently". >> >> It should be an interesting case to investigate for validating >> UBA algorithms in a conformance test case. >> >> >> 2014-04-21 16:32 GMT+02:00 Asmus Freytag > >: >> >> On 4/21/2014 1:33 AM, Eli Zaretskii wrote: >>>> Date: Sun, 20 Apr 2014 23:03:20 -0700 >>>> From: Asmus Freytag >>>> CC: Eli Zaretskii ,unicode at unicode.org , >>>> Kenneth Whistler >>>> >>>>>> Note that the current embedding level is not changed by this rule. >>>>>> >>>>>> What does this last sentence mean by "the current embedding level"? >>>>>> The first bullet of X6 mandates that "the current character?s >>>>>> embedding level" _is_ changed by this rule, so what other "current >>>>>> embedding level" is alluded to here? >>>>> I'm punting on that one - can someone else answer this? >>>>> >>>>> >>>>> I assume "current embedding level" here meant "the embedding level of >>>>> the last entry on the directional status stack". (This is a natural >>>>> slip to make if you think in terms of an optimized implementation that >>>>> stores each component of the top of the directional status stack in a >>>>> variable, as suggested in 3.3.2.) >>>>> >>>>> James >>>>> >>>> In general, I heartily dislike "specifications" that just narrate a >>>> particular implementation... >>> I cannot agree more. >>> >>> In fact, my main gripe about the UBA additions in 6.3 are that some of >>> their crucial parts are not formally defined, except by an algorithm >>> that narrates a specific implementation. The two worst examples of >>> that are the "definitions" of the isolating run sequence and of the >>> bracket pair. I didn't ask about those because I succeeded to figure >>> them out, but it took many readings of the corresponding parts of the >>> document. It is IMO a pity that the two main features added in 6.3 >>> are based on definitions that are so hard to penetrate, and which >>> actually all but force you to use the specific implementation >>> described by the document. >>> >>> My working definition that replaces BD13 is this: >>> >>> An isolating run sequence is the maximal sequence of level runs of >>> the same embedding level that can be obtained by removing all the >>> characters between an isolate initiator and its matching PDI (or >>> paragraph end, if there is no matching PDI) within those level runs. >>> >>> As for bracket pair (BD16), I'm really amazed that a concept as easy >>> and widely known/used as this would need such an obscure definition >>> that must have an algorithm as its necessary part. How about this >>> instead: >>> >>> A bracket pair is a pair of an opening paired bracket and a closing >>> paired bracket characters within the same isolating run sequence, >>> such that the Bidi_Paired_Bracket property value of the former >>> character or its canonical equivalent equals the latter character or >>> its canonical equivalent, and all the opening and closing bracket >>> characters in between these two are balanced. >>> >>> Then we could use the algorithm to explain what it means for brackets >>> to be balanced (for those readers who somehow don't already know >>> that). >>> >>> Again, thanks for clarifying these subtle issues. I can now proceed >>> to updating the Emacs bidirectional display with the changes in >>> Unicode 6.3. >>> >>> >> FWIW here is the restatement of BD16 that I used for myself >> (and that I put >> into the source comments of the sample Java implementation): >> >> // The following is a restatement of BD 16 using >> non-algorithmic language. >> // >> // A bracket pair is a pair of characters consisting of >> an opening >> // paired bracket and a closing paired bracket such that the >> // Bidi_Paired_Bracket property value of the former >> equals the latter, >> // subject to the following constraints. >> // - both characters of a pair occur in the same >> isolating run sequence >> // - the closing character of a pair follows the opening >> character >> // - any bracket character can belong at most to one >> pair, the earliest possible one >> // - any bracket character not part of a pair is treated >> like an ordinary character >> // - pairs may nest properly, but their spans may not >> overlap otherwise >> >> // Bracket characters with canonical decompositions are >> supposed to be treated >> // as if they had been normalized, to allow normalized >> and non-normalized text >> // to give the same result. >> >> Your language is more concise, but you may compare for >> differences. >> >> A./ >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Apr 21 13:59:18 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Apr 2014 11:59:18 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <20140421111435.665a7a7059d7ee80bb4d670165c8327d.24acdc5211.wbe@email03.secureserver.net> References: <20140421111435.665a7a7059d7ee80bb4d670165c8327d.24acdc5211.wbe@email03.secureserver.net> Message-ID: <53556A86.8000202@ix.netcom.com> On 4/21/2014 11:14 AM, Doug Ewell wrote: > From: Asmus Freytag wrote: > >> In general, I heartily dislike "specifications" that just narrate a >> particular implementation... > I agree completely. I see this with CLDR as well; there is a more or > less implicit assumption that I will be using ICU to implement whatever > is being described. I don't care how robust and well-tested a wheel is; > as a developer, I should be able to use the specification to reinvent it > if I like. > > Well put. Also, by simply narrating an implementation the UTC deprives the reader of a clear higher-level description of the concept and the intended result. The original part of the bidi specification does a much better job in that regard. It's time to revisit the language for the additions and bring them up to snuff. A./ From asmusf at ix.netcom.com Mon Apr 21 14:00:54 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Apr 2014 12:00:54 -0700 Subject: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries In-Reply-To: <1398073644.88727.YahooMailNeo@web87801.mail.ir2.yahoo.com> References: <1398072576.61843.YahooMailNeo@web87803.mail.ir2.yahoo.com> <1398073644.88727.YahooMailNeo@web87801.mail.ir2.yahoo.com> Message-ID: <53556AE6.5030101@ix.netcom.com> On 4/21/2014 2:47 AM, William_J_G Overington wrote: >> I am hoping to attach images showing the designs to other posts in this thread. > Please find attached an image of the designs of the colourful glyphs. The language I would use for my reaction to this, is just too colorful to reproduce here :) A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Apr 21 15:54:16 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 21 Apr 2014 22:54:16 +0200 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <535569F2.5080601@ix.netcom.com> References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> <5354B4A8.4030201@ix.netcom.com> <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> Message-ID: My intent was not to demonstrate a bug in the algorithm, I have not even claimed that, but to make sure that (less common) usages of paired brackets that do not obey to a pure hierarchy (because these notations use different type of brackets, they are not ambiguous) but still preserve their left vs. right (or open vs. close) semantic. However due to the way the algorithm is currently designed, distinct pairs of brackets still need to be nested hierarchically, and this is not always the case. And to allow such usages (which does not cause big problems in unidirectional texts i.e. texts using characters with the same strong direction, or characters with neutral and weak directions) in bidirectional texts, we'll necessarily need to use bidi controls however these controls cannot be so strong that they will break also the necessary embedding levels intended eah ch type of bracket; even when they do not match in pairs with the algorithm. As they wil then be trated in isolation (unpaired forthe hierarchic algorithm) they should still reain their intended RTL or LTR semantics (and notably their relative placement with things they surround in non-nested ways, and without being affected as well by inconsistant mirroring) The UBA test cases currently do not cover such uncommon cases; but only cases with single isolated/unpaired brackets. I want then to make sure that it will remain possible to write notations without pure hierarchical nesting (for now they still don't work at all, the result is already unpredicatable, even with bidi controls). Also I'm not limited only to punctuation pairs but to any kind of textual pairs (including XML element tags for example, or quotation marks delimiting strings in programming languages, or "begin end" keywords in Pascal or Lua programs, or descriptive expressions in humane languages (e.g. "[start singing] ... [end of song]" (even if they are not concerned by punctuation mirroring). You could see these non-nested usages as internlinear or unstructured, but in fact they do have a structure which should be preserved and not mixed randomly by an alforithm unable to decipher their meaning; unless there's some markup or controls sayng how to treat these items. We should not even have to use specific parsers for specific notations (like XML); this is a more generic abstract problem for texts whose content and semantic is not nested in a pure hierarchical tree but in subtrees with parallel branches, and whose rendering will then need to preserve these structures. My initial message contained a very minimal example of what is needed. I'd like this sample case to be clearly supported in some way without ambiguity. It will be important for things like songs, poestry, legal texts containng citations, discussions about another text, threaded discussions; annotating documents created collaboratively, versioning and showing diffs; and more exceptionally for interlinear notations (including the inclusion translator notes; or notes started in one page and continued elsewhere; possibly on another page; and containing their own sets of bracket pairs)... In all these usages, the UBA (and the infered effect on mirroring) could cause havoc. And of course I do not want to define a new technical syntax using references and identifiers like in XML or JSON to explcit these structures, for UBA it will be enough if it preserves the intended direction and mirroring type without having to explicit which bracket pairs with another one (it should just preserve the start/end or open/close semantic, leaving the rest to an upper layer syntax if they need it for more ambiguous cases; a renderer will use any trick it wants to exhibit this supplementary structure, such as font styles, colors, decorations, or custom 2D layouts, as provided by a rich text format which is out of scope of Unicode and UBA). Only herarchical structure is supported in XML or JSON, but SGML (an HTML) already shows that non-hierarchical structures are also possible and are effectively used in their supported "content models". 2014-04-21 20:56 GMT+02:00 Asmus Freytag : > On 4/21/2014 11:23 AM, Philippe Verdy wrote: > > It is on topic because the proposed description attempts to explain how > paired brackets should match and how this witll then affect the rendering > in bidirectional contexts. This is exactly the kind of things that are > difficult because the proposed description assumes that paired brackets are > organized hierarchically. > > Quote: "both characters of a pair occur in the same isolating run > sequence" (does not work here sequences are not fully isolated) > Quote: "any bracket character can belong at most to one pair, the > earliest possible one" (does not work here, this is not the earliest > possible) > > > That's OK, it's a limitation of the algorithm, not the description. > > In other words, the algorithm can help set the a better directionality of > paired (!) brackets, and those are the ones that nest properly. > > What Eli brough to our attention is that the description of this algorithm > is suboptimal - whether the algorithm could or should be improved is a > separate matter. > > A./ > > PS: I think it is unlikely that the UTC will be interested in substantial > changes to the algorithm, but it should be interested in allowing the > specification to be less dependent on the sample implementation. > > > > 2014-04-21 19:48 GMT+02:00 Asmus Freytag : > >> Philippe, >> >> I fail to understand how your post contributes to the topic. >> >> The issue was unclear wording of the specification, not deficiencies in >> the UBA or the PBA in general. >> >> Let's keep this discussion limited to issues of wording for the >> *existing* specification. Feel free to start a new discussion about >> something else under a new subject. >> >> A./ >> >> >> On 4/21/2014 9:18 AM, Philippe Verdy wrote: >> >> There are some cases where these rules will not be clear enough. Look at >> the following where overlaps do occur; but directionality still matters: >> >> "This is an [<<] example [>>] for demonstration only." >> >> There are two parsings possible if you just consider a hierarchic >> layout where overlaps are disabled: >> >> 1. "This is an [...] for demonstration only.", embedding "<<...>>", >> itself embedding "] example [" (here the square brackets match externally) >> >> 2. "This is an [...] example [...] for demonstration only.", embedding >> two spans for "<<" and ">>" separately (they also pair externally) >> >> Now suppose that the term "example" is translated in Arabic: It is not >> very clear how the UBA will work while preserving the correct pariing >> direction of the 3 pairs (one pair is "<<...>>", there are two pairs for >> "[...]"). Still all 3 pairs have a coherent direction that Bidi-reordering >> or glyph mirorring should not mix. >> >> I see only one solution to tag such text so that it will behave >> correctly: either the two pairs of square brackets or the pair or >> guillemets should be encoded with isolated Bidi overrides. But then what is >> happening to the ordering of the surrounding text? >> >> There should be a stable way to encode this case so that UBA will still >> work in preserving the correct reding order, and the expected semantics and >> orientation of pairs and the fact that the guillemets are effectively not >> really embedding the brackets, but the translated word "example". >> >> There are several ways to use Bidi-override or Bidi-embedding controls; >> I don't know which one is better but all of them should still work with >> UBA. I just hope that the complex cases of the brackets in the middle >> ("]...[") can be handled gracefully. >> >> My opinion would require embedding and isolating the each square >> bracket, they will no longer match together (externally they are treated as >> symbols with transparent direction, but how we ensure that the sequence >> "[<<]" will still occur before the RTL (Arabic) "example" word followed by >> the sequence "[>>]" and that the rest of the sentence (for demonstration >> only) will still occur in the correct order : we also have to embed/isolate >> the "example", or the whole sequence "[<<] example [>>]" so that the main >> sentence "This is an ... for demonstration only" will stil have a coherent >> reading direction. >> >> Such cases are not so exceptional because they occur to represent two >> distinct parallel readings of te same text, where in one reading for one >> kind of pairs will simply treat the other pairs as ignored "transparently". >> >> It should be an interesting case to investigate for validating UBA >> algorithms in a conformance test case. >> >> >> 2014-04-21 16:32 GMT+02:00 Asmus Freytag : >> >>> On 4/21/2014 1:33 AM, Eli Zaretskii wrote: >>> >>> Date: Sun, 20 Apr 2014 23:03:20 -0700 >>> From: Asmus Freytag >>> CC: Eli Zaretskii , unicode at unicode.org, >>> Kenneth Whistler >>> >>> Note that the current embedding level is not changed by this rule. >>> >>> What does this last sentence mean by "the current embedding level"? >>> The first bullet of X6 mandates that "the current character's >>> embedding level" _is_ changed by this rule, so what other "current >>> embedding level" is alluded to here? >>> >>> I'm punting on that one - can someone else answer this? >>> >>> >>> I assume "current embedding level" here meant "the embedding level of >>> the last entry on the directional status stack". (This is a natural >>> slip to make if you think in terms of an optimized implementation that >>> stores each component of the top of the directional status stack in a >>> variable, as suggested in 3.3.2.) >>> >>> James >>> >>> >>> In general, I heartily dislike "specifications" that just narrate a >>> particular implementation... >>> >>> I cannot agree more. >>> >>> In fact, my main gripe about the UBA additions in 6.3 are that some of >>> their crucial parts are not formally defined, except by an algorithm >>> that narrates a specific implementation. The two worst examples of >>> that are the "definitions" of the isolating run sequence and of the >>> bracket pair. I didn't ask about those because I succeeded to figure >>> them out, but it took many readings of the corresponding parts of the >>> document. It is IMO a pity that the two main features added in 6.3 >>> are based on definitions that are so hard to penetrate, and which >>> actually all but force you to use the specific implementation >>> described by the document. >>> >>> My working definition that replaces BD13 is this: >>> >>> An isolating run sequence is the maximal sequence of level runs of >>> the same embedding level that can be obtained by removing all the >>> characters between an isolate initiator and its matching PDI (or >>> paragraph end, if there is no matching PDI) within those level runs. >>> >>> As for bracket pair (BD16), I'm really amazed that a concept as easy >>> and widely known/used as this would need such an obscure definition >>> that must have an algorithm as its necessary part. How about this >>> instead: >>> >>> A bracket pair is a pair of an opening paired bracket and a closing >>> paired bracket characters within the same isolating run sequence, >>> such that the Bidi_Paired_Bracket property value of the former >>> character or its canonical equivalent equals the latter character or >>> its canonical equivalent, and all the opening and closing bracket >>> characters in between these two are balanced. >>> >>> Then we could use the algorithm to explain what it means for brackets >>> to be balanced (for those readers who somehow don't already know >>> that). >>> >>> Again, thanks for clarifying these subtle issues. I can now proceed >>> to updating the Emacs bidirectional display with the changes in >>> Unicode 6.3. >>> >>> >>> >>> FWIW here is the restatement of BD16 that I used for myself (and that >>> I put >>> into the source comments of the sample Java implementation): >>> >>> // The following is a restatement of BD 16 using non-algorithmic >>> language. >>> // >>> // A bracket pair is a pair of characters consisting of an opening >>> // paired bracket and a closing paired bracket such that the >>> // Bidi_Paired_Bracket property value of the former equals the >>> latter, >>> // subject to the following constraints. >>> // - both characters of a pair occur in the same isolating run >>> sequence >>> // - the closing character of a pair follows the opening character >>> // - any bracket character can belong at most to one pair, the >>> earliest possible one >>> // - any bracket character not part of a pair is treated like an >>> ordinary character >>> // - pairs may nest properly, but their spans may not overlap >>> otherwise >>> >>> // Bracket characters with canonical decompositions are supposed to >>> be treated >>> // as if they had been normalized, to allow normalized and >>> non-normalized text >>> // to give the same result. >>> >>> Your language is more concise, but you may compare for differences. >>> >>> A./ >>> >>> _______________________________________________ >>> Unicode mailing list >>> Unicode at unicode.org >>> http://unicode.org/mailman/listinfo/unicode >>> >>> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Apr 21 16:44:14 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Apr 2014 14:44:14 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> <5354B4A8.4030201@ix.netcom.com> <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> Message-ID: <5355912E.80101@ix.netcom.com> On 4/21/2014 1:54 PM, Philippe Verdy wrote: > My intent was not to demonstrate a bug in the algorithm, I have not > even claimed that, but to make sure that (less common) usages of > paired brackets that do not obey to a pure hierarchy (because these > notations use different type of brackets, they are not ambiguous) but > still preserve their left vs. right (or open vs. close) semantic. OK, so this has nothing to do with "unclear text". A./ From nospam-abuse at ilyaz.org Mon Apr 21 18:41:40 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Mon, 21 Apr 2014 16:41:40 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <5355912E.80101@ix.netcom.com> References: <5354B4A8.4030201@ix.netcom.com> <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> Message-ID: <20140421234140.GA5269@powdermilk> On Mon, Apr 21, 2014 at 02:44:14PM -0700, Asmus Freytag wrote: > On 4/21/2014 1:54 PM, Philippe Verdy wrote: > >My intent was not to demonstrate a bug in the algorithm, I have > >not even claimed that, but to make sure that (less common) usages > >of paired brackets that do not obey to a pure hierarchy (because > >these notations use different type of brackets, they are not > >ambiguous) but still preserve their left vs. right (or open vs. > >close) semantic. > OK, so this has nothing to do with "unclear text". Asmus, I cannot agree with this. I think Philippe?s message is on topic. [Below, I completely ignore BIDI part of the specification, and concentrate ONLY on the parens match. I do not understand why this question is interlaced with BIDI determination; I trust that it is.] I suspect Philippe was motivated by a kinda-cowboy attitude which both Eli and you show to the problem of ?parentheses match? (and I suspect this because THAT is my feeling ;-). You give two (IMO, informal) interpretations of what the algorithm-based description says. These two interpretations are obviously non-compatible (or at least not necessarily clearly stated). As Eli said it: ?bracket pair ? a concept as easy and widely known/used as this would need such an obscure definition ?. Just for background: the first theorem on the ?Applied Algebra? class taught by Yu.I.Manin was about parentheses match (it stated that the proper match is unique as far as it exists). This statement is a (tiny) mess to prove, but at least it should look very plausible to unwashed masses. (One corollary is that ?the earliest possible one? from your interpretation is not actually needed.) The problems appear when one wants to allow non-matching parentheses as as well as matched pairs. [If one fixes Eli?s description so that ?a pair? and ?matched? are complete synonims, then] what Eli conveys is that all non-matching parentheses MUST appear ?on top level? only. This is workable (meaning the match is still unique). Your approach gives a circular definition: to define which paren chars match one must know which ones DO NOT match, and the recursion is not terminated. This is exactly what Philippe?s example shows. ======================== My understanding is that Unicode is trying to do is to collect the best practical ways to treat multi-Language texts (without knowing fine details about the languages actually used in the text). It may be that what is ?well understood? today IS only the case where non-matched parens appear on top-level only. So one may ask: what will be the result of the CURRENT UNICODE parsing applied to Phillipe?s example? This is an [?] example [?] for demonstration only. By Eli?s interpretation, it contains no matched parens. In one reading of your interpretation, the external-[] and guillemets would match, and internal-][ would be non-matching ones. If one could ?show? that in majority of cases that is what the writer?s intent was, THEN your interpretation would be ?the best practical ways to treat multi-Language texts?, and it may be prefered to current-algorithmic-description. THIS is why I think the message was on topic. But this is all a very shaky ground? Yours, Ilya From ken.whistler at sap.com Mon Apr 21 19:44:12 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Tue, 22 Apr 2014 00:44:12 +0000 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <20140421234140.GA5269@powdermilk> References: <5354B4A8.4030201@ix.netcom.com> <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> Message-ID: Ilya noted: > [Below, I completely ignore BIDI part of the specification, and > concentrate ONLY on the parens match. I do not understand why this > question is interlaced with BIDI determination; I trust that it is.] Actually, it is, because the bracket-matching is really only interesting in the cases where the boundaries of the isolating runs are in question, and there are some directional differences in the runs. The whole point of introducing the paired bracket complication was to deal with edge cases for that, but... > So one may ask: what will be the result of the CURRENT UNICODE parsing > applied > to Phillipe?s example? > > This is an [?] example [?] for demonstration only. That is easily answered. Let's crank up the bidi reference code with a shorter example that contains the relevant units: a [?] b [?] c Turn up the trace output to see what rule N0 is actually doing, and you get the following. (Set your display wide enough to not wrap the output lines, for best interpretation.) Trace: Entering br_UBA_ResolvePairedBrackets [N0] Trace: br_PushBracketStack, bracket=005D, pos=2 Trace: br_PeekBracketStack, stack=00614808, top=00614810, tsptr=00614810 Trace: br_PeekBracketStack, bracket=005D, pos=2 Appended pair: opening pos 2, closing pos 4 Trace: br_PopBracketStack, #elements=1 Matched bracket Trace: br_PushBracketStack, bracket=005D, pos=8 Trace: br_PeekBracketStack, stack=00614808, top=00614810, tsptr=00614810 Trace: br_PeekBracketStack, bracket=005D, pos=8 Appended pair: opening pos 8, closing pos 10 Trace: br_PopBracketStack, #elements=1 Matched bracket Trace: Entering br_SortPairList Pair list: {2,4} {8,10} Append at end Trace: Exiting br_SortPairList Pair list: {2,4} {8,10} Debug: No strong direction between brackets Debug: No strong direction between brackets Current State: 14 Text: 0061 0020 005B 00AB 005D 0020 0062 0020 005B 00BB 005D 0020 0063 Bidi_Class: L WS ON ON ON WS L WS ON ON ON WS L Levels: 0 0 0 0 0 0 0 0 0 0 0 0 0 Runs: Because of the way the stack processing is defined, the first bracket pair is [?] and the second bracket pair is [?]. The algorithm does not push down potential matches while seeking for a largest outer pair to match. One could ? particularly if one is mathematically inclined ? argue that that is not the right way to do the matching, but it *is* the way the algorithm is currently defined. And it is the way both of the bidi reference implementations, all of the BidiCharacterTest.txt data, the ICU implementation, the Microsoft implementation, and the Harfbuzz implementation are defined, to the best of my knowledge. Other implementations would have to be doing the same, or they would be failing the conformance tests in BidiCharacterTest.txt. Note that for an all left-to-right run of text like this, with no isolating runs and no embeddings, the implications of rule N0 are trivial and non-interesting. The bracket matches don?t end up *doing* anything relevant to the text reordering for bidi in this example. But once you start mixing directions of text and adding embeddings and isolating runs, then things get complicated in non-trivial ways for the output. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Apr 21 20:08:12 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Apr 2014 18:08:12 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <20140421234140.GA5269@powdermilk> References: <5354B4A8.4030201@ix.netcom.com> <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> Message-ID: <5355C0FC.8060409@ix.netcom.com> Ilya, I appreciate your taking the time to take apart Philippe's message. That aspect of it was not obvious to me. A./ PS: more comments below On 4/21/2014 4:41 PM, Ilya Zakharevich wrote: > On Mon, Apr 21, 2014 at 02:44:14PM -0700, Asmus Freytag wrote: >> On 4/21/2014 1:54 PM, Philippe Verdy wrote: >>> My intent was not to demonstrate a bug in the algorithm, I have >>> not even claimed that, but to make sure that (less common) usages >>> of paired brackets that do not obey to a pure hierarchy (because >>> these notations use different type of brackets, they are not >>> ambiguous) but still preserve their left vs. right (or open vs. >>> close) semantic. >> OK, so this has nothing to do with "unclear text". > Asmus, I cannot agree with this. I think Philippe?s message is on topic. > > [Below, I completely ignore BIDI part of the specification, and > concentrate ONLY on the parens match. I do not understand why this > question is interlaced with BIDI determination; I trust that it is.] It really isn't. The result of detecting pairs allows one to improve on assigning directionality to the members of the pair, so that they would match (as expected). This works only for a (hopefully common) subset of all possible uses. Like the overall bidi algorithm (UBA) the paired bracket algorithm (PBA) is intended as a heuristic that frees the author from having to explicitly declare directionality for every bit of text, by providing a default directionality that should work with most text. Exceptional cases then, and ideally only those, would need overrides and similar mechanisms. > > I suspect Philippe was motivated by a kinda-cowboy attitude which both Eli > and you show to the problem of ?parentheses match? (and I suspect this > because THAT is my feeling ;-). You give two (IMO, informal) interpretations > of what the algorithm-based description says. These two interpretations > are obviously non-compatible (or at least not necessarily clearly stated). Eli and I both believe that a non-algorithmic definition should be possible, and that it is preferred to the current algorithmic definition. Not least, because with the algorithmic definition, it is not possible for anyone, by inspection, to be sure that they understand what the outcome would be. This is unacceptable, because authors of text (and not only implementers of the PBA) need to be able to predict where the heuristic fails and the text needs additional markup. This is not a trivial point - not everybody creates text at an editor where they can observe the results immediately and take corrective actions. Text is also edited in environments that do not do bidi processing (e.g. certain kinds of source format editing) or created as result of program action. Knowing when to insert (and when not to insert) bidi controls under program action would benefit from a definition that can be read independently of the implementation of the PBA. > > As Eli said it: ?bracket pair ? a concept as easy and widely known/used as > this would need such an obscure definition ?. Just for background: the first > theorem on the ?Applied Algebra? class taught by Yu.I.Manin was about > parentheses match (it stated that the proper match is unique as far as it > exists). This statement is a (tiny) mess to prove, but at least it should > look very plausible to unwashed masses. (One corollary is that ?the > earliest possible one? from your interpretation is not actually needed.) ( a [ b ) c ] ? The PBA matches the () but not the []. Some statement about "earliest" is needed, to select between () and [], but my language contains a mistake. > > The problems appear when one wants to allow non-matching parentheses as > as well as matched pairs. [If one fixes Eli?s description so that ?a pair? > and ?matched? are complete synonyms, then] what Eli conveys is that > all non-matching parentheses MUST appear ?on top level? only. This is > workable (meaning the match is still unique). Eli's definition was: A bracket pair is a pair of an opening paired bracket and a closing paired bracket characters within the same isolating run sequence, such that the Bidi_Paired_Bracket property value of the former character or its canonical equivalent equals the latter character or its canonical equivalent, and all the opening and closing bracket characters in between these two are balanced. Given ( a [ b ) c ] ? his definition contains no bracket pair, but the example in UAX#9 says that the () should form a pair. The purpose of providing my wording was to do precisely the comparison you have been attempting here, so we end up with language that is an actual (and not merely and attempted) restatement of the algorithmic definition. > Your approach gives a circular definition: to define which paren chars match > one must know which ones DO NOT match, and the recursion is not terminated. > This is exactly what Philippe?s example shows. Here's the text I supplied, with numbers added for discussion. It definitely needs some editing, but the point of the exercise would be to see what: 1. A bracket pair is a pair of characters consisting of an opening paired bracket and a closing paired bracket such that the Bidi_Paired_Bracket property value of the former equals the latter, subject to the following constraints. a - both characters of a pair occur in the same isolating run sequence b - the closing character of a pair follows the opening character c - any bracket character can belong at most to one pair, the earliest possible one d - any bracket character not part of a pair is treated like an ordinary character e - pairs may nest properly, but their spans may not overlap otherwise 2. Bracket characters with canonical decompositions are supposed to be treated as if they had been normalized, to allow normalized and non-normalized text to give the same result. c) needs rewording, because it is not correct The BD16 examples show a ( b ) c ) d 2-4 a ( b ( c ) d 4-6 From that, it follows that it's not the earliest but the one with the smallest span. What was intended was to cover the example: a ( b [ c ) d ] this would become (something like) d) brackets are resolved at the earliest opportunity, starting from the beginning of the text. f) unpaired bracket characters remaining inside a resolved bracket pair are treated as ordinary characters (get ignored for bracket matching purposes). Now, I do not see the recursion that you claim. > > ======================== > > My understanding is that Unicode is trying to do is to collect the best > practical ways to treat multi-Language texts (without knowing fine details > about the languages actually used in the text). It may be that what is > ?well understood? today IS only the case where non-matched parens appear on > top-level only. I don't know about "top level" - brackets improperly nested (and therefore unmatched within the enclosing bracket pair) are simply get ignored for resolving bracket pairs. > > So one may ask: what will be the result of the CURRENT UNICODE parsing applied > to Phillipe?s example? > > This is an [?] example [?] for demonstration only. > > By Eli?s interpretation, it contains no matched parens. In one reading of > your interpretation, the external-[] and guillemets would match, and > internal-][ would be non-matching ones. Neither. The PBA would match the two pairs of [ ]. Note - in order to make that claim, I'm using the examples from UAX#9, because I cannot run the algorithm in my head -- so I'm implicitly trusting the examples. > > If one could ?show? that in majority of cases that is what the writer?s > intent was, THEN your interpretation would be ?the best > practical ways to treat multi-Language texts?, and it may be prefered to > current-algorithmic-description. THIS is why I think the message was on topic. This part, that is "whether this is Best" is the one that's off topic. Eli and I are *exclusively* interested in getting the best possible statement for a definition that matches that algorithm and can actually be parsed by humans. Throwing in any discussion of whether the goal of the algorithm is the right one is confusing the issue and I would prefer if he went and started his own discussion. > > But this is all a very shaky ground? Why, because you can't follow the algorithm? :) A./ > > Yours, > Ilya > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Apr 21 20:16:46 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Apr 2014 18:16:46 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: References: <5354B4A8.4030201@ix.netcom.com> <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> Message-ID: <5355C2FE.80207@ix.netcom.com> On 4/21/2014 5:44 PM, Whistler, Ken wrote: > > > So one may ask: what will be the result of the CURRENT UNICODE parsing > > > applied > > > to Phillipe?s example? > > > > > > This is an [?] example [?] for demonstration only. > > That is easily answered. Let's crank up the bidi reference code with > > a shorter example that contains the relevant units: a [?] b [?] c > I find it telling that this dispute can only be settled by showing trace output - and not, as is normal, but looking at the wording of the definition. Really makes Eli's and my point that the cop out of using an algorithm to "define" the matching results in it being "unpredictable" to anyone not running sample text through an implementation. > > Because of the way the stack processing is defined, the first bracket > pair is [?] > > and the second bracket pair is [?]. The algorithm does not push down > potential > > matches while seeking for a largest outer pair to match. > Rather than hiding this in the "stack processing" it would be possible to express this approach in non-algorithmic language - as you have done here. This is something that should be done. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From nospam-abuse at ilyaz.org Mon Apr 21 22:32:15 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Mon, 21 Apr 2014 20:32:15 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <5355C0FC.8060409@ix.netcom.com> References: <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> Message-ID: <20140422033215.GA5778@powdermilk> On Mon, Apr 21, 2014 at 06:08:12PM -0700, Asmus Freytag wrote: > Here's the text I supplied, with numbers added for discussion. It > definitely needs some > editing, but the point of the exercise would be to see what: > > 1. A bracket pair is a pair of characters consisting of an opening > paired bracket and a closing paired bracket such that the > Bidi_Paired_Bracket property value of the former equals the > latter, > subject to the following constraints. > > a - both characters of a pair occur in the same isolating run > sequence > b - the closing character of a pair follows the opening character > c - any bracket character can belong at most to one pair, the > earliest possible one > d - any bracket character not part of a pair is treated like an > ordinary character > e - pairs may nest properly, but their spans may not overlap > otherwise > > > 2. Bracket characters with canonical decompositions are > supposed to be treated > as if they had been normalized, to allow normalized and > non-normalized text > to give the same result. > > > c) needs rewording, because it is not correct > > The BD16 examples show > > a ( b ) c ) d 2-4 > a ( b ( c ) d 4-6 > > From that, it follows that it's not the earliest but the one with the smallest span. Sorry, I do not see any definition here. Just a collection of words which looks like a definition, but only locally? And I think I can even invent an example which I cannot parse using your definition: 1( 2[ 3( 4] 5) 6) Is looking-at-1 forcing match of 3-and-5? Or what? Thanks, Ilya From asmusf at ix.netcom.com Tue Apr 22 01:25:05 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Apr 2014 23:25:05 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <20140422033215.GA5778@powdermilk> References: <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> Message-ID: <53560B41.70108@ix.netcom.com> On 4/21/2014 8:32 PM, Ilya Zakharevich wrote: > On Mon, Apr 21, 2014 at 06:08:12PM -0700, Asmus Freytag wrote: >> Here's the text I supplied, with numbers added for discussion. It >> definitely needs some >> editing, but the point of the exercise would be to see what: >> >> 1. A bracket pair is a pair of characters consisting of an opening >> paired bracket and a closing paired bracket such that the >> Bidi_Paired_Bracket property value of the former equals the >> latter, >> subject to the following constraints. >> >> a - both characters of a pair occur in the same isolating run >> sequence >> b - the closing character of a pair follows the opening character >> c - any bracket character can belong at most to one pair, the >> earliest possible one >> d - any bracket character not part of a pair is treated like an >> ordinary character >> e - pairs may nest properly, but their spans may not overlap >> otherwise >> >> >> 2. Bracket characters with canonical decompositions are >> supposed to be treated >> as if they had been normalized, to allow normalized and >> non-normalized text >> to give the same result. >> >> >> c) needs rewording, because it is not correct >> >> The BD16 examples show >> >> a ( b ) c ) d 2-4 >> a ( b ( c ) d 4-6 >> >> From that, it follows that it's not the earliest but the one with the smallest span. > Sorry, I do not see any definition here. Just a collection of words > which looks like a definition, but only locally? Thank you for the high praise. :? Now you deleted language which I will restore here, put into a reasonable order and complete the suggested edit on "c" d) brackets are resolved at the earliest opportunity, starting from the beginning of the text. c) if there are two possible ways to resolve a pair, the one spanning less text is used. f) unpaired bracket characters remaining inside a resolved bracket pair are treated as ordinary characters (get ignored for bracket matching purposes). > > And I think I can even invent an example which I cannot parse using > your definition: > > 1( 2[ 3( 4] 5) 6) > > Is looking-at-1 forcing match of 3-and-5? Or what? Let's see what the text gives (before we improve it further). 1. - 1( or 3( could match 5) or 6) , 2[ could only match 4] a. - we have only one isolating run, so this is a no-op b. - all opening characters follow their putative closing characters, so this is a no-op d. - at location 5 is the earliest opportunity to match a pair (before we get to 5 we don't have a opening and closing) c. - we could match 1( or 3( but we use 3, because it spans less text e. , f. - can probably combine these, but 4] is now inside a resolved pair and is ignored. Now, when we reach 6) we have another pair, and per d, it's the earliest possible moment we can resolve it, so we match 1) and 6). Now I add something to your example 1( 2[ 3( 4] 5) 6) 7] even though 2[ and 7] properly surround 3( and 5), they can't match, because 1( and 6) surround only 2[, which makes it unpaired and ignored (per f.). If the example had been 1( 2[ 3( 4] 5) 6] 7) then, on reaching 6] we could have matched it with 2[ and 7) with 1( Eli's definition starts A bracket pair is a pair of an opening paired bracket and a closing paired bracket characters within the same isolating run sequence, such that the Bidi_Paired_Bracket property value of the former character or its canonical equivalent equals the latter character or its canonical equivalent, .... and continues: ....and all the opening and closing bracket characters in between these two are balanced. That continuation we found out was incorrect, so we would need to fix it. Here's an attempt: ... subject to the following conditions: a. a match is attempted at the left-most closing bracket character unmatched at this point b. the closest earlier matching opening bracket, that is unmatched at this point is used to form the pair c. any unmatched bracket character enclosed in a pair is ignored for further matching d. matching ends when no more pairs can be formed I believe with this, you can parse the examples in UAX#9 and the examples we discussed here. If not, I'd appreciate if you could help identify and remedy any gaps. A./ From mark at macchiato.com Tue Apr 22 01:57:41 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 22 Apr 2014 09:57:41 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <20140421111435.665a7a7059d7ee80bb4d670165c8327d.24acdc5211.wbe@email03.secureserver.net> References: <20140421111435.665a7a7059d7ee80bb4d670165c8327d.24acdc5211.wbe@email03.secureserver.net> Message-ID: We try not to do that. There are some known holes, like RBNF. if you know of others please file a ticket. {phone} On Apr 21, 2014 9:18 PM, "Doug Ewell" wrote: > From: Asmus Freytag wrote: > > > In general, I heartily dislike "specifications" that just narrate a > > particular implementation... > > I agree completely. I see this with CLDR as well; there is a more or > less implicit assumption that I will be using ICU to implement whatever > is being described. I don't care how robust and well-tested a wheel is; > as a developer, I should be able to use the specification to reinvent it > if I like. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nospam-abuse at ilyaz.org Tue Apr 22 04:19:37 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Tue, 22 Apr 2014 02:19:37 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <53560B41.70108@ix.netcom.com> References: <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> Message-ID: <20140422091937.GA6603@powdermilk> On Mon, Apr 21, 2014 at 11:25:05PM -0700, Asmus Freytag wrote: > On 4/21/2014 8:32 PM, Ilya Zakharevich wrote: > >On Mon, Apr 21, 2014 at 06:08:12PM -0700, Asmus Freytag wrote: > >>Here's the text I supplied, with numbers added for discussion. It > >>definitely needs some > >>editing, but the point of the exercise would be to see what: > >> > >> 1. A bracket pair is a pair of characters consisting of an opening > >> paired bracket and a closing paired bracket such that the > >> Bidi_Paired_Bracket property value of the former equals the > >>latter, > >> subject to the following constraints. > >> > >> a - both characters of a pair occur in the same isolating run > >> sequence > >> b - the closing character of a pair follows the opening character > >> c - any bracket character can belong at most to one pair, the > >> earliest possible one > >> d - any bracket character not part of a pair is treated like an > >> ordinary character > >> e - pairs may nest properly, but their spans may not overlap > >> otherwise > >> > >> > >> 2. Bracket characters with canonical decompositions are > >>supposed to be treated > >> as if they had been normalized, to allow normalized and > >>non-normalized text > >> to give the same result. > >> > >> > >>c) needs rewording, because it is not correct > >> > >>The BD16 examples show > >> > >> a ( b ) c ) d 2-4 > >> a ( b ( c ) d 4-6 > >> > >> From that, it follows that it's not the earliest but the one with the smallest span. > >Sorry, I do not see any definition here. Just a collection of words > >which looks like a definition, but only locally? > Thank you for the high praise. :? > > Now you deleted language which I will restore here, put into a > reasonable order and complete the suggested > edit on "c" > > d) brackets are resolved at the earliest opportunity, starting from the beginning of the text. > > c) if there are two possible ways to resolve a pair, the one spanning less text is used. > > f) unpaired bracket characters remaining inside a resolved bracket pair are treated as > ordinary characters (get ignored for bracket matching purposes). As I said, to me it is just a combination of words, and I have no idea how to assign meaning to them. > >And I think I can even invent an example which I cannot parse using > >your definition: > > > > 1( 2[ 3( 4] 5) 6) > > > >Is looking-at-1 forcing match of 3-and-5? Or what? > Let's see what the text gives (before we improve it further). > > 1. - 1( or 3( could match 5) or 6) , 2[ could only match 4] > > a. - we have only one isolating run, so this is a no-op > b. - all opening characters follow their putative closing > characters, so this is a no-op > d. - at location 5 is the earliest opportunity to match a pair > (before we get to 5 we don't have a opening and closing) Why not match at location 4 then?! And with 1( 2[ 3( 4] 5) 6) 1a[ 2a( 3a[ 4a) 5a] 6a] would you match 2a with 4a on this step? I think the crucial problem is with 1( 2[ 3( 4] 5) 5b] 6) I have two possible interpretations: one matches 2 with 5b, another leaves 2 unmatched. ======================================================= Anyway, here is a writeup of one of possible interpretations: ========= Part I As I said before, for a string consisting of "(" and ")" only, there is a notion of (Eli?s match): depth-match with unmatched "(" and ")" at top-level only. In case this is unclear: (A) the string is broken into pieces ")", "(", and depth-matched pieces; (B) every piece "(" is after every piece ")". Such a match is unique (unless I?m mistaken), and for every matched guy we know where is the other (matching) character of the pair. ========= Part II Now allow also secondary parens, "[" and "]", in the string. (-1) Match "(" and ")" as above (ignoring "[" and "]"). (0) Match "[" and "]" as above at toplevel (remove matched pairs "(" and ")" and everything between them). (1) Do the same inside toplevel pairs of matching "(" and ")" (remove matched pairs "(" and ")" of 2nd level and everything between them). (?) Etc ========== Part III Now allow ternary parens, "{" "}"; etc. This way, if we have a HIERARCHY of paired Unicode characters, there is a unique notion of depth-match with unmatched delimiters at RELATIVE top-level only. (Here RELATIVE means: w.r.t. delimiters of higher precedence). =========== Part IV Now, define hierarchy DYNAMICALLY separately for every position in the string: if match-with-what/non-match is already decided, given a position in a string with delimiter: (a) write delimiters which enclose the position in order of deeper=later; (b) add the delimiter-at-the-position at the end; (c) remove duplicates. (d) other types of delimiter are later in the hierarchy (in arbitrary order). This is the dynamically-defined hierarchy at the position. =========== Part V The decision of match-with-what/non-match is called OK if every position with delimiter is matched/non-matched according to the Part III w.r.t. hierarchy defined in Part IV. Conjecture: =========== For every string of delimiters, the OK decision of Part V is unique. Ilya From eliz at gnu.org Tue Apr 22 11:02:00 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Tue, 22 Apr 2014 19:02:00 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <53560B41.70108@ix.netcom.com> References: <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> Message-ID: <83oaztbbev.fsf@gnu.org> > Date: Mon, 21 Apr 2014 23:25:05 -0700 > From: Asmus Freytag > Cc: verdy_p at wanadoo.fr, ken at unicode.org, Eli Zaretskii , > James Clark , > unicode Unicode Discussion > > > And I think I can even invent an example which I cannot parse using > > your definition: > > > > 1( 2[ 3( 4] 5) 6) > > > > Is looking-at-1 forcing match of 3-and-5? Or what? > > > Let's see what the text gives (before we improve it further). > > 1. - 1( or 3( could match 5) or 6) , 2[ could only match 4] > > a. - we have only one isolating run, so this is a no-op > b. - all opening characters follow their putative closing characters, so > this is a no-op > d. - at location 5 is the earliest opportunity to match a pair > (before we get to 5 we don't have a opening and closing) > c. - we could match 1( or 3( but we use 3, because it spans less text > e. , f. - can probably combine these, but 4] is now inside a resolved > pair and is ignored. > > Now, when we reach 6) we have another pair, and per d, it's the earliest > possible moment > we can resolve it, so we match 1) and 6). But that's wrong, isn't it? If I follow the algorithm in BD16 (which is really our only reference at this point), I get this: input results 1( push 1) 2[ push 2] 3( push 3) 4] produce a pair 2[ 4] and pop through and including 2] 5) produce 1( 5) and pop the entire stack 6) nothing (remains unmatched) The reference implementation (after I managed to understand how to invoke it for this case) agrees with me. This once again underlines the problem with the original "definition" in BD16, which does not lend itself to a useful and yet intuitive notion of what is "right". > Eli's definition starts > > A bracket pair is a pair of an opening paired bracket and a closing > paired bracket characters within the same isolating run sequence, > such that the Bidi_Paired_Bracket property value of the former > character or its canonical equivalent equals the latter character or > its canonical equivalent, .... > > and continues: > > ....and all the opening and closing bracket > characters in between these two are balanced. > > That continuation we found out was incorrect, so we would need to fix it. Indeed. > Here's an attempt: > > ... subject to the following conditions: > > > a. a match is attempted at the left-most closing bracket character > unmatched at this point > b. the closest earlier matching opening bracket, that is unmatched > at this point is used to form the pair > c. any unmatched bracket character enclosed in a pair is ignored > for further matching > d. matching ends when no more pairs can be formed I agree, but let me try to say the same more concisely: A bracket pair is a pair of an opening paired bracket and a closing paired bracket characters within the same isolating run sequence, such that the Bidi_Paired_Bracket property value of the former character or its canonical equivalent equals the latter character or its canonical equivalent, and provided that a closing bracket is matched to the closest match candidate, disregarding any candidates that either already have a closer match, or are enclosed in a matched pair of other 2 bracket characters. From asmusf at ix.netcom.com Tue Apr 22 11:06:27 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 22 Apr 2014 09:06:27 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <20140422091937.GA6603@powdermilk> References: <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> Message-ID: <53569383.4020502@ix.netcom.com> On 4/22/2014 2:19 AM, Ilya Zakharevich wrote: > I think the crucial problem is with > > 1( 2[ 3( 4] 5) 5b] 6) > > I have two possible interpretations: one matches 2 with 5b, another > leaves 2 unmatched. Ilya, if you read UAX#9, the way the algorithm works is by pushing openers on a stack, then, on finding the first closer, going down the stack and attempting to locate a match, then, on finding a match, discarding any enclosed openers, on not finding a match, discarding the closer. (discard = ignore for further matching, don't treat as bracket any longer). So, when we reach 4] we have 3( 2[ 1( on the stack. The match is with 2[ and 3 is ignored. 1( remains and can be matched later to 5). Ultimately 5b] and 6) are ignored. I believe that your scheme does not match the PBA in that it assumes that brackets are hierarchical and attempts to preserve the best hierarchy, whereas PBA assumes that pairs that are closer together are more likely to be correct matches (for non-mathematical texts hierarchies are not the norm (and they are shallow at best)). What the PBA actually does can now be put into a definition plus a rule, neither of which use "stack" or other implementation details, such as "variables" or "lists". D A bracket pair is a pair of an opening paired bracket and a closing paired bracket characters within the same isolating run sequence, such that the Bidi_Paired_Bracket property value of the former character or its canonical equivalent equals the latter character or its canonical equivalent. R Characters are resolved into resolved bracket pairs as follows: Starting at the beginning of the text, when the a closing bracket character is encountered, find the nearest preceding opening character that is not part of a resolved pair, and not ignored for pair resolution and that can form a bracket pair. If one exists, resolve the pair, and mark any enclosed opening brackets of any kind as ignored. Otherwise, if no pair can be resolved, mark the closing bracket as ignored. What this shows is that what the text in BD16 of UAX#9 tries to cover is both a definition and a rule; which makes it so difficult to follow. I think what should be proposed is such a breakdown into a smaller definition that speaks to the matching of properties (modulo canonical equivalence) separate from the strategy for resolving actual pairs, which is better stated as a rule. The rule does not need to use implementation language to be definite. A "resolved" bracket pair is simply the actual pair resolved by rule "R" and the rest of the PBA acts on "resolved" pairs. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Tue Apr 22 11:08:56 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Tue, 22 Apr 2014 19:08:56 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <20140422033215.GA5778@powdermilk> References: <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> Message-ID: <83mwfdbb3b.fsf@gnu.org> > Date: Mon, 21 Apr 2014 20:32:15 -0700 > From: Ilya Zakharevich > Cc: verdy_p at wanadoo.fr, ken at unicode.org, Eli Zaretskii , > unicode Unicode Discussion , James Clark > > > Sorry, I do not see any definition here. Just a collection of words > which looks like a definition, but only locally? Any definition is just a collection of words, of course. Can you tell what is missing from this collection to make it eligible? > 1( 2[ 3( 4] 5) 6) > > Is looking-at-1 forcing match of 3-and-5? Or what? No. From asmusf at ix.netcom.com Tue Apr 22 11:52:43 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 22 Apr 2014 09:52:43 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <83oaztbbev.fsf@gnu.org> References: <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <83oaztbbev.fsf@gnu.org> Message-ID: <53569E5B.6040006@ix.netcom.com> On 4/22/2014 9:02 AM, Eli Zaretskii wrote: >> an resolve it, so we match 1) and 6). > But that's wrong, isn't it? Yes, brain fart. > I agree, but let me try to say the same more concisely: > > A bracket pair is a pair of an opening paired bracket and a closing > paired bracket characters within the same isolating run sequence, > such that the Bidi_Paired_Bracket property value of the former > character or its canonical equivalent equals the latter character > or its canonical equivalent, and provided that a closing bracket is > matched to the closest match candidate, disregarding any candidates > that either already have a closer match, or are enclosed in a > matched pair of other 2 bracket characters. > > I think that this (or something like this might work), but that we are better off splitting this into a definition and a rule as I have proposed in my previous message. In the rest of the bidi algorithm, rules are used to describe actions taken on scanning text, and "resolving" bracket pairs is such a scan. A./ From eliz at gnu.org Tue Apr 22 12:02:18 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Tue, 22 Apr 2014 20:02:18 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <53569383.4020502@ix.netcom.com> References: <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> <53569383.4020502@ix.netcom.com> Message-ID: <83eh0pb8md.fsf@gnu.org> > Date: Tue, 22 Apr 2014 09:06:27 -0700 > From: Asmus Freytag > CC: Eli Zaretskii , ken at unicode.org, > unicode Unicode Discussion , > James Clark > > I believe that your scheme does not match the PBA in that it assumes > that brackets are hierarchical and attempts to preserve the best > hierarchy, whereas PBA assumes that pairs that are closer together are > more likely to be correct matches (for non-mathematical texts > hierarchies are not the norm (and they are shallow at best)). Indeed, that's the somewhat counter-intuitive part of the PBA, one that IMO should be explicitly pointed out in the text (as a note), because many readers will not expect that. > D A bracket pair is a pair of an opening paired bracket and a closing > paired bracket characters within the same isolating run sequence, > such that the Bidi_Paired_Bracket property value of the former > character or its canonical equivalent equals the latter character or > its canonical equivalent. > > R Characters are resolved into resolved bracket pairs as follows: > Starting at the beginning of the text, when the a closing bracket > character > is encountered, find the nearest preceding opening character that is > not part > of a resolved pair, and not ignored for pair resolution and that can > form a > bracket pair. If one exists, resolve the pair, and mark any enclosed > opening > brackets of any kind as ignored. Otherwise, if no pair can be > resolved, mark > the closing bracket as ignored. Please compare this with my latest suggestion. I think I say the same thing. From eliz at gnu.org Tue Apr 22 12:11:34 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Tue, 22 Apr 2014 20:11:34 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <53569E5B.6040006@ix.netcom.com> References: <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <83oaztbbev.fsf@gnu.org> <53569E5B.6040006@ix.netcom.com> Message-ID: <83bnvtb86x.fsf@gnu.org> > Date: Tue, 22 Apr 2014 09:52:43 -0700 > From: Asmus Freytag > CC: nospam-abuse at ilyaz.org, verdy_p at wanadoo.fr, ken at unicode.org, > jjc at jclark.com, unicode at unicode.org > > > I agree, but let me try to say the same more concisely: > > > > A bracket pair is a pair of an opening paired bracket and a closing > > paired bracket characters within the same isolating run sequence, > > such that the Bidi_Paired_Bracket property value of the former > > character or its canonical equivalent equals the latter character > > or its canonical equivalent, and provided that a closing bracket is > > matched to the closest match candidate, disregarding any candidates > > that either already have a closer match, or are enclosed in a > > matched pair of other 2 bracket characters. > > > > > I think that this (or something like this) might work, but that we are > better off > splitting this into a definition and a rule as I have proposed in my > previous message. Why not have the above _and_ a rule? The rule should be worded so as to help understanding the definition. But IMO it is not a good idea to have a rule as an integral part of the definition, because the two serve different purposes. And I think we should also point out explicitly that the brackets match non-hierarchically, as many readers will expect that they are, and will be confused. > In the rest of the bidi algorithm, rules are used to describe actions > taken on scanning text, and "resolving" bracket pairs is such a scan. Yes, but other definitions don't use rules as their integral parts. Why should this one be an exception? From asmusf at ix.netcom.com Tue Apr 22 13:41:39 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 22 Apr 2014 11:41:39 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <83bnvtb86x.fsf@gnu.org> References: <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <83oaztbbev.fsf@gnu.org> <53569E5B.6040006@ix.netcom.com> <83bnvtb86x.fsf@gnu.org> Message-ID: <5356B7E3.6020500@ix.netcom.com> On 4/22/2014 10:11 AM, Eli Zaretskii wrote: >> Date: Tue, 22 Apr 2014 09:52:43 -0700 >> From: Asmus Freytag >> CC: nospam-abuse at ilyaz.org, verdy_p at wanadoo.fr, ken at unicode.org, >> jjc at jclark.com, unicode at unicode.org >> >>> I agree, but let me try to say the same more concisely: >>> >>> A bracket pair is a pair of an opening paired bracket and a closing >>> paired bracket characters within the same isolating run sequence, >>> such that the Bidi_Paired_Bracket property value of the former >>> character or its canonical equivalent equals the latter character >>> or its canonical equivalent, and provided that a closing bracket is >>> matched to the closest match candidate, disregarding any candidates >>> that either already have a closer match, or are enclosed in a >>> matched pair of other 2 bracket characters. >>> >>> >> I think that this (or something like this) might work, but that we are >> better off >> splitting this into a definition and a rule as I have proposed in my >> previous message. > Why not have the above _and_ a rule? The rule should be worded so as > to help understanding the definition. But IMO it is not a good idea > to have a rule as an integral part of the definition, because the two > serve different purposes. Not everything needs to be in a single definition. The specification, to uplevel the discussion at this point, is composed of definitions and rules. What I am proposing is that the natural unit for definition is the paired bracket as defined by the match in properties, in other words ( with ) and not ( with ]. The part that picks out of the possible pairs in a span of text is really better handled as a rule - it describes an action to be performed. We really have two concepts and an action here. 1) matching bracket characters (a pair, or if you want, a possible or putative pair) 2) specific bracket characters in a given piece of text that match (resolved pair) 3) the act of resolving pairs given a specific sequence of characters (defined by a rule) So if you wanted to BD 16 could be split into BD16a defining the (putative) pair based on matching properties and 16b defining the term "resolved pair". A rule, Rx, could then be specified to describe the resolution process, and relying on the definition of the (putative) pair only. After Rx has been applied, all identified pairs are 'resolved pairs', and the remainder of the algorithm can be stated using that term. That structure matches the rest of the specification. > > And I think we should also point out explicitly that the brackets > match non-hierarchically, as many readers will expect that they are, > and will be confused. That's a good note, as we have seen that some people make that assumption (and it's a natural one from a mathematical point). > >> In the rest of the bidi algorithm, rules are used to describe actions >> taken on scanning text, and "resolving" bracket pairs is such a scan. > Yes, but other definitions don't use rules as their integral parts. > Why should this one be an exception? > > I agree - I would not put what I call "Rx" inside any definitions. It goes into the rules section. Now, BD16b seems to depend on performing Rx, but not really. A "resolved" pair is simply the result of any kind of resolution process. That the PBA uses Rx to do the resolution isn't part of the definition of the term - Rx serves to identify which pairs are resolved pairs, not that a resolved pair represents a choice among possible pairings. The whole mess in UAX#9 is based on the fact that the authors thought that they had to mix these two levels. As you asked me to compare my statement with yours, I'm adding it here again I do believe both would lead to the same implementation, but my preference is to take the description of how to resolve into a rule. This would be the kind of text I'd suggest to add to UAX#9. //--------- BD16a A bracket pair is a pair of an opening paired bracket and a closing paired bracket characters such that the Bidi_Paired_Bracket property value of the former character or its canonical equivalent equals the latter character or its canonical equivalent. BD16b A resolved bracket pair is a bracket pair that has been been selected from among possible bracket pairs in an isolating run sequence. Note: for the PBA this selection is performed according to Rx (below). Rx For each isolated run sequence, bracket characters are selected into resolved bracket pairs as follows: Starting at the beginning of the run sequence, when the a closing bracket character is encountered, find the nearest preceding opening character that forms a bracket pair, but is not already part of a resolved bracket pair, and not ignored for bracket pair selection. If one exists, resolve the pair, and mark any enclosed opening brackets of any kind as not part of a bracket pair and ignored for further bracket pair selection. Otherwise, if no pair can be selected, mark the closing bracket as not part of a pair and ignored for further pair selection. Note: the outcome of Rx is a list of resolved pairs and their locations. Selected pairs can nest, but can't otherwise overlap. The rule prefers the closest pair for matching as opposed to attempting to select for the most hierarchical set of nested pairs. (See examples). ------------ What I have called Rx here, would become N0a with the part of NO that is the second bullet numbered N0b. I would move the existing examples into the rules section, not leave them in the definitions as they are today. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From nospam-abuse at ilyaz.org Tue Apr 22 16:17:44 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Tue, 22 Apr 2014 14:17:44 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <83mwfdbb3b.fsf@gnu.org> References: <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <83mwfdbb3b.fsf@gnu.org> Message-ID: <20140422211744.GA9869@powdermilk> On Tue, Apr 22, 2014 at 07:08:56PM +0300, Eli Zaretskii wrote: > > Sorry, I do not see any definition here. Just a collection of words > > which looks like a definition, but only locally? > > Any definition is just a collection of words, of course. Can you tell > what is missing from this collection to make it eligible? This is a very delicate question, of course. And it is very personal: every definition assumes a certain target population. But let me try: A) It should be immediately clear which of the possible meanings of every word/phrase was intended by the author; B) It should have a unique non-self-contradictory interpretation; C) The reader should immediately get a feeling that given enough effort, one will be able to understand what is the interpretation in (B). Now, (A) avoids exponential growth of possible ?local interpretations?. The need for (B) is self-obvious (although what is self-contradictory would also depend on the reader?s abilities). And (C) is a major psychological help: usually, (A) cannot stop the exponential growth of possible ?global interpretations? (?how the pieces are intended to joint together?). Essentially, one gets a tree of possible choices, and it is crucial that when searching along the tree, one can cut off ?wrong? branches as early as possible. Ilya From asmusf at ix.netcom.com Tue Apr 22 16:58:43 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 22 Apr 2014 14:58:43 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <20140422211744.GA9869@powdermilk> References: <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <83mwfdbb3b.fsf@gnu.org> <20140422211744.GA9869@powdermilk> Message-ID: <5356E613.5060906@ix.netcom.com> On 4/22/2014 2:17 PM, Ilya Zakharevich wrote: > On Tue, Apr 22, 2014 at 07:08:56PM +0300, Eli Zaretskii wrote: >>> Sorry, I do not see any definition here. Just a collection of words >>> which looks like a definition, but only locally? >> Any definition is just a collection of words, of course. Can you tell >> what is missing from this collection to make it eligible? > This is a very delicate question, of course. And it is very personal: > every definition assumes a certain target population. But let me try: > > A) It should be immediately clear which of the possible meanings of > every word/phrase was intended by the author; > > B) It should have a unique non-self-contradictory interpretation; > > C) The reader should immediately get a feeling that given enough > effort, one will be able to understand what is the interpretation > in (B). > > Now, (A) avoids exponential growth of possible ?local > interpretations?. The need for (B) is self-obvious (although what is > self-contradictory would also depend on the reader?s abilities). > > And (C) is a major psychological help: usually, (A) cannot stop the > exponential growth of possible ?global interpretations? (?how the > pieces are intended to joint together?). Essentially, one gets a tree > of possible choices, and it is crucial that when searching along the > tree, one can cut off ?wrong? branches as early as possible. Thanks for the "theory" - we've taken the discussion off-list where we can work collaboratively on improving the language rather than theoretizing. A./ From nospam-abuse at ilyaz.org Wed Apr 23 02:35:02 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Wed, 23 Apr 2014 00:35:02 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <53569383.4020502@ix.netcom.com> References: <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> <53569383.4020502@ix.netcom.com> Message-ID: <20140423073502.GA11904@powdermilk> On Tue, Apr 22, 2014 at 09:06:27AM -0700, Asmus Freytag wrote: > if you read UAX#9, the way the algorithm works is by pushing openers > on a stack, then, on finding the first closer, going down the stack > and attempting to locate a match, then, on finding a match, > discarding any enclosed openers, on not finding a match, discarding > the closer. I think I LOVE this definition. Simple, beautiful, and IMO following people?s expectations very closely. Here is what ?theoretizing? gives: a parsing is good if it satisfies all conditions below: 0) Some delimiters in the string are marked as ?non-matching?; the rest is broken into disjoint ?matched? pairs; MATCH) A ?matched? pair consists of an open-delimiter and matching close- delimiter (in this order in the string). NEST) ?Matched? pairs are properly nested (meaning that 2 pairs cannot be positioned as Open1 Open2 Close1 Close2 in the string order). MINLEN) ?Inside? a ?matched? pair, every delimiter which could match elements of the pair but is marked as ?non-matching? must nest inside some deeper-nested ?matched? pair. (I hope that the meaning of the word ?inside? in MINLEN is clear.) GREED) Given any close-delimiter marked as ?non-matching?, its pre-context does not contain any open-delimiter which could match it. Here pre-context of a position is a concatenation of substrings of the initial string: ? Take the most deeply nested ?matched pair? containing the position (if none, the whole string); ? take the part of the string inside this pair AND before the position; ? remove all ?matched? pairs completely contained insidde this substring together with what they enclose. Ilya P.S. Judging by another message of yours, for you ?theoretizing? is a 4-letter word? Oh well? From eliz at gnu.org Wed Apr 23 09:54:01 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Wed, 23 Apr 2014 17:54:01 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <20140422211744.GA9869@powdermilk> References: <535559FB.60100@ix.netcom.com> <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <83mwfdbb3b.fsf@gnu.org> <20140422211744.GA9869@powdermilk> Message-ID: <83ha5k9jw6.fsf@gnu.org> > Date: Tue, 22 Apr 2014 14:17:44 -0700 > From: Ilya Zakharevich > Cc: asmusf at ix.netcom.com, verdy_p at wanadoo.fr, ken at unicode.org, > unicode at unicode.org, jjc at jclark.com > > On Tue, Apr 22, 2014 at 07:08:56PM +0300, Eli Zaretskii wrote: > > > Sorry, I do not see any definition here. Just a collection of words > > > which looks like a definition, but only locally? > > > > Any definition is just a collection of words, of course. Can you tell > > what is missing from this collection to make it eligible? > > This is a very delicate question, of course. And it is very personal: > every definition assumes a certain target population. But let me try: > > A) It should be immediately clear which of the possible meanings of > every word/phrase was intended by the author; > > B) It should have a unique non-self-contradictory interpretation; > > C) The reader should immediately get a feeling that given enough > effort, one will be able to understand what is the interpretation > in (B). I agree with all of the above (except that "immediately" might not be practically achievable in complex cases). I tried to achieve all of these goals with my attempted definitions, and I believe I succeeded, as did Asmus. From eliz at gnu.org Wed Apr 23 10:25:53 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Wed, 23 Apr 2014 18:25:53 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <20140423073502.GA11904@powdermilk> References: <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> <53569383.4020502@ix.netcom.com> <20140423073502.GA11904@powdermilk> Message-ID: <8361m09if2.fsf@gnu.org> > Date: Wed, 23 Apr 2014 00:35:02 -0700 > From: Ilya Zakharevich > Cc: Eli Zaretskii , ken at unicode.org, unicode Unicode Discussion > , James Clark > > On Tue, Apr 22, 2014 at 09:06:27AM -0700, Asmus Freytag wrote: > > if you read UAX#9, the way the algorithm works is by pushing openers > > on a stack, then, on finding the first closer, going down the stack > > and attempting to locate a match, then, on finding a match, > > discarding any enclosed openers, on not finding a match, discarding > > the closer. > > I think I LOVE this definition. Probably because you yourself wrote it ;-) > Simple, beautiful, and IMO following people?s expectations very > closely. I see nothing in your definition that is significantly different from our attempts. It does feel more complex, mainly because you have much more conditions, combining which in one's mind might not be easy at first reading. From asmusf at ix.netcom.com Wed Apr 23 11:21:04 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 23 Apr 2014 09:21:04 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <20140423073502.GA11904@powdermilk> References: <535569F2.5080601@ix.netcom.com> <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> <53569383.4020502@ix.netcom.com> <20140423073502.GA11904@powdermilk> Message-ID: <5357E870.6060603@ix.netcom.com> On 4/23/2014 12:35 AM, Ilya Zakharevich wrote: > On Tue, Apr 22, 2014 at 09:06:27AM -0700, Asmus Freytag wrote: >> if you read UAX#9, the way the algorithm works is by pushing openers >> on a stack, then, on finding the first closer, going down the stack >> and attempting to locate a match, then, on finding a match, >> discarding any enclosed openers, on not finding a match, discarding >> the closer. > I think I LOVE this definition. Simple, beautiful, and IMO following > people?s expectations very closely. I hadn't intended it as a definition, but let's see how it would work as one. The "stack" isn't necessary: The algorithm works by finding the first closer, looking back and attempting to locate a match, then, on finding the nearest match, discarding any enclosed openers, otherwise on not finding a match, discarding the closer. That an implementation uses a "stack" to avoid having to "look back" is a detail that has no place in a well-crafted definition. It leaves a few things unstated - that all has to take place inside the same isolating run, and, that, on reaching the end, only the matched pairs are remembered (all unmatched openers are discarded). (And we need to say what "discarded" means). While it's a statement of an algorithm, it's not obfuscatory, and it's relatively easy to consider alternate implementation strategies. > > Here is what ?theoretizing? gives: > > a parsing is good if it satisfies all conditions below: > > 0) Some delimiters in the string are marked as ?non-matching?; the rest > is broken into disjoint ?matched? pairs; > > MATCH) A ?matched? pair consists of an open-delimiter and matching close- > delimiter (in this order in the string). > > NEST) ?Matched? pairs are properly nested (meaning that 2 pairs cannot be > positioned as Open1 Open2 Close1 Close2 in the string order). > > MINLEN) ?Inside? a ?matched? pair, every delimiter which could match elements > of the pair but is marked as ?non-matching? must nest inside > some deeper-nested ?matched? pair. > > (I hope that the meaning of the word ?inside? in MINLEN is clear.) > > GREED) Given any close-delimiter marked as ?non-matching?, its > pre-context does not contain any open-delimiter which could > match it. > > Here pre-context of a position is a concatenation of substrings of the > initial string: > ? Take the most deeply nested ?matched pair? containing the position > (if none, the whole string); > ? take the part of the string inside this pair AND before the position; > ? remove all ?matched? pairs completely contained insidde this substring > together with what they enclose. This is a very nice formal definition. I'm surprised that your "GREED" statement needs such a complex auxiliary concept (pre-context). Can you explain why, if you make "pre-context" simply the part of the whole string that precedes the unmatched close-delimiter, the words "which could match it" are insufficient? Any opener, that's inside a matched pair entirely in the pre-context (my def) would be ineligible because of NEST, so you don't have to "remove" it from the pre-context (your def). Is there something I'm missing? > > Ilya > > P.S. Judging by another message of yours, for you ?theoretizing? is a > 4-letter word? Oh well? It can be - not in the sense you used it in this post. A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathias at qiwi.be Wed Apr 23 12:18:48 2014 From: mathias at qiwi.be (Mathias Bynens) Date: Wed, 23 Apr 2014 19:18:48 +0200 Subject: ID_Start, ID_Continue, and stability extensions Message-ID: <9F608E6A-79E1-46D2-AD96-2E614A17AAC0@qiwi.be> http://www.unicode.org/reports/tr31/#Default_Identifier_Syntax defines ID_Start as: > Characters having the Unicode General_Category of uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl), minus Pattern_Syntax and Pattern_White_Space code points, plus stability extensions. Note that ?other letters? includes ideographs. What are the ?stability extensions? this document refers to? I noticed that parsing `DerivedCoreProperties.txt` for `ID_Start` leads to slightly different results, than parsing `UnicodeData.txt` for category names and then adding the categories together, minus `Pattern_Syntax` and `Pattern_White_Space` which you can get by parsing `PropList.txt`. For example, U+2118 SCRIPT CAPITAL P is included in `ID_Start` as per `DerivedCoreProperties.txt`, but it doesn?t match any of the above categories. Is this an example of such a ?stability extension?, or was this an oversight? Regards, Mathias From mathias at qiwi.be Wed Apr 23 12:29:27 2014 From: mathias at qiwi.be (Mathias Bynens) Date: Wed, 23 Apr 2014 19:29:27 +0200 Subject: ID_Start, ID_Continue, and stability extensions In-Reply-To: <9F608E6A-79E1-46D2-AD96-2E614A17AAC0@qiwi.be> References: <9F608E6A-79E1-46D2-AD96-2E614A17AAC0@qiwi.be> Message-ID: <30F1ADAC-2D20-4234-BA1F-23653BAE24E6@qiwi.be> On 23 Apr 2014, at 19:18, Mathias Bynens wrote: > http://www.unicode.org/reports/tr31/#Default_Identifier_Syntax defines ID_Start as: > >> Characters having the Unicode General_Category of uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl), minus Pattern_Syntax and Pattern_White_Space code points, plus stability extensions. Note that ?other letters? includes ideographs. > > What are the ?stability extensions? this document refers to? > > I noticed that parsing `DerivedCoreProperties.txt` for `ID_Start` leads to slightly different results, than parsing `UnicodeData.txt` for category names and then adding the categories together, minus `Pattern_Syntax` and `Pattern_White_Space` which you can get by parsing `PropList.txt`. > > For example, U+2118 SCRIPT CAPITAL P is included in `ID_Start` as per `DerivedCoreProperties.txt`, but it doesn?t match any of the above categories. Is this an example of such a ?stability extension?, or was this an oversight? Here are the code points that match the respective property according to `DerivedCoreProperties.txt`, yet don?t match these properties if you?re adding/removing the categories manually based on the property definition in TR31. `ID_Start`: * U+2118 * U+212E * U+309B * U+309C `ID_Continue`: * U+00B7 * U+0387 * U+1369 * U+1370 * U+1371 * U+19DA Why these differences? From ken.whistler at sap.com Wed Apr 23 12:48:03 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Wed, 23 Apr 2014 17:48:03 +0000 Subject: ID_Start, ID_Continue, and stability extensions In-Reply-To: <30F1ADAC-2D20-4234-BA1F-23653BAE24E6@qiwi.be> References: <9F608E6A-79E1-46D2-AD96-2E614A17AAC0@qiwi.be> <30F1ADAC-2D20-4234-BA1F-23653BAE24E6@qiwi.be> Message-ID: Mathias, > > What are the ?stability extensions? this document refers to? > > > Here are the code points that match the respective property according to > `DerivedCoreProperties.txt`, yet don?t match these properties if you?re > adding/removing the categories manually based on the property definition in > TR31. > > `ID_Start`: > > * U+2118 > * U+212E > * U+309B > * U+309C > > `ID_Continue`: > > * U+00B7 > * U+0387 > * U+1369 > * U+1370 > * U+1371 > * U+19DA > > Why these differences? See the listings for Other_ID_Start and Other_ID_Continue in PropList.txt. Those are your "stability extensions" for the derivation of the identifier-related derived properties. --Ken # ================================================ 2118 ; Other_ID_Start # Sm SCRIPT CAPITAL P 212E ; Other_ID_Start # So ESTIMATED SYMBOL 309B..309C ; Other_ID_Start # Sk [2] KATAKANA-HIRAGANA VOICED SOUND MARK..KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK # Total code points: 4 # ================================================ 00B7 ; Other_ID_Continue # Po MIDDLE DOT 0387 ; Other_ID_Continue # Po GREEK ANO TELEIA 1369..1371 ; Other_ID_Continue # No [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE 19DA ; Other_ID_Continue # No NEW TAI LUE THAM DIGIT ONE # Total code points: 12 From mathias at qiwi.be Wed Apr 23 13:14:53 2014 From: mathias at qiwi.be (Mathias Bynens) Date: Wed, 23 Apr 2014 20:14:53 +0200 Subject: ID_Start, ID_Continue, and stability extensions In-Reply-To: References: <9F608E6A-79E1-46D2-AD96-2E614A17AAC0@qiwi.be> <30F1ADAC-2D20-4234-BA1F-23653BAE24E6@qiwi.be> Message-ID: <6683947A-744E-4838-82F8-BB8E243B32AE@qiwi.be> On 23 Apr 2014, at 19:48, Whistler, Ken wrote: > See the listings for Other_ID_Start and Other_ID_Continue in PropList.txt. > Those are your "stability extensions" for the derivation of the identifier-related derived properties. This answered all my questions :) Thanks! From markus.icu at gmail.com Wed Apr 23 13:18:39 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 23 Apr 2014 11:18:39 -0700 Subject: ID_Start, ID_Continue, and stability extensions In-Reply-To: <6683947A-744E-4838-82F8-BB8E243B32AE@qiwi.be> References: <9F608E6A-79E1-46D2-AD96-2E614A17AAC0@qiwi.be> <30F1ADAC-2D20-4234-BA1F-23653BAE24E6@qiwi.be> <6683947A-744E-4838-82F8-BB8E243B32AE@qiwi.be> Message-ID: I strongly recommend you parse the derived properties rather than trying to follow the derivation formula, because that can change over time. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathias at qiwi.be Wed Apr 23 15:16:26 2014 From: mathias at qiwi.be (Mathias Bynens) Date: Wed, 23 Apr 2014 22:16:26 +0200 Subject: Do `Grapheme_Extend` characters only apply to `Grapheme_Extend`? Message-ID: <8058BF88-8BB6-4A47-A7A2-32A1AC99E7A9@qiwi.be> Let?s say I?m writing a program that strips combining characters and grapheme extenders from an input string. For combining marks, I?m looking for any non-combining marks (e.g. `a`) followed by one or more combining marks (e.g. `?`), and then I remove everything but the non-combining mark (e.g. leaving only `a`). Is this a correct approach? What should the approach be for grapheme extenders? Should the program only look for `Grapheme_Base` characters followed by `Grapheme_Extend` characters (which includes the code points in `Other_Grapheme_Extend`)? From nospam-abuse at ilyaz.org Wed Apr 23 18:41:15 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Wed, 23 Apr 2014 16:41:15 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <5357E870.6060603@ix.netcom.com> References: <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> <53569383.4020502@ix.netcom.com> <20140423073502.GA11904@powdermilk> <5357E870.6060603@ix.netcom.com> Message-ID: <20140423234115.GA15271@powdermilk> On Wed, Apr 23, 2014 at 09:21:04AM -0700, Asmus Freytag wrote: > > a parsing is good if it satisfies all conditions below: > > > > 0) Some delimiters in the string are marked as ?non-matching?; the rest > > is broken into disjoint ?matched? pairs; > > > > MATCH) A ?matched? pair consists of an open-delimiter and matching close- > > delimiter (in this order in the string). > > > > NEST) ?Matched? pairs are properly nested (meaning that 2 pairs cannot be > > positioned as Open1 Open2 Close1 Close2 in the string order). > > > > MINLEN) ?Inside? a ?matched? pair, every delimiter which could match elements > > of the pair but is marked as ?non-matching? must nest inside > > some deeper-nested ?matched? pair. > > > >(I hope that the meaning of the word ?inside? in MINLEN is clear.) > > > > GREED) Given any close-delimiter marked as ?non-matching?, its > > pre-context does not contain any open-delimiter which could > > match it. > > > > Here pre-context of a position is a concatenation of substrings of the > > initial string: > > ? Take the most deeply nested ?matched pair? containing the position > > (if none, the whole string); > > ? take the part of the string inside this pair AND before the position; > > ? remove all ?matched? pairs completely contained insidde this substring > > together with what they enclose. > > This is a very nice formal definition. I'm surprised that your "GREED" > statement needs such a complex auxiliary concept (pre-context). > > Can you explain why, if you make "pre-context" simply the part of the > whole string that precedes the unmatched close-delimiter, the words > "which could match it" are insufficient? Aha, this means that my description is INCOMPLETE: you got a wrong impression what ?match? means! Everywhere, this word means exactly the same as in the MATCH rule: that Unicode codepoints match following Unicode properties. This is non-recursive definition. All rules are independent. Without complicated notion of pre-context, matching [] in ( [ ) ] would be an acceptable match. Thanks for your corrections, Ilya From nospam-abuse at ilyaz.org Wed Apr 23 19:04:49 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Wed, 23 Apr 2014 17:04:49 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <8361m09if2.fsf@gnu.org> References: <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> <53569383.4020502@ix.netcom.com> <20140423073502.GA11904@powdermilk> <8361m09if2.fsf@gnu.org> Message-ID: <20140424000449.GA15531@powdermilk> On Wed, Apr 23, 2014 at 06:25:53PM +0300, Eli Zaretskii wrote: > I see nothing in your definition that is significantly different from > our attempts. It does feel more complex, mainly because you have much > more conditions, combining which in one's mind might not be easy at > first reading. AFAICS, mine has exactly one condition (GREED) which is not explicitly contained in your approaches (AFAIR them). How can 1 be more than something? ;-) Note that the other approaches define a PROCESS (data depending on time???or on ?scan? position in string). My definition is completely static???given a matching, it says whether it is good or not (without any recursion, or interdependence of conditions). Yours, Ilya From asmusf at ix.netcom.com Wed Apr 23 20:15:44 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 23 Apr 2014 18:15:44 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <20140423234115.GA15271@powdermilk> References: <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> <53569383.4020502@ix.netcom.com> <20140423073502.GA11904@powdermilk> <5357E870.6060603@ix.netcom.com> <20140423234115.GA15271@powdermilk> Message-ID: <535865C0.7060609@ix.netcom.com> On 4/23/2014 4:41 PM, Ilya Zakharevich wrote: > On Wed, Apr 23, 2014 at 09:21:04AM -0700, Asmus Freytag wrote: >>> a parsing is good if it satisfies all conditions below: >>> >>> 0) Some delimiters in the string are marked as ?non-matching?; the rest >>> is broken into disjoint ?matched? pairs; >>> >>> MATCH) A ?matched? pair consists of an open-delimiter and matching close- >>> delimiter (in this order in the string). >>> >>> NEST) ?Matched? pairs are properly nested (meaning that 2 pairs cannot be >>> positioned as Open1 Open2 Close1 Close2 in the string order). >>> >>> MINLEN) ?Inside? a ?matched? pair, every delimiter which could match elements >>> of the pair but is marked as ?non-matching? must nest inside >>> some deeper-nested ?matched? pair. >>> >>> (I hope that the meaning of the word ?inside? in MINLEN is clear.) >>> >>> GREED) Given any close-delimiter marked as ?non-matching?, its >>> pre-context does not contain any open-delimiter which could >>> match it. >>> >>> Here pre-context of a position is a concatenation of substrings of the >>> initial string: >>> ? Take the most deeply nested ?matched pair? containing the position >>> (if none, the whole string); >>> ? take the part of the string inside this pair AND before the position; >>> ? remove all ?matched? pairs completely contained insidde this substring >>> together with what they enclose. >> This is a very nice formal definition. I'm surprised that your "GREED" >> statement needs such a complex auxiliary concept (pre-context). >> >> Can you explain why, if you make "pre-context" simply the part of the >> whole string that precedes the unmatched close-delimiter, the words >> "which could match it" are insufficient? > Aha, this means that my description is INCOMPLETE: you got a wrong > impression what ?match? means! Everywhere, this word means exactly > the same as in the MATCH rule: that Unicode codepoints match following > Unicode properties. > > This is non-recursive definition. All rules are independent. That explains why you repeat most of the other constraints in your pre-context. > Without > complicated notion of pre-context, matching [] in > > ( [ ) ] > > would be an acceptable match. > > Thanks for your corrections, > Ilya > For a static definition, would it have been simpler to break the definition into two - say a "tentative parsing" (all conditions but greed) and "selected parsing", which the could be defined as the parsing that starts closest to the left. (I don't have the time as I write this to work out whether that's the correct condition, as I am about to board a ride, but just as a trigger to thought what a split definition might achieve). A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Apr 23 21:37:11 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 24 Apr 2014 04:37:11 +0200 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <20140423234115.GA15271@powdermilk> References: <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> <53569383.4020502@ix.netcom.com> <20140423073502.GA11904@powdermilk> <5357E870.6060603@ix.netcom.com> <20140423234115.GA15271@powdermilk> Message-ID: Thanks for the clear reply, now I know that my example in a prior message would work appropriately with UBA: This is an [?] ARABIC EXAMPLE [?] for demonstration only. Because: - the opening guillemet is not stripped out of the context stack when the first closing bracket is matched with the first opening bracket, - later the closing guillemet matches the opening guillemet remaining on the stack, even if the second opening bracket was pushed on top of it : pair of guillemets is matched, the opening guillement is dropped from the stack but the second bracket on top of it remains there and can also match now the following closing bracket. So brackets pairs can effectively overlap non hierarchically. But still there's a problem: - The first pair of bracket starts immediately after an LTR Latin context so its direction is LTR too and consistant: these brackets won't be mirrored. - The pair of guillemets starts also after the first opening bracket has been resolved as LTR, so both guillemets will be LTR and won't be mirrored. - However the second pair of brackets starts just after an ARABIC context : these brackets will be BOTH mirrored. But we get: - "This is an " : strong LTR at start, the last weak space inherits frol the last letter "n" , no mirroring anywhere - "["; resolved as LTR by inheritance from the previous resolved space, no mirroring - "?": ditto - "]": found match, is LTR like the matching opening bracket. - " ARABIC EXAMPLE " the first weak space inherits from the bracket but then we have a direction to RTL up to the last space. - "["; resolved as RTL, mirrored - "?", resolved as LTR due tu pair matching, no mirroring - "]"; resolved as RTL due tu pair matching, mirrored - " for demonstrations only" : the first weak space inherits from the previous RTL bracket, but then the direction switches to LTR for the first Latin letter up to the end of string And we have then the follwing runs with directions resolved : - "This is an [?] " : LTR - "ARABIC EXAMPLE [" : RTL - "?" : LTR - "] " : RTL - "for demonstraton only." : LTR Now we can apply the Bidi possible linewraps (where appropriate if this does not fit a single line) and then the reordering of each line. Assuming that everything fits on the same line I get (without mirroring applied): "This is an [?] [ ELPMAXE CIBARA?[for demonstration only." And with mirroring applied: "This is an [?] ] ELPMAXE CIBARA?[for demonstration only." Ugly isn't it ? I still don't see any solution without using Bidi controls in the string. But if we use Bidi controls to force a change of direction and this only applies distinct directions for elements in a pair. In my opinion such pair should NOT match (this is the case here for the second pair of brackets. What I would like to see (including with mirroring applied where needed should better be: "This is an [?] ELPMAXE CIBARA [?] for demonstration only." There are still some tweaks to do to the algorithm. -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Thu Apr 24 02:28:50 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 24 Apr 2014 00:28:50 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: References: <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> <53569383.4020502@ix.netcom.com> <20140423073502.GA11904@powdermilk> <5357E870.6060603@ix.netcom.com> <20140423234115.GA15271@powdermilk> Message-ID: <5358BD32.3080302@ix.netcom.com> On 4/23/2014 7:37 PM, Philippe Verdy wrote: > Thanks for the clear reply, now I know that my example in a prior > message would work appropriately with UBA: > > This is an [?] ARABIC EXAMPLE [?] for demonstration only. > > Because: > - the opening guillemet is not stripped out of the context stack when > the first closing bracket is matched with the first opening bracket, This is _*incorrect*_, see the text in blue/bold in the definition copied below. The second bullet in item 3 of the second second-level bullet of the third top-level bullet of BD16 clearly says that all elements that are above the matched element are popped together with it. > - later the closing guillemet matches the opening guillemet remaining > on the stack, No, this is_*incorrect*_, because the stack has been popped. The problem with the "stack" in this algorithm is that it isn't a stack. A stack is a data structure that allows you to manipulate the top element. This data structure is simply a list, to which elements are appended, as opening brackets are found, and which then is scanned (from the tail) for a match, and, on meeting a match, the tail is trimmed. Item "4" is the one that does the iteration in scanning the tail. After one or more iterations, item "3" no longer operates on what would have been the "top" element of a "stack", but deep in the tail of a list. When the items are "popped" it's equivalent to dropping the tail. (Unlike your interpretation, which would remove individual elements, this language clearly refers to multiple elements.) > even if the second opening bracket was pushed on top of it : pair of > guillemets is matched, the opening guillement is dropped from the > stack but the second bracket on top of it remains there and can also > match now the following closing bracket. > > So brackets pairs can effectively overlap non hierarchically. BD16. A /bracket pair/ is a pair of characters consisting of an /opening paired bracket/ and a /closing paired bracket/ such that the Bidi_Paired_Bracket property value of the former or its canonical equivalent equals the latter or its canonical equivalent and which are algorithmically identified at specific text positions within an /isolating run sequence/. The following algorithm identifies all of the /bracket pairs/ in a given /isolating run sequence/: * Create a stack for elements each consisting of a bracket character and a text position. Initialize it to empty. * Create a list for elements each consisting of two text positions, one for an opening paired bracket and the other for a corresponding closing paired bracket. Initialize it to empty. * Inspect each character in the isolating run sequence in logical order. o If an opening paired bracket is found, push its Bidi_Paired_Bracket property value and its text position onto the stack. o If a closing paired bracket is found, do the following: 1. Declare a variable that holds a reference to the current stack element and initialize it with the top element of the stack. 2. Compare the closing paired bracket being inspected or its canonical equivalent to the bracket in the current stack element. 3. If the values match, meaning the two characters form a bracket pair, then + Append the text position in the current stack element together with the text position of the closing paired bracket to the list. + **Pop the stack _through the current stack element inclusively_.** 4. Else, if the current stack element is not at the bottom of the stack, advance it to the next element deeper in the stack and go back to step 2. 5. Else, continue with inspecting the next character without popping the stack. * Sort the list of pairs of text positions in ascending order based on the text position of the /opening paired bracket/. > > But still there's a problem: The remainder of the problems can't be discussed, because the premise is wrong (see above). A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathias at qiwi.be Thu Apr 24 02:58:31 2014 From: mathias at qiwi.be (Mathias Bynens) Date: Thu, 24 Apr 2014 09:58:31 +0200 Subject: Do `Grapheme_Extend` characters only apply to `Grapheme_Base`? In-Reply-To: <8058BF88-8BB6-4A47-A7A2-32A1AC99E7A9@qiwi.be> References: <8058BF88-8BB6-4A47-A7A2-32A1AC99E7A9@qiwi.be> Message-ID: <6AC1046C-DDAC-498B-A2EE-AE02BF91AACD@qiwi.be> On 23 Apr 2014, at 22:16, Mathias Bynens wrote: > Let?s say I?m writing a program that strips combining characters and grapheme extenders from an input string. > > For combining marks, I?m looking for any non-combining marks (e.g. `a`) followed by one or more combining marks (e.g. `?`), and then I remove everything but the non-combining mark (e.g. leaving only `a`). Is this a correct approach? > > What should the approach be for grapheme extenders? Should the program only look for `Grapheme_Base` characters followed by `Grapheme_Extend` characters (which includes the code points in `Other_Grapheme_Extend`)? The email subject should have been ?Do `Grapheme_Extend` characters only apply to `Grapheme_Base`?? ? sorry for the confusion. Does anyone know the answer? From eliz at gnu.org Thu Apr 24 09:39:42 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 24 Apr 2014 17:39:42 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <5358BD32.3080302@ix.netcom.com> References: <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> <53569383.4020502@ix.netcom.com> <20140423073502.GA11904@powdermilk> <5357E870.6060603@ix.netcom.com> <20140423234115.GA15271@powdermilk> <5358BD32.3080302@ix.netcom.com> Message-ID: <83ha5i94gh.fsf@gnu.org> > Date: Thu, 24 Apr 2014 00:28:50 -0700 > From: Asmus Freytag > CC: ken at unicode.org, Eli Zaretskii , > James Clark , > unicode Unicode Discussion > > On 4/23/2014 7:37 PM, Philippe Verdy wrote: > > Thanks for the clear reply, now I know that my example in a prior > > message would work appropriately with UBA: > > > > This is an [?] ARABIC EXAMPLE [?] for demonstration only. > > > > Because: > > - the opening guillemet is not stripped out of the context stack when > > the first closing bracket is matched with the first opening bracket, > This is _*incorrect*_, see the text in blue/bold in the definition > copied below. > The second bullet in item 3 of the second second-level bullet of the > third top-level bullet of BD16 clearly says that all elements that are > above the matched element are popped together with it. > > - later the closing guillemet matches the opening guillemet remaining > > on the stack, > No, this is_*incorrect*_, because the stack has been popped. Indeed. In addition, assuming that by "guillemets" Philippe means U+00AB and U+00BB, they cannot possibly form a bracketed pair, because their General Category is not Ps and Pe. For that reason, you will never find them in BidiBrackets.txt. From verdy_p at wanadoo.fr Thu Apr 24 10:11:23 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 24 Apr 2014 17:11:23 +0200 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <83ha5i94gh.fsf@gnu.org> References: <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> <53569383.4020502@ix.netcom.com> <20140423073502.GA11904@powdermilk> <5357E870.6060603@ix.netcom.com> <20140423234115.GA15271@powdermilk> <5358BD32.3080302@ix.netcom.com> <83ha5i94gh.fsf@gnu.org> Message-ID: 2014-04-24 16:39 GMT+02:00 Eli Zaretskii : > In addition, assuming that by "guillemets" Philippe means U+00AB and > U+00BB, "guillemet" is THE correct name, even in English. "guillemot" comes from an old typo error. If you don't want this term in Engmish you can still use "double angle bracket" which is unnecessarily long. > they cannot possibly form a bracketed pair, because their > General Category is not Ps and Pe. For that reason, you will never > find them in BidiBrackets.txt. > Forget the general category, we know that it does not solve any internationalization issue correctly. All past versions of Unicode algorthms that initially attempted to use them now use them only as informative rules (which are not stabilized) to help generate new "derived" properties (which should be used verbatim from the content of the UCD, because rapidly new exceptions are added to the rules). The guillemet evidently form a pair even if their use depends on languages which may swap their role (and this is the main reason why they are not assigned Ps and Pe because Ps and Pe will be swapped. They are still a pair which works even better than """ that can be paired in 3 different ways and not just two (meaning that you don't know which one to look for. Also read my exampel for what it is saying explicitly; a demonstration of the problem; just an example (there are many other similar example for such cases where nesting is not hierarchical but still maintains pairs). So nothing (at least not the reason of the GC which is just an intermediate but incomplete helper) forbids the guillemets to be listed in BidiBrackets.txt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Thu Apr 24 10:20:31 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 24 Apr 2014 18:20:31 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: References: <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> <53569383.4020502@ix.netcom.com> <20140423073502.GA11904@powdermilk> <5357E870.6060603@ix.netcom.com> <20140423234115.GA15271@powdermilk> <5358BD32.3080302@ix.netcom.com> <83ha5i94gh.fsf@gnu.org> Message-ID: <83d2g692kg.fsf@gnu.org> > From: Philippe Verdy > Date: Thu, 24 Apr 2014 17:11:23 +0200 > Cc: Asmus Freytag , Ilya Zakharevich , ken at unicode.org, > James Clark , unicode Unicode Discussion > > > In addition, assuming that by "guillemets" Philippe means U+00AB and > > U+00BB, > > > "guillemet" is THE correct name, even in English. "guillemot" comes from an > old typo error. I didn't mean to say "guillemet" was typo, I just wasn't sure which Unicode codepoint you had in mind, since you didn't show its full official name or its codepoint. And at least your original message used "<<" and ">>" transliterations, not the actual characters. > > they cannot possibly form a bracketed pair, because their > > General Category is not Ps and Pe. For that reason, you will never > > find them in BidiBrackets.txt. > > > > Forget the general category, we know that it does not solve any > internationalization issue correctly. All past versions of Unicode > algorthms that initially attempted to use them now use them only as > informative rules (which are not stabilized) to help generate new "derived" > properties (which should be used verbatim from the content of the UCD, > because rapidly new exceptions are added to the rules). > > The guillemet evidently form a pair even if their use depends on languages > which may swap their role (and this is the main reason why they are not > assigned Ps and Pe because Ps and Pe will be swapped. They are still a pair > which works even better than """ that can be paired in 3 different ways and > not just two (meaning that you don't know which one to look for. They are not a pair for the purposes of the PBA, which is the subject of this discussion. Your message, viz.: > - later the closing guillemet matches the opening guillemet remaining on > the stack, even if the second opening bracket was pushed on top of it : > pair of guillemets is matched, the opening guillement is dropped from the > stack but the second bracket on top of it remains there and can also match > now the following closing bracket. indicated that you thought the guillemets could form a bracket pair, which they cannot, according to the UBA. > So nothing (at least not the reason of the GC which is just an intermediate > but incomplete helper) forbids the guillemets to be listed in > BidiBrackets.txt. They don't satisfy the conditions for that. From BidiBrackets.txt: # This file lists the set of code points with Bidi_Paired_Bracket_Type # property values Open and Close. The set is derived from the character # properties General_Category (gc), Bidi_Class (bc), Bidi_Mirrored (Bidi_M), # and Bidi_Mirroring_Glyph (bmg), as follows: two characters, A and B, # form a bracket pair if A has gc=Ps and B has gc=Pe, both have bc=ON and # Bidi_M=Y, and bmg of A is B. Bidi_Paired_Bracket (bpb) maps A to B and # vice versa, and their Bidi_Paired_Bracket_Type (bpt) property values are # Open (o) and Close (c), respectively. As you see, Ps and Pe are explicitly required. From asmusf at ix.netcom.com Thu Apr 24 11:10:42 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 24 Apr 2014 09:10:42 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <83d2g692kg.fsf@gnu.org> References: <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> <53569383.4020502@ix.netcom.com> <20140423073502.GA11904@powdermilk> <5357E870.6060603@ix.netcom.com> <20140423234115.GA15271@powdermilk> <5358BD32.3080302@ix.netcom.com> <83ha5i94gh.fsf@gnu.org> <83d2g692kg.fsf@gnu.org> Message-ID: <53593782.6040507@ix.netcom.com> On 4/24/2014 8:20 AM, Eli Zaretskii wrote: >> So nothing (at least not the reason of the GC which is just an intermediate >> >but incomplete helper) forbids the guillemets to be listed in >> >BidiBrackets.txt. > They don't satisfy the conditions for that. From BidiBrackets.txt: Philippe is incorrect once again, as Eli notices. The underlying reason for the exclusion of the guillemets is probably that they are quotation marks and not brackets. Quotation marks, in some languages, are not paired (using the same mark for opening and closing), or are paired the opposite way or form different pairs. That makes the kind of algorithm like the PBA impossible -- at least in a language neutral way. (Even without being language neutral, the conventions for guillemets in particular may not necessarily be universal *within* a given language -- I seem to recall from our investigation into quotes that usage varied between samples). A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Thu Apr 24 11:12:53 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 24 Apr 2014 09:12:53 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <83ha5i94gh.fsf@gnu.org> References: <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> <53569383.4020502@ix.netcom.com> <20140423073502.GA11904@powdermilk> <5357E870.6060603@ix.netcom.com> <20140423234115.GA15271@powdermilk> <5358BD32.3080302@ix.netcom.com> <83ha5i94gh.fsf@gnu.org> Message-ID: <53593805.10708@ix.netcom.com> On 4/24/2014 7:39 AM, Eli Zaretskii wrote: > This is _*incorrect*_, see the text in blue/bold in the definition > >copied below. > >The second bullet in item 3 of the second second-level bullet of the > >third top-level bullet of BD16 clearly says that all elements that are > >above the matched element are popped together with it. Isn't is lovely how difficult it is to cite any of the steps of that algorithmic definition? There are levels of nested bullets, with some numbers about halfway down the hierarchy. Another brick in the wall of proving that BD16 is just very poorly written and presented. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Apr 24 13:28:46 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 24 Apr 2014 20:28:46 +0200 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <83d2g692kg.fsf@gnu.org> References: <5355912E.80101@ix.netcom.com> <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> <53569383.4020502@ix.netcom.com> <20140423073502.GA11904@powdermilk> <5357E870.6060603@ix.netcom.com> <20140423234115.GA15271@powdermilk> <5358BD32.3080302@ix.netcom.com> <83ha5i94gh.fsf@gnu.org> <83d2g692kg.fsf@gnu.org> Message-ID: 2014-04-24 17:20 GMT+02:00 Eli Zaretskii : > > From: Philippe Verdy > > Date: Thu, 24 Apr 2014 17:11:23 +0200 > > Cc: Asmus Freytag , Ilya Zakharevich < > nospam-abuse at ilyaz.org>, ken at unicode.org, > > James Clark , unicode Unicode Discussion < > unicode at unicode.org> > > > > > In addition, assuming that by "guillemets" Philippe means U+00AB and > > > U+00BB, > > > > > > "guillemet" is THE correct name, even in English. "guillemot" comes from > an > > old typo error. > > I didn't mean to say "guillemet" was typo, I just wasn't sure which > Unicode codepoint you had in mind, since you didn't show its full > official name or its codepoint. And at least your original message > used "<<" and ">>" transliterations, not the actual characters. > No I used the ?? characters exacvtly like here. I absolutely never use the ASCII trick with << >> (especially in email where >> is used by citations. But may be I'll use " in English contexts (I have used it as string delimiters in later discussions, to surround the oriented brackets and guillemets. I think this is your mail agent that transformed the guillemets, -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Apr 24 14:25:42 2014 From: doug at ewellic.org (Doug Ewell) Date: Thu, 24 Apr 2014 12:25:42 -0700 Subject: Do =?UTF-8?Q?=27Grapheme=5FExtend=27=20characters=20only=20apply=20to=20?= =?UTF-8?Q?=27Grapheme=5FBase=27=3F?= Message-ID: <20140424122542.665a7a7059d7ee80bb4d670165c8327d.d9a1509fa9.wbe@email03.secureserver.net> Mathias Bynens wrote: > Let's say I'm writing a program that strips combining characters and > grapheme extenders from an input string. > > For combining marks, I'm looking for any non-combining marks (e.g. > 'a') followed by one or more combining marks (e.g. '??'), and then I > remove everything but the non-combining mark (e.g. leaving only 'a'). > Is this a correct approach? It's entirely up to you. This is a rather unusual thing to want to do with text. Fr mn lnggs, t wld b qvlnt t strppng ll vwls t f th txt. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From ken.whistler at sap.com Thu Apr 24 14:38:54 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Thu, 24 Apr 2014 19:38:54 +0000 Subject: Do `Grapheme_Extend` characters only apply to `Grapheme_Base`? In-Reply-To: <6AC1046C-DDAC-498B-A2EE-AE02BF91AACD@qiwi.be> References: <8058BF88-8BB6-4A47-A7A2-32A1AC99E7A9@qiwi.be> <6AC1046C-DDAC-498B-A2EE-AE02BF91AACD@qiwi.be> Message-ID: > On 23 Apr 2014, at 22:16, Mathias Bynens wrote: > > > Let?s say I?m writing a program that strips combining characters and > grapheme extenders from an input string. > > > > For combining marks, I?m looking for any non-combining marks (e.g. `a`) > followed by one or more combining marks (e.g. `?`), and then I remove > everything but the non-combining mark (e.g. leaving only `a`). Is this a > correct approach? > > > > What should the approach be for grapheme extenders? Should the > program only look for `Grapheme_Base` characters followed by > `Grapheme_Extend` characters (which includes the code points in > `Other_Grapheme_Extend`)? > > The email subject should have been ?Do `Grapheme_Extend` characters only > apply to `Grapheme_Base`?? ? sorry for the confusion. > > Does anyone know the answer? Yes. Grapheme_Extend characters per se do not "apply" to anything. They are a mixture of different General_Category types -- mostly combining marks, but not all. The concept of applying to a base only refers to combining marks proper. The proper use of the Grapheme_Extend property is in the context of the text segmentation algorithms defined in UAX #29, and in particular: http://www.unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table See that document for the proper use. They are relevant to the determination of grapheme cluster boundaries. And by the way, it is a very bad idea to be writing a program to just unilaterally strip away grapheme extenders from input strings. In particular, many dependent vowels in Indic scripts are defined as grapheme extenders. If you strip them away, the input string will just end up as random trash. That is very, very different from something which is trying to strip diacritics and accent marks off of Latin letters. --Ken From doug at ewellic.org Thu Apr 24 14:41:42 2014 From: doug at ewellic.org (Doug Ewell) Date: Thu, 24 Apr 2014 12:41:42 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 Message-ID: <20140424124142.665a7a7059d7ee80bb4d670165c8327d.754867c073.wbe@email03.secureserver.net> Re: Unclear text in the UBA (UAX#9) of Unicode 6.3 Philippe Verdy wrote: >> [...] And at least your original message >> used "<<" and ">>" transliterations, not the actual characters. > > No I used the ?? characters exacvtly like here. > I absolutely never use the ASCII trick with << >> (especially in email > where >> is used by citations. > But may be I'll use " in English contexts (I have used it as string > delimiters in later discussions, to surround the oriented brackets and > guillemets. > > I think this is your mail agent that transformed the guillemets, http://www.unicode.org/mail-arch/unicode-ml/y2014-m04/0108.html -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From sdaoden at yandex.com Thu Apr 24 14:56:31 2014 From: sdaoden at yandex.com (Steffen Nurpmeso) Date: Thu, 24 Apr 2014 21:56:31 +0200 Subject: =?US-ASCII?Q?ID=5FStart,?= =?US-ASCII?Q?_ID=5FContinue,?= and stability extensions In-Reply-To: References: <9F608E6A-79E1-46D2-AD96-2E614A17AAC0@qiwi.be> <30F1ADAC-2D20-4234-BA1F-23653BAE24E6@qiwi.be> <6683947A-744E-4838-82F8-BB8E243B32AE@qiwi.be> Message-ID: <20140424205631.8jIlSGFgrqwwwAHXq7zujsl+@dietcurd.local> Markus Scherer wrote: |I strongly recommend you parse the derived properties rather than trying to |follow the derivation formula, because that can change over time. ..this file includes only those core properties that have themselves a derivation-may-change property? (I long hesitated to write this though.) --steffen From markus.icu at gmail.com Thu Apr 24 15:56:22 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 24 Apr 2014 13:56:22 -0700 Subject: ID_Start, ID_Continue, and stability extensions In-Reply-To: <20140424205631.8jIlSGFgrqwwwAHXq7zujsl+@dietcurd.local> References: <9F608E6A-79E1-46D2-AD96-2E614A17AAC0@qiwi.be> <30F1ADAC-2D20-4234-BA1F-23653BAE24E6@qiwi.be> <6683947A-744E-4838-82F8-BB8E243B32AE@qiwi.be> <20140424205631.8jIlSGFgrqwwwAHXq7zujsl+@dietcurd.local> Message-ID: On Thu, Apr 24, 2014 at 12:56 PM, Steffen Nurpmeso wrote: > Markus Scherer wrote: > |I strongly recommend you parse the derived properties rather than trying > to > |follow the derivation formula, because that can change over time. > > ..this file includes only those core properties that have > themselves a derivation-may-change property? > I don't know what that means. What I tried to say is, if you need ID_Start, then parse ID_Start from DerivedCoreProperties.txt. That's more stable (and easier than parsing the pieces and deriving # Lu + Ll + Lt + Lm + Lo + Nl # + Other_ID_Start # - Pattern_Syntax # - Pattern_White_Space yourself. For example, at least one of the derivation formulas (for Alphabetic) is changing from 6.3 to 7.0. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathias at qiwi.be Thu Apr 24 16:07:58 2014 From: mathias at qiwi.be (Mathias Bynens) Date: Thu, 24 Apr 2014 23:07:58 +0200 Subject: Do `Grapheme_Extend` characters only apply to `Grapheme_Base`? In-Reply-To: References: <8058BF88-8BB6-4A47-A7A2-32A1AC99E7A9@qiwi.be> <6AC1046C-DDAC-498B-A2EE-AE02BF91AACD@qiwi.be> Message-ID: <2EE8228C-9CE6-4747-9005-D14B62807F81@qiwi.be> On 24 Apr 2014, at 21:38, Whistler, Ken wrote: > Grapheme_Extend characters per se do not "apply" to anything. > They are a mixture of different General_Category types -- mostly combining > marks, but not all. The concept of applying to a base only refers to > combining marks proper. > > The proper use of the Grapheme_Extend property is in the context of the > text segmentation algorithms defined in UAX #29, and in particular: > > http://www.unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table > > See that document for the proper use. They are relevant to the determination of grapheme cluster boundaries. > > And by the way, it is a very bad idea to be writing a program to just unilaterally strip away grapheme extenders from input strings. In particular, many dependent vowels in Indic scripts are defined as grapheme extenders. If you strip them away, the input string will just end up as random trash. That is very, very different from something which is trying to strip diacritics and accent marks off of Latin letters. I agree. Don?t worry ? I am not actually writing such a program, it was just an example to simplify my question. The real program attempts to reverse a string while accounting for combining marks and grapheme extenders. Before reversing the code points one by one, some things need to happen: * For combining marks, I use a regular expression that looks for non-combining marks followed by any number of combining marks, and then I swap the combining marks with the preceding character. * Now I?m trying to figure out what to do about grapheme extenders (if anything). I was thinking: look for any non-grapheme extender symbol (or should it be only `Grapheme_Base` characters? Your reply suggested it shouldn?t) followed by a single grapheme extender (or should it be several, like with combining marks?), and then swap them. Would that be a correct approach? I realize reversing a string has nothing to do with text segmentation ? but ignoring grapheme extenders leads to unexpected results (since after reversing the code points, the grapheme extender might extend the wrong character): https://github.com/mathiasbynens/esrever/issues/5 From ken.whistler at sap.com Thu Apr 24 16:16:38 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Thu, 24 Apr 2014 21:16:38 +0000 Subject: Bidi Brackets for Dummies Message-ID: Given the incredible level of interest shown on this list during the last week, I am glad that I can finally announce the publication of Bidi Brackets for Dummies: http://www.unicode.org/notes/tr39/ I had wanted to publish that several weeks ago, but unfortunately, publication was held up for more than three weeks while I struggled to get the document past the Unicode Censors! Enjoy. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Thu Apr 24 16:19:18 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 24 Apr 2014 14:19:18 -0700 Subject: Bidi Brackets for Dummies In-Reply-To: References: Message-ID: tn not tr http://www.unicode.org/notes/tn39/ markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Apr 24 16:22:35 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 24 Apr 2014 22:22:35 +0100 Subject: Do `Grapheme_Extend` characters only apply to `Grapheme_Base`? In-Reply-To: References: <8058BF88-8BB6-4A47-A7A2-32A1AC99E7A9@qiwi.be> <6AC1046C-DDAC-498B-A2EE-AE02BF91AACD@qiwi.be> Message-ID: <20140424222235.479ca136@JRWUBU2> On Thu, 24 Apr 2014 19:38:54 +0000 "Whistler, Ken" wrote: > Yes. Grapheme_Extend characters per se do not "apply" to anything. > They are a mixture of different General_Category types -- mostly > combining marks, but not all. The concept of applying to a base only > refers to combining marks proper. > The proper use of the Grapheme_Extend property is in the context of > the text segmentation algorithms defined in UAX #29, A watertight definition of a grapheme cluster is probably impossible. The precise definition of the legacy grapheme cluster is crafted so that the process of splitting a string of characters into legacy grapheme clusters is invariant under canonical equivalence. The various Indic AA vowels that are other_grapheme_extend are there because they are also the second parts of canonical decompositions of multipart Indic vowels, most typically OO. However, diametrically opposite approaches were taken in the 'Myanmar' and Khmer scripts. In the Myanmar script, the two-part vowel symbol must be encoded as two separate characters, as in the various Tai scripts. In the Khmer script, the two parts are encoded as a single vowel. Most of the scripts of India allow both approaches; Devanagari is the most notable exception, and the multipart vowels there are primarily used for an archaic style. Thus U+09BE BENGALI VOWEL SIGN AA is intended to 'apply to' U+09C7 BENGALI VOWEL SIGN E, and it is only in the interests of simplicity and consistency that is a grapheme cluster but is not. Richard Ishida points out in one of his web pages that the practical definition of a grapheme cluster may actually depend on the font. > > http://www.unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table > > See that document for the proper use. They are relevant to the > determination of grapheme cluster boundaries. > > And by the way, it is a very bad idea to be writing a program to just > unilaterally strip away grapheme extenders from input strings. Thank you, Ken and Doug, for making that point. Richard. From rscook at unicode.org Thu Apr 24 17:22:44 2014 From: rscook at unicode.org (Richard COOK) Date: Thu, 24 Apr 2014 15:22:44 -0700 Subject: Bidi Brackets for Dummies In-Reply-To: References: Message-ID: <28697A94-F57B-4473-804A-47C93EC51331@unicode.org> On Apr 24, 2014, at 2:16 PM, Whistler, Ken wrote: > Given the incredible level of interest shown on this list during > the last week, I am glad that I can finally announce the publication > of Bidi Brackets for Dummies: > Dear Dr. Ken, Thanks ever so much for that enlightening course in BFD. It is rather long, and I dozed off in the middle, but as soon as I woke up I thought to give you some important feedback. I'd like to make one small suggestion (in addition to Markus' that you change the "r" to an "n" in the URL, which I have taken the liberty of doing above). You are using "pair" not only as a noun in several places, but as a *singular* noun. For example: >> And that tells you that U+005D is the pair for U+005B LEFT SQUARE BRACKET. >> And that tells you that U+005B is the pair for U+005D RIGHT SQUARE BRACKET. Maybe that pair of nominal singular "pair" usages is all of them. And maybe that's like "maths" in an earlier sentence. >> It's probably something to do with maths, but it's a "bracket", anyway. But, I'd think if you are going to use "pair" as a nominal singular, you might at least add a chapter on the subject. (And one on "maths" too, for that matter.) However, it might be easier just to use the word "mate" instead. For example, "And that tells you that U+005D is the *mate* of U+005B LEFT SQUARE BRACKET." But then, that may be a bit racy for the Unicode Censors ... > I had wanted to publish that several weeks ago, but unfortunately, > publication was held up for more than three weeks while I > struggled to get the document past the Unicode Censors! ... so perhaps just say ... "And that tells you that U+005D matches U+005B LEFT SQUARE BRACKET." ... or something like that, to avoid suggestively suggesting that these BFD things are actually, ahem, re-productive. > Enjoy. Yes, thanks! -Richard > --Ken > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From richard.wordingham at ntlworld.com Thu Apr 24 18:12:22 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 25 Apr 2014 00:12:22 +0100 Subject: Do `Grapheme_Extend` characters only apply to `Grapheme_Base`? In-Reply-To: <2EE8228C-9CE6-4747-9005-D14B62807F81@qiwi.be> References: <8058BF88-8BB6-4A47-A7A2-32A1AC99E7A9@qiwi.be> <6AC1046C-DDAC-498B-A2EE-AE02BF91AACD@qiwi.be> <2EE8228C-9CE6-4747-9005-D14B62807F81@qiwi.be> Message-ID: <20140425001222.412a39bd@JRWUBU2> On Thu, 24 Apr 2014 23:07:58 +0200 Mathias Bynens wrote: > I realize reversing a string has nothing to do with text segmentation > ? but ignoring grapheme extenders leads to unexpected results (since > after reversing the code points, the grapheme extender might extend > the wrong character): > https://github.com/mathiasbynens/esrever/issues/5 Actually, it has a lot to do with text segmentation - you need to work out what are really thought of as the characters. ??????? is a nice illustration of the problems. Is reversing twice to yield the string you first started with? Is reversing three times to give the same result as reversing once? What does reversing a Hangul syllable do? Canonically equivalence should be preserved! Should renderability be preserved? What does Thai ????? /kr???/ reverse to? /???rk/ is unpronounceable in Thai, and if it were it would be written ????? . Thai ???? is the spelling of two unrelated words, pronounced /p?law/ and /phe? la?/ respectively. Richard. From asmusf at ix.netcom.com Thu Apr 24 19:19:57 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 24 Apr 2014 17:19:57 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <20140424124142.665a7a7059d7ee80bb4d670165c8327d.754867c073.wbe@email03.secureserver.net> References: <20140424124142.665a7a7059d7ee80bb4d670165c8327d.754867c073.wbe@email03.secureserver.net> Message-ID: <5359AA2D.4030306@ix.netcom.com> On this side show, Philippe finally is correct, because I received his message without ASCII-i-fication; he cc'd me directly, and I never saw the mangled text. It's a bit embarassing for a Unicode mail list to not even be able to let guillemets through unmolested. But this shall not distract us from the fact that all other claims made by Philippe in conjunction with these characters were unfounded, because in contradiction to the specification of both properties and algorithms. One would wish for his sake that he would take as much time and effort to get these right as he takes on tracking this side issue. A./ On 4/24/2014 12:41 PM, Doug Ewell wrote: > Re: Unclear text in the UBA (UAX#9) of Unicode 6.3 > > Philippe Verdy wrote: > >>> [...] And at least your original message >>> used "<<" and ">>" transliterations, not the actual characters. >> No I used the ?? characters exacvtly like here. >> I absolutely never use the ASCII trick with << >> (especially in email >> where >> is used by citations. >> But may be I'll use " in English contexts (I have used it as string >> delimiters in later discussions, to surround the oriented brackets and >> guillemets. >> >> I think this is your mail agent that transformed the guillemets, > http://www.unicode.org/mail-arch/unicode-ml/y2014-m04/0108.html > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From nospam-abuse at ilyaz.org Fri Apr 25 03:11:26 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Fri, 25 Apr 2014 01:11:26 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <535865C0.7060609@ix.netcom.com> References: <20140421234140.GA5269@powdermilk> <5355C0FC.8060409@ix.netcom.com> <20140422033215.GA5778@powdermilk> <53560B41.70108@ix.netcom.com> <20140422091937.GA6603@powdermilk> <53569383.4020502@ix.netcom.com> <20140423073502.GA11904@powdermilk> <5357E870.6060603@ix.netcom.com> <20140423234115.GA15271@powdermilk> <535865C0.7060609@ix.netcom.com> Message-ID: <20140425081126.GA18924@powdermilk> On Wed, Apr 23, 2014 at 06:15:44PM -0700, Asmus Freytag wrote: > On 4/23/2014 4:41 PM, Ilya Zakharevich wrote: > >>> GREED) Given any close-delimiter marked as ?non-matching?, its > >>> pre-context does not contain any open-delimiter which could > >>> match it. > >>> > >>> Here pre-context of a position is a concatenation of substrings of the > >>> initial string: > >>> ? Take the most deeply nested ?matched pair? containing the position > >>> (if none, the whole string); > >>> ? take the part of the string inside this pair AND before the position; > >>> ? remove all ?matched? pairs completely contained insidde this substring > >>> together with what they enclose. > >>Can you explain why, if you make "pre-context" simply the part of the > >>whole string that precedes the unmatched close-delimiter, the words > >>"which could match it" are insufficient? > >Aha, this means that my description is INCOMPLETE: you got a wrong > >impression what ?match? means! Everywhere, this word means exactly > >the same as in the MATCH rule: that Unicode codepoints match following > >Unicode properties. > >This is non-recursive definition. All rules are independent. > That explains why you repeat most of the other constraints in your > pre-context. Frankly speaking, I do not see any such repetition. > For a static definition, would it have been simpler to break the > definition into > two - say a "tentative parsing" (all conditions but greed) and > "selected parsing", > which the could be defined as the parsing that starts closest to the left. I do not see how: to know whether a closing delimiter may be matched or not, it is not enough to know ?tentative? parsing of what preceeds it; one must know the **actual** parsing. Eventually, you would end with either a recursive definition, or a definition of a ?process? of parsing. Anyway, I?ve written my portion of definitions which combine ?tentative? stuff with ?best choice? of tentative variants. One ends with monsters like http://perldoc.perl.org/perlre.html#Combining-RE-Pieces (and, Eli, the fact that I wrote it does not imply that I must like it :-[ ). In the case of Perl RExes, there is no alternative. IMO, if there IS a way to define what a ?standalone? GOOD THING is, it is __much__ better than the ?best of many? way. Definiting it as ?the best of potentially good things? requires the reader to imagine first ALL the potentially good things; only when this (otherwise not very useful) universe has settled down in the reader?s mind they would be able to pick up the best guy? Ilya From eliz at gnu.org Fri Apr 25 03:54:32 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Fri, 25 Apr 2014 11:54:32 +0300 Subject: Bidi Brackets for Dummies In-Reply-To: References: Message-ID: <83lhutpz5j.fsf@gnu.org> > From: "Whistler, Ken" > Date: Thu, 24 Apr 2014 21:16:38 +0000 > Cc: "Whistler, Ken" > > Given the incredible level of interest shown on this list during > the last week, I am glad that I can finally announce the publication > of Bidi Brackets for Dummies: > > http://www.unicode.org/notes/tr39/ Thanks. I found one typo: Now that we have comes to grips with the fact ^^^^^ I also have a couple of questions about matching the canonical equivalents of the opening bracket: 1. Some characters have the decomposition mapping that starts with a tag, such as "" or "". Since (according to UAX#44, paragraph 5.7.3) these indicate a compatibility mapping, not a canonical mapping, I understand that they are not relevant for the purposes of the BPA. IOW, U+3008 and U+FE3F _cannot_ form a bracket pair, even though U+FE3F has " 3008" as its decomposition mapping. Is that understanding correct? 2. Why aren't pairs with canonically equivalent characters, such as these: 2329; 3009; o # LEFT-POINTING ANGLE BRACKET 232A; 3008; c # RIGHT-POINTING ANGLE BRACKET 3008; 232A; o # LEFT ANGLE BRACKET 3009; 2329; c # RIGHT ANGLE BRACKET in BidiBrackets.txt? If they were, that file could serve as a single source of information for deciding on bracket pairing; as things are, the implementation of the BPA must access other Unicode properties to do its job (unless the text representation is already decomposed such that each character is represented by its canonical equivalent -- which is, of course, a complication for text editors). From sdaoden at yandex.com Fri Apr 25 08:05:05 2014 From: sdaoden at yandex.com (Steffen Nurpmeso) Date: Fri, 25 Apr 2014 15:05:05 +0200 Subject: =?US-ASCII?Q?ID=5FStart,?= =?US-ASCII?Q?_ID=5FContinue,?= and stability extensions In-Reply-To: References: <9F608E6A-79E1-46D2-AD96-2E614A17AAC0@qiwi.be> <30F1ADAC-2D20-4234-BA1F-23653BAE24E6@qiwi.be> <6683947A-744E-4838-82F8-BB8E243B32AE@qiwi.be> <20140424205631.8jIlSGFgrqwwwAHXq7zujsl+@dietcurd.local> Message-ID: <20140425140505.pDgVh4jMEt/QYe7Pb5mvhFBE@dietcurd.local> Hello, Markus Scherer wrote: |On Thu, Apr 24, 2014 at 12:56 PM, Steffen Nurpmeso wrote: |> Markus Scherer wrote: |>|I strongly recommend you parse the derived properties rather than trying |> to |>|follow the derivation formula, because that can change over time. |> |> ..this file includes only those core properties that have |> themselves a derivation-may-change property? | |I don't know what that means. |What I tried to say is, if you need ID_Start, then parse ID_Start from |DerivedCoreProperties.txt. That's more stable (and easier than parsing the |pieces and deriving | |# Lu + Ll + Lt + Lm + Lo + Nl |# + Other_ID_Start |# - Pattern_Syntax |# - Pattern_White_Space | |yourself. But i *do* need to parse several many pieces (since i'm hardly interested in ID_Start only)! Unicode has DerivedAge.txt (i don't know where that is derived from) and i need to parse PropList.txt anyway (to get the full list of whitespace characters, for example). So imho it's a bit like ?Kraut und R?ben? (?higgledy-piggledy? sayy ). |For example, at least one of the derivation formulas (for Alphabetic) is |changing from 6.3 to 7.0. That is interesting or frightening, i don't know yet. Wouldn't it make sense to introduce a single PropListsJoined.txt that does it all. Or, for the sake of small and possibly space-constrained projects.. ?0[steffen at sherwood ]$ (cd ~/arena/docs.coding/unicode/data; > ll DerivedCore* PropList*) 100 [.] 99531 25 Sep 2013 PropList.txt 820 [.] 836985 25 Sep 2013 DerivedCoreProperties.txt ..and this is what i would do: offer a new file, say, Formula.txt, which defines exactly the necessary formula, e.g., to quote your example Alphabetic < UnicodeData.txt < PropList.txt + Lu + Ll + Lt + Lm + Lo + Nl + Other_ID_Start - Pattern_Syntax - Pattern_White_Space = That concept seems to be scalable at first glance. Old parsers will not generate correct data in the future anymore if i understood correctly? At least there should be a formular-compatibility version tag added somewhere, so that parsers can prevent themselves from generating incorrect data and automatically. I don't know why there need to be megabytes of duplicated data. Ach; and i'm not gonna start to dream of better support for ISO C / POSIX character classes. (Oh. ...It's surely sapless.) Ciao, --steffen From markus.icu at gmail.com Fri Apr 25 10:56:17 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 25 Apr 2014 08:56:17 -0700 Subject: ID_Start, ID_Continue, and stability extensions In-Reply-To: <20140425140505.pDgVh4jMEt/QYe7Pb5mvhFBE@dietcurd.local> References: <9F608E6A-79E1-46D2-AD96-2E614A17AAC0@qiwi.be> <30F1ADAC-2D20-4234-BA1F-23653BAE24E6@qiwi.be> <6683947A-744E-4838-82F8-BB8E243B32AE@qiwi.be> <20140424205631.8jIlSGFgrqwwwAHXq7zujsl+@dietcurd.local> <20140425140505.pDgVh4jMEt/QYe7Pb5mvhFBE@dietcurd.local> Message-ID: On Fri, Apr 25, 2014 at 6:05 AM, Steffen Nurpmeso wrote: > |What I tried to say is, if you need ID_Start, then parse ID_Start from > |DerivedCoreProperties.txt. That's more stable (and easier than parsing > the > |pieces and deriving > | > |# Lu + Ll + Lt + Lm + Lo + Nl > |# + Other_ID_Start > |# - Pattern_Syntax > |# - Pattern_White_Space > | > |yourself. > > But i *do* need to parse several many pieces (since i'm hardly > interested in ID_Start only)! > That's ok. Wherever there is a choice, parse the derived property rather than the pieces and doing your own derivation. So imho it's a bit like ?Kraut und R?ben? (?higgledy-piggledy? > sayy ). > Ich wei? was das bedeutet :-) Wouldn't it make sense to introduce a single PropListsJoined.txt > that does it all. Depends. You could just parse the files you need. They don't have to be combined. I parse most of the UCD .txt files with a Python script and munge them into one combined file. Then I have C++ code that parses that. (Years ago I did parse the pieces and derive at runtime but found it tedious to follow the formula changes, and if the data structure eliminates redundancy, then the data size is about the same.) Unicode also publishes XML versions of the data, with most or all properties in a single file. (It's just not as convenient for me to parse XML in my tools, and the XML files were missing some pieces when I looked at them.) You could also just use a library that provides these properties, rather than roll your own. Shameless plug for ICU here which has most of the low-level properties in source code (from a generator), so no data loading for those. Ask the icu-support list for help if needed. ..and this is what i would do: offer a new file, say, Formula.txt, > which defines exactly the necessary formula, e.g., to quote your > example > It's not "my example". I copied that straight out of DerivedCoreProperties.txt. It's not worth writing a parser that handles all formulas (they are meant for human consumption) and derive their properties when you can just parse the derived property values. I don't know why there need to be megabytes of duplicated data. > It's easier to maintain the data in pieces, although we have to check the derived results as well. For implementers, the derived properties are the way to go. Ach; and i'm not gonna start to dream of better support for ISO > C / POSIX character classes. (Oh. ...It's surely sapless.) > http://www.unicode.org/reports/tr18/#Compatibility_Properties Viele Gr??e, markus -- Google Internationalization Engineering -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Fri Apr 25 11:03:39 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 25 Apr 2014 09:03:39 -0700 Subject: Bidi Brackets for Dummies In-Reply-To: <83lhutpz5j.fsf@gnu.org> References: <83lhutpz5j.fsf@gnu.org> Message-ID: On Fri, Apr 25, 2014 at 1:54 AM, Eli Zaretskii wrote: > I also have a couple of questions about matching the canonical > equivalents of the opening bracket: > Please take a look at the date of the tech note. I suggest you start a new thread with a new subject for serious discussion. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Fri Apr 25 13:53:27 2014 From: public at khwilliamson.com (Karl Williamson) Date: Fri, 25 Apr 2014 12:53:27 -0600 Subject: ID_Start, ID_Continue, and stability extensions In-Reply-To: <20140424205631.8jIlSGFgrqwwwAHXq7zujsl+@dietcurd.local> References: <9F608E6A-79E1-46D2-AD96-2E614A17AAC0@qiwi.be> <30F1ADAC-2D20-4234-BA1F-23653BAE24E6@qiwi.be> <6683947A-744E-4838-82F8-BB8E243B32AE@qiwi.be> <20140424205631.8jIlSGFgrqwwwAHXq7zujsl+@dietcurd.local> Message-ID: <535AAF27.2090703@khwilliamson.com> On 04/24/2014 01:56 PM, Steffen Nurpmeso wrote: > Markus Scherer wrote: > |I strongly recommend you parse the derived properties rather than trying to > |follow the derivation formula, because that can change over time. > > ..this file includes only those core properties that have > themselves a derivation-may-change property? > (I long hesitated to write this though.) > > --steffen > _______________________________________________ Somewhere it says that the derived property files are subservient to the other files. And in fact in some Unicode releases, they contained errors. I therefor changed my parser to populate my internal db first with the derived files, and then to populate using the non-derived files. Any conflicts were thus automatically resolved in favor of the non-derived. But if the derived files contained things not in the non-derived ones, they would be used. I think that Unicode is doing a better job of making their files consistent and accurate these days, but I haven't had to worry since I made that change. (I no longer remember any details of what the problems were.) If I were starting from scratch, I would try the xml version first. From sdaoden at yandex.com Fri Apr 25 18:24:11 2014 From: sdaoden at yandex.com (Steffen Nurpmeso) Date: Sat, 26 Apr 2014 01:24:11 +0200 Subject: =?US-ASCII?Q?ID=5FStart,?= =?US-ASCII?Q?_ID=5FContinue,?= and stability extensions In-Reply-To: References: <9F608E6A-79E1-46D2-AD96-2E614A17AAC0@qiwi.be> <30F1ADAC-2D20-4234-BA1F-23653BAE24E6@qiwi.be> <6683947A-744E-4838-82F8-BB8E243B32AE@qiwi.be> <20140424205631.8jIlSGFgrqwwwAHXq7zujsl+@dietcurd.local> <20140425140505.pDgVh4jMEt/QYe7Pb5mvhFBE@dietcurd.local> Message-ID: <20140426002411.J8iCviqjsVvPPf5bnYfI2vxZ@dietcurd.local> Markus Scherer wrote: |On Fri, Apr 25, 2014 at 6:05 AM, Steffen Nurpmeso wrote: |So imho it's a bit like ?Kraut und R?ben? (?higgledy-piggledy? |> sayy ). | |Ich wei? was das bedeutet :-) hmmm, possibly a bit of a strong wording. In no way a personal attack against a real person. Unicode grew over two decades, only logical that this results in loose tissue here and there. |I parse most of the UCD .txt files with a Python script and munge them into Ugh this sounds terrible! Programmers should have the option to choose the right tools for the right tasks, i mean, payment and everything is nice, but in the end it is our own life time... |Unicode also publishes XML versions of the data, with most or all Yes, sorry, but i'm not taking a soapy bath in a privately owned ocean but instead am dealing with a washtub. 150 MB of shock-headed data that yet machines have troubles with! Even in the end the text files i need will be a tenth of that, and i'm working with them (especially UnicodeData.txt) uncountable times, i.e., direct human <-> text interaction. |You could also just use a library that provides these properties, rather |than roll your own. |Shameless plug for ICU here which has most of the low-level properties in |source code (from a generator), so no data loading for those. Ask the |icu-support |list for help if needed. But there still *are* products their creators can be prowd of, so no need for pudency of any kind, imho. It is of course not as common as in other cultures, say, Turkish goldsmiths, African silversmiths or Japanese swordsmiths and ceramists et cetera, but, so all the more remarkable. |http://www.unicode.org/reports/tr18/#Compatibility_Properties Maybe i turn to use a two-pass thing for my own little project, in order to use the final category. Right now i'm single-pass and am thus required to use ugly things like, e.g., .. {.name="Other_Alphabetic", .props=sct_ALPHA, .addprint=true}, {.name="Ideographic", .props=sct_IDEOGRAPH, .addprint=true}, .. /* Control characters, including the Zl and Zp separators (imho misplaced * and should go C) are not PRINTable */ if (pp->addprint && !(p & (sct_Cc | sct_Cs | sct_Co | sct_Zl | sct_Zp))) { p |= sct_PRINT; /* And whitespace is not GRAPHical */ if (!(p & sct_Zs)) p |= sct_GRAPH; } .. |Viele Gr??e, Oh. No mention of this brilliant idea of mine, PropRecipe.txt? Have a nice weekend. :) Ciao, --steffen P.S.: |Google Internationalization Engineering Oh Google, cute little thing you. From mathias at qiwi.be Sat Apr 26 01:06:19 2014 From: mathias at qiwi.be (Mathias Bynens) Date: Sat, 26 Apr 2014 08:06:19 +0200 Subject: ID_Start, ID_Continue, and stability extensions In-Reply-To: References: <9F608E6A-79E1-46D2-AD96-2E614A17AAC0@qiwi.be> <30F1ADAC-2D20-4234-BA1F-23653BAE24E6@qiwi.be> <6683947A-744E-4838-82F8-BB8E243B32AE@qiwi.be> Message-ID: On 23 Apr 2014, at 20:18, Markus Scherer wrote: > I strongly recommend you parse the derived properties rather than trying to follow the derivation formula, because that can change over time. No argument there! My initial question can be rephrased as the following remark/change request: http://unicode.org/reports/tr31/#Default_Identifier_Syntax could make it more clear that ?stability extensions? means `Other_ID_Start` and `Other_ID_Continue`, respectively. At the moment it lists an incomplete formula: it?s explicit about all the categories and properties to include to form `ID_Start` and `ID_Continue` _except for those_, for seemingly no good reason. Regards, Mathias From markus.icu at gmail.com Sat Apr 26 10:06:20 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Sat, 26 Apr 2014 08:06:20 -0700 Subject: ID_Start, ID_Continue, and stability extensions In-Reply-To: References: <9F608E6A-79E1-46D2-AD96-2E614A17AAC0@qiwi.be> <30F1ADAC-2D20-4234-BA1F-23653BAE24E6@qiwi.be> <6683947A-744E-4838-82F8-BB8E243B32AE@qiwi.be> Message-ID: On Fri, Apr 25, 2014 at 11:06 PM, Mathias Bynens wrote: > My initial question can be rephrased as the following remark/change > request: > > http://unicode.org/reports/tr31/#Default_Identifier_Syntax could make it > more clear that ?stability extensions? means `Other_ID_Start` and > `Other_ID_Continue`, respectively. At the moment it lists an incomplete > formula: it?s explicit about all the categories and properties to include > to form `ID_Start` and `ID_Continue` _except for those_, for seemingly no > good reason. > I suggest you report it here: http://www.unicode.org/reporting.html Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathias at qiwi.be Sat Apr 26 10:08:53 2014 From: mathias at qiwi.be (Mathias Bynens) Date: Sat, 26 Apr 2014 17:08:53 +0200 Subject: ID_Start, ID_Continue, and stability extensions In-Reply-To: References: <9F608E6A-79E1-46D2-AD96-2E614A17AAC0@qiwi.be> <30F1ADAC-2D20-4234-BA1F-23653BAE24E6@qiwi.be> <6683947A-744E-4838-82F8-BB8E243B32AE@qiwi.be> Message-ID: <2607084C-3AB5-4C9F-BE96-219EE09DDB70@qiwi.be> On 26 Apr 2014, at 17:06, Markus Scherer wrote: > I suggest you report it here: http://www.unicode.org/reporting.html Done. Thank you, Markus! From richard.wordingham at ntlworld.com Sun Apr 27 17:46:09 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 27 Apr 2014 23:46:09 +0100 Subject: Soft Hyphens in Complex and East Asian Scripts Message-ID: <20140427234609.7ae6b6b9@JRWUBU2> I'm trying to assess the impact of what I regard as a word-processing bug, and this forum seems to be the best source of information. What writing systems using 'complex' or 'East Asian' scripts use U+00AD SOFT HYPHEN in a manner that is potentially visually distinct from U+200B ZERO WIDTH SPACE? The only good example I have is Thai, and it seems remiss that most of the 8-bit encodings for Thai don't support invisible line-breaking opportunities at all. I do have two probable examples from a book in Tai Khuen (Tai Tham script) published in Thailand, but they may result from poor editing or, possibly, be plain hyphens. Both words appear to be proper nouns. The book has several examples of clear words broken across lines without any hyphenation. Are there any 'complex' or 'East Asian' scripts where U+00AD and U+200B have the same visual effect but are used for different semantics? An obvious example would be for U+200B to mark word boundaries but for U+00AD to mark line break opportunities within a word. Richard. From mark at macchiato.com Tue Apr 29 00:04:11 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 29 Apr 2014 07:04:11 +0200 Subject: ID_Start, ID_Continue, and stability extensions In-Reply-To: <535AAF27.2090703@khwilliamson.com> References: <9F608E6A-79E1-46D2-AD96-2E614A17AAC0@qiwi.be> <30F1ADAC-2D20-4234-BA1F-23653BAE24E6@qiwi.be> <6683947A-744E-4838-82F8-BB8E243B32AE@qiwi.be> <20140424205631.8jIlSGFgrqwwwAHXq7zujsl+@dietcurd.local> <535AAF27.2090703@khwilliamson.com> Message-ID: On 25 April 2014 20:53, Karl Williamson wrote: > And in fact in some Unicode releases, they contained errors. I think you know this, but for others. A derived property value in the UCD is defined by the value in the derived data file, NOT by the derivation.? Of course, the value might not follow the intent, just with any other property, and there are fixes to properties, whether derived or not, in each release. And sometimes the statement of the derivation is changed, and sometimes property values are changed. And the regex recommendations in http://www.unicode.org/reports/tr18/#Compatibility_Properties are different, so you may be referring to them. Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From jjc at jclark.com Tue Apr 29 01:09:19 2014 From: jjc at jclark.com (James Clark) Date: Tue, 29 Apr 2014 13:09:19 +0700 Subject: Soft Hyphens in Complex and East Asian Scripts In-Reply-To: <20140427234609.7ae6b6b9@JRWUBU2> References: <20140427234609.7ae6b6b9@JRWUBU2> Message-ID: On Mon, Apr 28, 2014 at 5:46 AM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: it seems remiss that most of > the 8-bit encodings for Thai don't support invisible line-breaking > opportunities at all. > WTT 2.0 draft standard developed by the Thai API Consortium encoded THAI WORD BREAK at 0xDC, but I don't think this ever got any substantial adoption. James -------------- next part -------------- An HTML attachment was scrubbed... URL: From kojiishi at gluesoft.co.jp Wed Apr 30 12:46:57 2014 From: kojiishi at gluesoft.co.jp (Koji Ishii) Date: Wed, 30 Apr 2014 17:46:57 +0000 Subject: Soft Hyphens in Complex and East Asian Scripts In-Reply-To: References: <20140427234609.7ae6b6b9@JRWUBU2> Message-ID: <966E2899-45EF-4276-860C-24D21119D645@gluesoft.co.jp> > Are there any 'complex' or 'East Asian' scripts where U+00AD and U+200B > have the same visual effect but are used for different semantics? An > obvious example would be for U+200B to mark word boundaries but for > U+00AD to mark line break opportunities within a word. Since Japanese and Chinese can break between any two characters (with some exceptions,) these scripts do not need either in their native text. Both are sometimes used in Latin text for the same purposes and visuals as they?re used in Latin. Korean has U+00AD encoded in their legacy encoding, so they may have typographic rules for it, but I?m not very familiar with Korean. As far as I searched for KLREQ[1], I could not get a hit. [1] http://www.w3.org/TR/klreq/ From richard.wordingham at ntlworld.com Wed Apr 30 15:43:03 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 30 Apr 2014 21:43:03 +0100 Subject: Soft Hyphens in Complex and East Asian Scripts In-Reply-To: <966E2899-45EF-4276-860C-24D21119D645@gluesoft.co.jp> References: <20140427234609.7ae6b6b9@JRWUBU2> <966E2899-45EF-4276-860C-24D21119D645@gluesoft.co.jp> Message-ID: <20140430214303.6ae5f9f8@JRWUBU2> On Wed, 30 Apr 2014 17:46:57 +0000 Koji Ishii wrote: > Korean has U+00AD encoded in their legacy encoding, so they may have > typographic rules for it, but I?m not very familiar with Korean. As > far as I searched for KLREQ[1], I could not get a hit. > [1] http://www.w3.org/TR/klreq/ Thanks for the link. Reading it leaves me uncertain as to whether one should expect to encounter U+00AD within a Korean word, but the part of the issue may be how it is to be rendered. I found some very relevant reading at the Cascading Style Sheets literature. http://www.w3.org/TR/css3-text/#hyphenate appears to reveal the existence of soft hyphens in Arabic text. Santhosh Thottingal has been doing some well-received work on hyphenation in Indian scripts (see e.g. http://thottingal.in/blog/2013/03/17/hyphenation-in-web ), and the only criticism I could see was in the rendering of the active soft hyphens. There is a suggested solution at http://dev.w3.org/csswg/css-text-4/#hyphenate-character , though I'm not sure that there will always be a character with the right glyph. On the basis of this information, I'm happy to contend that U+00AD can be found in words in many non-'Western' scripts. I can't even be beaten by a claim that ZWSP is the character for an invisible soft hyphen. Richard.