From mark at macchiato.com Tue Apr 1 02:01:39 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 1 Apr 2014 09:01:39 +0200 Subject: FYI: More emoji from Chrome Message-ID: More emoji from Chrome: http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Apr 1 02:13:39 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 1 Apr 2014 09:13:39 +0200 Subject: FYI: More emoji from Chrome In-Reply-To: References: Message-ID: April 1st joke... 2014-04-01 9:01 GMT+02:00 Mark Davis ?? : > More emoji from Chrome: > > http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html > > with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Apr 1 02:20:59 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 1 Apr 2014 09:20:59 +0200 Subject: FYI: More emoji from Chrome In-Reply-To: References:

Message-ID: Yup! Mark *? Il meglio ? l?inimico del bene ?* On 1 April 2014 09:13, Philippe Verdy wrote: > April 1st joke... > > > 2014-04-01 9:01 GMT+02:00 Mark Davis ?? : > >> More emoji from Chrome: >> >> http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html >> >> with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathias at qiwi.be Tue Apr 1 02:25:29 2014 From: mathias at qiwi.be (Mathias Bynens) Date: Tue, 1 Apr 2014 09:25:29 +0200 Subject: FYI: More emoji from Chrome In-Reply-To: References:

Message-ID: <945ED05C-7DB3-40C4-8622-F5EDAD63D04E@qiwi.be> On 1 Apr 2014, at 09:13, Philippe Verdy wrote: > April 1st joke... Sure ? it really works, though. Try it out. Kinda cool :) I would?ve preferred if Google had finally implemented support for proper emoji in OS X, though: https://code.google.com/p/chromium/issues/detail?id=62435 From jjc at jclark.com Tue Apr 1 00:51:11 2014 From: jjc at jclark.com (James Clark) Date: Tue, 1 Apr 2014 12:51:11 +0700 Subject: Bidi reordering of soft hyphen Message-ID: Suppose I have a paragraph (uppercase = RTL): CARROT IS car\u00ADrot IN ENGLISH and the paragraph gets broken at the soft hyphen. Is the correct ordering for the first line car- SI TORRAC or -car SI TORRAC ? I did not succeed in deducing the answer from UAX#9. Soft hyphen has bidi class BN, which means it gets removed in stage X9, and so, if I have understood correctly, doesn't have a defined embedding level. I'm guessing the correct ordering is the first one, but I don't trust my instincts here. (In particular, I wondered whether this was analogous to the case where rule L1 resets embedding levels so that trailing whitespace is at the visual end of the line.) More generally, suppose you have a markup language which has a construct for discretionary breaks, as in TeX, with pre-break, post-break and no-break text. Soft hyphen is a special case of this (where the pre-break text consists of a hyphen, and the pos and no-break texts are empty); you can also regard space as a kind of discretionary break (post-break text empty, no-break text contains the space, pre-break text either contains the space or is empty, depending on how you want to think about it). Obviously the embedding level for the no-break text should be resolved as if discretionary break was replaced by the no-break text (which is consistent with a bidi class of BN for soft hyphen). However, for the pre- and post-break text, it is not clear to me what the right way is to resolve embedding levels (or how their content should be restricted so that there is a sensible way to resolve the embedding levels). I would be grateful for any suggestions. James -------------- next part -------------- An HTML attachment was scrubbed... URL: From nospam-abuse at ilyaz.org Tue Apr 1 11:43:43 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Tue, 1 Apr 2014 09:43:43 -0700 Subject: FYI: More emoji from Chrome In-Reply-To: References: Message-ID: <20140401164343.GA5003@powdermilk> On Tue, Apr 01, 2014 at 09:01:39AM +0200, Mark Davis ?? wrote: > More emoji from Chrome: > > http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html > > with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y I do not know? The demos leave me completely unimpressed: emoji???by their nature???require higher resolution than text, so an emoji for ?pie? does not save any place comparing to the word itself. So the impact of this on everyday English-languare communication would not be in any way beneficial. However, this MAY be a beginning of revolution in scientific communication. Science-and-about publications contains very long words in abundance, and it is HERE where impact of emojification should be felt the most! So I think the task of emojification of scientific terms???be it ?secularization?, ?gamma-globulin?, or ?derived ?-category????should be at elevated priority in the Unicode commitees. The general public often considers scientific publications are too dense, and does not bother to read many scienific journals. What Google did is a beginning of a major step forward in making contemporary science (finally!) accessible to general public. Ilya From richard.wordingham at ntlworld.com Tue Apr 1 15:10:23 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 1 Apr 2014 21:10:23 +0100 Subject: Bidi reordering of soft hyphen In-Reply-To: References: Message-ID: <20140401211023.6042e0a5@JRWUBU2> On Tue, 1 Apr 2014 12:51:11 +0700 James Clark wrote: > Suppose I have a paragraph (uppercase = RTL): > > CARROT IS car\u00ADrot IN ENGLISH > > and the paragraph gets broken at the soft hyphen. > > Is the correct ordering for the first line > > car- SI TORRAC > > or > > -car SI TORRAC > > ? I did not succeed in deducing the answer from UAX#9. Soft hyphen > has bidi class BN, which means it gets removed in stage X9, and so, > if I have understood correctly, doesn't have a defined embedding > level. > > I'm guessing the correct ordering is the first one, but I don't trust > my instincts here. (In particular, I wondered whether this was > analogous to the case where rule L1 resets embedding levels so that > trailing whitespace is at the visual end of the line.) There is no conformance requirement on the location of the soft hyphen. Indeed, there is no requirement on whether it is rendered at all (TUS Section 16.2). As the treatment of the soft-hyphen is language dependent even in unidirectional text, I am afraid the treatment is down to good taste and the language(s) involved. (E.g., is this Arabic text effectively embedding English text within an overall Thai context?) As U+2010 HYPHEN would result in text like 'car-', in an English influenced context I would also go with 'car-'. Richard. From ken.whistler at sap.com Tue Apr 1 15:20:13 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Tue, 1 Apr 2014 20:20:13 +0000 Subject: Bidi reordering of soft hyphen In-Reply-To: References: Message-ID: I don?t think the answer is directly deduced from UAX #9, because it involves deciding where to insert a visible hyphen for display. However, I think the correct answer here is your number two guess, i.e. (in a RTL paragraph context): -car SI TORRAC A way to think about this, rather than starting from the BN nature of U+00AD, is to ask what would happen if there was an *explicit* hyphen-minus at the same position. Shortening your example line ?CARROT IS car\u00AD? to just the equivalent of ?ABC car-?, the outcome of the bidiref processing for a RTL paragraph context is: Trace: Entering br_UBA_ReverseLevels [L2] Current State: 19 Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D Bidi_Class: R R R R L L L R Levels: 1 1 1 1 2 2 2 1 Runs: Order: [7 4 5 6 3 2 1 0] In other words, on display: -car CBA <--------- with the hyphen-minus at the *end* of the reordered line, as expected. If you run the same example, but substituting U+00AD for U+002D, you get: Trace: Entering br_UBA_ReverseLevels [L2] Current State: 19 Text: 05D0 05D1 05D2 0020 0063 0061 0072 00AD Bidi_Class: R R R R L L L BN Levels: 1 1 1 1 2 2 2 x Runs: Order: [4 5 6 3 2 1 0] And the display for that would be: car CBA But *then* your hyphenation algorithm would presumably kick in and decide that the U+00AD is at the end of the line and should display as a visible hyphen glyph. But ?end of the line? here means the same as it would for the explicit hyphen-minus, so when you insert the visible hyphen glyph, you end up with the same result: -car CBA Another way of looking at this is that in order to line break your text in the first place, you need to be able to calculate the resolved display width to fit in the line. That would have to include the visual display of the inserted hyphen glyph. So once you have *decided* to break the line at the soft hyphen, in effect, you substitute a visual display symbol U+002D (or the actual hyphen U+2010, etc.) for U+00AD. *Then* run the UBA on the results to get the resolved order of all the elements on the line. The net effect should be the same. Maybe folks with full implementations of bidi rendering would have more to contribute on this, but that would be my own take on the problem. --Ken Suppose I have a paragraph (uppercase = RTL): CARROT IS car\u00ADrot IN ENGLISH and the paragraph gets broken at the soft hyphen. Is the correct ordering for the first line car- SI TORRAC or -car SI TORRAC ? I did not succeed in deducing the answer from UAX#9. Soft hyphen has bidi class BN, which means it gets removed in stage X9, and so, if I have understood correctly, doesn't have a defined embedding level. I'm guessing the correct ordering is the first one, but I don't trust my instincts here. (In particular, I wondered whether this was analogous to the case where rule L1 resets embedding levels so that trailing whitespace is at the visual end of the line.) More generally, suppose you have a markup language which has a construct for discretionary breaks, as in TeX, with pre-break, post-break and no-break text. Soft hyphen is a special case of this (where the pre-break text consists of a hyphen, and the pos and no-break texts are empty); you can also regard space as a kind of discretionary break (post-break text empty, no-break text contains the space, pre-break text either contains the space or is empty, depending on how you want to think about it). Obviously the embedding level for the no-break text should be resolved as if discretionary break was replaced by the no-break text (which is consistent with a bidi class of BN for soft hyphen). However, for the pre- and post-break text, it is not clear to me what the right way is to resolve embedding levels (or how their content should be restricted so that there is a sensible way to resolve the embedding levels). I would be grateful for any suggestions. James -------------- next part -------------- An HTML attachment was scrubbed... URL: From ken.whistler at sap.com Tue Apr 1 15:31:08 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Tue, 1 Apr 2014 20:31:08 +0000 Subject: Bidi reordering of soft hyphen In-Reply-To: <20140401211023.6042e0a5@JRWUBU2> References: <20140401211023.6042e0a5@JRWUBU2> Message-ID: Richard Wordingham noted: > As U+2010 HYPHEN would result in text like 'car-', in an English > influenced context I would also go with 'car-'. That's always a possibility, I suppose, but I'm not sure what "English influenced context" means here. The examples I just gave were for a RTL paragraph context. In a LTR paragraph context, the same input would end up in a very different order: Trace: Entering br_UBA_ReverseLevels [L2] Current State: 19 Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D Bidi_Class: R R R L L L L L Levels: 1 1 1 0 0 0 0 0 Runs: Order: [2 1 0 3 4 5 6 7] And you get the display: CBA car- ---------> As opposed to: -car CBA <--------- In either case, the hyphen-minus (or hyphen), ends up at the *end of the line*. My take is that *if* I am going to insert a visible glyph at the point of the SHY, it would probably be best to insert it at the actual line break at the end of the line, to be in the same position as an explicit hyphen-minus with the same line break. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From roozbeh at unicode.org Tue Apr 1 16:00:25 2014 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Tue, 1 Apr 2014 14:00:25 -0700 Subject: Bidi reordering of soft hyphen In-Reply-To: References: Message-ID: Adding Behdad for his insight on the rendering stack. But as for user requirements and expectations, the first option, with the hyphen on the right side of "car" as "car-" is what a good publisher would want to print in his magazine or book. The second option is harder to decipher for an RTL reader. (Note that breaking opposite-direction phrases across lines in bidi paragraphs is also avoided as much as possible in good typography, as the output is weird to some readers anyway.) On Apr 1, 2014 1:21 PM, "Whistler, Ken" wrote: > I don?t think the answer is directly deduced from UAX #9, because > > it involves deciding where to insert a visible hyphen for display. > > However, I think the correct answer here is your number two guess, > > i.e. (in a RTL paragraph context): > > > > -car SI TORRAC > > > > A way to think about this, rather than starting from the BN nature > > of U+00AD, is to ask what would happen if there was an *explicit* > > hyphen-minus at the same position. Shortening your example > > line ?CARROT IS car\u00AD? to just the equivalent of ?ABC car-?, > > the outcome of the bidiref processing for a RTL paragraph context is: > > > > Trace: Entering br_UBA_ReverseLevels [L2] > > Current State: 19 > > Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D > > Bidi_Class: R R R R L L L R > > Levels: 1 1 1 1 2 2 2 1 > > Runs: > > > > Order: [7 4 5 6 3 2 1 0] > > > > In other words, on display: > > > > -car CBA > > <--------- > > > > with the hyphen-minus at the *end* of the reordered line, as > > expected. > > > > If you run the same example, but substituting U+00AD for U+002D, you get: > > > > Trace: Entering br_UBA_ReverseLevels [L2] > > Current State: 19 > > Text: 05D0 05D1 05D2 0020 0063 0061 0072 00AD > > Bidi_Class: R R R R L L L BN > > Levels: 1 1 1 1 2 2 2 x > > Runs: > > > > Order: [4 5 6 3 2 1 0] > > > > And the display for that would be: > > > > car CBA > > > > But *then* your hyphenation algorithm would presumably kick in and decide > > that the U+00AD is at the end of the line and should display as a visible > > hyphen glyph. But ?end of the line? here means the same as it would for > > the explicit hyphen-minus, so when you insert the visible hyphen glyph, you > > end up with the same result: > > > > -car CBA > > > > Another way of looking at this is that in order to line break your text in > > the first place, you need to be able to calculate the resolved display > width > > to fit in the line. That would have to include the visual display of the > inserted > > hyphen glyph. So once you have *decided* to break the line at the soft > > hyphen, in effect, you substitute a visual display symbol U+002D (or > > the actual hyphen U+2010, etc.) for U+00AD. *Then* run the UBA on the > > results to get the resolved order of all the elements on the line. The net > > effect should be the same. > > > > Maybe folks with full implementations of bidi rendering would have more to > > contribute on this, but that would be my own take on the problem. > > > > --Ken > > > > > > > > Suppose I have a paragraph (uppercase = RTL): > > > > CARROT IS car\u00ADrot IN ENGLISH > > > > and the paragraph gets broken at the soft hyphen. > > > > Is the correct ordering for the first line > > > > car- SI TORRAC > > > > or > > > > -car SI TORRAC > > > > ? I did not succeed in deducing the answer from UAX#9. Soft hyphen has > bidi class BN, which means it gets removed in stage X9, and so, if I have > understood correctly, doesn't have a defined embedding level. > > > > I'm guessing the correct ordering is the first one, but I don't trust my > instincts here. (In particular, I wondered whether this was analogous to > the case where rule L1 resets embedding levels so that trailing whitespace > is at the visual end of the line.) > > > > More generally, suppose you have a markup language which has a construct > for discretionary breaks, as in TeX, with pre-break, post-break and > no-break text. Soft hyphen is a special case of this (where the pre-break > text consists of a hyphen, and the pos and no-break texts are empty); you > can also regard space as a kind of discretionary break (post-break text > empty, no-break text contains the space, pre-break text either contains the > space or is empty, depending on how you want to think about it). Obviously > the embedding level for the no-break text should be resolved as if > discretionary break was replaced by the no-break text (which is consistent > with a bidi class of BN for soft hyphen). However, for the pre- and > post-break text, it is not clear to me what the right way is to resolve > embedding levels (or how their content should be restricted so that there > is a sensible way to resolve the embedding levels). I would be grateful for > any suggestions. > > > > James > > > > > > > > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Tue Apr 1 16:43:38 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 01 Apr 2014 14:43:38 -0700 Subject: Bidi reordering of soft hyphen In-Reply-To: References: <20140401211023.6042e0a5@JRWUBU2> Message-ID: <533B330A.8030401@ix.netcom.com> I think this calls for an implementation note on UAX#9 along these lines. ------------------------- During line breaking, if a line is broken at the location of a SHY, the text around the line break may change. A common case is the replacement of the invisible SHY by a visible HYPHEN, but see Section x.x in the Unicode Standard. For the purposes of the Bidi Algorithm, apply steps .. to .. after any substitutions have been made, using the directional classes for the substituted characters, instead of a single BN for the SHY character. Note, no special action need be taken for a SHY character in the middle of a line, unless they are rendered as visible glyphs in a "show hidden character" mode. In the latter case, the recommendation would be to treat the visible symbol substituted for the SHY as having bidi class ON. ------------------------ I am not sure whether -car CBA or car- CBA is the right answer, nor whether the substitution will always be limited to the preceding line. (Old orthography German had B?cker turning in to B?k-|ker, where I've used | to show the line ending.) Those are details that the UBA should be ignorant about. The important thing is that the array of bidi directional classes is not constrained to contain a single entry for BN at the location of the original SHY. If "car- CBA" is the right answer then the substitution would have to be HYPHEN plus LRM to get this to come out right, but that would be under the control of the line-breaking conventions, and not legislated by the UBA. A./ On 4/1/2014 1:31 PM, Whistler, Ken wrote: > > Richard Wordingham noted: > > > As U+2010 HYPHEN would result in text like 'car-', in an English > > > influenced context I would also go with 'car-'. > > That's always a possibility, I suppose, but I'm not sure what > > "English influenced context" means here. > > The examples I just gave were for a RTL paragraph context. > > In a LTR paragraph context, the same input would end up in > > a very different order: > > Trace: Entering br_UBA_ReverseLevels [L2] > > Current State: 19 > > Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D > > Bidi_Class: R R R L L L L L > > Levels: 1 1 1 0 0 0 0 0 > > Runs: > > Order: [2 1 0 3 4 5 6 7] > > And you get the display: > > CBA car- > > ---------> > > As opposed to: > > -car CBA > > <--------- > > In either case, the hyphen-minus (or hyphen), ends up at the *end of > the line*. > > My take is that *if* I am going to insert a visible glyph at the point > of the > > SHY, it would probably be best to insert it at the actual line break > at the > > end of the line, to be in the same position as an explicit > hyphen-minus with > > the same line break. > > --Ken > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From nikiselken at gmail.com Tue Apr 1 16:50:01 2014 From: nikiselken at gmail.com (Nicole Selken) Date: Tue, 1 Apr 2014 17:50:01 -0400 Subject: Unicode Digest, Vol 4, Issue 1 In-Reply-To: References: Message-ID: I think Emoji is totally beneficial as a communication form. Yea, it takes op some UTF space and such but they literally affect different parts of the brain then written words. In this way they change the kind of communication possible. Also, so many people (especially the young) are using them, to ignore them or dismiss them would be a mistake. I like that they were included into the character set for Unicode and I would love to talk with someone who was a decision maker on that panel for my Emoji Project. If anyone who has worked on it has some time, drop me a line! https://niki-selken.squarespace.com/#/world-translation-foundation/ Thanks, Niki Selken Working on: www.nikiselken.com On Tue, Apr 1, 2014 at 1:00 PM, wrote: > Send Unicode mailing list submissions to > unicode at unicode.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://unicode.org/mailman/listinfo/unicode > or, via email, send a message with subject or body 'help' to > unicode-request at unicode.org > > You can reach the person managing the list at > unicode-owner at unicode.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Unicode digest..." > > Today's Topics: > > 1. Call for the experts of U+3013 (suzuki toshiya) > 2. FYI: More emoji from Chrome (Mark Davis ??) > 3. Re: FYI: More emoji from Chrome (Philippe Verdy) > 4. Re: FYI: More emoji from Chrome (Mark Davis ??) > 5. Re: FYI: More emoji from Chrome (Mathias Bynens) > 6. Bidi reordering of soft hyphen (James Clark) > 7. Re: FYI: More emoji from Chrome (Ilya Zakharevich) > > > ---------- Forwarded message ---------- > From: suzuki toshiya > To: Unicode Discussion > Cc: > Date: Tue, 01 Apr 2014 09:28:26 +0900 > Subject: Call for the experts of U+3013 > Dear all, > > Today I submitted a preliminary proposal to standardize > Variation Selectors for U+3013, so-called "GETA" mark. > > ftp://std.dkuug.dk/ftp.anonymous/JTC1/SC2/WG2/docs/n4572.pdf > > The geta mark was introduced from JIS X 0208:1990 and > GB 2312-1980. When I check the original documents > including the geta mark, some of the representative glyphs > in these regional standards are different from original > geta mark. I investigated theoretically possible visual > shapes of the geta mark, and concluded the registry-based > standardization of the geta mark is a considerable option. > > Unfortunately, the officially printed matters including > the geta mark is not popular (I found only a few books > in Japanese national diet library), so I want to hear the > comments from the geta expert for the official proposal. > > Regards, > mpsuzuki > > > > ---------- Forwarded message ---------- > From: "Mark Davis ??" > To: Unicode Public > Cc: > Date: Tue, 1 Apr 2014 09:01:39 +0200 > Subject: FYI: More emoji from Chrome > More emoji from Chrome: > > http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html > > with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y > > > ---------- Forwarded message ---------- > From: Philippe Verdy > To: "Mark Davis ??" > Cc: Unicode Public > Date: Tue, 1 Apr 2014 09:13:39 +0200 > Subject: Re: FYI: More emoji from Chrome > April 1st joke... > > > 2014-04-01 9:01 GMT+02:00 Mark Davis ?? : > >> More emoji from Chrome: >> >> http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html >> >> with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> > > > ---------- Forwarded message ---------- > From: "Mark Davis ??" > To: verdy_p > Cc: Unicode Public > Date: Tue, 1 Apr 2014 09:20:59 +0200 > Subject: Re: FYI: More emoji from Chrome > Yup! > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > > On 1 April 2014 09:13, Philippe Verdy wrote: > >> April 1st joke... >> >> >> 2014-04-01 9:01 GMT+02:00 Mark Davis ?? : >> >>> More emoji from Chrome: >>> >>> http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html >>> >>> with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y >>> >>> _______________________________________________ >>> Unicode mailing list >>> Unicode at unicode.org >>> http://unicode.org/mailman/listinfo/unicode >>> >>> >> > > > ---------- Forwarded message ---------- > From: Mathias Bynens > To: verdy_p at wanadoo.fr > Cc: "Mark Davis ??" , Unicode Public < > unicode at unicode.org> > Date: Tue, 1 Apr 2014 09:25:29 +0200 > Subject: Re: FYI: More emoji from Chrome > On 1 Apr 2014, at 09:13, Philippe Verdy wrote: > > > April 1st joke... > > Sure ? it really works, though. Try it out. Kinda cool :) > > I would?ve preferred if Google had finally implemented support for proper > emoji in OS X, though: > https://code.google.com/p/chromium/issues/detail?id=62435 > > > > ---------- Forwarded message ---------- > From: James Clark > To: unicode at unicode.org > Cc: > Date: Tue, 1 Apr 2014 12:51:11 +0700 > Subject: Bidi reordering of soft hyphen > Suppose I have a paragraph (uppercase = RTL): > > CARROT IS car\u00ADrot IN ENGLISH > > and the paragraph gets broken at the soft hyphen. > > Is the correct ordering for the first line > > car- SI TORRAC > > or > > -car SI TORRAC > > ? I did not succeed in deducing the answer from UAX#9. Soft hyphen has > bidi class BN, which means it gets removed in stage X9, and so, if I have > understood correctly, doesn't have a defined embedding level. > > I'm guessing the correct ordering is the first one, but I don't trust my > instincts here. (In particular, I wondered whether this was analogous to > the case where rule L1 resets embedding levels so that trailing whitespace > is at the visual end of the line.) > > More generally, suppose you have a markup language which has a construct > for discretionary breaks, as in TeX, with pre-break, post-break and > no-break text. Soft hyphen is a special case of this (where the pre-break > text consists of a hyphen, and the pos and no-break texts are empty); you > can also regard space as a kind of discretionary break (post-break text > empty, no-break text contains the space, pre-break text either contains the > space or is empty, depending on how you want to think about it). Obviously > the embedding level for the no-break text should be resolved as if > discretionary break was replaced by the no-break text (which is consistent > with a bidi class of BN for soft hyphen). However, for the pre- and > post-break text, it is not clear to me what the right way is to resolve > embedding levels (or how their content should be restricted so that there > is a sensible way to resolve the embedding levels). I would be grateful for > any suggestions. > > James > > > > > > > > ---------- Forwarded message ---------- > From: Ilya Zakharevich > To: Mark Davis ?? > Cc: Unicode Public > Date: Tue, 1 Apr 2014 09:43:43 -0700 > Subject: Re: FYI: More emoji from Chrome > On Tue, Apr 01, 2014 at 09:01:39AM +0200, Mark Davis ?? wrote: > > More emoji from Chrome: > > > > http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html > > > > with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y > > I do not know? The demos leave me completely unimpressed: emoji ? by > their nature ? require higher resolution than text, so an emoji for > ?pie? does not save any place comparing to the word itself. So the > impact of this on everyday English-languare communication would not be > in any way beneficial. > > However, this MAY be a beginning of revolution in scientific > communication. Science-and-about publications contains very long > words in abundance, and it is HERE where impact of emojification > should be felt the most! So I think the task of emojification of > scientific terms ? be it ?secularization?, ?gamma-globulin?, or > ?derived ?-category? ? should be at elevated priority in the Unicode > commitees. > > The general public often considers scientific publications are too > dense, and does not bother to read many scienific journals. What > Google did is a beginning of a major step forward in making > contemporary science (finally!) accessible to general public. > > Ilya > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From smontagu at smontagu.org Tue Apr 1 17:40:57 2014 From: smontagu at smontagu.org (Simon Montagu) Date: Wed, 02 Apr 2014 01:40:57 +0300 Subject: Bidi reordering of soft hyphen In-Reply-To: References:

Message-ID: <533B4079.2030402@smontagu.org> On 04/02/2014 12:00 AM, Roozbeh Pournader wrote: > Adding Behdad for his insight on the rendering stack. > > But as for user requirements and expectations, the first option, with > the hyphen on the right side of "car" as "car-" is what a good publisher > would want to print in his magazine or book. The second option is > harder to decipher for an RTL reader. I agree with Roozbeh here. Since the hyphen marks a break in the middle of the word, I think the most natural user expectation is that it should appear after the last character in the word, where "after" and "last" both refer to the reading direction of the word. I have seen examples of this in published Hebrew books, and this is also the way it's rendered in Chrome, Firefox and Opera (but in the case of Firefox, since I wrote the code for it I can testify that it isn't this way by design: as far as I remember I only took into account the direction of the text run containing the soft hyphen and didn't even think about the opposite-direction case). From richard.wordingham at ntlworld.com Tue Apr 1 18:02:57 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 2 Apr 2014 00:02:57 +0100 Subject: Bidi reordering of soft hyphen In-Reply-To: References: Message-ID: <20140402000257.32dd544d@JRWUBU2> On Tue, 1 Apr 2014 20:20:13 +0000 "Whistler, Ken" wrote: > I don?t think the answer is directly deduced from UAX #9, because > it involves deciding where to insert a visible hyphen for display. > However, I think the correct answer here is your number two guess, > i.e. (in a RTL paragraph context): > > -car SI TORRAC > > A way to think about this, rather than starting from the BN nature > of U+00AD, is to ask what would happen if there was an *explicit* > hyphen-minus at the same position. Is it legitimate to truncate the context to a single line? The BiDi algorithm is attempting to interpret unlabelled text as embedded text (it's not an arbitrary dance), and in just one line there is no indicator of whether the hyphen is part of the LTR text embedded in RTL text. However, the very next character is 'r', which tells us that the left-to-right run contains the hyphen. I also think the HYPHEN-MINUS is the wrong character to consider - the analogy should be with U+2010 HYPHEN (class ON) rather than with U+2212 MINUS SIGN (class ES), let alone the ambiguous HPYHEN-MINUS, for which ES is merely the interpretation most likely to work. I found a similar example, but with Hebrew embedded in the Latin script, in the introduction to the Stuttgart Bible. The corresponding character was U+05BE HEBREW PUNCTUATION MAQAF, though in this case the class is R (because one doesn't expect MAQAF to be used with left-to right scripts), and therefore not as good an example as I would have hoped for. The BiDi algorith then happily places the MAQAF internally, making the analogy 'car- SI TORRAC'. (I metaphorically embedded the quote, so I don't get 'SI TORRAC car-', which is plain wrong.) Now, a valid opposing view is that the graphical representation of soft hyphens says, "When written out as one very long line, there is no space between successive lines", as opposed to "This apparent word is actually continued by text on the next line". If you take the interpretation of the marks operating at the level of lines, then '-car SI TORRAC' is reasonable. As English has the hyphen as a half-way house between one word and two words, English very naturally works at the word level. I am not sure about other languages. Richard. From jonathan.rosenne at gmail.com Tue Apr 1 18:12:35 2014 From: jonathan.rosenne at gmail.com (Jonathan Rosenne) Date: Wed, 2 Apr 2014 02:12:35 +0300 Subject: Bidi reordering of soft hyphen In-Reply-To: <533B4079.2030402@smontagu.org> References:

<533B4079.2030402@smontagu.org> Message-ID: <002b01cf4dff$dfb53df0$9f1fb9d0$@gmail.com> The use of soft hyphen is a cultural matter. In Hebrew, Classic and Israeli, soft hyphens are not used. Best Regards, Jonathan Rosenne 054-4246522 -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Simon Montagu Sent: Wednesday, April 02, 2014 1:41 AM To: Roozbeh Pournader; Ken Whistler, (ken.whistler at sap.com) Cc: Behdad Esfahbod; unicode at unicode.org; James Clark Subject: Re: Bidi reordering of soft hyphen On 04/02/2014 12:00 AM, Roozbeh Pournader wrote: > Adding Behdad for his insight on the rendering stack. > > But as for user requirements and expectations, the first option, with > the hyphen on the right side of "car" as "car-" is what a good > publisher would want to print in his magazine or book. The second > option is harder to decipher for an RTL reader. I agree with Roozbeh here. Since the hyphen marks a break in the middle of the word, I think the most natural user expectation is that it should appear after the last character in the word, where "after" and "last" both refer to the reading direction of the word. I have seen examples of this in published Hebrew books, and this is also the way it's rendered in Chrome, Firefox and Opera (but in the case of Firefox, since I wrote the code for it I can testify that it isn't this way by design: as far as I remember I only took into account the direction of the text run containing the soft hyphen and didn't even think about the opposite-direction case). _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From asmusf at ix.netcom.com Tue Apr 1 18:39:13 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 01 Apr 2014 16:39:13 -0700 Subject: Bidi reordering of soft hyphen In-Reply-To: <002b01cf4dff$dfb53df0$9f1fb9d0$@gmail.com> References:

<533B4079.2030402@smontagu.org> <002b01cf4dff$dfb53df0$9f1fb9d0$@gmail.com> Message-ID: <533B4E21.2020504@ix.netcom.com> On 4/1/2014 4:12 PM, Jonathan Rosenne wrote: > The use of soft hyphen is a cultural matter. In Hebrew, Classic and Israeli, > soft hyphens are not used. More to the point, how does software render a soft hyphen included in inserted LTR text, when the outer text is Hebrew? Would it always be ignored? Would it be rendered? How? Mind you, I don't think that the bidi algorithm as such needs to care about these details, but the Unicode Standard does mumble about different conventions. Might be useful to add some examples to such mumbling. A./ > > Best Regards, > > Jonathan Rosenne > > 054-4246522 > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Simon > Montagu > Sent: Wednesday, April 02, 2014 1:41 AM > To: Roozbeh Pournader; Ken Whistler, (ken.whistler at sap.com) > Cc: Behdad Esfahbod; unicode at unicode.org; James Clark > Subject: Re: Bidi reordering of soft hyphen > > On 04/02/2014 12:00 AM, Roozbeh Pournader wrote: >> Adding Behdad for his insight on the rendering stack. >> >> But as for user requirements and expectations, the first option, with >> the hyphen on the right side of "car" as "car-" is what a good >> publisher would want to print in his magazine or book. The second >> option is harder to decipher for an RTL reader. > I agree with Roozbeh here. Since the hyphen marks a break in the middle of > the word, I think the most natural user expectation is that it should appear > after the last character in the word, where "after" and "last" > both refer to the reading direction of the word. > > I have seen examples of this in published Hebrew books, and this is also the > way it's rendered in Chrome, Firefox and Opera (but in the case of Firefox, > since I wrote the code for it I can testify that it isn't this way by > design: as far as I remember I only took into account the direction of the > text run containing the soft hyphen and didn't even think about the > opposite-direction case). > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From ken.whistler at sap.com Tue Apr 1 18:41:48 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Tue, 1 Apr 2014 23:41:48 +0000 Subject: Bidi reordering of soft hyphen In-Reply-To: <20140402000257.32dd544d@JRWUBU2> References: <20140402000257.32dd544d@JRWUBU2> Message-ID: > Is it legitimate to truncate the context to a single line? The BiDi > algorithm is attempting to interpret unlabelled text as embedded text > (it's not an arbitrary dance), and in just one line there is no > indicator of whether the hyphen is part of the LTR text embedded in RTL > text. For this discussion, I think yes. See Section 3.4 of UAX #9: The following rules describe the logical process of finding the correct display order. As opposed to resolution phases, these rules act on a per-line basis and are applied after any line wrapping is applied to the paragraph. The main collection of UBA rules apply on a per-paragraph basis, but you cannot actually do reordering of the resolved levels until you have specified the line breaks. Effectively, the hyphenation decision has to be taken first. And *then* you can reorder the results line-by-line. So once we have the decision where we are breaking ?car-/rot?, we can then talk just about where the ?car-? ends up on the single line. But I agree that there are many conundrums for trying to hyphenate individual words in mixed-direction bidi text, so I am not surprised that there would be special typographical conventions which might, as Asmus suggested, require dropping in LRM?s or the like, if you wanted the visual placement of hyphens to override the basic behavior of the algorithm. > However, the very next character is 'r', which tells us that the > left-to-right run contains the hyphen. I also think the HYPHEN-MINUS > is the wrong character to consider - the analogy should be with U+2010 > HYPHEN (class ON) rather than with U+2212 MINUS SIGN (class ES), let > alone the ambiguous HPYHEN-MINUS, for which ES is merely the > interpretation most likely to work. Well, sure, but for the purposes of *this* particular discussion, it makes no difference whatsoever whether we are using U+002D or U+2010, despite the difference in Bidi_Class, since there is no question of numerical formatting here. Rule W6 will convert the bc=ES to bc=ON, and thereafter the processing is identical: Trace: Entering br_UBA_ResolveTerminators [W5] Current State: 11 Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D Bidi_Class: R R R WS L L L ES Levels: 1 1 1 1 1 1 1 1 Runs: Trace: Entering br_UBA_ResolveESCSET [W6] Current State: 12 Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D Bidi_Class: R R R WS L L L ON Levels: 1 1 1 1 1 1 1 1 Runs: --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Apr 1 20:31:43 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 2 Apr 2014 03:31:43 +0200 Subject: FYI: More emoji from Chrome In-Reply-To: <20140401164343.GA5003@powdermilk> References: <20140401164343.GA5003@powdermilk> Message-ID: 2014-04-01 18:43 GMT+02:00 Ilya Zakharevich : > However, this MAY be a beginning of revolution in scientific > communication. Science-and-about publications contains very long > words in abundance, and it is HERE where impact of emojification > should be felt the most! So I think the task of emojification of > scientific terms ? be it ?secularization?, ?gamma-globulin?, or > ?derived ?-category? ? should be at elevated priority in the Unicode > commitees. The general public often considers scientific publications are too > dense, and does not bother to read many scienific journals. Density of scientific publication is not much about word lengths (actually they are not really longer than in general text) but in terms of precision added by each word and associated informations that require frequent use of qualifiers and subqualifiers. Frequently it is difficult to give names to the concepts so scientists will start using notations, and many abbreviations defined specifically for a document or topic which can only be understood in their specific context (outside this context, or without prior knowledge of commonly used conventions the text will look extremely confuse). Note also that the common use of synonyms in generic speach does not apply here because scientists tend to create stronger distinctions between terms that most public would not really discriminate. This is all about terminology and even this list frequently has problems discussing concepts due to terms that are now carrying more precise meaning (an example on this list is all the discussions related to "character", "codes", "code points", "collation element" vs. "collating element" : the general public cannot see the differences and the specifications then look very confusive or obscure to them). Reading a scientific paper requires then much more attention and prior knowledge of specific conventions. > What > Google did is a beginning of a major step forward in making > contemporary science (finally!) accessible to general public. > Not at all. Emojis are certainly not what scientists are using for their needed conventions, simply because their representation is too much permissive (they carry similar "emotions", their glyphs are frequently modified with lots of variants, different colors, styles.) In fact scientists do not use emojis. When thye need to summaize concepts, they create conventional abreviations/acronyms, or symbols with precise glyphs (and the glyph appearence is semantically important, e.g. in maths, chemical formulas, electronic, physics, building engineering...), or specific terminologies (legal texts...). These conventions are not freely translatable with emojis. Even a cookbook for meals cannot use easily emojis. If words are not enough qualifying, they'll use photos. But cuisine or gardening also has its own terminology. -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Tue Apr 1 20:39:13 2014 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Wed, 02 Apr 2014 10:39:13 +0900 Subject: FYI: More emoji from Chrome In-Reply-To: <20140401164343.GA5003@powdermilk> References: <20140401164343.GA5003@powdermilk> Message-ID: <533B6A41.1080508@it.aoyama.ac.jp> Now that it's no longer April 1st (at least not here in Japan), I can add a (moderately) serious comment. On 2014/04/02 01:43, Ilya Zakharevich wrote: > On Tue, Apr 01, 2014 at 09:01:39AM +0200, Mark Davis ?? wrote: >> More emoji from Chrome: >> >> http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html >> >> with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y > > I do not know? The demos leave me completely unimpressed: emoji???by > their nature???require higher resolution than text, so an emoji for > ?pie? does not save any place comparing to the word itself. So the > impact of this on everyday English-languare communication would not be > in any way beneficial. This is somewhat different for Japanese (and languages with similar writing systems) because they have higher line height. Regards, Martin. From jonathan.rosenne at gmail.com Tue Apr 1 23:25:26 2014 From: jonathan.rosenne at gmail.com (Jonathan Rosenne) Date: Wed, 2 Apr 2014 07:25:26 +0300 Subject: Bidi reordering of soft hyphen In-Reply-To: <533B4E21.2020504@ix.netcom.com> References:

<533B4079.2030402@smontagu.org> <002b01cf4dff$dfb53df0$9f1fb9d0$@gmail.com> <533B4E21.2020504@ix.netcom.com> Message-ID: <001c01cf4e2b$94462920$bcd27b60$@gmail.com> I don't think it matters very much what would the software do were there to be a soft hyphen in the text, firstly because it is not very likely for a soft hyphen to have been in the text intentionally and secondly because the software would more likely that not have been developed in a cultural environment that cares about soft hyphens. Best Regards, Jonathan Rosenne -----Original Message----- From: Asmus Freytag [mailto:asmusf at ix.netcom.com] Sent: Wednesday, April 02, 2014 2:39 AM To: Jonathan Rosenne; unicode at unicode.org Subject: Re: Bidi reordering of soft hyphen On 4/1/2014 4:12 PM, Jonathan Rosenne wrote: > The use of soft hyphen is a cultural matter. In Hebrew, Classic and > Israeli, soft hyphens are not used. More to the point, how does software render a soft hyphen included in inserted LTR text, when the outer text is Hebrew? Would it always be ignored? Would it be rendered? How? Mind you, I don't think that the bidi algorithm as such needs to care about these details, but the Unicode Standard does mumble about different conventions. Might be useful to add some examples to such mumbling. A./ > > Best Regards, > > Jonathan Rosenne > > 054-4246522 > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Simon > Montagu > Sent: Wednesday, April 02, 2014 1:41 AM > To: Roozbeh Pournader; Ken Whistler, (ken.whistler at sap.com) > Cc: Behdad Esfahbod; unicode at unicode.org; James Clark > Subject: Re: Bidi reordering of soft hyphen > > On 04/02/2014 12:00 AM, Roozbeh Pournader wrote: >> Adding Behdad for his insight on the rendering stack. >> >> But as for user requirements and expectations, the first option, with >> the hyphen on the right side of "car" as "car-" is what a good >> publisher would want to print in his magazine or book. The second >> option is harder to decipher for an RTL reader. > I agree with Roozbeh here. Since the hyphen marks a break in the > middle of the word, I think the most natural user expectation is that > it should appear after the last character in the word, where "after" and "last" > both refer to the reading direction of the word. > > I have seen examples of this in published Hebrew books, and this is > also the way it's rendered in Chrome, Firefox and Opera (but in the > case of Firefox, since I wrote the code for it I can testify that it > isn't this way by > design: as far as I remember I only took into account the direction of > the text run containing the soft hyphen and didn't even think about > the opposite-direction case). > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From smontagu at smontagu.org Wed Apr 2 01:02:08 2014 From: smontagu at smontagu.org (Simon Montagu) Date: Wed, 02 Apr 2014 09:02:08 +0300 Subject: Bidi reordering of soft hyphen In-Reply-To: <002b01cf4dff$dfb53df0$9f1fb9d0$@gmail.com> References:

<533B4079.2030402@smontagu.org> <002b01cf4dff$dfb53df0$9f1fb9d0$@gmail.com> Message-ID: <533BA7E0.2050004@smontagu.org> On 04/02/2014 02:12 AM, Jonathan Rosenne wrote: > The use of soft hyphen is a cultural matter. In Hebrew, Classic and Israeli, > soft hyphens are not used. I don't understand this statement. Classic, yes, but in Israeli Hebrew soft hyphens typically _are_ used in texts printed in relatively narrow justified columns -- common examples are newspapers and encyclop?dias. (Or are we using terms differently? In any case, with respect to the original question about where to position a soft hyphen in a line break in the middle of a word in an opposite-direction run in bidirectional text, I believe that it doesn't make a difference whether we are referring to U+00AD SOFT HYPHEN, a hyphen automatically inserted by typesetting software, or a hyphen inserted manually). From chris.fynn at gmail.com Wed Apr 2 01:04:54 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Wed, 2 Apr 2014 12:04:54 +0600 Subject: Emoji Message-ID: On 02/04/2014, Nicole Selken wrote: > I think Emoji is totally beneficial as a communication form. A reversion to a crude form of Hieroglyphics? From jonathan.rosenne at gmail.com Wed Apr 2 01:15:55 2014 From: jonathan.rosenne at gmail.com (Jonathan Rosenne) Date: Wed, 2 Apr 2014 09:15:55 +0300 Subject: Bidi reordering of soft hyphen In-Reply-To: <533BA7E0.2050004@smontagu.org> References:

<533B4079.2030402@smontagu.org> <002b01cf4dff$dfb53df0$9f1fb9d0$@gmail.com> <533BA7E0.2050004@smontagu.org> Message-ID: <004001cf4e3b$03512400$09f36c00$@gmail.com> Some papers are indeed doing this sporadically. It looks like it is up to the individual writer. The samples I see are barely readable, incorrect and unprofessional any way you look at them, and seem to derive from the use of inappropriate software. Best Regards, Jonathan Rosenne -----Original Message----- From: Simon Montagu [mailto:smontagu at smontagu.org] Sent: Wednesday, April 02, 2014 9:02 AM To: Jonathan Rosenne; unicode at unicode.org Subject: Re: Bidi reordering of soft hyphen On 04/02/2014 02:12 AM, Jonathan Rosenne wrote: > The use of soft hyphen is a cultural matter. In Hebrew, Classic and > Israeli, soft hyphens are not used. I don't understand this statement. Classic, yes, but in Israeli Hebrew soft hyphens typically _are_ used in texts printed in relatively narrow justified columns -- common examples are newspapers and encyclop?dias. (Or are we using terms differently? In any case, with respect to the original question about where to position a soft hyphen in a line break in the middle of a word in an opposite-direction run in bidirectional text, I believe that it doesn't make a difference whether we are referring to U+00AD SOFT HYPHEN, a hyphen automatically inserted by typesetting software, or a hyphen inserted manually). From mark at macchiato.com Wed Apr 2 01:27:23 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 2 Apr 2014 08:27:23 +0200 Subject: Bidi reordering of soft hyphen In-Reply-To: <533B330A.8030401@ix.netcom.com> References: <20140401211023.6042e0a5@JRWUBU2> <533B330A.8030401@ix.netcom.com> Message-ID: I tend to agree with Roozbeh and Behdad. I would expect to find the visible appearance of the hyphen "replacing" the letters that were broken off from the last word. That is, if the word was "beekeeper", I'd expect to see: .... bee- ..... That would be no matter where the word occurred, and no matter what the direction of the paragraph or surrounding text. (If the SHY occurred at a directional boundary, I'd also say we don't care much...) In any event, once we come up with an agreed recommendation, I'd suggest an implementation note like Asmus describes, but rather than talk about algorithmic steps, just point out the desired visual behavior (since there are many ways to do it). Mark *? Il meglio ? l?inimico del bene ?* On 1 April 2014 23:43, Asmus Freytag wrote: > I think this calls for an implementation note on UAX#9 along these lines. > ------------------------- > During line breaking, if a line is broken at the location of a SHY, the > text around the line break may change. A common case is the replacement of > the invisible SHY by a visible HYPHEN, but see Section x.x in the Unicode > Standard. > > For the purposes of the Bidi Algorithm, apply steps .. to .. after any > substitutions have been made, using the directional classes for the > substituted characters, instead of a single BN for the SHY character. > > > > Note, no special action need be taken for a SHY character in the middle of > a line, unless they are rendered as visible glyphs in a "show hidden > character" mode. In the latter case, the recommendation would be to treat > the visible symbol substituted for the SHY as having bidi class ON. > ------------------------ > > I am not sure whether -car CBA or car- CBA is the right answer, nor > whether the substitution will always be limited to the preceding line. (Old > orthography German had B?cker turning in to B?k-|ker, where I've used > | to show the line ending.) Those are details that the UBA should be > ignorant about. The important thing is that the array of bidi directional > classes is not constrained to contain a single entry for BN at the location > of the original SHY. > > If "car- CBA" is the right answer then the substitution would have to be > HYPHEN plus LRM to get this to come out right, but that would be under the > control of the line-breaking conventions, and not legislated by the UBA. > > A./ > > > On 4/1/2014 1:31 PM, Whistler, Ken wrote: > > Richard Wordingham noted: > > > > > As U+2010 HYPHEN would result in text like 'car-', in an English > > > influenced context I would also go with 'car-'. > > > > That's always a possibility, I suppose, but I'm not sure what > > "English influenced context" means here. > > > > The examples I just gave were for a RTL paragraph context. > > In a LTR paragraph context, the same input would end up in > > a very different order: > > > > Trace: Entering br_UBA_ReverseLevels [L2] > > Current State: 19 > > Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D > > Bidi_Class: R R R L L L L L > > Levels: 1 1 1 0 0 0 0 0 > > Runs: > > > > Order: [2 1 0 3 4 5 6 7] > > > > And you get the display: > > > > CBA car- > > ---------> > > > > As opposed to: > > > > -car CBA > > <--------- > > > > In either case, the hyphen-minus (or hyphen), ends up at the *end of the > line*. > > > > My take is that *if* I am going to insert a visible glyph at the point of > the > > SHY, it would probably be best to insert it at the actual line break at the > > end of the line, to be in the same position as an explicit hyphen-minus > with > > the same line break. > > > > --Ken > > > > > > > _______________________________________________ > Unicode mailing listUnicode at unicode.orghttp://unicode.org/mailman/listinfo/unicode > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Wed Apr 2 01:29:15 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 2 Apr 2014 07:29:15 +0100 (BST) Subject: Emoji In-Reply-To: References: Message-ID: <1396420155.35962.YahooMailNeo@web87803.mail.ir2.yahoo.com> For me, an important aspect of emoji is that they are independent of language. They can localize in the mind of the reader. How can they express verbs such as need and must; and pronouns? How can they express thanks? William Overington 2 April 2014 ----- Original Message ----- From: Christopher Fynn To: Unicode List Cc: Nicole Selken Sent: Wednesday, 2 April 2014, 7:04 Subject: Re: Emoji On 02/04/2014, Nicole Selken wrote: > I think? Emoji is totally beneficial as a communication form. A reversion to a crude form of Hieroglyphics? _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From chris.fynn at gmail.com Wed Apr 2 01:46:11 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Wed, 2 Apr 2014 12:46:11 +0600 Subject: Emoji In-Reply-To: <1396420155.35962.YahooMailNeo@web87803.mail.ir2.yahoo.com> References: <1396420155.35962.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: On 02/04/2014, William_J_G Overington wrote: > For me, an important aspect of emoji is that they are independent of > language. Emoji seem fairly culturally specific. (Maybe the mobile-phone messaging culture.) Kind of shorthand expressions which may be used with several languages - but not independent of language. I suspect some of them already convey one thing to a Japanese teenager and quite another to an American. And if you showed these symbols many people in other countries they wouldn't have a clue as to what they are supposed to mean. From richard.wordingham at ntlworld.com Wed Apr 2 02:36:24 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 2 Apr 2014 08:36:24 +0100 Subject: Bidi reordering of soft hyphen In-Reply-To: References: <20140402000257.32dd544d@JRWUBU2> Message-ID: <20140402083624.24b8da63@JRWUBU2> On Tue, 1 Apr 2014 23:41:48 +0000 "Whistler, Ken" wrote: > > Is it legitimate to truncate the context to a single line? The BiDi > > algorithm is attempting to interpret unlabelled text as embedded > > text > > (it's not an arbitrary dance), and in just one line there is no > > indicator of whether the hyphen is part of the LTR text embedded in > > RTL text. > For this discussion, I think yes. See Section 3.4 of UAX #9: > The following rules describe the logical process of finding the > correct display order. As opposed to resolution phases, these rules > act on a per-line basis and are applied after any line wrapping is > applied to the paragraph. But it is a *resolution* rule that converts the true hyphen or minus sign to Bidi Class L; these apply before the scope reduces from paragraph to line. Richard. From chris.fynn at gmail.com Wed Apr 2 03:42:29 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Wed, 2 Apr 2014 14:42:29 +0600 Subject: FYI: More emoji from Chrome In-Reply-To: <533B6A41.1080508@it.aoyama.ac.jp> References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> Message-ID: Rather than Emoji it might be better if people learnt Han ideographs which are also compact (and a far more developed system of communication than emoji). One CJK character can also easily replace dozens of Latin characters - which is what is being claimed for emoji. On 02/04/2014, "Martin J. D?rst" wrote: > Now that it's no longer April 1st (at least not here in Japan), I can > add a (moderately) serious comment. > > On 2014/04/02 01:43, Ilya Zakharevich wrote: >> On Tue, Apr 01, 2014 at 09:01:39AM +0200, Mark Davis ?? wrote: >>> More emoji from Chrome: >>> >>> http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html >>> >>> with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y >> >> I do not know? The demos leave me completely unimpressed: emoji???by >> their nature???require higher resolution than text, so an emoji for >> ?pie? does not save any place comparing to the word itself. So the >> impact of this on everyday English-languare communication would not be >> in any way beneficial. > > This is somewhat different for Japanese (and languages with similar > writing systems) because they have higher line height. > > Regards, Martin. > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From chris.fynn at gmail.com Wed Apr 2 04:07:44 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Wed, 2 Apr 2014 15:07:44 +0600 Subject: FYI: More emoji from Chrome In-Reply-To: <533B6A41.1080508@it.aoyama.ac.jp> References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> Message-ID: On 02/04/2014, "Martin J. D?rst" wrote: > Now that it's no longer April 1st (at least not here in Japan), I can > add a (moderately) serious comment. Long past April 1 here too - I'd already forgotten. ;-) >>> More emoji from Chrome: >>> >>> http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html >>> >>> with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y >> >> I do not know? The demos leave me completely unimpressed: emoji???by >> their nature???require higher resolution than text, so an emoji for >> ?pie? does not save any place comparing to the word itself. So the >> impact of this on everyday English-languare communication would not be >> in any way beneficial. > > This is somewhat different for Japanese (and languages with similar > writing systems) because they have higher line height. > > Regards, Martin. So CJK glyphs take up similar space to that needed to display an emoji character. - Presumably the individual Han ideographs for "pie", "dumpling" or "turd" would save as much screen space as using the corresponding emoji pictographs. Once there were enough emoji to carry on a conversation above the level of a 4 year old, they would also require an IME as complex as that needed for entering CJK text. From asmusf at ix.netcom.com Wed Apr 2 05:17:35 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 02 Apr 2014 03:17:35 -0700 Subject: Bidi reordering of soft hyphen In-Reply-To: <20140402083624.24b8da63@JRWUBU2> References: <20140402000257.32dd544d@JRWUBU2> <20140402083624.24b8da63@JRWUBU2> Message-ID: <533BE3BF.2010502@ix.netcom.com> On 4/2/2014 12:36 AM, Richard Wordingham wrote: > On Tue, 1 Apr 2014 23:41:48 +0000 > "Whistler, Ken" wrote: > >>> Is it legitimate to truncate the context to a single line? The BiDi >>> algorithm is attempting to interpret unlabelled text as embedded >>> text >>> (it's not an arbitrary dance), and in just one line there is no >>> indicator of whether the hyphen is part of the LTR text embedded in >>> RTL text. > >> For this discussion, I think yes. See Section 3.4 of UAX #9: > >> The following rules describe the logical process of finding the >> correct display order. As opposed to resolution phases, these rules >> act on a per-line basis and are applied after any line wrapping is >> applied to the paragraph. > But it is a *resolution* rule that converts the true hyphen or minus > sign to Bidi Class L; these apply before the scope reduces from > paragraph to line. When breaking a line at a soft hyphen, one is essentially modifying the text around the line break for display, because the SHY is not specific as to what should happen (as was the case with German old orthography, the changes go beyond simple substitution of a hyphen). When you change the text, you have to fix up the resolution. A./ > > Richard. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From asmusf at ix.netcom.com Wed Apr 2 05:19:08 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 02 Apr 2014 03:19:08 -0700 Subject: FYI: More emoji from Chrome In-Reply-To: References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> Message-ID: <533BE41C.1050903@ix.netcom.com> On 4/2/2014 1:42 AM, Christopher Fynn wrote: > Rather than Emoji it might be better if people learnt Han ideographs > which are also compact (and a far more developed system of > communication than emoji). One CJK character can also easily replace > dozens of Latin characters - which is what is being claimed for emoji. One wonders why the Japanese, who already know Han ideographs, took to emoji as they did.... A./ > > On 02/04/2014, "Martin J. D?rst" wrote: >> Now that it's no longer April 1st (at least not here in Japan), I can >> add a (moderately) serious comment. >> >> On 2014/04/02 01:43, Ilya Zakharevich wrote: >>> On Tue, Apr 01, 2014 at 09:01:39AM +0200, Mark Davis ?? wrote: >>>> More emoji from Chrome: >>>> >>>> http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html >>>> >>>> with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y >>> I do not know? The demos leave me completely unimpressed: emoji???by >>> their nature???require higher resolution than text, so an emoji for >>> ?pie? does not save any place comparing to the word itself. So the >>> impact of this on everyday English-languare communication would not be >>> in any way beneficial. >> This is somewhat different for Japanese (and languages with similar >> writing systems) because they have higher line height. >> >> Regards, Martin. >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From kojiishi at gluesoft.co.jp Wed Apr 2 06:05:22 2014 From: kojiishi at gluesoft.co.jp (Koji Ishii) Date: Wed, 2 Apr 2014 11:05:22 +0000 Subject: FYI: More emoji from Chrome In-Reply-To: <533BE41C.1050903@ix.netcom.com> References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> <533BE41C.1050903@ix.netcom.com> Message-ID: <14FF721E-8589-498C-BA77-AB05F852048F@gluesoft.co.jp> On Apr 2, 2014, at 7:19 PM, Asmus Freytag wrote: > On 4/2/2014 1:42 AM, Christopher Fynn wrote: >> Rather than Emoji it might be better if people learnt Han ideographs >> which are also compact (and a far more developed system of >> communication than emoji). One CJK character can also easily replace >> dozens of Latin characters - which is what is being claimed for emoji. > > One wonders why the Japanese, who already know Han ideographs, took to emoji as they did.... All the ancient emoji characters we inherited from our ancestors were already turned into Han ideographs like this[1][2], so we needed new ones to add more Han ideographs in next centuries ;) [1] http://ameblo.jp/happy2525tkg/entry-11541848940.html [2] http://ameblo.jp/happy2525tkg/entry-11578197418.html /koji From chris.fynn at gmail.com Wed Apr 2 06:08:02 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Wed, 2 Apr 2014 17:08:02 +0600 Subject: FYI: More emoji from Chrome In-Reply-To: <533BE41C.1050903@ix.netcom.com> References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> <533BE41C.1050903@ix.netcom.com> Message-ID: On 02/04/2014, Asmus Freytag wrote: > On 4/2/2014 1:42 AM, Christopher Fynn wrote: >> Rather than Emoji it might be better if people learnt Han ideographs >> which are also compact (and a far more developed system of >> communication than emoji). One CJK character can also easily replace >> dozens of Latin characters - which is what is being claimed for emoji. > > One wonders why the Japanese, who already know Han ideographs, took to > emoji as they did.... Perhaps because emoji are a sort of playful version of a means of communication they are already used to From duerst at it.aoyama.ac.jp Wed Apr 2 06:26:19 2014 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Wed, 02 Apr 2014 20:26:19 +0900 Subject: FYI: More emoji from Chrome In-Reply-To: References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> <533BE41C.1050903@ix.netcom.com> Message-ID: <533BF3DB.1010103@it.aoyama.ac.jp> On 2014/04/02 20:08, Christopher Fynn wrote: > On 02/04/2014, Asmus Freytag wrote: >> On 4/2/2014 1:42 AM, Christopher Fynn wrote: >>> Rather than Emoji it might be better if people learnt Han ideographs >>> which are also compact (and a far more developed system of >>> communication than emoji). One CJK character can also easily replace >>> dozens of Latin characters - which is what is being claimed for emoji. >> >> One wonders why the Japanese, who already know Han ideographs, took to >> emoji as they did.... > > Perhaps because emoji are a sort of playful version of a means of > communication they are already used to Yes. Already used to the concept that a character can represent (more or less) a concept. Already used to the concept that there are lots of characters, and a few more won't make such a difference. Already used to the concept that character entry means keying a word or phrase and the selecting what you actually want. But I think the main reason for their spread was that the mobile phone companies introduced them and young people found them cute. In a followup, Line (http://line.me/en/), the most popular Japanese mobile message app (similar to WhatsApp) got popular mostly because of their gorgeous collection of 'stickers' (over 10,000), fortunately after realizing that the technically correct way to deal with them was not squeezing them into the PUA, but treating them as inline images, avoiding headaches down the line for the Unicode Consortium :-). Regards, Martin. From mpsuzuki at hiroshima-u.ac.jp Wed Apr 2 07:00:42 2014 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Wed, 02 Apr 2014 21:00:42 +0900 Subject: ["Unicode"] Re: FYI: More emoji from Chrome In-Reply-To: <533BF3DB.1010103@it.aoyama.ac.jp> References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> <533BE41C.1050903@ix.netcom.com> <533BF3DB.1010103@it.aoyama.ac.jp> Message-ID: <533BFBEA.5040003@hiroshima-u.ac.jp> On 04/02/2014 08:26 PM, "Martin J. D?rst" wrote: > On 2014/04/02 20:08, Christopher Fynn wrote: >> On 02/04/2014, Asmus Freytag wrote: >>> On 4/2/2014 1:42 AM, Christopher Fynn wrote: >>>> Rather than Emoji it might be better if people learnt Han ideographs >>>> which are also compact (and a far more developed system of >>>> communication than emoji). One CJK character can also easily replace >>>> dozens of Latin characters - which is what is being claimed for emoji. >>> >>> One wonders why the Japanese, who already know Han ideographs, took to >>> emoji as they did.... >> >> Perhaps because emoji are a sort of playful version of a means of >> communication they are already used to > > Yes. Already used to the concept that a character can represent (more or less) a concept. Already used to the concept that there are lots of characters, and a few more won't make such a difference. Already used to the concept that character entry means keying a word or phrase and the selecting what you actually want. > > But I think the main reason for their spread was that the mobile phone companies introduced them and young people found them cute. > > In a followup, Line (http://line.me/en/), the most popular Japanese mobile message app (similar to WhatsApp) got popular mostly because of their gorgeous collection of 'stickers' (over 10,000), fortunately after realizing that the technically correct way to deal with them was not squeezing them into the PUA, but treating them as inline images, avoiding headaches down the line for the Unicode Consortium :-). Utilization of the words including rarely-used Han ideograph requests the deep knowledge about Chinese classics (except of the cases like "what is the most complex kanji?"). It is too hard for modern Japanese people who prefers video media than text media. I think the wide acceptance of new emojis and stickers (Japanese LINE users call as "stamp") by Japanese young people does not mean that they have something hard to express by existing characters or emoticons. Collecting them is something like an ambition to encode all comedy skits. Regards, mpsuzuki From asmusf at ix.netcom.com Wed Apr 2 07:02:54 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 02 Apr 2014 05:02:54 -0700 Subject: FYI: More emoji from Chrome In-Reply-To: <14FF721E-8589-498C-BA77-AB05F852048F@gluesoft.co.jp> References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> <533BE41C.1050903@ix.netcom.com> <14FF721E-8589-498C-BA77-AB05F852048F@gluesoft.co.jp> Message-ID: <533BFC6E.3060308@ix.netcom.com> On 4/2/2014 4:05 AM, Koji Ishii wrote: > On Apr 2, 2014, at 7:19 PM, Asmus Freytag wrote: > >> On 4/2/2014 1:42 AM, Christopher Fynn wrote: >>> Rather than Emoji it might be better if people learnt Han ideographs >>> which are also compact (and a far more developed system of >>> communication than emoji). One CJK character can also easily replace >>> dozens of Latin characters - which is what is being claimed for emoji. >> One wonders why the Japanese, who already know Han ideographs, took to emoji as they did.... > All the ancient emoji characters we inherited from our ancestors were already turned into Han ideographs like this[1][2], so we needed new ones to add more Han ideographs in next centuries ;) You may be on to something :) > > [1] http://ameblo.jp/happy2525tkg/entry-11541848940.html > [2] http://ameblo.jp/happy2525tkg/entry-11578197418.html > > /koji > From theppitak at gmail.com Wed Apr 2 07:16:32 2014 From: theppitak at gmail.com (Theppitak Karoonboonyanan) Date: Wed, 2 Apr 2014 19:16:32 +0700 Subject: Unencoded Lao Characters In-Reply-To: <20140329223559.0965a007@JRWUBU2> References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>

<20140328091547.64b97a4f@JRWUBU2> <20140329223559.0965a007@JRWUBU2> Message-ID: On Sun, Mar 30, 2014 at 5:35 AM, Richard Wordingham wrote: > On Sat, 29 Mar 2014 11:10:52 +0700 > Theppitak Karoonboonyanan wrote, > under topic 'Pali in Thai Script': > >> On Fri, Mar 28, 2014 at 4:15 PM, Richard Wordingham >> wrote: > >> > An older form of the Lao script is called the Thai Noi script. That >> > script has many of the characters needed. It has the characters, to >> > give them their 'standard' Unicode Indic names, GHA, NYA, TTHA, NNA, >> > DHA, BHA, and even has the Sanskrit-supporting characters SHA, SSA >> > and Vocalic R. The lack of CHA, JHA, TTA, DDA, DDHA and LLA may be >> > due to their rarity, as with the lack of Vocalic L. >> >> I don't think so. From my studies so far, Tai Noi script (aka. Lao >> Buhan) writing system was not so different from that of contemporary >> Lao script. Some characters are just obsolete. >> >> In fact, I have been drafting a summarized proposal to encode Tai Noi >> script here: >> >> http://linux.thai.net/~thep/esaan-scripts/tn-issues/tn-encoding.html > > That seems to be based on the analysis that the Tai Noi script is a > form of the Lao script. In that case, it ought to address GHA, NYA, > TTHA, NNA, DHA and BHA as seen in inscriptions, recorded for example in > the 1979 MA thesis of Thawaj Poonotoke (???? ???????) at > http://www.khamkoo.com/uploads/9/0/0/4/9004485/thai_noi_palaeography.pdf . I see. As said in the thesis, these Thai-borrowed characters were mostly used by the elites who were influenced by foreign states. That's why I don't find them in palm leaf documents which were inscribed by ordinary people, where the characters were simply borrowed from Tham script, not from (archaic) Thai when in use. And, as also said in the thesis, the official letters (Bai Jum) are not as abundant as palm leaves, and the author himself suggested that studying the writing system used in palm leaves were more useful. That's why most next-generation scholars, including those I consulted, do not mention the one used by the elites in their books at all. At least, they don't suggest it for contemporary use when the script is revivied. Anyway, I think we should take the elite's writing system into account when we encode it. > The Buddhist Institute 'additions' should also be handled. There are > several fonts around that make presumptions about their encoding in > Unicode. I'm not convinced that the old Tai Noi and Buddhist Institute > forms of each of NYA and NNA are the same character - I suspect we may > have four characters here. The two versions of NYA are particularly > difficult to reconcile. Don't you think it's a matter of style, in the same manner that Lao Tham share the same block with Lanna and Khun? > My though on the subscript consonants are: > > 1) The Lao block already has two subscript consonants, U+0EBC LAO > SEMIVOWEL SIGN LO and U+0EBD LAO SEMIVOWEL SIGN NYO, though perhaps > the various forms of the latter need to disunified. How does the > latter's J-shaped glyph kern? I'd rather leave the kerning to fonts (i.e. fonts for contemporary Lao and those for Tai Noi would kern differently). For the variations, I'm afraid it's a matter of style again. In case one insists to use different forms in the same document, I'm not sure how Variation Selectors fit? > 2) If we allow the Lao script to be split between planes, subscript > forms could be accommodated in an 'Archaic Lao' block in the SMP. This > would have the advantages that: > > (a) In UTF-8, a subscript consonant would only take 4 bytes, whereas > using a coeng in the BMP would require 6 bytes, 3 for the coeng and and > 3 for the consonant identity. The memory requirement is 4 bytes for > both schemes in UTF-16. > > (b) Distinct subscripts for the same letter can easily be encoded > distinctly. For example, the Lao letters LO, DO and NO can easily be > taken to have two distinct subscript forms, and in the related Thai > Nithet script (?????????????), formerly used in Northern Thailand, one > can argue for four forms of the cluster HO MO - the ligature HO MO (as > LAO HO MO), and HO plus (i) a purely subscript MO (gc=Mn), (ii) > subscript MO with an ascender (gc=Mc), and (iii) a borrowing of Tai > Tham (gc=Mn if treated as a single character). What's the difference between HO plus (i) and HO plus (ii)? I think I haven't seen the former case yet. Yes, the supplement block can be a good alternative, as it can address different forms of subscripts more flexibly. Regards, -- Theppitak Karoonboonyanan http://linux.thai.net/~thep/ From wjgo_10009 at btinternet.com Wed Apr 2 08:01:40 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 2 Apr 2014 14:01:40 +0100 (BST) Subject: Transmission of emoji within plain text messages In-Reply-To: <533BF3DB.1010103@it.aoyama.ac.jp> References: <20140401164343.GA5003@powdermilk> <533B6A41.1080508@it.aoyama.ac.jp> <533BE41C.1050903@ix.netcom.com> <533BF3DB.1010103@it.aoyama.ac.jp> Message-ID: <1396443700.49390.YahooMailNeo@web87801.mail.ir2.yahoo.com> ""Martin J. D?rst"" wrote: > In a followup, Line (http://line.me/en/), the most popular Japanese mobile message app (similar to WhatsApp) got popular mostly because of their gorgeous collection of 'stickers' (over 10,000), fortunately after realizing that the technically correct way to deal with them was not squeezing them into the PUA, but treating them as inline images, avoiding headaches down the line for the Unicode Consortium :-). There is another possible way to proceed, namely to use markup bubbles for transmission and to decode them with a local OpenType colour font, where the glyphs of the decoded items are unmapped. I have successfully tested such a markup bubble technique in monochrome for nine-character markup bubbles, for a different project, using a font made using High-Logic FontCreator 7 and applied using Serif PagePlus X5. William Overington 2 April 2014 From James_Lin at symantec.com Wed Apr 2 12:00:08 2014 From: James_Lin at symantec.com (James Lin) Date: Wed, 2 Apr 2014 10:00:08 -0700 Subject: Emoji In-Reply-To: References: <1396420155.35962.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: Emoji or ???, literally means Face word or Face Characters, essentially, provides an emotional state in the context of words. Emoji is very popular in APJ, and specially in Japan where most of your text will contain at least half dozen Emoji characters. Remember, people in Japan spend more than half of their commute in the train, and no talk on the cellphone in the train, so most people text instead. Everyone can guess what are the following emoji that used frequently in Japan: ?(???;)? - worried ?(??????? - happy ?(#`??)? - angry ??_???- confused there is a lot more... On 4/1/14, 11:46 PM, "Christopher Fynn" wrote: >On 02/04/2014, William_J_G Overington wrote: >> For me, an important aspect of emoji is that they are independent of >> language. > >Emoji seem fairly culturally specific. (Maybe the mobile-phone >messaging culture.) Kind of shorthand expressions which may be used >with several languages - but not independent of language. I suspect >some of them already convey one thing to a Japanese teenager and quite >another to an American. And if you showed these symbols many people >in other countries they wouldn't have a clue as to what they are >supposed to mean. >_______________________________________________ >Unicode mailing list >Unicode at unicode.org >http://unicode.org/mailman/listinfo/unicode From nospam-abuse at ilyaz.org Wed Apr 2 13:27:23 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Wed, 2 Apr 2014 11:27:23 -0700 Subject: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector? Message-ID: <20140402182723.GA8350@powdermilk> Current (and 7.0.0-tobe) versions do not say much: 23AF HORIZONTAL LINE EXTENSION * used for extension of arrows x (vertical line extension - 23D0) If it is intended to be a variation selector (possibly prepended instead of appended!), then using it with ? should give longer double arrow, and using it with ? should give a longer variant of ?. If it is a glyph, then what is the difference with U+2500 ? ? It looks like then any vertical-positioning distinction is trivially understood from context? Any thoughts? Thanks, Ilya From nospam-abuse at ilyaz.org Wed Apr 2 13:36:45 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Wed, 2 Apr 2014 11:36:45 -0700 Subject: HORIZONTAL SCAN LINEs Message-ID: <20140402183645.GA8366@powdermilk> The current version (and 7.0.0-tobe) describe them as: @ Scan lines for terminal graphics @+ The scan line numbers here refer to old, low-resolution technology for terminals, with only 9 scan lines per fixed-size character glyph. Even-numbered scan lines are unified with box-drawing graphics. 23BA HORIZONTAL SCAN LINE-1 23BB HORIZONTAL SCAN LINE-3 23BC HORIZONTAL SCAN LINE-7 23BD HORIZONTAL SCAN LINE-9 Is not it a complete BS? Was not this intended to be written similar to: Glyphs for even-numbered scan lines were never defined. The 5th scan line is unified with U+2500 (from box-drawing graphic). Please note that even well-researched fonts (like Symbola) treat them wrongly. Should not another comment (or two) be better added: Line-1 was the top line of the character box, and Line-9 was at the bottom. These characters were intended to connect on the sides, and with 23B8 LEFT VERTICAL BOX LINE 23B9 RIGHT VERTICAL BOX LINE Thanks, Ilya P.S. The references I found are http://lists.freedesktop.org/archives/wayland-devel/2012-May/003687.html http://invisible-island.net/vttest/images/VTTEST-VT100%20character%20sets.png illustrating http://invisible-island.net/vttest/ http://www.unicode.org/mail-arch/unicode-ml/y2001-m09/0037.html 5-9 in ftp://kermit.columbia.edu/kermit/ucsterminal/ucsterminal.txt http://paulbourke.net/dataformats/ascii/ (search for: control-backspace) From nospam-abuse at ilyaz.org Wed Apr 2 13:52:53 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Wed, 2 Apr 2014 11:52:53 -0700 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: References: <1396420155.35962.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: <20140402185253.GA8489@powdermilk> On Wed, Apr 02, 2014 at 10:00:08AM -0700, James Lin wrote: > Everyone can guess what are the following emoji that used frequently in > Japan: What makes you think so? I would not have a slightest clue what the intended meaning is? > ?(???;)? - worried [I removed the rest since they crash the Web interface to the list anyway: http://www.unicode.org/mail-arch/unicode-ml/y2014-m04/0039.html ] Ilya From ken.whistler at sap.com Wed Apr 2 13:56:54 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Wed, 2 Apr 2014 18:56:54 +0000 Subject: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector? In-Reply-To: <20140402182723.GA8350@powdermilk> References: <20140402182723.GA8350@powdermilk> Message-ID: Ilya, U+23AF is *definitely* not a variation selector at all. It is part of a set of bracket pieces (and other graphic pieces) in the range U+239B..U+23B1. See discussion of the topic at: http://www.unicode.org/forum/viewtopic.php?f=35&t=206 See also Section 2.13 of UTR #25: http://www.unicode.org/reports/tr25/ which discusses the use of these symbol pieces. It does not specifically talk about the arrow extender pieces, focusing instead on the bracket pieces, but the principles are the same. These glyphic pieces of symbols are only relevant and useful in the context of mathematical typesetting programs like TeX. The set of box drawing characters in the U+2500 block were encoded for compatibility with old character sets that did character cell graphics. So the two are different, but neither set is of much current relevance for general text currently using arrows. --Ken > Subject: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector? > > Current (and 7.0.0-tobe) versions do not say much: > > 23AF HORIZONTAL LINE EXTENSION > * used for extension of arrows > x (vertical line extension - 23D0) > > If it is intended to be a variation selector (possibly prepended > instead of appended!), then using it with ? should give longer double > arrow, and using it with ? should give a longer variant of ?. > > If it is a glyph, then what is the difference with U+2500 ? ? It > looks like then any vertical-positioning distinction is trivially > understood from context? > > Any thoughts? Thanks, > Ilya From nospam-abuse at ilyaz.org Wed Apr 2 14:35:28 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Wed, 2 Apr 2014 12:35:28 -0700 Subject: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector? In-Reply-To: References: <20140402182723.GA8350@powdermilk> Message-ID: <20140402193528.GA8861@powdermilk> On Wed, Apr 02, 2014 at 06:56:54PM +0000, Whistler, Ken wrote: > Ilya, > > U+23AF is *definitely* not a variation selector at all. > > It is part of a set of bracket pieces (and other graphic pieces) > in the range U+239B..U+23B1. Evidence does not support this (see below). > See discussion of the topic at: > > http://www.unicode.org/forum/viewtopic.php?f=35&t=206 Apparently, this has no relationship to U+23af at all? > See also Section 2.13 of UTR #25: > > http://www.unicode.org/reports/tr25/ Likewise? > which discusses the use of these symbol pieces. It does not > specifically talk about the arrow extender pieces, focusing > instead on the bracket pieces, but the principles are the same. AFAIU, they are not. > These glyphic pieces of symbols are only relevant and useful > in the context of mathematical typesetting programs like TeX. Are you sure? Did you look at Figure 6 in Appendix F of TeXBook? ? There is no horizontal extension pieces; ? The vertical extension pieces consists of very short chunks (to make size tunable in small increments). The horizontal extension of arrows in TeX is done by macros like \longleftarrow and \Longleftarrow (Appendix B, or just look in plain.tex). However, these macros (again!) have no relation to U+23af, since they need DIFFERENT extension pieces for single/double/triple/etc arrows. ======================================================= In short: if U+23AF were a part of extensible set, it would be short (never saw it like this) AND would have double/triple/etc counterparts. > The set of box drawing characters in the U+2500 block were > encoded for compatibility with old character sets that did > character cell graphics. > > So the two are different, but neither set is of much current > relevance for general text currently using arrows. I won?t be so sure. I use extended arrows in ?general text? mode; they are not shown reliably in ANY environment I know. I?d love to know a solution. Thanks, Ilya From jkorpela at cs.tut.fi Wed Apr 2 14:39:10 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Wed, 02 Apr 2014 22:39:10 +0300 Subject: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector? In-Reply-To: References: <20140402182723.GA8350@powdermilk> Message-ID: <533C675E.6040805@cs.tut.fi> 2014-04-02 21:56, Whistler, Ken wrote: > U+23AF is *definitely* not a variation selector at all. > > It is part of a set of bracket pieces (and other graphic pieces) > in the range U+239B..U+23B1. [?] > These glyphic pieces of symbols are only relevant and useful > in the context of mathematical typesetting programs like TeX. I?m not sure whether TeX uses such characters at all. TeX is oriented towards typesetting glyphs, often not caring that much about abstract characters. When I use, say, $$\begin{pmatrix}?\end{pmatrix} in LaTeX to get a nicely formatted array with large parentheses around, I don?t think LaTeX internally uses characters like U+239B. On the other hand, such characters can be used in very primitive ?typesetting? in a plain text environment under some conditions. For example, to create a largish left parentheses I could use U+239B U+239C ? U+239C U+239D each at the start of a new line: ? ? ? ? This won?t work on everyone?s email reader, of course. It works in Notepad, for example. On a web page, it works when you set the text solid, with line-height: 1. Of course, there would be the issue of font coverage, but I don?t see any particular reason why such characters could not be used in plain text, in word processors, in HTML documents?apart from the practical point that there are usually better alternaties. U+23AF is a simpler building block, but it has its problems, too. Despite the purpose mentioned in a comment in the standard, there is no guarantee that it joins smoothly with adjacent simple arrows. But of course it is a graphic character, and one that can be expected to have a rather specific shape. It?s not something abstract that says that some arrow should be extended; rather, it can be used as an extension. Yucca From doug at ewellic.org Wed Apr 2 14:44:46 2014 From: doug at ewellic.org (Doug Ewell) Date: Wed, 02 Apr 2014 12:44:46 -0700 Subject: Emoji [And crash in the Web interface to the mailing list] Message-ID: <20140402124446.665a7a7059d7ee80bb4d670165c8327d.8654bc4039.wbe@email03.secureserver.net> Ilya Zakharevich wrote: > [I removed the rest since they crash the Web interface to the list > anyway: > http://www.unicode.org/mail-arch/unicode-ml/y2014-m04/0039.html > ] I didn't have any trouble viewing James's examples from the Web interface, although of course the private-use characters showed up as dots instead of whatever they were supposed to be. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From nospam-abuse at ilyaz.org Wed Apr 2 15:00:45 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Wed, 2 Apr 2014 13:00:45 -0700 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: <20140402124446.665a7a7059d7ee80bb4d670165c8327d.8654bc4039.wbe@email03.secureserver.net> References: <20140402124446.665a7a7059d7ee80bb4d670165c8327d.8654bc4039.wbe@email03.secureserver.net> Message-ID: <20140402200045.GA9154@powdermilk> On Wed, Apr 02, 2014 at 12:44:46PM -0700, Doug Ewell wrote: > Ilya Zakharevich wrote: > > > [I removed the rest since they crash the Web interface to the list > > anyway: > > http://www.unicode.org/mail-arch/unicode-ml/y2014-m04/0039.html > > ] > > I didn't have any trouble viewing James's examples from the Web > interface, Yes, I double-checked with wget, and it can retrieve the page fine. So the problem is in Firefox (it shows even the HTML source truncated?). > although of course the private-use characters showed up as > dots instead of whatever they were supposed to be. Private-use?! The page is in iso-2022-jp! Anyway, why would private-use show as dots in browsers? If it is defined somewhere in the places the browser looks at, it will be shown THAT way; otherwise the browser?s last-resort would hit (which I never saw to be dots; HEX in Firefox; ?non-squares? ;-] in Chrome). Ilya From rick at unicode.org Wed Apr 2 15:24:26 2014 From: rick at unicode.org (Rick McGowan) Date: Wed, 02 Apr 2014 13:24:26 -0700 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: <20140402124446.665a7a7059d7ee80bb4d670165c8327d.8654bc4039.wbe@email03.secureserver.net> References: <20140402124446.665a7a7059d7ee80bb4d670165c8327d.8654bc4039.wbe@email03.secureserver.net> Message-ID: <533C71FA.2080700@unicode.org> Also, fwiw, the new Mailman archives using Pipermail seem to do better than the "legacy" archives: http://unicode.org/pipermail/unicode/2014-April/000382.html On 4/2/2014 12:44 PM, Doug Ewell wrote: > I didn't have any trouble viewing James's examples from the Web > interface, although of course the private-use characters showed up as > dots instead of whatever they were supposed to be. From richard.wordingham at ntlworld.com Wed Apr 2 15:29:37 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 2 Apr 2014 21:29:37 +0100 Subject: Unencoded Lao Characters In-Reply-To: References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>

<20140328091547.64b97a4f@JRWUBU2> <20140329223559.0965a007@JRWUBU2> Message-ID: <20140402212937.1543e9e2@JRWUBU2> On Wed, 2 Apr 2014 19:16:32 +0700 Theppitak Karoonboonyanan wrote: > On Sun, Mar 30, 2014 at 5:35 AM, Richard Wordingham > wrote: > > On Sat, 29 Mar 2014 11:10:52 +0700 > > In that case, it ought to address GHA, NYA, > > TTHA, NNA, DHA and BHA as seen in inscriptions, recorded for > > example in the 1979 MA thesis of Thawaj Poonotoke (???? ???????) at > > http://www.khamkoo.com/uploads/9/0/0/4/9004485/thai_noi_palaeography.pdf . > I see. As said in the thesis, these Thai-borrowed characters were > mostly used by the elites who were influenced by foreign states. Are they any more borrowed than the rest of the alphabet? > > I'm not convinced that the old Tai Noi and > > Buddhist Institute forms of each of NYA and NNA are the same > > character - I suspect we may have four characters here. The two > > versions of NYA are particularly difficult to reconcile. > Don't you think it's a matter of style, in the same manner that Lao > Tham share the same block with Lanna and Khun? Perhaps it will work. It's tidier if it does. > > 1) The Lao block already has two subscript consonants, U+0EBC LAO > > SEMIVOWEL SIGN LO and U+0EBD LAO SEMIVOWEL SIGN NYO, though perhaps > > the various forms of the latter need to disunified. How does the > > latter's J-shaped glyph kern? > I'd rather leave the kerning to fonts (i.e. fonts for contemporary > Lao and those for Tai Noi would kern differently). For the > variations, I'm afraid it's a matter of style again. My worry here is with the Khmu usage of the J-shaped glyph. Khmu uses U+0EBD as an initial consonant. If it is kerned in Khmu usage, then there is not a problem. > > ... in the > > related Thai Nithet script (?????????????), formerly used in > > Northern Thailand, one can argue for four forms of the cluster HO > > MO - the ligature HO MO (as LAO HO MO), and HO plus (i) a purely > > subscript MO (gc=Mn), (ii) subscript MO with an ascender (gc=Mc), > > and (iii) a borrowing of Tai Tham (gc=Mn if treated as > > a single character). > > What's the difference between HO plus (i) and HO plus (ii)? > I think I haven't seen the former case yet. It's the same as the difference between U+1A5E TAI THAM CONSONANT SIGN SA and or between U+1A56 TAI THAM CONSONANT SIGN MEDIAL LA and . Richard. From doug at ewellic.org Wed Apr 2 15:31:04 2014 From: doug at ewellic.org (Doug Ewell) Date: Wed, 02 Apr 2014 13:31:04 -0700 Subject: Emoji [And crash in the Web interface to the mailing list] Message-ID: <20140402133104.665a7a7059d7ee80bb4d670165c8327d.08a5508c35.wbe@email03.secureserver.net> Ilya Zakharevich wrote: >> although of course the private-use characters showed up as >> dots instead of whatever they were supposed to be. > > Private-use?! The page is in iso-2022-jp! Anyway, why would > private-use show as dots in browsers? If it is defined somewhere in > the places the browser looks at, it will be shown THAT way; otherwise > the browser?s last-resort would hit (which I never saw to be dots; HEX > in Firefox; ?non-squares? ;-] in Chrome). Sorry, this was my mistake. IE8 on Windows 7 displayed James's "angry" line like this: ?(????????????????????????????(#`??)? - angry The only "private-use" character was something that got transcoded to U+E559, and IE8 displayed that as a space, not a dot. But a quick look at the ISO-2022-JP source shows this isn't right at all. So I guess I did have trouble viewing it, maybe not a crash, but severe mojibake. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From richard.wordingham at ntlworld.com Wed Apr 2 15:46:51 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 2 Apr 2014 21:46:51 +0100 Subject: Bidi reordering of soft hyphen In-Reply-To: <533BE3BF.2010502@ix.netcom.com> References: <20140402000257.32dd544d@JRWUBU2> <20140402083624.24b8da63@JRWUBU2> <533BE3BF.2010502@ix.netcom.com> Message-ID: <20140402214651.5eef9b17@JRWUBU2> On Wed, 02 Apr 2014 03:17:35 -0700 Asmus Freytag wrote: > On 4/2/2014 12:36 AM, Richard Wordingham wrote: > > But it is a *resolution* rule that converts the true hyphen or minus > > sign to Bidi Class L; these apply before the scope reduces from > > paragraph to line. > When breaking a line at a soft hyphen, one is essentially modifying > the text around the line break for display, because the SHY is not > specific as to what should happen (as was the case with German old > orthography, the changes go beyond simple substitution of a hyphen). > > When you change the text, you have to fix up the resolution. The argument was based on what happened to U+002D HYPHEN-MINUS. The change to the text then is to replace what is, in code order, 'CARROT IS carrot...' by 'CARROT IS carrot...'. One can even argue that this replacement would result from SHY under the rules of English typography. Reapplying the resolution rules, the left-to-right run now includes, even after truncation, 'car'. Richard. From ken.whistler at sap.com Wed Apr 2 16:08:44 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Wed, 2 Apr 2014 21:08:44 +0000 Subject: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector? In-Reply-To: <533C675E.6040805@cs.tut.fi> References: <20140402182723.GA8350@powdermilk> <533C675E.6040805@cs.tut.fi> Message-ID: Yucca noted: > > These glyphic pieces of symbols are only relevant and useful > > in the context of mathematical typesetting programs like TeX. > > I?m not sure whether TeX uses such characters at all. TeX is oriented > towards typesetting glyphs, often not caring that much about abstract > characters. Yes, I'm not claiming there is any actual use of U+23AF in TeX per se. The original source of U+23AF, by the way, was as a compatibility mapping character for the PostScript "arrowhorizex". See octal 276 in the symbol font encoding for PostScript. The original proposal for the set of these can be seen in L2/99-346 in the UTC document register. The whole set of these pieces was included into the repertoire of mathematical symbols under discussion at the time, and was eventually published as part of Unicode 3.2. --Ken From mark at kli.org Wed Apr 2 18:39:39 2014 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 02 Apr 2014 19:39:39 -0400 Subject: Emoji In-Reply-To: <1396420155.35962.YahooMailNeo@web87803.mail.ir2.yahoo.com> References: <1396420155.35962.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: <533C9FBB.2020401@kli.org> Relatedly, see https://www.kwikpoint.com/ Basically, little (laminated) booklets of pictures of hopefully understandable items and concepts you can point at in foreign countries. ~mark On 04/02/2014 02:29 AM, William_J_G Overington wrote: > For me, an important aspect of emoji is that they are independent of language. > > They can localize in the mind of the reader. > > How can they express verbs such as need and must; and pronouns? > > How can they express thanks? > > William Overington > > 2 April 2014 > > > > ----- Original Message ----- > From: Christopher Fynn > To: Unicode List > Cc: Nicole Selken > Sent: Wednesday, 2 April 2014, 7:04 > Subject: Re: Emoji > > On 02/04/2014, Nicole Selken wrote: > >> I think Emoji is totally beneficial as a communication form. > A reversion to a crude form of Hieroglyphics? > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From duerst at it.aoyama.ac.jp Wed Apr 2 20:13:51 2014 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Thu, 03 Apr 2014 10:13:51 +0900 Subject: Emoji In-Reply-To: References: <1396420155.35962.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: <533CB5CF.9010109@it.aoyama.ac.jp> On 2014/04/03 02:00, James Lin wrote: > Emoji or ???, literally means Face word or Face Characters, essentially, Emoji is ??? (picture character), ??? is kaomoji (face character). Regards, Martin. > provides an emotional state in the context of words. Emoji is very > popular in APJ, and specially in Japan where most of your text will > contain at least half dozen Emoji characters. Remember, people in Japan > spend more than half of their commute in the train, and no talk on the > cellphone in the train, so most people text instead. > > Everyone can guess what are the following emoji that used frequently in > Japan: > > ?(???;)? - worried > > > > ?(??????? - happy > > ?(#`??)? - angry > > ??_???- confused > > > there is a lot more... From verdy_p at wanadoo.fr Wed Apr 2 21:51:26 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 3 Apr 2014 04:51:26 +0200 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: <20140402133104.665a7a7059d7ee80bb4d670165c8327d.08a5508c35.wbe@email03.secureserver.net> References: <20140402133104.665a7a7059d7ee80bb4d670165c8327d.08a5508c35.wbe@email03.secureserver.net> Message-ID: There was no such browser bug in my Chrome install GChrome (in Gmail)which rendered the full string correctly (no dots, all characters displayed properly). So this looks like a Firefox bug. There's a rendering problem in IE, but no such critical bug that breaks the rest of the page. It looks like a problem of transcoding of emails by browsers (in Gmail, the transcoding to UTF-8 is performed apparently by the web server, so Firefox does not break when rendering the Gmail page). It is posible that what is really broken is in fact another webmail interface that incorrectly transcodes the email to UTF-8. I cannot know if the broken rendering was performed on everyone's client, if he use a webmail or standalone email agent. Many webmail services are broken in how they transcode the mails received in order to embed its content in a UTF-8 web page. 2014-04-02 22:31 GMT+02:00 Doug Ewell : > > Sorry, this was my mistake. IE8 on Windows 7 displayed James's "angry" > line like this: > > ?(????????????????????????????(#`??)? > - angry > > The only "private-use" character was something that got transcoded to > U+E559, and IE8 displayed that as a space, not a dot. But a quick look > at the ISO-2022-JP source shows this isn't right at all. So I guess I > did have trouble viewing it, maybe not a crash, but severe mojibake. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Apr 2 22:18:21 2014 From: doug at ewellic.org (Doug Ewell) Date: Wed, 2 Apr 2014 21:18:21 -0600 Subject: Emoji [And crash in the Web interface to the mailing list] Message-ID: <201404030319.s333IgoV022247@unicode.org> It's really quite simple: Sending e-mails in ISO-2022-JP to the Unicode mailing list causes problems. ?? -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell -----Original Message----- From: "Philippe Verdy" Sent: ?4/?2/?2014 20:51 To: "Doug Ewell" Cc: "Ilya Zakharevich" ; "Unicode Mailing List" Subject: Re: Emoji [And crash in the Web interface to the mailing list] There was no such browser bug in my Chrome install GChrome (in Gmail)which rendered the full string correctly (no dots, all characters displayed properly). So this looks like a Firefox bug. There's a rendering problem in IE, but no such critical bug that breaks the rest of the page. It looks like a problem of transcoding of emails by browsers (in Gmail, the transcoding to UTF-8 is performed apparently by the web server, so Firefox does not break when rendering the Gmail page). It is posible that what is really broken is in fact another webmail interface that incorrectly transcodes the email to UTF-8. I cannot know if the broken rendering was performed on everyone's client, if he use a webmail or standalone email agent. Many webmail services are broken in how they transcode the mails received in order to embed its content in a UTF-8 web page. 2014-04-02 22:31 GMT+02:00 Doug Ewell : > > Sorry, this was my mistake. IE8 on Windows 7 displayed James's "angry" > line like this: > > ?(????????????????????????????(#`??)? > - angry > > The only "private-use" character was something that got transcoded to > U+E559, and IE8 displayed that as a space, not a dot. But a quick look > at the ISO-2022-JP source shows this isn't right at all. So I guess I > did have trouble viewing it, maybe not a crash, but severe mojibake. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Apr 3 02:25:15 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 3 Apr 2014 09:25:15 +0200 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: <81a90d39549afeec4ea1725d525c3bf9@mwinf5c28.me-wanadoo.net> References: <81a90d39549afeec4ea1725d525c3bf9@mwinf5c28.me-wanadoo.net> Message-ID: But then why did I these emoticons (using "ASCII art" with additional CJK characters) correctly from the same mailing list ? I did not see any dot, or squares for PUA, but the stars, triangle and some kanas. I don't think that the mailing list itself is broken (or may be Gmail corrected things after). You cannot always predict the encoding used by the effective sending mailing agent (or the first reaying agent). Even Gmail will try to fit the "best" (popular) legacy 8-bit encoding according to content and the recipient (where it will try to geolocalize the target domain name, or use its own knowledge of the languages and encodings most often used by senders in that domain), instead of always sending with UTF-8, when the default user settings are for using a "default" encoding if the user had not specified that the mail would be forced to UTF-8. 2014-04-03 5:18 GMT+02:00 Doug Ewell : > It's really quite simple: Sending e-mails in ISO-2022-JP to the Unicode > mailing list causes problems. > > ?? > > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > ------------------------------ > From: Philippe Verdy > Sent: ?4/?2/?2014 20:51 > To: Doug Ewell > Cc: Ilya Zakharevich ; Unicode Mailing List > Subject: Re: Emoji [And crash in the Web interface to the mailing list] > > There was no such browser bug in my Chrome install GChrome (in Gmail)which > rendered the full string correctly (no dots, all characters displayed > properly). > > So this looks like a Firefox bug. There's a rendering problem in IE, but > no such critical bug that breaks the rest of the page. It looks like a > problem of transcoding of emails by browsers (in Gmail, the transcoding to > UTF-8 is performed apparently by the web server, so Firefox does not break > when rendering the Gmail page). > > It is posible that what is really broken is in fact another webmail > interface that incorrectly transcodes the email to UTF-8. I cannot know if > the broken rendering was performed on everyone's client, if he use a > webmail or standalone email agent. Many webmail services are broken in how > they transcode the mails received in order to embed its content in a UTF-8 > web page. > > 2014-04-02 22:31 GMT+02:00 Doug Ewell : >> >> Sorry, this was my mistake. IE8 on Windows 7 displayed James's "angry" >> line like this: >> >> ?(????????????????????????????(#`??)? >> - angry >> >> The only "private-use" character was something that got transcoded to >> U+E559, and IE8 displayed that as a space, not a dot. But a quick look >> at the ISO-2022-JP source shows this isn't right at all. So I guess I >> did have trouble viewing it, maybe not a crash, but severe mojibake. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Apr 3 14:26:24 2014 From: doug at ewellic.org (Doug Ewell) Date: Thu, 03 Apr 2014 12:26:24 -0700 Subject: Emoji [And crash in the Web interface to the mailing list] Message-ID: <20140403122624.665a7a7059d7ee80bb4d670165c8327d.b5256a3b70.wbe@email03.secureserver.net> Philippe Verdy wrote: >> It's really quite simple: Sending e-mails in ISO-2022-JP to the >> Unicode mailing list causes problems. >> >> ?? > > But then why did I these emoticons (using "ASCII art" with additional > CJK characters) correctly from the same mailing list ? > I did not see any dot, or squares for PUA, but the stars, triangle and > some kanas. It was meant as a small joke. It's the *Unicode* mailing list. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From verdy_p at wanadoo.fr Thu Apr 3 15:53:37 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 3 Apr 2014 22:53:37 +0200 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: <20140403122624.665a7a7059d7ee80bb4d670165c8327d.b5256a3b70.wbe@email03.secureserver.net> References: <20140403122624.665a7a7059d7ee80bb4d670165c8327d.b5256a3b70.wbe@email03.secureserver.net> Message-ID: Yes I know and I jave noticed it immediately (look at my first message), but we are already discussing someting else than the Google's joke. 2014-04-03 21:26 GMT+02:00 Doug Ewell : > Philippe Verdy wrote: > > >> It's really quite simple: Sending e-mails in ISO-2022-JP to the > >> Unicode mailing list causes problems. > >> > >> ?? > > > > But then why did I these emoticons (using "ASCII art" with additional > > CJK characters) correctly from the same mailing list ? > > I did not see any dot, or squares for PUA, but the stars, triangle and > > some kanas. > > It was meant as a small joke. It's the *Unicode* mailing list. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From buck at yelp.com Thu Apr 3 18:16:39 2014 From: buck at yelp.com (Buck Golemon) Date: Thu, 3 Apr 2014 16:16:39 -0700 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: References: <81a90d39549afeec4ea1725d525c3bf9@mwinf5c28.me-wanadoo.net> Message-ID: I too received the intended emoji via direct email but I see the garbled characters in the web interface: ?(???;)? - worried ?(??????? - happy ?(#`??)? - angry ??_???- confused I believe there is an encoding issue somewhere in the unicode.org/mail-archtoolchain. On Thu, Apr 3, 2014 at 12:25 AM, Philippe Verdy wrote: > But then why did I these emoticons (using "ASCII art" with additional CJK > characters) correctly from the same mailing list ? > I did not see any dot, or squares for PUA, but the stars, triangle and > some kanas. > > I don't think that the mailing list itself is broken (or may be Gmail > corrected things after). > > You cannot always predict the encoding used by the effective sending > mailing agent (or the first reaying agent). Even Gmail will try to fit the > "best" (popular) legacy 8-bit encoding according to content and the > recipient (where it will try to geolocalize the target domain name, or use > its own knowledge of the languages and encodings most often used by senders > in that domain), instead of always sending with UTF-8, when the default > user settings are for using a "default" encoding if the user had not > specified that the mail would be forced to UTF-8. > > > > > 2014-04-03 5:18 GMT+02:00 Doug Ewell : > > It's really quite simple: Sending e-mails in ISO-2022-JP to the Unicode >> mailing list causes problems. >> >> ?? >> >> >> -- >> Doug Ewell | Thornton, CO, USA >> http://ewellic.org | @DougEwell >> ------------------------------ >> From: Philippe Verdy >> Sent: ?4/?2/?2014 20:51 >> To: Doug Ewell >> Cc: Ilya Zakharevich ; Unicode Mailing List >> Subject: Re: Emoji [And crash in the Web interface to the mailing list] >> >> There was no such browser bug in my Chrome install GChrome (in >> Gmail)which rendered the full string correctly (no dots, all characters >> displayed properly). >> >> So this looks like a Firefox bug. There's a rendering problem in IE, but >> no such critical bug that breaks the rest of the page. It looks like a >> problem of transcoding of emails by browsers (in Gmail, the transcoding to >> UTF-8 is performed apparently by the web server, so Firefox does not break >> when rendering the Gmail page). >> >> It is posible that what is really broken is in fact another webmail >> interface that incorrectly transcodes the email to UTF-8. I cannot know if >> the broken rendering was performed on everyone's client, if he use a >> webmail or standalone email agent. Many webmail services are broken in how >> they transcode the mails received in order to embed its content in a UTF-8 >> web page. >> >> 2014-04-02 22:31 GMT+02:00 Doug Ewell : >>> >>> Sorry, this was my mistake. IE8 on Windows 7 displayed James's "angry" >>> line like this: >>> >>> ?(????????????????????????????(#`??)? >>> - angry >>> >>> The only "private-use" character was something that got transcoded to >>> U+E559, and IE8 displayed that as a space, not a dot. But a quick look >>> at the ISO-2022-JP source shows this isn't right at all. So I guess I >>> did have trouble viewing it, maybe not a crash, but severe mojibake. >> >> > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From naenaguru at gmail.com Thu Apr 3 21:52:48 2014 From: naenaguru at gmail.com (Naena Guru) Date: Thu, 3 Apr 2014 21:52:48 -0500 Subject: Singhala scirpt ill defined by OpenType standard Message-ID: Here is the proof that OpenType standard defined the Singhala script wrongly. Also find a BNF grammar that describes it. http://ahangama.com/unicode/index.htm Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kojiishi at gluesoft.co.jp Thu Apr 3 23:48:52 2014 From: kojiishi at gluesoft.co.jp (Koji Ishii) Date: Fri, 4 Apr 2014 04:48:52 +0000 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: References: <81a90d39549afeec4ea1725d525c3bf9@mwinf5c28.me-wanadoo.net> Message-ID: <76A14357-762B-4A06-BD1E-68CB3C8C452D@gluesoft.co.jp> Go to Encoding menu and choose UTF-8 to fix the garbled characters. It looks like the page is served in UTF-8, but it declares itself as us-ascii: and /koji On Apr 4, 2014, at 8:16 AM, Buck Golemon > wrote: I too received the intended emoji via direct email but I see the garbled characters in the web interface: ?(???;)? - worried ?(??????? - happy ?(#`??)? - angry ??_???- confused I believe there is an encoding issue somewhere in the unicode.org/mail-arch toolchain. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Apr 4 00:45:08 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 4 Apr 2014 07:45:08 +0200 Subject: Emoji [And crash in the Web interface to the mailing list] In-Reply-To: <76A14357-762B-4A06-BD1E-68CB3C8C452D@gluesoft.co.jp> References: <81a90d39549afeec4ea1725d525c3bf9@mwinf5c28.me-wanadoo.net> <76A14357-762B-4A06-BD1E-68CB3C8C452D@gluesoft.co.jp> Message-ID: The content is transfered as UTF-8 at the MIME level for both the plain-text and HTML parts attached: --_000_76A14357762B4A06BD1E68CB3C8C452Dgluesoftcojp_ Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 ... --_000_76A14357762B4A06BD1E68CB3C8C452Dgluesoftcojp_ Content-Type: text/html; charset="utf-8"Content-Transfer-Encoding: base64 ... Normally the xml declaration or meta tag in the (X)HTML headers should be ignored, mail agents will not transform the attachments except possibly changing the content-transfer-encoding (here base64 in both parts). If you mail agents does not process the HTML part, it will render the plain-text verson which has no declaration at all, the MIME content-type will then be the only indication. I don't know how the sending email agent could generate "us-ascii" in XHTML headers, but in fact it should simply be discarded in all cases (in HTML5, us-ascii or iso-8859-1 and their aliases are normally all treated like windows-1252, and "us-ascii" is simply ignored, it is bugged by itself in almost all cases). But here we are not in an HTML5 case; so once the HTML headers are discarded, the next candidate is the MIME part declaration (the transporting layer). As it specifies UTF-8, this should work without forcing the reading email agent to start using its "encoding guessing" magic. But even in this case, UTF-8 is certainly a better guess than Iegacy japanese charsets (using various settings in the reding mail agent such as specifying a prefered default encoding to use one of these legacy charsets has no effect UTF-8 is always used to process the message. So those that see bugs are affected if: - a user sending is message with an outdated and bugged email agent for composing and sending the mail (which inserts compeltely incorrect XHTML headers) - recipients use themselves an outdated and bugged email agent, not performing the most reasonable processing and guessing steps (or this behavior can only be reproduced by those using a email agent whose user localisation is Japanese). The encoding guesser here is most probably bugged but affected by the fact that there are not enough contextual content to guess it with good confidence (only a few isolated characters whose use here was discretionary and extremely rare in an English text conten, those few characters have near-zero confidence value in English as long as there's no other East Asian language used). It looks like the reading email agent does not reach a minimum threshold level of confidence for the guessed encoding; so it eems that the result of the guesser is simply discarded, and then the reading email agent only uses the default user setting of the encoding to use to process messages with unknown/unspecified encodings. I'm not sure this is valid to discard the UTF-8 explicit MIME declaration which does not come from the encoding guesser, as UTF-8 is now a solid default to use (a default now for almost all new IETF standards since long now, with now a wide majority of software installations using it effectively as ther default), notably when it is specified as here). We know that UTF-8 is now the best guess for content at the *worldwide* level. But is UTF-8 still a minority encoding for contents exchanged in Japan ? The ISO2022-JP seems very unlikely to be used instead of UTF-8, and I would have possibly expected a shift-JIS variant instead, if Unicode is still not the best choice for Japan. But if the email agent is on a now antique OS (Windows XP or 2000 ? themselves installed with in their Japanese localisation) may be that user never updated its agent for that old OS (and it is quite surprising for Japan that like to use the newest technology products, except if the reading user is using a tricky installation with lots of personal system settings for their "geek" tools that have never been ported to newer OSes). In my opinion we are in an extremely user-specific situation. But I do not see where the mailing list was acting incorrectly (it won't change its settings only for a few "geeks" with tricky installations and using antique softwares). 2014-04-04 6:48 GMT+02:00 Koji Ishii : > Go to Encoding menu and choose UTF-8 to fix the garbled characters. > > It looks like the page is served in UTF-8, but it declares itself as > us-ascii: > > and > > > /koji > > On Apr 4, 2014, at 8:16 AM, Buck Golemon wrote: > > I too received the intended emoji via direct email but I see the garbled > characters in the web interface: > > ?(???;)? - worried > > > > ?(??????? - happy > > ?(#`??)? - angry > > ??_???- confused > > > I believe there is an encoding issue somewhere in the > unicode.org/mail-arch toolchain. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.fynn at gmail.com Fri Apr 4 02:15:33 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Fri, 4 Apr 2014 13:15:33 +0600 Subject: Singhala scirpt ill defined by OpenType standard In-Reply-To: References: Message-ID: If you think there is a problem with OpenType and Singala, the place to bring that up is on the OpenType list - not the Unicode list. There are often several different ways of accomplishing things with OpenType - and how you do them also depends on how you design the font. Creating and Supporting OpenType Fonts for Sinhala Script is not part of the OpenType specification it is just a guideline for creating OpenType Singhala fonts based on how Microsoft made their own Singhala font - but some other font maker could do things a little differently - as long as the text renders correctly with the font. If you have problems with the document itself and the use of terms - you should take that up with Microsoft typography and give them suggestions how to fix it. It was probably written by someone who makes the OpenType tables for fonts for many different scripts with no particular knowledge of the the Singhala language or of Sanskrit. On 04/04/2014, Naena Guru wrote: > Here is the proof that OpenType standard defined the Singhala script > wrongly. Also find a BNF grammar that describes it. > http://ahangama.com/unicode/index.htm > > Thanks. > From wjgo_10009 at btinternet.com Fri Apr 4 05:09:26 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 4 Apr 2014 11:09:26 +0100 (BST) Subject: Singhala scirpt ill defined by OpenType standard In-Reply-To: References: Message-ID: <1396606166.45507.YahooMailNeo@web87804.mail.ir2.yahoo.com> > If you think there is a problem with OpenType? and Singala, the placeto bring that up is on the OpenType? list ... subscribe: opentype-subscribe at indx.co.uk I seem to remember that joining is by asking the gentleman who runs it, rather than there being an automated system. It is a friendly mailing list with many experts on OpenType amongst the members. William Overington 4 April 2014 From doug at ewellic.org Fri Apr 4 11:12:59 2014 From: doug at ewellic.org (Doug Ewell) Date: Fri, 04 Apr 2014 09:12:59 -0700 Subject: Singhala scirpt ill defined by OpenType standard Message-ID: <20140404091259.665a7a7059d7ee80bb4d670165c8327d.f174ce1446.wbe@email03.secureserver.net> Christopher Fynn wrote: >> Here is the proof that OpenType standard defined the Singhala script >> wrongly. Also find a BNF grammar that describes it. >> http://ahangama.com/unicode/index.htm > > If you think there is a problem with OpenType and Singala, the place > to bring that up is on the OpenType list - not the Unicode list. If you look at the page Naena cited, you'll see that he conflates Unicode and OpenType -- the page is titled "Unicode misunderstands Singhala script" -- and also that he considers the Unicode encoding of Sinhala to have been motivated by evil intentions: "However, this is threatened by the unscrupulous implementation of Unicode Sinhala code page specification closing door to objective criticism. A nearly decade long intransigence seems to be the willingness among the technocracy in the country to value personal wellbeing over obtaining a successful solution for digitizing Singhala." -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From rick at unicode.org Fri Apr 4 20:49:58 2014 From: rick at unicode.org (Rick McGowan) Date: Fri, 04 Apr 2014 18:49:58 -0700 Subject: Unicode.org scheduled maintenance time on April 5 Message-ID: <533F6146.7060505@unicode.org> On April 5, at 11:00 pm US Central time (0400 GMT on April 6), the Unicode.org server may experience downtime due to some maintenance in the data center. The outage is expected to last less than two hours. From mark at macchiato.com Thu Apr 10 02:09:13 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 10 Apr 2014 09:09:13 +0200 Subject: Updated emoji working draft Message-ID: We updated the working draft for emoji info, http://unicode.org/draft/reports/tr51/tr51.html. It now has a more comprehensive list of characters, with images from the various systems (thanks to those who supplied them!). This is still just a working draft, without any UTC status, and may change at any time. Please let me know of any comments before the May UTC. Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Thu Apr 10 08:43:38 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 10 Apr 2014 14:43:38 +0100 (BST) Subject: Updated emoji working draft In-Reply-To: References: Message-ID: <1397137418.12792.YahooMailNeo@web87803.mail.ir2.yahoo.com> May I mention Chloe and Phil please? Chloe and Phil originated as part of my creative writing in the late 1990s and feature in various animations and songs. http://www.users.globalnet.co.uk/~ngo/cw000000.htm http://www.users.globalnet.co.uk/~ngo/de000000.htm http://www.users.globalnet.co.uk/~ngo/cpjs0001.htm http://www.users.globalnet.co.uk/~ngo/song1008.htm http://www.users.globalnet.co.uk/~ngo/song1011.htm http://www.users.globalnet.co.uk/~ngo/ast02400.htm The whole website has now been archived by the British Library in accordance with the 2013 regulations. http://www.bl.uk/aboutus/legaldeposit/index.html William Overington 10 April 2014 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Sat Apr 12 02:15:11 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 12 Apr 2014 08:15:11 +0100 (BST) Subject: Updated emoji working draft In-Reply-To: References: Message-ID: <1397286911.12024.YahooMailNeo@web87801.mail.ir2.yahoo.com> A multi-colour "multi-do-not" glyph displayed in Venice. https://maps.google.com/?ll=45.435077,12.333736&spn=0.00071,0.001124&t=m&z=19&layer=c&cbll=45.435113,12.333845&panoid=D0xVyae3dpu1Z5dA8nkXyA&cbp=12,292.04,,0,13.68 Please zoom in 3 times and go full screen. William Overington 12 April 2014 From verdy_p at wanadoo.fr Sat Apr 12 03:30:45 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 12 Apr 2014 10:30:45 +0200 Subject: Updated emoji working draft In-Reply-To: <1397286911.12024.YahooMailNeo@web87801.mail.ir2.yahoo.com> References: <1397286911.12024.YahooMailNeo@web87801.mail.ir2.yahoo.com> Message-ID: Clearly not a glyph but a free logographic composition with icons. Such composition pattern is in fact very common, not exclusive to this place, and the various sub-icons will change in all aspects: number of objects, placement, color, relatve sizes, and drawing styles (photos may be used as well). The types of restricted objects here are just foods and drinks (including cans), but there are other items commonly found: smoking cigarettes, pets dogs (except those specifically trained and equipped for guiding blind/handicaped people, as stated by law in some countries), umbrellas, shoes (near pools), trousers/shorts (only trunks allowed in pools), rollers, skis, radios/audio devices, mobile phones, fire (in natural areas), person shouting (keep silent), children... 2014-04-12 9:15 GMT+02:00 William_J_G Overington : > A multi-colour "multi-do-not" glyph displayed in Venice. > > > https://maps.google.com/?ll=45.435077,12.333736&spn=0.00071,0.001124&t=m&z=19&layer=c&cbll=45.435113,12.333845&panoid=D0xVyae3dpu1Z5dA8nkXyA&cbp=12,292.04,,0,13.68 > > Please zoom in 3 times and go full screen. > > William Overington > > 12 April 2014 > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Sat Apr 12 03:45:09 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 12 Apr 2014 09:45:09 +0100 (BST) Subject: Updated emoji working draft In-Reply-To: References: <1397286911.12024.YahooMailNeo@web87801.mail.ir2.yahoo.com> Message-ID: <1397292309.25459.YahooMailNeo@web87806.mail.ir2.yahoo.com> I found this rather fine example of some "do not" signs in use in the Uffizi Gallery in Florence. It is zoomed-in, so one needs to zoom-out, three times, in order to be able to move around in the simulation. https://maps.google.com/?ll=43.768808,11.256574&spn=0.000715,0.001124&t=m&z=19&layer=c&cbll=43.768802,11.256064&panoid=qjjyp6ul4nnS4zBqcRzSJQ&cbp=12,273.36,,3,12.92 Are such symbols emoji? In the future, perhaps there will be a colour font useful for making such signs. William Overington 12 April 2014 ________________________________ From: Philippe Verdy To: William_J_G Overington Cc: Mark Davis ?? ; Unicode Public Sent: Saturday, 12 April 2014, 9:30 Subject: Re: Updated emoji working draft Clearly not a glyph but a free logographic composition with icons. Such composition pattern is in fact very common, not exclusive to this place, and the various sub-icons will change in all aspects: number of objects, placement, color, relatve sizes, and drawing styles (photos may be used as well). The types of restricted objects here are just foods and drinks (including cans), but there are other items commonly found: smoking cigarettes, pets dogs (except those specifically trained and equipped for guiding blind/handicaped people, as stated by law in some countries),?umbrellas, shoes (near pools), trousers/shorts (only trunks allowed in pools), rollers,?skis, radios/audio devices, mobile phones, fire (in natural areas), person shouting (keep silent), children... 2014-04-12 9:15 GMT+02:00 William_J_G Overington : A multi-colour "multi-do-not" glyph displayed in Venice. > >https://maps.google.com/?ll=45.435077,12.333736&spn=0.00071,0.001124&t=m&z=19&layer=c&cbll=45.435113,12.333845&panoid=D0xVyae3dpu1Z5dA8nkXyA&cbp=12,292.04,,0,13.68 > >Please zoom in 3 times and go full screen. > >William Overington > >12 April 2014 > From wjgo_10009 at btinternet.com Sat Apr 12 04:46:43 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 12 Apr 2014 10:46:43 +0100 (BST) Subject: Updated emoji working draft In-Reply-To: References: Message-ID: <1397296003.47902.YahooMailNeo@web87801.mail.ir2.yahoo.com> The document http://unicode.org/draft/reports/tr51/tr51.html at present includes the following. quote There is one further kind of label, called a "read-out", for text-to-speech. For accessibility when reading text, it is useful to have a semi-unique name for an emoji character. The Unicode character name can often serve as a basis for this, but its requirements for uniqueness often ends up with names that are overly long, such as black right-pointing double triangle with vertical bar for ?. Note that the labels need to be in each user?s language to useful. They cannot simply be a translation of an English label, since different words, or even different categorizations, may be what is expected in different languages. The terms given in the data files here have been collected from different sources. They are only initial suggestions, not expected to be complete, and only in English. end quote In March 2014 I published the attached document, depositing a copy with the British Library. The_format_of_the_translit.dat_file_suggested_for_possible_use_for_transliteration.pdf Is this format suitable to become standardized for use in producing localized text-to-speech from emoji to the chosen local language? William Overington 12 April 2014 -------------- next part -------------- A non-text attachment was scrubbed... Name: The_format_of_the_translit.dat_file_suggested_for_possible_use_for_transliteration.pdf Type: application/pdf Size: 32815 bytes Desc: not available URL: From jf at colson.eu Sat Apr 12 09:36:29 2014 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Sat, 12 Apr 2014 16:36:29 +0200 Subject: IPA and unofficial extensions Message-ID: <53494F6D.8070501@colson.eu> Hello I have a few questions about the IPA and about its unofficial extensions. In the consonant charts at https://upload.wikimedia.org/wikipedia/commons/f/f5/IPA_chart_2005_png.svg there are a few grey symbols which are already in the IPA: ????. There are also three symbols I didn?t find: ? palatal lateral fricative (Latin small letter turned y with belt) ? velar lateral fricative (Latin letter small capital l with belt) ? retroflex lateral flap (Latin small letter turned r with long leg and retroflex hook) Is there a proposal including those three letters? In the suprasegmentals, I found an ?extra stress? character. It looks like a double primary stress ??. Is that the right way to write it? Would a new character be required? What about the strident diacritic in the diacritics table? Is it right to use the tilde below twice (n?? a??) or would a new diacritic (combining double tilde below) be proposed? Voiced bilabial fricative. Presently, for this letter, the Greek letter ? is used. The Latin letter ? (U+A7B5 Latin small letter beta) is about to be accepted. Would it be used instead of the Greek letter in IPA? Voiceless uvular fricative. Presently, for this letter, the Greek letter ? is used. Phonetic letters for German dialectology are about to be accepted. I?ve seen in several proposals that it includes a Latin small letter stretched x, but several code points were proposed for it in several proposals and I don?t know where is the last one. Would it be used instead of the Greek letter in IPA? The Greek chi has a wavy line while the streched x is very similar to and confusable with the standard x. From mark at macchiato.com Sat Apr 12 09:39:52 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 12 Apr 2014 16:39:52 +0200 Subject: Updated emoji working draft In-Reply-To: <1397296003.47902.YahooMailNeo@web87801.mail.ir2.yahoo.com> References: <1397296003.47902.YahooMailNeo@web87801.mail.ir2.yahoo.com> Message-ID: On 12 April 2014 11:46, William_J_G Overington wrote: ?...? In March 2014 I published the attached document, depositing a copy with the > British Library. > > > The_format_of_the_translit.dat_file_suggested_for_possible_use_for_transliteration.pdf > > Is this format suitable to become standardized for use in producing > localized text-to-speech from emoji to the chosen local language? > ? no?, not particularly Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Sat Apr 12 09:54:54 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 12 Apr 2014 15:54:54 +0100 (BST) Subject: Updated emoji working draft Message-ID: <1397314494.29013.YahooMailNeo@web87805.mail.ir2.yahoo.com> The document http://unicode.org/draft/reports/tr51/tr51.html at present includes the following. quote The longer-term goal for implementations should be to support embedded graphics. That would allow arbitrary emoji characters, and not be dependent on additional Unicode encoding. end quote Would it be good, for an emoji that is not encoded in regular Unicode, to include mention of the possibility of transmission by markup bubble, rendered upon reception as an unmapped glyph by an OpenType colour font? For example, as nine Unicode characters. COLON COLON U1 U2 U3 U4 U5 COLON SEMICOLON This would perhaps not always allow new emoji to be added as quickly as with embedded graphics, yet with this technique, the message could be archived as plain text and would be searchable and text-to-speech would be possible at the receiving end. William Overington 12 April 2014 From mark at macchiato.com Sat Apr 12 11:07:42 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 12 Apr 2014 18:07:42 +0200 Subject: Updated emoji working draft In-Reply-To: <1397314494.29013.YahooMailNeo@web87805.mail.ir2.yahoo.com> References: <1397314494.29013.YahooMailNeo@web87805.mail.ir2.yahoo.com> Message-ID: On 12 April 2014 16:54, William_J_G Overington wrote: > Would it be good, for an emoji that is not encoded in regular Unicode, to > include mention of the possibility of transmission by markup bubble, > rendered upon reception as an unmapped glyph by an OpenType colour font? > > For example, as nine Unicode characters. > > COLON COLON U1 U2 U3 U4 U5 COLON SEMICOLON > > This would perhaps not always allow new emoji to be added as quickly as > with embedded graphics, yet with this technique, the message could be > archived as plain text and would be searchable and text-to-speech would be > possible at the receiving end. > ?I don't think anything like what you suggest would be feasible, or desirable. Longer term, I think the most feasible approach is the interchange of embedded graphics, which can always have alt values (at least in html) for readings. Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Mon Apr 14 02:27:17 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 14 Apr 2014 08:27:17 +0100 (BST) Subject: Updated emoji working draft Message-ID: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> >> Is this format suitable to become standardized for use in producing localized text-to-speech from emoji to the chosen local language? > no, not particularly Thank you for replying. Well, I feel that it would be good if a format, whatever it may be, were decided at the May 2014 UTC (Unicode Technical Committee) meeting. This would enable application implementers to use a standardized format with a standardized file name; and enable advocates for localization to a particular language to produce a localization file for that particular language confident that the file produced would be widely compatible with various applications, such as browsers and email clients and ebook readers, from various manufacturers. The particular file format that I mentioned is a simplified variant of an earlier format that I produced for my research. The original format contains a facility for organizing a cascading menu system for use in generating messages as well. Yet, in general, what features are needed for such a format and can such a format become specified in good time for discussion before the May 2014 UTC meeting? William Overington 14 April 2014 From wjgo_10009 at btinternet.com Mon Apr 14 03:54:26 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 14 Apr 2014 09:54:26 +0100 (BST) Subject: Updated emoji working draft In-Reply-To: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> References: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> Message-ID: <1397465666.9700.YahooMailNeo@web87801.mail.ir2.yahoo.com> > I don't think anything like what you suggest would be feasible, or desirable. Decoding a nine-character markup bubble to an unmapped monochrome glyph using an OpenType font is known to be possible as I have done that using a font made using High-Logic FontCreator 7 with the font tested using Serif PagePlus X5. Colour fonts are now available and OpenType COLR/CPAL colour fonts can be made using FontCreator 7.5. I do not at present have access to an application that is both OpenType-aware and that can also support colour fonts. Some people might have access to such an application: I do not know at present. William Overington 14 April 2014 From wjgo_10009 at btinternet.com Mon Apr 14 09:01:11 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 14 Apr 2014 15:01:11 +0100 (BST) Subject: Updated emoji working draft In-Reply-To: <1397465666.9700.YahooMailNeo@web87801.mail.ir2.yahoo.com> References: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> <1397465666.9700.YahooMailNeo@web87801.mail.ir2.yahoo.com> Message-ID: <1397484071.93086.YahooMailNeo@web87805.mail.ir2.yahoo.com> Here are two examples each of a symbol together with accompanying text in Venice. The symbol is global and the text is local. https://maps.google.com/maps?q=Venice,+Italy&hl=en&ll=45.432399,12.337928&spn=0.000702,0.001124&sll=37.0625,-95.677068&sspn=26.039016,36.826172&oq=venice&hnear=Venice,+Veneto,+Italy&t=m&layer=c&cbll=45.432473,12.337638&panoid=YazHmOmqVm1q5CZ2H7klMQ&cbp=12,16.36,,0,8.23&z=19 Going full screen and zooming-in is helpful. William Overington 14 April 2014 From mark at macchiato.com Mon Apr 14 11:08:23 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 14 Apr 2014 18:08:23 +0200 Subject: Updated emoji working draft In-Reply-To: <1397484071.93086.YahooMailNeo@web87805.mail.ir2.yahoo.com> References: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> <1397465666.9700.YahooMailNeo@web87801.mail.ir2.yahoo.com> <1397484071.93086.YahooMailNeo@web87805.mail.ir2.yahoo.com> Message-ID: This is really off topic. If you want to start up a thread about this, please use a different subject. Mark *? Il meglio ? l?inimico del bene ?* On 14 April 2014 16:01, William_J_G Overington wrote: > Here are two examples each of a symbol together with accompanying text in > Venice. > > The symbol is global and the text is local. > > > https://maps.google.com/maps?q=Venice,+Italy&hl=en&ll=45.432399,12.337928&spn=0.000702,0.001124&sll=37.0625,-95.677068&sspn=26.039016,36.826172&oq=venice&hnear=Venice,+Veneto,+Italy&t=m&layer=c&cbll=45.432473,12.337638&panoid=YazHmOmqVm1q5CZ2H7klMQ&cbp=12,16.36,,0,8.23&z=19 > > Going full screen and zooming-in is helpful. > > William Overington > > 14 April 2014 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Mon Apr 14 11:51:38 2014 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Mon, 14 Apr 2014 18:51:38 +0200 Subject: IPA and unofficial extensions In-Reply-To: <53494F6D.8070501@colson.eu> References: <53494F6D.8070501@colson.eu> Message-ID: <534C121A.6030903@gmail.com> Le 12/04/2014 16:36, Jean-Fran?ois Colson a ?crit : > > I have a few questions about the IPA and about its unofficial extensions. > > In the consonant charts at > https://upload.wikimedia.org/wikipedia/commons/f/f5/IPA_chart_2005_png.svg > there are a few grey symbols which are already in the IPA: ????. You probably meant in Unicode > There are also three symbols I didn?t find: > ? palatal lateral fricative (Latin small letter turned y with belt) > ? velar lateral fricative (Latin letter small capital l with belt) > ? retroflex lateral flap (Latin small letter turned r with long leg > and retroflex hook) > Is there a proposal including those three letters? They are in the SIL PUA at position F267, F268 and F269 . You have more details at http://scripts.sil.org/cms/scripts/page.php?item_id=SILPUAassignments , but the 6.2a version of it ( http://scripts.sil.org/cms/scripts/render_download.php?format=file&media_id=SILCorpPUAAssign20130215&filename=SILCorpPUAAssign20130215_6.2a.zip ) does not list them as proposed to Unicode. I could not find any proposal for them, but the related LATIN CAPITAL LETTER L WITH BELT, used in the Alabama (Alibamu) language (aks) has been proposed in L2/12-080 by Joshua M Jensen and Karl Pentzlin. ( http://www.unicode.org/L2/L2012/12080-l-with-belt.pdf ) and has been accepted for unicode 7.0 as U+A7AD ( http://www.unicode.org/charts/PDF/Unicode-7.0/U70-A720.pdf ) > > In the suprasegmentals, I found an ?extra stress? character. > It looks like a double primary stress ??. Is that the right way to > write it? Would a new character be required? That is the way used to write it on wikipedia at least (e.g. here https://en.wikipedia.org/wiki/Stress_%28linguistics%29 and here https://en.wikipedia.org/wiki/Obsolete_and_nonstandard_symbols_in_the_International_Phonetic_Alphabet ). A new character does not seem to be required. > > What about the strident diacritic in the diacritics table? Is it right > to use the tilde below twice (n?? a??) or would a new diacritic (combining > double tilde below) be proposed? I think the correct encoding is indeed two uses of the tilde below and no new character is needed > > Voiced bilabial fricative. > Presently, for this letter, the Greek letter ? is used. The Latin > letter ? (U+A7B5 Latin small letter beta) is about to be accepted. > Would it be used instead of the Greek letter in IPA? > > Voiceless uvular fricative. > Presently, for this letter, the Greek letter ? is used. Phonetic > letters for German dialectology are about to be accepted. I?ve seen in > several proposals that it includes a Latin small letter stretched x, > but several code points were proposed for it in several proposals and > I don?t know where is the last one. The last encoding is U+AB53 LATIN SMALL LETTER CHI, as you can see here http://www.unicode.org/charts/PDF/Unicode-7.0/U70-AB30.pdf . I supose its final name folows the reasoning of this proposal http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4296.pdf which, I guess, was following this thread of the mailing list : http://www.unicode.org/mail-arch/unicode-ml/y2012-m06/0048.html > Would it be used instead of the Greek letter in IPA? The Greek chi has > a wavy line while the streched x is very similar to and confusable > with the standard x. The wavy line of the greek chi is optional. In some sans-serif fonts x and ? (chi) look exactly the same, so that Greek and Latin text blend better together. This is, for example, the design choice taken by the Ubuntu font (http://font.ubuntu.com/), and it makes this font difficult to use for mathematical and phonetic texts. I feel that the stretched x form is then more distinct, because of its different height and having it not contrasting with x simply makes no sense to any sane typograph. The encoding of these two ?greek-latin? letter used in IPA (together with theta) is a subject of discussion which comes every few years on this list. The tree following blog posts from Michael Everson and John Wells in 2010 (as well as the comments) discuss the unfortunate effect of their unification with Latin. http://evertype.com/blog/blog/2010/07/23/latin-and-greek-a-problem-for-the-ipa/ http://phonetic-blog.blogspot.fr/2010/07/disunification-1.html http://phonetic-blog.blogspot.fr/2010/07/disunification-2.html Only the latin theta is currently missing, but since they it is used in some Native american, Unifon and Rromani orthographies, it will be integrated in Unicode in some future date. And it is indeed proposed here http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4262.pdf So some person will volontarly use the Latin version of these letters for IPA, and I?m almost sure Michael Everson will tell you that it is the right thing to do... Fr?d?ric Grosshans From wjgo_10009 at btinternet.com Tue Apr 15 06:14:48 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 15 Apr 2014 12:14:48 +0100 (BST) Subject: Updated emoji working draft In-Reply-To: <1397484071.93086.YahooMailNeo@web87805.mail.ir2.yahoo.com> References: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> <1397465666.9700.YahooMailNeo@web87801.mail.ir2.yahoo.com> <1397484071.93086.YahooMailNeo@web87805.mail.ir2.yahoo.com> Message-ID: <1397560488.5735.YahooMailNeo@web87803.mail.ir2.yahoo.com> > This is really off topic. If you want to start up a thread about this, please use a different subject. Well, perhaps I may explain why I consider the post to be on topic. The document http://unicode.org/draft/reports/tr51/tr51.html at present includes the following. quote There is one further kind of label, called a "read-out", for text-to-speech. For accessibility when reading text, it is useful to have a semi-unique name for an emoji character. The Unicode character name can often serve as a basis for this, but its requirements for uniqueness often ends up with names that are overly long, such as black right-pointing double triangle with vertical bar for ?. Note that the labels need to be in each user?s language to useful. They cannot simply be a translation of an English label, since different words, or even different categorizations, may be what is expected in different languages. The terms given in the data files here have been collected from different sources. They are only initial suggestions, not expected to be complete, and only in English. end quote If the UTC (Unicode Technical Committee) accepts the introduction of read-out labels, each read-out label both linked to a pictograph character and also linked to a language-localization text string, then that will be a far-reaching enhancement to Unicode which may have enormous implications for facilitating communication through the language barrier. Although suggested in the draft as for use in text-to-speech, a read-out label could also be displayed as text, either in addition to the pictograph or instead of the pictograph. The linked picture in my post contains two examples, each of which may, in the present context be regarded as a pictograph and a read-out label text string displayed together. Imagine please if museum and art gallery websites each were to have an international webpage in its on-line shop. If there were on the webpage emoji for Surname, Forename, Delivery address, Card number, Card expiry date and so on and the end-user could display text in his or her own language by displaying the appropriate read-out label next to the pictograph of the emoji, then that could be very helpful. William Overington 15 April 2014 From mark at macchiato.com Tue Apr 15 07:57:52 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 15 Apr 2014 14:57:52 +0200 Subject: Updated emoji working draft In-Reply-To: <1397560488.5735.YahooMailNeo@web87803.mail.ir2.yahoo.com> References: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> <1397465666.9700.YahooMailNeo@web87801.mail.ir2.yahoo.com> <1397484071.93086.YahooMailNeo@web87805.mail.ir2.yahoo.com> <1397560488.5735.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: On 15 April 2014 13:14, William_J_G Overington wrote: > If the UTC (Unicode Technical Committee) accepts the introduction of > read-out labels, each read-out label both linked to a pictograph character > and also linked to a language-localization text string, then that will be a > far-reaching enhancement to Unicode which may have enormous implications > for facilitating communication through the language barrier. > > If the UTC (Unicode Technical Committee) accepts the introduction of read-out labels The passage just points out that those can exist, the document does not provide any data for that. > If there were on the webpage emoji for Surname, Forename, Delivery address, Card number I can't see any possible future in which emoji like that are encoded. As I said before, please move this discussion to another email subject. Otherwise, I'll take a step I should have long ago, and simply filter out all email coming from you. Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Tue Apr 15 11:29:23 2014 From: petercon at microsoft.com (Peter Constable) Date: Tue, 15 Apr 2014 16:29:23 +0000 Subject: Updated emoji working draft In-Reply-To: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> References: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> Message-ID: <21684ddd2077444f8c9cd2fc30b40bee@BL2PR03MB450.namprd03.prod.outlook.com> William, the UTC is not in the business of creating file formats for localization data. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of William_J_G Overington Sent: April 14, 2014 12:27 AM To: Unicode Public; Mark Davis ?? Cc: wjgo_10009 at btinternet.com Subject: Re: Updated emoji working draft >> Is this format suitable to become standardized for use in producing localized text-to-speech from emoji to the chosen local language? > no, not particularly Thank you for replying. Well, I feel that it would be good if a format, whatever it may be, were decided at the May 2014 UTC (Unicode Technical Committee) meeting. This would enable application implementers to use a standardized format with a standardized file name; and enable advocates for localization to a particular language to produce a localization file for that particular language confident that the file produced would be widely compatible with various applications, such as browsers and email clients and ebook readers, from various manufacturers. The particular file format that I mentioned is a simplified variant of an earlier format that I produced for my research. The original format contains a facility for organizing a cascading menu system for use in generating messages as well. Yet, in general, what features are needed for such a format and can such a format become specified in good time for discussion before the May 2014 UTC meeting? William Overington 14 April 2014 _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From wjgo_10009 at btinternet.com Wed Apr 16 06:04:07 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 16 Apr 2014 12:04:07 +0100 (BST) Subject: The application of localized read-out labels Message-ID: <1397646247.70807.YahooMailNeo@web87806.mail.ir2.yahoo.com> > William, the UTC is not in the business of creating file formats for localization data. > Peter Thank you for replying. Feeling that a format for the particular application is important I have now produced a format myself and published it. Please find a copy attached. Posting the publication as an attachment here will also hopefully place it in the mailing list archives for long-term availability. I have also sent a copy to the British Library for Legal Deposit. The publication has the following title. The format of the readouts.dat file suggested for possible use in the application of localized read-out labels The file has the following file name. The_format_of_the_readouts.dat_file_suggested_for_possible_use_in_the_application_of_localized_read-out_labels.pdf William Overington 16 April 2014 -------------- next part -------------- A non-text attachment was scrubbed... Name: The_format_of_the_readouts.dat_file_suggested_for_possible_use_in_the_application_of_localized_read-out_labels.pdf Type: application/pdf Size: 34283 bytes Desc: not available URL: From chris.fynn at gmail.com Wed Apr 16 14:41:30 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Thu, 17 Apr 2014 01:41:30 +0600 Subject: Updated emoji working draft In-Reply-To: <21684ddd2077444f8c9cd2fc30b40bee@BL2PR03MB450.namprd03.prod.outlook.com> References: <1397460437.31343.YahooMailNeo@web87806.mail.ir2.yahoo.com> <21684ddd2077444f8c9cd2fc30b40bee@BL2PR03MB450.namprd03.prod.outlook.com> Message-ID: On 15/04/2014, Peter Constable wrote: > William, the UTC is not in the business of creating file formats for > localization data. > > Peter Yes a proper understanding of what is the scope of Unicode - and what is not within that scope - might help. From chris.fynn at gmail.com Wed Apr 16 15:07:52 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Thu, 17 Apr 2014 02:07:52 +0600 Subject: The application of localized read-out labels In-Reply-To: <1397646247.70807.YahooMailNeo@web87806.mail.ir2.yahoo.com> References: <1397646247.70807.YahooMailNeo@web87806.mail.ir2.yahoo.com> Message-ID: On 16/04/2014, William_J_G Overington wrote: > >> William, the UTC is not in the business of creating file formats for >> localization data. > >> Peter > > Thank you for replying. > > Feeling that a format for the particular application is important I have now > produced a format myself and published it. Whether or not it is important, it is clearly beyond the defined scope of Unicode so off-topic here. From wjgo_10009 at btinternet.com Thu Apr 17 01:14:55 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 17 Apr 2014 07:14:55 +0100 (BST) Subject: The application of localized read-out labels In-Reply-To: References: <1397646247.70807.YahooMailNeo@web87806.mail.ir2.yahoo.com> Message-ID: <1397715295.40851.YahooMailNeo@web87802.mail.ir2.yahoo.com> Christopher Fynn wrote: > Whether or not it is important, it is clearly beyond the defined scope of Unicode so off-topic here. Well, whether it is beyond the scope of Unicode is not the issue here when considering whether it was reasonable for me to have made the post. The issue is whether it is beyond the scope of the Unicode Public Email List. Please consider the following web page. http://www.unicode.org/consortium/distlist-unicode.html That page includes the following. quote Discussion list for Unicode and general internationalization issues About 700 members world-wide, discuss such subjects as: implementing the Unicode Standard, discussion of new proposals, etc. end quote This mailing list is not only about producing the Unicode Standard, it is also for matters considering implementing end-user projects that use Unicode characters. Going back to the issue of whether the post is relevant to Unicode as such, the fact of the matter is, that at the time of posting, and indeed now, the document at the following webpage is implicitly due to be put before the UTC (Unicode Technical Committee) meeting in May 2014. http://unicode.org/draft/reports/tr51/tr51.html That page includes the following. quote There is one further kind of label, called a "read-out", for text-to-speech. For accessibility when reading text, it is useful to have a semi-unique name for an emoji character. The Unicode character name can often serve as a basis for this, but its requirements for uniqueness often ends up with names that are overly long, such as black right-pointing double triangle with vertical bar for ?. Note that the labels need to be in each user?s language to useful. They cannot simply be a translation of an English label, since different words, or even different categorizations, may be what is expected in different languages. The terms given in the data files here have been collected from different sources. They are only initial suggestions, not expected to be complete, and only in English. end quote So as the matter is raised in the Unicode Public Email List and is due to go before the UTC in May 2014, then I opine that it is both reasonable and within the rules to discuss the implications of the practical application of read-out labels in the Unicode Public Email List. In fact, I did not know of this concept of a read-out label in relation to emoji characters before I read that text. I feel that it is an important matter. It remains until the meeting for it to become clear what the UTC decides is relevant to its scope. Until recently, character colour was not in scope, everything was monochrome only. Now character colour is in scope. So until the Chair of the meeting reaches that topic, who can say what the UTC will decide to be in scope? I am pleased that the pdf document has been circulated around the world and I hope that it will be of practical use in relation to accessibility. If the format is used by software manufacturers and by people producing specific localization files so that interoperability is achieved, then that will be a good result. It would be of great help if the UTC chooses to participate and I hope that it does, yet if that is not possible the format can still be applied by end-users of the Unicode Standard. Here is a link to another item about accessibility that I produced some years ago. http://www.users.globalnet.co.uk/~ngo/spec0001.htm William Overington 17 April 2014 From corbett.dav at husky.neu.edu Thu Apr 17 08:39:20 2014 From: corbett.dav at husky.neu.edu (David Corbett) Date: Thu, 17 Apr 2014 09:39:20 -0400 Subject: IPA and unofficial extensions Message-ID: > > What about the strident diacritic in the diacritics table? Is it right > > to use the tilde below twice (n?? a??) or would a new diacritic (combining > > double tilde below) be proposed? > I think the correct encoding is indeed two uses of the tilde below and > no new character is needed The strident diacritic is U+1DFD COMBINING ALMOST EQUAL TO BELOW. L2/07-334R explains: COMBINING ALMOST EQUAL TO BELOW could possibly be considered COMBINING TILDE BELOW + COMBINING TILDE BELOW. However, COMBINING ALMOST EQUAL TO BELOW is one character representing strident vowels. It does not represent creaky voiced which is what tilde below represents in the IPA. COMBINING ALMOST EQUAL TO ABOVE exists in Unicode and we believe the COMBINING ALMOST EQUAL TO BELOW should be added for linguistic usage as well. From doug at ewellic.org Thu Apr 17 13:43:47 2014 From: doug at ewellic.org (Doug Ewell) Date: Thu, 17 Apr 2014 11:43:47 -0700 Subject: The application of localized read-out labels Message-ID: <20140417114347.665a7a7059d7ee80bb4d670165c8327d.a0ad58431a.wbe@email03.secureserver.net> Christopher Fynn wrote: >> Feeling that a format for the particular application is important I >> have now produced a format myself and published it. > > Whether or not it is important, it is clearly beyond the defined scope > of Unicode so off-topic here. It's labeled prominently as a "thought experiment," which means there is no expectation that anyone will implement the format or software which reads it, only think about what would happen if it were implemented. I actually read through the document, 18-point body type and all, before noticing this key point. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From eliz at gnu.org Sun Apr 20 05:24:22 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Sun, 20 Apr 2014 13:24:22 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 Message-ID: <83ha5ofgdl.fsf@gnu.org> Would someone please help understand the following subtleties and obscure language in the UBA document found at http://www.unicode.org/reports/tr9/? Thanks in advance. 1. In paragraph 3.1.2, near its very end, we have this sentence (with my emphasis): As rule X10 will specify, an isolating run sequence is the unit to which the rules following it are applied, and the last character of ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ one level run in the sequence is considered to be immediately followed by the first character of the next level run in the sequence during this phase of the algorithm. What does it mean here by "the rules following it"? Following what? 2. In BD16 (paragraph 3.1.3), the 1st bullet says: . Create a stack for elements each consisting of a bracket character and a text position. Initialize it to empty. But then 1st sub-bullet of the 3rd bullet says: . If an opening paired bracket is found, push its Bidi_Paired_Bracket property value and its text position onto the stack. But the stack does not hold values of Bidi_Paired_Bracket property, it holds characters. Items 2 and 3 below that say: 2. Compare the closing paired bracket being inspected or its canonical equivalent to the bracket in the current stack element. 3. If the values match, meaning the two characters form a bracket pair, then [...] So I guess the 1st bullet is correct, but the 3rd bullet should say "... push the opening paired bracket character and its text position onto the stack". Is this the correct interpretation? 3. Paragraph 3.3.2 says, under "Non-formatting characters": X6. For all types besides B, BN, RLE, LRE, RLO, LRO, PDF, RLI, LRI, FSI, and PDI: . Set the current character?s embedding level to the embedding level of the last entry on the directional status stack. [...] Note that the current embedding level is not changed by this rule. What does this last sentence mean by "the current embedding level"? The first bullet of X6 mandates that "the current character?s embedding level" _is_ changed by this rule, so what other "current embedding level" is alluded to here? 4. Rule X10 says in its last bullet: Apply rules W1?W7, N0?N2, and I1?I2, in the order in which they appear below, to each of the isolating run sequences, applying one rule to all the characters in the sequence in the order in which they occur in the sequence before applying another rule to any part of the sequence. The order that one isolating run sequence is treated relative to another does not matter. Does the last sentence mean that it is OK to apply W1 to the 1st isolating sequence, then apply W1 to the second isolating sequence, then apply W2 to the 1st isolating sequence, followed by W2 application to the 2nd isolating sequence, etc.? IOW, the last sentence refers to the order of processing between the isolating run sequences, but says nothing about the order of applying rules between the sequences. 5. Rule N0 says: . For each bracket-pair element in the list of pairs of text positions a. Inspect the bidirectional types of the characters enclosed within the bracket pair. b. If any strong type (either L or R) matching the embedding direction is found, set the type for both brackets in the pair to match the embedding direction. First, what is meant here by "strong type [...] matching the embedding direction"? Does the "match" here consider only the odd/even value of the current embedding level vs R/L type, in the sense that odd levels "match" R and even levels "match" L? Or does this mean some other kind of matching? Table 3, which the only place that seems to refer to the issue, is not entirely clear, either: e The text ordering type (L or R) that matches the embedding level direction (even or odd). Again, the sense of the "match" here is not clear. Next, what is meant here by "the characters enclosed within the bracket pair"? If the bracket pair encloses another bracket pair, which is inner to it, do the characters inside the inner pair count for the purposes of resolving the level of the outer pair? Lastly, I presume that by "the bidirectional types of the enclosed characters" the text means the resolved types as modified by the preceding phases, not the original types. Is that correct? Again, thanks in advance for any help. From asmusf at ix.netcom.com Sun Apr 20 14:58:23 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 20 Apr 2014 12:58:23 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <83ha5ofgdl.fsf@gnu.org> References: <83ha5ofgdl.fsf@gnu.org> Message-ID: <535426DF.9020308@ix.netcom.com> On 4/20/2014 3:24 AM, Eli Zaretskii wrote: > Would someone please help understand the following subtleties and > obscure language in the UBA document found at > http://www.unicode.org/reports/tr9/? Thanks in advance. Eli, I've tried to give you some explanations - in some places, I concur with you that the wording could be improved and that such improved wording should be proposed to the UTC (or its editorial committee) for incorporation into a future update. For details, see below. > > 1. In paragraph 3.1.2, near its very end, we have this sentence (with > my emphasis): > > As rule X10 will specify, an isolating run sequence is the unit to > which the rules following it are applied, and the last character of > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > one level run in the sequence is considered to be immediately > followed by the first character of the next level run in the > sequence during this phase of the algorithm. > > What does it mean here by "the rules following it"? Following what? That looks like a bad referent, but from context, this "it" must be X10 > > 2. In BD16 (paragraph 3.1.3), the 1st bullet says: > > . Create a stack for elements each consisting of a bracket character > and a text position. Initialize it to empty. > > But then 1st sub-bullet of the 3rd bullet says: > > . If an opening paired bracket is found, push its > Bidi_Paired_Bracket property value and its text position onto > the stack. > > But the stack does not hold values of Bidi_Paired_Bracket property, it > holds characters. The Bidi_Paired_Bracket property is a character code (it is the character code of the other partner in the pair). > Items 2 and 3 below that say: > > 2. Compare the closing paired bracket being inspected or its > canonical equivalent to the bracket in the current stack > element. > 3. If the values match, meaning the two characters > form a bracket pair, then [...] > > So I guess the 1st bullet is correct, but the 3rd bullet should say > "... push the opening paired bracket character and its text position > onto the stack". Is this the correct interpretation? What's really required is that the stack contain a unique identifier for each bracket pair, so that, given a function that maps either opening or closing brackets (or their canonical equivalents) to this id, one can determine that both character belong to the same pair. This unique id could be the opening or the closing bracket (or its canonical equivalent), it makes to practical difference. However, it looks like UAX#9 is written in terms of the code point for the closing bracket. Bullet 1 could be changed to . Create a stack for elements each consisting of a*code point* (Bidi_Paired_Bracket property value) and a text position. Initialize it to empty. to make things more clear. And a slight wording change might help the reader with item 2: 2. Compare the*code point for the*closing paired bracket being inspected or its canonical equivalent to the*code poin*t (Bidi_Paired_Bracket property value) in the current stack element. And, to continue 3. If the values match, meaning*the character being inspected and the character** ** at the text position in the stack* form a bracket pair, then [...] > > 3. Paragraph 3.3.2 says, under "Non-formatting characters": > > X6. For all types besides B, BN, RLE, LRE, RLO, LRO, PDF, RLI, LRI, > FSI, and PDI: > > . Set the current character?s embedding level to the embedding > level of the last entry on the directional status stack. > > [...] > > Note that the current embedding level is not changed by this rule. > > What does this last sentence mean by "the current embedding level"? > The first bullet of X6 mandates that "the current character?s > embedding level" _is_ changed by this rule, so what other "current > embedding level" is alluded to here? I'm punting on that one - can someone else answer this? > > 4. Rule X10 says in its last bullet: > > Apply rules W1?W7, N0?N2, and I1?I2, in the order in which they > appear below, to each of the isolating run sequences, applying one > rule to all the characters in the sequence in the order in which > they occur in the sequence before applying another rule to any part > of the sequence. The order that one isolating run sequence is > treated relative to another does not matter. > > Does the last sentence mean that it is OK to apply W1 to the 1st > isolating sequence, then apply W1 to the second isolating sequence, > then apply W2 to the 1st isolating sequence, followed by W2 > application to the 2nd isolating sequence, etc.? IOW, the last > sentence refers to the order of processing between the isolating run > sequences, but says nothing about the order of applying rules between > the sequences. Apply rules W1?W7, N0?N2, and I1?I2 to each of the isolating run sequences. For each sequence, [completely] apply each rule in the order in which they appear below. The order that one isolating run sequence is treated relative to another does not matter. I believe the above restatement expresses the same thing in fewer words. The "completely" may be unnecessary. The text about applying the rules to "all characters" seems to be unnecessary, unless there is, in any of the rules, an option to not apply it to some characters. Unless incomplete application is envisaged, calling out the "all characters" here just confuses. > > 5. Rule N0 says: > > . For each bracket-pair element in the list of pairs of text positions > > a. Inspect the bidirectional types of the characters enclosed > within the bracket pair. > b. If any strong type (either L or R) matching the embedding > direction is found, set the type for both brackets in the pair > to match the embedding direction. > > First, what is meant here by "strong type [...] matching the embedding > direction"? Does the "match" here consider only the odd/even value of > the current embedding level vs R/L type, in the sense that odd levels > "match" R and even levels "match" L? Or does this mean some other > kind of matching? Table 3, which the only place that seems to refer > to the issue, is not entirely clear, either: > > e The text ordering type (L or R) that matches the embedding level > direction (even or odd). > > Again, the sense of the "match" here is not clear. even/odd --- R/L match, might be made more explicit > > Next, what is meant here by "the characters enclosed within the > bracket pair"? If the bracket pair encloses another bracket pair, > which is inner to it, do the characters inside the inner pair count > for the purposes of resolving the level of the outer pair? They do, so there's no need to change the text. > > Lastly, I presume that by "the bidirectional types of the enclosed > characters" the text means the resolved types as modified by the > preceding phases, not the original types. Is that correct? It's the strong type assigned by rule N0. A./ > > Again, thanks in advance for any help. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From jjc at jclark.com Sun Apr 20 20:54:34 2014 From: jjc at jclark.com (James Clark) Date: Mon, 21 Apr 2014 08:54:34 +0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <535426DF.9020308@ix.netcom.com> References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> Message-ID: On Mon, Apr 21, 2014 at 2:58 AM, Asmus Freytag wrote: > On 4/20/2014 3:24 AM, Eli Zaretskii wrote: > > Would someone please help understand the following subtleties and > obscure language in the UBA document found athttp://www.unicode.org/reports/tr9/? Thanks in advance. > > 3. Paragraph 3.3.2 says, under "Non-formatting characters": > > X6. For all types besides B, BN, RLE, LRE, RLO, LRO, PDF, RLI, LRI, > FSI, and PDI: > > . Set the current character?s embedding level to the embedding > level of the last entry on the directional status stack. > > [...] > > Note that the current embedding level is not changed by this rule. > > What does this last sentence mean by "the current embedding level"? > The first bullet of X6 mandates that "the current character?s > embedding level" _is_ changed by this rule, so what other "current > embedding level" is alluded to here? > > I'm punting on that one - can someone else answer this? > I assume "current embedding level" here meant "the embedding level of the last entry on the directional status stack". (This is a natural slip to make if you think in terms of an optimized implementation that stores each component of the top of the directional status stack in a variable, as suggested in 3.3.2.) James -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Apr 21 01:03:20 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 20 Apr 2014 23:03:20 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> Message-ID: <5354B4A8.4030201@ix.netcom.com> On 4/20/2014 6:54 PM, James Clark wrote: > On Mon, Apr 21, 2014 at 2:58 AM, Asmus Freytag > wrote: > > On 4/20/2014 3:24 AM, Eli Zaretskii wrote: >> Would someone please help understand the following subtleties and >> obscure language in the UBA document found at >> http://www.unicode.org/reports/tr9/? Thanks in advance. >> 3. Paragraph 3.3.2 says, under "Non-formatting characters": >> >> X6. For all types besides B, BN, RLE, LRE, RLO, LRO, PDF, RLI, LRI, >> FSI, and PDI: >> >> . Set the current character?s embedding level to the embedding >> level of the last entry on the directional status stack. >> >> [...] >> >> Note that the current embedding level is not changed by this rule. >> >> What does this last sentence mean by "the current embedding level"? >> The first bullet of X6 mandates that "the current character?s >> embedding level" _is_ changed by this rule, so what other "current >> embedding level" is alluded to here? > I'm punting on that one - can someone else answer this? > > > I assume "current embedding level" here meant "the embedding level of > the last entry on the directional status stack". (This is a natural > slip to make if you think in terms of an optimized implementation that > stores each component of the top of the directional status stack in a > variable, as suggested in 3.3.2.) > > James > In general, I heartily dislike "specifications" that just narrate a particular implementation... A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Mon Apr 21 02:02:40 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 21 Apr 2014 08:02:40 +0100 (BST) Subject: The application of localized read-out labels In-Reply-To: <1397646247.70807.YahooMailNeo@web87806.mail.ir2.yahoo.com> References: <1397646247.70807.YahooMailNeo@web87806.mail.ir2.yahoo.com> Message-ID: <1398063760.79331.YahooMailNeo@web87801.mail.ir2.yahoo.com> The text of the first post in this thread was not recorded in the archive of the Unicode Public Email List. Maybe because there was an attachment to the post? This post is so as to include a transcript of the text of that post in the archive of the Unicode Public Email List. William Overington 21 April 2014 Transcript: > William, the UTC is not in the business of creating file formats for localization data. > Peter Thank you for replying. Feeling that a format for the particular application is important I have now produced a format myself and published it. Please find a copy attached. Posting the publication as an attachment here will also hopefully place it in the mailing list archives for long-term availability. I have also sent a copy to the British Library for Legal Deposit. The publication has the following title. The format of the readouts.dat file suggested for possible use in the application of localized read-out labels The file has the following file name. The_format_of_the_readouts.dat_file_suggested_for_possible_use_in_the_application_of_localized_read-out_labels.pdf William Overington 16 April 2014 From eliz at gnu.org Mon Apr 21 02:55:59 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Mon, 21 Apr 2014 10:55:59 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <535426DF.9020308@ix.netcom.com> References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> Message-ID: <83zjjfce0g.fsf@gnu.org> > Date: Sun, 20 Apr 2014 12:58:23 -0700 > From: Asmus Freytag > > On 4/20/2014 3:24 AM, Eli Zaretskii wrote: > > Would someone please help understand the following subtleties and > > obscure language in the UBA document found at > > http://www.unicode.org/reports/tr9/? Thanks in advance. > > Eli, > > I've tried to give you some explanations Thanks! > in some places, I concur with you that the wording could be improved > and that such improved wording should be proposed to the UTC (or its > editorial committee) for incorporation into a future update. How do we do that? > For details, see below. > > > > 1. In paragraph 3.1.2, near its very end, we have this sentence (with > > my emphasis): > > > > As rule X10 will specify, an isolating run sequence is the unit to > > which the rules following it are applied, and the last character of > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > one level run in the sequence is considered to be immediately > > followed by the first character of the next level run in the > > sequence during this phase of the algorithm. > > > > What does it mean here by "the rules following it"? Following what? > > That looks like a bad referent, but from context, this "it" must be X10 Ah, so simply saying "the following rules" or "rules following X10" would be enough. > Bullet 1 could be changed to > > . Create a stack for elements each consisting of a*code point* (Bidi_Paired_Bracket property value) > and a text position. Initialize it to empty. > > to make things more clear. And a slight wording change might help the > reader with item 2: > > 2. Compare the*code point for the*closing paired bracket being inspected or its > canonical equivalent to the*code poin*t (Bidi_Paired_Bracket property value) in the current stack > element. > > > And, to continue > > 3. If the values match, meaning*the character being inspected and the character** > ** at the text position in the stack* form a bracket pair, then [...] Right, this makes the description a whole lot more clear. > Apply rules W1?W7, N0?N2, and I1?I2 to each of the isolating run sequences. > For each sequence, [completely] apply each rule in the order in which they appear below. > The order that one isolating run sequence is treated relative to another does not matter. > > I believe the above restatement expresses the same thing in fewer words. It does, thanks. > > 5. Rule N0 says: > > > > . For each bracket-pair element in the list of pairs of text positions > > > > a. Inspect the bidirectional types of the characters enclosed > > within the bracket pair. > > b. If any strong type (either L or R) matching the embedding > > direction is found, set the type for both brackets in the pair > > to match the embedding direction. > > > > First, what is meant here by "strong type [...] matching the embedding > > direction"? Does the "match" here consider only the odd/even value of > > the current embedding level vs R/L type, in the sense that odd levels > > "match" R and even levels "match" L? Or does this mean some other > > kind of matching? Table 3, which the only place that seems to refer > > to the issue, is not entirely clear, either: > > > > e The text ordering type (L or R) that matches the embedding level > > direction (even or odd). > > > > Again, the sense of the "match" here is not clear. > > even/odd --- R/L match, might be made more explicit I agree this should be made more explicit, as this is a somewhat subtle issue that might trip the reader. > > Next, what is meant here by "the characters enclosed within the > > bracket pair"? If the bracket pair encloses another bracket pair, > > which is inner to it, do the characters inside the inner pair count > > for the purposes of resolving the level of the outer pair? > They do, so there's no need to change the text. It might be a good idea to say that explicitly, e.g. as a note, or at least provide another example where the strong characters are only inside an inner bracket pair, which will send the same message to the reader. Thanks again for the clarifications. From eliz at gnu.org Mon Apr 21 03:01:13 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Mon, 21 Apr 2014 11:01:13 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> Message-ID: <83y4yzcdrq.fsf@gnu.org> > From: James Clark > Date: Mon, 21 Apr 2014 08:54:34 +0700 > Cc: Eli Zaretskii , unicode at unicode.org, Kenneth Whistler > > > X6. For all types besides B, BN, RLE, LRE, RLO, LRO, PDF, RLI, LRI, > > FSI, and PDI: > > > > . Set the current character?s embedding level to the embedding > > level of the last entry on the directional status stack. > > > > [...] > > > > Note that the current embedding level is not changed by this rule. > > > > What does this last sentence mean by "the current embedding level"? > > The first bullet of X6 mandates that "the current character?s > > embedding level" _is_ changed by this rule, so what other "current > > embedding level" is alluded to here? > > > > I'm punting on that one - can someone else answer this? > > > > I assume "current embedding level" here meant "the embedding level of the > last entry on the directional status stack". Thanks, that was my guess as well, but I wanted to be sure. IMO, the unfortunate wording here is that the same phrase ("current embedding level") was used just before the problematic sentence to mean something completely different. Having identical phrases close to one another always tricks readers into thinking they are describing the same thing; when they aren't, confusion settles in. So I would suggest to reword one or both of these references to the "current embedding level". Btw, why is that note, about the current embedding level not being changed by X6, important? Why would someone mistakenly think the contrary? From eliz at gnu.org Mon Apr 21 03:33:39 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Mon, 21 Apr 2014 11:33:39 +0300 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <5354B4A8.4030201@ix.netcom.com> References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> <5354B4A8.4030201@ix.netcom.com> Message-ID: <83wqejcc9o.fsf@gnu.org> > Date: Sun, 20 Apr 2014 23:03:20 -0700 > From: Asmus Freytag > CC: Eli Zaretskii , unicode at unicode.org, > Kenneth Whistler > > >> Note that the current embedding level is not changed by this rule. > >> > >> What does this last sentence mean by "the current embedding level"? > >> The first bullet of X6 mandates that "the current character?s > >> embedding level" _is_ changed by this rule, so what other "current > >> embedding level" is alluded to here? > > I'm punting on that one - can someone else answer this? > > > > > > I assume "current embedding level" here meant "the embedding level of > > the last entry on the directional status stack". (This is a natural > > slip to make if you think in terms of an optimized implementation that > > stores each component of the top of the directional status stack in a > > variable, as suggested in 3.3.2.) > > > > James > > > In general, I heartily dislike "specifications" that just narrate a > particular implementation... I cannot agree more. In fact, my main gripe about the UBA additions in 6.3 are that some of their crucial parts are not formally defined, except by an algorithm that narrates a specific implementation. The two worst examples of that are the "definitions" of the isolating run sequence and of the bracket pair. I didn't ask about those because I succeeded to figure them out, but it took many readings of the corresponding parts of the document. It is IMO a pity that the two main features added in 6.3 are based on definitions that are so hard to penetrate, and which actually all but force you to use the specific implementation described by the document. My working definition that replaces BD13 is this: An isolating run sequence is the maximal sequence of level runs of the same embedding level that can be obtained by removing all the characters between an isolate initiator and its matching PDI (or paragraph end, if there is no matching PDI) within those level runs. As for bracket pair (BD16), I'm really amazed that a concept as easy and widely known/used as this would need such an obscure definition that must have an algorithm as its necessary part. How about this instead: A bracket pair is a pair of an opening paired bracket and a closing paired bracket characters within the same isolating run sequence, such that the Bidi_Paired_Bracket property value of the former character or its canonical equivalent equals the latter character or its canonical equivalent, and all the opening and closing bracket characters in between these two are balanced. Then we could use the algorithm to explain what it means for brackets to be balanced (for those readers who somehow don't already know that). Again, thanks for clarifying these subtle issues. I can now proceed to updating the Emacs bidirectional display with the changes in Unicode 6.3. From wjgo_10009 at btinternet.com Mon Apr 21 04:29:36 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 21 Apr 2014 10:29:36 +0100 (BST) Subject: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries Message-ID: <1398072576.61843.YahooMailNeo@web87803.mail.ir2.yahoo.com> Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries Imagine please if museum and art gallery websites each were to have an international webpage in its on-line shop. If there were on the webpage colourful symbols, one each for Surname, Forename, Card number and so on and the end-user could display text in his or her own language by displaying the appropriate read-out label next to each symbol, thus localizing the web page, then that could be very helpful. I have produced designs for nine symbols. There are two glyphs for each symbol, one colourful and one monochrome. The symbols are octagonal, using not quite a regular octagon. In the monochrome glyphs there is a border around the edge, yet in the colourful glyphs there is no border. The colourful glyphs are displayed in blue and orange, the idea being that the effect to the viewer is of blue upon an orange background. The designs are influenced by heraldry to some extent. This is because I consider Surname to be the most important, so I used a heraldic chief. Then for Forename I used a pale as Forename is different from Surname yet accompanies to Surname to form a name. A bar is used for Address. Name as on card may be different from Forename concatenated with a space and Surname, due to use of Mr Mrs Miss Ms etc and initials, hence the reason for the design not being a union of a chief and one pale. Two bars are used for Card number. Card start date and Card expiry date seemed liked brackets, so that inspired the designs. Card security code is just a design so as to be different from the other designs yet not use any diagonal shapes. Delivery address is included to allow for the possibility of sending a gift directly to someone who lives at another address. I am hoping to attach images showing the designs to other posts in this thread. William Overington 21 April 2014 From wjgo_10009 at btinternet.com Mon Apr 21 04:47:24 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 21 Apr 2014 10:47:24 +0100 (BST) Subject: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries In-Reply-To: <1398072576.61843.YahooMailNeo@web87803.mail.ir2.yahoo.com> References: <1398072576.61843.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: <1398073644.88727.YahooMailNeo@web87801.mail.ir2.yahoo.com> > I am hoping to attach images showing the designs to other posts in this thread. Please find attached an image of the designs of the colourful glyphs. William Overington 21 April 2014 -------------- next part -------------- A non-text attachment was scrubbed... Name: colourful_glyphs.png Type: image/png Size: 21679 bytes Desc: not available URL: From wjgo_10009 at btinternet.com Mon Apr 21 04:49:00 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 21 Apr 2014 10:49:00 +0100 (BST) Subject: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries In-Reply-To: <1398073644.88727.YahooMailNeo@web87801.mail.ir2.yahoo.com> References: <1398072576.61843.YahooMailNeo@web87803.mail.ir2.yahoo.com> <1398073644.88727.YahooMailNeo@web87801.mail.ir2.yahoo.com> Message-ID: <1398073740.83541.YahooMailNeo@web87801.mail.ir2.yahoo.com> > I am hoping to attach images showing the designs to other posts in this thread. Please find attached an image of the designs of the monochrome glyphs. William Overington 21 April 2014 -------------- next part -------------- A non-text attachment was scrubbed... Name: monochrome_glyphs.png Type: image/png Size: 21415 bytes Desc: not available URL: From ruland at luckymail.com Mon Apr 21 05:18:25 2014 From: ruland at luckymail.com (=?UTF-8?B?Q2hhcmxpZSBSdWxhbmQg4piY?=) Date: Mon, 21 Apr 2014 12:18:25 +0200 Subject: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries In-Reply-To: <1398072576.61843.YahooMailNeo@web87803.mail.ir2.yahoo.com> References: <1398072576.61843.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: <5354F071.1060008@luckymail.com> I am sorry, but this doesn?t look like internationalization. Rather it seems like another attempt by the British to force their culture upon the rest of the world. The richness of world-wide naming conventions for people is simply ignored, Putin Vladimir Vladimirovi? won?t be able to use his full name (let alone in the order required), and this will lead to World War III. William J. G. Overington, please admit that others know so much more about internationalization than you do, and stop these imperialist off-topic activities. Charlie Ruland ? William_J_G Overington a ?crit: > Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries > > Imagine please if museum and art gallery websites each were to have an international webpage in its on-line shop. > > If there were on the webpage colourful symbols, one each for Surname, Forename, Card number and so on and the end-user could display text in his or her own language by displaying the appropriate read-out label next to each symbol, thus localizing the web page, then that could be very helpful. > > I have produced designs for nine symbols. There are two glyphs for each symbol, one colourful and one monochrome. The symbols are octagonal, using not quite a regular octagon. In the monochrome glyphs there is a border around the edge, yet in the colourful glyphs there is no border. The colourful glyphs are displayed in blue and orange, the idea being that the effect to the viewer is of blue upon an orange background. > > The designs are influenced by heraldry to some extent. > > This is because I consider Surname to be the most important, so I used a heraldic chief. > > Then for Forename I used a pale as Forename is different from Surname yet accompanies to Surname to form a name. > > A bar is used for Address. > > Name as on card may be different from Forename concatenated with a space and Surname, due to use of Mr Mrs Miss Ms etc and initials, hence the reason for the design not being a union of a chief and one pale. > > Two bars are used for Card number. > > Card start date and Card expiry date seemed liked brackets, so that inspired the designs. > > Card security code is just a design so as to be different from the other designs yet not use any diagonal shapes. > > Delivery address is included to allow for the possibility of sending a gift directly to someone who lives at another address. > > I am hoping to attach images showing the designs to other posts in this thread. > > William Overington > > 21 April 2014 > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From wjgo_10009 at btinternet.com Mon Apr 21 07:34:38 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 21 Apr 2014 13:34:38 +0100 (BST) Subject: The application of localized read-out labels In-Reply-To: <20140417114347.665a7a7059d7ee80bb4d670165c8327d.a0ad58431a.wbe@email03.secureserver.net> References: <20140417114347.665a7a7059d7ee80bb4d670165c8327d.a0ad58431a.wbe@email03.secureserver.net> Message-ID: <1398083678.81771.YahooMailNeo@web87803.mail.ir2.yahoo.com> Doug Ewell wrote: > It's labeled prominently as a "thought experiment," which means there is no expectation that anyone will implement the format or software which reads it, only think about what would happen if it were implemented. Well, it states as follows. quote This is a thought experiment at present. Automated localization would be by having a file readouts.dat available. In the thought experiment the file is a UTF-16 text file, such as can be saved from the WordPad program by selecting saving as a Unicode Text Document. end quote My reason for putting "This is a thought experiment at present." was that the format has not been tested by me in practical application and is only theoretically based at the present time, yet I am hoping that the situation may change and that the format might become implemented in practice by someone and become widely used; or maybe that the publication of the format will act as a catalyst to someone publishing a format that is accepted, so that the end result of a standardized format is achieved. > I actually read through the document, 18-point body type and all, before noticing this key point. Thank you for reading through the document. http://en.wikipedia.org/wiki/Thought_experiment http://en.wikipedia.org/wiki/John_Searle http://en.wikipedia.org/wiki/Philosophy_of_language ---- http://en.wikipedia.org/wiki/Thought_experiment http://en.wikipedia.org/wiki/Backcasting William Overington 21 April 2014 From asmusf at ix.netcom.com Mon Apr 21 09:32:15 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Apr 2014 07:32:15 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <83wqejcc9o.fsf@gnu.org> References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> <5354B4A8.4030201@ix.netcom.com> <83wqejcc9o.fsf@gnu.org> Message-ID: <53552BEF.90308@ix.netcom.com> On 4/21/2014 1:33 AM, Eli Zaretskii wrote: >> Date: Sun, 20 Apr 2014 23:03:20 -0700 >> From: Asmus Freytag >> CC: Eli Zaretskii , unicode at unicode.org, >> Kenneth Whistler >> >>>> Note that the current embedding level is not changed by this rule. >>>> >>>> What does this last sentence mean by "the current embedding level"? >>>> The first bullet of X6 mandates that "the current character?s >>>> embedding level" _is_ changed by this rule, so what other "current >>>> embedding level" is alluded to here? >>> I'm punting on that one - can someone else answer this? >>> >>> >>> I assume "current embedding level" here meant "the embedding level of >>> the last entry on the directional status stack". (This is a natural >>> slip to make if you think in terms of an optimized implementation that >>> stores each component of the top of the directional status stack in a >>> variable, as suggested in 3.3.2.) >>> >>> James >>> >> In general, I heartily dislike "specifications" that just narrate a >> particular implementation... > I cannot agree more. > > In fact, my main gripe about the UBA additions in 6.3 are that some of > their crucial parts are not formally defined, except by an algorithm > that narrates a specific implementation. The two worst examples of > that are the "definitions" of the isolating run sequence and of the > bracket pair. I didn't ask about those because I succeeded to figure > them out, but it took many readings of the corresponding parts of the > document. It is IMO a pity that the two main features added in 6.3 > are based on definitions that are so hard to penetrate, and which > actually all but force you to use the specific implementation > described by the document. > > My working definition that replaces BD13 is this: > > An isolating run sequence is the maximal sequence of level runs of > the same embedding level that can be obtained by removing all the > characters between an isolate initiator and its matching PDI (or > paragraph end, if there is no matching PDI) within those level runs. > > As for bracket pair (BD16), I'm really amazed that a concept as easy > and widely known/used as this would need such an obscure definition > that must have an algorithm as its necessary part. How about this > instead: > > A bracket pair is a pair of an opening paired bracket and a closing > paired bracket characters within the same isolating run sequence, > such that the Bidi_Paired_Bracket property value of the former > character or its canonical equivalent equals the latter character or > its canonical equivalent, and all the opening and closing bracket > characters in between these two are balanced. > > Then we could use the algorithm to explain what it means for brackets > to be balanced (for those readers who somehow don't already know > that). > > Again, thanks for clarifying these subtle issues. I can now proceed > to updating the Emacs bidirectional display with the changes in > Unicode 6.3. > > FWIW here is the restatement of BD16 that I used for myself (and that I put into the source comments of the sample Java implementation): // The following is a restatement of BD 16 using non-algorithmic language. // // A bracket pair is a pair of characters consisting of an opening // paired bracket and a closing paired bracket such that the // Bidi_Paired_Bracket property value of the former equals the latter, // subject to the following constraints. // - both characters of a pair occur in the same isolating run sequence // - the closing character of a pair follows the opening character // - any bracket character can belong at most to one pair, the earliest possible one // - any bracket character not part of a pair is treated like an ordinary character // - pairs may nest properly, but their spans may not overlap otherwise // Bracket characters with canonical decompositions are supposed to be treated // as if they had been normalized, to allow normalized and non-normalized text // to give the same result. Your language is more concise, but you may compare for differences. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Apr 21 09:35:57 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Apr 2014 07:35:57 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <83zjjfce0g.fsf@gnu.org> References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> <83zjjfce0g.fsf@gnu.org> Message-ID: <53552CCD.1010703@ix.netcom.com> On 4/21/2014 12:55 AM, Eli Zaretskii wrote: >> in some places, I concur with you that the wording could be improved >> >and that such improved wording should be proposed to the UTC (or its >> >editorial committee) for incorporation into a future update. > How do we do that? > You file a problem report using the "contact form". A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Apr 21 09:40:15 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 21 Apr 2014 07:40:15 -0700 Subject: The application of localized read-out labels Message-ID: <20140421074015.665a7a7059d7ee80bb4d670165c8327d.302f747b25.wbe@email03.secureserver.net> William_J_G Overington wrote: > My reason for putting "This is a thought experiment at present." was > that the format has not been tested by me in practical application and > is only theoretically based at the present time, It's not, of course. It's specified in enough detail that conformant files could be created, and consumed by an application. > yet I am hoping that the situation may change and that the format > might become implemented in practice by someone and become widely > used; or maybe that the publication of the format will act as a > catalyst to someone publishing a format that is accepted, so that the > end result of a standardized format is achieved. It could be argued that this is at least part of the hypothesis for the "experiment." The expected result, not quite stated, is that the format will in fact be used, or will in fact stimulate the creation of a similar format. Because, of course, if there is no hypothesis, then this is neither a Gedankenexperiment nor any other kind of experiment, just an exercise in creating a file format, which is engineering. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From verdy_p at wanadoo.fr Mon Apr 21 11:18:09 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 21 Apr 2014 18:18:09 +0200 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <53552BEF.90308@ix.netcom.com> References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> <5354B4A8.4030201@ix.netcom.com> <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> Message-ID: There are some cases where these rules will not be clear enough. Look at the following where overlaps do occur; but directionality still matters: "This is an [<<] example [>>] for demonstration only." There are two parsings possible if you just consider a hierarchic layout where overlaps are disabled: 1. "This is an [...] for demonstration only.", embedding "<<...>>", itself embedding "] example [" (here the square brackets match externally) 2. "This is an [...] example [...] for demonstration only.", embedding two spans for "<<" and ">>" separately (they also pair externally) Now suppose that the term "example" is translated in Arabic: It is not very clear how the UBA will work while preserving the correct pariing direction of the 3 pairs (one pair is "<<...>>", there are two pairs for "[...]"). Still all 3 pairs have a coherent direction that Bidi-reordering or glyph mirorring should not mix. I see only one solution to tag such text so that it will behave correctly: either the two pairs of square brackets or the pair or guillemets should be encoded with isolated Bidi overrides. But then what is happening to the ordering of the surrounding text? There should be a stable way to encode this case so that UBA will still work in preserving the correct reding order, and the expected semantics and orientation of pairs and the fact that the guillemets are effectively not really embedding the brackets, but the translated word "example". There are several ways to use Bidi-override or Bidi-embedding controls; I don't know which one is better but all of them should still work with UBA. I just hope that the complex cases of the brackets in the middle ("]...[") can be handled gracefully. My opinion would require embedding and isolating the each square bracket, they will no longer match together (externally they are treated as symbols with transparent direction, but how we ensure that the sequence "[<<]" will still occur before the RTL (Arabic) "example" word followed by the sequence "[>>]" and that the rest of the sentence (for demonstration only) will still occur in the correct order : we also have to embed/isolate the "example", or the whole sequence "[<<] example [>>]" so that the main sentence "This is an ... for demonstration only" will stil have a coherent reading direction. Such cases are not so exceptional because they occur to represent two distinct parallel readings of te same text, where in one reading for one kind of pairs will simply treat the other pairs as ignored "transparently". It should be an interesting case to investigate for validating UBA algorithms in a conformance test case. 2014-04-21 16:32 GMT+02:00 Asmus Freytag : > On 4/21/2014 1:33 AM, Eli Zaretskii wrote: > > Date: Sun, 20 Apr 2014 23:03:20 -0700 > From: Asmus Freytag > CC: Eli Zaretskii , unicode at unicode.org, > Kenneth Whistler > > Note that the current embedding level is not changed by this rule. > > What does this last sentence mean by "the current embedding level"? > The first bullet of X6 mandates that "the current character's > embedding level" _is_ changed by this rule, so what other "current > embedding level" is alluded to here? > > I'm punting on that one - can someone else answer this? > > > I assume "current embedding level" here meant "the embedding level of > the last entry on the directional status stack". (This is a natural > slip to make if you think in terms of an optimized implementation that > stores each component of the top of the directional status stack in a > variable, as suggested in 3.3.2.) > > James > > > In general, I heartily dislike "specifications" that just narrate a > particular implementation... > > I cannot agree more. > > In fact, my main gripe about the UBA additions in 6.3 are that some of > their crucial parts are not formally defined, except by an algorithm > that narrates a specific implementation. The two worst examples of > that are the "definitions" of the isolating run sequence and of the > bracket pair. I didn't ask about those because I succeeded to figure > them out, but it took many readings of the corresponding parts of the > document. It is IMO a pity that the two main features added in 6.3 > are based on definitions that are so hard to penetrate, and which > actually all but force you to use the specific implementation > described by the document. > > My working definition that replaces BD13 is this: > > An isolating run sequence is the maximal sequence of level runs of > the same embedding level that can be obtained by removing all the > characters between an isolate initiator and its matching PDI (or > paragraph end, if there is no matching PDI) within those level runs. > > As for bracket pair (BD16), I'm really amazed that a concept as easy > and widely known/used as this would need such an obscure definition > that must have an algorithm as its necessary part. How about this > instead: > > A bracket pair is a pair of an opening paired bracket and a closing > paired bracket characters within the same isolating run sequence, > such that the Bidi_Paired_Bracket property value of the former > character or its canonical equivalent equals the latter character or > its canonical equivalent, and all the opening and closing bracket > characters in between these two are balanced. > > Then we could use the algorithm to explain what it means for brackets > to be balanced (for those readers who somehow don't already know > that). > > Again, thanks for clarifying these subtle issues. I can now proceed > to updating the Emacs bidirectional display with the changes in > Unicode 6.3. > > > > FWIW here is the restatement of BD16 that I used for myself (and that I > put > into the source comments of the sample Java implementation): > > // The following is a restatement of BD 16 using non-algorithmic > language. > // > // A bracket pair is a pair of characters consisting of an opening > // paired bracket and a closing paired bracket such that the > // Bidi_Paired_Bracket property value of the former equals the latter, > // subject to the following constraints. > // - both characters of a pair occur in the same isolating run sequence > // - the closing character of a pair follows the opening character > // - any bracket character can belong at most to one pair, the > earliest possible one > // - any bracket character not part of a pair is treated like an > ordinary character > // - pairs may nest properly, but their spans may not overlap otherwise > > // Bracket characters with canonical decompositions are supposed to be > treated > // as if they had been normalized, to allow normalized and > non-normalized text > // to give the same result. > > Your language is more concise, but you may compare for differences. > > A./ > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Apr 21 12:48:43 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Apr 2014 10:48:43 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> <5354B4A8.4030201@ix.netcom.com> <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> Message-ID: <535559FB.60100@ix.netcom.com> Philippe, I fail to understand how your post contributes to the topic. The issue was unclear wording of the specification, not deficiencies in the UBA or the PBA in general. Let's keep this discussion limited to issues of wording for the *existing* specification. Feel free to start a new discussion about something else under a new subject. A./ On 4/21/2014 9:18 AM, Philippe Verdy wrote: > There are some cases where these rules will not be clear enough. Look > at the following where overlaps do occur; but directionality still > matters: > > "This is an [?] example [?] for demonstration only." > > There are two parsings possible if you just consider a hierarchic > layout where overlaps are disabled: > > 1. "This is an [...] for demonstration only.", embedding "?...?", > itself embedding "] example [" (here the square brackets match externally) > > 2. "This is an [...] example [...] for demonstration only.", embedding > two spans for "?" and "?" separately (they also pair externally) > > Now suppose that the term "example" is translated in Arabic: It is not > very clear how the UBA will work while preserving the correct pariing > direction of the 3 pairs (one pair is "?...?", there are two pairs for > "[...]"). Still all 3 pairs have a coherent direction that > Bidi-reordering or glyph mirorring should not mix. > > I see only one solution to tag such text so that it will behave > correctly: either the two pairs of square brackets or the pair or > guillemets should be encoded with isolated Bidi overrides. But then > what is happening to the ordering of the surrounding text? > > There should be a stable way to encode this case so that UBA will > still work in preserving the correct reding order, and the expected > semantics and orientation of pairs and the fact that the guillemets > are effectively not really embedding the brackets, but the translated > word "example". > > There are several ways to use Bidi-override or Bidi-embedding > controls; I don't know which one is better but all of them should > still work with UBA. I just hope that the complex cases of the > brackets in the middle ("]...[") can be handled gracefully. > > My opinion would require embedding and isolating the each square > bracket, they will no longer match together (externally they are > treated as symbols with transparent direction, but how we ensure that > the sequence "[?]" will still occur before the RTL (Arabic) "example" > word followed by the sequence "[?]" and that the rest of the sentence > (for demonstration only) will still occur in the correct order : we > also have to embed/isolate the "example", or the whole sequence "[?] > example [?]" so that the main sentence "This is an ... for > demonstration only" will stil have a coherent reading direction. > > Such cases are not so exceptional because they occur to represent two > distinct parallel readings of te same text, where in one reading for > one kind of pairs will simply treat the other pairs as ignored > "transparently". > > It should be an interesting case to investigate for validating UBA > algorithms in a conformance test case. > > > 2014-04-21 16:32 GMT+02:00 Asmus Freytag >: > > On 4/21/2014 1:33 AM, Eli Zaretskii wrote: >>> Date: Sun, 20 Apr 2014 23:03:20 -0700 >>> From: Asmus Freytag >>> CC: Eli Zaretskii ,unicode at unicode.org , >>> Kenneth Whistler >>> >>>>> Note that the current embedding level is not changed by this rule. >>>>> >>>>> What does this last sentence mean by "the current embedding level"? >>>>> The first bullet of X6 mandates that "the current character?s >>>>> embedding level" _is_ changed by this rule, so what other "current >>>>> embedding level" is alluded to here? >>>> I'm punting on that one - can someone else answer this? >>>> >>>> >>>> I assume "current embedding level" here meant "the embedding level of >>>> the last entry on the directional status stack". (This is a natural >>>> slip to make if you think in terms of an optimized implementation that >>>> stores each component of the top of the directional status stack in a >>>> variable, as suggested in 3.3.2.) >>>> >>>> James >>>> >>> In general, I heartily dislike "specifications" that just narrate a >>> particular implementation... >> I cannot agree more. >> >> In fact, my main gripe about the UBA additions in 6.3 are that some of >> their crucial parts are not formally defined, except by an algorithm >> that narrates a specific implementation. The two worst examples of >> that are the "definitions" of the isolating run sequence and of the >> bracket pair. I didn't ask about those because I succeeded to figure >> them out, but it took many readings of the corresponding parts of the >> document. It is IMO a pity that the two main features added in 6.3 >> are based on definitions that are so hard to penetrate, and which >> actually all but force you to use the specific implementation >> described by the document. >> >> My working definition that replaces BD13 is this: >> >> An isolating run sequence is the maximal sequence of level runs of >> the same embedding level that can be obtained by removing all the >> characters between an isolate initiator and its matching PDI (or >> paragraph end, if there is no matching PDI) within those level runs. >> >> As for bracket pair (BD16), I'm really amazed that a concept as easy >> and widely known/used as this would need such an obscure definition >> that must have an algorithm as its necessary part. How about this >> instead: >> >> A bracket pair is a pair of an opening paired bracket and a closing >> paired bracket characters within the same isolating run sequence, >> such that the Bidi_Paired_Bracket property value of the former >> character or its canonical equivalent equals the latter character or >> its canonical equivalent, and all the opening and closing bracket >> characters in between these two are balanced. >> >> Then we could use the algorithm to explain what it means for brackets >> to be balanced (for those readers who somehow don't already know >> that). >> >> Again, thanks for clarifying these subtle issues. I can now proceed >> to updating the Emacs bidirectional display with the changes in >> Unicode 6.3. >> >> > FWIW here is the restatement of BD16 that I used for myself (and > that I put > into the source comments of the sample Java implementation): > > // The following is a restatement of BD 16 using > non-algorithmic language. > // > // A bracket pair is a pair of characters consisting of an opening > // paired bracket and a closing paired bracket such that the > // Bidi_Paired_Bracket property value of the former equals the > latter, > // subject to the following constraints. > // - both characters of a pair occur in the same isolating run > sequence > // - the closing character of a pair follows the opening character > // - any bracket character can belong at most to one pair, the > earliest possible one > // - any bracket character not part of a pair is treated like > an ordinary character > // - pairs may nest properly, but their spans may not overlap > otherwise > > // Bracket characters with canonical decompositions are > supposed to be treated > // as if they had been normalized, to allow normalized and > non-normalized text > // to give the same result. > > Your language is more concise, but you may compare for differences. > > A./ > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Apr 21 13:14:35 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 21 Apr 2014 11:14:35 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 Message-ID: <20140421111435.665a7a7059d7ee80bb4d670165c8327d.24acdc5211.wbe@email03.secureserver.net> From: Asmus Freytag wrote: > In general, I heartily dislike "specifications" that just narrate a > particular implementation... I agree completely. I see this with CLDR as well; there is a more or less implicit assumption that I will be using ICU to implement whatever is being described. I don't care how robust and well-tested a wheel is; as a developer, I should be able to use the specification to reinvent it if I like. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From verdy_p at wanadoo.fr Mon Apr 21 13:23:57 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 21 Apr 2014 20:23:57 +0200 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <535559FB.60100@ix.netcom.com> References: <83ha5ofgdl.fsf@gnu.org> <535426DF.9020308@ix.netcom.com> <5354B4A8.4030201@ix.netcom.com> <83wqejcc9o.fsf@gnu.org> <53552BEF.90308@ix.netcom.com> <535559FB.60100@ix.netcom.com> Message-ID: It is on topic because the proposed description attempts to explain how paired brackets should match and how this witll then affect the rendering in bidirectional contexts. This is exactly the kind of things that are difficult because the proposed description assumes that paired brackets are organized hierarchically. Quote: "both characters of a pair occur in the same isolating run sequence" (does not work here sequences are not fully isolated) Quote: "any bracket character can belong at most to one pair, the earliest possible one" (does not work here, this is not the earliest possible) 2014-04-21 19:48 GMT+02:00 Asmus Freytag : > Philippe, > > I fail to understand how your post contributes to the topic. > > The issue was unclear wording of the specification, not deficiencies in > the UBA or the PBA in general. > > Let's keep this discussion limited to issues of wording for the *existing* > specification. Feel free to start a new discussion about something else > under a new subject. > > A./ > > > On 4/21/2014 9:18 AM, Philippe Verdy wrote: > > There are some cases where these rules will not be clear enough. Look at > the following where overlaps do occur; but directionality still matters: > > "This is an [<<] example [>>] for demonstration only." > > There are two parsings possible if you just consider a hierarchic layout > where overlaps are disabled: > > 1. "This is an [...] for demonstration only.", embedding "<<...>>", itself > embedding "] example [" (here the square brackets match externally) > > 2. "This is an [...] example [...] for demonstration only.", embedding > two spans for "<<" and ">>" separately (they also pair externally) > > Now suppose that the term "example" is translated in Arabic: It is not > very clear how the UBA will work while preserving the correct pariing > direction of the 3 pairs (one pair is "<<...>>", there are two pairs for > "[...]"). Still all 3 pairs have a coherent direction that Bidi-reordering > or glyph mirorring should not mix. > > I see only one solution to tag such text so that it will behave > correctly: either the two pairs of square brackets or the pair or > guillemets should be encoded with isolated Bidi overrides. But then what is > happening to the ordering of the surrounding text? > > There should be a stable way to encode this case so that UBA will still > work in preserving the correct reding order, and the expected semantics and > orientation of pairs and the fact that the guillemets are effectively not > really embedding the brackets, but the translated word "example". > > There are several ways to use Bidi-override or Bidi-embedding controls; > I don't know which one is better but all of them should still work with > UBA. I just hope that the complex cases of the brackets in the middle > ("]...[") can be handled gracefully. > > My opinion would require embedding and isolating the each square > bracket, they will no longer match together (externally they are treated as > symbols with transparent direction, but how we ensure that the sequence > "[<<]" will still occur before the RTL (Arabic) "example" word followed by > the sequence "[>>]" and that the rest of the sentence (for demonstration > only) will still occur in the correct order : we also have to embed/isolate > the "example", or the whole sequence "[<<] example [>>]" so that the main > sentence "This is an ... for demonstration only" will stil have a coherent > reading direction. > > Such cases are not so exceptional because they occur to represent two > distinct parallel readings of te same text, where in one reading for one > kind of pairs will simply treat the other pairs as ignored "transparently". > > It should be an interesting case to investigate for validating UBA > algorithms in a conformance test case. > > > 2014-04-21 16:32 GMT+02:00 Asmus Freytag