From chris.fynn at gmail.com Tue Jul 1 00:20:37 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Tue, 1 Jul 2014 11:20:37 +0600 Subject: Characters that should be displayed? In-Reply-To: References: <53B07F03.5010105@cs.tut.fi> Message-ID: On 30/06/2014, David Starner wrote: > On Sun, Jun 29, 2014 at 2:02 PM, Jukka K. Korpela > wrote: >> They might be seen as ?not displayable by normal rendering?, so yes. On >> the >> practical side, although Private Use characters should not be used in >> public >> information interchange, they are increasingly popular in ?icon font? >> tricks. > Since when is HTML necessarily public information interchange? I can't > imagine where you would better use private use characters then in HTML > where a font can be named but you don't have enough control over the > format to enter the data in some other format. +1 If the font specified in the CSS has glyphs for those characters they should be displayed. There are also some Chinese national standards (do they count as a "private" agreement?) that make use of use of PUA and supplementary PUA characters - and quite a few web pages using them. - C From kojiishi at gluesoft.co.jp Tue Jul 1 00:55:00 2014 From: kojiishi at gluesoft.co.jp (Koji Ishii) Date: Tue, 1 Jul 2014 05:55:00 +0000 Subject: Characters that should be displayed? In-Reply-To: References: <53B067D5.6050102@ix.netcom.com> <4BB2DAA1-264F-4B88-92C5-FF604BC32D41@gluesoft.co.jp> <0FFF3859-04E1-489C-9C14-D8BCC5D5C354@gluesoft.co.jp> Message-ID: >> Thanks for the reply. It?s very likely that the page contains images, borders, background, etc., so I can recognize all the text are missing. But neither of text missing nor text garbled suggests me how to fix it. I?d try another browser, then give up viewing the page. > > If it didn't suggest how to fix it to you before today, it should > suggest it to you today. If you get a bunch of fallback characters, > your first guess should be font problems. Anyone using scripts with > poor support, especially stuff stored in the PUA, will recognize right > off when the text isn't displaying. I agree it?s nice if it suggests. The point of disagreement is that fallback glyphs suggest nothing help to fix it to me, but apart from my personal opinion, your opinion is taken and will not be disregarded when the CSS WG discusses along with all others including agree and disagree. Still looking for if there were any indication of security aspects of this issue. If not, the CSS WG might discuss this as a user issue. /koji From asmusf at ix.netcom.com Tue Jul 1 02:12:21 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 01 Jul 2014 00:12:21 -0700 Subject: Characters that should be displayed? In-Reply-To: References: <53B067D5.6050102@ix.netcom.com> <4BB2DAA1-264F-4B88-92C5-FF604BC32D41@gluesoft.co.jp> <0FFF3859-04E1-489C-9C14-D8BCC5D5C354@gluesoft.co.jp> Message-ID: <53B25F55.1050803@ix.netcom.com> On 6/30/2014 10:55 PM, Koji Ishii wrote: >>> Thanks for the reply. It?s very likely that the page contains images, borders, background, etc., so I can recognize all the text are missing. But neither of text missing nor text garbled suggests me how to fix it. I?d try another browser, then give up viewing the page. >> If it didn't suggest how to fix it to you before today, it should >> suggest it to you today. If you get a bunch of fallback characters, >> your first guess should be font problems. Anyone using scripts with >> poor support, especially stuff stored in the PUA, will recognize right >> off when the text isn't displaying. > I agree it?s nice if it suggests. The point of disagreement is that fallback glyphs suggest nothing help to fix it to me, but apart from my personal opinion, your opinion is taken and will not be disregarded when the CSS WG discusses along with all others including agree and disagree. > > Still looking for if there were any indication of security aspects of this issue. If not, the CSS WG might discuss this as a user issue. My thinking is that I would expect security issues in the display of identifiers, but the examples should all be code points that are ruled out as valid part of identifiers to begin with. Similar issues might apply when displaying the text of legal contracts or documents. Again, there should be dedicated techniques to secure them. So, I fail to understand under what conditions this suggestion is "best practice". A./ > > /koji > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From jjc at jclark.com Tue Jul 1 22:10:45 2014 From: jjc at jclark.com (James Clark) Date: Wed, 2 Jul 2014 10:10:45 +0700 Subject: Thai unalom symbol Message-ID: One of the most pervasive religious symbols in traditional Thailand culture is the "unalom" (???????). I was wondering whether it might be appropriate to encode this in Unicode. Visually, it looks like KHOMUT U+0E58, rotated 90 degrees counterclockwise, and then reflected about its vertical axis (so that the spiral is right-handed rather than left-handed). However, the semantics are unrelated. KHOMUT marks the end of a chapter or document, whereas unalom is a religious, auspicious symbol. More specifically, unalom represents the tuft of white hair curling from a mole between the eyebrows of the Buddha [1], and thus symbolises enlightenment. It is related to the concept of a third eye. The word ??????? is a compound of ????, derived from the Sanskrit word urna, and Pali word unna, which literally mean wool but are also used to refer to auspicious marks on the forehead of the Buddha. The unalom is widespread in Thailand. For example, the Thai Red Cross Society was originally founded as the Red Unalom Society, and its logo was a red Unalom combined with a cross. It forms the main component of the seal of Rama I (founder of the current Thai Royal dynasty). It is even part of the logo for the Royal Thai Army. The unalom used in Thai Buddhist culture in similar ways to how a cross is used in Western Christian culture. The Royal Institute Thai Dictionary (the authoritative dictionary for the Thai language) has an entry for unalom showing the symbol: https://pbs.twimg.com/media/BrdB2IsCYAAu4gP.jpg:large One issue is whether this ought to be encoded in the Thai block or as a non-script specific symbol. The concept of an auspicious mark on the forehead of the Buddha is common feature of Buddhist art and culture. However, the exact form of the mark varies: sometimes is a circular dot and sometimes a spiral. The Thai form of the unalom is also found in other South-East Asian countries bordering Thailand (Laos, Myanmar, Cambodia). My inclination would be to include it in the Thai block, on the basis that it needs to harmonize typographically with U+0E58, and that Khmer has its own separate version of khomut (U+17DA). Devanagari om U+0950 is a precedent for encoding a religious symbol in a script block. In fact, some scholars consider the unalom or urna to be representation of the om sound [1]. Since it is not a character (in the sense of being part of the Thai writing system), the name should probably be "THAI UNALOM". James [1] Buddhist Sculpture of Northern Thailand, Carol Stratton http://books.google.com/books?id=EVpSSigMi4cC&lpg=PA50&ots=v8uqIcyyFX&dq=urna%20unalom&pg=PA50#v=onepage&q&f=false -------------- next part -------------- An HTML attachment was scrubbed... URL: From roozbeh at unicode.org Tue Jul 1 23:28:43 2014 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Tue, 1 Jul 2014 21:28:43 -0700 Subject: Thai unalom symbol In-Reply-To: References: Message-ID: I think this is a very good candidate for encoding. I would recommend writing a proposal for UTC and including the discussion about potential location. On Tue, Jul 1, 2014 at 8:10 PM, James Clark wrote: > One of the most pervasive religious symbols in traditional Thailand > culture is the "unalom" (???????). I was wondering whether it might be > appropriate to encode this in Unicode. > > Visually, it looks like KHOMUT U+0E58, rotated 90 degrees > counterclockwise, and then reflected about its vertical axis (so that the > spiral is right-handed rather than left-handed). However, the semantics > are unrelated. KHOMUT marks the end of a chapter or document, whereas > unalom is a religious, auspicious symbol. > > More specifically, unalom represents the tuft of white hair curling from a > mole between the eyebrows of the Buddha [1], and thus symbolises > enlightenment. It is related to the concept of a third eye. The word > ??????? is a compound of ????, derived from the Sanskrit word urna, and > Pali word unna, which literally mean wool but are also used to refer to > auspicious marks on the forehead of the Buddha. > > The unalom is widespread in Thailand. For example, the Thai Red Cross > Society was originally founded as the Red Unalom Society, and its logo was > a red Unalom combined with a cross. It forms the main component of the seal > of Rama I (founder of the current Thai Royal dynasty). It is even part of > the logo for the Royal Thai Army. The unalom used in Thai Buddhist culture > in similar ways to how a cross is used in Western Christian culture. > > The Royal Institute Thai Dictionary (the authoritative dictionary for the > Thai language) has an entry for unalom showing the symbol: > > https://pbs.twimg.com/media/BrdB2IsCYAAu4gP.jpg:large > > One issue is whether this ought to be encoded in the Thai block or as a > non-script specific symbol. The concept of an auspicious mark on the > forehead of the Buddha is common feature of Buddhist art and culture. > However, the exact form of the mark varies: sometimes is a circular dot and > sometimes a spiral. The Thai form of the unalom is also found in other > South-East Asian countries bordering Thailand (Laos, Myanmar, Cambodia). > My inclination would be to include it in the Thai block, on the basis that > it needs to harmonize typographically with U+0E58, and that Khmer has its > own separate version of khomut (U+17DA). Devanagari om U+0950 is a > precedent for encoding a religious symbol in a script block. In fact, some > scholars consider the unalom or urna to be representation of the om sound > [1]. Since it is not a character (in the sense of being part of the Thai > writing system), the name should probably be "THAI UNALOM". > > James > > [1] Buddhist Sculpture of Northern Thailand, Carol Stratton > > http://books.google.com/books?id=EVpSSigMi4cC&lpg=PA50&ots=v8uqIcyyFX&dq=urna%20unalom&pg=PA50#v=onepage&q&f=false > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkorpela at cs.tut.fi Wed Jul 2 02:18:20 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Wed, 02 Jul 2014 10:18:20 +0300 Subject: Thai unalom symbol In-Reply-To: References: Message-ID: <53B3B23C.8050001@cs.tut.fi> 2014-07-02 6:10, James Clark wrote: > The unalom is widespread in Thailand. For example, the Thai Red Cross > Society was originally founded as the Red Unalom Society, and its logo > was a red Unalom combined with a cross. It forms the main component of > the seal of Rama I (founder of the current Thai Royal dynasty). It is > even part of the logo for the Royal Thai Army. The unalom used in Thai > Buddhist culture in similar ways to how a cross is used in Western > Christian culture. Is there evidence of its use in text? This should be an essential question when discussing whether it should be defined as a Unicode character. Use as ?logo? or, rather, as a standalone graphic symbol does not really mean it is used as a character. > The Royal Institute Thai Dictionary (the authoritative dictionary for > the Thai language) has an entry for unalom showing the symbol: > > https://pbs.twimg.com/media/BrdB2IsCYAAu4gP.jpg:large A dictionary may explain a name of a symbol by showing the symbol, but this does not constitute use as a character. > Since it is not a character (in the > sense of being part of the Thai writing system), the name should > probably be "THAI UNALOM". I think that here you mean ?letter? when you write ?character?. Yucca From chris.fynn at gmail.com Wed Jul 2 02:22:51 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Wed, 2 Jul 2014 13:22:51 +0600 Subject: Thai unalom symbol In-Reply-To: References: Message-ID: On 02/07/2014, James Clark wrote: > The Royal Institute Thai Dictionary (the authoritative dictionary for the > Thai language) has an entry for unalom showing the symbol: > https://pbs.twimg.com/media/BrdB2IsCYAAu4gP.jpg:large Are there other dictionaries and books which use this symbol in text? With three or four more examples like this I should think it would certainly be a good candidate for encoding. (use in logos is not so persuasive) From jjc at jclark.com Wed Jul 2 02:48:16 2014 From: jjc at jclark.com (James Clark) Date: Wed, 2 Jul 2014 14:48:16 +0700 Subject: Thai unalom symbol In-Reply-To: <53B3B23C.8050001@cs.tut.fi> References: <53B3B23C.8050001@cs.tut.fi> Message-ID: On Wed, Jul 2, 2014 at 2:18 PM, Jukka K. Korpela wrote: > > Is there evidence of its use in text? This should be an essential question > when discussing whether it should be defined as a Unicode character. Use as > ?logo? or, rather, as a standalone graphic symbol does not really mean it > is used as a character. It is a standalone graphic symbol with a religious and astrological significance. There are a number such symbols in Unicode, for example: U+2626-U+262A, and U+1F540-U+1F54A. My understanding is that such symbols are eligible to be encoded in Unicode, though there are many factors to considered: http://www.unicode.org/pending/symbol-guidelines.html James -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jul 2 04:01:25 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 2 Jul 2014 11:01:25 +0200 Subject: Thai unalom symbol In-Reply-To: References: <53B3B23C.8050001@cs.tut.fi> Message-ID: These guidelines are quite old (1999). But even with these, I'm convinced that the proposed symbol is OK for encoding, and that it should harmonize with glyphs for letters of the Thai script. The dictionary example is enough convincing for me, as it is hard to see that just as an illustration. It is inserted within the text itself. The local usage by the Red Cross is not convincing (used only as part as a logo, even if the Red Cross uses also other religious symbols like the Christian/Swiss cross, or the Islamic Moon Crescent, both being encoded as characters too, in fact the Red Cross does not use these symbols alone, but with define colors and within a rectangular area taking some extra surrounding padding and often a border; there's no transparency of the background, both the foreground and background are used with specific plain colors, also in such usage the logo is almost always much larger than any text around). However the Red Cross usage still demonstrates that the symbol is widely understood as as symbol of peace and respect, it would not have been chosen if this meaning was not locally well understood. This local meaning is a sign that it is used without saying more in other contexts can can replace text. It is certainly even better than many emojis or dingbats (like the skull and bones) or the mysterious symbols of the Phaistos Disk. ---- Anyway I'm still looking in Unicode for the symbol of the peacock ("paon" in French), i.e. the male bird exhibiting its large wheel of plums. Also this Faravahar Symbol, used by Zoroastrians throughout Middle-East up to India since several milleniums: http://op-ed.the-environmentalist.org/2007/04/zoroastrianisms-influence-on-judaism.html 2014-07-02 9:48 GMT+02:00 James Clark : > On Wed, Jul 2, 2014 at 2:18 PM, Jukka K. Korpela > wrote: > >> >> Is there evidence of its use in text? This should be an essential >> question when discussing whether it should be defined as a Unicode >> character. Use as ?logo? or, rather, as a standalone graphic symbol does >> not really mean it is used as a character. > > > It is a standalone graphic symbol with a religious and astrological > significance. There are a number such symbols in Unicode, for example: > U+2626-U+262A, and U+1F540-U+1F54A. My understanding is that such symbols > are eligible to be encoded in Unicode, though there are many factors to > considered: > > http://www.unicode.org/pending/symbol-guidelines.html > > James > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Wed Jul 2 09:39:40 2014 From: public at khwilliamson.com (Karl Williamson) Date: Wed, 02 Jul 2014 08:39:40 -0600 Subject: Unencoded cased scripts and unencoded titlecase letters Message-ID: <53B419AC.3050703@khwilliamson.com> It's my sense that there are very few cased scripts in existence that are ever likely to be encoded by Unicode that haven't already been so-encoded. I also suspect that there very few new titlecased letters will ever be added to Unicode, as I believe these all come to maintain roundtrip compatibility with existing standards, and there just aren't any unencoded standards. Am I right? From public at khwilliamson.com Wed Jul 2 10:02:56 2014 From: public at khwilliamson.com (Karl Williamson) Date: Wed, 02 Jul 2014 09:02:56 -0600 Subject: Corrigendum #9 In-Reply-To: <66865346da6645be9558d53dbac1b53a@BL2PR03MB450.namprd03.prod.outlook.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <53992CC1.3010101@khwilliamson.com> <66865346da6645be9558d53dbac1b53a@BL2PR03MB450.namprd03.prod.outlook.com> Message-ID: <53B41F20.8030204@khwilliamson.com> On 06/12/2014 11:14 PM, Peter Constable wrote: > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Karl Williamson > Sent: Wednesday, June 11, 2014 9:30 PM > >> I have a something like a library that was written a long time ago >> (not by me) assuming that noncharacters were illegal in open interchange. >> Programs that use the library were guaranteed that they would not receive >> noncharacters in their input. > > I haven't read every post in the thread, so forgive me if I'm making incorrect inferences. > > I get the impression that you think that Unicode conformance requirements have historically provided that guarantee, and that Corrigendum #9 broke that. If so, then that is a mistaken understanding of Unicode conformance. Any real-world application dealing with Unicode inputs needs to be protected from "bad" inputs. These can come in the form of malicious attacks, or the result of a noisy transmission, or just plain mistakes. It doesn't matter. Generally, a gatekeeper application is employed to furnish this protection, so that the other application doesn't have to keep checking things at every turn. And, since software is expensive to write and prone to error, a generic gatekeeper is usually used, shared among many applications. Such a gatekeeper may very well be configurable to let through some inputs that would normally be considered bad, to accommodate rare special cases. In UTF-8, an example would be that Sun, I'm told, and for reasons I've forgotten or never knew, did not want raw NUL bytes to appear in text streams, so used the overlong sequence \xC0\x80 to represent them; overlong sequences generally being considered "bad" because they could be used to insert malicious payloads into the input. The original wording of the non-character text "should never be interchanged" doesn't necessarily indicate that they will never be valid in input, but that their appearance there purposely would be something quite rare, and a gatekeeper application should default to not passing them through. A protected application could indicate to the gatekeeper that it is prepared to handle non-character inputs, but the default should be to not accept them. Corrigendum #9 has changed this so much that people are coming to me and saying that inputs may very well have non-characters, and that the default should be to pass them through. Since we have no published wording for how the TUS will absorb Corrigendum #9, I don't know how this will play out. But this abrupt a change seems wrong to me, and it was done without public input or really adequate time to consider its effects. Non-characters are still designed solely for internal use, and hence I think the default for a gatekeeper should still be to exclude them. > > Here is what has historically been said in the way of conformance requirements related to non-characters: > > TUS 1.0: There were no conformance requirements stated. This recommendation was given: > "U+FFFF and U+FFFE are reserved and should not be transmitted or stored." > > This same recommendation was repeated in later versions. However, it must be recognized that "should" statements are never absolute requirements. > > Conformance requirements first appeared in TUS 2.0: > > TUS 2.0, TUS 3.0: > "C5 A process shall not interpret either U+FFFE or U+FFFF as an abstract character." > > > TUS 4.0: > "C5 A process shall not interpret a noncharacter code point as an abstract character." > > "C10 When a process purports not to modify the interpretation of a valid coded character representation, it shall make no change to that coded character representation other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points." > > Btw, note that C10 makes the assumption that a valid coded character sequence can include non-character code points. > > > TUS 5.0 (trivially different from TUS4.0): > C2 = TUS4.0, C5 > > "C7 When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points." > > > TUS 6.0: > C2 = TUS5.0, C2 > > "C7 When a process purports not to modify the interpretation of a valid coded character > sequence, it shall make no change to that coded character sequence other than the possible > replacement of character sequences by their canonical-equivalent sequences." > > Interestingly, the change to C7 does not permit non-characters to be replaced or removed at all while claiming not to have left the interpretation intact. > > > So, there was a change in 6.0 that could impact conformance claims of existing implementations. But there has never been any guarantees made _by Unicode_ that non-character code points will never occur in open interchange. Interchange has always been discouraged, but never prohibited. > > > > > Peter > From leob at mailcom.com Wed Jul 2 11:11:46 2014 From: leob at mailcom.com (Leo Broukhis) Date: Wed, 2 Jul 2014 09:11:46 -0700 Subject: Contrastive use of kratka and breve Message-ID: Here https://upload.wikimedia.org/wikipedia/commons/a/a4/Contrastive_use_of_kratka_and_breve.JPG is an example of ? and ? + U+0306 COMBINING BREVE used contrastively (/j/ vs short /i/) thanks to a difference in typographic style of Cyrillic breve (kratka) and regular breve. For me in Win7 using ? + U+0306 results in a contrast, but given that ? + U+0306 is a canonical decomposition of ? and a renderer is allowed, if not encouraged, to use the glyph for ? every time it sees ? + U+0306, what is the right (portable) way to do that? Would ? + ZWNJ + U+0306 work? Should it? I'd like to reply to https://ru.wikipedia.org/wiki/%D0%9F%D1%80%D0%BE%D0%B5%D0%BA%D1%82:%D0%92%D0%BD%D0%B5%D1%81%D0%B5%D0%BD%D0%B8%D0%B5_%D1%81%D0%B8%D0%BC%D0%B2%D0%BE%D0%BB%D0%BE%D0%B2_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%BE%D0%B2_%D0%BD%D0%B0%D1%80%D0%BE%D0%B4%D0%BE%D0%B2_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B2_%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4#.D0.9E_.D0.BA.D1.80.D0.B0.D1.82.D0.BA.D0.B5_.D0.B8_.D0.B1.D1.80.D0.B5.D0.B2.D0.B8.D1.81.D0.B5 Thanks, Leo -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Wed Jul 2 11:37:06 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 02 Jul 2014 09:37:06 -0700 Subject: Corrigendum #9 In-Reply-To: <53B41F20.8030204@khwilliamson.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <53992CC1.3010101@khwilliamson.com> <66865346da6645be9558d53dbac1b53a@BL2PR03MB450.namprd03.prod.outlook.com> <53B41F20.8030204@khwilliamson.com> Message-ID: <53B43532.6090208@ix.netcom.com> On 7/2/2014 8:02 AM, Karl Williamson wrote: > Corrigendum #9 has changed this so much that people are coming to me > and saying that inputs may very well have non-characters, and that the > default should be to pass them through. Since we have no published > wording for how the TUS will absorb Corrigendum #9, I don't know how > this will play out. But this abrupt a change seems wrong to me, and > it was done without public input or really adequate time to consider > its effects. > > Non-characters are still designed solely for internal use, and hence I > think the default for a gatekeeper should still be to exclude them. This is the crux of this issue. The Corrigendum was introduced with the intent to allow users to lean on library and tool writers to adopt a permissive attitude - by removing what many among the developers of such software had seen as language that endorsed or even encouraged strong filtering. > On 06/12/2014 11:14 PM, Peter Constable wrote: > > I get the impression that you think that Unicode conformance > requirements have historically provided that guarantee, and that > Corrigendum #9 broke that. If so, then that is a mistaken > understanding of Unicode conformance. Not so much an issue of "guarantee", but language that was treating strong filtering as the default, and that was understood as such in the community. A./ From verdy_p at wanadoo.fr Wed Jul 2 12:34:49 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 2 Jul 2014 19:34:49 +0200 Subject: Contrastive use of kratka and breve In-Reply-To: References: Message-ID: ZWNJ is not supposed to join or disjoin combing diacritics from a base letter (even if it has such limited use in Indic scripts, but only between letters to prevent clusters with subjoined letters), CGJ would be better used to prevent canonical compositions but it won't normally give a distinctive semantic. It looks like you have a case where you would need to encode a variant of the base letter ? (with a variant selector). The rendering may be still fuzzy as there's no such variant registered for that CYRILIC LETTER I. Probably, given that you have fonts making a specific contrast for ? + U+306, adding a CGJ in the middle would do the trick if it is only to prevent the canonically equivalent composition which could occur in many places. 2014-07-02 18:11 GMT+02:00 Leo Broukhis : > Here > > https://upload.wikimedia.org/wikipedia/commons/a/a4/Contrastive_use_of_kratka_and_breve.JPG > is an example of ? and ? + U+0306 COMBINING BREVE used contrastively (/j/ > vs short /i/) thanks to a difference in typographic style of Cyrillic breve > (kratka) and regular breve. > For me in Win7 using ? + U+0306 results in a contrast, but given that ? + > U+0306 is a canonical decomposition of ? and a renderer is allowed, if not > encouraged, to use the glyph for ? every time it sees ? + U+0306, what is > the right (portable) way to do that? Would ? + ZWNJ + U+0306 work? Should > it? > > I'd like to reply to > https://ru.wikipedia.org/wiki/%D0%9F%D1%80%D0%BE%D0%B5%D0%BA%D1%82:%D0%92%D0%BD%D0%B5%D1%81%D0%B5%D0%BD%D0%B8%D0%B5_%D1%81%D0%B8%D0%BC%D0%B2%D0%BE%D0%BB%D0%BE%D0%B2_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%BE%D0%B2_%D0%BD%D0%B0%D1%80%D0%BE%D0%B4%D0%BE%D0%B2_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B2_%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4#.D0.9E_.D0.BA.D1.80.D0.B0.D1.82.D0.BA.D0.B5_.D0.B8_.D0.B1.D1.80.D0.B5.D0.B2.D0.B8.D1.81.D0.B5 > > Thanks, > Leo > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jul 2 12:57:35 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 2 Jul 2014 19:57:35 +0200 Subject: Contrastive use of kratka and breve In-Reply-To: References: Message-ID: The alternative would be to encode a separate CYRILLIC COMBINING "LUNAR" BREVE for the case of the initial /j/, or to encode that letter /j/ specifically. However in your examples, that letter /j/ only occurs in the word initial position where phonology transforms the long /i/ into /j/. Contextually you can still make the difference even if the glyphs are not contrasted. But may be you have examples showing that letter /j/ in a non-initial position (at start of a syllable, i.e. after a vowel) or at end of words (also after a vowel) where such contextual guess is not easy to decide between /j/ and long /i/. It would be interesting to have details where krafka is expected and where a breve is expected, and where they can be confusing. If you cannot find such example then a simple rendering rule would be to use the lunar form of the breve in syllable start position (over ? at start of a word, or over ? after another vowel), and the krafka form (with both arms terminated by a rounded bowl) in all other positions (over ? in the middle of a syllable after a consonnant... or also over ? at end of word unles you also need the distinction there between a final long /i/ and a final /j/?), 2014-07-02 19:34 GMT+02:00 Philippe Verdy : > ZWNJ is not supposed to join or disjoin combing diacritics from a base > letter (even if it has such limited use in Indic scripts, but only between > letters to prevent clusters with subjoined letters), > > CGJ would be better used to prevent canonical compositions but it won't > normally give a distinctive semantic. > > It looks like you have a case where you would need to encode a variant of > the base letter ? (with a variant selector). The rendering may be still > fuzzy as there's no such variant registered for that CYRILIC LETTER I. > > Probably, given that you have fonts making a specific contrast for ? + > U+306, adding a CGJ in the middle would do the trick if it is only to > prevent the canonically equivalent composition which could occur in many > places. > > > > 2014-07-02 18:11 GMT+02:00 Leo Broukhis : > >> Here >> >> https://upload.wikimedia.org/wikipedia/commons/a/a4/Contrastive_use_of_kratka_and_breve.JPG >> is an example of ? and ? + U+0306 COMBINING BREVE used contrastively (/j/ >> vs short /i/) thanks to a difference in typographic style of Cyrillic breve >> (kratka) and regular breve. >> For me in Win7 using ? + U+0306 results in a contrast, but given that ? + >> U+0306 is a canonical decomposition of ? and a renderer is allowed, if not >> encouraged, to use the glyph for ? every time it sees ? + U+0306, what is >> the right (portable) way to do that? Would ? + ZWNJ + U+0306 work? Should >> it? >> >> I'd like to reply to >> https://ru.wikipedia.org/wiki/%D0%9F%D1%80%D0%BE%D0%B5%D0%BA%D1%82:%D0%92%D0%BD%D0%B5%D1%81%D0%B5%D0%BD%D0%B8%D0%B5_%D1%81%D0%B8%D0%BC%D0%B2%D0%BE%D0%BB%D0%BE%D0%B2_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%BE%D0%B2_%D0%BD%D0%B0%D1%80%D0%BE%D0%B4%D0%BE%D0%B2_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B2_%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4#.D0.9E_.D0.BA.D1.80.D0.B0.D1.82.D0.BA.D0.B5_.D0.B8_.D0.B1.D1.80.D0.B5.D0.B2.D0.B8.D1.81.D0.B5 >> >> Thanks, >> Leo >> >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkorpela at cs.tut.fi Wed Jul 2 13:13:30 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Wed, 02 Jul 2014 21:13:30 +0300 Subject: Contrastive use of kratka and breve In-Reply-To: References: Message-ID: <53B44BCA.4050305@cs.tut.fi> 2014-07-02 20:34, Philippe Verdy wrote: > CGJ would be better used to prevent canonical compositions but it won't > normally give a distinctive semantic. In the question, visual difference was desired. The Unicode FAQ says: ?The semantics of CGJ are such that it should impact only searching and sorting, for systems which have been tailored to distinguish it, while being otherwise ignored in interpretation. The CGJ character was encoded with this purpose in mind.? http://www.unicode.org/faq/char_combmark.html So CGJ is to be used when you specifically want the same rendering but wish to make a distinction in processing. Yucca From prosfilaes at gmail.com Wed Jul 2 13:19:32 2014 From: prosfilaes at gmail.com (David Starner) Date: Wed, 2 Jul 2014 11:19:32 -0700 Subject: Corrigendum #9 In-Reply-To: <53B41F20.8030204@khwilliamson.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <53992CC1.3010101@khwilliamson.com> <66865346da6645be9558d53dbac1b53a@BL2PR03MB450.namprd03.prod.outlook.com> <53B41F20.8030204@khwilliamson.com> Message-ID: On Wed, Jul 2, 2014 at 8:02 AM, Karl Williamson wrote: > In > UTF-8, an example would be that Sun, I'm told, and for reasons I've > forgotten or never knew, did not want raw NUL bytes to appear in text > streams, so used the overlong sequence \xC0\x80 to represent them; overlong > sequences generally being considered "bad" because they could be used to > insert malicious payloads into the input. In C, NUL ends a string. If you have to run data that may have NUL characters through C functions, you can't store the NULs as \0. I might argue 11111111b for 0x00 in UTF-8 would be technically legal--the standard never specifies which bit sequences correspond to which byte values--but \xC0\x80 would probably be more reliably processed by existing code. -- Kie ekzistas vivo, ekzistas espero. From leob at mailcom.com Wed Jul 2 13:20:19 2014 From: leob at mailcom.com (Leo Broukhis) Date: Wed, 2 Jul 2014 11:20:19 -0700 Subject: Contrastive use of kratka and breve In-Reply-To: References: Message-ID: > The alternative would be to encode a separate CYRILLIC COMBINING "LUNAR" BREVE for the case of the initial /j/, or to encode that letter /j/ specifically. This is in effect what they are proposing on the wiki discussion page. A correction: the lunar breve is for the short /i/ sound, and the rounded bowl breve ('kraTka', cognate with 'shorT') is for the /j/ sound. You're right, it seems that the distinction can be made contextually. The lunar breve variant seems to appear only after ?, which is a letter, see http://ru.wikipedia.org/wiki/%D0%9D%D0%B5%D0%BD%D0%B5%D1%86%D0%BA%D0%B8%D0%B9_%D1%8F%D0%B7%D1%8B%D0%BA#.D0.A2.D0.B0.D0.B1.D0.BB.D0.B8.D1.86.D0.B0_.D1.81.D0.BE.D0.BE.D1.82.D0.B2.D0.B5.D1.82.D1.81.D1.82.D0.B2.D0.B8.D1.8F_.D0.B0.D0.BB.D1.84.D0.B0.D0.B2.D0.B8.D1.82.D0.BE.D0.B2 for the Nenets alphabet. It doesn't make a distinction between breves; but now it seems that we have two new characters to encode! Thanks, Leo On Wed, Jul 2, 2014 at 10:57 AM, Philippe Verdy wrote: > The alternative would be to encode a separate CYRILLIC COMBINING "LUNAR" > BREVE for the case of the initial /j/, or to encode that letter /j/ > specifically. > > However in your examples, that letter /j/ only occurs in the word initial > position where phonology transforms the long /i/ into /j/. Contextually you > can still make the difference even if the glyphs are not contrasted. > > But may be you have examples showing that letter /j/ in a non-initial > position (at start of a syllable, i.e. after a vowel) or at end of words > (also after a vowel) where such contextual guess is not easy to decide > between /j/ and long /i/. > > It would be interesting to have details where krafka is expected and where > a breve is expected, and where they can be confusing. If you cannot find > such example then a simple rendering rule would be to use the lunar form of > the breve in syllable start position (over ? at start of a word, or over > ? after another vowel), and the krafka form (with both arms terminated by > a rounded bowl) in all other positions (over ? in the middle of a > syllable after a consonnant... or also over ? at end of word unles you > also need the distinction there between a final long /i/ and a final /j/?), > > > > 2014-07-02 19:34 GMT+02:00 Philippe Verdy : > > ZWNJ is not supposed to join or disjoin combing diacritics from a base >> letter (even if it has such limited use in Indic scripts, but only between >> letters to prevent clusters with subjoined letters), >> >> CGJ would be better used to prevent canonical compositions but it won't >> normally give a distinctive semantic. >> >> It looks like you have a case where you would need to encode a variant of >> the base letter ? (with a variant selector). The rendering may be still >> fuzzy as there's no such variant registered for that CYRILIC LETTER I. >> >> Probably, given that you have fonts making a specific contrast for ? + >> U+306, adding a CGJ in the middle would do the trick if it is only to >> prevent the canonically equivalent composition which could occur in many >> places. >> >> >> >> 2014-07-02 18:11 GMT+02:00 Leo Broukhis : >> >>> Here >>> >>> https://upload.wikimedia.org/wikipedia/commons/a/a4/Contrastive_use_of_kratka_and_breve.JPG >>> is an example of ? and ? + U+0306 COMBINING BREVE used contrastively >>> (/j/ vs short /i/) thanks to a difference in typographic style of Cyrillic >>> breve (kratka) and regular breve. >>> For me in Win7 using ? + U+0306 results in a contrast, but given that ? >>> + U+0306 is a canonical decomposition of ? and a renderer is allowed, if >>> not encouraged, to use the glyph for ? every time it sees ? + U+0306, what >>> is the right (portable) way to do that? Would ? + ZWNJ + U+0306 work? >>> Should it? >>> >>> I'd like to reply to >>> https://ru.wikipedia.org/wiki/%D0%9F%D1%80%D0%BE%D0%B5%D0%BA%D1%82:%D0%92%D0%BD%D0%B5%D1%81%D0%B5%D0%BD%D0%B8%D0%B5_%D1%81%D0%B8%D0%BC%D0%B2%D0%BE%D0%BB%D0%BE%D0%B2_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%BE%D0%B2_%D0%BD%D0%B0%D1%80%D0%BE%D0%B4%D0%BE%D0%B2_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B2_%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4#.D0.9E_.D0.BA.D1.80.D0.B0.D1.82.D0.BA.D0.B5_.D0.B8_.D0.B1.D1.80.D0.B5.D0.B2.D0.B8.D1.81.D0.B5 >>> >>> Thanks, >>> Leo >>> >>> >>> >>> _______________________________________________ >>> Unicode mailing list >>> Unicode at unicode.org >>> http://unicode.org/mailman/listinfo/unicode >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jul 2 13:44:38 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 2 Jul 2014 20:44:38 +0200 Subject: Contrastive use of kratka and breve In-Reply-To: <53B44BCA.4050305@cs.tut.fi> References: <53B44BCA.4050305@cs.tut.fi> Message-ID: Aren(t we in such a case where the distinction (supposed to be guessed contextually) would be needed only to facilitate contextual analisis of text (such as counting syllables, or transforming the text to count them in a later process, or searching text phonologically, even if the look of the rendered glyph does not really need the distinction) ? Anyway we have two variants of the breve (with "rounded bowls" and "lunar"). In the example given the word initial was using the lunar form, and the word-medial and word-final was using the form with rounded bowls (sort of mix between a breve and a diaeresis). I still don't understand which one is for the short /i/ and which one is for /j/, and what is then the representation of the long /ii/ or how you represent and distinghich a short /ji/, a long /jii/ (like in English/French language name "Yi"), or /iji/ (like in French city and name of a cream "Chantilly") and how you would represent the diphtong /aj/ and distinguish /a?i/ (two sillables like in French verb "ha?r") from /aji/ (like in French noun "taillis") and /aj/ (like in French advective 'tha?" or verb/noun "taille"). All I know is that Cyrillic as dedicated letters for common syllables /ja/, /je/, /ju/ (inherited from old ligatures) and languages using Cyrillic vary in how they use them (add also the diferent Ukrainian letter for /i/ and its use of diaresis in some cases and ambiguities in writing borrowed foreign words notably when they are trademarks like "Wikimedia" whose orthography varies depending on author's interpretation of the phonology). 2014-07-02 20:13 GMT+02:00 Jukka K. Korpela : > 2014-07-02 20:34, Philippe Verdy wrote: > > CGJ would be better used to prevent canonical compositions but it won't >> normally give a distinctive semantic. >> > > In the question, visual difference was desired. The Unicode FAQ says: > ?The semantics of CGJ are such that it should impact only searching and > sorting, for systems which have been tailored to distinguish it, while > being otherwise ignored in interpretation. The CGJ character was encoded > with this purpose in mind.? > http://www.unicode.org/faq/char_combmark.html > > So CGJ is to be used when you specifically want the same rendering but > wish to make a distinction in processing. > > Yucca > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Wed Jul 2 13:48:06 2014 From: leob at mailcom.com (Leo Broukhis) Date: Wed, 2 Jul 2014 11:48:06 -0700 Subject: Contrastive use of kratka and breve In-Reply-To: <53B44BCA.4050305@cs.tut.fi> References: <53B44BCA.4050305@cs.tut.fi> Message-ID: Jukka, If the font happens to have lunar breve at U+0306, whereas the letter ? has the rounded bowl breve, using CGJ should guarantee to achieve distinctive rendering, because is not canonically equivalent to (cf. "The sequences and are not canonically equivalent.") and therefore the renderer must not be allowed to pick the glyph for ? instead as its canonical composition. This is a hack, but a legal hack. Leo On Wed, Jul 2, 2014 at 11:13 AM, Jukka K. Korpela wrote: > 2014-07-02 20:34, Philippe Verdy wrote: > > CGJ would be better used to prevent canonical compositions but it won't >> normally give a distinctive semantic. >> > > In the question, visual difference was desired. The Unicode FAQ says: > ?The semantics of CGJ are such that it should impact only searching and > sorting, for systems which have been tailored to distinguish it, while > being otherwise ignored in interpretation. The CGJ character was encoded > with this purpose in mind.? > http://www.unicode.org/faq/char_combmark.html > > So CGJ is to be used when you specifically want the same rendering but > wish to make a distinction in processing. > > Yucca > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkorpela at cs.tut.fi Wed Jul 2 13:55:53 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Wed, 02 Jul 2014 21:55:53 +0300 Subject: Contrastive use of kratka and breve In-Reply-To: References: Message-ID: <53B455B9.1060203@cs.tut.fi> 2014-07-02 19:11, Leo Broukhis wrote: > Here > https://upload.wikimedia.org/wikipedia/commons/a/a4/Contrastive_use_of_kratka_and_breve.JPG > is an example of ? and ? + U+0306 COMBINING BREVE used contrastively > (/j/ vs short /i/) thanks to a difference in typographic style of > Cyrillic breve (kratka) and regular breve. I can?t trell where the difference comes from, since this is a bitmap image. But my hypothesis is that this has nothing to do with making a difference in typographic style between kratka and breve. Rather, it is a matter of mixing font: in one case, you have the normal letter ?, and in another case, you have the letter ? and the combining breve *taken from another font*. > For me in Win7 using ? + U+0306 results in a contrast, This does not really depend on operating system; rather, in the rendering system of the program being used and on the fonts used. For example, using Word 2013, ? and ? U+0306 COMBINING BREVE look the same for many fonts, but if you select a font that does not contain U+0306, the results are odd. In some lucky cases, you might get a breve symbol in the right place Using ? U+034F COMBINING GRAPHEME JOINER U+0306 COMBINING BREVE causes different results. For many fonts, such as Calibri, the result is different in the sense that the breve (kratka) has been displaced to the right, and the result looks odd. Changing the font to Arial results in a mess, since CGJ, which displayed visibly. Apparently there are problems with CGJ in rendering software, even though CGJ should make no difference. Using ZWNJ between the base character and the diacritic is more logical, but it seems to yield similar results. The point is that whether or not these control characters should prevent the rendering of a base letter and a diacritic as a precomposed glyph, they tend to do that, but you mostly won?t like the results. > but given that ? > + U+0306 is a canonical decomposition of ? and a renderer is allowed, if > not encouraged, to use the glyph for ? every time it sees ? + U+0306, > what is the right (portable) way to do that? Would ? + ZWNJ + U+0306 > work? Should it? As far as I can see, there is nothing that says that it (or any other similar method) should work, but it may often ?work??in a manner you might not like. > I'd like to reply to > https://ru.wikipedia.org/wiki/%D0%9F%D1%80%D0%BE%D0%B5%D0%BA%D1%82:%D0%92%D0%BD%D0%B5%D1%81%D0%B5%D0%BD%D0%B8%D0%B5_%D1%81%D0%B8%D0%BC%D0%B2%D0%BE%D0%BB%D0%BE%D0%B2_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%BE%D0%B2_%D0%BD%D0%B0%D1%80%D0%BE%D0%B4%D0%BE%D0%B2_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B2_%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4#.D0.9E_.D0.BA.D1.80.D0.B0.D1.82.D0.BA.D0.B5_.D0.B8_.D0.B1.D1.80.D0.B5.D0.B2.D0.B8.D1.81.D0.B5 I think the idea of using CGJ is more wrong than the idea of using ZWNJ. But neither works *properly*. You can make a distinction between ? and ? with kratka (caron), at the character level, and applications are allowed to render them differently. But Unicode specifies no way to request for such rendering, still less guarantee it. Yucca From kent.karlsson14 at telia.com Wed Jul 2 14:12:02 2014 From: kent.karlsson14 at telia.com (Kent Karlsson) Date: Wed, 02 Jul 2014 21:12:02 +0200 Subject: Contrastive use of kratka and breve In-Reply-To: Message-ID: Sounds to me that what you really want is to have two different breve characters (assuming that the distinction is real and intentional, and not a happenstance). That would require encoding a new combining character, AFAICT... /Kent K Den 2014-07-02 20:48, skrev "Leo Broukhis" : > Jukka, > > If the font happens to have lunar breve at U+0306, whereas the letter ? has > the rounded bowl breve, using CGJ should guarantee to achieve distinctive > rendering, because is not canonically equivalent to? U+0306> (cf. "The sequences and are not > canonically equivalent.") and therefore the renderer must not be allowed to > pick the glyph for ? instead as its canonical composition. This is a hack, but > a legal hack. > > Leo > > > On Wed, Jul 2, 2014 at 11:13 AM, Jukka K. Korpela wrote: >> 2014-07-02 20:34, Philippe Verdy wrote: >> >>> CGJ would be better used to prevent canonical compositions but it won't >>> normally give a distinctive semantic. >> >> In the question, visual difference was desired. The Unicode FAQ says: >> ?The semantics of CGJ are such that it should impact only searching and >> sorting, for systems which have been tailored to distinguish it, while being >> otherwise ignored in interpretation. The CGJ character was encoded with this >> purpose in mind.? >> http://www.unicode.org/faq/char_combmark.html >> >> >> So CGJ is to be used when you specifically want the same rendering but wish >> to make a distinction in processing. >> >> Yucca >> >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jul 2 14:19:16 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 2 Jul 2014 21:19:16 +0200 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <53992CC1.3010101@khwilliamson.com> <66865346da6645be9558d53dbac1b53a@BL2PR03MB450.namprd03.prod.outlook.com> <53B41F20.8030204@khwilliamson.com> Message-ID: 2014-07-02 20:19 GMT+02:00 David Starner : > I might argue 11111111b for 0x00 in UTF-8 would be technically > legal It is not. UTF-8 specifies the effective value of each 8-bit byte, if you store 11111111b in that byte you have exactly the same result as when storing 0xFF or -1 (unless your system uses "bytes" larger than 8-bits (the time of PDP mainframes with 8-bit bytes is over since long, all devices around use 8-bit byte values on their interface, even if they may internally encode exposed bits with longer sequences, such as with MFM encodings, or by adding extra control and clock/sync bits, or could use three rotating sequences of 3 states with automatic synchronization by negative or positive transitions at every encoded bit position, plus some breaking rules on some bits to find start of packets) the standard never specifies which bit sequences correspond to > which byte values--but \xC0\x80 would probably be more reliably > processed by existing code. But the same C libraries are also using -1 as end-of-stream values and if they are converted to bytes, they will be undistinctable from the NULL character that could be stored everywhere in the stream. The main reason why 0xC0,0x80 was chosen instead of 00 is historic in Java when its JNI interface only used strings encoded on 8-bit sequences without a separate parameter to specify the length of the encoded sequence. 0x00 was then used like in the basic ANSI C string library (string.h and stdio.h) and Java was ported on heterogeneous systems (including those small devices whose "int" type was also 8-bit only, blocking the use of BOTH 0x00 and 0xFF in some system I/O APIs). At least 0xC0,0x80 was safe (and not used by UTF-8, but at that time UTF-8 was still not even a standard previsely defined, and it was legal to represent U+0000 as 0xC0,0x80, the prohibition of over long sequences in UTF-8 or Unicode came many years later, Java used the early informative-only RFC specification, which was also supported by ISO, before ISO1646-1 and Unicode 1.1 were aligned). The Unicode and ISO1646 have changed (both in incompatible way) but it was necessary to have both standards compatible with each other. Java could not change its ABI for JNI, it was too late. However Java added another UTF16-based interface for strings to JNI. But still this interface does not enforce UTF-16 rules about paired surrogates (just like C, C++ or even Javascript). But the added 16-bit string interface for JNI has a separate field for storing the encoded sting length (in 16-but code units), so that interface uses the standard 0x0000 value for U+0000. As much as possible JNI extension ibraries should use that 16-bit interface (which is simpler to handle also with modern OS APIs compatible woth Unicode, notably on Windows). But the 8-bit iJNI interface is still commonly used in JNI extension libraries for Unix/Linux (because it is safer to handle the conversion from 16-bit to 8-bit in the JVM than in the external JNI library using its own memory allocation and unable to use the garbage collector of the managed memory of the JVM). The Java-modified-UTF8 encoding is still used in the binary encoding of compiled class files (this is invisible to applications that only see 16-bit encoded strings, unless they have to parse or generate compiled class files) -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jul 2 14:40:52 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 2 Jul 2014 21:40:52 +0200 Subject: Contrastive use of kratka and breve In-Reply-To: <53B455B9.1060203@cs.tut.fi> References: <53B455B9.1060203@cs.tut.fi> Message-ID: 2014-07-02 20:55 GMT+02:00 Jukka K. Korpela : > I think the idea of using CGJ is more wrong than the idea of using ZWNJ. > I think exactly the opposite. CGJ brings the distinction that it prohibits the cannical combination. As the resulting string is not anonically equivalent, it is also semantically distinctive. Note that we are in the case were CGJ would be used here just after a base letter. We are not in the situation where CGJ is used between TWO combining characters, where the first one has a *higher* combining class than the second one with a non-zero combining class (this case is where CGJ is used to prevent reordering of combining diacritics during normalization: this case occurs when these diacritics wbich usually don't interact with most base characters may collide on the same position and need an explicit difference, for example when the cedilla occurs above a letter rather than below it and interacts with another diacritic above that letter, and the relative order of the cedilla and that diacritic matters). This is used notably in the Hebrew script (due to the "strange" historic assignment of distinct non-zero combiing classes to most of its diacritics even when they can interact and relative ordering is significant both semantically and graphically). We are also not in the situation where CGJ occurs between TWO combining characters having the *same* combining class, in order to stack them differently. In that case, no reordering occurs, even without CGJ, and the relative order is significant, but there's a distinction between vertical and horizontal stacking. CGJ is also used in cases where there's an enclosing diacritic and another one: should that diacritic be inside or outside the enclosing diacritic? (we have examples in mathematical notations with diacritics liek arrows) The case of CGJ used immediately after a base letter (or after a combining character with combining class 0) encodes either a variant of the base letter, or of the diacritic after CGJ. The only case where CGJ is currently not used is at end of streams ; or just before a base letter (but there could be applications for tricky cases in Indic scripts) where ZWJ and ZWNJ are prefered (to control ligatures and contextual letter forms such as subjoined letters). -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Wed Jul 2 15:08:42 2014 From: leob at mailcom.com (Leo Broukhis) Date: Wed, 2 Jul 2014 13:08:42 -0700 Subject: Contrastive use of kratka and breve In-Reply-To: References: Message-ID: The difference is real and intentional, but isn't it akin to the difference between (IIRC a discussion several years ago) the Polish/Czech acute and the French acute - the former is more vertical? It has been decided that there is no need for two combining acute signs. Leo On Wed, Jul 2, 2014 at 12:12 PM, Kent Karlsson wrote: > Sounds to me that what you really want is to have two different breve > characters > (assuming that the distinction is real and intentional, and not a > happenstance). > That would require encoding a new combining character, AFAICT... > > /Kent K > > > > Den 2014-07-02 20:48, skrev "Leo Broukhis" : > > Jukka, > > If the font happens to have lunar breve at U+0306, whereas the letter ? > has the rounded bowl breve, using CGJ should guarantee to achieve > distinctive rendering, because is not canonically > equivalent to (cf. "The sequences and umlaut> are not canonically equivalent.") and therefore the renderer must > not be allowed to pick the glyph for ? instead as its canonical > composition. This is a hack, but a legal hack. > > Leo > > > On Wed, Jul 2, 2014 at 11:13 AM, Jukka K. Korpela > wrote: > > 2014-07-02 20:34, Philippe Verdy wrote: > > CGJ would be better used to prevent canonical compositions but it won't > normally give a distinctive semantic. > > > In the question, visual difference was desired. The Unicode FAQ says: > ?The semantics of CGJ are such that it should impact only searching and > sorting, for systems which have been tailored to distinguish it, while > being otherwise ignored in interpretation. The CGJ character was encoded > with this purpose in mind.? > http://www.unicode.org/faq/char_combmark.html < > http://www.unicode.org/faq/char_combmark.html> > > So CGJ is to be used when you specifically want the same rendering but > wish to make a distinction in processing. > > Yucca > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode < > http://unicode.org/mailman/listinfo/unicode> > > > > ------------------------------ > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Wed Jul 2 15:33:03 2014 From: leob at mailcom.com (Leo Broukhis) Date: Wed, 2 Jul 2014 13:33:03 -0700 Subject: Missing Nenets letters? Message-ID: http://www.omniglot.com/writing/nenets.htm shows two letters (? and ?) in both versions of the Cyrillic Nenets alphabet ("voiced taser?" and "unvoiced taser?") that don't seem to be encoded as letters. Should they be encoded, or 2019 and 201D are good enough? Leo -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Wed Jul 2 17:02:31 2014 From: jf at colson.eu (=?windows-1252?Q?Jean-Fran=E7ois_Colson?=) Date: Thu, 03 Jul 2014 00:02:31 +0200 Subject: Missing Nenets letters? In-Reply-To: References: Message-ID: <53B48177.3000602@colson.eu> Le 02/07/14 22:33, Leo Broukhis a ?crit : > http://www.omniglot.com/writing/nenets.htm > > shows two letters (? and ?) in both versions of the Cyrillic Nenets > alphabet ("voiced taser?" and "unvoiced taser?") that don't seem to be > encoded as letters. Should they be encoded, or 2019 and 201D are good > enough? > Simply use U+02BC modifier letter apostrophe instead of U+2019 and U+02EE modifier letter double apostrophe instead of U+201D. Nothing new under the sun? From richard.wordingham at ntlworld.com Wed Jul 2 17:07:08 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 2 Jul 2014 23:07:08 +0100 Subject: Contrastive use of kratka and breve In-Reply-To: <53B44BCA.4050305@cs.tut.fi> References: <53B44BCA.4050305@cs.tut.fi> Message-ID: <20140702230708.36292bbf@JRWUBU2> On Wed, 02 Jul 2014 21:13:30 +0300 "Jukka K. Korpela" wrote: > 2014-07-02 20:34, Philippe Verdy wrote: > > CGJ would be better used to prevent canonical compositions but it > > won't normally give a distinctive semantic. > In the question, visual difference was desired. The Unicode FAQ says: > ?The semantics of CGJ are such that it should impact only searching > and sorting, for systems which have been tailored to distinguish it, > while being otherwise ignored in interpretation. The CGJ character > was encoded with this purpose in mind.? > http://www.unicode.org/faq/char_combmark.html Unfortunately, the Unicode FAQs need a thorough review. There is quite a bit with a low to zero truth value, especially about CGJ. > So CGJ is to be used when you specifically want the same rendering > but wish to make a distinction in processing. As Philippe has pointed out, a CGJ can affect rendering by encouraging renders to apply marks in the order they appear in the normalised texts. I am puzzled to the difference between diaeresis and umlaut; if black letter styles do distinguish them, as has been denied, then CGJ does affect the rendering, for CGJ may be used to distinguish a diaeresis from an umlaut. Richard. From richard.wordingham at ntlworld.com Wed Jul 2 17:27:01 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 2 Jul 2014 23:27:01 +0100 Subject: Contrastive use of kratka and breve In-Reply-To: References: <53B44BCA.4050305@cs.tut.fi> Message-ID: <20140702232701.17b2b72d@JRWUBU2> On Wed, 2 Jul 2014 11:48:06 -0700 Leo Broukhis wrote: > If the font happens to have lunar breve at U+0306, whereas the letter > ? has the rounded bowl breve, using CGJ should guarantee to achieve > distinctive rendering, because is not canonically > equivalent to (cf. "The sequences and CGJ, umlaut> are not canonically equivalent.") and therefore the > renderer must not be allowed to pick the glyph for ? instead as its > canonical composition. This is a hack, but a legal hack. And this may be the way to go, because we cannot change the canonical decompositions of U+0439 CYRILLIC CAPITAL LETTER SHORT I into or of U+045E CYRILLIC SMALL LETTER SHORT U into . Unfortunately, it is the mark that is part of U+0439 that seems to have been miscoded. Note also that the contrast is found in dictionaries, not in ordinary writing. In Russian the semivowel is called ? ??????? while the shortened vowel may be referred to as ??????? ?, so calling this 'kratka v. breve' is not very helpful. Tapani Salminen put up some contrasting usages (in both Tundra and Forest Nenets) at http://www.helsinki.fi/~tasalmin/cyrillic_breve.html . Richard. From richard.wordingham at ntlworld.com Wed Jul 2 17:43:34 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 2 Jul 2014 23:43:34 +0100 Subject: Contrastive use of kratka and breve In-Reply-To: References: Message-ID: <20140702234334.10e21ae2@JRWUBU2> On Wed, 2 Jul 2014 13:08:42 -0700 Leo Broukhis wrote: > The difference is real and intentional, but isn't it akin to the > difference between (IIRC a discussion several years ago) the > Polish/Czech acute and the French acute - the former is more > vertical? Not really. Look at the third entry at http://www.helsinki.fi/~tasalmin/Forest_Nenets_short_i_1.jpg , where the two occur side-by-side. Unification should depend on their being contextual variants or one another, and difficult in laying down rules is the justification for U+017F LATIN SMALL LETTER LONG S being a separate character. Richard. From richard.wordingham at ntlworld.com Wed Jul 2 17:59:55 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 2 Jul 2014 23:59:55 +0100 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <53992CC1.3010101@khwilliamson.com> <66865346da6645be9558d53dbac1b53a@BL2PR03MB450.namprd03.prod.outlook.com> <53B41F20.8030204@khwilliamson.com> Message-ID: <20140702235955.45550665@JRWUBU2> On Wed, 2 Jul 2014 21:19:16 +0200 Philippe Verdy wrote: > 2014-07-02 20:19 GMT+02:00 David Starner : > > > I might argue 11111111b for 0x00 in UTF-8 would be technically > > legal > But the same C libraries are also using -1 as end-of-stream values > and if they are converted to bytes, they will be undistinctable from > the NULL character that could be stored everywhere in the stream. A 0xFF byte in a narrow character stream is converted to 0x00FF (int is at least 16 bits wide) in the interfaces while the narrow character end-of-stream value EOF is required to be negative. Unfortunately, the wide character end-of-stream marker WEOF is not required to be negative, but it is not allowed to be a representable character. C appears to prohibit U+FFFF as well as supplementary characters if wchar_t is only 16 bits wide. Richard. From verdy_p at wanadoo.fr Wed Jul 2 18:23:14 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 3 Jul 2014 01:23:14 +0200 Subject: Contrastive use of kratka and breve In-Reply-To: <20140702234334.10e21ae2@JRWUBU2> References: <20140702234334.10e21ae2@JRWUBU2> Message-ID: The angle and form (straight or curved, with wedge, with rounded bowl or not, attached or detached from the letter) of the acute accent is not really defined, all variants are possible, including the Czech/Polish form. All that matters is the main direction of slanting. The only unacceptable rendering is a pure horizontal or vertical form (but there still exists some typographic styles, mostly used in logos) that use horizontal strokes not distinguishing visuelly the acute and grave accents, notably over capitals (this is acceptable for short titles or headings and for trademarks, whose exact orthography is not very important And even more on capitals notably at start of words, where there's no ambiguity in French as it can only be ? with acute; the distinction of acute and grave accents in French only occurs over letter e, which is the only one using an acute accent; and there's never any grace accent over e at start of words; The curcumflex over E can also be easily infered from the same glyph at start of words, it occurs only in wellknown words like the auxiliary verb "?tre".) For this reason the French accents are frequently flat if they are present over capitals. The grave accent occurs on initial capitals only in the preprosition "?" where the grave accent is also non ambiguous, the only one possible, so it can be flattened too. At end of words (or before final mute letters (e)(s), this is only "?" with acute (there's no "?" with grabe and no "?" with circumflex). Also I really doubt that the Polish/Czech accents were unified with accents in French, I would probably bet on Italian or even Spanish, from their presence in the Spanish Netherlands and contacts with hanseatic ligues in harbours of the Northern Sea up to the Baltic with influence on the Prussian kingdom (Spanish and Italian both have acute accents over all important vowels; but no grave, no circumflex in Italian, so it can be flattened as well), but Italian fonts have originately used more vertical shapes. I think that what made the Czrch and Polish accents more vertical was their use of double accents side by side rather than on top of each other. -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Wed Jul 2 19:15:38 2014 From: leob at mailcom.com (Leo Broukhis) Date: Wed, 2 Jul 2014 17:15:38 -0700 Subject: Missing Nenets letters? In-Reply-To: <53B48177.3000602@colson.eu> References: <53B48177.3000602@colson.eu> Message-ID: Thank you, but how convenient! Calling a letter a "modifier" allows to avoid re-encoding the same shape in various alphabets. Leo On Wed, Jul 2, 2014 at 3:02 PM, Jean-Fran?ois Colson wrote: > > Le 02/07/14 22:33, Leo Broukhis a ?crit : > > http://www.omniglot.com/writing/nenets.htm >> >> shows two letters (? and ?) in both versions of the Cyrillic Nenets >> alphabet ("voiced taser?" and "unvoiced taser?") that don't seem to be >> encoded as letters. Should they be encoded, or 2019 and 201D are good >> enough? >> >> > Simply use U+02BC modifier letter apostrophe instead of U+2019 > and U+02EE modifier letter double apostrophe instead of U+201D. > > Nothing new under the sun? > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Thu Jul 3 00:39:21 2014 From: leob at mailcom.com (Leo Broukhis) Date: Wed, 2 Jul 2014 22:39:21 -0700 Subject: Contrastive use of kratka and breve In-Reply-To: <20140702234334.10e21ae2@JRWUBU2> References: <20140702234334.10e21ae2@JRWUBU2> Message-ID: ? with lunate breve is not a letter of the alphabet, the breve is just an indication to the reader of the dictionary that the ? in this particular word is pronounced short. While there may be homographs that differ in pronunciation only by the vowel length, the alphabet doesn't provide for that distinction. Leo On Wed, Jul 2, 2014 at 3:43 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Wed, 2 Jul 2014 13:08:42 -0700 > Leo Broukhis wrote: > > > The difference is real and intentional, but isn't it akin to the > > difference between (IIRC a discussion several years ago) the > > Polish/Czech acute and the French acute - the former is more > > vertical? > > Not really. Look at the third entry at > http://www.helsinki.fi/~tasalmin/Forest_Nenets_short_i_1.jpg , where > the two occur side-by-side. Unification should depend on their being > contextual variants or one another, and difficult in laying down rules > is the justification for U+017F LATIN SMALL LETTER LONG S being a > separate character. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Thu Jul 3 02:39:19 2014 From: jf at colson.eu (=?ISO-8859-1?Q?Jean-Fran=E7ois_Colson?=) Date: Thu, 03 Jul 2014 09:39:19 +0200 Subject: Contrastive use of kratka and breve In-Reply-To: References: <20140702234334.10e21ae2@JRWUBU2> Message-ID: <53B508A7.4020107@colson.eu> Le 03/07/14 01:23, Philippe Verdy a ?crit : > The angle and form (straight or curved, with wedge, with rounded bowl > or not, attached or detached from the letter) of the acute accent is > not really defined, all variants are possible, including the > Czech/Polish form. > > All that matters is the main direction of slanting. The only > unacceptable rendering is a pure horizontal or vertical form (but > there still exists some typographic styles, mostly used in logos) that > use horizontal strokes not distinguishing visuelly the acute and grave > accents, notably over capitals (this is acceptable for short titles or > headings and for trademarks, whose exact orthography is not very important > > And even more on capitals notably at start of words, where there's no > ambiguity in French as it can only be ? with acute; the distinction of > acute and grave accents in French only occurs over letter e, which is > the only one using an acute accent; and there's never any grace accent > over e at start of words; Rarely, but not never: ?be, ?che, ?re, ?s, ?ve > The curcumflex over E can also be easily infered from the same glyph > at start of words, it occurs only in wellknown words like the > auxiliary verb "?tre".) For this reason the French accents are > frequently flat if they are present over capitals. The grave accent > occurs on initial capitals only in the preprosition "?" where the > grave accent is also non ambiguous, the only one possible, so it can > be flattened too. At end of words (or before final mute letters > (e)(s), this is only "?" with acute (there's no "?" with grabe and no > "?" with circumflex). There are at least agap?, koin?, kor?, psych?... But it's true that a rendering similar to agape-, koine-, kore-, psyche- doesn't make them ambiguous. > > Also I really doubt that the Polish/Czech accents were unified with > accents in French, I would probably bet on Italian or even Spanish, > from their presence in the Spanish Netherlands and contacts with > hanseatic ligues in harbours of the Northern Sea up to the Baltic with > influence on the Prussian kingdom (Spanish and Italian both have acute > accents over all important vowels; but no grave, no circumflex in > Italian, so it can be flattened as well), but Italian fonts have > originately used more vertical shapes. > > I think that what made the Czrch and Polish accents more vertical was > their use of double accents side by side rather than on top of each other. > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Thu Jul 3 10:12:52 2014 From: prosfilaes at gmail.com (David Starner) Date: Thu, 3 Jul 2014 08:12:52 -0700 Subject: Unencoded cased scripts and unencoded titlecase letters In-Reply-To: <53B419AC.3050703@khwilliamson.com> References: <53B419AC.3050703@khwilliamson.com> Message-ID: On Wed, Jul 2, 2014 at 7:39 AM, Karl Williamson wrote: > It's my sense that there are very few cased scripts in existence that are > ever likely to be encoded by Unicode that haven't already been so-encoded. Michael Everson is working on making Cherokee to be a cased script, and working on encoding Osage, a cased script that postdates the birth of Unicode. I do reckon you're correct about titlecased letters, though. -- Kie ekzistas vivo, ekzistas espero. From rscook at wenlin.com Thu Jul 3 13:02:08 2014 From: rscook at wenlin.com (Richard COOK) Date: Thu, 3 Jul 2014 11:02:08 -0700 Subject: Corrigendum #9 In-Reply-To: <53B41F20.8030204@khwilliamson.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <53992CC1.3010101@khwilliamson.com> <66865346da6645be9558d53dbac1b53a@BL2PR03MB450.namprd03.prod.outlook.com> <53B41F20.8030204@khwilliamson.com> Message-ID: On Jul 2, 2014, at 8:02 AM, Karl Williamson wrote: > Corrigendum #9 has changed this so much that people are coming to me and saying that inputs may very well have non-characters, and that the default should be to pass them through. Since we have no published wording for how the TUS will absorb Corrigendum #9, I don't know how this will play out. But this abrupt a change seems wrong to me, and it was done without public input or really adequate time to consider its effects. Asmus, I think you will recall that in late 2012 and early 2013, when the subject of the proposed changes (or clarifications) to text relating to noncharacters first arose, we (at Wenlin) expressed our concerns. Some concerns were grave, and some of the discussion and comments were captured in this web page: There was much back and forth on the editorial list. Discussion clarified some of the issues for me, and mollified some of my concerns. At that time we did implement support for noncharacters in Wenlin, controlled by an Advanced Option to: Replace noncharacters with [U+FFFD] This user preference is turned on by default. Not sure if revisiting any of our prior discussion would help clarify the evolution of thinking on this issue. But I did want to mention that the comment ?without public input? is not quite correct. As is so often the case, and as the web page above shows, there was input and discussion. Whether the amount of time given to this was really adequate is another question. Work required may expand to fill the available time, and perhaps more time is now available. -Richard From asmusf at ix.netcom.com Thu Jul 3 15:48:59 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 03 Jul 2014 13:48:59 -0700 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <53992CC1.3010101@khwilliamson.com> <66865346da6645be9558d53dbac1b53a@BL2PR03MB450.namprd03.prod.outlook.com> <53B41F20.8030204@khwilliamson.com> Message-ID: <53B5C1BB.4040303@ix.netcom.com> On 7/3/2014 11:02 AM, Richard COOK wrote: > On Jul 2, 2014, at 8:02 AM, Karl Williamson wrote: > >> Corrigendum #9 has changed this so much that people are coming to me and saying that inputs may very well have non-characters, and that the default should be to pass them through. Since we have no published wording for how the TUS will absorb Corrigendum #9, I don't know how this will play out. But this abrupt a change seems wrong to me, and it was done without public input or really adequate time to consider its effects. > Asmus, > > I think you will recall that in late 2012 and early 2013, when the subject of the proposed changes (or clarifications) to text relating to noncharacters first arose, we (at Wenlin) expressed our concerns. Some concerns were grave, and some of the discussion and comments were captured in this web page: > > > > There was much back and forth on the editorial list. Discussion clarified some of the issues for me, and mollified some of my concerns. > > At that time we did implement support for noncharacters in Wenlin, controlled by an Advanced Option to: > > Replace noncharacters with [U+FFFD] > > This user preference is turned on by default. > > Not sure if revisiting any of our prior discussion would help clarify the evolution of thinking on this issue. > > But I did want to mention that the comment ?without public input? is not quite correct. Richard, "public input" is best understood as PRI or similar process, not discussions by members or other people closely associated with the project. Also, in particular, discussions on the editorial list are invisible to the public. > As is so often the case, and as the web page above shows, there was input and discussion. Whether the amount of time given to this was really adequate is another question. Work required may expand to fill the available time, and perhaps more time is now available. Given the wide ranging nature of implementations this "clarification" affected, I believe the process failed to provide the necessary safeguards. Conformance changes are really significant, and a Corrigendum, no matter how much it is presented as harmless clarification, does affect conformance. The UTC would be well served to formally adopt a process that requires a PRI as well as resolutions taken at two separate UTCs to approve any Corrigendum. There are changes to properties and algorithms that would also benefit from such an extended process that has a guaranteed minimum number of times for the change to be debated, to surface in minutes and to surface in calls for public input, rather than sailing quietly and quickly into the standard. The threshold for this should really be rather low -- as the standard has matured, the number and nature of implementations that depend on it have multiplied, to the point where even a diverse membership is no guarantee that issues can be correctly identified and averted. With the minutes from the UTC only recording decisions, one change, to require an initial and a confirming resolution at separate meetings would allow more issues to surface. It would also help if proposal documents were updated to reflect the initial discussion, much as it is done with character encoding proposals that are updated to address additional concerns identified or resolved. That said, I could imagine a possible exception for true errata (typos), where correcting a clear mistake should not be unnecessarily drawn out, so the error can be removed promptly. Such cases usually are turning on facts (was there an editing mistake, was there new data about how a character is used that makes an original property assignment a mistake (rather than a less than optimal choice). Despite being called a "clarification" this corrigendum is not in the nature of an erratum. A./ > > -Richard > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From public at khwilliamson.com Thu Jul 3 22:30:00 2014 From: public at khwilliamson.com (Karl Williamson) Date: Thu, 03 Jul 2014 21:30:00 -0600 Subject: Corrigendum #9 In-Reply-To: <53B5C1BB.4040303@ix.netcom.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <53992CC1.3010101@khwilliamson.com> <66865346da6645be9558d53dbac1b53a@BL2PR03MB450.namprd03.prod.outlook.com> <53B41F20.8030204@khwilliamson.com> <53B5C1BB.4040303@ix.netcom.com> Message-ID: <53B61FB8.3010002@khwilliamson.com> On 07/03/2014 02:48 PM, Asmus Freytag wrote: > On 7/3/2014 11:02 AM, Richard COOK wrote: >> On Jul 2, 2014, at 8:02 AM, Karl Williamson >> wrote: >> >>> Corrigendum #9 has changed this so much that people are coming to me >>> and saying that inputs may very well have non-characters, and that >>> the default should be to pass them through. Since we have no >>> published wording for how the TUS will absorb Corrigendum #9, I don't >>> know how this will play out. But this abrupt a change seems wrong to >>> me, and it was done without public input or really adequate time to >>> consider its effects. >> Asmus, >> >> I think you will recall that in late 2012 and early 2013, when the >> subject of the proposed changes (or clarifications) to text relating >> to noncharacters first arose, we (at Wenlin) expressed our concerns. >> Some concerns were grave, and some of the discussion and comments were >> captured in this web page: >> >> >> >> There was much back and forth on the editorial list. Discussion >> clarified some of the issues for me, and mollified some of my concerns. >> >> At that time we did implement support for noncharacters in Wenlin, >> controlled by an Advanced Option to: >> >> Replace noncharacters with [U+FFFD] >> >> This user preference is turned on by default. >> >> Not sure if revisiting any of our prior discussion would help clarify >> the evolution of thinking on this issue. >> >> But I did want to mention that the comment ?without public input? is >> not quite correct. > > Richard, > > "public input" is best understood as PRI or similar process, not > discussions by members or other people closely associated with the > project. Also, in particular, discussions on the editorial list are > invisible to the public. > > >> As is so often the case, and as the web page above shows, there was >> input and discussion. Whether the amount of time given to this was >> really adequate is another question. Work required may expand to fill >> the available time, and perhaps more time is now available. > > Given the wide ranging nature of implementations this "clarification" > affected, I believe the process failed to provide the necessary safeguards. > > Conformance changes are really significant, and a Corrigendum, no matter > how much it is presented as harmless clarification, does affect > conformance. > > The UTC would be well served to formally adopt a process that requires a > PRI as well as resolutions taken at two separate UTCs to approve any > Corrigendum. > > There are changes to properties and algorithms that would also benefit > from such an extended process that has a guaranteed minimum number of > times for the change to be debated, to surface in minutes and to surface > in calls for public input, rather than sailing quietly and quickly into > the standard. > > The threshold for this should really be rather low -- as the standard > has matured, the number and nature of implementations that depend on it > have multiplied, to the point where even a diverse membership is no > guarantee that issues can be correctly identified and averted. > > With the minutes from the UTC only recording decisions, one change, to > require an initial and a confirming resolution at separate meetings > would allow more issues to surface. It would also help if proposal > documents were updated to reflect the initial discussion, much as it is > done with character encoding proposals that are updated to address > additional concerns identified or resolved. > > That said, I could imagine a possible exception for true errata (typos), > where correcting a clear mistake should not be unnecessarily drawn out, > so the error can be removed promptly. Such cases usually are turning on > facts (was there an editing mistake, was there new data about how a > character is used that makes an original property assignment a mistake > (rather than a less than optimal choice). > > Despite being called a "clarification" this corrigendum is not in the > nature of an erratum. > > A./ Exactly. There should have been a PRI before this was approved. I read the unicore list, and I was not aware of the change until after the fact. The first sentence of your more contemporaneous web page http://wenlininstitute.org/UnicodeNoncharacters/ indicates that you too did not know about this until after the fact, and undertook this effort upon finding out about it to understand the magnitude and cope with the change, which as Asmus said, is indeed a change and not a clarification. From rscook at wenlin.com Fri Jul 4 10:04:11 2014 From: rscook at wenlin.com (Richard COOK) Date: Fri, 4 Jul 2014 08:04:11 -0700 Subject: Corrigendum #9 In-Reply-To: <53B5C1BB.4040303@ix.netcom.com> References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <53992CC1.3010101@khwilliamson.com> <66865346da6645be9558d53dbac1b53a@BL2PR03MB450.namprd03.prod.outlook.com> <53B41F20.8030204@khwilliamson.com> <53B5C1BB.4040303@ix.netcom.com> Message-ID: On Jul 3, 2014, at 1:48 PM, Asmus Freytag wrote: > On 7/3/2014 11:02 AM, Richard COOK wrote: >> On Jul 2, 2014, at 8:02 AM, Karl Williamson wrote: >> >>> Corrigendum #9 has changed this so much that people are coming to me and saying that inputs may very well have non-characters, and that the default should be to pass them through. Since we have no published wording for how the TUS will absorb Corrigendum #9, I don't know how this will play out. But this abrupt a change seems wrong to me, and it was done without public input or really adequate time to consider its effects. >> Asmus, >> >> I think you will recall that in late 2012 and early 2013, when the subject of the proposed changes (or clarifications) to text relating to noncharacters first arose, we (at Wenlin) expressed our concerns. Some concerns were grave, and some of the discussion and comments were captured in this web page: >> >> >> >> There was much back and forth on the editorial list. Discussion clarified some of the issues for me, and mollified some of my concerns. >> >> At that time we did implement support for noncharacters in Wenlin, controlled by an Advanced Option to: >> >> Replace noncharacters with [U+FFFD] >> >> This user preference is turned on by default. >> >> Not sure if revisiting any of our prior discussion would help clarify the evolution of thinking on this issue. >> >> But I did want to mention that the comment ?without public input? is not quite correct. > > Richard, > > "public input" is best understood as PRI or similar process, not discussions by members or other people closely associated with the project. Also, in particular, discussions on the editorial list are invisible to the public. Asmus, The document (L2/13-015, see link above) which we submitted to UTC in response to the original proposal (L2/13-006) advocated caution. When L2/13-006 came to our attention it was perhaps rather late in the game (as Karl suggests in his reply). The changes were perhaps already a foregone conclusion in the minds of the proposers. I don?t recall if anyone even proposed doing a PRI, but in retrospect that would have been good a idea, a PRI would have been ideal and someone should have suggested it. > >> As is so often the case, and as the web page above shows, there was input and discussion. Whether the amount of time given to this was really adequate is another question. Work required may expand to fill the available time, and perhaps more time is now available. > > Given the wide ranging nature of implementations this "clarification" affected, I believe the process failed to provide the necessary safeguards. > > Conformance changes are really significant, and a Corrigendum, no matter how much it is presented as harmless clarification, does affect conformance. > > The UTC would be well served to formally adopt a process that requires a PRI as well as resolutions taken at two separate UTCs to approve any Corrigendum. > > There are changes to properties and algorithms that would also benefit from such an extended process that has a guaranteed minimum number of times for the change to be debated, to surface in minutes and to surface in calls for public input, rather than sailing quietly and quickly into the standard. > > The threshold for this should really be rather low -- as the standard has matured, the number and nature of implementations that depend on it have multiplied, to the point where even a diverse membership is no guarantee that issues can be correctly identified and averted. > > With the minutes from the UTC only recording decisions, one change, to require an initial and a confirming resolution at separate meetings would allow more issues to surface. It would also help if proposal documents were updated to reflect the initial discussion, much as it is done with character encoding proposals that are updated to address additional concerns identified or resolved. > > That said, I could imagine a possible exception for true errata (typos), where correcting a clear mistake should not be unnecessarily drawn out, so the error can be removed promptly. Such cases usually are turning on facts (was there an editing mistake, was there new data about how a character is used that makes an original property assignment a mistake (rather than a less than optimal choice). > > Despite being called a "clarification" this corrigendum is not in the nature of an erratum. So, there can be a continuum of cases between erratum and corrigendum. Corrigenda are at the severe end of the spectrum. It should be harder to issue a corrigendum since this affects conformance. Gray areas and thresholds dictate transparent process and caution. Judgement calls in critical cases require entertaining more opinions, and more second guessing, before it is too late. But someone does have to evaluate the possibilities, and make the judgement at each stage in the process. Sometimes too much second-guessing bogs the whole process down. How does the motto go? Is it ?Don?t let doing nothing perfectly stand in the way of doing something imperfectly.? ? On the other hand, sometimes the correct response to a call to action is to stand still. Or else somebody invades Iraq and Afghanistan. Can you issue a corrigendum to retract your corrigendum? Or just svn revert the whole software world? -R > A./ >> >> -Richard >> >> >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > From doug at ewellic.org Fri Jul 4 13:24:10 2014 From: doug at ewellic.org (Doug Ewell) Date: Fri, 4 Jul 2014 12:24:10 -0600 Subject: Official mappings between Wingdings/Webdings and Unicode Message-ID: <1C69CE39694745BA9E18CC747A1DC857@DougEwell> WG2 N4384, and at least one predecessor, included a "source references" document that attempted to provide a normative mapping between the glyphs in the Wingdings/Webdings font family and the corresponding existing or proposed Unicode characters. Now that Unicode 7.0 includes all of these characters, is the N4384 table still official, or will an official mapping be provided in some other format? It might be useful for apps to be able to convert between font-tagged Wingdings or Webdings and Unicode. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? From emuller at adobe.com Sat Jul 5 17:46:13 2014 From: emuller at adobe.com (Eric Muller) Date: Sat, 5 Jul 2014 15:46:13 -0700 Subject: Help with arabic Message-ID: <53B88035.9090802@adobe.com> I am working of the digitization of a text that includes arabic; could somebody please tell me what is the Unicode representation of the (short) fragments on those two pages? http://gallica.bnf.fr/ark:/12148/bpt6k6439352j/f33.image http://gallica.bnf.fr/ark:/12148/bpt6k6439352j/f474.image Thanks, Eric. From wblackwo at tampabay.rr.com Sat Jul 5 13:47:57 2014 From: wblackwo at tampabay.rr.com (William Blackwood) Date: Sat, 5 Jul 2014 14:47:57 -0400 Subject: Questions on the Unicode BiDirectional (BIDI) Algorithm Message-ID: <000a01cf9881$a3b03130$eb109390$@tampabay.rr.com> Can anyone provide me an actively resolving example of a .com domain name that demonstrates employment of the Unicode BIDI algorithm? Specifically, I am looking for realized/resolving examples of an Arabic number (AN) and character-containing domain name, (such as ???.com), but that which employs the BIDI algorithm to change an Arabic 1, to a European number (EN) 1? (E.g. ????.com), or (??1?.com). The BIDI algorithm should be changing either the AN or EN, or vise-versa; or has Verisign not yet incorporated the BIDI algorithm into its registry? Thank you, W. --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Sat Jul 5 19:08:49 2014 From: shervinafshar at gmail.com (Shervin Afshar) Date: Sat, 5 Jul 2014 17:08:49 -0700 Subject: Help with arabic In-Reply-To: <53B88035.9090802@adobe.com> References: <53B88035.9090802@adobe.com> Message-ID: Hi Eric, Here it is: - f33 - word in line 6: ??? - letters in line 7: ?, ? - words in line 8: ????, ????, ???? - f474 - words in line 6: - il ? parle = ??? (but here it looks like there is a typo; it's written as ???) - il ? ?crit = ??? - il a march? = ??? - il a vu = ??? Shervin ? shervinafshar.name On Sat, Jul 5, 2014 at 3:46 PM, Eric Muller wrote: > I am working of the digitization of a text that includes arabic; could > somebody please tell me what is the Unicode representation of the (short) > fragments on those two pages? > > http://gallica.bnf.fr/ark:/12148/bpt6k6439352j/f33.image > > http://gallica.bnf.fr/ark:/12148/bpt6k6439352j/f474.image > > Thanks, > Eric. > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From budelberger.richard at wanadoo.fr Sat Jul 5 20:39:34 2014 From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER) Date: Sun, 6 Jul 2014 03:39:34 +0200 (CEST) Subject: Help with arabic In-Reply-To: <53B88035.9090802@adobe.com> References: <53B88035.9090802@adobe.com> Message-ID: <748082224.28.1404610774586.JavaMail.www@wwinf1e04> > Message du 06/07/14 00:57 > De : Eric Muller > A : unicode at unicode.org > Objet : Help with arabic > > I am working of the digitization of a text that includes arabic; could > somebody please tell me what is the Unicode representation of the > (short) fragments on those two pages? > > http://gallica.bnf.fr/ark:/12148/bpt6k6439352j/f33.image on ajoute au mot ??? qui signifie chambre, les lettres ? ,? et ?, et l?on ?crit ???? ,????, et ????. > http://gallica.bnf.fr/ark:/12148/bpt6k6439352j/f474.image Il a parl?, il a ?crit, il a march?, il a vu, ??? ,??? ,??? ,???. > Thanks, > Eric. From richard.wordingham at ntlworld.com Sun Jul 6 06:04:41 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 6 Jul 2014 12:04:41 +0100 Subject: Demonstrating Non-compliance to C6 (No distinct Interpretations) Message-ID: <20140706120441.38f0c55d@JRWUBU2> How does one establish non-compliance of a process to Conformance Requirement C6, "A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct"? The problems I have are: 1. It is not sufficient to demonstrate that the process interprets canonically equivalent character sequences differently. 2. There therefore appears to be a mental activity involved. For example, the following snippet is non-compliant by virtue of the comment: # Function f should perform uppercasing. if s1 is canonically equivalent to s2 and f(s1) is not canonically equivalent to f(s2) then print s1, " and ", s2, " are c.e. but f converts them to ", f(s1), " and ", f(s2) endif because it requires that if f() ever interprets a pair of canonically equivalent strings differently, it shall always interpret them differently. If I remove the comment, it seems that the snippet might be compliant, for it is no longer clear that f() 'interprets' its argument. If I buffer the function values, then it seems to be compliant, especially if I add a comment: # Function f should perform uppercasing. u1 = f(s1); u2 = f(s2); if s1 is canonically equivalent to s2 and u1 is not canonically equivalent to u2 then # This message should never be generated. print s1, " and ", s2, " are c.e. but f converts them to ", u1, " and ", u2 endif Have I understood C6? The background is that I am writing a regular expression engine for equivalence classes of strings under canonical equivalence and I realised that there was a novel issue in the choice of 'longest leftmost' when matching the pattern \p{ccc=0}\p{ccc?0}. Would using character fragment positions in an unnormalised input string make my engine non-compliant with the Unicode standard? I think the 'practical' answer is that just using these positions makes selection of matching strings ill-defined as an operation on equivalence classes, and so should not be an option. Richard. From doug at ewellic.org Sun Jul 6 12:31:27 2014 From: doug at ewellic.org (Doug Ewell) Date: Sun, 6 Jul 2014 11:31:27 -0600 Subject: Questions on the Unicode BiDirectional (BIDI) Algorithm Message-ID: <92B49CF49BF14342A61D1AB041A0F23A@DougEwell> William Blackwood wrote: > Can anyone provide me an actively resolving example of a .com domain > name that demonstrates employment of the Unicode BIDI algorithm? > Specifically, I am looking for realized/resolving examples of an > Arabic number (AN) and character-containing domain name, (such as > ???.com), but that which employs the BIDI algorithm to change an > Arabic 1, to a European number (EN) 1? (E.g. ????.com), or (??1?.com). > The BIDI algorithm should be changing either the AN or EN, or vise- > versa; or has Verisign not yet incorporated the BIDI algorithm into > its registry? I would never expect application of the Unicode Bidirectional Algorithm to change an Arabic digit like ? into a European digit like 1. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From verdy_p at wanadoo.fr Sun Jul 6 13:10:12 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 6 Jul 2014 20:10:12 +0200 Subject: Questions on the Unicode BiDirectional (BIDI) Algorithm In-Reply-To: <92B49CF49BF14342A61D1AB041A0F23A@DougEwell> References: <92B49CF49BF14342A61D1AB041A0F23A@DougEwell> Message-ID: Neither do I think that the Verisign registry has to perform such mappings, even if they have implemented an equivalence, they should still return a domain name using the Arabic digits if it was registered like this, and should allow resolving such names directly (even if it also resoves the name with the Arabo-European digits from ASCII). The Bidi algorithm is completely independant of IDNA which addresses other issues and in fact is more concerned about canonical and compatibility equivalences or about confusables (notably with Indic digits, such as an Indic digit 4 that looks very much like a European digit 8). For IDNA there's a superset of equivalences or characters prohibited that goes far beyond basic canonical and compatibiluty equivalences, but Bidi is not an issue for the registration of IDNA labels in domain names (the possible issue is in the rendering of a FQDN domain alternating LTR and RTL labels, because the dot (.) separator has a weak direction in Bidi. But this is not directly an issue of the domain name system, but about how to render an URL (or more generally an URI) which should be parsed and have some characters (notably, the dot and slash) changed to adopt a strong direction, differnt from the generic Bidi applied directly to the full URI as if it was using a human language. There may be issues however with some domain name labels (separated by dots) that could mix characters allowed in a large set but with different strong directions. As such thng may break and create lots of confusable labels (and as Bidi controls are prohibited in domain labels),this could create havoc. But today's browsers perform some validation of domain labels to make sure their resolved Bidi dirction cannot change more than once. There are also issues with the minus-hyphen within labels (allowed only in the middle without repetition, it also has a weak direction inherited from the letters/digits encoded before it; but with normal text rendering it could have its visual position changed and could create confusable domain names. Each registry applies its own filters to allow or disallow some characters. They cannot open the full repertoire, and before extending their allowed character set they have to make sure that this will not create havoc with their own existing names (and they need to investigate how major web browsers will handle these new types of domain names, including in URLs. The problem is harder to solve in some formats (notably when URLs are just embeded without any standar syntax identifying them in plain text, e.g. in plain-text emails, or in short text fields in a database, where rich text encoding is not allowed, or in records of email addresses, or outside IDNA with user-selected user account names, including Facebook pages : they could be used to trick someone to connect to the wrong account or download unsafe data). 2014-07-06 19:31 GMT+02:00 Doug Ewell : > William Blackwood wrote: > > Can anyone provide me an actively resolving example of a .com domain >> name that demonstrates employment of the Unicode BIDI algorithm? >> Specifically, I am looking for realized/resolving examples of an >> Arabic number (AN) and character-containing domain name, (such as >> ???.com ), but that which employs the BIDI >> algorithm to change an >> Arabic 1, to a European number (EN) 1? (E.g. ????.com >> ), or (??1?.com ). >> The BIDI algorithm should be changing either the AN or EN, or vise- >> versa; or has Verisign not yet incorporated the BIDI algorithm into >> its registry? >> > > I would never expect application of the Unicode Bidirectional Algorithm to > change an Arabic digit like ? into a European digit like 1. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun Jul 6 13:15:39 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 6 Jul 2014 20:15:39 +0200 Subject: Help with arabic In-Reply-To: References: <53B88035.9090802@adobe.com> Message-ID: 2014-07-06 2:08 GMT+02:00 Shervin Afshar : > - words in line 6: > - il ? parle = ??? (but here it looks like there is a typo; it's > written as ???) > - il ? ?crit = ??? > > Three typos in your French: - there's no accent over "a" used as the auxiliary verb (at indicative present time, followed by the past participle, together forming the compount past) - there must be an acute accent in "parl?" (past participle time, not the present time). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun Jul 6 13:25:37 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 6 Jul 2014 20:25:37 +0200 Subject: Help with arabic In-Reply-To: <748082224.28.1404610774586.JavaMail.www@wwinf1e04> References: <53B88035.9090802@adobe.com> <748082224.28.1404610774586.JavaMail.www@wwinf1e04> Message-ID: 2014-07-06 3:39 GMT+02:00 Richard BUDELBERGER < budelberger.richard at wanadoo.fr>: > > http://gallica.bnf.fr/ark:/12148/bpt6k6439352j/f474.image > > Il a parl?, il a ?crit, il a march?, il a vu, ??? ,??? ,??? ,???. > Note: the three comma-separated items, if they are just separated by the comma (in that example it is handwritten, but it is the European comma, not the arabic comma) should use bidi-embedding controls (or ... in HTML, with browsers that use "unicode-direction" CSS style for the isolation mode, or using your own CSS stylesheet for the selector "bdi[dir=rtl] {...}", or en explicit style="" attribute) to avoid their reordering if you intend to have them in the same order as printed. Ad the list is part of a paragraph and sentence in French, I don't think it is appropriate to leave the three Arabic words untagged (if you use , your should probably add the lang="ar" attribute in it), and you need three separate tags, leaving the commas (and the final dot of the sentence) outside of them. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Jul 6 14:38:07 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 6 Jul 2014 20:38:07 +0100 Subject: Questions on the Unicode BiDirectional (BIDI) Algorithm In-Reply-To: <000a01cf9881$a3b03130$eb109390$@tampabay.rr.com> References: <000a01cf9881$a3b03130$eb109390$@tampabay.rr.com> Message-ID: <20140706203807.3a65ef70@JRWUBU2> On Sat, 5 Jul 2014 14:47:57 -0400 "William Blackwood" wrote: > Specifically, I am looking for realized/resolving examples of an > Arabic number (AN) and character-containing domain name, (such as > ???.com), but that which employs the BIDI algorithm to change an > Arabic 1, to a European number (EN) 1? The only interchange of these types in the BiDi algorithm is from European number to Arabic number, and this chiefly affects the tags on the characters for controlling reversals of the forwards direction and whether European separators and terminators are treated as part of the run of digits. The changes of the BiDi tags only affect the appearance of the characters that look different in left-to-right and right-to left text. The AN digits are those of the Eastern end of the Arabic world. The decimal digits used with the Arabic script to the West and East are EN - U+0030 to U+0039 and U+06F0 to U+06F9. Richard. From doug at ewellic.org Sun Jul 6 18:26:49 2014 From: doug at ewellic.org (Doug Ewell) Date: Sun, 6 Jul 2014 17:26:49 -0600 Subject: Questions on the Unicode BiDirectional (BIDI) Algorithm Message-ID: <767CE4D1377F4BFF8B1784C517574AE8@DougEwell> Richard Wordingham wrote: >> Specifically, I am looking for realized/resolving examples of an >> Arabic number (AN) and character-containing domain name, (such as >> ???.com), but that which employs the BIDI algorithm to change an >> Arabic 1, to a European number (EN) 1? > > The only interchange of these types in the BiDi algorithm is from > European number to Arabic number, and this chiefly affects the tags on > the characters for controlling reversals of the forwards direction and > whether European separators and terminators are treated as part of the > run of digits. To clarify, I would not expect the Unicode Bidirectional Algorithm to change an Arabic digit into a European digit, or vice versa. I would not ever expect it to change the identity of any character. The UBA can change the bidirectional TYPE of a particular instance of a character, so that it can be rendered properly, which I think is what Richard meant. But it will never replace a "?" with a "1", nor will it ever replace a "1" with a "?", which I think is what William Blackwood meant. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? From wblackwo at tampabay.rr.com Mon Jul 7 09:13:28 2014 From: wblackwo at tampabay.rr.com (William Blackwood) Date: Mon, 7 Jul 2014 10:13:28 -0400 Subject: Questions on the Unicode BiDirectional (BIDI) Algorithm In-Reply-To: <92B49CF49BF14342A61D1AB041A0F23A@DougEwell> References: <92B49CF49BF14342A61D1AB041A0F23A@DougEwell> Message-ID: <000601cf99ed$a0078520$e0168f60$@tampabay.rr.com> Doug, from Unicode Standard Annex #9 Unicode Bidirectional Algorithm: 3.3.4 W2. Search backward from each instance of a European number until the first strong type (R, L, AL, or sos) is found. If an AL is found, change the type of the European number to Arabic number. AL EN ? AL AN AL NI EN ? AL NI AN sos NI EN ? sos NI EN L NI EN ? L NI EN R NI EN ? R NI EN -----Original Message----- From: Doug Ewell [mailto:doug at ewellic.org] Sent: Sunday, July 06, 2014 1:31 PM To: unicode at unicode.org Cc: wblackwo at tampabay.rr.com Subject: Re: Questions on the Unicode BiDirectional (BIDI) Algorithm William Blackwood wrote: > Can anyone provide me an actively resolving example of a .com domain > name that demonstrates employment of the Unicode BIDI algorithm? > Specifically, I am looking for realized/resolving examples of an > Arabic number (AN) and character-containing domain name, (such as > ???.com), but that which employs the BIDI algorithm to change an > Arabic 1, to a European number (EN) 1? (E.g. ????.com), or (??1?.com). > The BIDI algorithm should be changing either the AN or EN, or vise- > versa; or has Verisign not yet incorporated the BIDI algorithm into > its registry? I would never expect application of the Unicode Bidirectional Algorithm to change an Arabic digit like ? into a European digit like 1. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com From eliz at gnu.org Mon Jul 7 09:56:11 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Mon, 07 Jul 2014 17:56:11 +0300 Subject: Questions on the Unicode BiDirectional (BIDI) Algorithm In-Reply-To: <000601cf99ed$a0078520$e0168f60$@tampabay.rr.com> References: <92B49CF49BF14342A61D1AB041A0F23A@DougEwell> <000601cf99ed$a0078520$e0168f60$@tampabay.rr.com> Message-ID: <8338eddxpw.fsf@gnu.org> > From: "William Blackwood" > Date: Mon, 7 Jul 2014 10:13:28 -0400 > > Doug, from Unicode Standard Annex #9 Unicode Bidirectional Algorithm: > > > 3.3.4 > > W2. Search backward from each instance of a European number until the first strong type (R, L, AL, or sos) is found. If an AL is found, change the type of the European number to Arabic number. This changes the _type_ of the character, but does not change the character itself. The type of the character, as updated by W2, is then used in further processing. But the character stays what it originally was. From verdy_p at wanadoo.fr Mon Jul 7 12:11:25 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 7 Jul 2014 19:11:25 +0200 Subject: Questions on the Unicode BiDirectional (BIDI) Algorithm In-Reply-To: <000601cf99ed$a0078520$e0168f60$@tampabay.rr.com> References: <92B49CF49BF14342A61D1AB041A0F23A@DougEwell> <000601cf99ed$a0078520$e0168f60$@tampabay.rr.com> Message-ID: It just changes the direction behavior, it does not replace characters, except in the preference mode using "national digits". Still changing Euro-Arabic digits to some Arabic digit required more knowledge about the language. And the BiDi algorithm does not track languages. So you cannot choose with the BiDi algortithm **alone** how to substitute Euro-Arabic digits to one of the two sets of Indo-Arabic digits (Western or Eastern; the eastern set being called "extended" in UCD but being the natural system of digits for Persan, Pashto, Urdu, and used by many shiite muslems, the former western set being used also in Arabic by most sunnite muslems)... 2014-07-07 16:13 GMT+02:00 William Blackwood : > Doug, from Unicode Standard Annex #9 Unicode Bidirectional Algorithm: > > > 3.3.4 > > W2. Search backward from each instance of a European number until the > first strong type (R, L, AL, or sos) is found. If an AL is found, change > the type of the European number to Arabic number. > > AL EN ? AL AN > > AL NI EN ? AL NI AN > > sos NI EN ? sos NI EN > > L NI EN ? L NI EN > > R NI EN ? R NI EN > > > > > -----Original Message----- > From: Doug Ewell [mailto:doug at ewellic.org] > Sent: Sunday, July 06, 2014 1:31 PM > To: unicode at unicode.org > Cc: wblackwo at tampabay.rr.com > Subject: Re: Questions on the Unicode BiDirectional (BIDI) Algorithm > > William Blackwood wrote: > > > Can anyone provide me an actively resolving example of a .com domain > > name that demonstrates employment of the Unicode BIDI algorithm? > > Specifically, I am looking for realized/resolving examples of an > > Arabic number (AN) and character-containing domain name, (such as > > ???.com ), but that which employs the BIDI > algorithm to change an > > Arabic 1, to a European number (EN) 1? (E.g. ????.com > ), or (??1?.com ). > > The BIDI algorithm should be changing either the AN or EN, or vise- > > versa; or has Verisign not yet incorporated the BIDI algorithm into > > its registry? > > I would never expect application of the Unicode Bidirectional Algorithm to > change an Arabic digit like ? into a European digit like 1. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > > > --- > This email is free from viruses and malware because avast! Antivirus > protection is active. > http://www.avast.com > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Jul 7 12:25:32 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 07 Jul 2014 10:25:32 -0700 Subject: Questions on the Unicode BiDirectional (BIDI) Algorithm Message-ID: <20140707102532.665a7a7059d7ee80bb4d670165c8327d.c6cf2d3bc2.wbe@email03.secureserver.net> Philippe Verdy wrote: > It just changes the direction behavior, it does not replace > characters, except in the preference mode using "national digits". It does not replace characters, period. There is no "preference mode" in the UBA that replaces digits. The terms "preference mode" and "national digits" do not appear in UAX #9. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From verdy_p at wanadoo.fr Mon Jul 7 12:28:31 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 7 Jul 2014 19:28:31 +0200 Subject: Questions on the Unicode BiDirectional (BIDI) Algorithm In-Reply-To: <20140707102532.665a7a7059d7ee80bb4d670165c8327d.c6cf2d3bc2.wbe@email03.secureserver.net> References: <20140707102532.665a7a7059d7ee80bb4d670165c8327d.c6cf2d3bc2.wbe@email03.secureserver.net> Message-ID: Yes, my sentence was incomplete. But such prefernece mode is implemented in common softwares (and without even needing UBA itself). 2014-07-07 19:25 GMT+02:00 Doug Ewell : > Philippe Verdy wrote: > > > It just changes the direction behavior, it does not replace > > characters, except in the preference mode using "national digits". > > It does not replace characters, period. There is no "preference mode" in > the UBA that replaces digits. The terms "preference mode" and "national > digits" do not appear in UAX #9. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Jul 7 12:36:22 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 07 Jul 2014 10:36:22 -0700 Subject: Questions on the Unicode BiDirectional (BIDI) Algorithm Message-ID: <20140707103622.665a7a7059d7ee80bb4d670165c8327d.4839263463.wbe@email03.secureserver.net> Philippe Verdy wrote: >>> It just changes the direction behavior, it does not replace >>> characters, except in the preference mode using "national digits". >> >> It does not replace characters, period. There is no "preference mode" >> in the UBA that replaces digits. The terms "preference mode" and >> "national digits" do not appear in UAX #9. > > Yes, my sentence was incomplete. But such prefernece mode is > implemented in common softwares (and without even needing UBA itself). William asked specifically about the UBA, and cited a passage from the UBA, and the Subject line of this thread refers to the UBA. To avoid confusing him with unrelated information, my responses at least are intentionally confined to the UBA. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From verdy_p at wanadoo.fr Tue Jul 8 03:58:22 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 8 Jul 2014 10:58:22 +0200 Subject: Questions on the Unicode BiDirectional (BIDI) Algorithm In-Reply-To: <20140707103622.665a7a7059d7ee80bb4d670165c8327d.4839263463.wbe@email03.secureserver.net> References: <20140707103622.665a7a7059d7ee80bb4d670165c8327d.4839263463.wbe@email03.secureserver.net> Message-ID: I do agree but the user was already confused by the EN->AN "substitution" which is not a character substitution but just a change of Bidi class during resolutions steps needed to order thngs correctly. The final goal of UBA is just to compute the "correct" visual reordering, and my sentence was already saying that (the "unless" part was effectively incomplete as it was not clear that this was not about UBA itself). Indeed, the replacement of digits by "national digits" is described in the stadnard even if it is now deprecated (but it is maintained because we also have some deprecated / non-recommanded formating controls to define this behavior for a few softwares that still need it). And the UBA algoritm also includes some minimal support for these formating controls, so this comment is not completely out of topic (but even in this case UBA does not perform the substitutions itself). These contols are not needed when composing an Arabic plain-text directly. This is just an old compatibility facility for cases where a software is formating numbers using ASCII digits (e.g. with printf("%d %f"...) in C) without knowing which other set of digits it should better use (no support of a more precise locale). These substitutions are problematic anyway because they are almost blind of the context of use (are these used for numbers expressing quantities, or are there codes and identifiers like phone numbers or social security numbers or car registration numbers, or postal codes in a foreign country?). They could also generate problems if they cause a change of interpretation in dates and times or in currency amounts (because such subtitutions are lossy), and in fact more poblems than keeping digits untouched (even if they are not the prefered ones for a given locale). What is really important is that substitution of ASCII digits is not possible only at the character encoding level, used by UBA, because it requires some other knowledge about language (or style for the Arabic script). Typically such substitution are handled in the context of a specific font used by the renderer; which will need such formating controls in order to know if it can substitute *glyphs* (not characters) and UBA also does not work at the glyph level. It is possible at the font level It is even more difficult to do that for the Arabic script rather than for the Indic scripts because the Arabic script has two distinct sets of "national" digits. Digits are significntly important enough in critical cases that performing automatic substitution of them without good knowledge of their context of use will cause severe security problems. UBA is used so broadly by default that it is certainly not the algorithm in which sush substitutions will occur (and I think this is for the same reason that the use of digit formatting controls is also strongly discouraged) 2014-07-07 19:36 GMT+02:00 Doug Ewell : > Philippe Verdy wrote: > > >>> It just changes the direction behavior, it does not replace > >>> characters, except in the preference mode using "national digits". > >> > >> It does not replace characters, period. There is no "preference mode" > >> in the UBA that replaces digits. The terms "preference mode" and > >> "national digits" do not appear in UAX #9. > > > > Yes, my sentence was incomplete. But such prefernece mode is > > implemented in common softwares (and without even needing UBA itself). > > William asked specifically about the UBA, and cited a passage from the > UBA, and the Subject line of this thread refers to the UBA. To avoid > confusing him with unrelated information, my responses at least are > intentionally confined to the UBA. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cewcathar at hotmail.com Tue Jul 8 18:38:24 2014 From: cewcathar at hotmail.com (CE Whitehead) Date: Tue, 8 Jul 2014 19:38:24 -0400 Subject: Help with arabic Message-ID: From: Richard BUDELBERGER Date: Sun, 6 Jul 2014 03:39:34 +0200 (CEST) > Message du 06/07/14 00:57 > De : Eric Muller > A : unicode_at_unicode.org > Objet : Help with arabic >> >> I am working of the digitization of a text that includes arabic; could >> somebody please tell me what is the Unicode representation of the >> (short) fragments on those two pages? >> Perhaps your question has been answered sufficiently by Richard: First, >> http://gallica.bnf.fr/ark:/12148/bpt6k6439352j/f33.image Richard said: > on ajoute au mot ??? qui signifie chambre, les lettres ? ,? et ?, et l?on ?crit ???? ,????, et ????. In my answer, I have used the escapes from http://www.unicode.org/charts/PDF/U0600.pdf in case you want these too (the complete alphabet showing first, middle, last, and isolated forms for letters is at Wikipedia http://en.wikipedia.org/wiki/Arabic_alphabet): ** on ajoute au mot بيت qui signifie chambre, les lettres ي, ك et ه, et l'on ?crit بيتي, بيتك, et بيته. ** So for ??? the escapes are: U+0628 U+064A U+062A For the three pronoun suffixes the escapes are:: U+064A, U+0643 et U+0647 [My note: the word bayt -- ??? -- actually means maison or house, not chambre or room.] * * * For the next text >> http://gallica.bnf.fr/ark:/12148/bpt6k6439352j/f474.image Richard said: > Il a parl?, il a ?crit, il a march?, il a vu, ??? ,??? ,??? ,??? The escapes/code point numbers: U+0643;U+0644;U+0645; is ??? [note that the letters/characters that represent the k-l-m sounds are ordered as in English!] U+0643;U+062A;U+0628; is ??? U+0645;U+0634;U+0649; is ??? Finally U+0634;U+0627;U+0641; is ??? [Another note: shaahad-a is the word I know for witnessed/saw-he: ??????? ra'aa-a is the word I know for perceived/saw-he: ????? I am not familiar with the word used in your text, shaaf-a: ??? -- but my Arabic is very minimal.] Best, -- C. E. Whitehead cewcathar at hotmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From rwhlk142 at gmail.com Wed Jul 9 15:45:50 2014 From: rwhlk142 at gmail.com (Robert Wheelock) Date: Wed, 9 Jul 2014 16:45:50 -0400 Subject: How to seet up a Unicode 7-aware OpenType features lookup table for your fonts Message-ID: Hello! I?ve been editing fonts with FontLab Studio for some time now, but HAVE NOT YET really delved into editing OpenType features lookup tables... (1) How?d you start? May I use FontLab Studio to do my OTF tables, or MUST I use VOLT? (2) In a Unicode 7-capable OT font, what letter/accent and ligature combinations should I include? (3) How should I use the 20 Stylistic Sets (ss01 through ss20) to my BEST advantage? (4) What online resources are available to help me create suitable OTF lookup tables for my fonts? Your cooperation would be greatly appreciated. Thank You! Robert Lloyd Wheelock INTERNATIONAL SYMBOLISM RESEARCH INSTITUTE Augusta, ME U.S.A. -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.fynn at gmail.com Wed Jul 9 22:48:43 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Thu, 10 Jul 2014 09:48:43 +0600 Subject: How to seet up a Unicode 7-aware OpenType features lookup table for your fonts In-Reply-To: References: Message-ID: Robert I suggest you subscribe to the OpenType mailing list and ask your questions there. subscribe: opentype-subscribe at indx.co.uk good luck with this - Chris On 10/07/2014, Robert Wheelock wrote: > Hello! > > I?ve been editing fonts with FontLab Studio for some time now, but HAVE NOT > YET really delved into editing OpenType features lookup tables... > > (1) How?d you start? May I use FontLab Studio to do my OTF tables, or > MUST I use VOLT? > > (2) In a Unicode 7-capable OT font, what letter/accent and ligature > combinations should I include? > > (3) How should I use the 20 Stylistic Sets (ss01 through ss20) to my BEST > advantage? > > (4) What online resources are available to help me create suitable OTF > lookup tables for my fonts? > > Your cooperation would be greatly appreciated. Thank You! > > > Robert Lloyd Wheelock > INTERNATIONAL SYMBOLISM RESEARCH INSTITUTE > Augusta, ME U.S.A. > From rick at unicode.org Mon Jul 14 11:44:20 2014 From: rick at unicode.org (Rick McGowan) Date: Mon, 14 Jul 2014 09:44:20 -0700 Subject: Erratum report for UTS #46 Message-ID: <53C408E4.3020506@unicode.org> Recently we received an error report about one of the data files for the latest release of UTS #46, specifically the testing data in IdnaTest.txt . The erratum notice is here: http://www.unicode.org/errata/#current_errata An update to this file has been generated, and that has now been posted. Users of UTS #46 test data may wish to take note. From public at khwilliamson.com Mon Jul 14 13:16:39 2014 From: public at khwilliamson.com (Karl Williamson) Date: Mon, 14 Jul 2014 12:16:39 -0600 Subject: Corrigendum #9 In-Reply-To: References: <20140602082722.665a7a7059d7ee80bb4d670165c8327d.680b2ce497.wbe@email03.secureserver.net> <53992CC1.3010101@khwilliamson.com> <66865346da6645be9558d53dbac1b53a@BL2PR03MB450.namprd03.prod.outlook.com> <53B41F20.8030204@khwilliamson.com> <53B5C1BB.4040303@ix.netcom.com> Message-ID: <53C41E87.3050103@khwilliamson.com> I ran across this in Section 3.7.4 of http://www.unicode.org/reports/tr36/ "Use pairs of noncharacter code points in the range FDD0..FDEF. These are "super" private-use characters, and are discouraged for general interchange. The transformation would take each nibble of a byte Y, and add to FDD0 and FDE0, respectively. However, noncharacter code points may be replaced by U+FFFD ( ? ) REPLACEMENT CHARACTER by some implementations, especially when they use them internally. (Again, incoming characters must never be deleted, because that can cause security problems.)" I'm not sure if this affects the calculus of the Corrigendum. From rwhlk142 at gmail.com Mon Jul 14 21:07:54 2014 From: rwhlk142 at gmail.com (Robert Wheelock) Date: Mon, 14 Jul 2014 22:07:54 -0400 Subject: =?UTF-8?Q?Unified_Canadian_Aboriginal_Syllabics=E2=80=94Missing_Syll?= =?UTF-8?Q?able_Characters?= Message-ID: Hello! I just started to make an ASDF layout for the Innuktitut syllabics characters (in association with Fontboard). The syllabic charcters are assigned to their (closest match) Innuit Latin keys (*a* on A, *pa* on P, ...) as follows: VOWELS: *ai* (e) E *i* I *u* (o) U *a* A INDEPENDENT CONSONANT *h* O -*AI* SYLLABLES *pai* B *tai* D *kai* G *gai* (tse) J *mai* ? *nai* ~ *sai* Z *lai* | *jai* (ye) F *vai* (fe) V *rai* Q *lhai* (lhe) { -*I* SYLLABLES *pi* B *ti* D *ki* G *gi* (tsi) J *mi* / *ni* ` *si* Z *li* \ *ji* (yi) F *vi* (fi) V *ri* Q *lhi* [ -*U* SYLLABLES *pu* P *tu* T *ku* K *gu* (tso) C *mu* M *nu* N *su* S *lu* L *ju* (yo) Y *vu* (fo) W *ru* R *lhu* (lho) } -*A* SYLLABLES *pa* P *ta* T *ka* K *ga* (tsa) C *ma* M *na* N *sa* S *la* L *ja* (ya) Y *va* (fa) W *ra* R *lha* ] There are quite a few *missing* syllablics characters: ?The character for the syllable *lhai* (lhe) (like a horizontally mirrored *lhi*, or a rotated *lha*) ?The characters for the entire *sp*- series (shown on Wikipedia?s article on UCAS as copies of ZESS, Z, N, and Russian Cyrillic I-OBROTNOYE) Where would y?all think about where to place these still-missing UCAS characters?!?! Thank You! -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Jul 15 14:13:28 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 15 Jul 2014 21:13:28 +0200 Subject: Erratum report for UTS #46 In-Reply-To: <53C408E4.3020506@unicode.org> References: <53C408E4.3020506@unicode.org> Message-ID: In that errata page, could you add in the side menu bar a link to the PRI page ? which is all about pending changes and possible incoming *corrections* suggested by the bug report form. Sometimes, it is when reading the content of a PRI with proposed changes that we detect old errors (which could bring discussions and often not just a simple corrigendum). Some errors are only in informative documents which ?ay be corrected simply without breaking the standard (e.g. representative glyphs, or alternate contextual forms). Some corrigenda should also have a beta discussion period before being definitely finalized. The bug report form is already linked from that page; but it does not show where are the current PRI. 2014-07-14 18:44 GMT+02:00 Rick McGowan : > Recently we received an error report about one of the data files for the > latest release of UTS #46, specifically the testing data in IdnaTest.txt . > > The erratum notice is here: > http://www.unicode.org/errata/#current_errata > > An update to this file has been generated, and that has now been posted. > Users of UTS #46 test data may wish to take note. > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From roozbeh at unicode.org Tue Jul 15 18:33:05 2014 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Tue, 15 Jul 2014 16:33:05 -0700 Subject: Noto adds CJK, plus new user-facing website Message-ID: Please excuse the spam, but I think it would be interesting for people here to know that the Noto open source project now supports CJK, which brings it very close to the goal of supporting every major script (and several minor and historical ones). Here is the CJK announcement: http://googledevelopers.blogspot.com/2014/07/noto-cjk-font-that-is-complete.html Here is the new user-oriented Noto website: http://www.google.com/get/noto/ The data on the website is from the CLDR project, and the sample images are rendered using HarfBuzz and Pango. And more will be coming. (Of all the scripts used for CLDR languages, only three have not been released yet.) -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Jul 15 20:21:12 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 16 Jul 2014 03:21:12 +0200 Subject: Noto adds CJK, plus new user-facing website In-Reply-To: References: Message-ID: Thanks to get it known. Probably the Noto collection is the best drop in replacement for Android smartphones and tablets. And they will be useful to many websites. They will also fit very well with Linux distribs. Apple could feature the Adobe collection for MacOSX. Will Microsoft follow with a comparable collection for Windows? For languages like Burmese and languages of Africa this is a great announcement. Tibetan script still lacks some complete support (and Divehi as well even if it is much simpler than Arabic; but really ugly in existing fonts). Next step: building monospaced variants of these fonts for use in programmng languages and coding. Or may be just integrate a feature in these fonts to support a monospaced rendering (using one or several fixed-width cells in a row for each cluster), or facilitating data input with easier placement of input carets and easier text selection (the alternative being to use simplified glyphs and simpler joiners for cursive scripts, at least temporarily for the word under focus or an input tool showing the simplified rendering in a small window working like a magnifier when hovering some scripts with complex layouts; that tool could work also with IMEs; that alternative would deprecate monospace styles for many scripts where they are really ugly and not very easy to read fast, glyphs would be rendered with more natural sizes and positioning and more regular stroke weights). After that, this will be the turn for a comprehensive font for Maths formulas and pictograms for technical diagrams, and a font for pictograms (meteorology, astrology, games, cartographic symbols, arrows, clocks showing time, UI symbols, agendas, musical notations, emojis) And some other for old historic scripts (Linear A or B, old runic scripts), and experiments with new experimental scripts developed in the last half-century or just since the apparition of personal computers in the early 1980s (coincides with radical changes about how books/papers and other medias showing text are produced, with radical changes in orthographies for the remaining minority languages). The global public is just starting to rediscover the beauty of the historic scripts and how they could also be useful to complement their native alphabets that have suffered a lot since the advent of ASCII or early 8bit charsets in computers everywhere and the early development of Unicode and incompaticle charsets showing unreadable random results or just tofu (even today or modern languages like Burmese, or with "optional" diacritics rendered on the wrong letters in Russian with most commonly installed fonts). Another for SignWriting with specific features (if it is possible to design it to work with a stable orthographic convention for the layout, otherwise develop a standard layout UI control, or a simple schema for use in basic HTML or UI, rendering it with a subset of SVG using a set of component glyphs from a common font and a standard mapping). Let's just hope that OSes will support all these new scripts (Windows has always been leaving users behind if they did not use the lastest version whose linguistic support was frozen at least 2 years before the last release, with few extensions with OS or Office service packs, notably for the OpenType, GDI, 3D API, or .Net renderers and in i18n support APIs). 2014-07-16 1:33 GMT+02:00 Roozbeh Pournader : > Please excuse the spam, but I think it would be interesting for people > here to know that the Noto open source project now supports CJK, which > brings it very close to the goal of supporting every major script (and > several minor and historical ones). > > Here is the CJK announcement: > > http://googledevelopers.blogspot.com/2014/07/noto-cjk-font-that-is-complete.html > > Here is the new user-oriented Noto website: > http://www.google.com/get/noto/ > > The data on the website is from the CLDR project, and the sample images > are rendered using HarfBuzz and Pango. > > And more will be coming. (Of all the scripts used for CLDR languages, only > three have not been released yet.) > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From roozbeh at unicode.org Tue Jul 15 20:45:14 2014 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Tue, 15 Jul 2014 18:45:14 -0700 Subject: Noto adds CJK, plus new user-facing website In-Reply-To: References: Message-ID: The Noto Sans Symbols font already supports a lot of the symbol classes you mentioned. Linear B and Runic are also supported by Noto. Same with some of the newer experimental scripts (Osmanya, Deseret, Shavian, etc.) HarfBuzz has been trying its best to support ever character in Unicode as soon as possible. It is included in Android, ChromeOS, and I believe all modern Linux distributions. And last not the least, both HarfBuzz and Noto would appreciate any help in finding and fixing issues they may have. The bug fix turnaround is usually very quick, especially with HarfBuzz. On Tue, Jul 15, 2014 at 6:21 PM, Philippe Verdy wrote: > Thanks to get it known. > > Probably the Noto collection is the best drop in replacement for Android > smartphones and tablets. And they will be useful to many websites. They > will also fit very well with Linux distribs. > > Apple could feature the Adobe collection for MacOSX. Will Microsoft follow > with a comparable collection for Windows? > > For languages like Burmese and languages of Africa this is a great > announcement. Tibetan script still lacks some complete support (and Divehi > as well even if it is much simpler than Arabic; but really ugly in existing > fonts). > > Next step: building monospaced variants of these fonts for use in > programmng languages and coding. Or may be just integrate a feature in > these fonts to support a monospaced rendering (using one or several > fixed-width cells in a row for each cluster), or facilitating data input > with easier placement of input carets and easier text selection (the > alternative being to use simplified glyphs and simpler joiners for cursive > scripts, at least temporarily for the word under focus or an input tool > showing the simplified rendering in a small window working like a magnifier > when hovering some scripts with complex layouts; that tool could work also > with IMEs; that alternative would deprecate monospace styles for many > scripts where they are really ugly and not very easy to read fast, glyphs > would be rendered with more natural sizes and positioning and more regular > stroke weights). > > After that, this will be the turn for a comprehensive font for Maths > formulas and pictograms for technical diagrams, and a font for pictograms > (meteorology, astrology, games, cartographic symbols, arrows, clocks > showing time, UI symbols, agendas, musical notations, emojis) > > And some other for old historic scripts (Linear A or B, old runic > scripts), and experiments with new experimental scripts developed in the > last half-century or just since the apparition of personal computers in the > early 1980s (coincides with radical changes about how books/papers and > other medias showing text are produced, with radical changes in > orthographies for the remaining minority languages). > > The global public is just starting to rediscover the beauty of the > historic scripts and how they could also be useful to complement their > native alphabets that have suffered a lot since the advent of ASCII or > early 8bit charsets in computers everywhere and the early development of > Unicode and incompaticle charsets showing unreadable random results or just > tofu (even today or modern languages like Burmese, or with "optional" > diacritics rendered on the wrong letters in Russian with most commonly > installed fonts). > > Another for SignWriting with specific features (if it is possible to > design it to work with a stable orthographic convention for the layout, > otherwise develop a standard layout UI control, or a simple schema for use > in basic HTML or UI, rendering it with a subset of SVG using a set of > component glyphs from a common font and a standard mapping). > > Let's just hope that OSes will support all these new scripts (Windows has > always been leaving users behind if they did not use the lastest version > whose linguistic support was frozen at least 2 years before the last > release, with few extensions with OS or Office service packs, notably for > the OpenType, GDI, 3D API, or .Net renderers and in i18n support APIs). > > > > 2014-07-16 1:33 GMT+02:00 Roozbeh Pournader : > >> Please excuse the spam, but I think it would be interesting for people >> here to know that the Noto open source project now supports CJK, which >> brings it very close to the goal of supporting every major script (and >> several minor and historical ones). >> >> Here is the CJK announcement: >> >> http://googledevelopers.blogspot.com/2014/07/noto-cjk-font-that-is-complete.html >> >> Here is the new user-oriented Noto website: >> http://www.google.com/get/noto/ >> >> The data on the website is from the CLDR project, and the sample images >> are rendered using HarfBuzz and Pango. >> >> And more will be coming. (Of all the scripts used for CLDR languages, >> only three have not been released yet.) >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Wed Jul 16 03:30:48 2014 From: andrewcwest at gmail.com (Andrew West) Date: Wed, 16 Jul 2014 09:30:48 +0100 Subject: Noto adds CJK, plus new user-facing website In-Reply-To: References: Message-ID: On 16 July 2014 00:33, Roozbeh Pournader wrote: > Please excuse the spam, but I think it would be interesting for people here > to know that the Noto open source project now supports CJK, which brings it > very close to the goal of supporting every major script (and several minor > and historical ones). > > Here is the CJK announcement: > http://googledevelopers.blogspot.com/2014/07/noto-cjk-font-that-is-complete.html Fantastic news, but I personally think that the decision to include the four characters at U+9FCD through U+9FD0 in the Adobe Source Han / Noto Sans Simplified Chinese fonts (and U+9FD0 in the Traditional Chinese fonts) is extremely premature given that these characters have only just been added to the draft repertoire for ISO/IEC 10646:2014 Amd. 2, and have not yet completed even their first round of the ISO balloting process. As such the code point allocations are not stable, and should not be used in fonts for public consumption. Andrew From verdy_p at wanadoo.fr Wed Jul 16 06:03:43 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 16 Jul 2014 13:03:43 +0200 Subject: Noto adds CJK, plus new user-facing website In-Reply-To: References: Message-ID: What does "Noto" mean? is it an abbreviation of "no (more) tofu" ? 2014-07-16 3:45 GMT+02:00 Roozbeh Pournader : > The Noto Sans Symbols font already supports a lot of the symbol classes > you mentioned. Linear B and Runic are also supported by Noto. Same with > some of the newer experimental scripts (Osmanya, Deseret, Shavian, etc.) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Wed Jul 16 07:45:10 2014 From: jf at colson.eu (=?windows-1252?Q?Jean-Fran=E7ois_Colson?=) Date: Wed, 16 Jul 2014 14:45:10 +0200 Subject: Unified Canadian Aboriginal =?windows-1252?Q?Syllabics=97Mis?= =?windows-1252?Q?sing_Syllable_Characters?= In-Reply-To: References: Message-ID: <53C673D6.4020003@colson.eu> Le 15/07/14 04:07, Robert Wheelock a ?crit : > Hello! Hello > > I just started to make an ASDF layout for the Innuktitut syllabics > characters (in association with Fontboard). Where can I find Fontboard?s official website ? > The syllabic charcters are assigned to their (closest match) Innuit > Latin keys (/a/ on A, /pa/ on P, ...) as follows: > > VOWELS: > /ai/ (e) E /i/ I /u/ (o) U /a/ A > > INDEPENDENT CONSONANT > /h/ O > > -/AI/ SYLLABLES > /pai/ B /tai/ D /kai/ G /gai/ (tse) J > /mai/ ? /nai/ ~ /sai/ Z /lai/ | > /jai/ (ye) F /vai/ (fe) V /rai/ Q /lhai/ (lhe) { > > -/I/ SYLLABLES > /pi/ B /ti/ D /ki/ G /gi/ (tsi) J /mi/ / /ni/ ` /si/ Z > /li/ \ > /ji/ (yi) F /vi/ (fi) V /ri/ Q /lhi/ [ > > -/U/ SYLLABLES > /pu/ P /tu/ T /ku/ K /gu/ (tso) C > /mu/ M /nu/ N /su/ S /lu/ L > /ju/ (yo) Y /vu/ (fo) W /ru/ R /lhu/ (lho) } > > -/A/ SYLLABLES > /pa/ P /ta/ T /ka/ K /ga/ (tsa) C /ma/ M /na/ N /sa/ S > /la/ L > /ja/ (ya) Y /va/ (fa) W /ra/ R /lha/ ] > I have a few questions: ? How do you handle the final consonants? ? How do you type long syllables (those with a dot above)? with a dead key or with the Alt Gr (= right Alt) key? ? There are a few missing characters in your description: ? ? : You could map it to Shift + ? (Shift + O). ? ? ? ? ? (qai qi qu qa): You could map them to the still unused keys H and X. ? ? ? ? ? (ngai ngi ngu nga): There?s no room left for them. If you applied your general scheme (pai = Shift + pi, pu = Shift + pa), you could move ? to Shift + ? and ? to Shift + ?. That way, the keys E and U could be freed to accept ? ? ? ?. ? ? ? ? (nngi nngu nnga): I have no idea where you could put them. ? ? replaces the ?. Aren?t there any question marks in Inuktitut? > There are quite a few _/missing/_ syllablics characters: > ?The character for the syllable /lhai/ (lhe) (like a horizontally > mirrored /lhi/, or a rotated /lha/) Once upon a time, the ai-pai-tai? syllables were discarded in Inuktitut because there weren?t enough room for the whole syllabary on the daisy wheel of an electric typewriter. Later, they were readopted in Nunavik, but not in Nunavut where ? is still written ??. Apparently, ? ? ? ? ? ? ? are used only at the West of Hudson?s Bay (in Nunavut). Therefore, there is no need for a lhai syllable. If you have examples of use of that syllable, please share them with us and the powers that be could perhaps investigate a little about it. > ?The characters for the entire /sp/- series (shown on Wikipedia?s > article on UCAS as copies of ZESS, Z, N, and Russian Cyrillic I-OBROTNOYE) I suppose ? and ?, which are used for l and r in some languages, come from there, and their finals, ? and ?, were not forgotten. But you?re right : those letters, which look like Z, reversed Z, N, and ? were completely forgotten. Perhaps they were judged too archaic to be encoded. But now, Unicode includes hieroglyphs, cuneiforms, and many other old scripts, so I think those characters could be considered for a proposal. > Where would y?all think about where to place these still-missing UCAS > characters?!?! > Thank You! > > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Wed Jul 16 09:58:25 2014 From: jf at colson.eu (=?windows-1252?Q?Jean-Fran=E7ois_Colson?=) Date: Wed, 16 Jul 2014 16:58:25 +0200 Subject: Unified Canadian Aboriginal =?windows-1252?Q?Syllabics=97Mis?= =?windows-1252?Q?sing_Syllable_Characters?= In-Reply-To: <53C673D6.4020003@colson.eu> References: <53C673D6.4020003@colson.eu> Message-ID: <53C69311.4090004@colson.eu> Le 16/07/14 14:45, Jean-Fran?ois Colson a ?crit : > > Le 15/07/14 04:07, Robert Wheelock a ?crit : >> Hello! > > Hello > > >> >> I just started to make an ASDF layout for the Innuktitut syllabics >> characters (in association with Fontboard). > > Where can I find Fontboard?s official website ? > > >> The syllabic charcters are assigned to their (closest match) Innuit >> Latin keys (/a/ on A, /pa/ on P, ...) as follows: >> >> VOWELS: >> /ai/ (e) E /i/ I /u/ (o) U /a/ A >> >> INDEPENDENT CONSONANT >> /h/ O >> >> -/AI/ SYLLABLES >> /pai/ B /tai/ D /kai/ G /gai/ (tse) J >> /mai/ ? /nai/ ~ /sai/ Z /lai/ | >> /jai/ (ye) F /vai/ (fe) V /rai/ Q /lhai/ (lhe) { >> >> -/I/ SYLLABLES >> /pi/ B /ti/ D /ki/ G /gi/ (tsi) J /mi/ / /ni/ ` /si/ Z >> /li/ \ >> /ji/ (yi) F /vi/ (fi) V /ri/ Q /lhi/ [ >> >> -/U/ SYLLABLES >> /pu/ P /tu/ T /ku/ K /gu/ (tso) C >> /mu/ M /nu/ N /su/ S /lu/ L >> /ju/ (yo) Y /vu/ (fo) W /ru/ R /lhu/ (lho) } >> >> -/A/ SYLLABLES >> /pa/ P /ta/ T /ka/ K /ga/ (tsa) C /ma/ M /na/ N /sa/ S >> /la/ L >> /ja/ (ya) Y /va/ (fa) W /ra/ R /lha/ ] >> > > I have a few questions: > ? How do you handle the final consonants? > ? How do you type long syllables (those with a dot above)? with a dead > key or with the Alt Gr (= right Alt) key? > ? There are a few missing characters in your description: > ? ? : You could map it to Shift + ? (Shift + O). > ? ? ? ? ? (qai qi qu qa): You could map them to the still unused > keys H and X. > ? ? ? ? ? (ngai ngi ngu nga): There?s no room left for them. If > you applied your general scheme (pai = Shift + pi, pu = Shift + pa), > you could move ? to Shift + ? and ? to Shift + ?. That way, the keys E > and U could be freed to accept ? ? ? ?. > ? ? ? ? (nngi nngu nnga): I have no idea where you could put them. > ? ? replaces the ?. Aren?t there any question marks in Inuktitut? > > >> There are quite a few _/missing/_ syllablics characters: >> ?The character for the syllable /lhai/ (lhe) (like a horizontally >> mirrored /lhi/, or a rotated /lha/) > > Once upon a time, the ai-pai-tai? syllables were discarded in > Inuktitut because there weren?t enough room for the whole syllabary on > the daisy wheel of an electric typewriter. > Later, they were readopted in Nunavik, but not in Nunavut where ? is > still written ??. > Apparently, ? ? ? ? ? ? ? are used only at the West of Hudson?s Bay > (in Nunavut). Therefore, there is no need for a lhai syllable. > If you have examples of use of that syllable, please share them with > us and the powers that be could perhaps investigate a little about it. > > >> ?The characters for the entire /sp/- series (shown on Wikipedia?s >> article on UCAS as copies of ZESS, Z, N, and Russian Cyrillic >> I-OBROTNOYE) > > I suppose ? and ?, which are used for l and r in some languages, come > from there, and their finals, ? and ?, were not forgotten. > But you?re right : those letters, which look like Z, reversed Z, N, > and ? were completely forgotten. > Perhaps they were judged too archaic to be encoded. > But now, Unicode includes hieroglyphs, cuneiforms, and many other old > scripts, so I think those characters could be considered for a proposal. > I wonder whether there are any relations between those missing syllables and the Cree ? ? ? ? (she shi sho sha). > >> Where would y?all think about where to place these still-missing UCAS >> characters?!?! >> Thank You! >> >> >> >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From A.Schappo at lboro.ac.uk Wed Jul 16 10:09:49 2014 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Wed, 16 Jul 2014 15:09:49 +0000 Subject: Noto adds CJK, plus new user-facing website In-Reply-To: References: Message-ID: <7117CC4F-D938-466C-9996-92233669FC8C@lboro.ac.uk> Looks like you are also working on a color emoji font https://code.google.com/p/noto/source/browse/#git/color_emoji ??? Andr? On 16 Jul 2014, at 00:33, Roozbeh Pournader wrote: Please excuse the spam, but I think it would be interesting for people here to know that the Noto open source project now supports CJK, which brings it very close to the goal of supporting every major script (and several minor and historical ones). Here is the CJK announcement: http://googledevelopers.blogspot.com/2014/07/noto-cjk-font-that-is-complete.html Here is the new user-oriented Noto website: http://www.google.com/get/noto/ The data on the website is from the CLDR project, and the sample images are rendered using HarfBuzz and Pango. And more will be coming. (Of all the scripts used for CLDR languages, only three have not been released yet.) _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode ???????????????? http://twitter.com/andreschappo http://schappo.blogspot.co.uk http://weibo.com/andreschappo http://blog.sina.com.cn/andreschappo -------------- next part -------------- An HTML attachment was scrubbed... URL: From ken.whistler at sap.com Wed Jul 16 12:08:51 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Wed, 16 Jul 2014 17:08:51 +0000 Subject: Noto adds CJK, plus new user-facing website In-Reply-To: References: Message-ID: Andrew, Everybody recognizes the potential risks of getting out too far over one's skis in implementations, but this particular one seems a relatively small risk. Seldom (if ever?) has a NB objected in ballot to these small repertoire additions that have periodically been tacked on at the end of the URO range, once they have been reviewed and gone into ballot. Everybody recognizes that these 4 are needed. So for Adobe Source Han to implement early, given the lead times involved, seems like a smart move to me. It is just additional incentive for the various NB's involved in the ballot review to not mess with the code point allocations for these 4 during ballot comments. ;-) --Ken > Fantastic news, but I personally think that the decision to include > the four characters at U+9FCD through U+9FD0 in the Adobe Source Han / > Noto Sans Simplified Chinese fonts (and U+9FD0 in the Traditional > Chinese fonts) is extremely premature given that these characters have > only just been added to the draft repertoire for ISO/IEC 10646:2014 > Amd. 2, and have not yet completed even their first round of the ISO > balloting process. As such the code point allocations are not stable, > and should not be used in fonts for public consumption. > > Andrew From frederic.grosshans at gmail.com Wed Jul 16 13:12:44 2014 From: frederic.grosshans at gmail.com (=?windows-1252?Q?Fr=E9d=E9ric_Grosshans?=) Date: Wed, 16 Jul 2014 20:12:44 +0200 Subject: Unified Canadian Aboriginal =?windows-1252?Q?Syllabics=97Mis?= =?windows-1252?Q?sing_Syllable_Characters?= In-Reply-To: <53C673D6.4020003@colson.eu> References: <53C673D6.4020003@colson.eu> Message-ID: <53C6C09C.3080407@gmail.com> Le 16/07/2014 14:45, Jean-Fran?ois Colson a ?crit : > Once upon a time, the ai-pai-tai? syllables were discarded in > Inuktitut because there weren?t enough room for the whole syllabary on > the daisy wheel of an electric typewriter. If they where discarded for electric typewriters, it means that they where historically used before. Wikipedia ( https://en.wikipedia.org/wiki/Inuktitut_syllabics#Modifications ) ) states it happened in the 1970s. If I understood this correctly, it should not be too difficult to find use examples of these character, and I would be very surprised if such recent texts would not be thought fit for unicode encoding. Fr?d?ric From jf at colson.eu Wed Jul 16 14:31:04 2014 From: jf at colson.eu (=?windows-1252?Q?Jean-Fran=E7ois_Colson?=) Date: Wed, 16 Jul 2014 21:31:04 +0200 Subject: Unified Canadian Aboriginal =?windows-1252?Q?Syllabics=97Mis?= =?windows-1252?Q?sing_Syllable_Characters?= In-Reply-To: <53C6C09C.3080407@gmail.com> References: <53C673D6.4020003@colson.eu> <53C6C09C.3080407@gmail.com> Message-ID: <53C6D2F8.3080908@colson.eu> Le 16/07/14 20:12, Fr?d?ric Grosshans a ?crit : > Le 16/07/2014 14:45, Jean-Fran?ois Colson a ?crit : >> Once upon a time, the ai-pai-tai? syllables were discarded in >> Inuktitut because there weren?t enough room for the whole syllabary >> on the daisy wheel of an electric typewriter. > > If they where discarded for electric typewriters, it means that they > where historically used before. Wikipedia ( > https://en.wikipedia.org/wiki/Inuktitut_syllabics#Modifications ) ) > states it happened in the 1970s. If I understood this correctly, it > should not be too difficult to find use examples of these character, > and I would be very surprised if such recent texts would not be > thought fit for unicode encoding. I agree: they?re useless today but perhaps they were used in the recent past (i.e. less than half a century ago). I?m not in Canada and I dont plan to go there in the near future. Perhaps Robert Wheelock has some examples of texts with the syllable lhai? > > Fr?d?ric > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From dwanders at sonic.net Wed Jul 16 14:50:33 2014 From: dwanders at sonic.net (Deborah W. Anderson) Date: Wed, 16 Jul 2014 12:50:33 -0700 Subject: =?iso-8859-2?Q?RE:_Unified_Canadian_Aboriginal_Syllabics-Missing_Syllable?= =?iso-8859-2?Q?_Characters?= Message-ID: <008101cfa12f$34d4cd50$9e7e67f0$@sonic.net> I am forwarding the following response to R Wheelock's original posting from Chris Harvey, who has just re-subscribed to this email list. (Chris worked with Michael Everson on the "Proposal to encode additional Unified Canadian Aboriginal Syllabics" [http://www.unicode.org/L2/L2008/08132r-n3427r-syllabics.pdf], and has long been involved in work on indigenous languages and support for them in keyboards, fonts, etc.) --Debbie Anderson -----Original Message----- From: Chris Harvey [mailto:languagegeek at gmail.com] Sent: Monday, July 14, 2014 8:36 PM > There are quite a few missing syllablics characters: > .The character for the syllable lhai (lhe) (like a horizontally > mirrored lhi, or a rotated lha) The dialect that uses /?/ does not have the ai-series diacritics in the orthography. > .The characters for the entire sp- series (shown on Wikipedia's > article on UCAS as copies of ZESS, Z, N, and Russian Cyrillic > I-OBROTNOYE) The sp-series was never used outside early experimentation with Cree syllabics. No language has ever used them. (Note: In Canada, where syllabics are used, it's spelled Inuit, not Innuit.) Chris --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Thu Jul 17 09:57:18 2014 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Thu, 17 Jul 2014 16:57:18 +0200 Subject: Unified Canadian Aboriginal Syllabics-Missing Syllable Characters In-Reply-To: <008101cfa12f$34d4cd50$9e7e67f0$@sonic.net> References: <008101cfa12f$34d4cd50$9e7e67f0$@sonic.net> Message-ID: <53C7E44E.9090408@gmail.com> Le 16/07/2014 21:50, Deborah W. Anderson a ?crit : > gmail.com] > > Sent: Monday, July 14, 2014 8:36 PM > > > There are quite a few missing syllablics characters: > > > ?The character for the syllable lhai (lhe) (like a horizontally > > > mirrored lhi, or a rotated lha) > > The dialect that uses /?/ does not have the ai-series diacritics in > the orthography. > Did this dialect use the ai series before it was discarded eslewhere (presumably in the 1970?s, for electric typewriters) ? > > > ?The characters for the entire sp- series (shown on Wikipedia?s > > > article on UCAS as copies of ZESS, Z, N, and Russian Cyrillic > > > I-OBROTNOYE) > > The sp-series was never used outside early experimentation with Cree > syllabics. No language has ever used them. > This seems to be a situation similar to the archaic Cherokee letter mv, which was dropped very early in the history of the script. It is however on its way to be encoded as U+13F5 CHEROKEE LETTER MV, following the proposal L2/14-064 by Michael Everson and Durbin Feeling ( http://www.unicode.org/L2/L2014/14064r-n4537r-cherokee.pdf ). Similarly, I guess the encoding of the SP series could be useful for discussing the script history (as on wikipedia page) and transcribing historic texts, like this 1841 Cree Hymn book ( http://peel.library.ualberta.ca/bibliography/209/reader.html#17 ), which has, for example, a ?SPI? on the third line, p. 16. Fr?d?ric From doug at ewellic.org Thu Jul 17 16:10:19 2014 From: doug at ewellic.org (Doug Ewell) Date: Thu, 17 Jul 2014 14:10:19 -0700 Subject: Unified Canadian Aboriginal Syllabics-Missing Syllable Characters Message-ID: <20140717141019.665a7a7059d7ee80bb4d670165c8327d.334b9621b2.wbe@email03.secureserver.net> Fr?d?ric Grosshans wrote: > Similarly, I guess the encoding of the SP series could be useful for > discussing the script history (as on wikipedia page) I've gotten into trouble before for this sort of comment, but I'll take my chances again: Any time a character is suggested for encoding *in order to facilitate talking about that character*, there are very few blobs and squiggles in the course of human experience that would not qualify for encoding. > and transcribing historic texts, like this 1841 Cree Hymn book ( > http://peel.library.ualberta.ca/bibliography/209/reader.html#17 ), > which has, for example, a ?SPI? on the third line, p. 16. The Cree hymnbook would probably provide a stronger argument, assuming the character in question is distinct and not a glyph variant or typo or bug splat. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From fantasai.lists at inkedblade.net Tue Jul 22 09:03:33 2014 From: fantasai.lists at inkedblade.net (fantasai) Date: Tue, 22 Jul 2014 07:03:33 -0700 Subject: Ambiguous hyphenation cases with In-Reply-To: <896B66B790C04549AD3BF12412E175E39D35164C0E@hst-mail01.sdbit.local> (sfid-20140512_074406_966504_FE3AFEC5) References: <896B66B790C04549AD3BF12412E175E39D2A944772@hst-mail01.sdbit.local> (sfid-20140212_141727_830873_C9E64B6E) <536C2765.1050306@inkedblade.net> <896B66B790C04549AD3BF12412E175E39D35164C0E@hst-mail01.sdbit.local> (sfid-20140512_074406_966504_FE3AFEC5) Message-ID: <53CE6F35.5000405@inkedblade.net> On 05/12/2014 12:43 AM, H?kan Save Hansson wrote: > Hi fantasai, > > Regarding your answer to my second suggestion (if you are referring > to James Clarks first answer): > > The problem is that the hyphenation system in itself can't decide how > to change the spelling, without any "dictionary" functionality. It > can't know if I meant "mat-tjuv" ("food thief" in Swedish) or "matt-tjuv" > ("carpet thief") when I wrote "mat­tjuv". So there has to be a way > to tell the hyphenation system that. Hm. I don't think I have a solution for that problem. :/ Currently you'd just have to not hyphenate that word. CCing Unicode, in case anyone there has a solution Up-reference: http://lists.w3.org/Archives/Public/www-style/2014Feb/0739.html ~fantasai From christoph.paeper at crissov.de Tue Jul 22 11:14:11 2014 From: christoph.paeper at crissov.de (=?windows-1252?Q?Christoph_P=E4per?=) Date: Tue, 22 Jul 2014 18:14:11 +0200 Subject: Ambiguous hyphenation cases with In-Reply-To: <53CE6F35.5000405@inkedblade.net> References: <896B66B790C04549AD3BF12412E175E39D2A944772@hst-mail01.sdbit.local> (sfid-20140212_141727_830873_C9E64B6E) <536C2765.1050306@inkedblade.net> <896B66B790C04549AD3BF12412E175E39D35164C0E@hst-mail01.sdbit.local> (sfid-20140512_074406_966504_FE3AFEC5) <53CE6F35.5000405@inkedblade.net> Message-ID: fantasai : >> The problem is that the hyphenation system in itself can't decide how >> to change the spelling, without any "dictionary" functionality. It >> can't know if I meant "mat-tjuv" ("food thief" in Swedish) or "matt-tjuv" >> ("carpet thief") when I wrote "mat­tjuv". So there has to be a way >> to tell the hyphenation system that. Imagine if there was also ?matt?juv? next to ?mat?tjuv? and ?matt?tjuv?, or even ?mat?ttjuv?. > Hm. I don't think I have a solution for that problem. :/ > Currently you'd just have to not hyphenate that word. Smart-font solution (OpenType, AFDKO syntax): ?mattjuv, matttjuv? lookup tripleletters { sub t' t' t by t; } feature rlig { script latn; language SWE exclude_dflt; lookup tripleletters; } rlig; Combining Grapheme Joiner (U+034F, ?CGJ?) could possibly be given an interpretation like this (XML syntax), but Zero-Width Non-Joiner (U+200C, ?ZWNJ?) should probably not be repurposed: ?mattjuv, mat͏tjuv? Possible Unicode solution with a new combining character that makes the preceding character or grapheme ? I?m not sure which ? invisible except at the end of a line: ?mattjuv, matt⁥tjuv? U+2065 ? Combining Collapse or Reduplicating Soft Hyphen or so All solutions require author education. The latter two require changes to existing software and specifications (including CSS), the former ?just? updated fonts. The second solution would fall back gracefully to ?mattjuv?, the others to ?matttjuv?, maybe even with a .notdef glyph in there. All of these approaches are too complicated for Joe Sixpack (or Jo Sexpack), so I don?t think that will work in practice, except in environments that already make sure to treat border cases like disambiguation of umlaut and diaeresis use of trema dots. JFTR, Swedish is not the only language with this orthographic feature. The German orthography reform of 1996 did away with letter collapsing completely, probably for this very problem. Now there are instances of three times the same letter on the same line, which some consider ugly, but smart fonts can overcome most of the perceived problems by ligating the first two letters of such a sequence or by selecting an alternate glyph for the final one. The special treatment of the double-?k? grapheme was also abolished: It used to look like ?ck? ? often a ligature ? except at the end of the line where it showed its real face, ?k-k?; now it?s always typed, encoded and displayed as ?ck? and cannot be separated. Theoretical graphemes ?zz' and ?hh? still look like ?tz? and ?ch? respectively, whereof only the former may be split ?t-z?. From vargavind at gmail.com Tue Jul 22 11:24:36 2014 From: vargavind at gmail.com (Kess Vargavind) Date: Tue, 22 Jul 2014 18:24:36 +0200 Subject: Ambiguous hyphenation cases with In-Reply-To: <53CE6F35.5000405@inkedblade.net> References: <896B66B790C04549AD3BF12412E175E39D2A944772@hst-mail01.sdbit.local> <536C2765.1050306@inkedblade.net> <896B66B790C04549AD3BF12412E175E39D35164C0E@hst-mail01.sdbit.local> <53CE6F35.5000405@inkedblade.net> Message-ID: There actually is one simple solution that I sometimes use: do not contract three consecutive same-letter consonants at all! That is, do like Icelandic and write food thief as and carpet thief as . Then there is no trouble hyphenating. Yes, this goes against current spelling rules in Swedish, but it works. And until there is better hyphenation support for corner cases like this (either at character level or higher) that is how I have ?solved? it when unable to do manual tweaking. Would it be logical to add a character similar to U+00AD SOFT HYPHEN (shy) that says ?you can break me here, but unless you do please skip the previous character (however such would be defined in a case like this)?? Such that is either rendered or broken up as . Kess 2014-07-22 16:03 GMT+02:00 fantasai : > On 05/12/2014 12:43 AM, H?kan Save Hansson wrote: > >> Hi fantasai, >> >> Regarding your answer to my second suggestion (if you are referring >> to James Clarks first answer): >> >> The problem is that the hyphenation system in itself can't decide how >> to change the spelling, without any "dictionary" functionality. It >> can't know if I meant "mat-tjuv" ("food thief" in Swedish) or "matt-tjuv" >> ("carpet thief") when I wrote "mat­tjuv". So there has to be a way >> to tell the hyphenation system that. >> > > Hm. I don't think I have a solution for that problem. :/ Currently you'd > just have to not hyphenate that word. > > CCing Unicode, in case anyone there has a solution > > Up-reference: http://lists.w3.org/Archives/Public/www-style/2014Feb/0739. > html > > ~fantasai > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fantasai.lists at inkedblade.net Wed Jul 23 14:45:48 2014 From: fantasai.lists at inkedblade.net (fantasai) Date: Wed, 23 Jul 2014 20:45:48 +0100 Subject: Request for Information Message-ID: <53D010EC.9060204@inkedblade.net> I would like to request that Unicode include, for each writing system it encodes, some information on how it might justify. Possible options include a) Text justification typically expands at word-separating characters, but may also expand between letters. b) Since this writing system does not use spaces, justification typically expands between letters. c) Text justification can elongate glyphs and/or expand spaces, but because the script is cursive, cannot introduce inter-letter spacing. e) We do not have information on text justifying practices for this script. Anything, really, would be helpful. Even saying you have no clue is helpful. I would also like to request that the prose chapters include, for each writing system encoded, some information on line-breaking conventions. For example a) Latin typically breaks only at spaces and other punctuation. However, it also admits hyphenation within words. In some contexts (such as Japanese), it may, as a stylistic option, break anywhere (without hyphens). b) Arabic breaks between words. Some languages (such as Uyghur) allow hyphenation, but most do not. c) Japanese can break anywhere, except restrictions can be introduced by symbols and punctuation, and sometimes breaks before small kana are suppressed. c) Javanese only breaks between clauses, where punctuation is used, resulting in horrendously ragged lines. (Did I get that right?) d) We have no idea how this script would break across lines. There is only one inscription extant, and it is undeciphered. e) We believe that this script can break across lines, but the encoding proposal neglected to tell us how. We suggest pretending it's Latin for now. This information is of course encoded into UAX#14 and can be extracted from there (as I have done for Javanese above), however it's helpful to have overview of what to expect from the data tables, and also a clue about possible tailoring requirements. ~fantasai From markus.icu at gmail.com Wed Jul 23 14:53:36 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 23 Jul 2014 12:53:36 -0700 Subject: Request for Information In-Reply-To: <53D010EC.9060204@inkedblade.net> References: <53D010EC.9060204@inkedblade.net> Message-ID: Some of the data is available in the Unicode CLDR script metadata: http://unicode.org/cldr/trac/browser/trunk/common/properties/scriptMetadata.txt http://cldr.unicode.org/development/updating-codes/updating-script-metadata markus -- Google Internationalization Engineering -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jul 23 16:47:32 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 23 Jul 2014 23:47:32 +0200 Subject: Request for Information In-Reply-To: <53D010EC.9060204@inkedblade.net> References: <53D010EC.9060204@inkedblade.net> Message-ID: 2014-07-23 21:45 GMT+02:00 fantasai : > c) Text justification can elongate glyphs and/or expand spaces, but > because > the script is cursive, cannot introduce inter-letter spacing Cursive scripts can use inter-letter "spacing". in fact many OpenType fonts fot such scripts include the necessary mapping to a joining glyph that can be safely elongated or repeated or truncated to fill the gap and preserve the interletter joining. They also use contextual glyphs: when the script is horizontal, the joining glyph is typically an horizontal stroke but there are case where it may be diagonal or split in three parts with the central part horizontal and elongatable. In the Arabic script this glyph is even encoded as compatibility character (with some eastern styles used in Persian, Urdu or Uyghur, the vertical placement of the joining glyph is complex) But you can do the same as well for other cursive scripts such as Devanagari. It's possible as well for the Mongolian script (even if its fonts are built with a 90 degrees clockwise rotation so that by default they render left-to-right with lines staked top to bottom like Latin, and the traditional rendering is vertical by just rotating glyphs 90 degrees clockwise so they render lines top-to-bottom and stack lines right to left (like with Sinograms, Yi, Tangut, Bopomofo, Hirag111ana, Katakana and with old Hangul styles; except these scripts almost never need rotation of glyphs because they are most often not joined in cursive styles) In some calligraphic artworks painted with brushes, creative cursive styles are very difficult to reproduce with a general purpose font: these artworks are best reproduced with graphic formats such as SVG; that do not need any text encoding with Unicode (except for the embedded metadata containing descriptive plain text or alternated text). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jul 23 17:06:29 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 24 Jul 2014 00:06:29 +0200 Subject: Request for Information In-Reply-To: References: <53D010EC.9060204@inkedblade.net> Message-ID: It would be useful to have a sample text of the language, useful to show examplar characters but a but more showing typical layout of words. Many font selectors use only the sme Englsh text, a non-sense when fonts are specifically designed for a non-Latin script. The first sentence of the UDHR is better but sme lanuages have typical sentences used for typographic usage, showing all (or most typical) letters of their alphabet, plus essential punctuation (comma, full stop) in at least one complete sentence. (sample digits are not needed they can be infered using number formats). We could then have the addition of a "sample text" in CLDR (which would not be a set of translations English would use the typical sentence about the "jumping lazzy fox", French could use the sentence about the "juge blond", Latin could use the wellknown first sentence of "Lorm Ipsum" even if it's pseudo-Latin; the meaning of the sentence does not really matter) For large scripts (including Hangul, even it is a small and simple alphabet with a featural layout), it is not possibly to show all letters but a least a represetnative subset showing typical features of the script and the translation of the 1st sentence of the UDHR is a good sample text). 2014-07-23 21:53 GMT+02:00 Markus Scherer : > Some of the data is available in the Unicode CLDR script metadata: > > > http://unicode.org/cldr/trac/browser/trunk/common/properties/scriptMetadata.txt > http://cldr.unicode.org/development/updating-codes/updating-script-metadata > > markus > -- > Google Internationalization Engineering > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From emuller at adobe.com Wed Jul 23 17:23:46 2014 From: emuller at adobe.com (Eric Muller) Date: Wed, 23 Jul 2014 15:23:46 -0700 Subject: Parsers for the UnicodeSet notation? Message-ID: <53D035F2.5070101@adobe.com> I would like to work with the exemplarCharacters data in the CLDR. That uses the UnicodeSet notation. Is there somewhere a parser for that notation, that would return me just the list of characters in the set? Something a bit like the UnicodeSet utility at , but for use in apps/shell. I suspect that the exemplarCharacters use a restricted form of the UnicodeSet notation (e.g. do not use property values). Is that correct, and if so, what's the subset? Incidentally, I copy/pasted the punctuation exemplar characters for he.xml into the utility, and it reported that the set contains 8,130 code points, including the ascii letters. Somehow, that seems incorrect. What did I do wrong? Thanks, Eric. From roozbeh at unicode.org Wed Jul 23 17:28:51 2014 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Wed, 23 Jul 2014 15:28:51 -0700 Subject: Parsers for the UnicodeSet notation? In-Reply-To: <53D035F2.5070101@adobe.com> References: <53D035F2.5070101@adobe.com> Message-ID: On Wed, Jul 23, 2014 at 3:23 PM, Eric Muller wrote: > I would like to work with the exemplarCharacters data in the CLDR. That > uses the UnicodeSet notation. Is there somewhere a parser for that > notation, that would return me just the list of characters in the set? Note that it's a set of strings, not characters. I suspect that the exemplarCharacters use a restricted form of the > UnicodeSet notation (e.g. do not use property values). Is that correct, and > if so, what's the subset? > I have an Apache-licensed parser in Python here: https://code.google.com/p/noto/source/browse/nototools/generate_website_data.py#180 -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Wed Jul 23 17:31:24 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Wed, 23 Jul 2014 15:31:24 -0700 Subject: Parsers for the UnicodeSet notation? In-Reply-To: <53D035F2.5070101@adobe.com> References: <53D035F2.5070101@adobe.com> Message-ID: <53D037BC.5010409@icu-project.org> On 07/23/2014 03:23 PM, Eric Muller wrote: > I would like to work with the exemplarCharacters data in the CLDR. > That uses the UnicodeSet notation. Is there somewhere a parser for > that notation, that would return me just the list of characters in the > set? Something a bit like the UnicodeSet utility at > , but for use in > apps/shell. > > I suspect that the exemplarCharacters use a restricted form of the > UnicodeSet notation (e.g. do not use property values). Is that > correct, and if so, what's the subset? > > Incidentally, I copy/pasted the punctuation exemplar characters for > he.xml into the utility, and it reported that the set contains 8,130 > code points, including the ascii letters. Somehow, that seems > incorrect. What did I do wrong? > > Thanks, > Eric. > Eric, UnicodeSet is a class available in ICU4J and ICU4C/C++ and so you can parse and query using the ICU API. I wrote a little command line utility badly named "ucd" that is similar to the web page mentioned above. It is here: http://source.icu-project.org/repos/icu/icuapps/trunk/ucd/ and here is the readme: http://source.icu-project.org/repos/icu/icuapps/trunk/ucd/readme.txt let me know what platform you are on and I can send you build instructions. -s -- IBMer but all opinions are mine. https://www.ohloh.net/accounts/srl295 // fingerprint @ https://ssl.icu-project.org/trac/wiki/Srl From srl at icu-project.org Wed Jul 23 18:18:20 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Wed, 23 Jul 2014 16:18:20 -0700 Subject: Parsers for the UnicodeSet notation? In-Reply-To: References: <53D035F2.5070101@adobe.com> Message-ID: <53D042BC.40103@icu-project.org> On 07/23/2014 03:28 PM, Roozbeh Pournader wrote: > On Wed, Jul 23, 2014 at 3:23 PM, Eric Muller > wrote: > > I would like to work with the exemplarCharacters data in the CLDR. > That uses the UnicodeSet notation. Is there somewhere a parser for > that notation, that would return me just the list of characters in > the set? > > > Note that it's a set of strings, not characters. > > I suspect that the exemplarCharacters use a restricted form of the > UnicodeSet notation (e.g. do not use property values). Is that > correct, and if so, what's the subset? > > > I have an Apache-licensed parser in Python here: > https://code.google.com/p/noto/source/browse/nototools/generate_website_data.py#180 > Nice, you should get those CLDR folks to add a link! I'm cross posting this to cldr-users, which may be more appropriate. Eric, to answer your second question, the TR35 spec does not say that exemplars are a restricted set, as per http://unicode.org/repos/cldr/trunk/specs/ldml/tr35-general.html#ExemplarSyntax - in practice, a restricted set is used, ranges are expanded. But there's no guarantee of this by the spec. -s -- IBMer but all opinions are mine. https://www.ohloh.net/accounts/srl295 // fingerprint @ https://ssl.icu-project.org/trac/wiki/Srl From richard.wordingham at ntlworld.com Thu Jul 24 01:37:42 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 24 Jul 2014 07:37:42 +0100 Subject: Request for Information In-Reply-To: <53D010EC.9060204@inkedblade.net> References: <53D010EC.9060204@inkedblade.net> Message-ID: <20140724073742.6248cb29@JRWUBU2> On Wed, 23 Jul 2014 20:45:48 +0100 fantasai wrote: > I would like to request that Unicode include, for each writing system > it encodes, some information on how it might justify. Unicode encodes scripts, and I suspect CLDR only really supports living languages. Scripts can be used for multiple writing systems - the example of the Latin script for Romaji in Japanese was given in the original post. > a) Text justification typically expands at word-separating > characters, but may also expand between letters. > b) Since this writing system does not use spaces, justification > typically expands between letters. Are you hoping for details on this? This justification, which I've seen called 'Thai justification' in Microsoft Word, generally treats spacing combining marks (gc=Mc) like letters in the Tai Tham script when used for Tai Khuen. > a) Latin typically breaks only at spaces and other punctuation. > However, it also admits hyphenation within words. > In some contexts (such as Japanese), it may, as a stylistic > option, break anywhere (without hyphens). This is also a mediaeval European style! > c) Javanese only breaks between clauses, where punctuation is used, > resulting in horrendously ragged lines. (Did I get that right?) No. The text samples I could find quickly show scripta continua, but I suspect the line breaks are occurring at word or syllable boundaries. If I am right about the constraint on line break position, then this can be recovered by marking the optional line breaks with ZWSP. In addition, the consonants should be reclassified from AL to SA. However, such a change would be incompatible with a modern writing system in which words are separated by spaces (if such exists). I don't know what happens in Indonesian schools, so I can't report an error. Scripta continua and non-scripta continua in the same script are incompatible in plain text. > This information is of course encoded into UAX#14 and can be extracted > from there (as I have done for Javanese above), Not when writing systems in the same script differ as to whether they delimit words by line-break inducing marks. Some Thai script minority writing systems are supposed to use spaces to separate words, whereas Thai is written using scripta continua. Richard. From emuller at adobe.com Thu Jul 24 01:51:15 2014 From: emuller at adobe.com (Eric Muller) Date: Wed, 23 Jul 2014 23:51:15 -0700 Subject: Parsers for the UnicodeSet notation? In-Reply-To: <53D035F2.5070101@adobe.com> References: <53D035F2.5070101@adobe.com> Message-ID: <53D0ACE3.5090304@adobe.com> Thanks for the answers. I take it from Steve's answer that Roozbeh's parser may work today but may break tomorrow. A couple of suggestions: - a full "parser" of UnicodeSet is non-trivial, since it involves having access to property values. That does not seem really necessary for exemplars, so may be it would be good restrict the UnicodeSet there. - alternatively, since the extent of a UnicodeSet can involve property values, it means that the extent can depend on the Unicode version from which those values come from. Which means that there ought to be a Unicode version number in the CLDR data; it would be nice for that number to be present in the data files (I don't see one in he.xml) > > Incidentally, I copy/pasted the punctuation exemplar characters for > he.xml into the utility, and it reported that the set contains 8,130 > code points, including the ascii letters. Somehow, that seems > incorrect. What did I do wrong? Sorry, I took the UnicodeSet straight out of he/characters.json, without handling the json serialization (or rather deserialization) of strings. Taking it straight out of he.xml (where there is no serialization effect) gives a much more reasonable set of twenty strings. XML wins again ;-) Eric. From duerst at it.aoyama.ac.jp Thu Jul 24 02:09:02 2014 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Thu, 24 Jul 2014 16:09:02 +0900 Subject: Request for Information In-Reply-To: <20140724073742.6248cb29@JRWUBU2> References: <53D010EC.9060204@inkedblade.net> <20140724073742.6248cb29@JRWUBU2> Message-ID: <53D0B10E.3030400@it.aoyama.ac.jp> On 2014/07/24 15:37, Richard Wordingham wrote: > No. The text samples I could find quickly show scripta continua, but I > suspect the line breaks are occurring at word or syllable boundaries. > If I am right about the constraint on line break position, then this > can be recovered by marking the optional line breaks with ZWSP. In > addition, the consonants should be reclassified from AL to SA. > However, such a change would be incompatible with a modern writing > system in which words are separated by spaces (if such exists). I don't > know what happens in Indonesian schools, so I can't report an error. > Scripta continua and non-scripta continua in the same script are > incompatible in plain text. Shouldn't that be "scripta non-continua" ? Regards, Martin. From mark at macchiato.com Thu Jul 24 02:10:01 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 24 Jul 2014 09:10:01 +0200 Subject: Parsers for the UnicodeSet notation? In-Reply-To: <53D0ACE3.5090304@adobe.com> References: <53D035F2.5070101@adobe.com> <53D0ACE3.5090304@adobe.com> Message-ID: On Thu, Jul 24, 2014 at 8:51 AM, Eric Muller wrote: > - a full "parser" of UnicodeSet is non-trivial, since it involves having > access to property values. That does not seem really necessary for > exemplars, so may be it would be good restrict the UnicodeSet there. > > - alternatively, since the extent of a UnicodeSet can involve property > values, it means that the extent can depend on the Unicode version from > which those values come from. Which means that there ought to be a Unicode > version number in the CLDR data; it would be nice for that number to be > present in the data files (I don't see one in he.xml) > ?Can you file a cldr ticket on this?? Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From jknappen at web.de Thu Jul 24 02:15:43 2014 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Thu, 24 Jul 2014 09:15:43 +0200 Subject: Aw: Ambiguous hyphenation cases with In-Reply-To: <53CE6F35.5000405@inkedblade.net> References: <896B66B790C04549AD3BF12412E175E39D2A944772@hst-mail01.sdbit.local> (sfid-20140212_141727_830873_C9E64B6E) <536C2765.1050306@inkedblade.net> <896B66B790C04549AD3BF12412E175E39D35164C0E@hst-mail01.sdbit.local> (sfid-20140512_074406_966504_FE3AFEC5), <53CE6F35.5000405@inkedblade.net> Message-ID: An HTML attachment was scrubbed... URL: From chris.fynn at gmail.com Thu Jul 24 08:04:32 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Thu, 24 Jul 2014 19:04:32 +0600 Subject: Request for Information In-Reply-To: References: <53D010EC.9060204@inkedblade.net> Message-ID: On 24/07/2014, Philippe Verdy wrote: > It would be useful to have a sample text of the language, useful to show > examplar characters but a but more showing typical layout of words. https://en.wikipedia.org/wiki/List_of_pangrams#Other_languages From srl at icu-project.org Thu Jul 24 09:46:08 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Thu, 24 Jul 2014 07:46:08 -0700 Subject: Request for Information In-Reply-To: <20140724073742.6248cb29@JRWUBU2> References: <53D010EC.9060204@inkedblade.net> <20140724073742.6248cb29@JRWUBU2> Message-ID: <53D11C30.5090002@icu-project.org> On 07/23/2014 11:37 PM, Richard Wordingham wrote: > On Wed, 23 Jul 2014 20:45:48 +0100 > fantasai wrote: >> I would like to request that Unicode include, for each writing system >> it encodes, some information on how it might justify. > Unicode encodes scripts, and I suspect CLDR only really supports living > languages. That is widely suspected of CLDR, but is only theoretical until someone actually tries it! I don't think it's a good argument from silence. -s -- IBMer but all opinions are mine. https://www.ohloh.net/accounts/srl295 // fingerprint @ https://ssl.icu-project.org/trac/wiki/Srl From srl at icu-project.org Thu Jul 24 09:54:14 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Thu, 24 Jul 2014 07:54:14 -0700 Subject: Parsers for the UnicodeSet notation? In-Reply-To: References: <53D035F2.5070101@adobe.com> <53D0ACE3.5090304@adobe.com> Message-ID: <53D11E16.9030009@icu-project.org> On 07/24/2014 12:10 AM, Mark Davis ?? wrote: > > On Thu, Jul 24, 2014 at 8:51 AM, Eric Muller > wrote: > > - a full "parser" of UnicodeSet is non-trivial, since it involves > having access to property values. That does not seem really > necessary for exemplars, so may be it would be good restrict the > UnicodeSet there. > > - alternatively, since the extent of a UnicodeSet can involve > property values, it means that the extent can depend on the > Unicode version from which those values come from. Which means > that there ought to be a Unicode version number in the CLDR data; > it would be nice for that number to be present in the data files > (I don't see one in he.xml) > > > ?Can you file a cldr ticket on this?? Sounds like two tickets. -- IBMer but all opinions are mine. https://www.ohloh.net/accounts/srl295 // fingerprint @ https://ssl.icu-project.org/trac/wiki/Srl From emuller at adobe.com Thu Jul 24 10:30:29 2014 From: emuller at adobe.com (Eric Muller) Date: Thu, 24 Jul 2014 08:30:29 -0700 Subject: Parsers for the UnicodeSet notation? In-Reply-To: <53D11E16.9030009@icu-project.org> References: <53D035F2.5070101@adobe.com> <53D0ACE3.5090304@adobe.com> <53D11E16.9030009@icu-project.org> Message-ID: <53D12695.8040900@adobe.com> > Sounds like two tickets. 7730, 7731. Eric. From ken.whistler at sap.com Thu Jul 24 12:45:25 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Thu, 24 Jul 2014 17:45:25 +0000 Subject: Request for Information In-Reply-To: <53D010EC.9060204@inkedblade.net> References: <53D010EC.9060204@inkedblade.net> Message-ID: Fantasai asked: > I would like to request that Unicode include, for each writing system it > encodes, some information on how it might justify. > Following up on the comment and examples provided by Richard Wordingham, I'd like to emphasize a relevant point: Scripts may be used for *multiple* (different) writing systems. Rules for justification of text are aspects of writing systems, orthographies, and typographical conventions -- and are not inherent properties of scripts. So while there may be strong tendencies for certain scripts to fall into certain typographical practices, including behavior for text justification, I don't think that information is inherent to scripts per se. And it would be misleading and gardenpathy for the Unicode Standard to try to treat justification as somehow inhering to scripts. Note also that there are many cases where there is even argumentation over the edge cases for script identity -- where one script's behavior bleeds into another's historically, or where the status of certain elements as borrowed elements from another script into a certain orthography or as nativized elements borrowed from another script *into* a script (thereby requiring separate encoding). I think it would make more sense to turn fantasai's query on its head, as it were: First categorize what kinds of systems of justification there are, and then start filling in, from best understood out to the fringes of knowledge of practice, what writing systems (using what script or combination of scripts) are attested as regularly using each system. Lacunae are inevitable, however. I think it is just a mistake to assume from a query on the Script property identity of a character, what justification rule should apply to it in text. Note also that for many scripts there is no established modern typographical practice, so it is basically unknown or meaningless to ask what the justification rules are for it. Modern typographers setting old material will eventually make up the rules, and those will *become* the answer, but the Unicode Consortium cannot look at pictures of fragmentary Byzantine seals or fragments of papyri and *determine* what some normative (or even informative) property of justification should be for the script in such a record. --Ken From public at khwilliamson.com Thu Jul 24 14:38:44 2014 From: public at khwilliamson.com (Karl Williamson) Date: Thu, 24 Jul 2014 13:38:44 -0600 Subject: Question about WordBreak property rules Message-ID: <53D160C4.1080300@khwilliamson.com> http://www.unicode.org/draft/reports/tr29/tr29.html#WB6 indicates that there should be no break between the first two letters in the sequence Hebrew_Letter Single_Quote Hebrew_Letter. However, rule 7a just below indicates that there should be no break between a Hebrew_Letter and a Single_Quote even if what follows is not a Hebrew_Letter. This is not contradictory, but it is suspicious. It makes me wonder if there is an error in the specification. Assuming there is not, then rule 7a ought to be before current rule 6, which itself should be divided so that there isn't redundant specification of the Hebrew_Letter rules. From public at khwilliamson.com Thu Jul 24 15:21:32 2014 From: public at khwilliamson.com (Karl Williamson) Date: Thu, 24 Jul 2014 14:21:32 -0600 Subject: Question about WordBreak property rules In-Reply-To: <53D160C4.1080300@khwilliamson.com> References: <53D160C4.1080300@khwilliamson.com> Message-ID: <53D16ACC.1070001@khwilliamson.com> On 07/24/2014 01:38 PM, Karl Williamson wrote: > http://www.unicode.org/draft/reports/tr29/tr29.html#WB6 > indicates that there should be no break between the first two letters in > the sequence > Hebrew_Letter Single_Quote Hebrew_Letter. > > However, rule 7a just below indicates that there should be no break > between a Hebrew_Letter and a Single_Quote even if what follows is not a > Hebrew_Letter. > > This is not contradictory, but it is suspicious. It makes me wonder if > there is an error in the specification. Assuming there is not, then > rule 7a ought to be before current rule 6, which itself should be > divided so that there isn't redundant specification of the Hebrew_Letter > rules. In reading this after I sent it, I'm not sure I was clear enough. Rule 6 implies that you need additional context to decide whether to break between a Hebrew_Letter followed by a Single_Quote. Yet Rule 7a says that you don't need any additional context; you break always. From asmusf at ix.netcom.com Thu Jul 24 15:35:34 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 24 Jul 2014 13:35:34 -0700 Subject: Request for Information In-Reply-To: References: <53D010EC.9060204@inkedblade.net> Message-ID: <53D16E16.9050107@ix.netcom.com> On 7/24/2014 10:45 AM, Whistler, Ken wrote: > Fantasai asked: > >> I would like to request that Unicode include, for each writing system it >> encodes, some information on how it might justify. >> > Following up on the comment and examples provided by Richard > Wordingham, I'd like to emphasize a relevant point: > > Scripts may be used for *multiple* (different) writing systems. > > Rules for justification of text are aspects of writing systems, > orthographies, and typographical conventions -- and are not > inherent properties of scripts. The encoding of the Latin script is intended to be used for Fraktur. Fraktur, as used for German until the early 20th century is its own system, with its own rules. These affect justification and they are notably different from the rules used for German typeset in the modern style. For an easy to understand example, Fraktur has a commonly used form of emphasis by increased inter-letter spacing something that's rare or absent in other Latin-based writing systems. Because of its use for emphasis, it is not possible to use increased inter-letter spacing for justification. The counter example is US newspaper layout where this feature is commonly observed to help in justification of narrow columns. (The use of letter-spacing for emphasis has not fully died out in German, though with modern computer typesetting bold and italic are easily available. For that reason, its use for justification is felt as jarring to many readers, because it would subconsciously be interpreted as randomly applied emphasis). So, here you have a non-complex script like Latin, and two writing systems that fundamentally disagree on what is allowed, preferred or required for justification in certain contexts. To make matters more interesting, Fraktur has required and optional ligatures, with the required ones staying together on letter spacing while the optional ones are broken apart. Fraktur typesetting will adjust the use of optional ligatures as part of the justification process, for jet another difference between it an other writing systems based on Latin. > > So while there may be strong tendencies for certain scripts to > fall into certain typographical practices, including behavior for > text justification, I don't think that information is inherent > to scripts per se. And it would be misleading and gardenpathy > for the Unicode Standard to try to treat justification as > somehow inhering to scripts. > > Well put. A./ From rick at unicode.org Thu Jul 24 15:54:34 2014 From: rick at unicode.org (Rick McGowan) Date: Thu, 24 Jul 2014 13:54:34 -0700 Subject: PRI #273, UTS #39 draft data updated Message-ID: <53D1728A.1080702@unicode.org> The draft and data for the proposed update UTS #39 were both changed on 2014-07-24. It appears that the issue previously noted with idempotence in the UTS #39 tables can be addressed for all of the mappings, with some extensive changes.? The issue will be documented in the text ( see http://www.unicode.org/reports/tr39/proposed.html ) and the PRI text is being adjusted as well. The PRI page has been updated as well: http://www.unicode.org/review/pri273/ From cewcathar at hotmail.com Fri Jul 25 10:17:27 2014 From: cewcathar at hotmail.com (CE Whitehead) Date: Fri, 25 Jul 2014 11:17:27 -0400 Subject: Request for Information Message-ID: From: fantasai Date: Wed, 23 Jul 2014 20:45:48 +0100 > . . . > b) Arabic breaks between words. Some languages (such as Uyghur) > allow hyphenation, but most do not. Here is a resource that describes them:http://ucam.ac.ma/fssm/rydarab/doc/expose/justificatione.pdf (page 10):"Calligraphers also build on other practices for justification, such as:word heaping: putting certain words above others .moving the broken fragment above the hyphenated word .word hyphenation .word hyphenation in margin .decreasing of some words at the end of a line . .curving of the baseline ." One of the author's of the above has put another resource with examples online:http://www.tug.org/tugboat/tb27-2/tb87benatia.pdfRichard Ishida describes elongation which I believe may also be done when inserting diacritics (to make them easier to read is one reason). Some resources do say hyphenation is not allowed but you can find counter-examples/information I believe (someone else might have more/better information). Hyphenation is not common; I think I would indicate that it does exist though in Arabic. Best, --C. E. Whiteheadcewcathar at hotmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From fantasai.lists at inkedblade.net Fri Jul 25 10:49:13 2014 From: fantasai.lists at inkedblade.net (fantasai) Date: Fri, 25 Jul 2014 16:49:13 +0100 Subject: Request for Information In-Reply-To: References: <53D010EC.9060204@inkedblade.net> Message-ID: <53D27C79.4070302@inkedblade.net> On 07/24/2014 06:45 PM, Whistler, Ken wrote: > Fantasai asked: > >> I would like to request that Unicode include, for each writing system it >> encodes, some information on how it might justify. > > Following up on the comment and examples provided by Richard > Wordingham, I'd like to emphasize a relevant point: > > Scripts may be used for *multiple* (different) writing systems. Hence the use of "for each writing system" rather than "for each script" in the sentence you quote above. Also, from a practical perspective, the systems for which this information would be *really* useful for Unicode to provide are the lesser-used systems (like Javanese), which are tied to only a few languages and therefore belong to only a handful of writing systems with very little variation. > Rules for justification of text are aspects of writing systems, > orthographies, and typographical conventions -- and are not > inherent properties of scripts. > > So while there may be strong tendencies for certain scripts to > fall into certain typographical practices, including behavior for > text justification, I don't think that information is inherent > to scripts per se. And it would be misleading and gardenpathy > for the Unicode Standard to try to treat justification as > somehow inhering to scripts. Sure, but the practice of using spaces to separate words is, by your same argument, not a property of the script, but of the writing system. However, this information--whether a script is typically used with or without word separation--is often included in the Unicode standard?s description of that script. To take another example, Unicode defines a set of line breaking conventions for UAX14's default rules. However these could also be argued to be part of the writing system and not an inherent property of the script. What's chosen for UAX14 is the common case, and where there are multiple common cases UAX14 calls them out as possible tailorings. Are arguing that all of this information should be removed from Unicode? > I think it would make more sense to turn fantasai's query on its > head, as it were: First categorize what kinds of systems of > justification there are, and then start filling in, from best > understood out to the fringes of knowledge of practice, what > writing systems (using what script or combination of scripts) > are attested as regularly using each system. Lacunae are > inevitable, however. Justification systems typically expand or compress spaces, and when that fails (becoming too small or too large, where the tolerances vary widely per writing system), fall back to "letter-spacing". The interaction of different levels of justification (e.g. spaces vs. letter-spacing) depends on the justification algorithms, and the tolerances for spacing adjustments depends on the writing system and the quality of the typesetter. It is my observation that systems with fewer spaces are more tolerant of letter-spacing. Some data-related questions here are: 1. The frequency of spaces in that writing system. This is strongly related to whether stretchable spaces are used for word separation, phrase separation, or neither, and whether they are used around common or rare punctuation. This information is noted for some scripts in the Unicode standard, but it is irregularly considered. Many chapters make no mention of whether and how spaces are used. (For example, it would be nice if the standard mentioned whether punctuation like the Javanese pada lingsa are expected to be followed by a space character, so that font makers, layout engineers, and typists can coordinate accordingly to create the appropriate amount of white space on the screen.) 2. Which characters are "separable" for justification. Some languages (like German) may suppress such separation. And the rules for determining separable "clusters" can be language and/or font-dependent. However it can be said with certainty that Latin letters, for example, are separable, whereas Arabic letters are not. This information is mostly represented in UAX29, with the exception that there's no really clear information on which scripts are "cursive" (have inseparable grapheme clusters). There are exceptional systems: - Arabic can use cursive elongation for justification. - Japanese and Chinese can compress the inherent "spaces" within the full-width glyphs of certain punctuation. - Tibetan can use tsek marks as filler for justification. (Which is, by the way, discussed *extensively* in the Unicode standard, so you can't tell me that the Unicode Consortium considers notes on common justification practices to be out of scope.) > I think it is just a mistake to assume from a query on the Script > property identity of a character, what justification rule should > apply to it in text. I think when you have no further context, it is better to have a guess informed by the character properties than one completely ignorant of them. > Note also that for many scripts there is no established modern > typographical practice, so it is basically unknown or meaningless > to ask what the justification rules are for it. Modern typographers > setting old material will eventually make up the rules, and those > will *become* the answer, but the Unicode Consortium cannot > look at pictures of fragmentary Byzantine seals or fragments of > papyri and *determine* what some normative (or even informative) > property of justification should be for the script in such a > record. Right, so as I mentioned, "We don't know" is an acceptable answer. At least then I can assume that my best guess is equivalent to the state of the art. :) ~fantasai From asmusf at ix.netcom.com Fri Jul 25 14:47:14 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 25 Jul 2014 12:47:14 -0700 Subject: Request for Information In-Reply-To: <53D27C79.4070302@inkedblade.net> References: <53D010EC.9060204@inkedblade.net> <53D27C79.4070302@inkedblade.net> Message-ID: <53D2B442.80109@ix.netcom.com> On 7/25/2014 8:49 AM, fantasai wrote: > On 07/24/2014 06:45 PM, Whistler, Ken wrote: >> Fantasai asked: >> >>> I would like to request that Unicode include,*for each writing >>> system it ** >>> **encodes*, some information on how it might justify. >> >> Following up on the comment and examples provided by Richard >> Wordingham, I'd like to emphasize a relevant point: >> >> Scripts may be used for *multiple* (different) writing systems. > > Hence the use of "for each writing system" rather than "for each > script" in the sentence you quote above. But the sentence implies that Unicode encodes "writing systems". That is not the case. Unicode encodes characters, which are elements of scripts, and usually not specific to a given writing system. The various "default" algorithms that Unicode publishes are an attempt to deal with "plain text", where it access to detailed information about the writing system may not be available. To be useful, it helps if these algorithms can be modified (tailored) for situations where additional knowledge is available, but even with a few examples of that process, this is a far cry from providing information "for each writing system". > > >> I think it would make more sense to turn fantasai's query on its >> head, as it were: First categorize what kinds of systems of >> justification there are, and then start filling in, from best >> understood out to the fringes of knowledge of practice, what >> writing systems (using what script or combination of scripts) >> are attested as regularly using each system. Lacunae are >> inevitable, however. > > Justification systems typically expand or compress spaces, > and when that fails (becoming too small or too large, where > the tolerances vary widely per writing system), fall back to > "letter-spacing". The interaction of different levels of > justification (e.g. spaces vs. letter-spacing) depends on > the justification algorithms, and the tolerances for spacing > adjustments depends on the writing system and the quality of > the typesetter. You left out the use of optional ligatures. > > It is my observation that systems with fewer spaces are more > tolerant of letter-spacing. My example (Fraktur) shows that other factors can come to play and that some writing systems can depart quite decisively from writing systems for the same script, and, in this case, even the same language. > > > (For example, it would be nice if the standard mentioned > whether punctuation like the Javanese pada lingsa are expected > to be followed by a space character, so that font makers, > layout engineers, and typists can coordinate accordingly to > create the appropriate amount of white space on the screen.) This information is rather more proper for Unicode to collect since it addresses conventions of how to encode texts, which is central to the mission of the Unicode Standard as character encoding. > > 2. Which characters are "separable" for justification. > Some languages (like German) may suppress such separation. > And the rules for determining separable "clusters" can be > language and/or font-dependent. > However it can be said with certainty that Latin letters, > for example, are separable, whereas Arabic letters are not. The minute you make a blanket statement like that, you promulgate a lowest common denominator behavior for software, where any language requiring tailoring will see a degradation, because tailorings will be unlikely. The correct way to characterize Latin would be that while separation is allowable in some languages, it is not preferred, and justification algorithms would normally apply a penalty for any line that requires it. The penalty values for some languages are much higher, and may also depend on the application. It's easy to find examples of newspaper layout in the US that are quite tolerant (if not to say over-tolerant) of letter spacing, but those represent an extreme. The point I am trying to make is that justification goes a lot further into the direction of "typography" than what is covered by UAX#14 (which covers line breaking opportunities - but not how to select among them for best typographical result). By providing some generalized statements, you may do more harm than good, because you would by necessity enshrine the lowest common denominator. For "plain text" in Latin, you could argue the case in reverse and state that using letter spacing is not "safe", because you don't know whether the text is in a language or for a context where letter-spacing should be given a high or very high penalty value. Something similar goes of applying ligatures. Applying them for plain text in Latin is inappropriate, because each language has or may have rules when ligatures are appropriate or where they are required or forbidden - and these rules change over time (like the changes made in the 1990's orthography change for German, or the 1940 switch away from Fraktur). Still, Latin does have ligatures and there are writing systems where justification uses fewer or more ligatures to affect the rate at which text is condensed. But, as can be seen from these examples, a global statement on the script level is not helpful, unless a practice is really universal and can safely be applied to plain text. > > This information is mostly represented in UAX29, with the > exception that there's no really clear information on > which scripts are "cursive" (have inseparable grapheme > clusters). > > There are exceptional systems: > - Arabic can use cursive elongation for justification. > - Japanese and Chinese can compress the inherent "spaces" > within the full-width glyphs of certain punctuation. > - Tibetan can use tsek marks as filler for justification. > (Which is, by the way, discussed *extensively* in the Unicode > standard, so you can't tell me that the Unicode Consortium > considers notes on common justification practices to be out > of scope.) What you are asking for, is in effect a "survey of typographical practices". This would be a fine project if published in some form that is independent of the Unicode Standard, for example as a technical note. That survey would need to focus on *writing systems*, and not on script properties. Because the latter are really inappropriate for making the correct choice. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Jul 25 19:36:43 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 26 Jul 2014 02:36:43 +0200 Subject: Request for Information In-Reply-To: References: Message-ID: 2014-07-25 17:17 GMT+02:00 CE Whitehead : > From: fantasai > > > Date: Wed, 23 Jul 2014 20:45:48 +0100 > > . . . > > b) Arabic breaks between words. Some languages (such as Uyghur) > > allow hyphenation, but most do not. > > Here is a resource that describes them: > http://ucam.ac.ma/fssm/rydarab/doc/expose/justificatione.pdf (page 10): > That PDF is completely broken and does not even show the various styles accurately. There are character encoding issues everywhere. It's impossible to understand the arguments or definitions just by reading it under its existing form (which was apaprently produced by a broken PDF generator, may be it was correct in the original editor format, possibly a Word or OpenOffice document; but here it looks like if it was first exported to HTML with incorret encodng prodocing tofu and mojibake, then reconverted as is. There's not any working example of text justification in it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sat Jul 26 10:26:21 2014 From: doug at ewellic.org (Doug Ewell) Date: Sat, 26 Jul 2014 09:26:21 -0600 Subject: Request for Information In-Reply-To: References: Message-ID: <0F8B003784EB4A9E903E445EFB3DFB56@DougEwell> fantasai wrote: > I think when you have no further context, it is better to have > a guess informed by the character properties than one completely > ignorant of them. Some of the responses on this list already demonstrate a real risk of Unicode adding a property like this. When Unicode publishes this sort of data, even if it is meant to be informative, people tend to treat it as normative and rigid, and applying to all imaginable scenarios. So even for a script like Latin, where the customary method of justification is usually straightforward, you can have reasonable counterexamples like Fraktur as described by Asmus. And then someone might bring up a case where the rules might be different for different languages (Philippe sort of alluded to this with Arabic). And then there will be a historic example from the dawn of printing, and one from a highly styled advertising sign, and so forth, and it will be hard to tell when the "normal usage" line has been crossed. If necessary, someone will trudge out Latin letters on a neon sign, oriented normally but written vertically down the sign. Meanwhile Unicode will be criticized for not taking all the special cases into account. It's a bit like the locale collections (CLDR is not alone here) that specify a single date format for an entire country, as if all Americans only ever write a short date as "m/dd/yy" and anyone who uses a different format is employing some sort of weird hybrid system. The presence of "m/dd/yy" in the locale collection appears normative and rigid, and is often implemented in software as though that were the intent, even if the data is meant to be descriptive and a first approximation. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? From eik at iki.fi Sat Jul 26 12:20:03 2014 From: eik at iki.fi (Erkki I Kolehmainen) Date: Sat, 26 Jul 2014 20:20:03 +0300 Subject: VS: Request for Information In-Reply-To: <0F8B003784EB4A9E903E445EFB3DFB56@DougEwell> References: <0F8B003784EB4A9E903E445EFB3DFB56@DougEwell> Message-ID: <000001cfa8f5$d73ff2e0$85bfd8a0$@fi> +1 (although some people object to this notation) Sincerely, Erkki -----Alkuper?inen viesti----- L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Doug Ewell L?hetetty: 26. hein?kuuta 2014 18:26 Vastaanottaja: unicode at unicode.org Kopio: fantasai.lists at inkedblade.net Aihe: Re: Request for Information fantasai wrote: > I think when you have no further context, it is better to have > a guess informed by the character properties than one completely > ignorant of them. Some of the responses on this list already demonstrate a real risk of Unicode adding a property like this. When Unicode publishes this sort of data, even if it is meant to be informative, people tend to treat it as normative and rigid, and applying to all imaginable scenarios. So even for a script like Latin, where the customary method of justification is usually straightforward, you can have reasonable counterexamples like Fraktur as described by Asmus. And then someone might bring up a case where the rules might be different for different languages (Philippe sort of alluded to this with Arabic). And then there will be a historic example from the dawn of printing, and one from a highly styled advertising sign, and so forth, and it will be hard to tell when the "normal usage" line has been crossed. If necessary, someone will trudge out Latin letters on a neon sign, oriented normally but written vertically down the sign. Meanwhile Unicode will be criticized for not taking all the special cases into account. It's a bit like the locale collections (CLDR is not alone here) that specify a single date format for an entire country, as if all Americans only ever write a short date as "m/dd/yy" and anyone who uses a different format is employing some sort of weird hybrid system. The presence of "m/dd/yy" in the locale collection appears normative and rigid, and is often implemented in software as though that were the intent, even if the data is meant to be descriptive and a first approximation. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From emuller at adobe.com Sat Jul 26 16:11:40 2014 From: emuller at adobe.com (Eric Muller) Date: Sat, 26 Jul 2014 14:11:40 -0700 Subject: Help with Hebrew In-Reply-To: <100501cf990c$08ad0dd0$1a072970$@gmail.com> References: <100501cf990c$08ad0dd0$1a072970$@gmail.com> Message-ID: <53D4198C.9030607@adobe.com> Many thanks for all the answers on my Hebrew and Arabic questions. On 7/6/2014 4:18 AM, Matitiahu Allouche wrote: > The original text is interesting, combining French, Latin and Hebrew. There is also a fair amount of Greek, and a couple of Arabic words. > Unfortunately, the author and/or the type setter were not quite proficient in Hebrew, so that the Hebrew words in the 3 referenced pages contain quite a few errors. I think it's a safe assumption that the typesetter was not necessarily fluent in Hebrew. > > I am not sure if the digitization should reproduce faithfully the flaws of the original document, or if it is an opportunity to correct the errors (which may not be possible for the first page). I want both! In my XML source, I do record things like "mistake", and render that in the EPUBs I produce by "mistake [mistaque]". > 1) Eric's representation of the Hebrew words in f274.image seems correct. So the Unicode sequences are > Yod (U+05D9) Segol (U+05B6) Dalet (U+05D3) Segol (U+05B6) Alef (U+05D0) > And > Yod (U+05D9) Segol (U+05B6) Dalet (U+05D3) Segol (U+05B6) He (U+05D4) > > However, the Hebrew words are suspect: > a. The first one (Yod Dalet Alef) is not a stem known in Hebrew. It could be a deformation of the stem Yod Resh Alef whose meaning is to fear (= the French "craindre"). I would not be surprised if the typesetter confused dalet and resh. The good news is that the text I pointed to is one of the many re-editions of the work, and we have a facsimile of the original edition: http://gallica.bnf.fr/ark:/12148/btv1b8626248g/f26.image Here is seems clear that it's a resh in both examples. By the way, the whole sentence reads roughly "In Hebrew, there are words which are different only in that one ends with an aleph, and the other with a he, which are not pronounced, as <> which means fear and <> which means throw away." This follows a discussion that in French, "champ" and "chant" are pronounced the same, with the final p and t silent. > b. Both grammatical forms (with Segol under the rightmost two letters in both words) do not conform to proper conjugation, as far as I know (conjugation of Hebrew verbs is not a matter for the faint of heart). The original edition seem to show a qamats. Would that be better? > > 2) The case of f299.image is yet more complicated: The original edition: http://gallica.bnf.fr/ark:/12148/btv1b8626248g/f53.image > a. If you compare the rightmost letter in the Hebrew word following "mais dans" with the corresponding letter in the Hebrew word following "pour", you can see that they don't look identical. The first one has a rounded top-right corner while the second one has a more square shape. The first letter looks like a Hebrew letter Resh (U+05E8) and the second looks like a Hebrew letter Dalet (U+05D3, and it is the correct one). The original seems to show Dalet in all three cases. Overall, what I see there is - dalet, sheva, bet, patah, resh, space, shin, segol, qof, segol, resh - shin, segol, qof, segol, resh - dalet, sheva, bet, patah, resh - dalet, qamats, bet, qamats, resh The text is telling how the genitive is marked differently in Latin and in Hebrew. In Latin, in verbum falsitatis, it's falsitas that has been transformed into falsitatis to mark the genetive, while in Hebrew, it's (the word for verbum) that is modified. > > > c. When a word starts with Dalet, there should generally be a Dagesh in the Dalet. That brings an interesting question. If you look at the French in the two editions (1660 and 1810), you will see that they different orthographies, and that today's orthography (2014) is yet another one. There is no reason this would not happen in the same way for the Hebrew. So what I am really after is - what's on the page - what was meant to be on the page, when the editions were made (1660, 1810) - what one would want to put on the page if one were to make a modern edition, with modern orthography throughout Is it plausible that the dagesh would only be in the last case (modern orthography), since it's clearly absent in both facsimiles? > d. The point on the Shin (rightmost letter of the second word) is a Sin Dot, while it should be a Shin Dot, None in the original edition, apparently. > > The expression was probably quoted from Exodus XXIII, 7, where the vowel under the Bet is a Patah, which is also the way it would be written in modern Hebrew. > So the right sequences (after correcting the errors in the original document) are > - Dalet (U+05D3) Dagesh (U+05BC) Sheva (U+05B0) Bet (U+05D1) Patah (U+05B7) Resh (U+05E8) Space Shin (U+05E9) Shin Dot (U+05C1) Qamats (U+05B8) Qof (U+05E7) Segol (U+05B8) Resh (U+05E8) > - Shin (U+05E9) Shin Dot (U+05C1) Qamats (U+05B8) Qof (U+05E7) Segol (U+05B8) Resh (U+05E8) > - Dalet (U+05D3) Dagesh (U+05BC) Sheva (U+05B0) Bet (U+05D1) Patah (U+05B7) Resh (U+05E8) > - Dalet (U+05D3) Dagesh (U+05BC) Qamats (U+05B8) Bet (U+05D1) Qamats (U+05B8) Resh (U+05E8) That matches the original edition, except for the dagesh and shin dot which seems absent. Some quamats tend to look like segol, but I suspect that can be attributed to poor printing. > > > > 3) The case of f310.image is also problematic. The original edition: http://gallica.bnf.fr/ark:/12148/btv1b8626248g/f66.image > > a. The feminine pronoun is written with 2 errors (not bad for a word with only 2 consonants): firstly, the vowel under the first (rightmost) letter Alef (U+05D0) is missing and should be a Patah. Secondly, there should be a Dagesh (U+05BC) in the leftmost letter (Tav U+05EA). So the proper sequence is Alef(U+05D0) Patah (U+05B7) Tav (U+05EA) Dagesh (U+05BC). The original edition looks like: alef, qamats, tav, sheva. > > b. The masculine pronoun is also written with 2 errors: firstly, there should be a Dagesh (U+05BC) in the middle letter (Tav U+05EA). Secondly, the leftmost letter must be a He (U+05D4) and not an Alef as appearing in the original document. > So the proper sequence is Alef (U+05D0) Patah (U+05B7) Tav (U+05EA) Dagesh (U+05BC) Qamats (U+05B8) He (U+05D4). The original edition looks likes alef, qamats, tav, qamats, alef. Philippe Verdy wrote: > Note: the three comma-separated items, if they are just separated by > the comma (in that example it is handwritten, but it is the European > comma, not the arabic comma) should use bidi-embedding controls That actually looks like the perfect job for three isolates. Thanks again to everyone, Eric. From nospam-abuse at ilyaz.org Sat Jul 26 19:22:08 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Sat, 26 Jul 2014 17:22:08 -0700 Subject: Request for Information In-Reply-To: <0F8B003784EB4A9E903E445EFB3DFB56@DougEwell> References: <0F8B003784EB4A9E903E445EFB3DFB56@DougEwell> Message-ID: <20140727002208.GA21963@powdermilk> On Sat, Jul 26, 2014 at 09:26:21AM -0600, Doug Ewell wrote: > It's a bit like the locale collections (CLDR is not alone here) that > specify a single date format for an entire country, as if all > Americans only ever write a short date as "m/dd/yy" and anyone who > uses a different format is employing some sort of weird hybrid > system. The presence of "m/dd/yy" in the locale collection appears > normative and rigid, and is often implemented in software as though > that were the intent, even if the data is meant to be descriptive > and a first approximation. Hmm? So you think that the date format is not normative? [Based on a true story, ? a few years ago] So one comes from a transatlantic flight after non having sleep for coupla days, and signs a lease with the end date in m/dd/yy format. When leaving, it turns out that the German landlord wants an extra month payment since in German format, this means one month later than in American one. Who do you think gets/loses $1000 here? Hope this helps, Ilya P.S. Sorry, do not remember exact numbers in the dates. It might have been that the confusion was of month/year, not of month/day. From cewcathar at hotmail.com Sat Jul 26 20:33:27 2014 From: cewcathar at hotmail.com (CE Whitehead) Date: Sat, 26 Jul 2014 21:33:27 -0400 Subject: Request for Information In-Reply-To: References: , Message-ID: From: verdy_p at wanadoo.fr Date: Sat, 26 Jul 2014 02:36:43 +0200 Subject: Re: Request for Information To: cewcathar at hotmail.com CC: unicode at unicode.org; fantasai.lists at inkedblade.net 2014-07-25 17:17 GMT+02:00 CE Whitehead : From: fantasai Date: Wed, 23 Jul 2014 20:45:48 +0100 > . . . > b) Arabic breaks between words. Some languages (such as Uyghur) > allow hyphenation, but most do not. Here is a resource that describes them:http://ucam.ac.ma/fssm/rydarab/doc/expose/justificatione.pdf (page 10): That PDF is completely broken and does not even show the various styles accurately.There was an explanation which is quoted. The examples were at: http://www.tug.org/tugboat/tb27-2/tb87benatia.pdf Unfortunately the examples of hyphenation in this text are not very good; when I have more time before I have to leave the wifi I will look for another example. The one thing to note as far as I know is that when hyphenating Arabic script, whatever the language, you would show the letters in their connected shapes across the hyphen -- that's what I understand. Someone else correct me if it is not correct. I have only cited the above to make a point that if you mention how Arabic script might be justified, I do believe even text in the Arabic language can occasionally be hyphenated, especially maybe some of the old Q'urans (I really would appreciate feedback from Arabic native speakers though as I have had a little Arabic in school and research). So maybe never say never in this case as there may be a point when old texts are brought online from .pdf s and digitized. Best, --C. E. Whiteheadcewcathar at hotmail.com> There are character encoding issues everywhere. It's impossible to understand the arguments > or definitions just by reading it under its existing form (which was apaprently produced by a > broken PDF generator, may be it was correct in the original editor format, possibly a Word or > OpenOffice document; but here it looks like if it was first exported to HTML with incorret > encodng prodocing tofu and mojibake, then reconverted as is. > There's not any working example of text justification in it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Mon Jul 28 09:57:11 2014 From: frederic.grosshans at gmail.com (=?windows-1252?Q?Fr=E9d=E9ric_Grosshans?=) Date: Mon, 28 Jul 2014 16:57:11 +0200 Subject: Unencoded cased scripts and unencoded titlecase letters In-Reply-To: <53B419AC.3050703@khwilliamson.com> References: <53B419AC.3050703@khwilliamson.com> Message-ID: <53D664C7.1060604@gmail.com> Le 02/07/2014 16:39, Karl Williamson a ?crit : > It's my sense that there are very few cased scripts in existence that > are ever likely to be encoded by Unicode that haven't already been > so-encoded. The Kaddare script ( http://www.skyknowledge.com/kaddare.htm , https://en.wikipedia.org/wiki/Kaddare_alphabet ) is a cased alphabet, historically used for Somali, which is not yet on the Unicode Roadmap. But I don?t see what would prevents it?s encoding in the long term