From samjnaa at gmail.com Sat Feb 1 22:20:32 2014 From: samjnaa at gmail.com (Shriramana Sharma) Date: Sun, 2 Feb 2014 09:50:32 +0530 Subject: Astrological symbol for Pluto? Message-ID: Currently Unicode encodes a distinct astrological symbol for Uranus 2645 ? vs an astronomical symbol 26E2 ?. However the only symbol encoded for Pluto is the astronomical one: 2647 ?. Just now I learnt from https://en.wikipedia.org/wiki/Pluto#Name that there is a distinct astrological symbol: [image: Inline image 1] Has there been any proposal to encode this? (I'm guessing Michael might be interested...) -- Shriramana Sharma ???????????? ???????????? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 200px-Pluto's_astrological_symbol.svg.png Type: image/png Size: 3056 bytes Desc: not available URL: From ishida at w3.org Sun Feb 2 11:46:14 2014 From: ishida at w3.org (Richard Ishida) Date: Sun, 02 Feb 2014 17:46:14 +0000 Subject: UniView is back Message-ID: <52EE8466.2050307@w3.org> If you were a user of my UniView tool you'll find a new version at http://rishida.net/uniview/ I am rebuiding UniView without PHP. Most of the essential features work in the new version, but there are one or two that I have yet to rebuild, and you may find that the odd thing just won't work. Principal things still outstanding include: Searching the Unicode database (searching the information local to the page works, if you select 'local') Listing of characters with a given property (although filtering the information currently on the page still works, if you turn 'local' on) My notes on individual characters no longer appear at the bottom of the right panel You can't show an annotated list of all characters in a block A graphic X indicates some of the things that don't work, or only partially work. The images will be removed as the features are reinstated. You may also find that UniView is initially slower in a couple of ways. I haven't yet reinstated the AJAX to pull in character data at the point of need: instead the app downloads a 2.5Mb of data before running. Also, the initial draw of a block on the left is somewhat slower than before. Hopefully, over time, I will address these issues. If there's a feature you especially need that is not available, let me know and I may be able to prioritise work on that. From jknappen at web.de Mon Feb 3 01:57:19 2014 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Mon, 3 Feb 2014 08:57:19 +0100 (CET) Subject: Aw: Astrological symbol for Pluto? In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 200px-Pluto's_astrological_symbol.svg.png Type: image/png Size: 3056 bytes Desc: not available URL: From frederic.grosshans at gmail.com Mon Feb 3 04:45:42 2014 From: frederic.grosshans at gmail.com (=?ISO-8859-1?Q?Fr=E9d=E9ric_Grosshans?=) Date: Mon, 03 Feb 2014 11:45:42 +0100 Subject: Aw: Astrological symbol for Pluto? In-Reply-To: References: Message-ID: <52EF7356.2010207@gmail.com> Le 03/02/2014 08:57, "J?rg Knappen" a ?crit : > Unfortunately, > this astrological symbol is given in the Wikipedia article, but not > sourced. So I think, further evidence for its usage is needed. > Actually, it is sourced (with the other symbils) to http://www.uranian-institute.org/bfglyphs.htm , which lists no less than 4 symbols for Pluto... Fred From samjnaa at gmail.com Mon Feb 3 07:14:39 2014 From: samjnaa at gmail.com (Shriramana Sharma) Date: Mon, 3 Feb 2014 18:44:39 +0530 Subject: Aw: Astrological symbol for Pluto? In-Reply-To: <52EF7356.2010207@gmail.com> References: <52EF7356.2010207@gmail.com> Message-ID: On Mon, Feb 3, 2014 at 4:15 PM, Fr?d?ric Grosshans < frederic.grosshans at gmail.com> wrote: > >> Actually, it is sourced (with the other symbils) to > http://www.uranian-institute.org/bfglyphs.htm , which lists no less than > 4 symbols for Pluto... > In any case, it seems its astronomical symbol was encoded quite early (DerivedAge = 1.1) which was before the 2006 IAU decision to demote it to dwarf planet status. Of course, even if it were encoded today I'm sure it would be the only dwarf planet to have a symbol encoded since no other dwarf planet has captured the common man's imagination (and basic knowledge) like Pluto, and I have not heard any of the other dwarf planets (Ceres, Haumea, Makemake and Eris) having any symbols... -- Shriramana Sharma ???????????? ???????????? -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Mon Feb 3 07:29:22 2014 From: andrewcwest at gmail.com (Andrew West) Date: Mon, 3 Feb 2014 13:29:22 +0000 Subject: Aw: Astrological symbol for Pluto? In-Reply-To: References: <52EF7356.2010207@gmail.com> Message-ID: On 3 February 2014 13:14, Shriramana Sharma wrote: > > In any case, it seems its astronomical symbol was encoded quite early > (DerivedAge = 1.1) which was before the 2006 IAU decision to demote it to > dwarf planet status. Of course, even if it were encoded today I'm sure it > would be the only dwarf planet to have a symbol encoded since no other dwarf > planet has captured the common man's imagination (and basic knowledge) like > Pluto, and I have not heard any of the other dwarf planets (Ceres, Haumea, > Makemake and Eris) having any symbols... Well, there are no fewer than four unencoded astrological symbols for Eris according to this Wikipedia article: Andrew From jknappen at web.de Mon Feb 3 07:43:59 2014 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Mon, 3 Feb 2014 14:43:59 +0100 (CET) Subject: Aw: Re: Astrological symbol for Pluto? In-Reply-To: References: <52EF7356.2010207@gmail.com>, Message-ID: An HTML attachment was scrubbed... URL: From everson at evertype.com Mon Feb 3 10:34:41 2014 From: everson at evertype.com (Michael Everson) Date: Mon, 3 Feb 2014 08:34:41 -0800 Subject: Astrological symbol for Pluto? In-Reply-To: References: Message-ID: On 1 Feb 2014, at 20:20, Shriramana Sharma wrote: > Currently Unicode encodes a distinct astrological symbol for Uranus 2645 ? vs an astronomical symbol 26E2 ?. > > Has there been any proposal to encode this? (I'm guessing Michael might be interested?) I?d be happy to, unless there was going to be pushback from the UTC. Michael Everson * http://www.evertype.com/ From martinho.fernandes at gmail.com Tue Feb 4 08:43:37 2014 From: martinho.fernandes at gmail.com (Martinho Fernandes) Date: Tue, 04 Feb 2014 15:43:37 +0100 Subject: Arabic percent sign and percent signs in RTL scripts Message-ID: Is the arabic percent sign (U+066A) just a typographical variation of the "normal" percent sign (U+0025) or is it somehow more distinct than that? What about its placement? Is it placed to the left or to the right of the digits it applies to? Mit freundlichen Gr??en,? Martinho -------------- next part -------------- An HTML attachment was scrubbed... URL: From James_Lin at symantec.com Tue Feb 4 11:05:53 2014 From: James_Lin at symantec.com (James Lin) Date: Tue, 4 Feb 2014 09:05:53 -0800 Subject: Arabic percent sign and percent signs in RTL scripts In-Reply-To: References: Message-ID: For Arabic, percentage sign is fixed on the left side of the digit: %10 and for Hebrew, percentage sign is on the right side of digit: 10%. -James From: Martinho Fernandes > Date: Tuesday, February 4, 2014 6:43 AM To: Unicode List > Subject: Arabic percent sign and percent signs in RTL scripts Is the arabic percent sign (U+066A) just a typographical variation of the "normal" percent sign (U+0025) or is it somehow more distinct than that? What about its placement? Is it placed to the left or to the right of the digits it applies to? Mit freundlichen Gr??en, Martinho -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkorpela at cs.tut.fi Tue Feb 4 13:37:06 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Tue, 04 Feb 2014 21:37:06 +0200 Subject: Arabic percent sign and percent signs in RTL scripts In-Reply-To: References: Message-ID: <52F14162.8040800@cs.tut.fi> 2014-02-04 19:05, James Lin wrote: > For Arabic, percentage sign is fixed on the left side of the digit: %10 There seem to be different opinions and practices on this. In the CLDR database, the formats have ?%? (the Ascii percent sign) on the right of the number, as far as I can see; Arabic inherits the root settings for percentages. Yucca From richard.wordingham at ntlworld.com Tue Feb 4 18:51:09 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 5 Feb 2014 00:51:09 +0000 Subject: Arabic percent sign and percent signs in RTL scripts In-Reply-To: <52F14162.8040800@cs.tut.fi> References: <52F14162.8040800@cs.tut.fi> Message-ID: <20140205005109.6db5accd@JRWUBU2> On Tue, 04 Feb 2014 21:37:06 +0200 "Jukka K. Korpela" wrote: > 2014-02-04 19:05, James Lin wrote: > > For Arabic, percentage sign is fixed on the left side of the digit: > > %10 > There seem to be different opinions and practices on this. In the > CLDR database, the formats have ?%? (the Ascii percent sign) on the > right of the number, as far as I can see; Arabic inherits the root > settings for percentages. As far as I can *see* (perhaps there are hidden format characters in the CLDR data), the '%' follows the digits, and so will occur on the left if number plus percentage sign is flanked by Arabic text. Character sequence yields, recording right-to-left, glyph sequence . Both percentage signs are bidi class ET ('European Terminator'), and the preceding Arabic text converts the digits to class AN 'Arabic number' whichever of the three sets of Arabic digits (DIGIT ZERO onwards, ARABIC-INDIC DIGIT ZERO onwards or EXTENDED ARABIC-INDIC DIGIT ZERO onwards) is used. Richard. From khaledhosny at eglug.org Wed Feb 5 02:06:03 2014 From: khaledhosny at eglug.org (Khaled Hosny) Date: Wed, 5 Feb 2014 10:06:03 +0200 Subject: Arabic percent sign and percent signs in RTL scripts In-Reply-To: References: Message-ID: <20140205080602.GA15328@khaled-laptop> On Tue, Feb 04, 2014 at 03:43:37PM +0100, Martinho Fernandes wrote: > Is the arabic percent sign (U+066A) just a typographical variation of > the "normal" percent sign (U+0025) or is it somehow more distinct than > that? The former. It is mainly used when Arabic-Indic or Extended Arabic-Indic digits are used. > What about its placement? Is it placed to the left or to the right of > the digits it applies to? It should follow the digit in the input stream, and its proper visual placement should be handled by the Unicode bidirectional algorithm. Regards, Khaled From rhavin at shadowtec.de Tue Feb 4 16:25:06 2014 From: rhavin at shadowtec.de (Rhavin Grobert) Date: Tue, 04 Feb 2014 23:25:06 +0100 Subject: proposal for new character 'soft/preferred line break' Message-ID: <52F168C2.7090401@shadowtec.de> Parallel to soft hyphen, a hyphen that is just inserted if the word was broken, it would be practical to have some way to tell browser: if you need to break the line, try here first. This would be really usefull for poems, music lines, adresses,? And it would be really easy to implement: there is no visual representation needed and if the right code-point is choosen, it would be downward-compatible to all systems not knowing of the new character. Also, the implementation in browers would be very easy to acomplish. Please support this proposal, Rhavin Grobert -- Rhavin Grobert ? ShadowTec media, B?dikersteig 11, 13629 Berlin http://rhavin.de/ ? Tontechnik, Consulting, Wartung und Planung MCITP & Event-Professional, Windows & Linux Administrator C++ ? Perl ? Java ? JavaScript ? Ruby ? XHTML+CSS ? XML ? PowerShell From markus.icu at gmail.com Wed Feb 5 10:22:12 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 5 Feb 2014 08:22:12 -0800 Subject: proposal for new character 'soft/preferred line break' In-Reply-To: <52F168C2.7090401@shadowtec.de> References: <52F168C2.7090401@shadowtec.de> Message-ID: On Tue, Feb 4, 2014 at 2:25 PM, Rhavin Grobert wrote: > Parallel to soft hyphen, a hyphen that is just inserted if the word was > broken, it would be practical to have some way to tell browser: if you need > to break the line, try here first. This would be really usefull for poems, > music lines, adresses,? > That would be HTML or U+200B ZERO WIDTH SPACE . And it would be really easy to implement: there is no visual representation > needed and if the right code-point is choosen, it would be > downward-compatible to all systems not knowing of the new character. > Unlikely. There are some unassigned code points that are predefined with Default_Ignorable_Code_Point, but that is not supported everywhere. Also, the implementation in browers would be very easy to acomplish. > Maybe. You could research how widely and U+200B are supported. (I don't have that data.) Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkorpela at cs.tut.fi Wed Feb 5 12:35:59 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Wed, 05 Feb 2014 20:35:59 +0200 Subject: proposal for new character 'soft/preferred line break' In-Reply-To: References: <52F168C2.7090401@shadowtec.de> Message-ID: <52F2848F.6050103@cs.tut.fi> 2014-02-05 18:22, Markus Scherer wrote: > On Tue, Feb 4, 2014 at 2:25 PM, Rhavin Grobert > wrote: > > Parallel to soft hyphen, a hyphen that is just inserted if the word > was broken, it would be practical to have some way to tell browser: > if you need to break the line, try here first. This would be really > usefull for poems, music lines, adresses,? > > > That would be HTML or > U+200B ZERO WIDTH SPACE As a suggested direct line break point, they both work fine, with few caveats though, making it a bit difficult to decide which one is better, see my treatise http://www.cs.tut.fi/~jkorpela/html/nobr.html#suggest In plain text, of course, U+200B is the way. The main problem with it is that some software, including some old browsers like IE 6, do not recognize it but try to render it as a graphic character, possibly using a font that has no glyph for it. Adding a new character would not help here at all, of course. > And it would be really easy to implement: there is no visual > representation needed and if the right code-point is choosen, it > would be downward-compatible to all systems not knowing of the new > character. > > Unlikely. Indeed, there is no reason to expect old software to silently ignore characters that they do not recognize. Whatever the Unicode Standard might say, old software just does what it has been programmed to do, and this may well be ?here?s a character for which I have no special rule, so I?ll use whatever is available in the font(s) I?m using?, typically resulting in a small rectangle that represents a character for which no glyph is available. But I?m not quite sure of the idea of the suggestion. If the idea is to provide an optional break point, in a position where none would normally not be present, then U+200B is the way. Not 100% reliable, but better than anything else (in plain text). But if the idea is to suggest that among permissible line break points, this one is preferable, then it?s a different issue. Theoretically interesting, but in practical terms, things don?t work that way. In practice, there are permissible line break points (either by implicit rules that e.g. normally allow a break after a space, or by explicit indication by U+200B). Programs will take it from there, and if they do some optimization, like good publishing software does, they typically optimize the division of an entire paragraph into lines, applying several criteria. Yucca From richard.wordingham at ntlworld.com Wed Feb 5 14:20:23 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 5 Feb 2014 20:20:23 +0000 Subject: proposal for new character 'soft/preferred line break' In-Reply-To: References: <52F168C2.7090401@shadowtec.de> Message-ID: <20140205202023.02bf8b48@JRWUBU2> On Wed, 5 Feb 2014 08:22:12 -0800 Markus Scherer wrote: > On Tue, Feb 4, 2014 at 2:25 PM, Rhavin Grobert > wrote: > > > Parallel to soft hyphen, a hyphen that is just inserted if the word > > was broken, it would be practical to have some way to tell browser: > > if you need to break the line, try here first. This would be really > > usefull for poems, music lines, adresses,? > > > > That would be HTML or > U+200B ZERO WIDTH SPACE > . I don't think these are the same. They give permission for the line to be broken at those points, with a strong tendency for the opportunity nearest the end to be taken. What Rhavin wants to do is to override this tendency. I presume the idea is that if a line of poetry will not fit on a physical line, the line should instead be broken as its principal caesura. While such logic makes sense if a line only needs to be broken once, what if it needs to be broken three or four times? I feel this logic belongs in the realm of complex mark-up rather than the very simple mark-up afforded by characters. I'll give an example. As I don't trust my formatting to survive, I've marked the end of physical lines with a raised dot(?). For example, consider: The princely palace of the sun stood gorgeous to behold? On stately pillars builded high of yellow burnished gold? If we break it at the principal caesuras, then The princely palace of the sun? stood gorgeous to behold? On stately pillars builded high? of yellow burnished gold? looks fine. Am I cheating by believing one would choose to have continuations of lines indented? However, if the available width is reduced further: The princely palace of the? sun? stood gorgeous to? behold? On stately pillars builded? high? of yellow burnished? gold? The result is a mess. Richard. From rhavin at shadowtec.de Wed Feb 5 15:44:18 2014 From: rhavin at shadowtec.de (Rhavin Grobert) Date: Wed, 05 Feb 2014 22:44:18 +0100 Subject: proposal for new character 'soft/preferred line break' In-Reply-To: References: <52F168C2.7090401@shadowtec.de> Message-ID: <52F2B0B2.9090209@shadowtec.de> last mail was sent from wrong address, sorry, if u get id twice, answer to this one ;) Am 05.02.2014 17:22, schrieb Markus Scherer: > On Tue, Feb 4, 2014 at 2:25 PM, Rhavin Grobert > wrote: > > Parallel to soft hyphen, a hyphen that is just inserted if the word > was broken, it would be practical to have some way to tell browser: > if you need to break the line, try here first. This would be really > usefull for poems, music lines, adresses,? > > > That would be HTML or > U+200B ZERO WIDTH SPACE > . No, you did not understand. is like ­ its below the whitespace level: if the line is to long, it breaks a word: "This is a long line with a verylongawesomeword in its middle." Wbr gives the opportunity to break at long|awesome. But what i mean is: - non existing "sbr" in parralell to shy assumed - "Do you think me gentle,do you think me cold? do you wanna risk alook into my thoughts?" if line is long enough: "Do you think me gentle, do you think me cold? do you wanna risk a look into my thoughts?" if line is not long enough: "Do you think me gentle, do you think me cold? do you wanna risk a look into my thoughts?" Poems need some whitespace-element that is *above* usual whaitespaces when it comes to linebreaks, and ­ are *below* all whitespaces. -- Rhavin Grobert ? ShadowTec media, B?dikersteig 11, 13629 Berlin http://rhavin.de/ ? Tontechnik, Consulting, Wartung und Planung MCITP & Event-Professional, Windows & Linux Administrator C++ ? Perl ? Java ? JavaScript ? Ruby ? XHTML+CSS ? XML ? PowerShell From buck at yelp.com Wed Feb 5 16:15:44 2014 From: buck at yelp.com (Buck Golemon) Date: Wed, 5 Feb 2014 14:15:44 -0800 Subject: proposal for new character 'soft/preferred line break' In-Reply-To: References: <52F168C2.7090401@shadowtec.de> Message-ID: On Wed, Feb 5, 2014 at 8:22 AM, Markus Scherer wrote: > On Tue, Feb 4, 2014 at 2:25 PM, Rhavin Grobert wrote: > >> Parallel to soft hyphen, a hyphen that is just inserted if the word was >> broken, it would be practical to have some way to tell browser: if you need >> to break the line, try here first. This would be really usefull for poems, >> music lines, adresses,... >> > > That would be HTML or U+200B > ZERO WIDTH SPACE . > > And it would be really easy to implement: there is no visual >> representation needed and if the right code-point is choosen, it would be >> downward-compatible to all systems not knowing of the new character. >> > > Unlikely. There are some unassigned code points that are predefined with > Default_Ignorable_Code_Point, > but that is not supported everywhere. > > Also, the implementation in browers would be very easy to acomplish. >> > > Maybe. You could research how widely and U+200B are supported. (I > don't have that data.) > > Here's the wbr support story: http://www.quirksmode.org/oddsandends/wbr.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkorpela at cs.tut.fi Wed Feb 5 16:27:10 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Thu, 06 Feb 2014 00:27:10 +0200 Subject: proposal for new character 'soft/preferred line break' In-Reply-To: <52F2B0B2.9090209@shadowtec.de> References: <52F168C2.7090401@shadowtec.de> <52F2B0B2.9090209@shadowtec.de> Message-ID: <52F2BABE.4070406@cs.tut.fi> 2014-02-05 23:44, Rhavin Grobert wrote: > Wbr gives the opportunity to break at long|awesome. But what i mean is: > - non existing "sbr" in parralell to shy assumed - Just giving a hypothetical character or tag an identifier does not specify its intended meaning. > "Do you think me gentle,do you think me cold? > do you wanna risk alook into my thoughts?" > > if line is long enough: > > "Do you think me gentle, do you think me cold? > do you wanna risk a look into my thoughts?" > > if line is not long enough: > > "Do you think me gentle, > do you think me cold? > do you wanna risk a > look into my thoughts?" This seems to be what Richard Wordingham guessed what you mean, more or less. > Poems need some whitespace-element that is *above* usual whaitespaces > when it comes to linebreaks, and ­ are *below* all whitespaces. Anything ?above? the character level is generally up to higher-level protocols rather than what the Unicode Standard deals with. It seems to me that you actually want is to make some line break points the only allowed break points. So you would rather want to prohibit breaks elsewhere than introduce a ?soft/preferred line break?. At the character level, you could use no-break spaces for the purpose. Using the entity reference   (for U+00A9) for clarity here, you could write Do you think me gentle, do you think me cold? do you wanna risk a look into my thoughts? If the text contains hyphens or other characters that might allow a line break by default, you made need something extra. If this is actually about HTML authoring, you can successfully use Do you think me gentle, do you think me cold? do you wanna risk a look into my thoughts? If you need/want to ?conform to HTML standards?, you can, with some marginal loss in functionality, use ... instead of nobr elements. Anyway, there appears to be existing solutions to the problem. They might be a bit clumsy, but adding an ?exclusive line break opportunity? into Unicode would introduce quite some complexity and burden on implementations. Yucca From asmusf at ix.netcom.com Wed Feb 5 17:55:46 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 05 Feb 2014 15:55:46 -0800 Subject: proposal for new character 'soft/preferred line break' In-Reply-To: <52F2BABE.4070406@cs.tut.fi> References: <52F168C2.7090401@shadowtec.de> <52F2B0B2.9090209@shadowtec.de> <52F2BABE.4070406@cs.tut.fi> Message-ID: <52F2CF82.2020607@ix.netcom.com> I agree, the use of markup is more appropriate to the problem. This is not a plain text issue and it even fails the "smell test" for "issue that is more elegantly solved by format characters than markup". A./ On 2/5/2014 2:27 PM, Jukka K. Korpela wrote: > 2014-02-05 23:44, Rhavin Grobert wrote: > >> Wbr gives the opportunity to break at long|awesome. But what i mean is: >> - non existing "sbr" in parralell to shy assumed - > > Just giving a hypothetical character or tag an identifier does not > specify its intended meaning. > >> "Do you think me gentle,do you think me cold? >> do you wanna risk alook into my thoughts?" >> >> if line is long enough: >> >> "Do you think me gentle, do you think me cold? >> do you wanna risk a look into my thoughts?" >> >> if line is not long enough: >> >> "Do you think me gentle, >> do you think me cold? >> do you wanna risk a >> look into my thoughts?" > > This seems to be what Richard Wordingham guessed what you mean, more > or less. > >> Poems need some whitespace-element that is *above* usual whaitespaces >> when it comes to linebreaks, and ­ are *below* all >> whitespaces. > > Anything ?above? the character level is generally up to higher-level > protocols rather than what the Unicode Standard deals with. > > It seems to me that you actually want is to make some line break > points the only allowed break points. So you would rather want to > prohibit breaks elsewhere than introduce a ?soft/preferred line break?. > > At the character level, you could use no-break spaces for the purpose. > Using the entity reference   (for U+00A9) for clarity here, you > could write > > Do you think me gentle, > do you think me cold? > do you wanna risk a > look into my thoughts? > > If the text contains hyphens or other characters that might allow a > line break by default, you made need something extra. > > If this is actually about HTML authoring, you can successfully use > > Do you think me gentle, > do you think me cold? > do you wanna risk a > look into my thoughts? > > If you need/want to ?conform to HTML standards?, you can, with some > marginal loss in functionality, use ... instead of nobr elements. > > Anyway, there appears to be existing solutions to the problem. They > might be a bit clumsy, but adding an ?exclusive line break > opportunity? into Unicode would introduce quite some complexity and > burden on implementations. > > Yucca > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From fantasai.lists at inkedblade.net Fri Feb 7 02:02:01 2014 From: fantasai.lists at inkedblade.net (fantasai) Date: Fri, 07 Feb 2014 00:02:01 -0800 Subject: [CSSWG][css-writing-modes] Last Call for Comments on CSS3 Writing Modes In-Reply-To: <5C1870EC-0ED7-400A-A469-FB6635D4FEB1@gluesoft.co.jp> References: <5C1870EC-0ED7-400A-A469-FB6635D4FEB1@gluesoft.co.jp> Message-ID: <52F492F9.3020705@inkedblade.net> On 01/27/2014 05:34 PM, Koji Ishii wrote: > On Dec 21, 2013, at 20:39, CE Whitehead > wrote: > >> 4.3 >> "alphabetic >> The alphabetic baseline is assumed to be at the under margin edge. >> "central >> The central baseline is assumed to be halfway between the under and over margin edges of the box. " >> => >> "alphabetic >> The alphabetic baseline is assumed to be at the under-margin edge. >> "central >> The central baseline is assumed to be halfway between the under- and over-margin edges of the box. " >> >> {COMMENT: normally when you use two words to modify a single word, as when "under margin", "over margin" modify the word, >> "edge" or "edges", then it is customary to join the two modifying words with a hyphen.} > > Fixed. Actually, this is an incorrect edit. I've reverted it. Under and over are in this case used as adjectives, and are not part of the word "margin". This follows the pattern of "left margin" as opposed to "left-margin". >> 6.2 second paragraph (after the list of four "flow-relative directions" -- block-end, block-start, etc.) >> "Where unambiguous (or dual-meaning), the terms start and end are used in place of block-start/inline-start and >> block-end/inline-end, respectively." >> >> {COMMENT: "unambiguous" is the opposite of "dual-meaning" -- "dual meaning" means "ambiguous"; do you mean the following? >> (if so it's o.k. to eliminate the stuff in parentheses altogether):} > > Fixed. Similarly, this is an incorrect edit. The intent is the opposite of "ambiguous" in the sense of "lacking clearness or definiteness". If the intent is clear from context OR if the intent encompasses both meanings, then the ambiguous terms start/end are allowed to be used. I have removed the parentheses to make this clear. ~fantasai From kojiishi at gluesoft.co.jp Fri Feb 7 02:22:10 2014 From: kojiishi at gluesoft.co.jp (Koji Ishii) Date: Fri, 7 Feb 2014 08:22:10 +0000 Subject: [CSSWG][css-writing-modes] Last Call for Comments on CSS3 Writing Modes In-Reply-To: <52F492F9.3020705@inkedblade.net> References: <5C1870EC-0ED7-400A-A469-FB6635D4FEB1@gluesoft.co.jp> <52F492F9.3020705@inkedblade.net> Message-ID: On Feb 7, 2014, at 0:02, fantasai wrote: >>> 6.2 second paragraph (after the list of four "flow-relative directions" -- block-end, block-start, etc.) >>> "Where unambiguous (or dual-meaning), the terms start and end are used in place of block-start/inline-start and >>> block-end/inline-end, respectively." >>> >>> {COMMENT: "unambiguous" is the opposite of "dual-meaning" -- "dual meaning" means "ambiguous"; do you mean the following? >>> (if so it's o.k. to eliminate the stuff in parentheses altogether):} >> >> Fixed. > > Similarly, this is an incorrect edit. The intent is the opposite > of "ambiguous" in the sense of "lacking clearness or definiteness". > If the intent is clear from context OR if the intent encompasses > both meanings, then the ambiguous terms start/end are allowed to > be used. I have removed the parentheses to make this clear. After a bit more discussion with fantasai, the intent of ?dual-meaning? in this context is ?both directions?, but I thought it means ?either. direction? Maybe it?s better to use different wording that indicates ?both directions? better? /koji From fantasai.lists at inkedblade.net Fri Feb 7 01:06:16 2014 From: fantasai.lists at inkedblade.net (fantasai) Date: Thu, 06 Feb 2014 23:06:16 -0800 Subject: [CSSWG][css-writing-modes] Last Call for Comments on CSS3 Writing Modes In-Reply-To: References: <52957E08.1060000@inkedblade.net> Message-ID: <52F485E8.4010306@inkedblade.net> On 12/26/2013 05:58 AM, Aharon (Vladimir) Lanin wrote: > Hixie filed https://www.w3.org/Bugs/Public/show_bug.cgi?id=24006 on > Writing Modes in the beginning of December, and I added some comments > there. It does not seem to have been addressed yet. Thanks for punting that to the ML. Wrt the paragraph beginning "In general...", it has been revised: # In CSS, the paragraph embedding level must be set (following rule HL1) # according to the direction property of the paragraph?s containing # block rather than by the heuristic given in steps P2 and P3 of the # Unicode algorithm. There is, however, one exception: when the # computed unicode-bidi of the paragraph?s containing block is # 'plaintext', the Unicode heuristics in P2 and P3 are used as # described in [UAX9], without the HL1 override. Wrt referring to the HL* rules, the bidi spec does not appear to require such references, only that modifications to the algorithm conform to those rules. However I have added the references as you request to help clarify the intent. Wrt using "must" everywhere, whether one agrees or disagrees with the style, it is not a habit of the CSS specs to do so, and statements without the modifier are nonetheless normative per http://www.w3.org/TR/css3-writing-modes/#conventions > > is "the bidi control codes assigned to the end" defined anywhere? > > Yes, the control codes are defined under the various unicode-bidi > values [..] But I agree that some sort of reference is needed. Since this sentence is only a few paragraphs below the section that defines them, I haven't added a link. But all of them are now talking about rule HL3, so this will help create that correspondance. > I now realize, however, that the spec does not make it 100% clear for > isolate-override whether it "combines" the isolate on the outside of > the override or vice-versa. This is now specified explicitly. Comment #2 is handled separately, see thread at http://lists.w3.org/Archives/Public/www-style/2014Feb/0267.htm Updated ED: http://dev.w3.org/csswg/css-writing-modes/ Please let me know if this sufficiently addresses the comment. ~fantasai From kojiishi at gluesoft.co.jp Fri Feb 7 13:01:07 2014 From: kojiishi at gluesoft.co.jp (Koji Ishii) Date: Fri, 7 Feb 2014 19:01:07 +0000 Subject: [CSSWG][css-writing-modes] Last Call for Comments on CSS3 Writing Modes In-Reply-To: References: <5C1870EC-0ED7-400A-A469-FB6635D4FEB1@gluesoft.co.jp> <52F492F9.3020705@inkedblade.net> Message-ID: > After a bit more discussion with fantasai, the intent of "dual-meaning" in this context > is "both directions", but I thought it means "either. direction" > Maybe it's better to use different wording that indicates "both directions" better? And we've fixed this. /koji From fantasai.lists at inkedblade.net Fri Feb 7 14:48:15 2014 From: fantasai.lists at inkedblade.net (fantasai) Date: Fri, 07 Feb 2014 12:48:15 -0800 Subject: [CSSWG][css-writing-modes] Last Call for Comments on CSS3 Writing Modes In-Reply-To: References: <52957E08.1060000@inkedblade.net> <52F485E8.4010306@inkedblade.net> Message-ID: <52F5468F.7070304@inkedblade.net> On 02/07/2014 09:57 AM, Aharon (Vladimir) Lanin wrote: > Thanks, looks great! > > Just one nit: HL1 etc. are not rules. UAX9 referes to the HLs as > "clauses". So, the references to them should be something > like "clause HLx of [UAX9]". Fixed! ~fantasai From verdy_p at wanadoo.fr Mon Feb 10 01:13:05 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 10 Feb 2014 08:13:05 +0100 Subject: proposal for new character 'soft/preferred line break' In-Reply-To: <52F2B0B2.9090209@shadowtec.de> References: <52F168C2.7090401@shadowtec.de> <52F2B0B2.9090209@shadowtec.de> Message-ID: 2014-02-05 22:44 GMT+01:00 Rhavin Grobert : > last mail was sent from wrong address, sorry, if u get id twice, answer to > this one ;) > > > Am 05.02.2014 17:22, schrieb Markus Scherer: > > On Tue, Feb 4, 2014 at 2:25 PM, Rhavin Grobert > > wrote: > > > > Parallel to soft hyphen, a hyphen that is just inserted if the word > > was broken, it would be practical to have some way to tell browser: > > if you need to break the line, try here first. This would be really > > usefull for poems, music lines, adresses,... > > > > > > That would be HTML or > > U+200B ZERO WIDTH SPACE > > . > > > No, you did not understand. is like ­ its below the whitespace > level: if the line is to long, it breaks a word: > > "This is a long line with a verylongawesomeword in its middle." > > Wbr gives the opportunity to break at long|awesome. But what i mean is: > - non existing "sbr" in parralell to shy assumed - > > "Do you think me gentle,do you think me cold? > do you wanna risk alook into my thoughts?" > > if line is long enough: > > "Do you think me gentle, do you think me cold? > do you wanna risk a look into my thoughts?" > The is enough for this purpose, A browser could even use them to give higher priority to break lines, than on other breaking oppotunities (on whitespace or with some punctuation). However I' not convinced this increased priority is a good thing if this cannot be controled (the wbr element should then have an attribute to control this priority, relatively to standard priorities of break opportunities found in the plain text; it should also have attributes to control how the break will be realized, such as inserting hyphens or not, or another character, as it is not necessarily easy to deduce only from the language tagging). What you want is just to hint the line breaker in the renderer on where the linebreaks are the best beneficial. This is really something that does not belong to plain text, but to the presentation layer, and HTML for example is reach enough about such presentation layer (that does not modify the underlying plain text, so if you get the "innerText" property of an element containing these tags, they are invisible and you'll onlywant to see the plain text itself). In my opinion the encced SHY character is there only for legacy reasons (compatibility with older encodings when renderers had no good option to break words. But in HTML SHY is not needed and will work better. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkorpela at cs.tut.fi Mon Feb 10 01:53:37 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Mon, 10 Feb 2014 09:53:37 +0200 Subject: proposal for new character 'soft/preferred line break' In-Reply-To: References: <52F168C2.7090401@shadowtec.de> <52F2B0B2.9090209@shadowtec.de> Message-ID: <52F88581.5080501@cs.tut.fi> 2014-02-10 9:13, Philippe Verdy wrote: > The is enough for this purpose, No, since the purpose was clearly to specify a line break point that is preferred over other possible line break points, or even the only allowed line break point within a string. The tag (an old nonstandard tag, now being standardized in HTML5) would not have been needed if browsers had supported U+200B. It is nowadays debatable which one should be used (U+200B has the disadvantage of not being supported by IE 6, a still somewhat significant point). But in any case, they are for allowing direct line break points, nothing more. > A browser could even use them to give higher priority to break lines, That would be rather arbitrary and won?t happen; there is no good reason for that. > What you want is just to hint the line breaker in the renderer on where > the linebreaks are the best beneficial. This is really something that > does not belong to plain text, but to the presentation layer, and HTML > for example is reach enough about such presentation layer In rendering software, the choice between line break opportunities is usually either a very simple one (put as many characters on a line as possible) or a complicated layout decision that tries to optimize the spacing between words at a paragraph level. I don?t think there is much room for any layout instructions at any layer, beyond interactive fine tuning where a human user instructs the problem to split at specific point and sees what happens, or prevents a specific break. Theoretically, it is an interesting idea to consider control characters or markup for line break opportunities with different preferability, but in practice, it would be too complicated as compared with the possible gain. > In my opinion the encced SHY character is there only for legacy reasons > (compatibility with older encodings when renderers had no good option to > break words. But in HTML SHY is not needed and will work better. They are completely different things. You might be confusing with ­ (which is just a named reference for SHY, useful when you want it to be visible in source code). Yucca From richard.wordingham at ntlworld.com Mon Feb 10 13:49:03 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 10 Feb 2014 19:49:03 +0000 Subject: proposal for new character 'soft/preferred line break' In-Reply-To: <52F88581.5080501@cs.tut.fi> References: <52F168C2.7090401@shadowtec.de> <52F2B0B2.9090209@shadowtec.de> <52F88581.5080501@cs.tut.fi> Message-ID: <20140210194903.07ad1df0@JRWUBU2> On Mon, 10 Feb 2014 09:53:37 +0200 "Jukka K. Korpela" wrote: > The tag (an old nonstandard tag, now being standardized in > HTML5) would not have been needed if browsers had supported U+200B. > It is nowadays debatable which one should be used (U+200B has the > disadvantage of not being supported by IE 6, a still somewhat > significant point). U+200B has the distinct advantage of being a character, and therefore readily travelling with the words it separates. It's quite a useful character when dealing with inadequate or non-existent dictionaries for languages that don't have visible separators between words or, depending on line-breaking practice, syllables. Richard. From verdy_p at wanadoo.fr Mon Feb 10 14:30:41 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 10 Feb 2014 21:30:41 +0100 Subject: proposal for new character 'soft/preferred line break' In-Reply-To: <52F88581.5080501@cs.tut.fi> References: <52F168C2.7090401@shadowtec.de> <52F2B0B2.9090209@shadowtec.de> <52F88581.5080501@cs.tut.fi> Message-ID: 2014-02-10 8:53 GMT+01:00 Jukka K. Korpela : > They are completely different things. You might be confusing with > ­ (which is just a named reference for SHY, useful when you want it to > be visible in source code). > No I make no confusion: is a formatting HTML element, SHY (or ­ in HTML syntax for the defined entity) is a character. Both play equivalent roles in HTML, except that ­ has a defined behavior to insert an hyphen at end of broken lines, where would adopt a language-dependant behavior (not all languages use hyphens at end of lines to mark words that have been split by breaking lines). I really know that ­ and SHY are synonyms in this context but that is a bit different and is not part of plain-text (notably it will be filtered out from $(element).innerText, but not ­ Note that some browsers are resolving the "innerText" property of HTML DOM elements by parsing the CSS properties, so this property does not really reflect only the plain-text elements of the document: Chrome for example does this to remove spans of texts that are hidden, either by display:none, or display:hidden, or color:transparent, and it transforms
elements into newlines, and detects the boundarty of block-elements (e.g. with "display:block" or "display:table-cell') to generate newline characters, or sometimes tabs. Chome also injects text added by CSS ":before" and ":after" selectors. The effect of all this is that a browser uses the HTML DOM to still infer some plain text to return for the innerText element property, and may become a SHY format control (should it?) -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkorpela at cs.tut.fi Mon Feb 10 14:41:37 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Mon, 10 Feb 2014 22:41:37 +0200 Subject: proposal for new character 'soft/preferred line break' In-Reply-To: <20140210194903.07ad1df0@JRWUBU2> References: <52F168C2.7090401@shadowtec.de> <52F2B0B2.9090209@shadowtec.de> <52F88581.5080501@cs.tut.fi> <20140210194903.07ad1df0@JRWUBU2> Message-ID: <52F93981.2000308@cs.tut.fi> 2014-02-10 21:49, Richard Wordingham wrote: > U+200B has the distinct advantage of being a character, and therefore > readily travelling with the words it separates. Granted, but it?s still a character that the rendering software needs to know and support in order to have the desired effect. As I mentioned, some legacy software try to render it as a graphic character, with poor results. In contrast, in HTML, the tag is safe in the sense that when it does not work (some modern browser have oddities in this respect), it gets ignored > It's quite a useful > character when dealing with inadequate or non-existent dictionaries for > languages that don't have visible separators between words or, > depending on line-breaking practice, syllables. That is correct. Yet, it needs to be supported by the relevant software. Yucca From jkorpela at cs.tut.fi Tue Feb 11 01:25:35 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Tue, 11 Feb 2014 09:25:35 +0200 Subject: proposal for new character 'soft/preferred line break' In-Reply-To: References: <52F168C2.7090401@shadowtec.de> <52F2B0B2.9090209@shadowtec.de> <52F88581.5080501@cs.tut.fi> Message-ID: <52F9D06F.4060604@cs.tut.fi> 2014-02-10 22:30, Philippe Verdy kirjoitti: > No I make no confusion: is a formatting HTML element, SHY (or > ­ in HTML syntax for the defined entity) is a character. Both play > equivalent roles in HTML, Not at all. > except that ­ has a defined behavior to > insert an hyphen at end of broken lines, where would adopt a > language-dependant behavior (not all languages use hyphens at end of > lines to mark words that have been split by breaking lines). Quite the opposite. The effect of SOFT HYPHEN is expected to be language-dependent (though it hardly is in web browsers): http://www.unicode.org/reports/tr14/#SoftHyphen Normally, it causes hyphenation, with a hyphen inserted when adequate. This is quite different from a direct break opportunity, which is what means in browser practice, being standardized in HTML5: http://www.w3.org/TR/html5/text-level-semantics.html#the-wbr-element There is nothing language-dependent about , in theory or in practice. It is never expected to result in the addition of a hyphen, and it never does that. When Netscape invented long ago, they chose a cryptic name, which, when expanded (to ?word break?), has seriously misled many people into thinking that it is for suggesting breaks inside human-language words. Its primary use is for breaks inside strings that are *not* words. (Exceptionally, it sometimes has use inside words: you might wish to write e.g. tax-free, but there the point is that a simple string break is OK, since the ?-? is part of the word and no hyphen needs to be added when the word is divided.) Yucca From emuller at adobe.com Wed Feb 12 10:46:58 2014 From: emuller at adobe.com (Eric Muller) Date: Wed, 12 Feb 2014 08:46:58 -0800 Subject: Transforming BidiTest.txt to the format of BidiCharacterTest.txt Message-ID: <52FBA582.508@adobe.com> Does anybody have a program that transforms the UCD file BidiTest.txt to the format of BidiCharacterTest.txt, and that they are willing to share? Thanks, Eric. From ken.whistler at sap.com Wed Feb 12 13:09:33 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Wed, 12 Feb 2014 19:09:33 +0000 Subject: Transforming BidiTest.txt to the format of BidiCharacterTest.txt In-Reply-To: <52FBA582.508@adobe.com> References: <52FBA582.508@adobe.com> Message-ID: Eric, The C version of the bidiref code does that, in part. See the function br_ParseFileFormatB in brinput.c. http://www.unicode.org/Public/PROGRAMS/BidiReferenceC/6.3.0/ It doesn't actually *transform* the BidiTest.txt file to output the other format, but it parses the input and then constructs calls into the bidi testing API in the same format used when it parses BidiCharacterTest.txt. So you could adapt that code, if you want, to writing out lines in the format of BidiCharacterTest.txt. The main addition you would have to make would be to add a table of characters exemplifying each of the bidi classes, so you could map the bidi class values from BidiTest.txt back to actual code points to store in BidiCharacterTest.txt format. --Ken > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Eric > Muller > Sent: Wednesday, February 12, 2014 8:47 AM > To: unicode at unicode.org > Subject: Transforming BidiTest.txt to the format of BidiCharacterTest.txt > > Does anybody have a program that transforms the UCD file BidiTest.txt to > the format of BidiCharacterTest.txt, and that they are willing to share? > > Thanks, > Eric. From markus.icu at gmail.com Wed Feb 12 13:46:03 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 12 Feb 2014 11:46:03 -0800 Subject: Transforming BidiTest.txt to the format of BidiCharacterTest.txt In-Reply-To: References: <52FBA582.508@adobe.com> Message-ID: On Wed, Feb 12, 2014 at 11:09 AM, Whistler, Ken wrote: > Eric, > > The C version of the bidiref code does that, in part. > > See the function br_ParseFileFormatB in brinput.c. > > http://www.unicode.org/Public/PROGRAMS/BidiReferenceC/6.3.0/ > > It doesn't actually *transform* the BidiTest.txt file to output the other > format, but it > parses the input and then constructs calls into the bidi testing API in > the same format > used when it parses BidiCharacterTest.txt. So you could adapt that code, > if you > want, to writing out lines in the format of BidiCharacterTest.txt. The > main addition you would have to make would be to add a table of > characters exemplifying each of the bidi classes, so you could map > the bidi class values from BidiTest.txt back to actual code points to > store in BidiCharacterTest.txt format. > ICU also has test code that parses both files, but it does not transform either one into the format of the other. We have both C++ and Java, and I can send you URLs if you are interested. There are also sample characters per Bidi_Class. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.fynn at gmail.com Thu Feb 13 06:30:44 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Thu, 13 Feb 2014 18:30:44 +0600 Subject: proposal for new character 'soft/preferred line break' In-Reply-To: <52F2B0B2.9090209@shadowtec.de> References: <52F168C2.7090401@shadowtec.de> <52F2B0B2.9090209@shadowtec.de> Message-ID: On 06/02/2014, Rhavin Grobert wrote: > No, you did not understand. is like ­ its below the whitespace > level: if the line is to long, it breaks a word: Not really alike. is an HTML tag while ­ is a named reference for a character. Unicode has nothing to do with as it is higher level markup. - C From everson at evertype.com Fri Feb 14 07:57:12 2014 From: everson at evertype.com (Michael Everson) Date: Fri, 14 Feb 2014 07:57:12 -0600 Subject: Sorting notation Message-ID: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> So if A <<< a < B <<< b < C <<< c means that A and a sort together before B and b and that before C and c, what is the notation for where A and ? and a and ? sort together before B and ? and b and ? and then C and ? and c and ?? Michael Everson * http://www.evertype.com/ From verdy_p at wanadoo.fr Fri Feb 14 08:38:10 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 14 Feb 2014 15:38:10 +0100 Subject: Sorting notation In-Reply-To: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> Message-ID: A <<< a << ? <<< ? < B <<< b << ? <<< ? < C <<< c << ? <<< ? 2014-02-14 14:57 GMT+01:00 Michael Everson : > So if > > A <<< a < B <<< b < C <<< c > > means that A and a sort together before B and b and that before C and c, > what is the notation for where A and ? and a and ? sort together before B > and ? and b and ? and then C and ? and c and ?? > > Michael Everson * http://www.evertype.com/ > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Feb 14 08:41:12 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 14 Feb 2014 15:41:12 +0100 Subject: Sorting notation In-Reply-To: References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> Message-ID: or if you dont want to include case distinctions at third level, but only sorting the groups for the same base letter on the secondary level: A << a << ? << ? < B << b << ? << ? < C << c << ? << ? A << ? << a << ? < B << ? << b << ? < C << ? << c << ? 2014-02-14 15:38 GMT+01:00 Philippe Verdy : > A <<< a << ? <<< ? < B <<< b << ? <<< ? < C <<< c << ? <<< ? > > > 2014-02-14 14:57 GMT+01:00 Michael Everson : > > So if >> >> A <<< a < B <<< b < C <<< c >> >> means that A and a sort together before B and b and that before C and c, >> what is the notation for where A and ? and a and ? sort together before B >> and ? and b and ? and then C and ? and c and ?? >> >> Michael Everson * http://www.evertype.com/ >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Fri Feb 14 10:26:07 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 14 Feb 2014 08:26:07 -0800 Subject: Sorting notation In-Reply-To: References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> Message-ID: You need a reset point to say where in the UCA/CLDR universe this rule chain goes. http://www.unicode.org/reports/tr35/tr35-collation.html#Orderings The default collation puts lowercase first. Normally you reset to a lowercase character and tailor variations to that, otherwise the few characters you tailor are inconsistent with the rest of Unicode. Implementations like ICU provide parametric settings (no need for rules) to specify uppercase first. http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options You should only reorder characters that the default order does not already have where you need them. For example, reset at each base letter, unless you want to reorder them relative to each other's default order. http://www.unicode.org/charts/collation/ See also http://cldr.unicode.org/index/cldr-spec/collation-guidelines especially about "Minimal Rules". You can try out collation rules and settings at http://demo.icu-project.org/icu-bin/locexp?_=root&d_=en&x=col Best regards, markus -- Google Internationalization Engineering -------------- next part -------------- An HTML attachment was scrubbed... URL: From Perka at muchomail.com Fri Feb 14 04:37:19 2014 From: Perka at muchomail.com (=?utf-8?B?0JrRgNGD0YjQtdCy0ZnQsNC90LjQvQ==?=) Date: Fri, 14 Feb 2014 02:37:19 -0800 Subject: Unicode organization is still anti-Serbian and anti-Macedonian Message-ID: <20140214023719.A856914@m0048141.ppops.net> There is still problem with letters ????? in italic, and ? in regular mode. OpenType support is still very weak (Firefox, LibreOffice on Linux, Adobe's software and that's it, practically). It's also disappointing that Microsoft is still incapable to implement and force this support on system level. Also, there are Serbian/Macedonian cyrillic vowels with accents (total: 7 types ? 6 possible letters = 42 combinations) where majority of them don't exist precomposed, and is impossible to enter them. A lot of nowadays' fonts (even commercial) still have issues with accents. In Unicode, Latin scripts are always favored, which is simply not fair to the rest of the world. They have space to put glyphs for dominoes, a lot of dead languages etc. but they don't have space for real-world issues. I want Unicode organization to change their politics and pay attention to small countries like Serbia and Macedonia. We have real-world problems. Thank you. If you think these are biases of me, I say ? real-world problem for us. If you think changes would invalidate existing texts, I say ? no, because *real* Serbian/Macedonian support still doesn't exist! And we can develop converters in the future, so I don't see any "huge cost" problems... -- ??????????? ???? _____________________________________________________________ The Free Email with so much more! =====> http://www.MuchoMail.com <===== From mark at macchiato.com Fri Feb 14 15:00:45 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJU=?=) Date: Fri, 14 Feb 2014 13:00:45 -0800 Subject: Unicode organization is still anti-Serbian and anti-Macedonian In-Reply-To: <20140214023719.A856914@m0048141.ppops.net> References: <20140214023719.A856914@m0048141.ppops.net> Message-ID: Unicode is not anti-Serbian or Macedonian. The exact level of Unicode support will depend on your operating system and font choice. For example, on the Mac there are reasonable results with arbitrary accents. Here are examples with and q? Q? Here is an image, in case your emailer or OS doesn't handle these well. [image: Inline image 1] See also http://www.unicode.org/standard/where/ As to the italic, that also depends on the font support on your system. Mark *? Il meglio ? l?inimico del bene ?* On Fri, Feb 14, 2014 at 2:37 AM, ??????????? wrote: > There is still problem with letters ????? in italic, and ? in regular mode. > > OpenType support is still very weak (Firefox, LibreOffice on Linux, > Adobe's software and that's it, practically). It's also disappointing that > Microsoft is still incapable to implement and force this support on system > level. > > Also, there are Serbian/Macedonian cyrillic vowels with accents (total: 7 > types ? 6 possible letters = 42 combinations) where majority of them don't > exist precomposed, and is impossible to enter them. A lot of nowadays' > fonts (even commercial) still have issues with accents. > > In Unicode, Latin scripts are always favored, which is simply not fair to > the rest of the world. They have space to put glyphs for dominoes, a lot of > dead languages etc. but they don't have space for real-world issues. > > I want Unicode organization to change their politics and pay attention to > small countries like Serbia and Macedonia. We have real-world problems. > Thank you. > > If you think these are biases of me, I say ? real-world problem for us. > If you think changes would invalidate existing texts, I say ? no, because > *real* Serbian/Macedonian support still doesn't exist! And we can develop > converters in the future, so I don't see any "huge cost" problems... > > -- > ??????????? ???? > > _____________________________________________________________ > The Free Email with so much more! > =====> http://www.MuchoMail.com <===== > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2014-02-14 at 12.56.52.png Type: image/png Size: 7794 bytes Desc: not available URL: From ishida at w3.org Sat Feb 15 05:45:24 2014 From: ishida at w3.org (Richard Ishida) Date: Sat, 15 Feb 2014 11:45:24 +0000 Subject: [counter-styles] i18n-ISSUE-339: Should Japanese spec styles match implementations or vice versa? In-Reply-To: References: <52FE8762.3060504@w3.org> Message-ID: <52FF5354.7010903@w3.org> [cc public-i18n-cjk and unicode at unicode.org to get some more eyes on this] I don't think you revised the algorithm either. I think this discrepancy has been around for a long time. As Xidorn points out, we're talking here about characters that, yes, exist in the kana set, but that are not often used or not often used in this context. That said, this whole spec is about being able to customise these lists however you want. So in a sense the list of characters described in the spec is a kind of default. So I'm wondering whether, in that case, it's best to just document the exisiting implementations, and allow people to modify the list if they want. Unless you have a list of over 44 items you won't meet the problem anyway. Just thinking out loud, really. RI On 14/02/2014 23:18, Tab Atkins Jr. wrote: > On Fri, Feb 14, 2014 at 1:15 PM, Richard Ishida wrote: >> 6.2 Alphabetic: lower-alpha, lower-latin, upper-alpha, upper-latin, >> lower-greek, hiragana, hiragana-iroha, katakana, katakana-iroha >> http://dev.w3.org/csswg/css-counter-styles/#simple-alphabetic >> >> The hiragana, katakana, hiragana-iroha, and katakana-iroha seem to be >> implemented in the same way in Firefox, Chrome, Safari, and now Opera. The >> implementation differs from the spec only by the addition of one or two >> characters to the basic set. >> >> Should we change the spec to align with the implementations? >> >> For more information see the test results at >> http://www.w3.org/International/tests/repository/css3-counter-styles/predefined-styles/results-cstyles#simplealpha > > It's weird that the spec differs from implementations. I don't > *think* I revised those algorithms at all. > > I'd prefer to go ahead and match implementations unless they're totally off. > > ~TJ > From otto.stolz at uni-konstanz.de Sat Feb 15 11:01:06 2014 From: otto.stolz at uni-konstanz.de (Otto Stolz) Date: Sat, 15 Feb 2014 18:01:06 +0100 Subject: Unicode organization is still anti-Serbian and anti-Macedonian In-Reply-To: <20140214023719.A856914@m0048141.ppops.net> References: <20140214023719.A856914@m0048141.ppops.net> Message-ID: <52FF9D52.2000403@uni-konstanz.de> Hello, Am 14.02.2014 11:37, schrieb ???????????: > There is still problem with letters ????? in italic, and ? in regular mode. As has been said, already, in this thread, this is a mere font issue: you have to use a particular font in order to display those italic letters, in the Serbian and Macedonian style. One example: The ?Gentium Plus? font from SIL International, available from can be configured to display the Serbian/Macedonian style italics rather than the glyphs used elsewhere. If this configuration is too cumbersome for you, feel free to ask me privately, for a copy of the font, configured for Serbian/Macedonian. I can send you that copy, without any obligation to maintain it or to adapt forth- coming versions. Best wishes, Otto Stolz From richard.wordingham at ntlworld.com Sat Feb 15 12:25:51 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 15 Feb 2014 18:25:51 +0000 Subject: Unicode organization is still anti-Serbian and anti-Macedonian In-Reply-To: <20140214023719.A856914@m0048141.ppops.net> References: <20140214023719.A856914@m0048141.ppops.net> Message-ID: <20140215182551.7535808d@JRWUBU2> On Fri, 14 Feb 2014 02:37:19 -0800 ??????????? wrote: > There is still problem with letters ????? in italic, and ? in regular > mode. > OpenType support is still very weak (Firefox, LibreOffice on Linux, > Adobe's software and that's it, practically). It's also disappointing > that Microsoft is still incapable to implement and force this support > on system level. I'll be interested to know what stops Gentium Plus, suggested by Otto Stolz, from working on, say, Windows 7. I'm very sure the support is there at a system level - the problem (if any) is more likely to be at an application level. Does LibreOffice on Windows not support Serbian italics? > Also, there are Serbian/Macedonian cyrillic vowels with accents > (total: 7 types ? 6 possible letters = 42 combinations) where > majority of them don't exist precomposed, and is impossible to enter > them. A lot of nowadays' fonts (even commercial) still have issues > with accents. Should these combinations be well known? They're not listed in the CLDR exemplar characters for Serbian. As for input, I would suggest that the solution for the simpler keyboarding techniques is to enter them as base character and then dead key. Dead keys could be available for more advanced input systems, e.g. ibus on Linux and 'text services' on Windows (Vista and above, I believe). > In Unicode, Latin scripts are always favored, which is simply not > fair to the rest of the world. They have space to put glyphs for > dominoes, a lot of dead languages etc. but they don't have space for > real-world issues. Precomposed characters are an unpleasant feature in Unicode. I am curious as to how the Serbian combinations escaped notice. When are they actually used? Each precomposed character adds a small processing overhead to an extremely large number of computers, not just to the computers that actually use it. By contrast, dominoes can be ignored when no-one using the computer is using the characters for them. Richard. From jsbien at mimuw.edu.pl Sat Feb 15 12:39:59 2014 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Sat, 15 Feb 2014 19:39:59 +0100 Subject: precomposed characters (was: Unicode organization is still anti-Serbian and anti-Macedonian) In-Reply-To: <20140215182551.7535808d@JRWUBU2> References: <20140214023719.A856914@m0048141.ppops.net> <20140215182551.7535808d@JRWUBU2> Message-ID: <20140215193959.11167l0awd3dxlkv@mail.mimuw.edu.pl> Quote/Cytat - Richard Wordingham (Sat 15 Feb 2014 07:25:51 PM CET): > Each precomposed character adds a small processing > overhead to an extremely large number of computers, not just to the > computers that actually use it. This is a very strong claim. Would be so kind to elaborate? Best regards Janus -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From gansmann at uni-bonn.de Sat Feb 15 14:15:51 2014 From: gansmann at uni-bonn.de (Gerrit Ansmann) Date: Sat, 15 Feb 2014 21:15:51 +0100 Subject: Unicode organization is still anti-Serbian and anti-Macedonian In-Reply-To: <20140214023719.A856914@m0048141.ppops.net> References: <20140214023719.A856914@m0048141.ppops.net> Message-ID: On Fri, 14 Feb 2014 11:37:19 +0100, ??????????? wrote: > There is still problem with letters ????? in italic, and ? in regular mode. > > OpenType support is still very weak (Firefox, LibreOffice on Linux, Adobe's software and that's it, practically). It's also disappointing that Microsoft is still incapable to implement and force this support on system level. > > I want Unicode organization to change their politics and pay attention to small countries like Serbia and Macedonia. We have real-world problems. Thank you. Just to avoid that I am arguing from a wrong premise: From what I gathered in a quick research, the problem is that the upright letter ? and the italic letters ?, ?, ?, ? and ? have a different shape in Serbian and Macedonian Cyrillic than in other flavours of Cyrillic. First of all, the lack of support of such features by font creators and the support of font standards by a certain software company (who ironically happens to have created that standard) are hardly Unicode?s fault. It?s like complaining to your government that your favorite merchant still won?t sell bananas, though bananas were legalised twenty years ago. But most importantly: Encoding these characters won?t do your goal any good for many reasons: ? Even if Unicode did include these characters today, it would take a long time for creators of fonts and other software to catch up ? just consider how slow support of OpenType (or other intelligent font standards) is growing, despite the fact that it concerns a lot of languages and not just two. ? You cannot control every old text to be converted. However, for many such text you can control with which font or font technologies they are rendered. The support of working solutions for the latter is likely to grow even slower if your request were granted. ? People will be confused as to which characters they should use. ? In the current situation, if a font does support Cyrillic but not the Serbian and Macedonian specialties, there is a decent if not identical fallback in many cases. If the new characters were used, however, fonts that support Cyrillic but not the new characters (which especially includes every font that exists today) would not even render the upright versions of the new Serbian/Macedonian ?, ?, ? and ? correctly, even though they do contain these glyphs. ? If you consider this only a temporary makeshift solution to the problem, it works against other temporary solutions (see below). Actually, the only advantage I see in encoding these letters separately is that it makes type designers aware of these specialties of Serbian and Macedonian ? but neither is this the purpose of Unicode nor is it the best way to do so, and moreover does it not compensate the aforementioned disadvantages. Some suggestions, how to better invest your ressources and energy on this issue: ? Make type designers aware of this. ? Enforce support of OpenType (or other intelligent font standards) or work on it yourself. (In general, it would be good if people stopped working on makeshift solutions for problems specific to their language or complaining about their lack of support and started working on the support of global standards that will not only solve their problem but benefit many other languages too.) ? Improve open-source fonts by adding the special glyphs yourself. ? As a temporary solution: Request and advertise versions of important fonts that default to the Serbian/Macedonian versions of said characters instead of the others. Or for open-source fonts: Make those versions yourself. (See also Otto Stolz?s answer) > In Unicode, Latin scripts are always favored, which is simply not fair to the rest of the world. They have space to put glyphs for dominoes, a lot of dead languages etc. but they don't have space for real-world issues. It?s somewhat amazing how you complain about Unicode?s focus on Latin script and its encoding of things that are not Latin in one line. > Also, there are Serbian/Macedonian cyrillic vowels with accents (total: 7 types ? 6 possible letters = 42 combinations) where majority of them don't exist precomposed, and is impossible to enter them. A lot of nowadays' fonts (even commercial) still have issues with accents. At least for the 6 accents and 5 vowels I found, using combining diacritical marks should work very well even without OpenType, given that the font supports these characters (and you can bet that a font which does not even support this, would not support your requested precomposed glyphs). From richard.wordingham at ntlworld.com Sat Feb 15 17:33:09 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 15 Feb 2014 23:33:09 +0000 Subject: Unicode organization is still anti-Serbian and anti-Macedonian In-Reply-To: <20140215182551.7535808d@JRWUBU2> References: <20140214023719.A856914@m0048141.ppops.net> <20140215182551.7535808d@JRWUBU2> Message-ID: <20140215233309.282203b0@JRWUBU2> On Sat, 15 Feb 2014 18:25:51 +0000 Richard Wordingham wrote: > On Fri, 14 Feb 2014 02:37:19 -0800 > ??????????? wrote: > > > There is still problem with letters ????? in italic, and ? in > > regular mode. > > > OpenType support is still very weak (Firefox, LibreOffice on Linux, > > Adobe's software and that's it, practically). It's also > > disappointing that Microsoft is still incapable to implement and > > force this support on system level. > > I'll be interested to know what stops Gentium Plus, suggested by Otto > Stolz, from working on, say, Windows 7. I do seem to have found a problem, though I find it hard to believe. When I looked for the OpenType language tag for Serbian at http://www.microsoft.com/typography/otspec/languagetags.htm , it wasn't there! Now I'm puzzled as to how any flavour of OpenType is supposed to automatically switch between Russian and Serbian italics as such. Gentium Plus (italic) has the Serbian italic glyphs, but via the aalt feature, which I don't think is what one would want for most uses. > I'm very sure the support is there at a system level It seems I was wrong! Richard. From richard.wordingham at ntlworld.com Sat Feb 15 19:55:58 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 16 Feb 2014 01:55:58 +0000 Subject: Unicode organization is still anti-Serbian and anti-Macedonian In-Reply-To: References: Message-ID: <20140216015558.53e9deac@JRWUBU2> On Sat, 15 Feb 2014 17:43:05 -0800 "Steven R. Loomis" wrote: > Richard, SRB and MKD respectively are both in the page you linked to. Good. I made the mistake of thinking the list was sorted by English language name, rather than tag. Richard. > >I do seem to have found a problem, though I find it hard to believe. > >When I looked for the OpenType language tag for Serbian at > >http://www.microsoft.com/typography/otspec/languagetags.htm , it > >wasn't there! From richard.wordingham at ntlworld.com Sun Feb 16 07:13:29 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 16 Feb 2014 13:13:29 +0000 Subject: precomposed characters (was: Unicode organization is still anti-Serbian and anti-Macedonian) In-Reply-To: <20140215193959.11167l0awd3dxlkv@mail.mimuw.edu.pl> References: <20140214023719.A856914@m0048141.ppops.net> <20140215182551.7535808d@JRWUBU2> <20140215193959.11167l0awd3dxlkv@mail.mimuw.edu.pl> Message-ID: <20140216131329.4a9daa87@JRWUBU2> On Sat, 15 Feb 2014 19:39:59 +0100 "Janusz S. Bien" wrote: > Quote/Cytat - Richard Wordingham > (Sat 15 Feb 2014 07:25:51 PM CET): > > Each precomposed character adds a small processing > > overhead to an extremely large number of computers, not just to the > > computers that actually use it. > This is a very strong claim. Would be so kind to elaborate? The following need to be stored simply because the character has been assigned: name (typically for character pick-lists) script (typically for breaking text runs by script) casing (upper/lower/titlecase) collation properties (not strictly necessary) There are many other properties, but many of them will often be covered by default rules and may not need to be stored explicitly. The only likely subsetting options I can think of would be to not support the supplementary planes or to not support CJK characters. This data will be moved when an operating system is installed, and the files are liable to be moved or replaced at other times. I will concede that it is possible that this information may not need to be moved from disk to memory - the data is likely to be ordered by codepoint and if nearby codepoints are never used either it will not need to be loaded. Some data files are mapped to memory, but I unfortunately I can't comment on the processing overhead of increasing their size if the additional data is not accessed. The operations that will be most significantly be affected is composition. I am assuming that composition information will be used even in the presence of a composition exclusion, e.g. to select the best glyph from a font. (One could optimise this away by potentially rendering the canonical decomposition of a precomposed character differently to the precomposed character.) The composition data, consisting of the pairs of characters to which precomposed characters decompose, will be stored in codepoint order of the decomposition. The net effect of this is that the existence of unused composition data will increase the number of cache misses, and thus increase the amount of processing required. If there is not a separate store of compositions not subject to composition exclusion, then the same effect will occur whenever a composition happens as part of the transform of a character string to NFC or NFKC, e.g. in the processing of a non-ASCII internet domain name. If data access is not carefully optimised, there will be many more occasions when unused decompositions will nevertheless add to the processing load. Richard. From Perka at muchomail.com Sun Feb 16 05:33:56 2014 From: Perka at muchomail.com (=?utf-8?B?0JrRgNGD0YjQtdCy0ZnQsNC90LjQvQ==?=) Date: Sun, 16 Feb 2014 03:33:56 -0800 Subject: Unicode organization is still anti-Serbian and anti-Macedonian Message-ID: <20140216033356.31E84EC7@m0005299.ppops.net> O-kay, I got several on-list and off-list messages, so I'll compile some replies here. I receive this mailing list in daily digest, so please excuse my style of replying/commenting. Please read this compilation minutely and don't take everything as insult. People, I am perfectly aware of their existance and capable to use fonts like: - from Microsoft (Windows Vista and above): Calibri, Cambria, Candara, Consolas (please make upper part (macron) of italic '?' longer, it looks stupid now), Constantia, Corbel, Sitka (Gabriola has the potential) - from Adobe: Arno Pro, Baskerville Cyrillic, Excelsior LT, Garamond Premier Pro, Helvetica Inserat, Minion Pro, Myriad (currently misses Serbian '?'), Times Ten, Warnock Pro (Sava Pro also fits for this purpose) - DejaVu family (Sans, Serif, Mono) - GNU FreeFont family (Sans, Serif, Mono) - Ubuntu family (Ubuntu, Mono) - other useful fonts: Gentium Plus (SIL Graphite technology), EB Garamond. Linux Libertine/Liberation/Biolinum family currently have severe issues and/or missing glyphs. And, font developers: please forgive me if I missed some good font for Serbian/Macedonian purposes! I would like Microsoft to alter and provide Serbian/Macedonian support to following old (but unfortunately still used as default in many modern programs) fonts: Arial, Comic Sans (please provide Serbian '?' and fix italic '?') Courier New (please provide Serbian '?'), Georgia, Impact (please provide Serbian '?'), Tahoma (please provide Serbian '?'), Times New Roman, Verdana (please provide Serbian '?') Adobe, Microsoft and others: please also note that, to cover both languages, in OpenType fonts you need to place both locale tags, language SRB and language MKD. (SRB for Serbian, MKD for Macedonian.) Macedonian cyrillic incorporates ????? from Serbian cyrillic, plus they have separate character '?' and italic glyph for that character rarely looks correct (GNU FreeSerif and EB Garamond have it best). What is interesting, I know next to nothing about Apple. (Probably because Macintosh computers are expensive as hell.) I have read something about AAT technology, but what about their fonts? Are there Serbian/Macedonian glyphs? I saw one old screenshot of some Serbian Wikipedia page viewed from MacOS (and Safari?, I don't know exact details) but I didn't see proper glyphs. * * * Unicode problems that small countries (like Serbia and Macedonia) have are SEVERE, they can not be called "a mere font issue". Please do not insult my intelligence quotient. This is because Serbian/Macedonian language and our cyrillic script is not used on south Balkan only. People from all around the world communicate, and we all have different operating systems, software, fonts... When folks from America, Germany, Russia, China, Japan... exchange mail, documents, textual informations on Wikipedia (even on Wikipedia informations are not always and everywhere tagged) with folks in Serbia and Macedonia, they all encounter problems ? they get Russian cyrillic instead of Serbian/Macedonian. People, do you realize that proper glyphs are needed everywhere and every time, CONSTANTLY, even when American ordinary user chats with German ordinary user about Serbian language, on different OS-es, textual e-mail/chat clients, GUI (Graphical User Interface) forms... We must NOT rely on OpenType and similar technologies for this! Serbia and Macedonia became "second-class citizens", systematically discriminated in computer world! That's why I want Unicode to finally fulfill this requirement. To make Serbia and Macedonia "first-class citizens"! And you can not use "Private User Areas", that's not reliable. Please read further discusion below with employee from Microsoft. And note that Serbian/Macedonian cyrillic is not just "preferable", this is not appropriate term. The correct glyphs are REQUIRED ? we can not accept Russian glyphs. Especially when in Russian small italic '?' and '?' looks *exactly* like latin 'n' and 'm'! That's nonsense for Serbian/Macedonian users (because we also use latin). Furthermore, Serbian small '?' is visually better than Russian counterpart. Sure this is my personal opinion, and I say it because Russian version looks to digit 6, Serbian doesn't (or, at least, at very low size)! So, Serbian small '?' can enter the Unicode as authentic Serbian letter. It resembles Greek gamma, but it's not exactly the same ? the pronunciation is different and upper part of glyph design must be slightly altered, and result would be fine. (And all Serbian glyphs are visually better than Russian. Yes, I claim it. Russian "curvature" italic '?', for example, is *extremely ugly* for me. Serbian "i-macron" style is better. And longer part of cursive/handwritten '?' always goes below, like latin 'g' in some designs, not above.) * * * Technologies like OpenType, SIL Graphite and AAT are good. People want stylistic alternate shapes, ornaments etc. But these technologies can not replace Unicode. Unicode comes first and it obviously shows that this organization must do internal, system-level support for Serbian/Macedonian issues. >From disappointing and incapable company called Microsoft, heh heh, one employee asked me to further clarify implementation and system-level OpenType support. Well, I'm not C/C++ programmer (man, I wish I were!), but for non-compliant software can't you somehow intercept all textual communication and replace Russian glyphs with Serbian? It is crucially important to apply this behaviour on all Windows GUI forms (native API, .Net framework etc.), system-wide. And why only in Internet Explorer 11 (currently via CSS 3, can't you force this in settings?), and Office 2010 and above (Word only? Why not Excel, Access... man, we need it EVERYWHERE). Please continue reading the following. Mozzila Firefox has great support for resolving Serbian/Macedonian issues. OpenType locale is supported, correct rendering when you have HTML attribute like lang="sr" and, for example, you can entirely disable page author's choice of fonts, for any writing script Firefox supports. To compensate for bad or incomplete support, I use that powerful feature all the time, and I wish other manufacturers like Google, Opera etc. do the same in their products. (Just implement the same as Firefox did, but then again, it's not almighty feature.) LibreOffice also does nice job, but currently under GNU/Linux only. (I talked to one developer from Red Hat Software I believe, and the problem is shaping/rendering engine they currently use for MS Windows ? they should change it and adopt better one like Pango, HarfBuzz...) It must be said that GNU/Linux in general, stands much better than MS Windows in this regard. (If I recall correctly, in Ubuntu from the very beggining/instalation you can have OpenType locale support.) So, Microsoft, start modelling your OT support like the one from GNU/Linux, make good programming library and abandon old useless stuff. Can DirectWrite help in this regard? So, I would like that EVERY piece of software has great OpenType/Graphite/AAT support like Firefox and GNU/Linux, but unfortunately, we are still far away from that "nirvana". (Conclusion: We are far away because of Unicode organization and Microsoft, in the first place.) * * * About the further support with accents. I was asked to provide "a reference" for 42 combinations I mentioned. (The biggest reference is that I'm Serb, heh heh, and I have modern local scientific books for proper spelling.) In Serbian (and Macedonian can not be much different in this regard) there are 5 vowels (?, ?, ?, ?, ?) and in some linguistic cases '?' can be considered as vowel (all of these characters are cyrillic, not latin). So, that's 6 of them. In Serbian there are usually 4 accents, but for *full professional* linguistic purposes, 7 of them (grave, double grave, acute, breve, inverted breve, circumflex and macron). I inform you that I used MS Keyboard Layout Creator v1.4 and I created excellent keyboard layout for me, but *most* fonts nowadays, even from Microsoft and Adobe, show their ugly behaviour in this regard. I think DejaVu family stands on firm ground here, Gentium Plus too, and I also heard SIL Doulos font has been created with professional linguistics is mind... So, mathematics is: 6 ? 7 = 42 combinations of *cyrillic* accented letters. Hmmm, that's for small, do we need capital versions as well? Yikes, that makes 84 glyphs! Still, the best option is to have them precomposed, don't you agree, my friends? Font developers, please make *perfect* support with combining diacritics, and, just to be sure, draw these 84 characters precomposed now, mark them eventually as (Serbian) accented cyrillic, make excellent kerning, and I would buy such precious font (with Serbian ?????, of course). Who knows, you then might be of interest to scientific institutions, government... and not just Serbian ones. * * * You know what? I'm not that young and incompetent computer user. I've been struggling with these notorious issues for more than 15 years. It just happened to express my rage now. Before posting this, I surely took some time and read previous related conversations on this mailing list, and a lot of related things beside. I know perfectly well what (you say that) Unicode is and is not. It is easy for you latin-oriented nations (USA, Germany...) to ignore the rest of the world, especially third-world countries. You are powerful, others are weak. You have big software companies like Microsoft and Adobe, others don't. Your latin scripts are perfected, others have to battle with their own. You have fancy OpenType effects, others don't even deserve the basic support. It is easy for you to make only Russian-compatible fonts, and you do it practically always, because the market is considerably bigger than market of south Balkan. Who cares about their real-world problems... But all of this is simply NOT FAIR. My final conclusion: Until Serbian and Macedonian people get required/proper glyphs and required accented letters, all this SYSTEMATICALLY packaged in Unicode and operating system level ("first-class citizens"), not just on "popular software", Unicode will still be anti-Serbian and anti-Macedonian organization. Whole Unicode standard will be faulty and Unicode organization *politically aggressive* to small, "incompetent", "ugly" countries like Serbia and Macedonia. -- Best regards from ??????????? ???? (that's one resentful and provocative computer user from Krusevac town in Serbia) _____________________________________________________________ The Free Email with so much more! =====> http://www.MuchoMail.com <===== From verdy_p at wanadoo.fr Sun Feb 16 11:44:55 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 16 Feb 2014 18:44:55 +0100 Subject: Unicode organization is still anti-Serbian and anti-Macedonian In-Reply-To: <20140215182551.7535808d@JRWUBU2> References: <20140214023719.A856914@m0048141.ppops.net> <20140215182551.7535808d@JRWUBU2> Message-ID: 2014-02-15 19:25 GMT+01:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Fri, 14 Feb 2014 02:37:19 -0800 > ??????????? wrote:Should these combinations be well > known? They're not listed in the > CLDR exemplar characters for Serbian. > > As for input, I would suggest that the solution for the simpler > keyboarding techniques is to enter them as base character and then dead > key. > "Dead keys" don't work this way. Their name really indicate that these keys have no action (seam dead) until another key is pressed AFTER them. So you press the dead key for the diacritic, then the key for the base letter, to produce EITHER: - a single precomposed character (where it exists) ; OR - a canonically equivalent decomposed combiing sequence representing the letter with its diacritic(s) (preferably in NFC form). Dead keys may be combined in advanced keyboard drivers supporting complex input states for handling multiple diacritics typed before a base letter ; but simple keyboard drivers (such as those generated by MS Keyboard Layout Editor) do not handle these complex states. But nothing prohibits building such a keboard driver. There's another inut method where you can press a key for the diacritic after a base letter: this key is treated in isolation and immediately generates the combining diacritic, independantly of the characters pressed before. But such input method will not warranty the NFC form, and cand produce broken sequences (in some cases the diacritic may be invisible in the generated text). For simple alphabetic scripts (like Latin, Greek, Cyrillic), the dead key input method is generally prefered. the other one is used to enter isolated combining diacritics which are almost never used in association with other letters (and notably not in combining sequences equivalent to an existing precomposed letter). If you think about the combining diaeresis, as it is already used very frequently in association with Latin and Cyrillic letters using a dead key method, it should also be used as a dead key even for less frequent base letters such as the Cyrillic letter Q. All that is needed is to use an updated driver adding the mapping for diacritic dead key+letter, in which it will output the NFC combining sequence if there's no precomposed NFC equivalent ---- Unfortunately, the drivers generated by the MS Keyboard Layout Creator (MSKLC), when it does not find any explciitly predefined mapping for diacritic dead key+base letter, will generate the mapping for , followed by the base letter, meaning that you won't get the text , but ! The second limitation of MSKLC is that it cannot chain dead letters: each input state must be mapped to a single state represented by a single character, which is the spacing modifier letter that would be output if you press the SPACE bar after the diacritic. It incorrectly assumes that combinations that are not mapped explicitly will always be used followed by a space bar keystroke to produce a spacing modifier letter, as if all unmapped sequences were not possible and do not exist in the real world. The other limitation is that this input state table can only be represented by a single character in the BMP (but it may be represented by a PUA of the BMP, even if MSKLC warns that this character may not be supported by fonts on the native OS or in the Console using the local legacy OEM or "ANSI" codepage (an 8-bit code page which may be either SBCS or DBCS). Drivers built by MSKLC do not allow mapping a dead key outside the root state table (so after pressing a dead key, possibly in combination with state modier keys like Shift; Ctrl, Alt, and with the current state of the CapsLock/ShiftLock), you can only press a single base character (also possibly in combinjation with state modifier keys). Due to these limitations of MSKLC, trying to generate some advanced keymaps to support extended sets of combining sequences, requires using complex key combinations with state modifiers (for the dead key and for the base letter), which are very uneasy to input when it would be simpler and faster to enter if sequences of dead keys were supported. Dead keys are not very complex, in fact they are quire friendly and have the advantage of normalizing the input to NFC directly, without needing any additional support from the external text editor (modifying the text buffer on the flow). They are natural to users even if the input order of keystrokes is reversed, compared to the Unicode encoding of the generated text (something that most users will never see as they have no idea about how the text will be finally encoded and used in their applications). -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Sun Feb 16 15:57:38 2014 From: prosfilaes at gmail.com (David Starner) Date: Sun, 16 Feb 2014 13:57:38 -0800 Subject: Unicode organization is still anti-Serbian and anti-Macedonian In-Reply-To: <20140216033356.31E84EC7@m0005299.ppops.net> References: <20140216033356.31E84EC7@m0005299.ppops.net> Message-ID: Every time you attack the only character set that supports various third-world African languages and various tiny North American languages and various small Indian languages and various Philippine scripts, as it's "easy for you latin-oriented nations (USA, Germany...) to ignore the rest of the world, especially third-world countries", people stop listening to you. Unicode is the system designed to make it possible to write the scripts of all languages. Microsoft happens to have been one of the largest drivers behind it, having spent a lot of money on Unicode and OpenType to make this stuff possible. > People, do you realize that proper glyphs are needed everywhere and every time, CONSTANTLY, even when American ordinary > user chats with German ordinary user about Serbian language They'd use Latin because that's what their keyboards are going to support. Virtually every recent protocol runs over some sort of XML, so language tagging comes free, and if they don't, they need to provide some sort of language tagging. And if we picked your option and they did use Cyrillic? I'm betting American ordinary user and German ordinary user would load up their Russian keyboards and type away using Russian letters for Serbian. It is an incredibly well-known problem that if you have two similar looking characters, users will use the more common one even when the less common one is the correct one. There won't be new precomposed characters, and there shouldn't be a need for them. There won't be new Serbian characters invalidating every text stored in systems in Serbian today. Maybe 15 years ago, a change could in theory have been done, but not today. Deal with what you have, because those decisions have been made and written in stone. -- Kie ekzistas vivo, ekzistas espero. From richard.wordingham at ntlworld.com Sun Feb 16 16:12:23 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 16 Feb 2014 22:12:23 +0000 Subject: Unicode organization is still anti-Serbian and anti-Macedonian In-Reply-To: References: <20140214023719.A856914@m0048141.ppops.net> <20140215182551.7535808d@JRWUBU2> Message-ID: <20140216221223.7369356f@JRWUBU2> On Sun, 16 Feb 2014 18:44:55 +0100 Philippe Verdy wrote: > 2014-02-15 19:25 GMT+01:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > > On Fri, 14 Feb 2014 02:37:19 -0800 > > ??????????? wrote:Should these combinations > > be well known? They're not listed in the > > CLDR exemplar characters for Serbian. > > > > As for input, I would suggest that the solution for the simpler > > keyboarding techniques is to enter them as base character and then > > dead key. > There's another inut method where you can press a key for the > diacritic after a base letter: this key is treated in isolation and > immediately generates the combining diacritic, independantly of the > characters pressed before. Sorry, this is what I meant. I should have written 'diacritic', not 'dead key'. > But such input method will not warranty the NFC form, Which is an argument for text editors to have normalisation functions, like the emacs ucs-normalize-NFC-region command. > and cand > produce broken sequences (in some cases the diacritic may be > invisible in the generated text). Something many users of the Thai script currently have to live with. Richard. From richard.wordingham at ntlworld.com Sun Feb 16 17:50:45 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 16 Feb 2014 23:50:45 +0000 Subject: Unicode organization is still anti-Serbian and anti-Macedonian In-Reply-To: References: <20140216033356.31E84EC7@m0005299.ppops.net> Message-ID: <20140216235045.7f916c0d@JRWUBU2> On Sun, 16 Feb 2014 13:57:38 -0800 David Starner wrote: > > People, do you realize that proper glyphs are needed everywhere and > > every time, CONSTANTLY, even when American ordinary user chats with > > German ordinary user about Serbian language > And if we picked your option and they did use Cyrillic? I'm betting > American ordinary user and German ordinary user would load up their > Russian keyboards and type away using Russian letters for Serbian. American *ordinary* user and German *ordinary* user would not be typing Serbian. One issue here that I don't know the solution for is how the right glyphs should be chosen for displaying plain text communication. I don't know any general mechanism for, say, specifying that by default Cyrillic text should use Serbian glyphs, CJK characters should use Japanese glyphs and that Cuneiform should use Neo-Assyrian glyphs. > There won't be new Serbian characters invalidating > every text stored in systems in Serbian today. I don't like the idea, but one possibility would be to define Serbian glyph styles by adding variation selectors. Variation selectors are already 'defined' for the decimal digits U+0030 to U+0039. It would, however, mess up string comparison operations that weren't smart enough to ignore variation selectors. Richard. From tom at bluesky.org Sun Feb 16 18:23:04 2014 From: tom at bluesky.org (Tom Gewecke) Date: Sun, 16 Feb 2014 17:23:04 -0700 Subject: Unicode organization is still anti-Serbian and anti-Macedonian In-Reply-To: <20140216235045.7f916c0d@JRWUBU2> References: <20140216033356.31E84EC7@m0005299.ppops.net> <20140216235045.7f916c0d@JRWUBU2> Message-ID: <44391A18-ED55-4CE5-AF3D-1D363E9F89F7@bluesky.org> On Feb 16, 2014, at 4:50 PM, Richard Wordingham wrote: > > One issue here that I don't know the solution for is how the right > glyphs should be chosen for displaying plain text communication. I > don't know any general mechanism for, say, specifying that by > default Cyrillic text should use Serbian glyphs, CJK characters > should use Japanese glyphs and that Cuneiform should use Neo-Assyrian > glyphs. In Mac OS X and iOS, this is currently being done for the CJK case by switch fonts according to the order of languages in the system level language preferences. If Japanese is higher than Chinese on the list, then by default a Japanese font is used for CJK plain text. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at bluesky.org Sun Feb 16 18:51:55 2014 From: tom at bluesky.org (Tom Gewecke) Date: Sun, 16 Feb 2014 17:51:55 -0700 Subject: Unicode organization is still anti-Serbian and anti-Macedonian In-Reply-To: <20140216033356.31E84EC7@m0005299.ppops.net> References: <20140216033356.31E84EC7@m0005299.ppops.net> Message-ID: On Feb 16, 2014, at 4:33 AM, ??????????? wrote: > > What is interesting, I know next to nothing about Apple. (Probably because Macintosh computers are expensive as hell.) I have read something about AAT technology, but what about their fonts? Are there Serbian/Macedonian glyphs? I had a look, and I think the answer is "no". (Except for two, one of which is Chinese, which seem to have the Serbian '?' by mistake). -------------- next part -------------- An HTML attachment was scrubbed... URL: From gansmann at uni-bonn.de Mon Feb 17 03:33:05 2014 From: gansmann at uni-bonn.de (Gerrit Ansmann) Date: Mon, 17 Feb 2014 10:33:05 +0100 Subject: Unicode organization is still anti-Serbian and anti-Macedonian In-Reply-To: <20140216235045.7f916c0d@JRWUBU2> References: <20140216033356.31E84EC7@m0005299.ppops.net> <20140216235045.7f916c0d@JRWUBU2> Message-ID: On Mon, 17 Feb 2014 00:50:45 +0100, Richard Wordingham wrote: > I don't like the idea, but one possibility would be to define Serbian glyph styles by adding variation selectors. Variation selectors are already 'defined' for the decimal digits U+0030 to U+0039. It would, however, mess up string comparison operations that weren't smart enough to ignore variation selectors. Also, for the variation selectors to work for the end user, it requires the same technologies whose lack of support is why we are discussing this in the first place, doesn?t it? So, defining the corresponding variation selectors would not make the end user see the correct glyphs earlier. From otto.stolz at uni-konstanz.de Mon Feb 17 07:57:56 2014 From: otto.stolz at uni-konstanz.de (Otto Stolz) Date: Mon, 17 Feb 2014 14:57:56 +0100 Subject: Unicode organization is still anti-Serbian and anti-Macedonian In-Reply-To: <20140216235045.7f916c0d@JRWUBU2> References: <20140216033356.31E84EC7@m0005299.ppops.net> <20140216235045.7f916c0d@JRWUBU2> Message-ID: <53021564.4040102@uni-konstanz.de> Hello, ??????????? ???? had written: > People, do you realize that proper glyphs are needed everywhere and > every time, CONSTANTLY, even when American ordinary user chats with > German ordinary user about Serbian language Am 2014-02-17 um 00:50 Uhr MEZ schrieb Richard Wordingham: > One issue here that I don't know the solution for is how the right > glyphs should be chosen for displaying plain text communication. I > don't know any general mechanism for, say, specifying that by > default Cyrillic text should use Serbian glyphs, CJK characters > should use Japanese glyphs and that Cuneiform should use Neo-Assyrian > glyphs. This boils down to the fact that, in plain-text communication, the receiver can ? and should ? chose the appropriate font. This holds, in particular, for classical e-mail. Thence my recent claim that the problem posed by ???? is a mere font issue. In HTML, this is a bit different: The author has control over the fonts (thence over the glyphic style) used for the display, but the reader can normally override the author?s choice. Hence, WWW authors should specify suitable fonts for their respective articvles (or even parts thereof). On paper, or in PDF and other facsimile formaats, the author is entirely responsible for the glyphic style and appearnce, and he should always chose suitable fonts. This is the realm of the solution involving that ?Gentium Plus srp? font I had mentioned, recently. May i humbly remind ???? (and all other readers of this thread) that the problem manifests itself (mainly or only) with italic style letters; hence there remains virually no problem with normal (non-italic) style. Best wishes, Otto Stolz From kent.karlsson14 at telia.com Mon Feb 17 08:23:00 2014 From: kent.karlsson14 at telia.com (Kent Karlsson) Date: Mon, 17 Feb 2014 15:23:00 +0100 Subject: Unicode organization is still anti-Serbian and anti-Macedonian In-Reply-To: Message-ID: Den 2014-02-17 10:33, skrev "Gerrit Ansmann" : >> I don't like the idea, but one possibility would be to define Serbian glyph >> styles by adding variation selectors. Variation selectors are already >> 'defined' for the decimal digits U+0030 to U+0039. It would, however, >> mess up string comparison operations that weren't smart enough to ignore >> variation selectors. > Also, for the variation selectors to work for the end user, it requires > the same technologies whose lack of support is why we are discussing this > in the first place, doesn?t it? So, defining the corresponding variation > selectors would not make the end user see the correct glyphs earlier. Still, variation selectors would be, in the text, a very localized indication, independent of (displaying) user's preference settings or language declaration (from the author, in e.g. XML/HTML formats) for the text, and variation selectors are indeed more likely to survive operations like cut-and-paste. There would be a problem of inserting variation selectors at all places where appropriate, though. Spell checking functionality could, in principle at least, help with the latter. /Kent K From mathias at qiwi.be Thu Feb 20 04:42:01 2014 From: mathias at qiwi.be (Mathias Bynens) Date: Thu, 20 Feb 2014 11:42:01 +0100 Subject: =?windows-1252?Q?Difference_between_=91combining_characters=92_a?= =?windows-1252?Q?nd_=91grapheme_extenders=92=3F?= Message-ID: <330A96AB-BEA0-4726-B2CE-8B2E49A6752C@qiwi.be> What is the difference between ?combining characters? (http://www.unicode.org/faq/char_combmark.html) and ?grapheme extenders? (http://www.unicode.org/reports/tr44/#Grapheme_Extend) in Unicode? They seem to do the same thing, as far as I can tell ? although the set of grapheme extenders is larger than the set of combining characters. I?m clearly missing something here. Why the distinction? I?ve also posted this question on Stack Overflow: http://stackoverflow.com/q/21722729/96656 From verdy_p at wanadoo.fr Thu Feb 20 05:10:09 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 20 Feb 2014 12:10:09 +0100 Subject: Difference between 'combining characters' and 'grapheme extenders'? In-Reply-To: <330A96AB-BEA0-4726-B2CE-8B2E49A6752C@qiwi.be> References: <330A96AB-BEA0-4726-B2CE-8B2E49A6752C@qiwi.be> Message-ID: Many grapheme extenders are not "combining characters". Combining characters are classified this way for legacy reasons (the very weak "general category" property) and this property is normatively stabilized. As well most combining characters have a non-zero combining class and they are stabilized for the purpose of normalization. Grapheme extenders include characters that are also NOT combining characters but controls (e.g. joiners). Some graphemclusters are also more complex in some scripts (there are extenders encoded BEFORE the base character; and they cannot be classified as combining characters because combining characters are always encoded AFTER a base character) For legacy reasons (and roundtrip compatibility with older standards) not all scripts are encoded using the UCS character model using combining characters. (E.g. the Thai script; not following the "logical" encoding order; but following the model used in TIS-620 and other standards based on it; including for Windows, and *nix/*nux). 2014-02-20 11:42 GMT+01:00 Mathias Bynens : > What is the difference between 'combining characters' ( > http://www.unicode.org/faq/char_combmark.html) and 'grapheme extenders' ( > http://www.unicode.org/reports/tr44/#Grapheme_Extend) in Unicode? > > They seem to do the same thing, as far as I can tell - although the set of > grapheme extenders is larger than the set of combining characters. I'm > clearly missing something here. Why the distinction? > > I've also posted this question on Stack Overflow: > http://stackoverflow.com/q/21722729/96656 > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tyler at tylercipriani.com Thu Feb 20 09:57:26 2014 From: tyler at tylercipriani.com (Tyler Cipriani) Date: Thu, 20 Feb 2014 08:57:26 -0700 Subject: Banjo glyph proposal--open discussion Message-ID: I'm proposing adding a single UCS character to further the goal of set completeness for the set of glyphs represented on the SMP block: Miscellaneous Symbols and Pictographs: 'BANJO' (proposed glyph U+1F3DB) My current proposal is available at: https://github.com/thcipriani/unicode-banjo/blob/master/Proposal/Banjo_Unicode_Proposal.markdown Thank you in advance for any feedback or comments. Tyler Cipriani -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Feb 20 14:00:15 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 20 Feb 2014 20:00:15 +0000 Subject: Difference between =?windows-1252?B?kWNvbWJpbmluZyBjaGFyYWN0?= =?windows-1252?B?ZXJzkg==?= and =?windows-1252?B?kWdyYXBoZW1lIGV4dGVuZGVy?= =?windows-1252?B?c5I/?= In-Reply-To: <330A96AB-BEA0-4726-B2CE-8B2E49A6752C@qiwi.be> References: <330A96AB-BEA0-4726-B2CE-8B2E49A6752C@qiwi.be> Message-ID: <20140220200015.6ba76901@JRWUBU2> On Thu, 20 Feb 2014 11:42:01 +0100 Mathias Bynens wrote: > What is the difference between ?combining > characters? (http://www.unicode.org/faq/char_combmark.html) and > ?grapheme > extenders? (http://www.unicode.org/reports/tr44/#Grapheme_Extend) in > Unicode? > > They seem to do the same thing, as far as I can tell ? although the > set of grapheme extenders is larger than the set of combining > characters. I?m clearly missing something here. Why the distinction? Spacing combining marks (category Mc) are in general not grapheme extenders. The ones that are included are mostly included so that the boundaries between 'legacy grapheme clusters' http://www.unicode.org/reports/tr29/tr29-23.html are invariant under canonical equivalence. There are six grapheme extenders that are not nonspacing (Mn) or enclosing (Me) and are not needed by this rule: ZWNJ, ZWJ, U+302E HANGUL SINGLE DOT TONE MARK U+302F HANGUL DOUBLE DOT TONE MARK U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK I can see that it will sometimes be helpful to ZWNJ and ZWJ along with the previous base character. The fullwidth soundmarks U+3099 and U+309A are included for reasons of canonical equivalence, so it makes sense to include their halfwidth versions. I don't actually see the logic for including U+302E and U+302F. If you're going to encourage forcing someone who's typed the wrong base character before a sequence of 3 non-spacing marks to retype the lot, you may as well do the same with Hangul tone marks. Richard. From rwhlk142 at gmail.com Sat Feb 22 13:46:03 2014 From: rwhlk142 at gmail.com (Robert Wheelock) Date: Sat, 22 Feb 2014 14:46:03 -0500 Subject: Hebrew Extended Block(s) Message-ID: Hello! There?s an empty subblock (U+00860 ? U+0089F) with 64 empty codepoints where we could put needed additional Hebrew characters in... The (U+00860) column could house such things as: ??A?AF-?IRIQ (?iriq + sh?wa?) ??A?AF-QIBBU? (qibbu? + sh?wa?) ?TRUE SHURUQ POINT FOR WAW (point inside WAW, but slightly higher up) ?WAW WITH SHURUQ (WAW having the TRUE SHURUQ POINT inside) ?DOUBLY-POINTED SHIN (a SHIN with both ?IN and SHIN points atop) ?DOUBLY-POINTED SHIN WITH DAGHESH (the preceding letter?only with an added DAGHESH inside) ?WAW WITH DAGHESH AND ?OLAM ?WAW WITH DAGHESH AND SHURUQ ?The 4 DIACRITICAL POINTS ABOVE (single, double horizontal, triple up triangle, and quad squared) used to extend the Hebrew alphabet to new sounds for other Jewish languages (Judeo-Arabic, ...) ?VARIQA? HAFUKH (for the same purpose) ?GALGAL HAFUKH (to write the Yiddish palatals?instead of using a double yudh ligature; also extended to ?AYIN for writing an /e/ vowel in Yiddish) The remaining 48 codepoints (U+00870 ? U+0089F) could house additional letters that?re used in other Jewish languages (Hebrew letters with points above that mimic those on the corresponding Arabic letters). Research needs to be done to determine the most-widely used of those (keeping in mind that those based on KAF, MEM, NUN, PE?, and ?ADHEH require 2 codepoints each?the 1st for its final form, followed by a 2nd for its regular form) to assign to these 48 codepoints. The remainder of those marked letter?along with variant cantillation signs?will need another codepoint subblock to reside at... we got (at least) 3 variant cantillation systems?each with their own vowel points and reading signs; besides those, we also have occasionally- and rarely-used supplemental marked Hebrew letters. We should also reserve (at least) a codepoint for the YHWH TETRAGRAMMATON. Shalom! Thank You! -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Feb 22 14:06:37 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 22 Feb 2014 21:06:37 +0100 Subject: Hebrew Extended Block(s) In-Reply-To: References: Message-ID: 2014-02-22 20:46 GMT+01:00 Robert Wheelock : > Hello! > > There?s an empty subblock (U+00860 ? U+0089F) with 64 empty codepoints > where we could put needed additional Hebrew characters in... > > The (U+00860) column could house such things as: > ??A?AF-?IRIQ (?iriq + sh?wa?) > ??A?AF-QIBBU? (qibbu? + sh?wa?) > ?TRUE SHURUQ POINT FOR WAW (point inside WAW, but slightly higher up) > ?WAW WITH SHURUQ (WAW having the TRUE SHURUQ POINT inside) > ?DOUBLY-POINTED SHIN (a SHIN with both ?IN and SHIN points atop) > ?DOUBLY-POINTED SHIN WITH DAGHESH (the preceding letter?only with an added > DAGHESH inside) > ?WAW WITH DAGHESH AND ?OLAM > ?WAW WITH DAGHESH AND SHURUQ > ?The 4 DIACRITICAL POINTS ABOVE (single, double horizontal, triple up > triangle, and quad squared) used to extend the Hebrew alphabet to new > sounds for other Jewish languages (Judeo-Arabic, ...) > ?VARIQA? HAFUKH (for the same purpose) > ?GALGAL HAFUKH (to write the Yiddish palatals?instead of using a double > yudh ligature; also extended to ?AYIN for writing an /e/ vowel in Yiddish) > Most of these are already encoded using multiple codepoints (e.g. doubly-pointed shin, with or without dagesh). The Hebrew script is already enough challenging that we don't need to add complexity by adding even more ways to encode the same thing (and then need to update the already complex collation rules in which there's been tons of comments to find solutions already implemented (including solutions already used to borrow some Arabic combining marks for use within the Hebrew script). Some of the characters you propose are just typographic variants. I suggest you read the PDF document speaking about the development of the SIL SBL Font, it is really informative about many of these issues and how that font was designed (according to many discussions that have occured years ago in this list).I'm convinced that this document should also be better referenced as an informative technical report for the script (because it is infact not specific to the SBL font itself). -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Sat Feb 22 14:55:33 2014 From: everson at evertype.com (Michael Everson) Date: Sat, 22 Feb 2014 12:55:33 -0800 Subject: Hebrew Extended Block(s) In-Reply-To: References: Message-ID: <6ABEE65D-7D4B-4A53-ACFD-4325DC93EE92@evertype.com> On 22 Feb 2014, at 11:46, Robert Wheelock wrote: > There?s an empty subblock (U+00860 ? U+0089F) with 64 empty codepoints where we could put needed additional Hebrew characters in? We?re not going to add a rake of pre-composed Hebrew characters though. Michael Everson * http://www.evertype.com/ From jonathan.rosenne at gmail.com Sat Feb 22 15:10:16 2014 From: jonathan.rosenne at gmail.com (Jonathan Rosenne) Date: Sat, 22 Feb 2014 23:10:16 +0200 Subject: Hebrew Extended Block(s) In-Reply-To: <6ABEE65D-7D4B-4A53-ACFD-4325DC93EE92@evertype.com> References: <6ABEE65D-7D4B-4A53-ACFD-4325DC93EE92@evertype.com> Message-ID: <006d01cf3012$7d78b140$786a13c0$@gmail.com> May I suggest a correction: We're not going to add a rake of pre-composed characters though. Best Regards, Jonathan Rosenne -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Michael Everson Sent: Saturday, February 22, 2014 10:56 PM To: unicode Unicode Discussion Subject: Re: Hebrew Extended Block(s) On 22 Feb 2014, at 11:46, Robert Wheelock wrote: > There's an empty subblock (U+00860 - U+0089F) with 64 empty codepoints where we could put needed additional Hebrew characters in. We're not going to add a rake of pre-composed Hebrew characters though. Michael Everson * http://www.evertype.com/ _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From verdy_p at wanadoo.fr Sun Feb 23 13:49:24 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 23 Feb 2014 20:49:24 +0100 Subject: Sorting notation In-Reply-To: References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> Message-ID: OK, I ignored these resets only for simplicity, the question was not about a full set of rules to build a collation; but a small subset of rules that could be used. It seems surprisng that Michael Everson asks the question, when he already knows so much about Unicode algorithms (but may be less about notations used in CLDR data) The CLDR also has several competing notations for specifying collations so that may be the purpose of his question. I don't think that all notations need an explicit reset at start (it can be implicit for the first element in a chain of relations) 2014-02-14 17:26 GMT+01:00 Markus Scherer : > You need a reset point to say where in the UCA/CLDR universe this rule > chain goes. > http://www.unicode.org/reports/tr35/tr35-collation.html#Orderings > > The default collation puts lowercase first. Normally you reset to a > lowercase character and tailor variations to that, otherwise the few > characters you tailor are inconsistent with the rest of Unicode. > Implementations like ICU provide parametric settings (no need for rules) to > specify uppercase first. > http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options > > You should only reorder characters that the default order does not already > have where you need them. For example, reset at each base letter, unless > you want to reorder them relative to each other's default order. > http://www.unicode.org/charts/collation/ > > See also http://cldr.unicode.org/index/cldr-spec/collation-guidelines > especially about "Minimal Rules". > > You can try out collation rules and settings at > http://demo.icu-project.org/icu-bin/locexp?_=root&d_=en&x=col > > Best regards, > markus > -- > Google Internationalization Engineering > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Feb 23 15:32:45 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 23 Feb 2014 21:32:45 +0000 Subject: Sorting notation In-Reply-To: References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> Message-ID: <20140223213245.26f99657@JRWUBU2> On Sun, 23 Feb 2014 20:49:24 +0100 Philippe Verdy wrote: > It seems surprisng that Michael Everson asks the question, when he > already knows so much about Unicode algorithms (but may be less about > notations used in CLDR data) > > The CLDR also has several competing notations for specifying > collations so that may be the purpose of his question. I have no confidence that his question has been understood. Collation is a monster, and it is unsafe to assume that one understands it. The ICU notation and implementation for an abstract definition of collation turned out to be full of traps, and won't catch up with CLDR definitions until Markus Scherer's raft of collation amendments goes in. (Or have I missed the announcement?) Rigorous definitions have had to address collation elements (i.e. sets of weights, one at each level with 0 a special value), which is not as abstract as the ICU notation was meant to be. As an example of the treachery of collation definitions, one might na?vely think that adding &a< References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> <20140223213245.26f99657@JRWUBU2> Message-ID: 2014-02-23 22:32 GMT+01:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Sun, 23 Feb 2014 20:49:24 +0100 > Philippe Verdy wrote:*At least, referring to Version > 24 of the LFML specification, I assume > Part 5 Section 3.5, which defines "&..<<", also applies to Section 3.9, > which purports to define the meaning of "&[before 2]..<<". It's > conceivable that I am wrong, and the meaning of "&[before 2]? << ?" is > undefined. > This looks like a cryptic notation anyway. If we assume that there's an implicit reset at start of a collation rule, and that collation does not define any relative order for the empty string, you could simply write this reset at level 2 as: << ? << ? instead of the mysterious notation (and in fact verbose and probably inconsistant in the way the same level 2 is further used with "<<"): &[before 2]? << ? I don't thing the "&" is necessary except as a separator between separate rules (where all rules must implicitly start by a reset at some level). The "monster" you describe belongs to ICU implementation (which is not part of any standard but now integrated in various products that have abandonned the idea of implementing (unstable and complex) collations themselves. My opinion is that this part of ICU should be detached from it in a completely separated project, to help simplifying it, because all the rest of ICU have viable competitive implementations (that are also more easily ported to other languages without having to create possibly unsafe binary bindings to native C/C++ code or Java). It is notable that after so many years years, collation is still not implemented in Javascript, and still does not have a standardized API in Javascript/ECMAscript minimum support for strings (there is an implementation though in Lua, based on internal bindings to the native C/C++ code in its library; there are some attempts to emulate it also in Python; in C#/J# the implementation is performed by binding the native C/C++ code; but it still causes deployment problems for distributed applications that need to deliver code on the client side of web services: only Java works for now, not Javascript except by using server-side helpers with really _slow_ remote APIs). When performance of applications on the client side is a problem (for client-side applications needing to perform dynamic collations), full collators are not implemented at all, and these applications use a much simpler model (even if they don't work very well with lots of languages). And the existing CLDR data about collation is simply not portable at all outside contexts where ICU can be used. Instead, each application supports its own (more or less limited) model implementing some unspecified part of the CLDR collation data (which is then insufficiently reused and corrected for handling real cases, even for the most frequently needed ones). -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Sun Feb 23 17:04:34 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Sun, 23 Feb 2014 15:04:34 -0800 Subject: Sorting notation In-Reply-To: References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> <20140223213245.26f99657@JRWUBU2> Message-ID: On Sun, Feb 23, 2014 at 2:13 PM, Philippe Verdy wrote: > 2014-02-23 22:32 GMT+01:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > >> On Sun, 23 Feb 2014 20:49:24 +0100 >> Philippe Verdy wrote:*At least, referring to >> Version 24 of the LFML specification, I assume >> Part 5 Section 3.5, which defines "&..<<", also applies to Section 3.9, >> which purports to define the meaning of "&[before 2]..<<". It's >> conceivable that I am wrong, and the meaning of "&[before 2]? << ?" is >> undefined. >> > No, it's well-defined, and I believe that part of the spec is fairly complete since CLDR 24. This looks like a cryptic notation anyway. If we assume that there's an > implicit reset at start of a collation rule, and that collation does not > define any relative order for the empty string, you could simply write this > reset at level 2 as: > << ? << ? > It might have made sense 15 years ago to permit relations without an initial reset, because at the time the rules were applied on a blank slate. Ever since ICU/CLDR collation rules were redefined to apply on top of DUCET (and later on top of the CLDR root collation), you really need to reset to something for the result to make sense. CLDR 24 forbids rules without initial reset, and ICU 53 will follow suit. instead of the mysterious notation (and in fact verbose and probably > inconsistant in the way the same level 2 is further used with "<<"): > &[before 2]? << ? > It is true that the "2" and the strength of the operator are redundant, but the notation is now well-defined. I don't know your criteria for "mysterious" :-) It does help to know the root collation mappings, or at least how they are generally constructed; for example, that ? maps to two collation elements. I don't thing the "&" is necessary except as a separator between separate > rules (where all rules must implicitly start by a reset at some level). > See above. The "monster" you describe belongs to ICU implementation (which is not part > of any standard but now integrated in various products that have abandonned > the idea of implementing (unstable and complex) collations themselves. > I think Richard refers to the "monster" because it is very, very tricky to get one's head around the interaction of all of the pieces of the UCA, Unicode normalization, and the CLDR additions. At least when it comes to the heads of Richard, Mark, Ken, and my own... Also, the implementation of UCA is easy if you don't care about data size or speed of string comparisons. Once you care about size and speed and want additional functionality (like in ICU), it's a major chunk of code. In the case of ICU, that code had accreted functionality and changed with changing specs and had gotten buggy and hard to maintain, so I am in the process of reimplementing it, with hopes of getting it into ICU 53 in March. The code and data actually got smaller, but it's still large. My opinion is that this part of ICU should be detached from it in a > completely separated project, to help simplifying it, > It's complex for reasons stated above, and it benefits from many lower-level parts of ICU (Unicode properties, normalization, data loading, data structures, ...). It is notable that after so many years years, collation is still not > implemented in Javascript, and still does not have a standardized API in > Javascript/ECMAscript minimum support for strings > Collation was added to the ECMAScript standard in 2012, with several browsers implementing it. PyICU makes it available in Python. If someone wanted to port code to JavaScript or Python, and wanted it to be fast, the new (upcoming) ICU Java code might be a reasonable start. When performance of applications on the client side is a problem (for > client-side applications needing to perform dynamic collations), full > collators are not implemented at all, and these applications use a much > simpler model (even if they don't work very well with lots of languages). > Right. If the client code need not collate newly typed strings, then one good technique is to have the server send the corresponding sort keys. By the way, ICU makes a strong effort to write very short sort keys. Best regards, markus -- Google Internationalization Engineering -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun Feb 23 17:26:01 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 24 Feb 2014 00:26:01 +0100 Subject: Sorting notation In-Reply-To: References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> <20140223213245.26f99657@JRWUBU2> Message-ID: 2014-02-24 0:04 GMT+01:00 Markus Scherer : > Right. If the client code need not collate newly typed strings, then one > good technique is to have the server send the corresponding sort keys. By > the way, ICU makes a strong effort to write very short sort keys. > The size of sort keys does not really matters here. There are in the same order of magnitude as the texts to sort. The problem in dynamic applciations is that there may need to send lots of text to the server to return lots of keys. The client may want to cache them, but then the application will be bound to the perfomance of the network and the server load, for round tripping response times. In most application web sites, this is simply not an option: users will complain about the slow response time of the applicaton for something that seems abvious for them, such as sortng columns in a long data report mixed with user data input (without having to download it again for the same data presented differently, and without loosing current user input). In some cases, it is also not an option to send this data to the server because it is private to the user and the uer wants that data to be stored elsewhere securely (including on a server that performs nothing else than storage; and cannot offer the collator service). Other applications needing performance of collators are text editors (for client-side search-and-replace while editing; possibly even with support for regexps and collator-based text transforms...). -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Sun Feb 23 23:29:47 2014 From: everson at evertype.com (Michael Everson) Date: Sun, 23 Feb 2014 21:29:47 -0800 Subject: Sorting notation In-Reply-To: <20140223213245.26f99657@JRWUBU2> References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> <20140223213245.26f99657@JRWUBU2> Message-ID: On 23 Feb 2014, at 13:32, Richard Wordingham wrote: > On Sun, 23 Feb 2014 20:49:24 +0100 > Philippe Verdy wrote: > >> It seems surprisng that Michael Everson asks the question, when he >> already knows so much about Unicode algorithms (but may be less about >> notations used in CLDR data) Do me a favour, Mr Verdy. Don?t think about me. Thanks. Michael Everson * http://www.evertype.com/ From verdy_p at wanadoo.fr Mon Feb 24 02:36:33 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 24 Feb 2014 09:36:33 +0100 Subject: Sorting notation In-Reply-To: References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> <20140223213245.26f99657@JRWUBU2> Message-ID: 2014-02-24 6:29 GMT+01:00 Michael Everson : > > On Sun, 23 Feb 2014 20:49:24 +0100 Philippe Verdy > wrote: > > > >> It seems surprisng that Michael Everson asks the question, when he > >> already knows so much about Unicode algorithms (but may be less about > >> notations used in CLDR data) > > Do me a favour, Mr Verdy. Don't think about me. Thanks. Why? Didn't *you* ask the question to the list? If you don't like the replies, that's possibly because you did not ask the correct question or what you need to confirm. Or may be because you may want to get opinions from others on something that is highly subject ot variations and not really a widely adopted standard (the UCA algorithm is standard, not the notations for tailorings and even the CLDR data has changed is several times). -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Mon Feb 24 05:00:47 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 24 Feb 2014 11:00:47 +0000 (GMT) Subject: Websites in Hindi Message-ID: <1393239647.87888.YahooMailNeo@web87805.mail.ir2.yahoo.com> An interesting thread about Websites in Hindi is on the Serif forum. https://community.serif.com/forum/webplus/8615/websites-in-hindi I know that there can be issues over the correct rendering of some Indian languages, though I do not know if that applies to Hindi specifically. It is possible that browsers and Adobe Reader resolve those issues, but I do not know. Could someone here say something about this please? William Overington 24 February 2014 From richard.wordingham at ntlworld.com Mon Feb 24 13:38:21 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 24 Feb 2014 19:38:21 +0000 Subject: Sorting notation In-Reply-To: References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> <20140223213245.26f99657@JRWUBU2> Message-ID: <20140224193821.23fa0cee@JRWUBU2> On Sun, 23 Feb 2014 23:13:53 +0100 Philippe Verdy wrote: > 2014-02-23 22:32 GMT+01:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > > On Sun, 23 Feb 2014 20:49:24 +0100 > > Philippe Verdy wrote:*At least, referring to > > Version 24 of the LFML specification, I assume > > Part 5 Section 3.5, which defines "&..<<", also applies to Section > > 3.9, which purports to define the meaning of "&[before 2]..<<". > > It's conceivable that I am wrong, and the meaning of "&[before 2]? > > << ?" is undefined. > This looks like a cryptic notation anyway. If we assume that there's > an implicit reset at start of a collation rule, and that collation > does not define any relative order for the empty string, you could > simply write this reset at level 2 as: > << ? << ? > instead of the mysterious notation (and in fact verbose and probably > inconsistant in the way the same level 2 is further used with "<<"): > &[before 2]? << ? My understanding of the meaning of the notation is that: 1) ? is to have the same number and type of collation elements as ? currently has; 2) The last collation element of ? that has a positive weight at level 2 is to be immediately before the corresponding collation element of ? at the secondary level; 3) No collation element is to be ordered between these two collation elements; and 4) Their other collation elements are to be the same. Thus, before the operation we have a? << ? << ? << ?. After it, we have a? << ? << ? << ?. Is this really what your notation "<< ? << ?" is intended to mean? If we are looking for a brief notation, I think "&? >> ?" would be better. Richard. From verdy_p at wanadoo.fr Tue Feb 25 14:02:47 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 25 Feb 2014 21:02:47 +0100 Subject: Sorting notation In-Reply-To: <20140224193821.23fa0cee@JRWUBU2> References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> <20140223213245.26f99657@JRWUBU2> <20140224193821.23fa0cee@JRWUBU2> Message-ID: 2014-02-24 20:38 GMT+01:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > My understanding of the meaning of the notation is that: > > 1) ? is to have the same number and type of collation elements as ? > currently has; > 2) The last collation element of ? that has a positive weight at level > 2 is to be immediately before the corresponding collation element of > ? at the secondary level; > 3) No collation element is to be ordered between these two collation > elements; and > 4) Their other collation elements are to be the same. > I disagree with point your point (1). * The number of levels does not matter, the notation just indicates that the relation does not specify any starting weight for levels lower than the one indicated by the reset. * And the effective number of collation elements does not matter: we should assume that if one of the item has not enough collation elements, there's for each level a zero weight for each missing level. In practive this only affects the first element, except in case of contractions. I disagree as well on point (2). The starting element (at the reset) may have a null weight at that level, so that we can still order other elements with the same null weight at that level, notably if they have non null weights for higher levels. I agree on your point (3) EXCEPT when the first item of a pair is a "reset" (i.e. an empty string). The point (4) is completely wrong. The other collaction elements in the first pair may be arbitrary (also possibly with distinct weights, but at higher levels) !!! In fact the two items listed after the reset do not matter at all. All what is important is the item for the reset itself, and the first non-empty item ordered after it. That's why I think that "&[before2] xxx" makes sense (even alone) and is in fact the same as "& << xxx" or even just "<< xxx" if you consider that evey rule starts by an implicit reset in order to create a valid pair (in the first pair, the 1st item of the pair is the reset itself, i.e. an empty string, the second item is the first non-empty string indicated after it; and the pair itself has a numeric property specifying its level, here 2). The form "&a < b < c < d ..." is a compressed form of these rules: " From markus.icu at gmail.com Tue Feb 25 15:29:53 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 25 Feb 2014 13:29:53 -0800 Subject: Sorting notation In-Reply-To: References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> <20140223213245.26f99657@JRWUBU2> <20140224193821.23fa0cee@JRWUBU2> Message-ID: On Tue, Feb 25, 2014 at 12:02 PM, Philippe Verdy wrote: > 2014-02-24 20:38 GMT+01:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > My understanding of the meaning of the notation is that: >> >> 1) ? is to have the same number and type of collation elements as ? >> currently has; >> 2) The last collation element of ? that has a positive weight at level >> 2 is to be immediately before the corresponding collation element of >> ? at the secondary level; >> 3) No collation element is to be ordered between these two collation >> elements; and >> 4) Their other collation elements are to be the same. >> > > I disagree with point your point (1). > Philippe, Richard is correct with what the specific example of &[before 2]? << ? should yield according to http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Tailorings Your opinions are not based on the LDML collation tailoring spec, but you make it sound like they are. I suggest the two of you agree on which spec to discuss, or you clarify that what you are doing is comparing the LDML spec with some other spec (I don't know which one that is). markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Feb 25 15:36:24 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 25 Feb 2014 22:36:24 +0100 Subject: Sorting notation In-Reply-To: References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> <20140223213245.26f99657@JRWUBU2> <20140224193821.23fa0cee@JRWUBU2> Message-ID: I did not cite LDML, because it is far from being a stable standard for the question of collation (I endorse the term "monster" used by someone else), being adopted (and modified) mostly to document what ICU does (or does not know how to do better). As such this spec is still in a very alpha stage, and subject to various experimentations. 2014-02-25 22:29 GMT+01:00 Markus Scherer : > On Tue, Feb 25, 2014 at 12:02 PM, Philippe Verdy wrote: > >> 2014-02-24 20:38 GMT+01:00 Richard Wordingham < >> richard.wordingham at ntlworld.com>: >> >> My understanding of the meaning of the notation is that: >>> >>> 1) ? is to have the same number and type of collation elements as ? >>> currently has; >>> 2) The last collation element of ? that has a positive weight at level >>> 2 is to be immediately before the corresponding collation element of >>> ? at the secondary level; >>> 3) No collation element is to be ordered between these two collation >>> elements; and >>> 4) Their other collation elements are to be the same. >>> >> >> I disagree with point your point (1). >> > > Philippe, Richard is correct with what the specific example of > &[before 2]? << ? > > should yield according to > > http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Tailorings > > Your opinions are not based on the LDML collation tailoring spec, but you > make it sound like they are. > > I suggest the two of you agree on which spec to discuss, or you clarify > that what you are doing is comparing the LDML spec with some other spec (I > don't know which one that is). > > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Feb 25 18:08:27 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 26 Feb 2014 00:08:27 +0000 Subject: Sorting notation In-Reply-To: References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> <20140223213245.26f99657@JRWUBU2> <20140224193821.23fa0cee@JRWUBU2> Message-ID: <20140226000827.6e189530@JRWUBU2> On Tue, 25 Feb 2014 21:02:47 +0100 Philippe Verdy wrote: > 2014-02-24 20:38 GMT+01:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: The immediately following text of mine is entirely concerned with the interpretation of the LDML specification "&[before 2]? << ?". > > My understanding of the meaning of the notation is that: > > > > 1) ? is to have the same number and type of collation elements as ? > > currently has; > > 2) The last collation element of ? that has a positive weight at > > level 2 is to be immediately before the corresponding collation > > element of ? at the secondary level; > > 3) No collation element is to be ordered between these two collation > > elements; and > > 4) Their other collation elements are to be the same. The terms collation element and weight as I use them are intended to be used as in the Unicode Collation Algorithm. It is conceivable that I have missed some subtlety in the difference between the extended weights of DUCET and the fractional weights preferred for the expression of the CLDR default collation. > I disagree with point your point (1). > * The number of levels does not matter, the notation just indicates > that the relation does not specify any starting weight for levels > lower than the one indicated by the reset. It does seem that what happens below the level of the reset is irrelevant. I couldn't construct a counter-example to show that it can matter. I'd still recommend copying at the lower levels just in case there is a subtle effect. > * And the effective number of collation elements does not matter: we > should assume that if one of the item has not enough collation > elements, there's for each level a zero weight for each missing > level. In practive this only affects the first element, except in > case of contractions. This makes no sense to me. The collation elements for ? before the application of the rule do not matter. The requirements I gave on the collation elements of ? are for its collation elements *immediately after* the rule has been applied. This incomprehension also applies to your comments on points (2) to (4). > I disagree as well on point (2). The starting element (at the reset) > may have a null weight at that level, so that we can still order > other elements with the same null weight at that level, notably if > they have non null weights for higher levels. > I agree on your point (3) EXCEPT when the first item of a pair is a > "reset" (i.e. an empty string). > > The point (4) is completely wrong. The other collaction elements in > the first pair may be arbitrary (also possibly with distinct weights, > but at higher levels) !!! The specification "&[before 2]? << ?" has to be invalid if ? has no non-zero secondary wei?hts. The LDML specification doesn't mention this input error. > That's why I think that "&[before2] xxx" makes sense (even alone) and > is in fact the same as "& << xxx" or even just "<< xxx" if you > consider that evey rule starts by an implicit reset in order to > create a valid pair (in the first pair, the 1st item of the pair is > the reset itself, i.e. an empty string, the second item is the first > non-empty string indicated after it; and the pair itself has a > numeric property specifying its level, here 2). This has nothing to do with the LDML notation. As far as I can tell, you are interpreting "<< xxx" to assign xxx a collating element with zero primary weight. > The form "&a < b < c < d ..." is a compressed form of these rules: > " level). The "reset" is then automatically the first item of each pair. > So my own syntax never needs any explicit reset, it just order > collection elements with simple rules, in which I can also add > optional statistics (used only for the generation of collation keys, > but not needed at all for comparing two strings). No. This is similar to the fallacy that a collation is defined by the relative ordering (and degree of difference) of the collating elements. Are you relying on deferred binding? And please try not to use 'collation element' (a sequence of weights, one per level) when you mean 'collating element' (either a string of characters or the ordered pair of a string and its corresponding sequence of collation elements). > And I still don't handle some of the preprocessing needed > for some Indic scripts (includng Thai),... Are you aware that Thai can be handled by contractions? Compared with how it might have been, Thai collation is extremely computer friendly. Richard. From verdy_p at wanadoo.fr Tue Feb 25 22:34:43 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 26 Feb 2014 05:34:43 +0100 Subject: Sorting notation In-Reply-To: <20140226000827.6e189530@JRWUBU2> References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> <20140223213245.26f99657@JRWUBU2> <20140224193821.23fa0cee@JRWUBU2> <20140226000827.6e189530@JRWUBU2> Message-ID: 2014-02-26 1:08 GMT+01:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > > And I still don't handle some of the preprocessing needed > > for some Indic scripts (includng Thai),... > > Are you aware that Thai can be handled by contractions? Compared with > how it might have been, Thai collation is extremely computer friendly. > I did not write anything here about contraction but only about preprocessing. The "computer friendly" feautre of Thai is basically for its rendering (not part of this topic), I'm not sure this is really true when discussing about collations. Though as I said, I've not investigated time to test it in real cases. Only basic tests were performed (using some of the testcases listed in CLDR data or in ICU, only for comparaison of results). Also I absolutely don't care about compisite weights or fractional weights used in ICU. For me these are implementation tricks and are irrelevant to how a collator may work, they are one possible solution which in fact just complicates the expression of problems to solve. There are the kind of things that are (IMHO) overspecified only for documenting how ICU works. (Note: I don't oppose ICU; but ICU is not universal and cannot be used uniersally, even if it is integrated in more projects today). -------------- next part -------------- An HTML attachment was scrubbed... URL: From tomasek at etf.cuni.cz Wed Feb 26 03:47:37 2014 From: tomasek at etf.cuni.cz (Petr Tomasek) Date: Wed, 26 Feb 2014 10:47:37 +0100 Subject: Hebrew Extended Block(s) In-Reply-To: <6ABEE65D-7D4B-4A53-ACFD-4325DC93EE92@evertype.com> References: <6ABEE65D-7D4B-4A53-ACFD-4325DC93EE92@evertype.com> Message-ID: <20140226094737.GA20110@ebed.etf.cuni.cz> On Sat, Feb 22, 2014 at 12:55:33PM -0800, Michael Everson wrote: > On 22 Feb 2014, at 11:46, Robert Wheelock wrote: > > > There?s an empty subblock (U+00860 ? U+0089F) with 64 empty codepoints where we could put needed additional Hebrew characters in? > > We?re not going to add a rake of pre-composed Hebrew characters though. > What about the Babylonian and Palestinian punctuation? Anything new currently? Thanks! Petr Tomasek From qsjn4ukr at gmail.com Wed Feb 26 07:30:05 2014 From: qsjn4ukr at gmail.com (QSJN 4 UKR) Date: Wed, 26 Feb 2014 15:30:05 +0200 Subject: Old Cyrillic Yest In-Reply-To: References: <20121112092156.665a7a7059d7ee80bb4d670165c8327d.cea44632cc.wbe@email03.secureserver.net> <6725ADA5AC2341D9B1AEF2F398F81BDC@DougEwell> <7B5C469C-1EC7-4DCE-A1AC-2F22E7C69230@evertype.com> Message-ID: 2012/11/12 QSJN 4 UKR wrote: > Old Cyrillic letter YEST (?) has two variants: broad (also called > Yakornoye Yest) and narrow. They are saved in modern Ukrainian script > (only), where U+0404/0454 UKRAINIAN IE is used for the inherited BROAD > YEST and the modern, rectangle form of U+0415/0453 IE for the NARROW > YEST. Unicode Standard has a remark to use U+0404 for the Old Cyrillic > YEST, but it is unclear, how to distinguish the BROAD YEST and the > NARROW YEST. Unfortunately some fonts use U+0404/0454 for any YEST and > U+0415/0435 for the modern rectangle IE, some old-style fonts use only > the old YEST but with codepoint U+0415/0435 and do not use U+0404/0454 > at all, some use U+0404/0454 for the BROAD YEST and U+0415/0435 for > the NARROW YEST... 2012/11/23 Doug Ewell >How many truly different letters, old and new, are we talking about? On November 12 you >wrote, "UKRAINIAN IE and BROAD YEST is the same letter in fact." It would not make >sense to assign a new BROAD YEST letter if it is really the same as UKRAINIAN IE, and if >existing texts already use UKRAINIAN IE to represent it. > Full picture Meaning - Glyph - Codepoint Old ChurchSlavonic: Narrow Yest (regular form) - very narrow halfmoon - 0404/0454 (ambiguous) and 0415/0435 (probably wrong glyph will be rendered) (there are no certain codepoints) Broad Yest (special form, initial, plural disambiguator) - broad halfmoon, identical to Ukrainian Ie or maybe somehow grater (broking baseline) - 0404/0454 indeed Modern imitation of Church Slavonic, or really old texts, or texts where hard to distinct Broad and Narrow Yest: Ambiguous Yest - identical to Ukrainian Ie or maybe like Narrow Yest (in old-style font) - 0404/0454 sure Modern languages: Ie - rectangle capital / closed rounded small (identical to Latin) - 0415/0435 Ukrainian Ie - identical to ambiguous Yest - 0404/0454 So there are two steps. First. Required. Separate codepoint for Narrow Yests. It is just impossible to work with ChurchSlavonic texts without these. Because: wrong glyph is rendered almost always (you must understand, we cant hope on language detection, cause the text contains certain the mix, old text with modern translation) - or - there is no way to show Broad Yest at all. Second. Optional. Separate codepoint for Broad Yests. That's only necessary if one part of text contains the ambiguous Yests (coded as now, 0404/0454, without changes!) but other part contains the Broad Yests and the author can/wants to show this feature. Am i the only man in the world who think that Unicode is poorly adapted for ChurchSlavonic? From samjnaa at gmail.com Thu Feb 27 04:32:49 2014 From: samjnaa at gmail.com (Shriramana Sharma) Date: Thu, 27 Feb 2014 16:02:49 +0530 Subject: ?MP = Multi*lingual* plane? Message-ID: Given that Unicode encodes scripts and not languages, how appropriate is it to call the BMP and the SMP as the multi*lingual* planes? -- Shriramana Sharma ???????????? ???????????? -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Thu Feb 27 09:23:53 2014 From: everson at evertype.com (Michael Everson) Date: Thu, 27 Feb 2014 07:23:53 -0800 Subject: ?MP = Multi*lingual* plane? In-Reply-To: References: Message-ID: On 27 Feb 2014, at 02:32, Shriramana Sharma wrote: > Given that Unicode encodes scripts and not languages, how appropriate is it to call the BMP and the SMP as the multi*lingual* planes? You are more than two decades late in asking this. It may have seemed more appropriate in an 8-bit code page world where rather small subsets limited the number of languages accessible by one or another part of ISO/IEC 8859. A new term like ?multiscriptal? would not have been appropriate. File this under ?We know the term ?ideograph? is a misnomer." Michael Everson * http://www.evertype.com/ From asmusf at ix.netcom.com Thu Feb 27 09:30:59 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 27 Feb 2014 07:30:59 -0800 Subject: ?MP = Multi*lingual* plane? In-Reply-To: References: Message-ID: <530F5A33.1010005@ix.netcom.com> On 2/27/2014 2:32 AM, Shriramana Sharma wrote: > Given that Unicode encodes scripts and not languages, how appropriate > is it to call the BMP and the SMP as the multi*lingual* planes? > Isn't it lovely how these things work? A./ From richard.wordingham at ntlworld.com Thu Feb 27 16:00:09 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 27 Feb 2014 22:00:09 +0000 Subject: Sorting notation In-Reply-To: References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com> <20140223213245.26f99657@JRWUBU2> <20140224193821.23fa0cee@JRWUBU2> <20140226000827.6e189530@JRWUBU2> Message-ID: <20140227220009.69655291@JRWUBU2> On Wed, 26 Feb 2014 05:34:43 +0100 Philippe Verdy wrote: > 2014-02-26 1:08 GMT+01:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > Compared > > with how it might have been, Thai collation is extremely computer > > friendly. > The "computer friendly" feautre of Thai is basically for its > rendering (not part of this topic), I'm not sure this is really true > when discussing about collations. You just swap the preposed vowels with the immediately following consonant (which can be done by a contraction), and then it's a straightforward sort of a system having characters with a secondary weight. You don't need to know anything more about the structure of Thai words. However, the first Thai-Thai dictionary had a very different collation order - see http://www.sealang.net/dictionary/bradley/theraphan1991lexicography.htm . I think that order needs a very large collation element table. It may well be beyond the capability of the UCA - the description I cited barely hints at the problems. Richard. From adam at nohejl.name Fri Feb 28 12:56:43 2014 From: adam at nohejl.name (Adam Nohejl) Date: Fri, 28 Feb 2014 19:56:43 +0100 Subject: CJK stroke order data: kRSUnicode v. kRSKangXi Message-ID: Hello, I am comparing radical data for CJK characters from different sources, including the Unihan database. According to the Unihan documentation* the kRSUnicode radical should correspond to kRSKangXi radical, which in turn should be based on the Kang Xi dictionary. Is there any explanation for the following discrepancies? Did I miss any other rules or reasoning behind the content of these two fields? Examples of the discrepancies: (1) A very common character for "most, maximum". U+6700 kRSKangXi 73.8 U+6700 kRSUnicode 13.10 (2) A funny character for autumn containing the turtle component. U+9F9D kRSKangXi 115.16 U+9F9D kRSKanWa 115.16 U+9F9D kRSUnicode 213.5 There are also characters that actually are not included in the Kang Xi dictionary**, but the Unihan data contain both a purported Kang Xi radical and in addition to that a _different_ Unicode radical. (3) The simplified turtle character (commonly assigned to the traditional radical #213): U+4E80 kRSKangXi 213.0 U+4E80 kRSUnicode 5.10 (4) Character with the radical #72/73 at the top, i.e. IMHO an arbitrary decision, but unexpectedly the fields differ: U+66FB kRSKangXi 72.7 U+66FB kRSUnicode 73.7 - - - [*] : "Property: kRSUnicode // Description: (...) The first value is intended to reflect the same radical as the kRSKangXi field and the stroke count of the glyph used to print the character within the Unicode Standard." [**] The two characters are missing from the '89 edition of Kang Xi (which should be the same as used for Unihan) according to search on this site: -- Adam Nohejl