From richard.wordingham at ntlworld.com Thu May 1 17:44:24 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 1 May 2014 23:44:24 +0100 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <5359AA2D.4030306@ix.netcom.com> References: <20140424124142.665a7a7059d7ee80bb4d670165c8327d.754867c073.wbe@email03.secureserver.net> <5359AA2D.4030306@ix.netcom.com> Message-ID: <20140501234424.6357ce12@JRWUBU2> On Thu, 24 Apr 2014 17:19:57 -0700 Asmus Freytag wrote: > On this side show, Philippe finally is correct, because I received > his message without ASCII-i-fication; he cc'd me directly, and I > never saw the mangled text. It's a bit embarassing for a Unicode mail > list to not even be able to let guillemets through unmolested. Are you sure it's the mail list that did the mangling? As I got the post, it had two parts, plain-text in ISO-8859-1, with '<<' and '>>' substituted for the guillemets '?' and '?', and HTML, also in ISO-8859-1, with character entities « and ». I suspect Philippe's e-mail client may be at fault. Richard. From asmusf at ix.netcom.com Thu May 1 20:58:13 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 01 May 2014 18:58:13 -0700 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <20140501234424.6357ce12@JRWUBU2> References: <20140424124142.665a7a7059d7ee80bb4d670165c8327d.754867c073.wbe@email03.secureserver.net><5359AA2D.4030306@ix.netcom.com> <20140501234424.6357ce12@JRWUBU2> Message-ID: <5362FBB5.5090303@ix.netcom.com> This has seen off-line discussion with the mail manager and we're good. A./ On 5/1/2014 3:44 PM, Richard Wordingham wrote: > On Thu, 24 Apr 2014 17:19:57 -0700 > Asmus Freytag wrote: > >> On this side show, Philippe finally is correct, because I received >> his message without ASCII-i-fication; he cc'd me directly, and I >> never saw the mangled text. It's a bit embarassing for a Unicode mail >> list to not even be able to let guillemets through unmolested. > Are you sure it's the mail list that did the mangling? As I got the > post, it had two parts, plain-text in ISO-8859-1, with '<<' and '>>' > substituted for the guillemets '?' and '?', and HTML, also in > ISO-8859-1, with character entities « and ». I suspect > Philippe's e-mail client may be at fault. > > Richard. > From mark at macchiato.com Fri May 2 03:14:18 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 2 May 2014 10:14:18 +0200 Subject: Adding number system In-Reply-To: References: Message-ID: +unicode@ On 2 May 2014 09:51, Aleksandr Andreev wrote: > Dear list members, > > Question: how does one add a new numbering system to the CLDR? Should > I file a ticket? > ?Yes, and please provide supporting information about usage and identification (see below). ? > > I also noticed here: > > http://cldr.unicode.org/translation/numbering-systems > > that numbering systems may be "algorithmic". Are the algorithms > themselves described in the CLDR? > ?No, only enough information is provided by CLDR so as to uniquely identify the numbering system, not to provide a complete specification of the exact behavior. (The same is true of calendar systems.) ? > > Cordially, > > Aleksandr > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri May 2 09:57:36 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 2 May 2014 16:57:36 +0200 Subject: Unclear text in the UBA (UAX#9) of Unicode 6.3 In-Reply-To: <20140501234424.6357ce12@JRWUBU2> References: <20140424124142.665a7a7059d7ee80bb4d670165c8327d.754867c073.wbe@email03.secureserver.net> <5359AA2D.4030306@ix.netcom.com> <20140501234424.6357ce12@JRWUBU2> Message-ID: The email was sent from Gmail on its webmail, French edition. May be Gmail is causing this, this is not expected and I don't know why Gmail transforms the text to ISO 8859-1 (without breaking the text without notice, it could had used windows-1252, which has completely superseded ISO 8859-1 along with HTML5). But the HTML part was intact and that's the HTML part that I see (I almost never look at the generated plain text part which always has caveats if it is not sent with UTF-8). In my opinion it's a bad choice of Gmail for replacing guillemets by ugly pairs of ASCII symbols (not even used in French contexts) which are also confusing the conventional notation of citations, in fact if it really wanted to use ISO 88598-1, it should have used the "ASCII double quotes". Is ? 20 ?C ? OK with the degree symbol? So the guillemets should be OK too and I don't think this makes emails less readable for the recipients reading only the plain text part in their old email agents. I just hope that Gmail does not mess things worse by using UTF-8 and still making these ugly substitutions for characters that are widely supported since about 30 years on so many system. And I wonder then why Gmail offers immediate support for HTML composition and a tool to remove the non-plain text formatting that still preserves the UTF-8 encoding if that's for modifying what we write. But Gmail should better send UTF-8 plain text parts instead of replacing any character that does not match in its default legacy 8-bit encoding. Notably because it does not offer any option to select the encoding that will be used (either in HTML or in the plain-text version). If you just read the plain text part, you know that it is lossy, so you can get random replacements for many characters in lots of scripts (even non-Latin ones), and symbols if it's not sent with UTF-8. So this is not embarassing for the Unicode mailing list, it is embarassing for Google leaving their users aware of what it will perform silently. And sorry I no longer use any standalone mail agent, I prefer using web storage without loosing all my emails when I use another device or install a new OS. 2014-05-02 0:44 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Thu, 24 Apr 2014 17:19:57 -0700 > Asmus Freytag wrote: > > > On this side show, Philippe finally is correct, because I received > > his message without ASCII-i-fication; he cc'd me directly, and I > > never saw the mangled text. It's a bit embarassing for a Unicode mail > > list to not even be able to let guillemets through unmolested. > > Are you sure it's the mail list that did the mangling? As I got the > post, it had two parts, plain-text in ISO-8859-1, with '<<' and '>>' > substituted for the guillemets '?' and '?', and HTML, also in > ISO-8859-1, with character entities « and ». I suspect > Philippe's e-mail client may be at fault. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Fri May 2 10:40:41 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 2 May 2014 16:40:41 +0100 Subject: Guillements in Email In-Reply-To: References: <20140424124142.665a7a7059d7ee80bb4d670165c8327d.754867c073.wbe@email03.secureserver.net> <5359AA2D.4030306@ix.netcom.com> <20140501234424.6357ce12@JRWUBU2> Message-ID: <20140502164041.64b1c375@JRWUBU2> On Fri, 2 May 2014 16:57:36 +0200 Philippe Verdy wrote: > The email was sent from Gmail on its webmail, French edition. > May be Gmail is causing this, this is not expected and I don't know > why Gmail transforms the text to ISO 8859-1 (without breaking the > text without notice, it could had used windows-1252, which has > completely superseded ISO 8859-1 along with HTML5). The really weird thing is that the guillemets are ISO-8859-1 characters, so should only have been modified as part of the transfer-encoding. > Is ? 20 ?C ? OK with the degree symbol? Weirdly, both the gullements and the degree sign are preserved in the plain text version of the e-mail I'm answering. Have Google just fixed their e-mail client? Richard. From verdy_p at wanadoo.fr Fri May 2 10:57:37 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 2 May 2014 17:57:37 +0200 Subject: Guillements in Email In-Reply-To: <20140502164041.64b1c375@JRWUBU2> References: <20140424124142.665a7a7059d7ee80bb4d670165c8327d.754867c073.wbe@email03.secureserver.net> <5359AA2D.4030306@ix.netcom.com> <20140501234424.6357ce12@JRWUBU2> <20140502164041.64b1c375@JRWUBU2> Message-ID: Apparently not. There's a difference: Gmail now used quoted-printable, that preserves these guillemets (as =AB and =BB) even if they are still encoded with ISO8859-1 (without replacement by ASCII pairs of symbols). In the previous message, Gmail thought that quoted-printable was not needed for just these guillemets and replaced them (even if it was clearly not needed in ISO8859-1 that has them since extremely long time, and even before the Internet was open and emails started spreading; at that time Google still did not even exist). The degree sign (also in ISO 8859-1) was enough to trigger quoted-printable. 2014-05-02 17:40 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Fri, 2 May 2014 16:57:36 +0200 > Philippe Verdy wrote: > > > The email was sent from Gmail on its webmail, French edition. > > May be Gmail is causing this, this is not expected and I don't know > > why Gmail transforms the text to ISO 8859-1 (without breaking the > > text without notice, it could had used windows-1252, which has > > completely superseded ISO 8859-1 along with HTML5). > > The really weird thing is that the guillemets are ISO-8859-1 > characters, so should only have been modified as part of the > transfer-encoding. > > > Is ? 20 ?C ? OK with the degree symbol? > > Weirdly, both the gullements and the degree sign are preserved in the > plain text version of the e-mail I'm answering. Have Google just fixed > their e-mail client? > > Richard. > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Fri May 2 12:11:26 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 2 May 2014 18:11:26 +0100 Subject: Guillements in Email In-Reply-To: References: <20140424124142.665a7a7059d7ee80bb4d670165c8327d.754867c073.wbe@email03.secureserver.net> <5359AA2D.4030306@ix.netcom.com> <20140501234424.6357ce12@JRWUBU2> <20140502164041.64b1c375@JRWUBU2> Message-ID: <20140502181126.7e6399f4@JRWUBU2> On Fri, 2 May 2014 17:57:37 +0200 Philippe Verdy wrote: > Apparently not. There's a difference: Gmail now used > quoted-printable, that preserves these guillemets (as =AB and =BB) > even if they are still encoded with ISO8859-1 (without replacement by > ASCII pairs of symbols). Using quoted-printable is usual for sending ISO-8859-1 through 7-bit channels. Richard. From verdy_p at wanadoo.fr Fri May 2 12:41:49 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 2 May 2014 19:41:49 +0200 Subject: Guillements in Email In-Reply-To: <20140502181126.7e6399f4@JRWUBU2> References: <20140424124142.665a7a7059d7ee80bb4d670165c8327d.754867c073.wbe@email03.secureserver.net> <5359AA2D.4030306@ix.netcom.com> <20140501234424.6357ce12@JRWUBU2> <20140502164041.64b1c375@JRWUBU2> <20140502181126.7e6399f4@JRWUBU2> Message-ID: Yes I know But in the first message it was not used. I do not criticize the fact of using quoted-printable; but the fact that of NOT using it to preserve characters; based on an arbitrary selection of characters that Google considers worth preserving only when they appear in combination of other characters If people want to send guillemets, there's no reson this depends on the presence of other characters (well they are sent; but only in HTML but not in the plain text part). I consider this a bug from Google (probably in old tricky code unreviewed since very long). 2014-05-02 19:11 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Fri, 2 May 2014 17:57:37 +0200 > Philippe Verdy wrote: > > > Apparently not. There's a difference: Gmail now used > > quoted-printable, that preserves these guillemets (as =AB and =BB) > > even if they are still encoded with ISO8859-1 (without replacement by > > ASCII pairs of symbols). > > Using quoted-printable is usual for sending ISO-8859-1 through 7-bit > channels. > > Richard. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Fri May 2 13:01:31 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 2 May 2014 11:01:31 -0700 Subject: Guillements in Email In-Reply-To: References: <20140424124142.665a7a7059d7ee80bb4d670165c8327d.754867c073.wbe@email03.secureserver.net> <5359AA2D.4030306@ix.netcom.com> <20140501234424.6357ce12@JRWUBU2> <20140502164041.64b1c375@JRWUBU2> <20140502181126.7e6399f4@JRWUBU2> Message-ID: If there is a Gmail bug, then please report it. Either way, I suggest you go into Gmail Settings and set it to "Use Unicode (UTF-8) encoding for outgoing messages" markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri May 2 13:08:15 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 2 May 2014 20:08:15 +0200 Subject: Guillements in Email In-Reply-To: References: <20140424124142.665a7a7059d7ee80bb4d670165c8327d.754867c073.wbe@email03.secureserver.net> <5359AA2D.4030306@ix.netcom.com> <20140501234424.6357ce12@JRWUBU2> <20140502164041.64b1c375@JRWUBU2> <20140502181126.7e6399f4@JRWUBU2> Message-ID: I don't know when this option was introduced; anyway the French translation of this option is confusive/incoherent and if I have used it in the past, I suspect the translation was reversed sometime. Thanks for pointing it in the Gmail settings. The bug remains though 2014-05-02 20:01 GMT+02:00 Markus Scherer : > If there is a Gmail bug, then please report it. > > Either way, I suggest you go into Gmail Settings and set it to "Use > Unicode (UTF-8) encoding for outgoing messages" > > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdaoden at yandex.com Fri May 2 15:33:50 2014 From: sdaoden at yandex.com (Steffen Nurpmeso) Date: Fri, 02 May 2014 22:33:50 +0200 Subject: Guillements in Email Message-ID: <20140502213350.7wfiF8A2bfo9BDM+86bXHSbW@dietcurd.local> Sorry for not replying in the thread, and jumping in general, i'm currently ?jolly well fed up? of dealing with mail, but... Philippe Verdy wrote: |I do not criticize the fact of using quoted-printable; but the |fact that of NOT using it to preserve characters; based on an |arbitrary selection of characters It is also thinkable that Google -- definitely capable of ESMTP (RFC 1869) -- only falls back to QM or Base64 if the message would otherwise not conform to the standard. I.e., i'm thinking of line length issues here, which is not unlikely given that today everybody composes in ...-based textboxes, and 1000 bytes are reached pretty soon. --steffen From nospam-abuse at ilyaz.org Tue May 6 16:30:25 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Tue, 6 May 2014 14:30:25 -0700 Subject: Unicode ranges with baseline/x-height/X-height Message-ID: <20140506213024.GA11256@powdermilk> For the purpose of drawing characters from a secondary, substitution font, one must know whether one must rescale bbox to bbox, or X-height to X-height etc. Currently I use a very brain-damaged do-it-quickly scheme (only BMP matters so far): my $nobaseline_blocks = < We're struggling to master the intricacies of proposing new Unicode characters specific to the James Joyce masterpiece "Finnegans Wake".? http://fwpages.blogspot.com/2014/05/unicode-for-james-joyce-needed.html There are somewhere from two to two-dozen 'sigla' that occur in the published text and the voluminous surviving notes, representing Joyce's basic archetypes: man/woman, boy/girl, old-man/old-woman, judge/jury, message, etc. Scholars currently use various elaborate compromises to represent these in published articles. Because they're all fairly simple geometric shapes, most of them already have approximate representations somewhere in Unicode (eg triangle, circle, caret, bracket), but as a gesture of respect it would be great to replace these so the whole set can be precisely matched for heights, linewidths, angles, etc. The highest priority would be the 90 and 270 degree rotations of the capital 'E' (still missing as far as I can tell). There's also rotations and reflections of capital 'F' that occur in the published text but not exactly as sigla (so far as we can judge-- scholarship is still in its early days). Are there precedents? Would testimonials help? Might this qualify for the full two dozen set? From asmusf at ix.netcom.com Thu May 8 12:20:02 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 08 May 2014 10:20:02 -0700 Subject: Preliminary inquiry: Sigla for James Joyce's Finnegans Wake In-Reply-To: <1399565389.50943.YahooMailNeo@web164805.mail.gq1.yahoo.com> References: <1399565389.50943.YahooMailNeo@web164805.mail.gq1.yahoo.com> Message-ID: <536BBCC2.90609@ix.netcom.com> On 5/8/2014 9:09 AM, catherine butler wrote: > We're struggling to master the intricacies of proposing new Unicode characters specific to the James Joyce masterpiece "Finnegans Wake". > http://fwpages.blogspot.com/2014/05/unicode-for-james-joyce-needed.html > > There are somewhere from two to two-dozen 'sigla' that occur in the published text and the voluminous surviving notes, representing Joyce's basic archetypes: man/woman, boy/girl, old-man/old-woman, judge/jury, message, etc. Scholars currently use various elaborate compromises to represent these in published articles. > > Because they're all fairly simple geometric shapes, most of them already have approximate representations somewhere in Unicode (eg triangle, circle, caret, bracket), but as a gesture of respect it would be great to replace these so the whole set can be precisely matched for heights, linewidths, angles, etc. > > The highest priority would be the 90 and 270 degree rotations of the capital 'E' (still missing as far as I can tell). There's also rotations and reflections of capital 'F' that occur in the published text but not exactly as sigla (so far as we can judge-- scholarship is still in its early days). > > Are there precedents? Would testimonials help? Might this qualify for the full two dozen set? > > _______________________________________________ > Catherine, What is needed is an authoritative and complete inventory of these, using *images* from the works and notes to show their shapes (and a few images to document that they are indeed part of running text). Only on that basis of that could there be a useful discussion of whether existing Unicode character codes actually encompass these, and/or whether any are distinct enough to warrant a new character. For example, the sigla that looks like a reversed or rotated E probably should not be encoded using any of the character codes for "E" - unless(!) it is clear that Joyce intended a letter shape. I personally would make a difference between letters and symbols that just happen to look like letters. However, for the triangle, there's no need to encode a new character. The existing one is already a symbol, and, unless the shape in Joyce's work was a special kind of triangle (like a tall, narrow one, or a right triangle, etc.) the generic triangle would be the correct encoding (Unicode does have all sorts of triangles so it's most likely covered). As for the exact details, line width, precise height, fine positioning on the line, at some point of specificity, these become a matter for a *font*. That means that the proper "respect" for his work would be shown by creating a font that renders these character codes in the correct "look&feel". People on this list can definitely help with the analysis, but see the first paragraph. A./ From emuller at adobe.com Thu May 8 17:23:05 2014 From: emuller at adobe.com (Eric Muller) Date: Thu, 8 May 2014 15:23:05 -0700 Subject: Time to learn French! Message-ID: <536C03C9.8080002@adobe.com> http://www.forbes.com/sites/pascalemmanuelgobry/2014/03/21/want-to-know-the-language-of-the-future-the-data-suggests-it-could-be-french/ http://www.france24.com/en/20140326-will-french-be-world-most-spoken-language-2050/ http://www.boston.com/bostonglobe/ideas/brainiac/2014/03/the_language_of_1.html Et cetera. Eric. From wjgo_10009 at btinternet.com Fri May 9 01:15:24 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 9 May 2014 07:15:24 +0100 (BST) Subject: Time to learn French! In-Reply-To: <536C03C9.8080002@adobe.com> References: <536C03C9.8080002@adobe.com> Message-ID: <1399616124.80518.YahooMailNeo@web87804.mail.ir2.yahoo.com> Song about colour fonts - in French http://forum.high-logic.com/viewtopic.php?f=36&t=4953 William Overington 9 May 2014 From crmb211 at yahoo.com Fri May 9 12:45:51 2014 From: crmb211 at yahoo.com (catherine butler) Date: Fri, 9 May 2014 10:45:51 -0700 (PDT) Subject: Preliminary inquiry: Sigla for James Joyce's Finnegans Wake In-Reply-To: <1399572976.53367.YahooMailNeo@web164806.mail.gq1.yahoo.com> References: <1399565389.50943.YahooMailNeo@web164805.mail.gq1.yahoo.com> <536BBF56.8030405@unicode.org> <1399572976.53367.YahooMailNeo@web164806.mail.gq1.yahoo.com> Message-ID: <1399657551.94930.YahooMailNeo@web164803.mail.gq1.yahoo.com> "What is needed is an authoritative and complete inventory of these, using *images* from the works and notes to show their shapes (and a few images to document that they are indeed part of running text)." I don't have access to the manuscript facsimiles this would require. ?There's a few page-images around the Web, but no more than 20 handwritten variants of the most basic sigla. Since the top scholars still haven't resolved the exact shapes, but all agree on the identities being indicated (man, woman, etc), could the Unicode leave the exact shapes up to future debate, but set aside some two-dozen slots using Joyce's own verbal designations? (HCE, ALP, Shem, Shaun, Issy, etc) (Also: I can photograph the published variants, but where do I post them?) ((Apologies for sloppy header: Yahoo Mail seems to have disabled editing of reply-headers!?)) From asmusf at ix.netcom.com Fri May 9 15:13:18 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 09 May 2014 13:13:18 -0700 Subject: Preliminary inquiry: Sigla for James Joyce's Finnegans Wake In-Reply-To: <1399657551.94930.YahooMailNeo@web164803.mail.gq1.yahoo.com> References: <1399565389.50943.YahooMailNeo@web164805.mail.gq1.yahoo.com> <536BBF56.8030405@unicode.org> <1399572976.53367.YahooMailNeo@web164806.mail.gq1.yahoo.com> <1399657551.94930.YahooMailNeo@web164803.mail.gq1.yahoo.com> Message-ID: <536D36DE.9060007@ix.netcom.com> On 5/9/2014 10:45 AM, catherine butler wrote: > "What is needed is an authoritative and complete inventory of these, > using *images* from the works and notes to show their shapes (and a few > images to document that they are indeed part of running text)." > > I don't have access to the manuscript facsimiles this would require. There's a few page-images around the Web, but no more than 20 handwritten variants of the most basic sigla. > > Since the top scholars still haven't resolved the exact shapes, but all agree on the identities being indicated (man, woman, etc), could the Unicode leave the exact shapes up to future debate, but set aside some two-dozen slots using Joyce's own verbal designations? (HCE, ALP, Shem, Shaun, Issy, etc) That's usually not how it's done. Additional reply off-line. A./ > (Also: I can photograph the published variants, but where do I post them?) > > ((Apologies for sloppy header: Yahoo Mail seems to have disabled editing of reply-headers!?)) > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From budelberger.richard at wanadoo.fr Fri May 9 16:02:32 2014 From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER) Date: Fri, 9 May 2014 23:02:32 +0200 (CEST) Subject: Preliminary inquiry: Sigla for James Joyce's Finnegans Wake In-Reply-To: <536D36DE.9060007@ix.netcom.com> References: <1399565389.50943.YahooMailNeo@web164805.mail.gq1.yahoo.com> <536BBF56.8030405@unicode.org> <1399572976.53367.YahooMailNeo@web164806.mail.gq1.yahoo.com> <1399657551.94930.YahooMailNeo@web164803.mail.gq1.yahoo.com> <536D36DE.9060007@ix.netcom.com> Message-ID: <1722509421.27207.1399669352879.JavaMail.www@wwinf1f14> > Message du 09/05/14 22:24 > De : "Asmus Freytag" > A : "catherine butler" , unicode at unicode.org > Copie ? : > Objet : Re: Preliminary inquiry: Sigla for James Joyce's Finnegans Wake > > On 5/9/2014 10:45 AM, catherine butler wrote: > > "What is needed is an authoritative and complete inventory of these, > > using *images* from the works and notes to show their shapes (and a few > > images to document that they are indeed part of running text)." > > > > I don't have access to the manuscript facsimiles this would require. There's a few page-images around the Web, but no more than 20 handwritten variants of the most basic sigla. > > > > Since the top scholars still haven't resolved the exact shapes, but all agree on the identities being indicated (man, woman, etc), could the Unicode leave the exact shapes up to future debate, but set aside some two-dozen slots using Joyce's own verbal designations? (HCE, ALP, Shem, Shaun, Issy, etc) > > That's usually not how it's done. Additional reply off-line. Why ??off-line????? R. B. From richard.wordingham at ntlworld.com Fri May 9 17:57:52 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 9 May 2014 23:57:52 +0100 Subject: Indic Syllabic Categories Message-ID: <20140509235752.27e23319@JRWUBU2> Is the provisional property 'Indic_Syllabic_Category' defined by anything deeper than the UCD file IndicSyllabicCategory itself? I started to review the current assignments and realised I couldn't explain the division between Vowel_Independent and Consonant. Is the division meant to be rational or, although possibly consistent within a script, is it arbitrary? For example, why are U+17A2 KHMER LETTER QA and U+0E24 THAI CHARACTER RU of category Consonant, while U+1021 MYANMAR LETTER A and U+1A50 TAI THAM LETTER UU are of category Vowel_Independent? Is this difference intended to matter? Both U+17A2 and U+1021 combine freely with dependent vowels, while U+0E24 and U+1A50 combine with very few dependent vowels (I think just U+0E45 THAI CHARACTER LAKKHANGYAO and U+1A63 TAI THAM VOWEL SIGN AA respectively). Another puzzle is that U+1038 MYANMAR SIGN VISARGA of category Visarga while U+19B0 NEW TAI LUE VOWEL SIGN VOWEL SHORTENER is of category Vowel_Dependent. Both may follow a final consonant (one of category Consonant_Final in the case of Tai Lue!) without implying a new syllable. Is the property meant to be tailorable? For example, there are encoded characters in the Khmer script that serve as tone marks when it is used to write Thai. Richard. From samjnaa at gmail.com Fri May 9 20:32:03 2014 From: samjnaa at gmail.com (Shriramana Sharma) Date: Sat, 10 May 2014 07:02:03 +0530 Subject: Indic Syllabic Categories In-Reply-To: <20140509235752.27e23319@JRWUBU2> References: <20140509235752.27e23319@JRWUBU2> Message-ID: Dear Richard, It is true that Vowel_Independent can behave like Consonant characters. Given that consonant letters also have an inherent vowel in these scripts, IMO there is not really much to distinguish *technically*. At least in *Indian* Indic scripts we don't have Vowel_Independent letters participating in a cluster via a virama unlike the consonant letter, but possibly in the South East Asian scripts this is not guaranteed. Hence IIUC it is merely the traditional classification based on the sound value of these letters that is reflected here. And that classification (which you probably know but just putting into writing) is: The consonant letters all denote the same inherent vowel preceded by one (or, rarely, more) consonant sounds. The independent vowel letters OTOH all denote different vowel sounds without (for the most part) any consonant sounds. HTH, Shriramana Sharma. -- Shriramana Sharma ???????????? ???????????? From crmb211 at yahoo.com Fri May 9 20:49:37 2014 From: crmb211 at yahoo.com (catherine butler) Date: Fri, 9 May 2014 18:49:37 -0700 (PDT) Subject: Preliminary inquiry: Sigla for James Joyce's Finnegans Wake In-Reply-To: <536D36DE.9060007@ix.netcom.com> References: <1399565389.50943.YahooMailNeo@web164805.mail.gq1.yahoo.com> <536BBF56.8030405@unicode.org> <1399572976.53367.YahooMailNeo@web164806.mail.gq1.yahoo.com> <1399657551.94930.YahooMailNeo@web164803.mail.gq1.yahoo.com> <536D36DE.9060007@ix.netcom.com> Message-ID: <1399686577.34855.YahooMailNeo@web164805.mail.gq1.yahoo.com> "> Since the top scholars still haven't resolved the exact shapes, but all agree on the identities being indicated (man, woman, etc), could the Unicode leave the exact shapes up to future debate, but set aside some two-dozen slots using Joyce's own verbal designations? (HCE, ALP, Shem, Shaun, Issy, etc) That's usually not how it's done." The Phaistos Disc's Unicode block names the characters after best-guesses of what they depict. If it's ever solved, we might realize some guesses were wrong, and re-draw them, but the Unicode numbering can stay the same. This could work similarly. ?The old Finnegans Wake Circular used ascii designations: $E, $A, $I.1, etc. that everyone agrees on. But trying to resolve subtle questions like "Is the middle horizontal of the 'E' shorter?" by collating all handwritten exemplars will be a labor of decades... From asmusf at ix.netcom.com Fri May 9 21:58:03 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 09 May 2014 19:58:03 -0700 Subject: Indic Syllabic Categories In-Reply-To: References: <20140509235752.27e23319@JRWUBU2> Message-ID: <536D95BB.3010800@ix.netcom.com> On 5/9/2014 6:32 PM, Shriramana Sharma wrote: > Dear Richard, > > It is true that Vowel_Independent can behave like Consonant > characters. Given that consonant letters also have an inherent vowel > in these scripts, IMO there is not really much to distinguish > *technically*. At least in *Indian* Indic scripts we don't have > Vowel_Independent letters participating in a cluster via a virama > unlike the consonant letter, And the ability to construct a regular expression that caters to this restriction (for the scripts that have it) is supremely useful in a number of application areas. Having a division that is only occasionally required is much better than not having the division. A./ > but possibly in the South East Asian > scripts this is not guaranteed. Hence IIUC it is merely the > traditional classification based on the sound value of these letters > that is reflected here. > > And that classification (which you probably know but just putting into > writing) is: The consonant letters all denote the same inherent vowel > preceded by one (or, rarely, more) consonant sounds. The independent > vowel letters OTOH all denote different vowel sounds without (for the > most part) any consonant sounds. > > HTH, > > Shriramana Sharma. > > From asmusf at ix.netcom.com Fri May 9 21:59:21 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 09 May 2014 19:59:21 -0700 Subject: Preliminary inquiry: Sigla for James Joyce's Finnegans Wake In-Reply-To: <1399686577.34855.YahooMailNeo@web164805.mail.gq1.yahoo.com> References: <1399565389.50943.YahooMailNeo@web164805.mail.gq1.yahoo.com> <536BBF56.8030405@unicode.org> <1399572976.53367.YahooMailNeo@web164806.mail.gq1.yahoo.com> <1399657551.94930.YahooMailNeo@web164803.mail.gq1.yahoo.com> <536D36DE.9060007@ix.netcom.com> <1399686577.34855.YahooMailNeo@web164805.mail.gq1.yahoo.com> Message-ID: <536D9609.5050608@ix.netcom.com> On 5/9/2014 6:49 PM, catherine butler wrote: > "> Since the top scholars still haven't resolved the exact shapes, but all agree on the identities being indicated (man, woman, etc), could the Unicode leave the exact shapes up to future debate, but set aside some two-dozen slots using Joyce's own verbal designations? (HCE, ALP, Shem, Shaun, Issy, etc) > That's usually not how it's done." > > The Phaistos Disc's Unicode block names the characters after best-guesses of what they depict. If it's ever solved, we might realize some guesses were wrong, and re-draw them, but the Unicode numbering can stay the same. This could work similarly. The old Finnegans Wake Circular used ascii designations: $E, $A, $I.1, etc. that everyone agrees on. But trying to resolve subtle questions like "Is the middle horizontal of the 'E' shorter?" by collating all handwritten exemplars will be a labor of decades... And not likely relevant to the encoding decision. A./ From eik at iki.fi Sat May 10 03:18:29 2014 From: eik at iki.fi (Erkki I Kolehmainen) Date: Sat, 10 May 2014 11:18:29 +0300 Subject: Preliminary inquiry: Sigla for James Joyce's Finnegans Wake In-Reply-To: <1399686577.34855.YahooMailNeo@web164805.mail.gq1.yahoo.com> References: <1399565389.50943.YahooMailNeo@web164805.mail.gq1.yahoo.com> <536BBF56.8030405@unicode.org> <1399572976.53367.YahooMailNeo@web164806.mail.gq1.yahoo.com> <1399657551.94930.YahooMailNeo@web164803.mail.gq1.yahoo.com> <536D36DE.9060007@ix.netcom.com> <1399686577.34855.YahooMailNeo@web164805.mail.gq1.yahoo.com> Message-ID: <000801cf6c28$6c7c4990$4574dcb0$@fi> The Unicode names for the Phaistos disc (and everything else, for that matter) would not change based on a new interpretation. Sincerely, Erkki -----Alkuper?inen viesti----- L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta catherine butler L?hetetty: 10. toukokuuta 2014 04:50 Vastaanottaja: unicode at unicode.org Aihe: Re: Preliminary inquiry: Sigla for James Joyce's Finnegans Wake "> Since the top scholars still haven't resolved the exact shapes, but all agree on the identities being indicated (man, woman, etc), could the Unicode leave the exact shapes up to future debate, but set aside some two-dozen slots using Joyce's own verbal designations? (HCE, ALP, Shem, Shaun, Issy, etc) That's usually not how it's done." The Phaistos Disc's Unicode block names the characters after best-guesses of what they depict. If it's ever solved, we might realize some guesses were wrong, and re-draw them, but the Unicode numbering can stay the same. This could work similarly. ?The old Finnegans Wake Circular used ascii designations: $E, $A, $I.1, etc. that everyone agrees on. But trying to resolve subtle questions like "Is the middle horizontal of the 'E' shorter?" by collating all handwritten exemplars will be a labor of decades... _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From richard.wordingham at ntlworld.com Sat May 10 05:19:46 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 10 May 2014 11:19:46 +0100 Subject: Indic Syllabic Categories In-Reply-To: <536D95BB.3010800@ix.netcom.com> References: <20140509235752.27e23319@JRWUBU2> <536D95BB.3010800@ix.netcom.com> Message-ID: <20140510111946.7e442104@JRWUBU2> On Fri, 09 May 2014 19:58:03 -0700 Asmus Freytag wrote: > On 5/9/2014 6:32 PM, Shriramana Sharma wrote: > > At least in *Indian* Indic scripts we don't have > > Vowel_Independent letters participating in a cluster via a virama > > unlike the consonant letter, > And the ability to construct a regular expression that caters to this > restriction (for the scripts that have it) is supremely useful in a > number of application areas. Is it participation as C1, as C2 or as either in a cluster that one wishes to rule out? > Having a division that is only occasionally required is much better > than not having the division. What worried me is the possibility of false deductions from such a division. If the differences are widely known to vary from script to script, then there is much less cause to worry. > > And that [traditional] classification (which you probably know but > > just putting into writing) is: The consonant letters all denote the > > same inherent vowel preceded by one (or, rarely, more) consonant > > sounds. The independent vowel letters OTOH all denote different > > vowel sounds without (for the most part) any consonant sounds. Unsurprisingly, it's when the latter rule breaks down that things can get complicated, and one has to ask which symbols denoting consonant plus vowel are to be counted as independent vowels and which as consonant letters. There don't seem to be any complications arising from different consonants having different implicit vowels (e.g Khmer). Richard. From richard.wordingham at ntlworld.com Sun May 11 18:37:37 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 12 May 2014 00:37:37 +0100 Subject: Indic Syllabic Categories In-Reply-To: References: <20140509235752.27e23319@JRWUBU2> Message-ID: <20140512003737.1c4408e6@JRWUBU2> On Sat, 10 May 2014 07:02:03 +0530 Shriramana Sharma wrote: > At least in *Indian* Indic scripts we don't have > Vowel_Independent letters participating in a cluster via a virama > unlike the consonant letter, but possibly in the South East Asian > scripts this is not guaranteed. In Devanagari at least, there does appear to be the exceptional case of RA + VIRAMA + LETTER VOCALIC R. (Did this arise from the resolution of /r?/ as /ri/ or /ru/?) Richard. From verdy_p at wanadoo.fr Sun May 11 22:22:36 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 12 May 2014 05:22:36 +0200 Subject: Indic Syllabic Categories In-Reply-To: References: <20140509235752.27e23319@JRWUBU2> Message-ID: My opinion is that unlike Consonants, the Vowel_Independent are generally not needing an extra Dependant_Vowel to alter them (only vowel modifiers for tone, stress, nasalisation, or newer distinguished phonetic variants needed to represent words borrowed from other languages...), and also generally don't need a Virama-like character to remove their inherent vowel sign. These Vowel_Independent also generally don't take the additional diacritics used to modofy Consonants. So the Vowel_Independent can be viewed as if they were already precombining the same "null" Consonant (not encoded, though some phonologies may render it with a true plain phonetic consonnant like a glottal stop, later added to the abugida as a true Consonant with its own inherent vowel; something also occuring in Semitic adjads) with their associated Dependant_Vowel. Now if you analyze the existing independant vowel A, it is by itself the representation of this null consonnant associated with the inherent vowel. That vowel A followed by a virama then makes sense to represent a non-null consonnant like the glotal stop. This independant vowel A is different from other independant vowels, and is more like other consonnants (this can be said too about the Semetic Alef perceived as this null consonnant with its inherent vowel dropped). This is the only case where the distinction between indepedant vowel and consonnant is really fuzzy (unless we analyze it like the vowel A in western alphabets). And there are some transforms (or orthographic evolutions) where pairs of combining Dependant_Vowels could become a single Dependant_Vowel followed by a Vowel_Independent (or could also evolve to an abjad or a plain alphabet like Hangul), which affect things like collation and transcription. In the history of scripts, the separation line may become dofficult to draw when abugidas, alphabets are evolutions (in opposite directions) of Semitic abjads which initially did not separate phonemes but only syllables (like today's syllabaries, largely kept in sinograms and even graphically in the Hangul alphabet). It is with western alphabets that the syllabic separations have become almost invisible (so much that hyphenation rules in Western alphabets have become difficult and they incorporated spaces and punctuation to separate words, but their origin comes in fact from Semitc abjads at a time where vowels where not written and had to be guessed) 2014-05-10 3:32 GMT+02:00 Shriramana Sharma : > Dear Richard, > > It is true that Vowel_Independent can behave like Consonant > characters. Given that consonant letters also have an inherent vowel > in these scripts, IMO there is not really much to distinguish > *technically*. At least in *Indian* Indic scripts we don't have > Vowel_Independent letters participating in a cluster via a virama > unlike the consonant letter, but possibly in the South East Asian > scripts this is not guaranteed. Hence IIUC it is merely the > traditional classification based on the sound value of these letters > that is reflected here. > > And that classification (which you probably know but just putting into > writing) is: The consonant letters all denote the same inherent vowel > preceded by one (or, rarely, more) consonant sounds. The independent > vowel letters OTOH all denote different vowel sounds without (for the > most part) any consonant sounds. > > HTH, > > Shriramana Sharma. > > > -- > Shriramana Sharma ???????????? ???????????? > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon May 12 03:03:47 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 12 May 2014 09:03:47 +0100 Subject: Indic Syllabic Categories In-Reply-To: References: <20140509235752.27e23319@JRWUBU2> Message-ID: <20140512090347.281534f2@JRWUBU2> On Mon, 12 May 2014 05:22:36 +0200 Philippe Verdy wrote: > My opinion is that unlike Consonants, the Vowel_Independent are > generally not needing an extra Dependant_Vowel to alter them (only > vowel modifiers for tone, stress, nasalisation, or newer > distinguished phonetic variants needed to represent words borrowed > from other languages...), and also generally don't need a Virama-like > character to remove their inherent vowel sign. These > Vowel_Independent also generally don't take the additional diacritics > used to modofy Consonants. Mostly it's only LETTER A that loses its vowel, and then generally when it comes to be interpreted as a 'Consonant_Placeholder' or as a 'Consonant'. Oddly, there are a few of these characters that are classed as 'Vowel_Independent' when 'Consonant' seems more appropriate. In most of the other cases, an independent vowel in combination with a virama still acts as a combination of consonant and dependent vowel. Dependent vowels on independent vowels generally modify rather than replace the vowel sound of the independent vowel. Balinese provides a simple example; the Brahmi length mark has retained or regained its independence and is regularly applied to both independent and dependent vowels. Richard. From ken.whistler at sap.com Mon May 12 13:43:04 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Mon, 12 May 2014 18:43:04 +0000 Subject: Indic Syllabic Categories In-Reply-To: <20140509235752.27e23319@JRWUBU2> References: <20140509235752.27e23319@JRWUBU2> Message-ID: Richard Wordingham asked: > Is the provisional property 'Indic_Syllabic_Category' defined by > anything deeper than the UCD file IndicSyllabicCategory itself? Basically, no. It simply gathers together information scattered about in the core spec and elsewhere about claims regarding what all the characters are. The classification has been undergoing further review and will be updated again shortly for the 7.0 Unicode release with some further distinctions and corrections. However, the file(s) (and properties) will remain provisional for Unicode 7.0. And there is no overarching UTR which provides a definitive model for all of these categories. The values are evolving more along the lines of what is proving useful for implementation, rather than being a priori defined categories. > Is the property meant to be tailorable? For example, there are > encoded characters in the Khmer script that serve as tone marks when it > is used to write Thai. For a property to be "tailorable" in a Unicode context, you pretty much have to have some kind of algorithm defined which uses those property values and then changes them in some systematic way to modify the outcome of the algorithm. In this case, there is no Unicode algorithm defined (although implementers may have specific algorithms in their rendering engines), and the data is all provisional. There is a probability that the two Indic category files may be promoted to *informative* status as of Unicode 8.0, with further modifications, extensions, and corrections. The main difference would be that once a property becomes *informative* in the UCD, the UTC would be committed to keeping it around and maintaining it. By contrast, a provisional property can just be removed, if it doesn't pan out. My suggestion, for those who are interested in this topic, would be to review the relevant data files, implied script behaviors, and documents and proposals in the UTC document register -- and over the course of the next year participate in providing feedback on this topic and the data files, so that if/when the files and related properties become informative for Unicode 8.0 next year sometime, these questions and any concerns about the various edge cases as applied to Southeast Asian scripts, can be addressed before the properties become more difficult to update. --Ken From verdy_p at wanadoo.fr Mon May 12 14:54:32 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 12 May 2014 21:54:32 +0200 Subject: Indic Syllabic Categories In-Reply-To: <20140512090347.281534f2@JRWUBU2> References: <20140509235752.27e23319@JRWUBU2> <20140512090347.281534f2@JRWUBU2> Message-ID: 2014-05-12 10:03 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > Dependent vowels on independent vowels generally modify rather than > replace the vowel sound of the independent vowel. Balinese provides a > simple example; the Brahmi length mark has retained or regained its > independence and is regularly applied to both independent and > dependent vowels. > Hmmm... The length mark itself is not a dependant vowel by itself, it's a modifier that follows a vowel (dependant or not). This length mark is just like a simple macron in Latin, or other length marks used in Asian scripts. In may cases its use is optional as it is not significant semantically and phonologically, or it remains by tradition and frequently used words. It is the introduction of a null consonnant as a plain letter (H in Latin; a letter distinguished with other letters when the glotal stop is realized phonetically with semantic distinctions from a pure "mute" sound : a true glottal stop more or less advanced in the throat or palatal, or an expirated or inspirated breathe, or a short mute pause, sometimes just a modification of tone for the following vowel...) that have made possible the evolution of semitic abjads to alphabets with separate letters (though without banning the use of diacritics) Orthographies also have used diacritics for represent this null consonnant in alphabets ; notably the diaeresis as in French, where it is not only used for this but also to avoid the interpretation of digrams and enforce the separate pronounciation of one of the two vowels ; notably when appied to a final e (usually this final e is mute but it would need to be pronounced in a separate syllable if not mute for emphasis purpose ? see "aigu?": /?.?y/, or exceptionally /?.?y.?/ with emphasis, where we see this null-consonnant as if it was written "aiguhe"; the same term may also be written "aig?e" with an older orthography, the placement of the diaeresis or the first or second vowel being variable though it is preferable today to write it on the second one). Indic abugidas on the opposite have not distinguished this null-consonnant explicitly; but it still exists logically as an unbreakable combination of that null-consonnant and the dependant vowel. This Indic model was abandonned in the Hangul script using an explicit null-consonnant jamo (but older Korean orthographies have also written multiple vowels in the same syllabic cluster, without marking this null consonnant explicitly, and a syllable leading vowel coudl also be left unwritten, or written with a special placement relative to the following consonnant; today Korean no longer use clusters of vowels and have also abandonnend clusters of consonnants by promotting 'de jure' some digrams into plain consonnants even if this change is superficial). -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon May 12 16:58:10 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 12 May 2014 22:58:10 +0100 Subject: Indic Syllabic Categories In-Reply-To: References: <20140509235752.27e23319@JRWUBU2> <20140512090347.281534f2@JRWUBU2> Message-ID: <20140512225810.6f208910@JRWUBU2> On Mon, 12 May 2014 21:54:32 +0200 Philippe Verdy wrote: > 2014-05-12 10:03 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > > Dependent vowels on independent vowels generally modify rather than > > replace the vowel sound of the independent vowel. Balinese > > provides a simple example; the Brahmi length mark has retained or > > regained its independence and is regularly applied to both > > independent and dependent vowels. I was writing in haste; the use of the length mark is not as regular as I thought it was. > Hmmm... The length mark itself is not a dependant vowel by itself, > it's a modifier that follows a vowel (dependant or not). It's counted as such in the Balinese script, and has become such in most Indic scripts, being the dependent vowel AA. > Indic abugidas on the opposite have not distinguished this > null-consonnant explicitly; but it still exists logically as an > unbreakable combination of that null-consonnant and the dependant > vowel. In mainland SE Asia the distinction is made. The independent vowel whose vowel is the implicit vowel has been reinterpreted as the consonant for a glottal stop, and is combined with the dependent vowels. Several scripts, e.g. Tibetan and Thai, have largely done away with the independent vowels. Richard. From verdy_p at wanadoo.fr Mon May 12 17:55:38 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 13 May 2014 00:55:38 +0200 Subject: Indic Syllabic Categories In-Reply-To: <20140512225810.6f208910@JRWUBU2> References: <20140509235752.27e23319@JRWUBU2> <20140512090347.281534f2@JRWUBU2> <20140512225810.6f208910@JRWUBU2> Message-ID: 2014-05-12 23:58 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > In mainland SE Asia the distinction is made. The independent vowel > whose vowel is the implicit vowel has been reinterpreted as the > consonant for a glottal stop, and is combined with the dependent > vowels. Several scripts, e.g. Tibetan and Thai, have largely done away > with the independent vowels. > Also Arabic with Alef in many uses. So has also Greek in an ancient time. And this goes further than the simple matres lectionis for "half-vowels" that still ahve remains in the Latin script (in the orthographic systems due to the phonology and assimilation of regional "accents" variations and their evolution). In fact it is in the alphabets (rather then abjads and abugidas) that the distinctions between consonnants and vowels (which are still clear in the phonetics) has become the most fuzzy: this is a large departure between the spoken language and the written one assimilating more and more local or historic phonetic variations and evolutions Up to extreme points like in English whose orthography is very far from the spoken language and obeys absolutely no rule: lots of exceptions, lots of mute letters, it is completely counterintuitive, only partly ompnsated by the (over?) simplification of grammatical rules and the syntax (and creating many interpretation ambiguities in written texts whose understanding require mich more contetual analysis; it is a fact that many English texts are difficult to translate due to these frequent multiple interpretations, non marking the tense of verbs, and this has become even worse with some arbitrary conventions in the written text like capitalization, and then these have also contaminated the spoken language). Even the basic SVO syntax is threatened in English by the SSS model: if there was not a few auxiliary verbs kept, English would be now just a justaposition of nouns, with a syntax reduced so much that it's difficult to distinguish a verb, a noun, and basic verb modes like infinitive and imperative, and the intended target of imperatives. Pronouns are also disappearing. This is only compensated by a large increase of the vocabulary (with lots of strange borrows from other languages, frequently with very irregular orthography. And it is probably the base for the promotion of "Simple English" which could become a new language far from the English we read today that could become a et of related languages for specialists in their own domain. More or less the world has learnt to work with written English, but has difficulties (including within the Englosphere) with the spoken language. But at the same time, native English speakers write less and depend more on the spoken language. Children no longer use a pen, they use a computer, and read less : they look at videos or listen audio records in their more local community. With the huge separation between the oral and written language, a new written form appears and grows fast in popular usage (but these texts are more difficult to parse and understand by others living or working in different contexts). Could English become tomorrow like Church Latin? Today when I read at a few short news headlines in English, it's hard to see what the article really speaks about, or what is the real intent. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue May 13 02:20:05 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 13 May 2014 08:20:05 +0100 Subject: Indic Syllabic Categories In-Reply-To: References: <20140509235752.27e23319@JRWUBU2> Message-ID: <20140513082005.624ecdad@JRWUBU2> On Mon, 12 May 2014 18:43:04 +0000 "Whistler, Ken" wrote: > Richard Wordingham asked: > > Is the provisional property 'Indic_Syllabic_Category' defined by > > anything deeper than the UCD file IndicSyllabicCategory itself? > Basically, no. Thanks for answering the question. There are two aspects to the evaluation of values. The first is what the values there are to chose from, for which the recent UTC meeting planned to consider a proposed refinement. The second is what values are assigned to each character. The 'implied script behaviours' will be the most difficult one to review - where are we to find them? In SE Asia, there's a clutch of characters that have multiple functions, even within the same writing system. Should we contemplate assigning multiple roles to characters? It would make more sense to do so for tasks like checking syllable structure. Richard. From chris.fynn at gmail.com Thu May 15 05:21:23 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Thu, 15 May 2014 16:21:23 +0600 Subject: Unicode ranges with baseline/x-height/X-height In-Reply-To: <20140506213024.GA11256@powdermilk> References: <20140506213024.GA11256@powdermilk> Message-ID: Indic scripts generally have a hanging base From nospam-abuse at ilyaz.org Thu May 15 15:50:07 2014 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Thu, 15 May 2014 13:50:07 -0700 Subject: Unicode ranges with baseline/x-height/X-height In-Reply-To: References: <20140506213024.GA11256@powdermilk> Message-ID: <20140515205007.GA23375@powdermilk> On Thu, May 15, 2014 at 04:21:23PM +0600, Christopher Fynn wrote: > Indic scripts generally have a hanging base Sure. And many mathematical symbols should have a ?math-centerline base?. However, the font files I?m working with do not have the information about where these extra baselines are (or any per-script details); so I do not care about them at the moment. So my question was about the ?compatible with European scripts? baseline (this is why I put it near uppercase/lowercase; probably I needed to be more explicit). Ilya P.S.: in more details: I?m merging a (lousily autogenerated) font which has a complete 6.3 BMP (Unifont + my ?-level beautifier) with a (well designed) font which supports European scripts (but even them incompletely: DejaVu). So the only mismatches may appear in the scripts supported by both. P.P.S.: However, I presume that in technical writing, the notion of script is not so set in stone; character/words from one script would pepper the text as ?symbols? inside a running ?human language? text (as Greek does in math-in-Latin, and Greek and Latin do in math-in-Cyrillic). Do not know how this is handled inside non-Latin/Greek/Cyrillic/Hebrew scripts. From richard.wordingham at ntlworld.com Sat May 17 05:56:35 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 17 May 2014 11:56:35 +0100 Subject: Indic Syllabic Categories In-Reply-To: References: <20140509235752.27e23319@JRWUBU2> Message-ID: <20140517115635.7e03509f@JRWUBU2> On Mon, 12 May 2014 18:43:04 +0000 "Whistler, Ken" wrote: > My suggestion, for those who are interested in this topic, would be > to review the relevant data files, implied script behaviors, and > documents and proposals in the UTC document register -- and over the > course of the next year participate in providing feedback on this > topic and the data files, so that if/when the files and related > properties become informative for Unicode 8.0 next year sometime, > these questions and any concerns about the various edge cases as > applied to Southeast Asian scripts, can be addressed before the > properties become more difficult to update. I've reviewed the application of the revised categories as set forth in L2/14-126 (http://www.unicode.org/L2/L2014/14126r-indic-properties.pdf) as applied to the Thai, Lao and Tai Tham scripts, and noted a few other characters, and come up with the following proposed changes of syllabic category. I present them here rather than submit them as feedback *immediately*. Some of these changes are tentative and would benefit from discussion. I've come up with 3 new characters of category Bindu: 0303 ; Bindu # Mn COMBINING TILDE 0310 ; Bindu # Mn COMBINING CANDRABINDU 1A74 ; Bindu # Mn TAI THAM SIGN MAI KANG (was Vowel_Dependent) Note that both U+0ECD LAO NIGGAHITA and U+1A74 function both as Bindu and as Vowel_Dependent. U+0303 is used in Patani Malay in the Thai script - see UTC document L2/10-451. U+0310 is used for Sanskrit in Tamil script, according to Indic list email 'Re: Tamil Punctuation', 27/7/12 9:24 +0530 from Shriramana Sharma. I've found 4 new characters of category Visarga: 0E30 ; Visarga # Lo THAI CHARACTER SARA A 0EB0 ; Visarga # Lo LAO VOWEL SIGN A 1A61 ; Visarga # Mc TAI THAM VOWEL SIGN A 19B0 ; Visarga # Mc NEW TAI LUE VOWEL SIGN VOWEL SHORTENER Note that the tone (or voice modulation) character U+1038 MYANMAR SIGN VISARGA is currently classified as Visarga. U+0E30 is used as visarga in Sanskrit, e.g. in the Royal Institute Dictionary. The typical sound of the four visargas above is /?/ rather than /h/, and, through a feature of Tai (SW Tai?) phonology, they all have the additional function of shortening a vowel. As a vowel shortener, U+1A61 and U+19B0 may follow a final consonant. These 4 characters are currently classified as Vowel_Dependent. Except for the Lao script, that usage can easily be interpreted as a modification of the implicit vowel. Modern Lao does not acknowledge the existence of an implicit vowel, so that interpretation may be harder to accept. (Vowel_Dependent U+0EB1 LAO VOWEL SIGN MAI KAN is also a vowel shortener; in the 19th century it was denied that Vowel_Dependent U+0E31 THAI CHARACTER MAI HAN-AKAT was a vowel in Thai.) U+1A61 occasionally has the sound /k/, especially when used in conjunction with U+1A62 TAI THAM VOWEL SIGN MAI SAT. I think we should regard this as just one of the uses of visarga. I've found 3 new nuktas, at least, so long as the application of nukta is not restricted to *foreign* consonants. 0331 ; Nukta # Mn COMBINING MACRON BELOW 0359 ; Nukta # Mn COMBINING ASTERISK BELOW 1A7F ; Nukta # Mn TAI THAM COMBINING CRYPTOGRAMMIC DOT U+0331 is used in Patani Malay in the Thai script - see L2/10-451 and the consonant chart on p16 of http://mlenetwork.org/sites/default/files/Patani%20Malay%20Presentation%20-%20Part%202.pdf. U+0331 and U+0359 have been used in English-Thai dictionaries to represent English sounds, very much a nukta role. They were previously classified as 'Other'. U+0EC8 LAO TONE MAI EK functions as Nukta in Khmu as well as performing its principal r?le of Tone_Mark in Lao. U+0E3A THAI CHARACTER PHINTHU is used both as Nukta and as Pure_Killer; the latter is its traditional# U+1A7B is principally a repetition mark. As extensions of this role, it # can also do at least the following: # (1) Indicate a repeated (not geminate) consonant # (2) Indicated an omitted implicit vowel (one omits an implicit virama by # replacing it with U+1A60) # (3) Indicate an epenthetic vowel (extension of Role 2) r?le, and its current classification. I've found 4 new pure killers, all currently classified as 'Other'. They are: 0E4C ; Pure_Killer # Mn THAI CHARACTER THANTHAKHAT 0ECC ; Pure_Killer # Mn LAO CANCELLATION MARK 1A7C ; Pure_Killer # Mn TAI THAM SIGN KHUEN-LUE KARAN 1A7A ; Pure_Killer # Mn TAI THAM SIGN RA HAAM U+0E4C THAI CHARACTER THANTHAKHAT and U+0E4E THAI CHARACTER YAMAKKAN once divided the role of vowel killing - U+0E4E formed clusters and U+0E4C removed final vowels. The use of U+0E4C came to be largely restricted to vowels associated with clusters of consonants. Removing the vowel made the final consonant of the cluster silent (spoken Thai does not permit final consonant clusters), and from this effect it has been reinterpreted as a consonant-killer. U+0ECC probably had the same behaviour as U+0E4C. I don't know if it is still used in Laos - foreign loanwords often don't follow the rules. The Tai Tham marks are still at the transitional stage - they are sometimes found on final unsubscripted consonants to indicate that they have no vowel. There is an unfortunate overlap with the final consonant mark for (pronunciation necessarily /n/). The Khuen and Lue from of the final consonant symbol has the same shape as the Thai and Lao form of the pure killer. Consequently U+1A7A serves as Consonant_Final in Tai Khuen and Tai Lue. In Tai Khuen, at least, the use as a final consonant seems to have recently fallen into disfavour, so it seems most appropriate to classify U+1A7A as 'Pure_Killer'. I noted above that the 'Pure_Killer' U+0E3A THAI CHARACTER PHINTHU also serves as a nukta. I have a vague recollection that U+0E4C THAI CHARACTER THANTHAKHAT serves as a register mark in an orthography for the Chong language, so that would count as an auxiliary r?le as Tone_Mark. I think I have found one new 'Vowel_Independent', U+1A53 TAI THAM LETTER LAE, currently classified as 'Consonant'. However, it does not freely combine with true dependent vowels. It does pleonastically combine with U+1A6F TAI THAM VOWEL SIGN AE; U+1A53 arises as an abbreviation for . 1A53 ; Vowel_Independent # Lo TAI THAM LETTER LAE - or is it? It should be noted that U+1A62 TAI THAM VOWEL SIGN MAI SAT serves not only as Vowel_Dependent but also as Consonant_Final. This seems to be chiefly relevant to anyone attempting to deduce the pronunciation from the spelling. There are 4 characters currently categorised as 'Consonant' which I think are better categorised as 'Vowel': 0E24 ; Vowel # Lo THAI CHARACTER RU 0E26 ; Vowel # Lo THAI CHARACTER LU 1A42 ; Vowel # Lo TAI THAM LETTER RUE 1A44 ; Vowel # Lo TAI THAM LETTER LUE They serve both as independent and dependent vowels. Note that U+0E24 and U+0E26 may be followed by the length mark U+0E45 THAI CHARACTER LAKKHANGYAO, which is categorised as 'Vowel_Dependent'. I am not aware of any usage of U+0E45 as a true vowel. The sequence occurs with the same meaning, 'elephant', as U+1AAD. I don't know whether this justifies changing U+1AAD from 'Other' to 'Consonant_Placeholder'. I've found 2 new Consonants: 0EBD ; Consonant # Lo LAO SEMIVOWEL SIGN NYO (was Consonant_Medial) 0EDE ; Consonant # Lo LAO LETTER KHMU GO (was Other) U+0EBD is used as an initial consonant in Khmu, so U+0EBD has been used in all r?les in the Lao script, like U+0EA7 LAO LETTER WO, which is of category Consonant. For information on Khmu usage, see UTC document L2/10-335 (http://www.unicode.org/L2/L2010/10335r-n3893r-lao-hosken.pdf). The omission of U+0EDE and U+0EDF is such a shock that I submitted an error report as I was drafting the email. The Khmu alphabet chart included backs up the text. (It also shows U+0EC8 LAO TONE MAI EK acting as a Nukta!) If 'repha' can be used as a general category, including for example Myanmar script kinzi, then there are two arguable new examples, currently categorised as Consonant_Final: 1A58 ; Consonant_Preceding_Repha? # Mn TAI THAM SIGN MAI KANG LAI 1A5A ; Consonant_Succeeding_Repha? # Mn TAI THAM CONSONANT SIGN LOW PA There are significant issues with U+1A58; while traditionally it behaves as repha/kinzi, some modern styles are better served by treating it as Consonant_Final. It takes some juggling for a single OTL-style rendering engine to be able to render either style depending on the lookups while oblivious to the difference, but it can be done. I've found 5 new instances of Consonant_Subjoined: 1A57 ; Consonant_Subjoined # Mc TAI THAM CONSONANT SIGN LA TANG LAI 1A5B ; Consonant_Subjoined # Mn TAI THAM CONSONANT SIGN HIGH RATHA OR LOW PA 1A5C ; Consonant_Subjoined # Mn TAI THAM CONSONANT SIGN MA 1A5D ; Consonant_Subjoined # Mn TAI THAM TAI THAM CONSONANT SIGN BA 1A5E ; Consonant_Subjoined # Mn TAI THAM CONSONANT SIGN SA They were all previously categorised as Consonant_Final. Note that U+1A57 is an abbreviation. It is derived by the addition of a stroke to the subscript form . Abbreviations of the word _tanglaai_ using U+1A57 normally include at least , so U+1A57 is not Consonant_Final. The word ?????? _nippa:na_ 'nirvana' immediately demonstrates that U+1A5B is not a final consonant. U+1A5C occurs in the Pali proper names ending -mmo , so is clearly not a final consonant. U+1A5D occurs in Northern Thai principally in one word, whose pronunciation is roughly /k?b??/. U+1A5D is not Consonant_Final in its phonetic effect. The word is a compound word (or perhaps just a visual compound), formed by chaining two syllables and striking out the duplicated characters. I have a text in which the constituents are to be encoded and , so the chained word may reasonably be encoded or . While all my examples of U+1A5E are word final, it seems to differ from on the basis of the room available for it. Both forms are used as a word final consonant. The only Pali consonant cluster ending in /s/ is /ss/, and that is written using U+1A54 TAI THAM LETTER GREAT SA, so a non-final will be rare. (I'm finding /ks/ written with U+1A47 TAI THAM LETTER HIGH SSA due to the application of RUKI.) However, I feel it would be rash to presume that every example of U+1A5E will be a final consonant. I have one new Consonant_Final: 0EDF ; Consonant_Final # Lo LAO LETTER KHMU NYO (was Other) See UTC document L2/10-335 for evidence. I have already submitted this omission as formal feedback. I have one possible new Vowel_Dependent: 1A7B ; Vowel_Dependent # Mn TAI THAM SIGN MAI SAM The value of its Indic_Matra_Category should be recorded as Top. I suspect renderers need to apply rearrangement rules to this mark, but I haven't experimented with other techniques yet. U+1A7B is principally a repetition mark, indicating the repetition of a word. As extensions of this role, it can also do at least the following: (1) Indicate a repeated (not geminate) consonant (2) Indicate an omitted implicit vowel (one omits an implicit vowel by replacing it with U+1A60) (3) Indicate an epenthetic vowel (extension of Role 2). In r?les (2) and (3), it serves as a dependent vowel. Richard. From richard.wordingham at ntlworld.com Sat May 17 06:10:31 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 17 May 2014 12:10:31 +0100 Subject: Unicode ranges with baseline/x-height/X-height In-Reply-To: <20140506213024.GA11256@powdermilk> References: <20140506213024.GA11256@powdermilk> Message-ID: <20140517121031.68a337f7@JRWUBU2> On Tue, 6 May 2014 14:30:25 -0700 Ilya Zakharevich wrote: > For the purpose of drawing characters from a secondary, substitution > font, one must know whether one must rescale bbox to bbox, or X-height > to X-height etc. > What do people use in ?real life? applications? I don't have an integrated system of my own, but matching bbox to bbox gives poor results. I've always felt that x-height to x-height will give the best results for mixing scripts. After all, the author may well have taken the trouble to ensure that the characters he has principally been working are adequately matched by the obligatory ASCII (indeed, possibly even Latin-1) glyphs. LibreOffice changed its default size matching rules and thereby improved matching between different scripts in the same line, but I don't know what rules it actually adopted. Richard. From richard.wordingham at ntlworld.com Sun May 18 18:06:01 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 19 May 2014 00:06:01 +0100 Subject: Unicode Regular Expressions for Syllable Structure and Normalisation Message-ID: <20140519000601.63e694d2@JRWUBU2> While pondering the Indic Syllabic Category property and its application in regular expressions, I found myself worrying as to what Thai script expressions should match the 'regular expression' \p{isc=Consonant}\p{isc=Nukta}\p{isc=Vowel_Dependent} Now, with the present tables for the property Indic_Syllabic_Category, this is not a problem, but it so happens that U+0331 COMBINING MACRON BELOW serves as a nukta, and the problem I foresee will surface once it is assigned isc=Nukta then, for U+0331 has canonical combining class 220 but U+0E38 THAI CHARACTER SARA U and U+0E39 THAI CHARACTER SARA UU have canonical combining class 103. Note that these problems do not arise with this expression when one adds U+0E3A THAI CHARACTER PHINTHU to the list of nuktas. Using U+0E07 THAI CHARACTER NGO NGU as the conmsonant, and considering the possibility of using U+034F COMBINING GRAPHEME JOINER to avoid rendering problems with na?ve rendering engines, for which of the following strings ought a regular expression engine declare a match if U+0331 is given isc=Nukta? (Not in NFD) (In NFD, but not the specified order) (Extraneous character) I consulted UTS #18 Unicode Regular Expressions, and it appears from first glance that the requirement 'RL2.1 Canonical Equivalents' should supply the answer. However, that transpired not to contain any actual requirement! It does suggest a three-part strategy: 1. Putting the text to be matched into a defined normalization form (NFD or NFKD). 2. Having the user design the regular expression pattern to match against that defined normalization form. For example, the pattern should contain no characters that would not occur in that normalization form, nor sequences that would not occur. 3. Applying the matching algorithm on a code point by code point basis, as usual. Part 1 (NFD, not NFKD) is reasonable. Part 2 leaves me a bit confused. Taken literally, there is no problem with the pattern; it is in ASCII! However, expanding the pattern out to the possible sequences leaves one searching the NFD string for the non-NFD substring . Obviously, no matches will be found when Part 3 is applied. Is the correct solution to change the comprehensible regular expression to \p{isc=Consonant}(\p{isc=Nukta}\p{isc=Vowel_Dependent}| \p{isc=Vowel_Dependent}\p{isc=Nukta}) ? It does contain impossible sequences, but it will find the sequences that should be found. One drawback is that it will match , which one might expect to be a homograph of the canonically inequivalent sequence . (The combining marks do not interact typographically, but U+0E34 follows the Indic pattern of having canonical combining class 0. In practice, dotted circles are liable to appear for either string.) I have done a little mathematical work on regular expressions and canonical equivalence, but these were true regular expressions, i.e. recognisable by finite automata. I worked with NFD strings, and came to the conclusion that the result of concatenating strings should be defined as the result of character-wise concatenation followed by normalisation. Even with this definition, concatenations of regular expressions are still regular expressions, as in the standard theory. (Character-wise concatenation, excluding non-NFD juxtapositions, also yields a regular expression - the set of NFD strings can be defined by a regular expression.) I came to the disappointing conclusion that (\p{name=COMBINING DOT BELOW}\p{name=COMBINING ACUTE ACCENT})* was not a true regular expression, at least, not in any sense that allows the expression to denote an infinite set of strings. One could define * to be restricted to character-wise concatenations that yielded NFD strings, but this is potentially very confusing. It might be argued to be in line with RL2.1 in UTS #18. If one takes the approach I've outlined of including normalisation in the concatenation operation, then one can revert to the definition \p{isc=Consonant}\p{isc=Nukta}\p{isc=Vowel_Dependent} and this will, because it only considers text to be searched once it has been converted to NFD, match both and , but not . I don't know how acceptable this approach is. Does anyone use it? The handling of joiners, non-joiners and disruptors (i.e. U+034F) is yet another topic. Richard. From wl at gnu.org Mon May 19 10:40:45 2014 From: wl at gnu.org (Werner LEMBERG) Date: Mon, 19 May 2014 17:40:45 +0200 (CEST) Subject: question to Akkadian Message-ID: <20140519.174045.241962631.wl@gnu.org> Folks, I'm trying to find an encoding of the following Akkadian cuneiform: ___ ___ ___ \ / \ / \ / | | | | /| | /| | | \| | \| | | | | |\_______ |/ My knowledge of cuneiforms is zero, but I can read Unicode tables :-) However, I haven't found it in the Akkadian cuneiforms block. Either I've missed it, or it gets represented as a ligature, or ... In case it is a ligature: Where should I look to find well drawn glyphs? Or to formulate it more generally: If I have a cuneiform text, where can I find glyph images to identify them? Werner -------------- next part -------------- A non-text attachment was scrubbed... Name: cuneiform.png Type: image/png Size: 757 bytes Desc: not available URL: From tom at bluesky.org Mon May 19 11:11:31 2014 From: tom at bluesky.org (Tom Gewecke) Date: Mon, 19 May 2014 09:11:31 -0700 Subject: question to Akkadian In-Reply-To: <20140519.174045.241962631.wl@gnu.org> References: <20140519.174045.241962631.wl@gnu.org> Message-ID: <3C6BD2C8-277D-42F1-880D-110BCD83F81F@bluesky.org> On May 19, 2014, at 8:40 AM, Werner LEMBERG wrote: > If I have a cuneiform > text, where can I find glyph images to identify them? You might want to specify what you mean by "text". A photo of an inscription? Something from a printed book? Because of the considerable variation in glyphs over the long time period when this script was used, you may need to consult a reference that tries to cover that, like Labat's Manuel d'?pigraphie Akkadienne. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wl at gnu.org Mon May 19 11:21:51 2014 From: wl at gnu.org (Werner LEMBERG) Date: Mon, 19 May 2014 18:21:51 +0200 (CEST) Subject: question to Akkadian In-Reply-To: <3C6BD2C8-277D-42F1-880D-110BCD83F81F@bluesky.org> References: <20140519.174045.241962631.wl@gnu.org> <3C6BD2C8-277D-42F1-880D-110BCD83F81F@bluesky.org> Message-ID: <20140519.182151.215362624.wl@gnu.org> >> If I have a cuneiform text, where can I find glyph images to >> identify them? > > You might want to specify what you mean by "text". A photo of an > inscription? Something from a printed book? I'm interested in representing one of the so-called Hurrian songs (tablet H.6, containing musical notation) with Unicode, cf. https://en.wikipedia.org/wiki/Hurrian_songs A much better drawing of the tablet can be found here on page 503: http://digital.library.stonybrook.edu/cdm/ref/collection/amar/id/7250 The character in question is the first one on the left after the double line. A nice article on this song can be found here: http://individual.utoronto.ca/seadogdriftwood/Hurrian/Website_article_on_Hurrian_Hymn_No._6.html Werner From tom at bluesky.org Mon May 19 12:28:58 2014 From: tom at bluesky.org (Tom Gewecke) Date: Mon, 19 May 2014 10:28:58 -0700 Subject: question to Akkadian In-Reply-To: <20140519.182151.215362624.wl@gnu.org> References: <20140519.174045.241962631.wl@gnu.org> <3C6BD2C8-277D-42F1-880D-110BCD83F81F@bluesky.org> <20140519.182151.215362624.wl@gnu.org> Message-ID: <84C105ED-14C6-40DB-8B3C-84770629D200@bluesky.org> On May 19, 2014, at 9:21 AM, Werner LEMBERG wrote: > > I'm interested in representing one of the so-called Hurrian songs > (tablet H.6, containing musical notation) with Unicode, cf. > > https://en.wikipedia.org/wiki/Hurrian_songs That says it represents q?b, which seems to be a version of Labat 88, which is U+1218F KAB. Unfortunately none of my fonts give the version shown in that drawing, but there may be one. Photo from Labat attached. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: kab.jpeg Type: image/jpg Size: 15847 bytes Desc: not available URL: From wl at gnu.org Mon May 19 14:13:48 2014 From: wl at gnu.org (Werner LEMBERG) Date: Mon, 19 May 2014 21:13:48 +0200 (CEST) Subject: question to Akkadian In-Reply-To: <84C105ED-14C6-40DB-8B3C-84770629D200@bluesky.org> References: <3C6BD2C8-277D-42F1-880D-110BCD83F81F@bluesky.org> <20140519.182151.215362624.wl@gnu.org> <84C105ED-14C6-40DB-8B3C-84770629D200@bluesky.org> Message-ID: <20140519.211348.277698877.wl@gnu.org> >> I'm interested in representing one of the so-called Hurrian songs >> (tablet H.6, containing musical notation) with Unicode, cf. >> >> https://en.wikipedia.org/wiki/Hurrian_songs > > That says it represents q?b, which seems to be a version of Labat > 88, which is U+1218F KAB. > > Unfortunately none of my fonts give the version shown in that > drawing, but there may be one. Thanks a lot! Will try to get the book you've mentioned... BTW, it seems to me that cuneiforms would benefit enormously by introducing variant selectors, collecting all cuneiform variants in a database similar to the CJK stuff. Werner From richard.wordingham at ntlworld.com Tue May 20 03:04:10 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 20 May 2014 09:04:10 +0100 Subject: Unicode Regular Expressions for Syllable Structure and Normalisation In-Reply-To: <20140519000601.63e694d2@JRWUBU2> References: <20140519000601.63e694d2@JRWUBU2> Message-ID: <20140520090410.76a72446@JRWUBU2> On Mon, 19 May 2014 00:06:01 +0100 Richard Wordingham wrote: > I have done a little mathematical work on regular expressions and > canonical equivalence, but these were true regular expressions, i.e. > recognisable by finite automata. > I came to the disappointing conclusion that > > (\p{name=COMBINING DOT BELOW}\p{name=COMBINING ACUTE ACCENT})* > > was not a true regular expression, at least, not in any sense that > allows the expression to denote an infinite set of strings. One could > define * to be restricted to character-wise concatenations that > yielded NFD strings, but this is potentially very confusing. It > might be argued to be in line with RL2.1 in UTS #18. It has been observed off-list that a literal seach of an NFD string will obviously not match \u0323\u0301\u0323\u0301. To eliminate any confusion induced by using the example of searching, I will now talk about the 'regular expression' x(\u0323\u0301)*y. (I did my with the conventional task of recognising patterns.) My aim in the mathematical work was to find a match if the pattern matches anything canonically equivalent to the searched string. I accept that the matched substring will in general be discontiguous, and that if the searched string is not in NFD the substring will contain only parts of some characters. So, when searching an NFD string for what I thought of as 'x(\u0323\u0301)*y', that is a search for any of 'xy', 'x\u0323\u0301y', 'x\u0323\u0323\u0301\u0301y', 'x\u0323\u0323\u0323\u0301\u0301\u0301y' and so on. Those who have studied the mathematical theory of regular expressions should recall that there is no *finite* automaton that will implement this search. Therefore, by definition, the pattern is not a true regular expression. That is how I came to my 'disappointing conclusion' above. I think this limitation is not actually a problem, though it would be something to bear in mind. Length marks in Tibetan vowels and nuktas on Thai vowels looked the likeliest problem areas, but there seem not to be problems. Incidentally, although I have spoken of searching an NFD string, if one has a finite automaton to do that, one can (given enough time and memory) create another finite automaton that will do the same job on unnormalised strings. The trick to the *proof* is to remember the latest batch of non-starters in the unrearranged decomposition of the search string, grouped by canonical combining class. (Equivalence classes can be used to keep the numbers finite.) Whether this is useful may be another matter! Now, this process is very similar to converting the expansions of the pattern string to NFD and then searching for them. However, that conversion is itself a complicated business, similar to compiling a regular expression. One difference is that I would like a search for the Vietnamese vowel U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX (decomposes to 0061 0302) to match the same vowel with a tone mark for the n?ng tone, U+1EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW (decomposes to 0061 0323 0302). As I asked before, is anyone using a regular expression system where the potentially complicated pattern string does not imply NFD strings but the search respects canonical equivalence? There are systems (e.g HarfBuzz) that resort to changing the canonical combining classes to make things work. Richard. From richard.wordingham at ntlworld.com Tue May 27 17:18:22 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 27 May 2014 23:18:22 +0100 Subject: Unicode Sets in 'Unicode Regular Expressions' Message-ID: <20140527231822.79f9fd18@JRWUBU2> UTS#18 'Unicode Regular Expressions' Version 17 Requirement RL1.3 'Subtraction and Intersection' talks of Unicode sets. What is the relevant definition of a 'Unicode set'? Is it a finite set of non-empty strings? Other possibilities that occur to me, depending on context, include sets of codepoints and sets of indecomposable codepoints. Richard. From addison at lab126.com Tue May 27 17:36:04 2014 From: addison at lab126.com (Phillips, Addison) Date: Tue, 27 May 2014 22:36:04 +0000 Subject: Unicode Sets in 'Unicode Regular Expressions' In-Reply-To: <20140527231822.79f9fd18@JRWUBU2> References: <20140527231822.79f9fd18@JRWUBU2> Message-ID: <7C0AF84C6D560544A17DDDEB68A9DFB51FCCE134@ex10-mbx-9007.ant.amazon.com> A "Unicode set" in this context means "a set of code points". This is discussed in section 1.2: -- This is done by providing syntax for sets of characters based on the Unicode character properties, and allowing them to be mixed with lists and ranges of individual code points. -- More generally, there is no term "Unicode set" defined, although is it referred to in places such as RL1.3 as a shorthand. It merely means "the set of all code points selected" (by whatever selection, subtraction, intersection, or differencing has been applied beginning from the Universal Character Set as a whole). Or at least this is how I have already read it. Addison > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard > Wordingham > Sent: Tuesday, May 27, 2014 3:18 PM > To: unicode at unicode.org > Subject: Unicode Sets in 'Unicode Regular Expressions' > > UTS#18 'Unicode Regular Expressions' Version 17 Requirement RL1.3 > 'Subtraction and Intersection' talks of Unicode sets. What is the relevant > definition of a 'Unicode set'? Is it a finite set of non-empty strings? Other > possibilities that occur to me, depending on context, include sets of codepoints > and sets of indecomposable codepoints. > > Richard. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From ken.whistler at sap.com Tue May 27 17:44:45 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Tue, 27 May 2014 22:44:45 +0000 Subject: Unicode Sets in 'Unicode Regular Expressions' In-Reply-To: <20140527231822.79f9fd18@JRWUBU2> References: <20140527231822.79f9fd18@JRWUBU2> Message-ID: http://userguide.icu-project.org/strings/unicodeset Whenever UTS #18 talks of "Unicode sets", it means whatever is actually defined in the class UnicodeSet in ICU. --Ken > UTS#18 'Unicode Regular Expressions' Version 17 Requirement RL1.3 > 'Subtraction and Intersection' talks of Unicode sets. What is the > relevant definition of a 'Unicode set'? Is it a finite set of non-empty > strings? Other possibilities that occur to me, depending on context, > include sets of codepoints and sets of indecomposable codepoints. > > Richard. From ruland at luckymail.com Tue May 27 17:56:40 2014 From: ruland at luckymail.com (=?UTF-8?B?Q2hhcmxpZSBSdWxhbmQg4piY?=) Date: Wed, 28 May 2014 00:56:40 +0200 Subject: Unicode Sets in 'Unicode Regular Expressions' In-Reply-To: <20140527231822.79f9fd18@JRWUBU2> References: <20140527231822.79f9fd18@JRWUBU2> Message-ID: <53851828.6050205@luckymail.com> This is from the introduction to UTS#18: ?Unicode is a large character set?[...]? So I take ?Unicode set? to mean ?set of Unicode characters? with their respective codepoints, whether decomposable or not. Charlie ? Richard Wordingham schrieb: > UTS#18 'Unicode Regular Expressions' Version 17 Requirement RL1.3 > 'Subtraction and Intersection' talks of Unicode sets. What is the > relevant definition of a 'Unicode set'? Is it a finite set of non-empty > strings? Other possibilities that occur to me, depending on context, > include sets of codepoints and sets of indecomposable codepoints. > > Richard. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue May 27 19:19:26 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 28 May 2014 01:19:26 +0100 Subject: Unicode Sets in 'Unicode Regular Expressions' In-Reply-To: <53851828.6050205@luckymail.com> References: <20140527231822.79f9fd18@JRWUBU2> <53851828.6050205@luckymail.com> Message-ID: <20140528011926.020c4d72@JRWUBU2> On Wed, 28 May 2014 00:56:40 +0200 Charlie Ruland ? wrote: > So I take ?Unicode set? to mean ?set of Unicode characters? with > their respective codepoints, whether decomposable or not. The decomposability issue arises when trying to follow RL2.1 "Canonical Equivalence". In a pattern such as "f\p{L}te". \p{L} is not just a set of codepoints if the pattern is to be matched by "f?te" when processing NFD strings. This is one reason I think Ken is right when he says the ICU meaning is intended. I believe I have a coherent resolution of RL2.1, but I'm currently wrestling with the other requirements that an implementation satisfying the spirit of RL2.1 ought to address. Richard. From mark at macchiato.com Tue May 27 23:42:13 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 28 May 2014 06:42:13 +0200 Subject: Unicode Sets in 'Unicode Regular Expressions' In-Reply-To: <20140527231822.79f9fd18@JRWUBU2> References: <20140527231822.79f9fd18@JRWUBU2> Message-ID: They are defined in http://unicode.org/reports/tr35/tr35.html#Unicode_Sets. We should add a pointer to that; could you please file a feedback report for #18 to that effect? Also, if you find any problems in the description in #35, you can file a ticket at http://unicode.org/cldr/trac/newticket to get them addressed. Mark *? Il meglio ? l?inimico del bene ?* On Wed, May 28, 2014 at 12:18 AM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > UTS#18 'Unicode Regular Expressions' Version 17 Requirement RL1.3 > 'Subtraction and Intersection' talks of Unicode sets. What is the > relevant definition of a 'Unicode set'? Is it a finite set of non-empty > strings? Other possibilities that occur to me, depending on context, > include sets of codepoints and sets of indecomposable codepoints. > > Richard. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Wed May 28 15:56:16 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 28 May 2014 21:56:16 +0100 Subject: Unicode Sets in 'Unicode Regular Expressions' In-Reply-To: References: <20140527231822.79f9fd18@JRWUBU2> Message-ID: <20140528215616.00fa0bd0@JRWUBU2> On Wed, 28 May 2014 06:42:13 +0200 Mark Davis ?? wrote: > They are defined in > http://unicode.org/reports/tr35/tr35.html#Unicode_Sets. We should add > a pointer to that; could you please file a feedback report for #18 to > that effect? Fed back as requested. > Also, if you find any problems in the description in #35, you can > file a ticket at http://unicode.org/cldr/trac/newticket to get them > addressed. http://unicode.org/cldr/trac/ticket/7507 submitted. Richard. From richard.wordingham at ntlworld.com Thu May 29 17:39:56 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 29 May 2014 23:39:56 +0100 Subject: Long-Encoded Restricted Characters in High Frequency Modern Use Message-ID: <20140529233956.5db1ea5e@JRWUBU2> I am a little confused by the call for a review of UTS #39, Unicode Security Mechanisms (PRI #273). Are we being requested to report long-encoded 'restricted' characters in high frequency modern use? 'Restricted' refers to the classification in xidmodifications.txt. One linked pair of long-encoded restricted characters in high frequency use is U+0E33 THAI CHARACTER SARA AM and U+0EB3 LAO VOWEL SIGN AM, which occurs in the extremely common Thai and Lao words for 'water' or 'liquid in general' ??? ??? whose NFKC decompositions are the nonsensical forms ???? ????, but may be faked by the linguistically incorrect ???? ????. In Thai the encodings are , and . Now, U+0E4D THAI CHARACTER NIKHAHIT is classified as 'allowed; recommended', although its main use is in writing Pali, which would suggest that it should be 'restricted; historic' or 'restricted; limited-use'. The situation is not so clear for Lao - U+0ECD LAO NIGGAHITA is a fairly common vowel in the Lao language. To me, a truly bizarre set of 'restricted' characters is U+17CB KHMER SIGN BANTOC to U+17D0 KHMER SIGN SAMYOK SANNYA, which are categorised as 'restricted; technical'. They are all in use in the Khmer language. U+17CB KHMER SIGN BANTOC is required for the main methods of writing the Khmer vowels /a/ and /?/. U+17CC KHMER SIGN ROBAT is a repha, but I would be surprised to learn that it has recently become little-used. It is, however, readily confused with U+17CC KHMER SIGN TOANDAKHIAT, a 'pure killer' whose main modern use is to show that a consonant is silent, rather like the Thai letter U+0E4C THAI CHARACTER THANTHAKHAT. (The names are the same.) The confusion arises because Sanskrit -rCa was pronounced /-r/ in Khmer, and final /r/ recently became silent in Khmer, so the effect of the Sanskrit /r/ is now to silence the final consonant. While U+17CE KHMER SIGN KAKABAT and U+17CF KHMER SIGN AHSDA may not be common, they are still in modern use. Although U+17D0 KHMER SIGN SAMYOK SANNYA may have declined in frequency, it has not dropped out of use and is still a common enough way of writing the vowel /a/. Richard. From richard.wordingham at ntlworld.com Thu May 29 18:37:55 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 30 May 2014 00:37:55 +0100 Subject: Long-Encoded Restricted Characters in High Frequency Modern Use In-Reply-To: <20140529233956.5db1ea5e@JRWUBU2> References: <20140529233956.5db1ea5e@JRWUBU2> Message-ID: <20140530003755.40e6f41f@JRWUBU2> On Thu, 29 May 2014 23:39:56 +0100 Richard Wordingham wrote: > To me, a truly bizarre set of 'restricted' characters is U+17CB KHMER > SIGN BANTOC to U+17D0 KHMER SIGN SAMYOK SANNYA, which are categorised > as 'restricted; technical'. They are all in use in the Khmer language. Indeed, all but U+17CE can be found on the single web page http://www.mrcmekong.org/khmer . Finding U+17CE KHMER SIGN KAKABAT is a bit trickier - there seems to be a choice of religious writing (both Christian and Buddhist) and blogs. Anyway, there are several different words sporting it at http://r-b-u.blogspot.com/2014/03/rbu_4.html . Richard. From wl at gnu.org Thu May 29 23:20:40 2014 From: wl at gnu.org (Werner LEMBERG) Date: Fri, 30 May 2014 06:20:40 +0200 (CEST) Subject: tablature characters for the Chinese guqin Message-ID: <20140530.062040.504340524.wl@gnu.org> Folks, there are two different tablature systems for the guqin (a Chinese zither): ??? g?ngch?p? and ??? ji?nz?p?. Both systems contain various `composite' CJK characters like the two attached to this e-mail. It seems to me that they aren't encoded in Unicode, and I wonder whether this is already on the radar, or whether they should be encoded piecewise. https://en.wikipedia.org/wiki/File:Shenqi_Mipu_vol_3_pg_1.jpg https://en.wikipedia.org/wiki/File:Kam_Hok_Yap_Mun-Yeung_Kwan_Sam_Tip.jpg Werner -------------- next part -------------- A non-text attachment was scrubbed... Name: gongchepu.jpg Type: image/jpeg Size: 5276 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: jianzipu.jpg Type: image/jpeg Size: 1081 bytes Desc: not available URL: From wl at gnu.org Fri May 30 00:47:51 2014 From: wl at gnu.org (Werner LEMBERG) Date: Fri, 30 May 2014 07:47:51 +0200 (CEST) Subject: ["Unicode"] tablature characters for the Chinese guqin In-Reply-To: <53880DFA.7040004@hiroshima-u.ac.jp> References: <20140530.062040.504340524.wl@gnu.org> <53880DFA.7040004@hiroshima-u.ac.jp> Message-ID: <20140530.074751.54740794.wl@gnu.org> > China National Body had ever reported that they had a plan to > encode the character for the tablature, in IRG: [...] Thanks. > BTW, a few (only one?) characters for the latter style are sampled > in a normal dictionary "CiYuan", and will be included in CJK Unified > Ideograph Extension F. However, I don't think encoding only one > glyph for the tablature is so useful - there is any avantgarde > number using only one note? Well, the very structure of the guqin tablature characters is this: mod1 mod2 base char mod3 mod4 The number of `modifiers' varies, but according to the literature it can go up to six and more. Given that a modifier is usually a digit, and that we have a bunch of base characters, we easily reach 100000 and more characters if all possible permutations are encoded, and this is certainly not what Unicode wants :-) Werner From mpsuzuki at hiroshima-u.ac.jp Fri May 30 01:34:23 2014 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Fri, 30 May 2014 15:34:23 +0900 Subject: ["Unicode"] tablature characters for the Chinese guqin In-Reply-To: <20140530.074751.54740794.wl@gnu.org> References: <20140530.062040.504340524.wl@gnu.org> <53880DFA.7040004@hiroshima-u.ac.jp> <20140530.074751.54740794.wl@gnu.org> Message-ID: <5388266F.8010106@hiroshima-u.ac.jp> It seems that my first response to this discussion was not delivered because my attachment image was too big. I'm sorry, please let me post revised version... -- China National Body had ever reported that they had a plan to encode the character for the tablature, in IRG: http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg32/IRGN1574_ChinaActivityReportIRG32.doc When I asked the progress of the plan in later IRG meeting, they commented "nothing to report in the project". I wish if there is something ongoing in Chinese Character Repertoire project (oh, it was announced to be finished in 2015!). BTW, a character for the latter style was sampled in a normal Chinese dictionary "CiYuan", and China NB submission to CJK Unified Ideograph Extension F includes it (G_CY2255). However, there was a comment that the evidence image was not clearly scanned, so G_CY2255 is being queued in the postponed list (see PDF in http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg41/IRGN1979Appendix1_PostponedIRG40And41.zip ) Regards, mpsuzuki Werner LEMBERG wrote: >> China National Body had ever reported that they had a plan to >> encode the character for the tablature, in IRG: [...] > > Thanks. > >> BTW, a few (only one?) characters for the latter style are sampled >> in a normal dictionary "CiYuan", and will be included in CJK Unified >> Ideograph Extension F. However, I don't think encoding only one >> glyph for the tablature is so useful - there is any avantgarde >> number using only one note? > > Well, the very structure of the guqin tablature characters is this: > > mod1 mod2 > > base char > > mod3 mod4 > > The number of `modifiers' varies, but according to the literature it > can go up to six and more. Given that a modifier is usually a digit, > and that we have a bunch of base characters, we easily reach 100000 > and more characters if all possible permutations are encoded, and this > is certainly not what Unicode wants :-) > > > Werner From mpsuzuki at hiroshima-u.ac.jp Fri May 30 01:39:25 2014 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Fri, 30 May 2014 15:39:25 +0900 Subject: ["Unicode"] tablature characters for the Chinese guqin In-Reply-To: <20140530.074751.54740794.wl@gnu.org> References: <20140530.062040.504340524.wl@gnu.org> <53880DFA.7040004@hiroshima-u.ac.jp> <20140530.074751.54740794.wl@gnu.org> Message-ID: <5388279D.9020608@hiroshima-u.ac.jp> > and that we have a bunch of base characters, we easily reach 100000 > and more characters if all possible permutations are encoded, and this > is certainly not what Unicode wants :-) Indeed. Some people may want to encode the tablature characters as precomposed glyphs in square metric, and unify with Hanzi. The separated encoding as combining character would be better to process as a musical content, in my personal opinion. There is a joke saying Unicode is a Hanzi collection because the biggest part is CJK Unified Ideograph, but the precomposed encoding of guqin characters would modify it as "Unicode is now a musical note collection". Regards, mpsuzuki Werner LEMBERG wrote: >> China National Body had ever reported that they had a plan to >> encode the character for the tablature, in IRG: [...] > > Thanks. > >> BTW, a few (only one?) characters for the latter style are sampled >> in a normal dictionary "CiYuan", and will be included in CJK Unified >> Ideograph Extension F. However, I don't think encoding only one >> glyph for the tablature is so useful - there is any avantgarde >> number using only one note? > > Well, the very structure of the guqin tablature characters is this: > > mod1 mod2 > > base char > > mod3 mod4 > > The number of `modifiers' varies, but according to the literature it > can go up to six and more. Given that a modifier is usually a digit, > and that we have a bunch of base characters, we easily reach 100000 > and more characters if all possible permutations are encoded, and this > is certainly not what Unicode wants :-) > > > Werner > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From mpsuzuki at hiroshima-u.ac.jp Thu May 29 23:50:02 2014 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Fri, 30 May 2014 13:50:02 +0900 Subject: ["Unicode"] tablature characters for the Chinese guqin In-Reply-To: <20140530.062040.504340524.wl@gnu.org> References: <20140530.062040.504340524.wl@gnu.org> Message-ID: <53880DFA.7040004@hiroshima-u.ac.jp> Dear Werner, China National Body had ever reported that they had a plan to encode the character for the tablature, in IRG: http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg32/IRGN1574_ChinaActivityReportIRG32.doc When I asked the progress of the plan in later IRG meeting, they commented "nothing to report in the project". I wish if there is something ongoing in Chinese Character Repertoire project (oh, it was announced to be finished in 2015!). BTW, a few (only one?) characters for the latter style are sampled in a normal dictionary "CiYuan", and will be included in CJK Unified Ideograph Extension F. However, I don't think encoding only one glyph for the tablature is so useful - there is any avantgarde number using only one note? Regards, mpsuzuki P.S. Attached is IRG42 t-shirt of a tablature(?) taken from Dunhuang manuscript (Pelliot P3808). Werner LEMBERG wrote: > Folks, > > > there are two different tablature systems for the guqin (a Chinese > zither): ??? g?ngch?p? and ??? ji?nz?p?. Both systems contain > various `composite' CJK characters like the two attached to this > e-mail. It seems to me that they aren't encoded in Unicode, and I > wonder whether this is already on the radar, or whether they should be > encoded piecewise. > > https://en.wikipedia.org/wiki/File:Shenqi_Mipu_vol_3_pg_1.jpg > https://en.wikipedia.org/wiki/File:Kam_Hok_Yap_Mun-Yeung_Kwan_Sam_Tip.jpg > > > Werner > > > ------------------------------------------------------------------------ > > > ------------------------------------------------------------------------ > > > ------------------------------------------------------------------------ > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- A non-text attachment was scrubbed... Name: irg42-tshirt.jpg Type: image/jpeg Size: 106148 bytes Desc: not available URL: From andrewcwest at gmail.com Fri May 30 13:13:32 2014 From: andrewcwest at gmail.com (Andrew West) Date: Fri, 30 May 2014 19:13:32 +0100 Subject: ["Unicode"] tablature characters for the Chinese guqin In-Reply-To: <53880DFA.7040004@hiroshima-u.ac.jp> References: <20140530.062040.504340524.wl@gnu.org> <53880DFA.7040004@hiroshima-u.ac.jp> Message-ID: On 30 May 2014 05:50, suzuki toshiya wrote: > > BTW, a few (only one?) characters for the latter style are > sampled in a normal dictionary "CiYuan", and will be included > in CJK Unified Ideograph Extension F. I hope not. Just because it occurs in a Chinese dictionary does not mean that it is a Han ideograph, and guqin tablature signs most definitely are not Han ideographs. The component elements of guqin tablature signs should be encoded in a separate block, with an encoding model that allows for the composition of arbitrary tablature signs by fonts. > However, I don't think > encoding only one glyph for the tablature is so useful -> there is any avantgarde number using only one note? It would be extremely unuseful to do so. > Attached is IRG42 t-shirt of a tablature(?) taken from Dunhuang > manuscript (Pelliot P3808). Yes, it is tablature used for the pipa lute during the Tang dynasty. I have a table of pipa tablature signs at: http://babelstone.blogspot.co.uk/2012/12/one-to-twenty-in-jurchen-khitan-and-lute.html#Lute And I discuss Song and Yuan dynasty flute tablature signs at: http://babelstone.blogspot.co.uk/2012/12/one-to-ten-in-tangut-and-flute.html Glyphs for both flute and pipa tablature signs are available in my BabelStone Han font in the PUA at E000..E01D and E020..E04B respectively. Andrew From public at khwilliamson.com Fri May 30 13:26:18 2014 From: public at khwilliamson.com (Karl Williamson) Date: Fri, 30 May 2014 12:26:18 -0600 Subject: Corrigendum #9 Message-ID: <5388CD4A.4060704@khwilliamson.com> I'm having a problem with this http://www.unicode.org/versions/corrigendum9.html Some people now think it means that noncharacters are really no different from private-use characters, and should be treated very similarly if not identically. It seems to me that they should be illegal in open interchange, or perhaps illegal in interchange without prior agreement. Any system (process or group of related, cooperating processes) that uses noncharacters will want to not have any of the ones it uses present in its inputs. It will want to filter them out of those inputs, likely turning each into a REPLACEMENT CHARACTER. If it fails to do that, it leaves itself vulnerable to an attack by hackers, who can fool it into thinking the input data is different from what it really is. Hence, a system that creates outputs containing noncharacters cannot be assured that any other system will accept those noncharacters. Thus, I don't see how noncharacters can be considered to be valid in public interchange, given that the producers have to assume that the consumers will not accept them. Producers can assume that consumers will accept private-use characters, though they may not know their intent. I think the text in 6.2 section 16.7 is good and does not need to be changed: "Noncharacters ... are forbidden for use in open interchange of Unicode text data" Perhaps a bit better wording would be, "are forbidden for use in interchange of Unicode text data without prior agreement" The only reason I can think of for your too-large (in my opinion) backing away from what TUS has said about noncharacters since their inception is to accommodate processes that conform to C7, "that purports to not modify the interpretation of a valid coded character sequence". But, I think there is a better way to do that than what Corrigendum #9 currently says. I also am curious as to why the consecutive group of 32 noncharacters can't be split off into its own block instead of being part of an Arabic one. I'm unaware of any stability policy forbidding this. Another block is to be split, if I recall correctly, to accommodate the new Cherokee characters. From richard.wordingham at ntlworld.com Fri May 30 13:45:08 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 30 May 2014 19:45:08 +0100 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 Message-ID: <20140530194508.357ae366@JRWUBU2> Is there any good reason for UTS#18 'Unicode Regular Expressions' to express its requirements in terms of codepoints rather than scalar values? I was initially worried by RL1.1 requiring that one be able to specify surrogate codepoints in a pattern. It would not be compliant for an application to reject such patterns as syntactically or semantically incorrect! RL1.1 seemed to prohibit compliant regular expression engines that only handled well-formed UTF-8 strings. Furthermore, consider attempting to handle CESU-8 text as a sequence of UTF-8 code units. The code unit sequence for U+10000 will, corresponding to the UTF-16 code unit sequence D800 DC00, be ED A0 80 ED B0 80. If one follows the lead of the 'best practice' for processing ill-formed UTF-8 code unit sequences given in TUS Section 5.22, this will be interpreted as *four* ill-formed sequences, ED A0, 80, ED B0, and 80. I am not aware of any recommendation as to how to interpret these sequences as codepoints. While being able to specify a search for surrogate codepoint U+D800 might be useful when dealing with ill-formed UTF-16 Unicode sequences, UTS#18 Section 1.7, which discusses requirement RL1.7, states that there is no requirement for a one-codepoint pattern such as \u{D800} to match a UTF-16 Unicode string consisting just of one code unit with the value 0xD800. The convenient, possibly intended, consequence of this is that the RL1.1 requirement to allow patterns to specify surrogate codepoints can be satisfied by simply treating them as unmatchable; For example, such a 1-character RE could be treated as the empty Unicode set [\p{gc=Lo} && \p{gc=Mn}]. Now, I suppose one might want to specify a match for ill-formed (in context) UTF-8 code unit subsequences such as E0 80 (not a valid initial subsequence) and E0 A5 (lacking a trailing byte), but as matching is not required, I don't see the point in UTS#18 being changed to ask for an appropriate syntax to be added. Richard. From asmusf at ix.netcom.com Fri May 30 13:49:00 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 30 May 2014 11:49:00 -0700 Subject: Corrigendum #9 In-Reply-To: <5388CD4A.4060704@khwilliamson.com> References: <5388CD4A.4060704@khwilliamson.com> Message-ID: <5388D29C.9040502@ix.netcom.com> On 5/30/2014 11:26 AM, Karl Williamson wrote: > I'm having a problem with this > http://www.unicode.org/versions/corrigendum9.html You are not alone. > > Some people now think it means that noncharacters are really no > different from private-use characters, and should be treated very > similarly if not identically. > > It seems to me that they should be illegal in open interchange, or > perhaps illegal in interchange without prior agreement. > > Any system (process or group of related, cooperating processes) that > uses noncharacters will want to not have any of the ones it uses > present in its inputs. It will want to filter them out of those > inputs, likely turning each into a REPLACEMENT CHARACTER. If it fails > to do that, it leaves itself vulnerable to an attack by hackers, who > can fool it into thinking the input data is different from what it > really is. > > Hence, a system that creates outputs containing noncharacters cannot > be assured that any other system will accept those noncharacters. > > Thus, I don't see how noncharacters can be considered to be valid in > public interchange, given that the producers have to assume that the > consumers will not accept them. Producers can assume that consumers > will accept private-use characters, though they may not know their > intent. This is an important distinction. One of the concerns was that people felt that they had to have "data pipeline" style implementations (tools) go and filter these out - even if there was no intent for the implementation to use them internally in any way. Making clear that the standard does not require filtering allows for cleaner implementations of such ("path through) tools. However, like you, I feel that the corrigendum went to far. > > I think the text in 6.2 section 16.7 is good and does not need to be > changed: "Noncharacters ... are forbidden for use in open interchange > of Unicode text data" > > Perhaps a bit better wording would be, "are forbidden for use in > interchange of Unicode text data without prior agreement" > > The only reason I can think of for your too-large (in my opinion) > backing away from what TUS has said about noncharacters since their > inception is to accommodate processes that conform to C7, "that > purports to not modify the interpretation of a valid coded character > sequence". But, I think there is a better way to do that than what > Corrigendum #9 currently says. > > I also am curious as to why the consecutive group of 32 noncharacters > can't be split off into its own block instead of being part of an > Arabic one. I'm unaware of any stability policy forbidding this. > Another block is to be split, if I recall correctly, to accommodate > the new Cherokee characters. This might have been possible at the time these were added, but now it is probably not feasible. One of the reasons is that block names are exposed (for better or for worse) as character properties and as such are also exposed in regular expressions. While not recommended, it would be really bad if the expression with pseudo-code "IsInArabicPresentationFormB(x)" were to fail, because we split the block into three (with the middle one being the noncharacters). It's the usual dance: is it better to prevent such breakage, or is it better to not pile up more "exceptions" like noncharacters being filed under Arabic Presentation forms. The damage from the former is direct and immediate and eventually decays. The damage from the latter is subtle and cumulative over time. Tough choice. A./ > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From ken.whistler at sap.com Fri May 30 14:50:37 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Fri, 30 May 2014 19:50:37 +0000 Subject: Block Boundaries (was: RE: Corrigendum #9) Message-ID: Skipping over the wording related to noncharacters for the moment, let me address the block stability issue: > I also am curious as to why the consecutive group of 32 noncharacters > can't be split off into its own block instead of being part of an Arabic > one. I'm unaware of any stability policy forbidding this. Another > block is to be split, if I recall correctly, to accommodate the new > Cherokee characters. Actually, this is *not* correct. The Latin Extended-E block *will* be first published in Unicode 7.0 next month. In the charts for that version and in Blocks.txt, the range for Latin Extended-E is AB30..AB6F. True, it was initially approved with a more extended range, and was long shown with the longer range in the Roadmap. But the Roadmap is just a "roadmap", and not the standard. The new range allocated to the Cherokee Supplement (AB70..ABBF) is in ballot now, so that allocation is not final, although I personally consider it unlikely to change before publication next year. At any rate the revision of the range for the Latin Extended-E block occurred before actual publication of that block. The net net here is that the last major churning of block boundaries dates all the way back to Unicode 1.1 times and the great Hangul Catastrophe. And the last time any formal block boundary was touched was in 2002, when all blocks were firmly ended on xxxF boundaries as part of synchronizing documentation between the Unicode Standard and 10646. And while there is indeed no actual stability guarantee in place that would absolutely prevent the UTC or SC2 from adjusting a block boundary if it decided to, the committees are very unlikely to do so, for the reasons that Asmus cited. Keep in mind that even if the UTC, for some reason, decided it would be a cool idea to split the Arabic Presentation Forms-A block into a new, shorter range and two new blocks, just so FDD0..FDEF could have its own block identity for the noncharacter range, it would be rather likely that a fight would then ensue in the SC2 framework over balloting for such a change to be synchronized in 10646. Nobody has the stomach for that kind of a pointless fight over something with such marginal relevance and benefit. If people want to *fix* this, assuming that "this" is an actual problem, then the issue, as I see it, isn't really block ranges per se, which don't mean a whole lot outside of regex expressions that may use them. Instead, the issue is the de facto alignment of chart presentation with block boundaries. Jiggering the chart production to *present* the range FB50..FDFF as three *chart* units, instead of one, would solve most of the problem for all but the most hardcore Unicode metaphysicians out there. ;-) BTW, for those worried about the FDD0..FDEF range on noncharacters having to live in a mixed neighborhood in the Arabic Presentation Forms-A block, remember that we have lived since 2002 with the BOM itself residing in the Arabic Presentation Forms-B block. Nobody seems to get too worked up any more about that particular funky address. --Ken From markus.icu at gmail.com Fri May 30 15:22:58 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 30 May 2014 13:22:58 -0700 Subject: Block Boundaries (was: RE: Corrigendum #9) In-Reply-To: References: Message-ID: In addition, the Block property is not particularly useful even in regular expressions or other processing. It is almost always more useful to use Script, Alphabetic, Unified_Ideograph, etc. Blocks help with planning and allocation but little else. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Fri May 30 16:05:47 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 30 May 2014 22:05:47 +0100 Subject: Block Boundaries (was: RE: Corrigendum #9) In-Reply-To: References: Message-ID: <20140530220547.31e2eb0d@JRWUBU2> On Fri, 30 May 2014 13:22:58 -0700 Markus Scherer wrote: > In addition, the Block property is not particularly useful even in > regular expressions or other processing. It is almost always more > useful to use Script, Alphabetic, Unified_Ideograph, etc. > Blocks help with planning and allocation but little else. They also help with the code charts. Richard. From markus.icu at gmail.com Fri May 30 18:15:12 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 30 May 2014 16:15:12 -0700 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: <20140530194508.357ae366@JRWUBU2> References: <20140530194508.357ae366@JRWUBU2> Message-ID: If you use Unicode 16-bit strings, it's easy to "pass through" unpaired surrogates and treat them like code points; it's often not productive or necessary to check for them all the time, that is, to be strict about UTF-16. On the other hand, I don't think anyone expects you to support invalid UTF-8, and especially not to support any and all Unicode 8-bit strings (see Unicode 3.9 Unicode Encoding Forms for what I mean here). If you find UTS #18 unclear or misleading, I suggest you submit feedback pointing out specific text issues. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat May 31 03:59:58 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 31 May 2014 09:59:58 +0100 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: References: <20140530194508.357ae366@JRWUBU2> Message-ID: <20140531095958.61fecff7@JRWUBU2> On Fri, 30 May 2014 16:15:12 -0700 Markus Scherer wrote: > If you find UTS #18 unclear or misleading, I suggest you submit > feedback pointing out specific text issues. In this case it seems to be making a pointless, counter-productive 'demand'. I'm first raising the point here in case there is a good reason for the requirement. > If you use Unicode 16-bit strings, it's easy to "pass through" > unpaired surrogates and treat them like code points; it's often not > productive or necessary to check for them all the time, that is, to > be strict about UTF-16. Bear in mind that a pattern \uD808 shall not match anything in a well-formed Unicode string. \uD808\uDF45 specifies a sequence of two codepoints. This sequence can occur in an ill-formed UTF-32 Unicode string and before Unicode 5.2 could readily be taken to occur in an ill-formed UTF-8 Unicode string. RL1.7 declares that for a regular expression engine, the codepoint sequence cannot occur in a UTF-16 Unicode string; instead, the code unit sequence is the codepoint sequence . > On the other hand, I don't think anyone expects you to support invalid > UTF-8, and especially not to support any and all Unicode 8-bit > strings (see Unicode 3.9 Unicode Encoding Forms for what I mean here). Is there a use case for having a 1-character RE \uD800 within a pattern? I could understand it if it were required to match a lone surrogate U+D800, but it isn't. If a regular expression engine matches lone surrogates, then using the same notation for all codepoints is reasonable, but if it doesn't it would be more useful for it to treat lone surrogate code points in patterns as errors. Richard. From richard.wordingham at ntlworld.com Sat May 31 04:02:34 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 31 May 2014 10:02:34 +0100 Subject: Corrigendum #9 In-Reply-To: <5388CD4A.4060704@khwilliamson.com> References: <5388CD4A.4060704@khwilliamson.com> Message-ID: <20140531100234.1a9095e9@JRWUBU2> On Fri, 30 May 2014 12:26:18 -0600 Karl Williamson wrote: > I'm having a problem with this > http://www.unicode.org/versions/corrigendum9.html > Some people now think it means that noncharacters are really no > different from private-use characters, and should be treated very > similarly if not identically. > It seems to me that they should be illegal in open interchange, or > perhaps illegal in interchange without prior agreement. So one just puts a notice on the web site saying that by downloading CLDR files one agrees to accept non-characters. Part of the original problem is that the CLDR mechanism for identifying Unicode scalar values in XML rather than quoting them (albeit by numeric entities) was broken. > Thus, I don't see how noncharacters can be considered to be valid in > public interchange, given that the producers have to assume that the > consumers will not accept them. The publishing of the CLDR data was strictly limited to the Milky Way, and will remain so for several decades at the very least. Therefore it was not public interchange. Practically, there is the very real issue that a system may be useful enough to be used as part of a larger system, and therefore called upon to handle any Unicode scalar value. One possible solution is to use, instead of non-characters, lone low surrogates. These have the advantage of having obvious representations for use with all three coding forms. Of course, internal checks on the well-formedness of Unicode strings would have to be relaxed, and one might prefer to use them doubled in UTF-16 so as not to weaken checks for broken strings. Richard. From verdy_p at wanadoo.fr Sat May 31 06:09:16 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 31 May 2014 13:09:16 +0200 Subject: Corrigendum #9 In-Reply-To: <5388D29C.9040502@ix.netcom.com> References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> Message-ID: 2014-05-30 20:49 GMT+02:00 Asmus Freytag : > This might have been possible at the time these were added, but now it is > probably not feasible. One of the reasons is that block names are exposed > (for better or for worse) as character properties and as such are also > exposed in regular expressions. While not recommended, it would be really > bad if the expression with pseudo-code "IsInArabicPresentationFormB(x)" > were to fail, because we split the block into three (with the middle one > being the noncharacters). > If you think about pseudocode testing for properties then nothing forbifs the test IsInArabicPresentationFormB(x) to check two ranges onstead of just one. Almost all character properties are using multiple ranges of characters (including the more useful properties needed in lots of place in the code, so updating it so that the property covers two ranges is not a major change). But anyway, I have never see the non-characters in the Arabic presentation forms used elsewhere than within legacy Arabic fonts, using these code points to map... Arabic presentation forms. OK, text documents do not need to encode these legay forms in order to use these fonts (text renderers don't need them with modern OpenType fonts but will still use them in legacy non-OpenType TTF fonts, as a tentative fallback to render these contextual forms). So basically there's no interchange of *text* but the fonts using these codepoints are still interchanged. I think it would be better to just reassign these characters are compatibility characters (or even as PUA) and not as non-characters. I see no rational for keeping them illegal, when it just cause unnecessary complications for document validation. After all most C0 and C1 controls also don't have any other interchangeable semantic except being "controls" which are always application and protocol dependant (not meant for encoding texts, except in legacy more or less "rich" encodings (e.g. for storing escape sequences, not standardized and fully dependant on the protocol or terminal type, or on various legacy standards that did not separate text from style, or for the many protocols that need them for special purpose, such as tagginh content, switching code pages changing colors and font styles, positioning on a screen or input form, adding foratting metadata, implement out-of-band commands, starting/stopping records, pacing the bandwidth use, starting/ending/redirecting/splitting/merging sessions, embed non-text content such as bitmap images or structured data, changing transport protocol options such as compresion schemes, exchanhing encryption/decryption keys, adding checksum controls or autocorrection data, marking redundant data copies, inserting resynchronization points for error recovery...) So these "non-characters" in Arabic presentation forms are to be treated more or less like most C1 controls that have undefined behavior. Saying that there's a need for a "prior agreement" the agreement may be explicit by the fact that they are used in some old font formats (the same is true about old fonts using PUA assignments: the kind of agreement is basically the same, and in both cases, fonts are not plain-text documents). So the good queston for us is only to be able to reply to this question: "is this document a valid and conforming plain-text ?" If: * (1) your document contains - any one in most of the C0 or C1 controls (except CR, LF, VT, FF, and NL from C1) - any one in PUA - any one in non-characters - any unpaired surrogates * and (2) your document does not validate its encoding scheme, Then it is not plain-text (to be interchangeable it also needs a recognized standard encoding, which also requires an agreement or a specification in the protocol or file format used to transport it). Personnally I think that surrogates are also non-characters. They are not assigned to any character even if a pair of encodings are using them internally to represent code units (not directly code points which are converted first in two code units); this means that some documents are valid UTF16 and UTF-32 documents even if they are not plain-text with the current system (I don't like this situation because UTF-16 and UTF-32 documents are supposed to be interchangeable, even if they are not all convertible to UTF-8). But with the non-characters in the Arabic presentation forms, all is made as if they were reserved for a possible future encoding that could use them internally for representing some text using sequences of code units containing or starting by them, or for some still mysterious encoding of a PUA agreement with an unspecified protocol (exactly the same situation as with most C1 controls), or as possible replacement for some code units that could collide with the internal use of some standard controls with some protocols (e.g. to reencode a NULL, or to delimit the end of an variable-length escape sequence, when all other C0 and C1 controls are already used in a terminal protocol). But even in this case, it will be difficult to consider documents containing them as "plain-text". ---- Note: I do not discuss the 34 non-characters in positions U+xxFFFE and U+xxFFFF: keep them as non-characters, they are sufficient for all possible internal use (in fact ony U+FFFE and U+FFFF are needed: the first one for determining the byte order in streams that accept either big-endian or little-endian ordering, the second to mark the end of a stream), and I've still never seen any application needing more non-characters from the Arabic presentation form for such use. The non-character U+FFFE can be used to detect the byte order in UTF-16 adnd UTF-32, but not the bit order in bytes, (because U+7FFF and U+FF7F are not non-characters). This is a problem in some protocols that can accept both without an explicit prior encoding of this order (they could need another non-character to help determining the bit order, in which case the encoding of the non-character U+1FFFE could be used: if the bit order is swapped in UTF-16 we get 0xDFFE as the second UTF-16 encoding from which we can determine the bit order from the position of the clear bit with value 0x20000). But unlike the cirrent code unit 0xFEFF used to detect swapped BOM (whcih is considered valid character and in-band, stripped conditionnaly only in the leading position), 0x1FFFE could be treated as non-character and its presence always out-of-band, so once the bit and byte order has been detected, or changed with it within a stream of code units, it can always be stripped from the plain-text output of code points. May be in the furure we'll need more distinctive order marks for bits, bytes, code units, but I am convinced that the 34 codepojnts U+xFFFE and U+xFFFF will be far enough **without** ever needing to use the non-characters in the Arabic presentation form block. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat May 31 06:21:23 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 31 May 2014 13:21:23 +0200 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: References: <20140530194508.357ae366@JRWUBU2> Message-ID: I think Richard dd not speak aout that, but about the behavior of a matchier that would start parsing a text using the wrong guessed encoding. e gave the exampe of a valid CESU-8 text containing with U+10000: when reading it incorrectly as UTF-8, the parser gets the 4 invalid sequences: CESU-8 cannot be easily detected at start of the stream with the encoding of byte order mark U+FEFF. However CESU-8 can be detected by the initial encoding of another byte order mark U+1FFFE (which is a non-character that MUST be stripped once detected from the parsed stream of code points) However, documents starting by this non-cahracters are supposed to be non-interoperable by definition even though the presence of that special byte order mark would be very safe to secure CESU-8 and discriminate it from UTF-8. 2014-05-31 1:15 GMT+02:00 Markus Scherer : > If you use Unicode 16-bit strings, it's easy to "pass through" unpaired > surrogates and treat them like code points; it's often not productive or > necessary to check for them all the time, that is, to be strict about > UTF-16. > > On the other hand, I don't think anyone expects you to support invalid > UTF-8, and especially not to support any and all Unicode 8-bit strings (see > Unicode 3.9 Unicode Encoding Forms for what I mean here). > > If you find UTS #18 unclear or misleading, I suggest you submit feedback > pointing out specific text issues. > > markus > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat May 31 07:11:03 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 31 May 2014 13:11:03 +0100 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: References: <20140530194508.357ae366@JRWUBU2> Message-ID: <20140531131103.5fb182cc@JRWUBU2> On Sat, 31 May 2014 13:21:23 +0200 Philippe Verdy wrote: > However CESU-8 can be detected by the initial encoding of another byte > order mark U+1FFFE (which is a non-character that MUST be stripped > once detected from the parsed stream of code points) However, > documents starting by this non-cahracters are supposed to be > non-interoperable by definition even though the presence of that > special byte order mark would be very safe to secure CESU-8 and > discriminate it from UTF-8. Where is this tagging defined? It is in general not true that non-characters must be stripped on input. That would be highly inappropriate in a conversion program that transformed between UTFs. Also, the collations defined in CLDR Version 23 file collation/zh.xml would be severely damaged if the non-characters were stripped out. In version 24 and later the file uses a different syntax and doesn't contain non-characters. Richard. From verdy_p at wanadoo.fr Sat May 31 08:08:59 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 31 May 2014 15:08:59 +0200 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: <20140531131103.5fb182cc@JRWUBU2> References: <20140530194508.357ae366@JRWUBU2> <20140531131103.5fb182cc@JRWUBU2> Message-ID: I just spoke about stripping a single leading U+1FFFE in such context of autodetecting UTF-8 vs. CESU-8 where a U+FFFE leading BOM is not reliable enough (if the non-BMP characters are not found in the first few KB of data), or where the presence of a non-BMP character in that leading few KB could cause UTF-8 ro be rejected, but CESU-8 still not selected as a candidate. However CESU-8 is rarely used except fro compatibility with some old processes built initially for handling only characters in the BMP and treating surrogates as if they were characters, if those processes cannot accept 4-byte UTF-8 encoded sequences. CESU-8 is a legacy, UTF-8 is far better and now well supported in most OSes and "Unicode" libraries and most old protocols (plus all new ones). Insertng the special BOM for CESU-8 on output is possible, while also forcing strupping it on input. And this has nothing in common with collation. Collations are *not* encodings even if internally they may reencode texts during preprocessing within a private interface between layers of the collation process (no need of an interfchanged agreement, these internal steps are mutually bound together and not easily separable (except the leading normalization step, but the most efficient collators do not separate these steps, they pileline them in a finite state automata to reduce the use of multiple buffers and maximize the data locality within a small set of state variables and code branches). 2014-05-31 14:11 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Sat, 31 May 2014 13:21:23 +0200 > Philippe Verdy wrote: > > > However CESU-8 can be detected by the initial encoding of another byte > > order mark U+1FFFE (which is a non-character that MUST be stripped > > once detected from the parsed stream of code points) However, > > documents starting by this non-cahracters are supposed to be > > non-interoperable by definition even though the presence of that > > special byte order mark would be very safe to secure CESU-8 and > > discriminate it from UTF-8. > > Where is this tagging defined? > > It is in general not true that non-characters must be stripped on > input. That would be highly inappropriate in a conversion program that > transformed between UTFs. Also, the collations defined in CLDR Version > 23 file collation/zh.xml would be severely damaged if the > non-characters were stripped out. In version 24 and later the file > uses a different syntax and doesn't contain non-characters. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Sat May 31 08:41:10 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 31 May 2014 15:41:10 +0200 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: <20140530194508.357ae366@JRWUBU2> References: <20140530194508.357ae366@JRWUBU2> Message-ID: I think you have a point here. We should probably change to: To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode scalar value (from U+0000 to U+D7FF and U+E000 to U+10FFFF), using the hexadecimal code point representation. and then in the notes say that the same notation can be used for codepoints that are not scalar values, for implementation that handle them in Unicode strings. Mark *? Il meglio ? l?inimico del bene ?* On Fri, May 30, 2014 at 8:45 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > Is there any good reason for UTS#18 'Unicode Regular Expressions' to > express its requirements in terms of codepoints rather than scalar > values? > > I was initially worried by RL1.1 requiring that one be able to specify > surrogate codepoints in a pattern. It would not be compliant for an > application to reject such patterns as syntactically or semantically > incorrect! RL1.1 seemed to prohibit compliant regular expression > engines that only handled well-formed UTF-8 strings. > > Furthermore, consider attempting to handle CESU-8 text as a sequence of > UTF-8 code units. The code unit sequence for U+10000 will, > corresponding to the UTF-16 code unit sequence D800 DC00, be ED A0 80 > ED B0 80. If one follows the lead of the 'best practice' for processing > ill-formed UTF-8 code unit sequences given in TUS Section 5.22, this > will be interpreted as *four* ill-formed sequences, ED A0, 80, ED B0, > and 80. I am not aware of any recommendation as to how to interpret > these sequences as codepoints. > > While being able to specify a search for surrogate codepoint U+D800 > might be useful when dealing with ill-formed UTF-16 Unicode sequences, > UTS#18 Section 1.7, which discusses requirement RL1.7, states that there > is no requirement for a one-codepoint pattern such as \u{D800} to match > a UTF-16 Unicode string consisting just of one code unit with the value > 0xD800. The convenient, possibly intended, consequence of this is that > the RL1.1 requirement to allow patterns to specify surrogate codepoints > can be satisfied by simply treating them as unmatchable; For example, > such a 1-character RE could be treated as the empty Unicode set > [\p{gc=Lo} && \p{gc=Mn}]. > > Now, I suppose one might want to specify a match for ill-formed (in > context) UTF-8 code unit subsequences such as E0 80 (not a valid > initial subsequence) and E0 A5 (lacking a trailing byte), but as > matching is not required, I don't see the point in UTS#18 being > changed to ask for an appropriate syntax to be added. > > Richard. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sat May 31 11:17:45 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sat, 31 May 2014 09:17:45 -0700 Subject: Corrigendum #9 In-Reply-To: References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> Message-ID: <538A00A9.1050907@ix.netcom.com> On 5/31/2014 4:09 AM, Philippe Verdy wrote: > 2014-05-30 20:49 GMT+02:00 Asmus Freytag >: > > This might have been possible at the time these were added, but > now it is probably not feasible. One of the reasons is that block > names are exposed (for better or for worse) as character > properties and as such are also exposed in regular expressions. > While not recommended, it would be really bad if the expression > with pseudo-code "IsInArabicPresentationFormB(x)" were to fail, > because we split the block into three (with the middle one being > the noncharacters). > > > If you think about pseudocode testing for properties then nothing > forbifs the test IsInArabicPresentationFormB(x) to check two ranges > onstead of just one. Besides the point. The issue is that the result of evaluation of an expression would change over time. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Sat May 31 14:27:55 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 31 May 2014 21:27:55 +0200 Subject: Long-Encoded Restricted Characters in High Frequency Modern Use In-Reply-To: <20140529233956.5db1ea5e@JRWUBU2> References: <20140529233956.5db1ea5e@JRWUBU2> Message-ID: Mark *? Il meglio ? l?inimico del bene ?* On Fri, May 30, 2014 at 12:39 AM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > I am a little confused by the call for a review of UTS #39, Unicode > Security Mechanisms (PRI #273). Are we being requested to > report long-encoded 'restricted' characters in high frequency modern > use? 'Restricted' refers to the classification in > xidmodifications.txt. > ?First, "restricted" are meant not for everyday use, bu?t specifically just for the purpose of programming identifiers and similar sorts of identifiers. Moreover, it sets up a framework, but the conformance requirements are only that any modification is declared. http://www.unicode.org/reports/tr39/proposed.html#C1 You may know this all, but just to be sure. ? > > One linked pair of long-encoded restricted characters in high frequency > use is U+0E33 THAI CHARACTER SARA AM and U+0EB3 LAO VOWEL SIGN AM, > which occurs in the extremely common Thai and Lao words for 'water' or > 'liquid in general' ??? ??? whose NFKC decompositions are the > nonsensical forms ???? ????, but may be faked by the linguistically > incorrect ???? ????. In Thai the encodings are NO NU, U+0E49 THAI CHARACTER MAI THO, U+0E33 THAI CHARACTER SARA AM>, > SARA AA> and . The structure of the data is based on the use of NFKC characters in identifiers. So SARA AM and the Lao? equivalent are both not NFKC characters, and are categorized as such, and would need to be represented by their NFKC fors. The process is in http://www.unicode.org/reports/tr39/proposed.html#IDMOD_Data_Collection You can see the categorization (for 6.3) for a whole script with a link like: http://unicode.org/cldr/utility/list-unicodeset.jsp?g=identifier-restriction&a=\p{sc=thai} (It only works for 6.3 right now, but these items haven't changed recently.) > Now, U+0E4D THAI > CHARACTER NIKHAHIT is classified as 'allowed; recommended', although > its main use is in writing Pali, which would suggest that it should be > 'restricted; historic' or 'restricted; limited-use'. ?For that, it would be best to submit via http://www.unicode.org/reports/tr39/proposed.html#Feedback, AND file a feedback form at http://www.unicode.org/reporting.html, just to be sure. ? > The situation is > not so clear for Lao > - U+0ECD LAO NIGGAHITA is a fairly common vowel in the Lao language. > ?Based on your information, ?the following appear (at least to me) to be caused by typos in in the xidmodifications source files; they are all marked as 'technical'. http://unicode.org/cldr/utility/list-unicodeset.jsp?g=identifier-restriction&a=\p{sc=khmer} Again, best to submit this like above (via http://www.unicode.org/reports/tr39/proposed.html#Feedback, AND file a feedback form at http://www.unicode.org/reporting.html). > To me, a truly bizarre set of 'restricted' characters is U+17CB KHMER > SIGN BANTOC to U+17D0 KHMER SIGN SAMYOK SANNYA, which are categorised as > 'restricted; technical'. They are all in use in the Khmer language. > > U+17CB KHMER SIGN BANTOC is required for the main methods of writing > the Khmer vowels /a/ and /?/. > > U+17CC KHMER SIGN ROBAT is a repha, but I would be surprised to learn > that it has recently become little-used. It is, however, readily > confused with U+17CC KHMER SIGN TOANDAKHIAT, a 'pure killer' whose main > modern use is to show that a consonant is silent, rather like the Thai > letter U+0E4C THAI CHARACTER THANTHAKHAT. (The names are the same.) > The confusion arises because Sanskrit -rCa was pronounced /-r/ in > Khmer, and final /r/ recently became silent in Khmer, so the effect of > the Sanskrit /r/ is now to silence the final consonant. > > While U+17CE KHMER SIGN KAKABAT and U+17CF KHMER SIGN AHSDA may not be > common, they are still in modern use. > > Although U+17D0 KHMER SIGN SAMYOK SANNYA may have declined in > frequency, it has not dropped out of use and is still a common enough > way of writing the vowel /a/. > > Richard. > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat May 31 14:36:37 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 31 May 2014 21:36:37 +0200 Subject: Corrigendum #9 In-Reply-To: <538A00A9.1050907@ix.netcom.com> References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> <538A00A9.1050907@ix.netcom.com> Message-ID: May be; but there's real doubt that a regular expression that would need this property would be severely broken if that property was corrected. There are many other properties that are more useful (and mich more used) whose associated set of codepoints changes regularly across versions. I don't see any specific interest in maintaining non-characters in that block, as it effectively reduces the reusaibility of this property. And in fact it would be highly preferable to no longer state that these non-characters in ArabicPresenationForm be treated like C1 controls or PUA (because they will ever be reassigned to something more useful). Making them PUA would not change radically the fact thzt these characters are not recommended but we xould no longer bother about checking if they are valid or not. They remain there only as a legacy with old outdated versions of Unicode for a mysterious need that I"ve not clearly identified. Let's assume we change them into PUA; some applications will start accepting them when some other won't. Not a problem given that they are already not interoperable. And regular expressions trying to use character properties have many more caveats to handle (the most serious being with canonical equivalences and discontinuous matches or partial matches; when searches are only focuing on exact sets of code points instead of sets of canonical equivalent texts; the other complciation coming with the effect of collation and its variable strength matching more or less parts of text spanning ignorable collation elements i.e, possibly also, discontinuous runs of ignorable codepoints if we want to get consistant results independant of th normalization form. more compicate is how to handle "partial matches" such as a combining character within a precomposed character which is canonically equivalent to string where this combining character appears And even more tricky is how to handle substitution with regexps, for example when perfrming search at primary collation level ignoring lettercase, but when we wnt to replace base letters but preserve case in the substituted string: this requires specific lookup of characters using properties **not** specified in the UCD but in the collation tailoring data, and then how to ensure that the result of the substitution in the plain-text source will remain a valid text not creating new unexpected canonical equivalences, and that it will also not break basic orthographic properties such as syllabic structures in a specific pair of language+script, and without also producing unexpected collation equivalents at the same collation strength; causing later unexpected never ending loops of subtitutions, for example in large websites with bots operating text corrections). Regexps are still a very experimental proposal, they are still very difficult to make interoperatable except in a small set of tested cases and for this reason I really doubt that the "characetrs encoding block" property is very productive for now with regexps (and notably not with this "compatibility" block, whose characters wll remain used isolately independantly of their context, if they are still used in rare cases). I see little value in keeping this old complication in this block, but just more interoperability problems for implementations. So these non characters should be treated mostly like PUA, except that they have a few more properties : direction=RTL, script= Arabic, and starters working in isolation for the Arabic joining type (these properties can help limit their generic reusability like regular PUAs but at least all other processes and notably generic validtors won't have to bother about them). 2014-05-31 18:17 GMT+02:00 Asmus Freytag : > On 5/31/2014 4:09 AM, Philippe Verdy wrote: > > 2014-05-30 20:49 GMT+02:00 Asmus Freytag : > >> This might have been possible at the time these were added, but now it is >> probably not feasible. One of the reasons is that block names are exposed >> (for better or for worse) as character properties and as such are also >> exposed in regular expressions. While not recommended, it would be really >> bad if the expression with pseudo-code "IsInArabicPresentationFormB(x)" >> were to fail, because we split the block into three (with the middle one >> being the noncharacters). >> > > If you think about pseudocode testing for properties then nothing > forbifs the test IsInArabicPresentationFormB(x) to check two ranges onstead > of just one. > > Besides the point. > > The issue is that the result of evaluation of an expression would change > over time. > > A./ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Sat May 31 15:02:57 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 31 May 2014 22:02:57 +0200 Subject: Corrigendum #9 In-Reply-To: References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> <538A00A9.1050907@ix.netcom.com> Message-ID: A few quick items. (I admit to only skimming your response, Phillipe; there is only so much time in the day.) Any discussion of changing non-characters is really pointless. See http://www.unicode.org/policies/property_value_stability_table.html As to breaking up the block, that is not forbidden: but one would have to give pretty compelling arguments that the benefits would outweigh any likely problems, especially since we already don't recommend the use of the block property in regexes. > And regular expressions trying to use character properties have many more caveats to handle (the most serious being with canonical equivalences and discontinuous matches or partial matches. The UTC, after quite a bit of work, concluded that it was not feasible with today's regex engines to handle normalization automatically, instead recommending the approach in http://www.unicode.org/reports/tr18/#Canonical_Equivalents > Regexps are still a very experimental proposal, they are still very difficult to make interoperatable except in a small set of tested cases I have no idea where this is coming from. Regexes using Unicode properties are in widespread and successful use. It is not that hard to make them interoperable (as long as both implementations are using the same version of Unicode). Mark *? Il meglio ? l?inimico del bene ?* On Sat, May 31, 2014 at 9:36 PM, Philippe Verdy wrote: > May be; but there's real doubt that a regular expression that would need > this property would be severely broken if that property was corrected. > There are many other properties that are more useful (and mich more used) > whose associated set of codepoints changes regularly across versions. > > I don't see any specific interest in maintaining non-characters in that > block, as it effectively reduces the reusaibility of this property. > And in fact it would be highly preferable to no longer state that these > non-characters in ArabicPresenationForm be treated like C1 controls or PUA > (because they will ever be reassigned to something more useful). Making > them PUA would not change radically the fact thzt these characters are not > recommended but we xould no longer bother about checking if they are valid > or not. They remain there only as a legacy with old outdated versions of > Unicode for a mysterious need that I"ve not clearly identified. > > Let's assume we change them into PUA; some applications will start > accepting them when some other won't. Not a problem given that they are > already not interoperable. > > And regular expressions trying to use character properties have many more > caveats to handle (the most serious being with canonical equivalences and > discontinuous matches or partial matches; when searches are only focuing on > exact sets of code points instead of sets of canonical equivalent texts; > the other complciation coming with the effect of collation and its variable > strength matching more or less parts of text spanning ignorable collation > elements i.e, possibly also, discontinuous runs of ignorable codepoints if > we want to get consistant results independant of th normalization form. > more compicate is how to handle "partial matches" such as a combining > character within a precomposed character which is canonically equivalent to > string where this combining character appears > > And even more tricky is how to handle substitution with regexps, for > example when perfrming search at primary collation level ignoring > lettercase, but when we wnt to replace base letters but preserve case in > the substituted string: this requires specific lookup of characters using > properties **not** specified in the UCD but in the collation tailoring > data, and then how to ensure that the result of the substitution in the > plain-text source will remain a valid text not creating new unexpected > canonical equivalences, and that it will also not break basic orthographic > properties such as syllabic structures in a specific pair of > language+script, and without also producing unexpected collation > equivalents at the same collation strength; causing later unexpected never > ending loops of subtitutions, for example in large websites with bots > operating text corrections). > > Regexps are still a very experimental proposal, they are still very > difficult to make interoperatable except in a small set of tested cases and > for this reason I really doubt that the "characetrs encoding block" > property is very productive for now with regexps (and notably not with this > "compatibility" block, whose characters wll remain used isolately > independantly of their context, if they are still used in rare cases). > > I see little value in keeping this old complication in this block, but > just more interoperability problems for implementations. So these non > characters should be treated mostly like PUA, except that they have a few > more properties : direction=RTL, script= Arabic, and starters working in > isolation for the Arabic joining type (these properties can help limit > their generic reusability like regular PUAs but at least all other > processes and notably generic validtors won't have to bother about them). > > 2014-05-31 18:17 GMT+02:00 Asmus Freytag : > > On 5/31/2014 4:09 AM, Philippe Verdy wrote: >> >> 2014-05-30 20:49 GMT+02:00 Asmus Freytag : >> >>> This might have been possible at the time these were added, but now it >>> is probably not feasible. One of the reasons is that block names are >>> exposed (for better or for worse) as character properties and as such are >>> also exposed in regular expressions. While not recommended, it would be >>> really bad if the expression with pseudo-code >>> "IsInArabicPresentationFormB(x)" were to fail, because we split the block >>> into three (with the middle one being the noncharacters). >>> >> >> If you think about pseudocode testing for properties then nothing >> forbifs the test IsInArabicPresentationFormB(x) to check two ranges onstead >> of just one. >> >> Besides the point. >> >> The issue is that the result of evaluation of an expression would change >> over time. >> >> A./ >> >> > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat May 31 20:19:59 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 1 Jun 2014 03:19:59 +0200 Subject: some outdated charts from Unicode 5.0 Message-ID: Unicode charts seem to have been left unmaintained since Unicode 5.0 in their Frenchr translation such as the chart for Latin Extended-D: http://www.unicode.org/fr/charts/PDF/UA720.pdf which exhibits only 2 assignments. Compare with the updated version in English http://www.unicode.org/charts/PDF/UA720.pdf It is not dramatic for new blocks that have been added after Unicode 5.0 for new scripts, but tables for existing blocks should be updated (even if there are missing French translations for character names and they are shown in English, preferably in italic with a note explaining the missing translations). -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sat May 31 21:15:52 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sat, 31 May 2014 19:15:52 -0700 Subject: Corrigendum #9 In-Reply-To: References: <5388CD4A.4060704@khwilliamson.com> <5388D29C.9040502@ix.netcom.com> <538A00A9.1050907@ix.netcom.com> Message-ID: <538A8CD8.4070905@ix.netcom.com> On 5/31/2014 12:36 PM, Philippe Verdy wrote: > May be; but there's real doubt that a regular expression that would > need this property would be severely broken if that property was > corrected. There are many other properties that are more useful (and > mich more used) whose associated set of codepoints changes regularly > across versions. we have learned that there are always more implementations of a feature than we might have predicted. That has been true, for Unicode, from day one. More importantly, while a regex that uses an expression that is equivalent to "IsInArabiPresentation(x)" may or may not be well-defined, there is no reason to break it by splitting the block. As blocks cannot be discontiguous (unlike other properties), some Arabic Presentation forms would have to be put into a new block (Arabic Presentation Forms C). This is what would break such expressions - it has, in fact, nothing to do with the status of the noncharacters. There's no reason to contemplate breaking changes of any kind at this point. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Sat May 31 21:24:09 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Sat, 31 May 2014 19:24:09 -0700 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: References: <20140530194508.357ae366@JRWUBU2> Message-ID: On Sat, May 31, 2014 at 6:41 AM, Mark Davis ?? wrote: > I think you have a point here. We should probably change to: > > To meet this requirement, an implementation shall supply a mechanism for > specifying any Unicode scalar value (from U+0000 to U+D7FF and U+E000 to > U+10FFFF), using the hexadecimal code point representation. > > and then in the notes say that the same notation can be used for > codepoints that are not scalar values, for implementation that handle them > in Unicode strings. > This combination sounds good. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Sat May 31 21:28:27 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Sat, 31 May 2014 19:28:27 -0700 Subject: Unicode Regular Expressions, Surrogate Points and UTF-8 In-Reply-To: <20140531095958.61fecff7@JRWUBU2> References: <20140530194508.357ae366@JRWUBU2> <20140531095958.61fecff7@JRWUBU2> Message-ID: On Sat, May 31, 2014 at 1:59 AM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > Bear in mind that a pattern \uD808 shall not match anything in a > well-formed Unicode string. Depends. See the definitions of Unicode strings vs. UTF strings. \uD808\uDF45 specifies a sequence of two > codepoints. Implementations that use Unicode 16-bit strings will usually treat this as one supplementary code point. In Java, there is no other way to escape one. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: