From lists-cldr at chsc.dk Mon Apr 3 13:51:21 2017 From: lists-cldr at chsc.dk (Christian Schmidt) Date: Mon, 3 Apr 2017 20:51:21 +0200 Subject: Generic location format misleading for America/Argentina/San_Juan Message-ID: UTS #35, part 2 suggests using "generic location format" in interfaces for time zone selection. The generic location format for America/Argentina/San_Juan is "San Juan". As pointed out in https://github.com/symfony/symfony/pull/17628#issuecomment-290724976, there are many cities with that name, with the most prominent one being the capital of Puerto Rico (i.e. not the one in Argentina). Hence, the label "San Juan" is misleading to the user. Would it be better to change the fallback format for timezones in countries with multiple zones to "Argentina Time (San Juan)" or "Argentina (San Juan) Time" ('''regionFormat''' and '''fallbackFormat''' combined)? Perhaps this was the intention all along? Elsewhere in the spec it says about the generic location format that "the naming is more uniform than the generic non-location format and zones for the same country will be grouped together". I don't understand what this means, unless the country name is included in the format (and the formatted strings are sorted alphabetically). Christian Schmidt PS: I tried creating this in Trac, but for some reason the Akismet spam filter thinks it is spam. From lists-cldr at chsc.dk Wed Apr 5 14:44:05 2017 From: lists-cldr at chsc.dk (Christian Schmidt) Date: Wed, 5 Apr 2017 21:44:05 +0200 Subject: Generic location format misleading for America/Argentina/San_Juan In-Reply-To: References: Message-ID: > PS: I tried creating this in Trac, but for some reason the Akismet spam filter thinks it is spam. Today the spam filter was more forgiving, so I have now filed a ticket for this issue: http://unicode.org/cldr/trac/ticket/10175 Christian -------------- next part -------------- An HTML attachment was scrubbed... URL: From rick at unicode.org Mon Apr 10 15:26:09 2017 From: rick at unicode.org (Rick McGowan) Date: Mon, 10 Apr 2017 13:26:09 -0700 Subject: CLDR version 31.0.1 released Message-ID: <58EBEA61.8090905@unicode.org> Hello everyone, There is a new update of CLDR v31, with fixes as described on: http://cldr.unicode.org/index/downloads/cldr-31 The largest change is the addition of a full set of derived emoji names and keywords. From skeet at pobox.com Tue Apr 11 00:47:19 2017 From: skeet at pobox.com (Jon Skeet) Date: Tue, 11 Apr 2017 06:47:19 +0100 Subject: Year error for 31.0.0 on downloads page Message-ID: Hi, Having just seen the release note for 31.0.1, I visited http://cldr.unicode.org/index/downloads ... where 31.0.0 is listed as having been released on 2016-03-20, instead of 2017-03-20. Is this notification enough for someone with suitable permissions to get it fixed, or should I file a bug instead? Jon -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedberg at apple.com Tue Apr 11 02:14:42 2017 From: pedberg at apple.com (Peter Edberg) Date: Tue, 11 Apr 2017 00:14:42 -0700 Subject: Year error for 31.0.0 on downloads page In-Reply-To: References: Message-ID: <06FAC6B1-0AD9-4AA3-A35C-4A19ED22091A@apple.com> > On Apr 10, 2017, at 10:47 PM, Jon Skeet wrote: > > Hi, > > Having just seen the release note for 31.0.1, I visited http://cldr.unicode.org/index/downloads ... where 31.0.0 is listed as having been released on 2016-03-20, instead of 2017-03-20. > > Is this notification enough for someone with suitable permissions to get it fixed, or should I file a bug instead? > > Jon Thanks! This notification was enough, I just fixed the problem. - Peter E -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sat Apr 29 19:42:03 2017 From: cldr-users at unicode.org (Cameron Dutro via CLDR-Users) Date: Sat, 29 Apr 2017 17:42:03 -0700 Subject: Word break question Message-ID: Hey CLDR users, I have a question regarding the word break rules from CLDR v31. Consider the following word break test: ? 0001 ? 0308 ? 0041 ? I believe rule #5 should apply between 0308 and 0041, which looks like this: $AHLetter ? $AHLetter 0308 has a word break property of "Extend" which $AHLetter matches, and 0041 has a word break property of ALetter which $AHLetter also matches. The thing is, rule #5 indicates no break should occur between these characters. Furthermore, there are only two rules in which a break is indicated (3.1 and 3.2), both of which don't apply in this case. What am I missing? Thanks! -Cameron -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sat Apr 29 19:48:58 2017 From: cldr-users at unicode.org (Cameron Dutro via CLDR-Users) Date: Sat, 29 Apr 2017 17:48:58 -0700 Subject: Macedonian-Latin Transformation Question Message-ID: Hey CLDR users, In the Macedonian-Latin-BGN transformation rules in CLDR v31, Cyrillic small letter je is transformed into an uppercase Latin J. The transform rule in question is here . ICU seems to either ignore this rule or use a corrected version of the CLDR data set, because it produces correct transformed results. If this is a mistake in the CLDR data, I'm happy to file a ticket, but would also like to ask why ICU seems to be using a non-published version of CLDR. The problem could also be an issue with my transformation implementation, in which case I'd like to ask what other rule should be matching instead. Thanks for your help! -Cameron -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sat Apr 29 20:12:36 2017 From: cldr-users at unicode.org (Richard Wordingham via CLDR-Users) Date: Sun, 30 Apr 2017 02:12:36 +0100 Subject: Word break question In-Reply-To: References: Message-ID: <20170430021236.62f345a3@JRWUBU2> On Sat, 29 Apr 2017 17:42:03 -0700 Cameron Dutro via CLDR-Users wrote: > Hey CLDR users, > > I have a question regarding the word break rules from CLDR v31. > Consider the following word break test: > > ? 0001 ? 0308 ? 0041 ? > > I believe rule #5 should apply between 0308 and 0041, which looks > like this: > > $AHLetter ? $AHLetter > > 0308 has a word break property of "Extend" which $AHLetter matches, > and 0041 has a word break property of ALetter which $AHLetter also > matches. The thing is, rule #5 indicates no break should occur > between these characters. Furthermore, there are only two rules in > which a break is indicated (3.1 and 3.2), both of which don't apply > in this case. What am I missing? You're missing the shape of the brackets in "($ALetter $FEZ*)". The brackets are round, not square, so does not match $ALetter as it is not a string starting with something for which Word_Break=ALetter. Obviously does not match either. Secondly, if you read http://unicode.org/reports/tr35/tr35-general.html#Segmentations, you will see that the final rule of "Any ? Any" is implicit. Richard. From cldr-users at unicode.org Sun Apr 30 00:41:33 2017 From: cldr-users at unicode.org (Cameron Dutro via CLDR-Users) Date: Sat, 29 Apr 2017 22:41:33 -0700 Subject: Word break question In-Reply-To: <20170430021236.62f345a3@JRWUBU2> References: <20170430021236.62f345a3@JRWUBU2> Message-ID: Hey Richard, Thank you for your response, it's been quite helpful. I'm aware of the difference between the various types of brackets. I think my implementation is treating the round brackets as literal characters because rules are compiled into a regular expression. Variable replacement gives this as (nearly) the final expanded form of the rule in question: [(\p{Word_Break=ALetter} $FEZ*) \p{Word_Break=Hebrew_Letter}] ? [(\p{Word_Break=ALetter} $FEZ*) \\p{Word_Break=Hebrew_Letter}] As you can see, the parentheses exist *within* the character class, and are therefore treated as literal characters. I understand that the transformation rules are unicode sets as opposed to true regular expressions. Is there documentation available that explains the differences between the two, or perhaps the syntax and expected behavior of a unicode set? Thank you, -Cameron On Sat, Apr 29, 2017 at 6:12 PM, Richard Wordingham via CLDR-Users < cldr-users at unicode.org> wrote: > On Sat, 29 Apr 2017 17:42:03 -0700 > Cameron Dutro via CLDR-Users wrote: > > > Hey CLDR users, > > > > I have a question regarding the word break rules from CLDR v31. > > Consider the following word break test: > > > > ? 0001 ? 0308 ? 0041 ? > > > > I believe rule #5 should apply between 0308 and 0041, which looks > > like this: > > > > $AHLetter ? $AHLetter > > > > 0308 has a word break property of "Extend" which $AHLetter matches, > > and 0041 has a word break property of ALetter which $AHLetter also > > matches. The thing is, rule #5 indicates no break should occur > > between these characters. Furthermore, there are only two rules in > > which a break is indicated (3.1 and 3.2), both of which don't apply > > in this case. What am I missing? > > You're missing the shape of the brackets in " id="$ALetter">($ALetter $FEZ*)". The brackets are round, > not square, so does not match $ALetter as it is not a string > starting with something for which Word_Break=ALetter. Obviously > does not match either. > > Secondly, if you read > http://unicode.org/reports/tr35/tr35-general.html#Segmentations, you > will see that the final rule of "Any ? Any" is implicit. > > Richard. > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Apr 30 06:21:44 2017 From: cldr-users at unicode.org (Richard Wordingham via CLDR-Users) Date: Sun, 30 Apr 2017 12:21:44 +0100 Subject: Word break question In-Reply-To: References: <20170430021236.62f345a3@JRWUBU2> Message-ID: <20170430122144.2ad82ce5@JRWUBU2> On Sat, 29 Apr 2017 22:41:33 -0700 Cameron Dutro via CLDR-Users wrote: > Hey Richard, > > Thank you for your response, it's been quite helpful. > > I'm aware of the difference between the various types of brackets. I > think my implementation is treating the round brackets as literal > characters because rules are compiled into a regular expression. > Variable replacement gives this as (nearly) the final expanded form > of the rule in question: > > [(\p{Word_Break=ALetter} $FEZ*) \p{Word_Break=Hebrew_Letter}] ? > [(\p{Word_Break=ALetter} $FEZ*) \\p{Word_Break=Hebrew_Letter}] I trust this expansion has been abbreviated for citing in an email - Aletter and Hebrew_Letter should be treated the same. > As you can see, the parentheses exist *within* the character class, > and are therefore treated as literal characters. > > I understand that the transformation rules are unicode sets as > opposed to true regular expressions. Is there documentation available > that explains the differences between the two, or perhaps the syntax > and expected behavior of a unicode set? I'm not sure what you mean by a 'true regular expression'; whatever one might think of Unicode sets, they are true definitions of regular languages on the alphabet of code points. They are thus regular expressions, though possibly not as you know them. There is a slight gap in the set of strings of code points; one cannot have a leading surrogate followed by a trailing surrogate if one is processing a well-formed Unicode string, for that is the UTF-16 representation of a single code point in a supplemental plane. There is documentation on the syntax of Unicode sets at http://unicode.org/reports/tr35/tr35.html#Unicode_Sets. You will see that Version 31 Section 5.3.3.4 is very restrictive in what string specifications it allows. Moreover, the opening paragraph of Section 5.3.3 says, "A UnicodeSet represents a finite set of Unicode code points and strings". "(\p{Word_Break=ALetter} $FEZ*)" defines an infinite set of strings, so would be illegal even if it were syntactically correct. The rules themselves are given in the form: regular_expression boundary_decision regular_expression These regular expressions are usually not Unicode sets: 1) They are not given in such notation. 2) At http://www.unicode.org/reports/tr29/#Notation there is the statement, "The left and right sides use the boundary property values in regular expressions." 3) The sets of applicable boundary strings are mostly infinite. Richard. From cldr-users at unicode.org Sun Apr 30 07:49:37 2017 From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users) Date: Sun, 30 Apr 2017 14:49:37 +0200 Subject: Word break question In-Reply-To: References: <20170430021236.62f345a3@JRWUBU2> Message-ID: OK the problem is with the extra parentheses that are transcluded as is during expansion of $variables, even if these $variables are then used within a character class. What this means is that ($ALetter $FEZ*) cannot be used in a character class, and the inclusion of $ALetter in a character class is invalid, when its expansion is not a single character... Even if you drop the (unnecesssary) parentheses in $ALetter $FEZ* it will not be correct. In fact this variable definition is silly because it is self-referencing itself, so it would expand to (($ALetter $FEZ*) $FEZ*), then ((($ALetter $FEZ*) $FEZ*) $FEZ*), and so on infinitely. Removing the parentheses would still expand it to: $ALetter $FEZ* $FEZ*), then $ALetter $FEZ* $FEZ* $FEZ*, and so on infinitely. I think that the defined variable should be renamed... This is clearly a bug, IMHO 2017-04-30 7:41 GMT+02:00 Cameron Dutro via CLDR-Users < cldr-users at unicode.org>: > Hey Richard, > > Thank you for your response, it's been quite helpful. > > I'm aware of the difference between the various types of brackets. I think > my implementation is treating the round brackets as literal characters > because rules are compiled into a regular expression. Variable replacement > gives this as (nearly) the final expanded form of the rule in question: > > [(\p{Word_Break=ALetter} $FEZ*) \p{Word_Break=Hebrew_Letter}] ? > [(\p{Word_Break=ALetter} $FEZ*) \\p{Word_Break=Hebrew_Letter}] > > As you can see, the parentheses exist *within* the character class, and > are therefore treated as literal characters. > > I understand that the transformation rules are unicode sets as opposed to > true regular expressions. Is there documentation available that explains > the differences between the two, or perhaps the syntax and expected > behavior of a unicode set? > > Thank you, > > -Cameron > > On Sat, Apr 29, 2017 at 6:12 PM, Richard Wordingham via CLDR-Users < > cldr-users at unicode.org> wrote: > >> On Sat, 29 Apr 2017 17:42:03 -0700 >> Cameron Dutro via CLDR-Users wrote: >> >> > Hey CLDR users, >> > >> > I have a question regarding the word break rules from CLDR v31. >> > Consider the following word break test: >> > >> > ? 0001 ? 0308 ? 0041 ? >> > >> > I believe rule #5 should apply between 0308 and 0041, which looks >> > like this: >> > >> > $AHLetter ? $AHLetter >> > >> > 0308 has a word break property of "Extend" which $AHLetter matches, >> > and 0041 has a word break property of ALetter which $AHLetter also >> > matches. The thing is, rule #5 indicates no break should occur >> > between these characters. Furthermore, there are only two rules in >> > which a break is indicated (3.1 and 3.2), both of which don't apply >> > in this case. What am I missing? >> >> You're missing the shape of the brackets in "> id="$ALetter">($ALetter $FEZ*)". The brackets are round, >> not square, so does not match $ALetter as it is not a string >> starting with something for which Word_Break=ALetter. Obviously >> does not match either. >> >> Secondly, if you read >> http://unicode.org/reports/tr35/tr35-general.html#Segmentations, you >> will see that the final rule of "Any ? Any" is implicit. >> >> Richard. >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Apr 30 08:11:06 2017 From: cldr-users at unicode.org (Richard Wordingham via CLDR-Users) Date: Sun, 30 Apr 2017 14:11:06 +0100 Subject: Word break question In-Reply-To: References: <20170430021236.62f345a3@JRWUBU2> Message-ID: <20170430141106.2fb797ed@JRWUBU2> On Sun, 30 Apr 2017 14:49:37 +0200 Philippe Verdy via CLDR-Users wrote: > ... Even if you drop the > (unnecesssary) parentheses in $ALetter > $FEZ* it will not be correct. > In fact this variable definition is silly because it is > self-referencing itself, so it would expand to > (($ALetter $FEZ*) $FEZ*), then > ((($ALetter $FEZ*) $FEZ*) $FEZ*), > and so on infinitely. That is another reason for calling it a *variable*; it is not a *constant*. If you read the text at http://unicode.org/reports/tr35/tr35-general.html#Segmentations, you would find the statement "The ordering of variables is important; they are evaluated in order from first to last (see Section 9.1 Segmentation Inheritance)". Richard. From cldr-users at unicode.org Sun Apr 30 14:28:51 2017 From: cldr-users at unicode.org (Cameron Dutro via CLDR-Users) Date: Sun, 30 Apr 2017 12:28:51 -0700 Subject: Word break question In-Reply-To: <20170430141106.2fb797ed@JRWUBU2> References: <20170430021236.62f345a3@JRWUBU2> <20170430141106.2fb797ed@JRWUBU2> Message-ID: Philippe, thanks for your response. As Richard said, the variable entries are "evaluated" from first to last, which obviates the self-referencing problem you mentioned. My Ruby implementation follows this rule, so I don't think that's the problem. Richard, thanks again for clarifying the notation in use by the segmentation rules - I now understand the left- and right-hand sides to be regular expressions. It's still not clear to me how to interpret parentheses *inside* character classes however. Consider the following generalized case: [(abc d*)] The above regular expression is nonsensical unless one considers the parentheses and asterisk to be individual characters in the character class as opposed to a grouping of characters that must match in order (eg. a capturing group). Of the programming languages I've used, I know of none that would treat the parentheses as anything but literal characters. As far as I can tell, such behavior isn't even mentioned in UTS #18 . There is the provision for character *groups* within character classes - for example [{abc}{def}] - but that doesn't take repetitions like * and + into account. That said, it appears that any conformant implementation of the segmentation rules must make an allowance for parentheses inside character classes. How then should [(abc d*)] be interpreted? I can think of several interpretations: 1. Simply remove the square brackets. 2. Replace grouping symbols () with grouping symbols {}, which are explicitly allowed/supported in Unicode regular expressions (UTS #18). Unfortunately the issue of how to interpret repetition symbols is still in question. 3. Allow repetition symbols in character classes. This would require rewriting the regular expression above as something like (?:abc|d*). Any guidance would be much appreciated! -Cameron On Sun, Apr 30, 2017 at 6:11 AM, Richard Wordingham via CLDR-Users < cldr-users at unicode.org> wrote: > On Sun, 30 Apr 2017 14:49:37 +0200 > Philippe Verdy via CLDR-Users wrote: > > > ... Even if you drop the > > (unnecesssary) parentheses in $ALetter > > $FEZ* it will not be correct. > > > In fact this variable definition is silly because it is > > self-referencing itself, so it would expand to > > (($ALetter $FEZ*) $FEZ*), then > > ((($ALetter $FEZ*) $FEZ*) $FEZ*), > > and so on infinitely. > > That is another reason for calling it a *variable*; it is not a > *constant*. If you read the text at > http://unicode.org/reports/tr35/tr35-general.html#Segmentations, you > would find the statement "The ordering of variables is important; they > are evaluated in order from first to last (see Section 9.1 Segmentation > Inheritance)". > > Richard. > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Apr 30 16:01:49 2017 From: cldr-users at unicode.org (Richard Wordingham via CLDR-Users) Date: Sun, 30 Apr 2017 22:01:49 +0100 Subject: Word break question In-Reply-To: References: <20170430021236.62f345a3@JRWUBU2> <20170430141106.2fb797ed@JRWUBU2> Message-ID: <20170430220149.0d183722@JRWUBU2> On Sun, 30 Apr 2017 12:28:51 -0700 Cameron Dutro via CLDR-Users wrote: > Richard, thanks again for clarifying the notation in use by the > segmentation rules - I now understand the left- and right-hand sides > to be regular expressions. It's still not clear to me how to interpret > parentheses *inside* character classes however. Consider the following > generalized case: > > [(abc d*)] You should not be getting parentheses inside character classes. Ignoring Hebrew letters as a distraction, the rule is $ALetter ? $ALetter and $ALetter has the value (\p{Word_Break=ALetter} $FEZ*) This is a regular expression; it is not defined by a single character class (or Unicode set). At each point for which no break decision has been made, there shall be no break if a string immediately before and a string immediately after match that pattern. Richard. From cldr-users at unicode.org Sun Apr 30 17:09:17 2017 From: cldr-users at unicode.org (Cameron Dutro via CLDR-Users) Date: Sun, 30 Apr 2017 15:09:17 -0700 Subject: Word break question In-Reply-To: <20170430220149.0d183722@JRWUBU2> References: <20170430021236.62f345a3@JRWUBU2> <20170430141106.2fb797ed@JRWUBU2> <20170430220149.0d183722@JRWUBU2> Message-ID: Hey Richard, Unfortunately the Hebrew letters cannot be ignored since the $AHLetter variable introduces a character class, which is the source of my confusion. Here are all the variables in question: $AHLetter = [$ALetter(2) $Hebrew_Letter(2)] $HebrewLetter(1) = \p{Word_Break=Hebrew_Letter} $HebrewLetter(2) = ($Hebrew_Letter(1) $FEZ*) $ALetter(1) = \p{Word_Break=ALetter} $ALetter(2) = ($ALetter(1) $FEZ*) $FEZ = [$Format $Extend $ZWJ] $Format = \p{Word_Break=Format} $Extend = \p{Word_Break=Extend} $ZWJ = \p{Word_Break=ZWJ} The regular expressions for either side of the rule can be constructed using a series of simple substitutions: $AHLetter [$ALetter(2) $Hebrew_Letter(2)] [($ALetter(1) $FEZ*) ($Hebrew_Letter(1) $FEZ*)] [(\p{Word_Break=ALetter} $FEZ*) (\p{Word_Break=Hebrew_Letter} $FEZ*)] [(\p{Word_Break=ALetter} [$Format $Extend $ZWJ]*) (\p{Word_Break=Hebrew_Letter} [$Format $Extend $ZWJ]*)] [(\p{Word_Break=ALetter} [\p{Word_Break=Format} $Extend $ZWJ]*) (\p{Word_Break=Hebrew_Letter} [\p{Word_Break=Format} $Extend $ZWJ]*)] [(\p{Word_Break=ALetter} [\p{Word_Break=Format} \p{Word_Break=Extend} $ZWJ]*) (\p{Word_Break=Hebrew_Letter} [\p{Word_Break=Format} \p{Word_Break=Extend} $ZWJ]*)] [(\p{Word_Break=ALetter} [\p{Word_Break=Format} \p{Word_Break=Extend} \p{Word_Break=ZWJ}]*) (\p{Word_Break=Hebrew_Letter} [\p{Word_Break=Format} \p{Word_Break=Extend} \p{Word_Break=ZWJ}]*)] As you can see, the resulting regular expression contains parentheses within character classes (and in fact some nested ones too). How should my implementation handle these cases? Thanks, -Cameron On Sun, Apr 30, 2017 at 2:01 PM, Richard Wordingham via CLDR-Users < cldr-users at unicode.org> wrote: > On Sun, 30 Apr 2017 12:28:51 -0700 > Cameron Dutro via CLDR-Users wrote: > > > Richard, thanks again for clarifying the notation in use by the > > segmentation rules - I now understand the left- and right-hand sides > > to be regular expressions. It's still not clear to me how to interpret > > parentheses *inside* character classes however. Consider the following > > generalized case: > > > > [(abc d*)] > > You should not be getting parentheses inside character classes. > Ignoring Hebrew letters as a distraction, the rule is > > $ALetter ? $ALetter > > and $ALetter has the value > > (\p{Word_Break=ALetter} $FEZ*) > > This is a regular expression; it is not defined by a single character > class (or Unicode set). > > At each point for which no break decision has been made, there shall > be no break if a string immediately before and a string immediately > after match that pattern. > > Richard. > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Apr 30 18:07:32 2017 From: cldr-users at unicode.org (Richard Wordingham via CLDR-Users) Date: Mon, 1 May 2017 00:07:32 +0100 Subject: Word break question In-Reply-To: References: <20170430021236.62f345a3@JRWUBU2> <20170430141106.2fb797ed@JRWUBU2> <20170430220149.0d183722@JRWUBU2> Message-ID: <20170501000732.0605e779@JRWUBU2> On Sun, 30 Apr 2017 15:09:17 -0700 Cameron Dutro via CLDR-Users wrote: > Hey Richard, > > Unfortunately the Hebrew letters cannot be ignored since the $AHLetter > variable introduces a character class, which is the source of my > confusion. Here are all the variables in question: > > $AHLetter = [$ALetter(2) $Hebrew_Letter(2)] > $HebrewLetter(1) = \p{Word_Break=Hebrew_Letter} > $HebrewLetter(2) = ($Hebrew_Letter(1) $FEZ*) > $ALetter(1) = \p{Word_Break=ALetter} > $ALetter(2) = ($ALetter(1) $FEZ*) > $FEZ = [$Format $Extend $ZWJ] > $Format = \p{Word_Break=Format} > $Extend = \p{Word_Break=Extend} > $ZWJ = \p{Word_Break=ZWJ} > > How should my implementation handle these cases? It would have been friendlier if instead of doing macro-like expansions, it had compounded finite state machines. Then, it would have reported an error at "$AHLetter = [$ALetter(2) $Hebrew_Letter(2)]". Basically, the CLDR definition is wrong! What CLDR should have is $AHLetter(1) = [$ALetter(1) $Hebrew_Letter(1)] $AHLetter(2) = ($AHLetter(1) $FEZ*) Alternatively, it could have $AHLetter = ($ALetter | $Hebrew_Letter) At this point, you may realise that ICU does not derive the break iterators from the CLDR definitions. Instead, they are derived manually from the specifications. What can then happen is that someone works from what the specification should say, rather than from what it does say. Richard. From cldr-users at unicode.org Sun Apr 30 18:36:22 2017 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Sun, 30 Apr 2017 16:36:22 -0700 Subject: Word break question In-Reply-To: <20170501000732.0605e779@JRWUBU2> References: <20170430021236.62f345a3@JRWUBU2> <20170430141106.2fb797ed@JRWUBU2> <20170430220149.0d183722@JRWUBU2> <20170501000732.0605e779@JRWUBU2> Message-ID: Richard, Cameron, Philippe, thanks for tracking this down... I filed a ticket at http://unicode.org/cldr/trac/ticket/10226. If you have any comments on the proposed solution, please add them there so we don't lose them. Mark On Sun, Apr 30, 2017 at 4:07 PM, Richard Wordingham via CLDR-Users < cldr-users at unicode.org> wrote: > On Sun, 30 Apr 2017 15:09:17 -0700 > Cameron Dutro via CLDR-Users wrote: > > > Hey Richard, > > > > Unfortunately the Hebrew letters cannot be ignored since the $AHLetter > > variable introduces a character class, which is the source of my > > confusion. Here are all the variables in question: > > > > $AHLetter = [$ALetter(2) $Hebrew_Letter(2)] > > $HebrewLetter(1) = \p{Word_Break=Hebrew_Letter} > > $HebrewLetter(2) = ($Hebrew_Letter(1) $FEZ*) > > $ALetter(1) = \p{Word_Break=ALetter} > > $ALetter(2) = ($ALetter(1) $FEZ*) > > $FEZ = [$Format $Extend $ZWJ] > > $Format = \p{Word_Break=Format} > > $Extend = \p{Word_Break=Extend} > > $ZWJ = \p{Word_Break=ZWJ} > > > > > > > How should my implementation handle these cases? > > It would have been friendlier if instead of doing macro-like > expansions, it had compounded finite state machines. Then, it would > have reported an error at "$AHLetter = [$ALetter(2) > $Hebrew_Letter(2)]". Basically, the CLDR definition is wrong! What > CLDR should have is > > $AHLetter(1) = [$ALetter(1) $Hebrew_Letter(1)] > $AHLetter(2) = ($AHLetter(1) $FEZ*) > > Alternatively, it could have > $AHLetter = ($ALetter | $Hebrew_Letter) > > At this point, you may realise that ICU does not derive the break > iterators from the CLDR definitions. Instead, they are derived > manually from the specifications. What can then happen is that > someone works from what the specification should say, rather than from > what it does say. > > Richard. > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: