From cldr-users at unicode.org Wed Oct 16 15:32:37 2019 From: cldr-users at unicode.org (=?utf-8?Q?=E6=A2=81=E6=B5=B7_Liang_Hai?= via CLDR-Users) Date: Wed, 16 Oct 2019 13:32:37 -0700 Subject: An adorable candidate for an CLDR mascot! Message-ID: <0421B22B-D548-4764-886F-8A5527122180@gmail.com> Today at IUC, Ben Yang showed this adorable ?seal-deer? (?CLDR?) creature at his ?ICU Does That? Lesser-known Features of ICU, How to Use Them, How to Extend Them, and How it All Relates to the CLDR? talk: -------------- next part -------------- A non-text attachment was scrubbed... Name: image0.jpeg Type: image/jpeg Size: 66709 bytes Desc: not available URL: -------------- next part -------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: image1.jpeg Type: image/jpeg Size: 64735 bytes Desc: not available URL: -------------- next part -------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: image2.jpeg Type: image/jpeg Size: 90222 bytes Desc: not available URL: -------------- next part -------------- Best, ?? Liang Hai https://lianghai.github.io From cldr-users at unicode.org Sat Oct 19 15:46:20 2019 From: cldr-users at unicode.org (Matthew Stuckwisch via CLDR-Users) Date: Sat, 19 Oct 2019 16:46:20 -0400 Subject: Interpreting t-h0- mechanism Message-ID: The T extension for BCP-47 are contained in CLDR and described at length in TR35, so I hope this is the correct forum for the question. I'm implementing a module for handling language tags and have a few questions relating to the h0 tag: 1. What is the difference between 'en-t-es-h0-hybrid' and 'en-t-h0-es'? Both styles are given as example encodings for what would be Spanglish (Spanish-English hybrid), but surely there ought to be a difference 2. What would be the interpretation of 'en-t-pt-h0-es'? English, transformed from Portu?ol (Spanish-Portuguese hybrid), or Spanglish, transformed from Portuguese? 3. What would the correct method for indicating, for instance, Portu?ol transformed from Franglish (French-English hybrid)? 4. What is the format for a triple language hybrid? Would Asturian-Portuguese-Spanish be 'ast-t-h0-pt-h0-es'? TR35 is fairly anemic in the h0 mechanism documentation, and particularly with regards to (4), it seems like much was intentionally left as an open question to be handled at a later date, hence my confusion. Mat?u From cldr-users at unicode.org Sun Oct 20 08:46:56 2019 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Sun, 20 Oct 2019 06:46:56 -0700 Subject: Interpreting t-h0- mechanism In-Reply-To: References: Message-ID: On Sat, Oct 19, 2019 at 1:48 PM Matthew Stuckwisch via CLDR-Users < cldr-users at unicode.org> wrote: > The T extension for BCP-47 are contained in CLDR and described at length > in TR35, so I hope this is the correct forum for the question. I'm > implementing a module for handling language tags and have a few questions > relating to the h0 tag: > > 1. What is the difference between 'en-t-es-h0-hybrid' and 'en-t-h0-es'? > Both styles are given as example encodings for what would be Spanglish > (Spanish-English hybrid), but surely there ought to be a difference > The latter style is illegal. The key h0 only currently takes one possible value, h0-hybrid. https://unicode.org/reports/tr35/tr35.html#Hybrid_Locale If you are seeing "h0-es" someplace in the spec, please let us know, since that would be a typo. 2. What would be the interpretation of 'en-t-pt-h0-es'? English, > transformed from Portu?ol (Spanish-Portuguese hybrid), or Spanglish, > transformed from Portuguese? > That is illegal, so it doesn't have any interpretation. > 3. What would the correct method for indicating, for instance, Portu?ol > transformed from Franglish (French-English hybrid)? > That cannot currently be represented. You can file a ticket if you want to present a case for why that would be important. > 4. What is the format for a triple language hybrid? Would > Asturian-Portuguese-Spanish be 'ast-t-h0-pt-h0-es'? > That is illegal, so it doesn't have any interpretation. There is no syntax for a mixture of more than 2 languages. You can file a ticket if you want to present a case for why that would be important. > TR35 is fairly anemic in the h0 mechanism documentation, and particularly > with regards to (4), it seems like much was intentionally left as an open > question to be handled at a later date, hence my confusion. > The validity for key-type combinations is established by referring to the https://github.com/unicode-org/cldr/tree/master/common/bcp47 data. Between that and https://unicode.org/reports/tr35/tr35.html#Hybrid_Locale, if there are any "open questions", please file a new ticket so that we can fix them. > Mat?u > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Oct 20 08:36:14 2019 From: cldr-users at unicode.org (Daniel Yacob via CLDR-Users) Date: Sun, 20 Oct 2019 09:36:14 -0400 Subject: Regarding $wordBoundary in Katakana-Latin BGN Transformation Message-ID: Hi Cameron, I've just joined the CLDR-Users list and so am not able to reply directly to your September 20th email on this topic. The $wordBoundary variable was introduced in the BGN transformation when they were composed initially in 2006 as a work-around to a bug at the time (#2034 in the old bug tracking system). I'm sure it is no longer needed and the variable can be replaced with a standard word boundary marker. Hopefully it is otherwise harmless. In the Katakana-Latin case, the variable should simply be ignored by ICU and have no impact. The fact that it is unused may be the result of a file copy from another BGN file when the Katakana-Latin file was created and the variable was not removed when it should have been. Or perhaps it was referenced at one time and some later edit removed it. regards, -Daniel -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Oct 20 08:54:12 2019 From: cldr-users at unicode.org (Daniel Yacob via CLDR-Users) Date: Sun, 20 Oct 2019 09:54:12 -0400 Subject: Considering Cleanup of Some Transforms Message-ID: Greetings, Looking through the CLDR transforms recently, I noticed mixed conventions in the nodes and thought that I could volunteer some time to make them more consistent. What I have in mind is: * add child nodes to encompass all rules (impacts 214 files) * Set XML comments, to # comments (impacts 3 files) * Fix a commented out header (one file) * Update Copyright end date to all files (set to 2020 ?) * replace $wordBoundary variable with corresponding ICU marker (31 files, having introduced this variable I feel some sense of responsibility for it) I would fork the repo and make a pull request to provide an update. Before undertaking this work though, I would like to have some assurance that a pull request with these updates would be accepted. LMK if so and I will go forward with these revisions -or just some as advised. thanks! -Daniel -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Oct 20 11:26:55 2019 From: cldr-users at unicode.org (Matthew Stuckwisch via CLDR-Users) Date: Sun, 20 Oct 2019 12:26:55 -0400 Subject: Interpreting t-h0- mechanism In-Reply-To: References: Message-ID: >> 1. What is the difference between 'en-t-es-h0-hybrid' and 'en-t-h0-es'? Both styles are given as example encodings for what would be Spanglish (Spanish-English hybrid), but surely there ought to be a difference >> > The latter style is illegal. The key h0 only currently takes one possible value, h0-hybrid. > > https://unicode.org/reports/tr35/tr35.html#Hybrid_Locale > > If you are seeing "h0-es" someplace in the spec, please let us know, since that would be a typo. Indeed, in the documentation it says "Thus Hinglish should be represented as hi-t-h0-en where Hindi is the scaffold, and as en-t-h0-hi where English is", but the table represents the two Hinglishes as hi-t-en-h0-hybrid or en-t-hi-h0-hybrid, which is the source of the initial confusion. When it said, "Should there ever be strong need for hybrids of more than two languages or for other purposes such as hybrid languages as the source of translated content, additional structure could be added." it was not clear to me that this meant "it is currently not possible to do this" over "by adding in additional structure [e.g. -h0- tags]". I work occasionally with documents in Eonaviego which would best be coded as ast-t-gl-h0-hybrid, but then when translated to-from (which there are quite a few to/from Asturian or Spanish), there would be no valid encoding, so being able to represent a hybrid language as a source/destination of a transform is not a pure hypothetical for me. I will go ahead and file some tickets about the docs/support, thanks Mat?u From cldr-users at unicode.org Sun Oct 20 12:00:42 2019 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Sun, 20 Oct 2019 19:00:42 +0200 Subject: Interpreting t-h0- mechanism In-Reply-To: References: Message-ID: On Sun, Oct 20, 2019 at 6:26 PM Matthew Stuckwisch wrote: > >> 1. What is the difference between 'en-t-es-h0-hybrid' and > 'en-t-h0-es'? Both styles are given as example encodings for what would be > Spanglish (Spanish-English hybrid), but surely there ought to be a > difference > >> > > The latter style is illegal. The key h0 only currently takes one > possible value, h0-hybrid. > > > > https://unicode.org/reports/tr35/tr35.html#Hybrid_Locale > > > > If you are seeing "h0-es" someplace in the spec, please let us know, > since that would be a typo. > > Indeed, in the documentation it says "Thus Hinglish should be represented > as hi-t-h0-en where Hindi is the scaffold, and as en-t-h0-hi where English > is", but the table represents the two Hinglishes as hi-t-en-h0-hybrid or > en-t-hi-h0-hybrid, which is the source of the initial confusion. > Great, thanks for finding these (very misleading) typos. > > When it said, "Should there ever be strong need for hybrids of more than > two languages or for other purposes such as hybrid languages as the source > of translated content, additional structure could be added." it was not > clear to me that this meant "it is currently not possible to do this" over > "by adding in additional structure [e.g. -h0- tags]". > Right, that could be clearer; that the current structure does not permit it. > > I work occasionally with documents in Eonaviego which would best be coded > as ast-t-gl-h0-hybrid, but then when translated to-from (which there are > quite a few to/from Asturian or Spanish), there would be no valid encoding, > so being able to represent a hybrid language as a source/destination of a > transform is not a pure hypothetical for me. > The hybrids were originally designed for cases like Hinglish or Denglish, where there are large numbers of borrowings of words from a different language. Eonaviego sounds like set of dialects on the continuum between Asturian and Galician. That is, it doesn't appear to be Asturian with a batch of loan words from Galician. While "h0-hybrid" is currently the closest term for it, it might be better to define a new term for that that more precisely identifies "a set of dialects on the continuum between X and Y". That being said, the structure isn't designed to allow transforms to or from these h0 entities; we'd have to think of how it could be extended for that. > I will go ahead and file some tickets about the docs/support, thanks > Great, glad you found these issues. > > Mat?u -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Oct 20 12:09:55 2019 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Sun, 20 Oct 2019 19:09:55 +0200 Subject: Considering Cleanup of Some Transforms In-Reply-To: References: Message-ID: Thanks, this sounds like a useful project to me. For something like this, you should file a ticket first, and then you can get the committee's take on the design before doing the work on it. For each one of the planned changes in the ticket, please add a link to a file that shows the issue, and what the change would look like in that file, so that the committee can see exactly what you plan. I give one simple one below. 1. add child nodes to encompass all rules (impacts 214 files) 2. Set XML comments, to # comments (impacts 3 files) 3. Fix a commented out header (one file) 4. Update Copyright end date to all files (set to 2020 ?) https://github.com/unicode-org/cldr/blob/master/common/transforms/Amharic-Latin-BGN.xml -4 Copyright ? 1991-2013 Unicode, Inc. +4 Copyright ? 1991-2020 Unicode, Inc. 1. replace $wordBoundary variable with corresponding ICU marker (31 files, having introduced this variable I feel some sense of responsibility for it) Mark On Sun, Oct 20, 2019 at 5:10 PM Daniel Yacob via CLDR-Users < cldr-users at unicode.org> wrote: > Greetings, > > Looking through the CLDR transforms recently, I noticed mixed conventions > in the nodes and thought that I could volunteer some time to make > them more consistent. What I have in mind is: > > * add child nodes to encompass all rules (impacts 214 > files) > * Set XML comments, to # comments (impacts 3 files) > * Fix a commented out header (one file) > * Update Copyright end date to all files (set to 2020 ?) > * replace $wordBoundary variable with corresponding ICU marker (31 files, > having introduced this variable I feel some sense of responsibility for it) > > I would fork the repo and make a pull request to provide an update. > Before undertaking this work though, I would like to have some assurance > that a pull request with these updates would be accepted. LMK if so and I > will go forward with these revisions -or just some as advised. > > thanks! > > -Daniel > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Oct 20 12:12:56 2019 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Sun, 20 Oct 2019 19:12:56 +0200 Subject: Regarding $wordBoundary in Katakana-Latin BGN Transformation In-Reply-To: References: Message-ID: If that is now handled by ICU (eg there wouldn't be a functional difference) this sounds great to me. (See my remarks on other email). Mark On Sun, Oct 20, 2019 at 5:08 PM Daniel Yacob via CLDR-Users < cldr-users at unicode.org> wrote: > Hi Cameron, > > I've just joined the CLDR-Users list and so am not able to reply directly > to your September 20th email on this topic. The $wordBoundary variable was > introduced in the BGN transformation when they were composed initially in > 2006 as a work-around to a bug at the time (#2034 in the old bug tracking > system). > > I'm sure it is no longer needed and the variable can be replaced with a > standard word boundary marker. Hopefully it is otherwise harmless. > > In the Katakana-Latin case, the variable should simply be ignored by ICU > and have no impact. The fact that it is unused may be the result of a file > copy from another BGN file when the Katakana-Latin file was created and the > variable was not removed when it should have been. Or perhaps it was > referenced at one time and some later edit removed it. > > regards, > > -Daniel > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Oct 20 14:58:20 2019 From: cldr-users at unicode.org (Doug Ewell via CLDR-Users) Date: Sun, 20 Oct 2019 13:58:20 -0600 Subject: Interpreting t-h0- mechanism Message-ID: <000a01d58780$ba8faf40$2faf0dc0$@ewellic.org> Mark Davis wrote on the CLDR list: >> I work occasionally with documents in Eonaviego which would best be >> coded as ast-t-gl-h0-hybrid, but then when translated to-from (which >> there are quite a few to/from Asturian or Spanish), there would be no >> valid encoding, so being able to represent a hybrid language as a >> source/destination of a transform is not a pure hypothetical for me. > > The hybrids were originally designed for cases like Hinglish or > Denglish, where there are large numbers of borrowings of words from a > different language. Eonaviego sounds like set of dialects on the > continuum between Asturian and Galician. That is, it doesn't appear to > be Asturian with a batch of loan words from Galician. It sounds like the best course of action might be to investigate adding a BCP 47 variant, rather than trying to shoehorn this dialectical situation into the T extension. https://en.wikipedia.org/wiki/Galician-Asturian -- Doug Ewell | Thornton, CO, US | ewellic.org From cldr-users at unicode.org Sun Oct 20 16:28:31 2019 From: cldr-users at unicode.org (Matthew Stuckwisch via CLDR-Users) Date: Sun, 20 Oct 2019 17:28:31 -0400 Subject: Interpreting t-h0- mechanism In-Reply-To: <000a01d58780$ba8faf40$2faf0dc0$@ewellic.org> References: <000a01d58780$ba8faf40$2faf0dc0$@ewellic.org> Message-ID: > On Oct 20, 2019, at 3:58 PM, Doug Ewell via CLDR-Users wrote: >>> I work occasionally with documents in Eonaviego which would best be >>> coded as ast-t-gl-h0-hybrid, but then when translated to-from (which >>> there are quite a few to/from Asturian or Spanish), there would be no >>> valid encoding, so being able to represent a hybrid language as a >>> source/destination of a transform is not a pure hypothetical for me. >> >> The hybrids were originally designed for cases like Hinglish or >> Denglish, where there are large numbers of borrowings of words from a >> different language. Eonaviego sounds like set of dialects on the >> continuum between Asturian and Galician. That is, it doesn't appear to >> be Asturian with a batch of loan words from Galician. > > It sounds like the best course of action might be to investigate adding a BCP 47 variant, rather than trying to shoehorn this dialectical situation into the T extension. That's certainly fair, but I just used it as a quick example. In my head 'hybrid' doesn't imply exclusively lexical borrowing, and perhaps we could include other terms to describe the relationship more explicitly. For example, 'hybrid' for lexical borrowing (more for backwards compatibility than precision), 'codeswap' for true codeswapping, 'diacont' for being the intermediate in a dialect continuum, 'mixed' for an intermixing where both grammar and lexicon are taken, etc. For fairly established mixes I think variants ?or even language codes, as many mixed languages have them? are a good idea, especially with the advantage they can be used anywhere the singletons can't be used. Mat?u From cldr-users at unicode.org Tue Oct 22 13:08:42 2019 From: cldr-users at unicode.org (Cameron Dutro via CLDR-Users) Date: Tue, 22 Oct 2019 11:08:42 -0700 Subject: Regarding $wordBoundary in Katakana-Latin BGN Transformation In-Reply-To: References: Message-ID: Hey Daniel, Thanks for your reply, I was getting concerned my emails weren't making it through :/ As it turns out, I was able to fix the issue I described in my Sept 20th email by fixing several bugs in my implementation. The code was (successfully?) ignoring the $wordBoundary variable, but I thought it perhaps had some bearing on the test failures I was seeing, so I wanted to clarify with the users group. Thanks again, -Cameron On Sun, Oct 20, 2019 at 10:15 AM Mark Davis ?? via CLDR-Users < cldr-users at unicode.org> wrote: > If that is now handled by ICU (eg there wouldn't be a functional > difference) this sounds great to me. (See my remarks on other email). > > Mark > > > On Sun, Oct 20, 2019 at 5:08 PM Daniel Yacob via CLDR-Users < > cldr-users at unicode.org> wrote: > >> Hi Cameron, >> >> I've just joined the CLDR-Users list and so am not able to reply directly >> to your September 20th email on this topic. The $wordBoundary variable was >> introduced in the BGN transformation when they were composed initially in >> 2006 as a work-around to a bug at the time (#2034 in the old bug tracking >> system). >> >> I'm sure it is no longer needed and the variable can be replaced with a >> standard word boundary marker. Hopefully it is otherwise harmless. >> >> In the Katakana-Latin case, the variable should simply be ignored by ICU >> and have no impact. The fact that it is unused may be the result of a file >> copy from another BGN file when the Katakana-Latin file was created and the >> variable was not removed when it should have been. Or perhaps it was >> referenced at one time and some later edit removed it. >> >> regards, >> >> -Daniel >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: