From cldr-users at unicode.org Wed Nov 14 12:54:30 2018 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Wed, 14 Nov 2018 19:54:30 +0100 Subject: =?UTF-8?Q?Re=3A_Igbo_=5Big=5D_exemplar_characters_set_should_not_hav?= =?UTF-8?Q?e_=E1=BA=B9?= In-Reply-To: References: Message-ID: We are planning to move to Jira, which should take care of the spam issues. The question is whether ? U+1EB9 was used historically or is used special circumstances in Igbo (eg foreign words, dialects, etc); in either of those cases it should be moved to "aux". Mark On Wed, Jul 25, 2018 at 11:21 AM Denis Jacquerye via CLDR-Users < cldr-users at unicode.org> wrote: > Hi, > > I?m posting this here since the spambot refuses to let me open new issues > on the Bug Tracking site. > > ? U+1EB9 doesn?t seem to be used in Igbo, yet it is listed in the CLDR > Igbo [ig] locale default exemplar characters set. > It cannot be found in: > - Igbo Wikipedia site > - BBC Igbo site > - Wikipedia Igbo Language article > - Kay Williamson, *Dictionary of O??ni??cha? Igbo*, 2006 > - Yvonne C. Mbanefo, *Okowaokwu Igbo Umuaka : Igbo Dictionary for > Children*, 2016 > - Windows Igbo keyboard layout > > > - I haven?t checked Ayo Bamgbose, Orthographies of Nigerian languages: > manual I, Lagos, Federal Ministry of Education, National Language > Centre, 1982. > > > ? U+1EB9 should be removed from the Igbo [ig] default exemplar characters > set > Cheers, > -- > Denis Moyogo Jacquerye > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Wed Nov 14 14:23:32 2018 From: cldr-users at unicode.org (Lorna Evans via CLDR-Users) Date: Wed, 14 Nov 2018 14:23:32 -0600 Subject: =?UTF-8?Q?Re=3a_Igbo_=5big=5d_exemplar_characters_set_should_not_ha?= =?UTF-8?B?dmUg4bq5?= In-Reply-To: References: Message-ID: <4de8abf6-4159-40cd-3d28-5d2321a3509c@sil.org> FYI U+1EB9 is not listed for Igbo in Hartell's "Alphabets of Africa" either. On 11/14/2018 12:54 PM, Mark Davis ?? via CLDR-Users wrote: > We are planning to move to Jira, which should take care of the spam > issues. > > The question is?whether ? U+1EB9 was used historically or is used > special circumstances in Igbo (eg foreign words, dialects, etc); in > either of those cases it should be moved to "aux". > > Mark > // > > > On Wed, Jul 25, 2018 at 11:21 AM Denis Jacquerye via CLDR-Users > > wrote: > > Hi, > > I?m posting this here since the spambot refuses to let me open new > issues on the Bug Tracking site. > > ? U+1EB9 doesn?t seem to be used in Igbo, yet it is listed in the > CLDR Igbo [ig] locale default exemplar?characters set. > It cannot be found in: > - Igbo Wikipedia site > - BBC Igbo site > - Wikipedia Igbo Language article > - Kay Williamson, /Dictionary of O??ni??cha? Igbo/, 2006 > - Yvonne C. Mbanefo, /Okowaokwu Igbo Umuaka : Igbo Dictionary for > Children/, 2016 > - Windows Igbo keyboard layout > > * I haven?t checked Ayo Bamgbose, Orthographies of Nigerian > languages: manual I, Lagos, Federal Ministry of Education, > National Language Centre, 1982. > > > ? U+1EB9 should be removed from the Igbo [ig] default exemplar > characters set > Cheers, > -- > Denis Moyogo Jacquerye > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Wed Nov 14 14:33:54 2018 From: cldr-users at unicode.org (Asmus Freytag via CLDR-Users) Date: Wed, 14 Nov 2018 12:33:54 -0800 Subject: Igbo [ig] exemplar characters set should not have ? In-Reply-To: References: Message-ID: <766d8932-1dfb-588c-e019-e84eed06d6dd@ix.netcom.com> An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Wed Nov 14 15:58:47 2018 From: cldr-users at unicode.org (Patrick Chew via CLDR-Users) Date: Wed, 14 Nov 2018 13:58:47 -0800 Subject: =?UTF-8?Q?Re=3A_Igbo_=5Big=5D_exemplar_characters_set_should_not_hav?= =?UTF-8?Q?e_=E1=BA=B9?= In-Reply-To: <4de8abf6-4159-40cd-3d28-5d2321a3509c@sil.org> References: <4de8abf6-4159-40cd-3d28-5d2321a3509c@sil.org> Message-ID: While unlikely to be canonical resource, the following show use of , unmarked-as-dialect *and* overtly marked as Igbo dialects: https://wikitravel.org/en/Igbo_phrasebook An ?k?r? ??kp?? resist-dyed with nsibidi symbols ??g??n?? kw?n??/g??n?? m??r?? very direct and informal, literally 'what's happening'. How are you? K?d? k? ?m??r??? Thank you. D?l??/Im??l?. buttocks ??k?? red mm?-mm?, uhie Where does this train/bus go? ?b? ?l? ka ?gbo igw?/bosu nka na ga? Where is the train/bus to _____? ?b? ?l? ka ?gbo igw?/bosu d?, nke na ga _____? Does this train/bus stop in _____? ?gbo igw?/bosu nka, ? n? k?sh? na _____? When does the train/bus for _____ leave? Mgbe ?le ka ?gbo igw?/bosu nke na ga _____? When will this train/bus arrive in _____? Mgbe ?le ka ?gbo igw?/bosu nk? gi ru _____? ...the airport? ... ??p??t??? A glass of red/white wine, please. Nkalama ?m??y? mm? mm?/?ch?, biko. Do you have this in my size? ? nw?r? ih?a na ?s?m?/? nw?r? ih?a na am?m? You're cheating me. ? na ? f?b?m na ?ny?./I na ? m?r?m mu jobu. Can I have a bag? ? nw?r? ?kp?? I haven't done anything wrong. ?? d??gh?? ?hy?? m??r??. Ch?n?k? ? kw?l? ??hy?? ??j?? 'God will not allow a bad thing' An exclamation made out of shock when a bad thing happens. https://books.google.com/books?id=FD4iDAAAQBAJ&pg=PA58&dq=%E1%BA%B9&hl=en&sa=X&ved=0ahUKEwj2v_CO7tTeAhVRJDQIHWbKDzMQ6AEIKjAA#v=onepage&q=%E1%BA%B9&f=false "Affixation and Auxiliaries in Igbo" by Onumajuru, Virginia Chinwe ?n?cha Igbo https://www.academia.edu/23308295/On_the_Vowels_of_Imilike_Dialect_of_the_Igbo_Laanguage "On the Vowels of Imilike Dialect of the Igbo Laanguage" (Gerald Nweya) Imilike Igbo https://www.researchgate.net/publication/253846534_Reflections_on_Address_Politeness_in_Igbo_Family Mbieri Igbo http://www.skase.sk/Volumes/JTL26/pdf_doc/04.pdf "? there are many sounds (mainly consonants found in some other dialects of Igbo which are lacking in the Onwu Orthography." On Wed, Nov 14, 2018 at 12:25 PM Lorna Evans via CLDR-Users < cldr-users at unicode.org> wrote: > FYI U+1EB9 is not listed for Igbo in Hartell's "Alphabets of Africa" > either. > On 11/14/2018 12:54 PM, Mark Davis ?? via CLDR-Users wrote: > > We are planning to move to Jira, which should take care of the spam issues. > > The question is whether ? U+1EB9 was used historically or is used special > circumstances in Igbo (eg foreign words, dialects, etc); in either of those > cases it should be moved to "aux". > > Mark > > > On Wed, Jul 25, 2018 at 11:21 AM Denis Jacquerye via CLDR-Users < > cldr-users at unicode.org> wrote: > >> Hi, >> >> I?m posting this here since the spambot refuses to let me open new issues >> on the Bug Tracking site. >> >> ? U+1EB9 doesn?t seem to be used in Igbo, yet it is listed in the CLDR >> Igbo [ig] locale default exemplar characters set. >> It cannot be found in: >> - Igbo Wikipedia site >> - BBC Igbo site >> - Wikipedia Igbo Language article >> - Kay Williamson, *Dictionary of O??ni??cha? Igbo*, 2006 >> - Yvonne C. Mbanefo, *Okowaokwu Igbo Umuaka : Igbo Dictionary for >> Children*, 2016 >> - Windows Igbo keyboard layout >> >> >> - I haven?t checked Ayo Bamgbose, Orthographies of Nigerian >> languages: manual I, Lagos, Federal Ministry of Education, National >> Language Centre, 1982. >> >> >> ? U+1EB9 should be removed from the Igbo [ig] default exemplar characters >> set >> Cheers, >> -- >> Denis Moyogo Jacquerye >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> > > _______________________________________________ > CLDR-Users mailing listCLDR-Users at unicode.orghttp://unicode.org/mailman/listinfo/cldr-users > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Tue Nov 20 18:47:03 2018 From: cldr-users at unicode.org (Hugh Paterson via CLDR-Users) Date: Tue, 20 Nov 2018 16:47:03 -0800 Subject: =?UTF-8?Q?TR35_=C2=A7_3=2E1_and_keyboards?= Message-ID: Greetings, I am reading over the TR35 documentation [1] for categorizing letters as Auxiliary vs. Main for the purposes of designing keyboard layouts. I read the following guidance for classifying letters between Auxiliary and Main: For a given language, there are a few factors that help for determining whether a character belongs in the auxiliary set, instead of the main set: - The character is not available on all normal keyboards. - It is acceptable to always use spellings that avoid that character. So my questions are as follows: 1. is "all normal keyboards" supposed to be interpreted with language scope such that: "all normal German keyboards" when considering German "letters" and the task of determining if they are Auxiliary vs. Main? 2. I'm not actually working on German. I am working with Eastern Dan, a language of the Ivory Coast. If the answer to question #1 is "yes, interpret with language scope", then I have a follow on question: Are "keyboards" then in the TR35 context understood to be keyboard layouts (such as may be switched with software) or are they considered to be physical keyboards? In the Eastern Dan context keyboards are French or English as those are the two types of physical computers which make it into the language using context. However, it seems a bit silly to consider some of the characters in the Eastern Dan orthography as "Auxiliary" just because they don't appear on the set of "all physical keyboards". 3. When designing an new keyboard layout, or working with a language which does not have a keyboard (or keyboard layout) how is one advised to approach the distinction between Auxiliary vs. Main? [1]: https://www.unicode.org/reports/tr35/tr35-general.html#Character_Elements thanks, all the best -- *Hugh Paterson III *Innovation Analyst *Innovation Development & Experimentation*, *SIL International* *Web*: Contact & CV -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Wed Nov 21 08:42:39 2018 From: cldr-users at unicode.org (Marcel Schneider via CLDR-Users) Date: Wed, 21 Nov 2018 15:42:39 +0100 Subject: =?UTF-8?Q?Re:_TR35_=c2=a7_3.1_and_keyboards?= In-Reply-To: References: Message-ID: <785f9f3f-31f5-33fe-8df4-8ea95a827e2c@orange.fr> On 21/11/2018 01:47, Hugh Paterson via CLDR-Users wrote: > Greetings, > > I am reading over the TR35 documentation [1] for categorizing letters > as Auxiliary vs. Main for the purposes of designing keyboard > layouts. > > I read the following guidance for classifying letters between > Auxiliary and Main: > > For a given language, there are a few factors that help for > determining whether a character belongs in the auxiliary set, instead > of the main set: > > * The character is not available on all normal keyboards. > * It is acceptable to always use spellings that avoid that character. > > > So my questions are as follows: > > 1. is "all normal keyboards" supposed to be interpreted with language > scope such that: "all normal German keyboards" when considering > German "letters" and the task of determining if they are Auxiliary > vs. Main? > > 2. I'm not actually working on German. I am working with Eastern Dan, > a language of the Ivory Coast. If the answer to question #1 is "yes, > interpret with language scope", then I have a follow on question: Are > "keyboards" then in the TR35 context understood to be keyboard > layouts (such as may be switched with software) or are they > considered to be physical keyboards? In the Eastern Dan context > keyboards are French or English as those are the two types of > physical computers which make it into the language using context. > However, it seems a bit silly to consider some of the characters in > the Eastern Dan orthography as "Auxiliary" just because they don't > appear on the set of "all physical keyboards". > > 3. When designing an new keyboard layout, or working with a language > which does not have a keyboard (or keyboard layout) how is one > advised to approach the distinction between Auxiliary vs. Main? > > > > [1]: https://www.unicode.org/reports/tr35/tr35-general.html#Character_Elements I think that the advice to use the so-called normal keyboards as a shibboleth is likely to be an unfortunate remainder in the spec. The Information Hub for Linguists [2] does not mention it when quoting this snippet: | ?The test to see whether or not a letter belongs in the main set | is based on whether it is acceptable in your language to always use | spellings that avoid that character. For example, English characters | do not contain the accented letters that are sometimes seen in words | like r?sum? or na?ve, because it is acceptable in common practice to | spell those words without the accents.? We see that the keyboards can safely remain out of consideration when setting up standard vs auxiliary. Eg the German keyboard you cited has support for much more than German standard letters, given it has dead keys for accented letters, let alone that the new standard keyboard for Germany [3] has support for all European official languages. The reverse is also true: Designing keyboard layouts based on which letters are standard and which are auxiliary is at risk of resulting in suboptimal layouts, at least so far as I can tell based on my experience. Eg in French we have "?" and "?" as mandatory standard letters, still they may be accessed in a Group 2 defined not as a label for levels 3 and 4, but as a real group accessed via a dead key group selector. That is already implemented on the new German standard multilingual keyboard referred to above, but it may also fit a French keyboard because it makes for a less disruptive, more streamlined, less confusing and more respectful keyboard layout that is both powerful and easy to use. That is achieved by not sticking with the rule stipulating that all standard national letters must be in Group 1, and all auxiliary ie foreign letters are relegated into Group 2. ? and ? now happen to be conveniently grouped together with ?, ?, ?, ?, ?, ? and many more letters; so far I?m happy with it and am currently working on the documentation. 1. When considering level 1 of the German keyboard shipping with Windows, we find indeed all German standard lowercase letters, but the same won?t hold true for French given all circumflex-accented vowels are accessed through a dead key, while e and i with diaeresis are typed using another dead key position (or place, so as not to interfere with part 7 of UTS #35), still they are standard letters, while doing the same on a German keyboard yields auxiliary letters. Hence normal keyboards are not a means to safely dispatch letters between categories. That also helps answering question 2. 3. The second rule applies: | ?It is acceptable to always use spellings that avoid that character.? The stress is on *always*. Eg in French it is (still) acceptable to use spellings avoiding "?" and "?", but not always, ie not in all books nor in all handwriting. It may be inacceptable in other written material either. A point that I consider very important is implementability, because it conditions usability. Marc Durdin advised long ago that we?d be well advised not to implement on Windows the third level as a Ctrl+Alt key combo (0x06), whether or not mapped to a single key on the right: https://blog.keyman.com/2008/06/robust-key-mess/ But given in Windows that is the only CapsLock-sensitive level pair beside the classic Base and Shift levels, mapping letters on levels 3 and 4 on Windows we risk to run into issues. Using a dead key group like Karl Pentzlin did for Germany is the way to go. I?m not qualified to advise on designing keyboards for languages of Africa, but I remember that a similar experience took place in Togo as it was discussed on the Unicode Public Mailing List: https://unicode.org/mail-arch/unicode-ml/y2016-m02/0071.html Continued: https://unicode.org/mail-arch/unicode-ml/y2016-m11/0005.html Hope that helps. Good luck! Best regards, Marcel [2] https://sites.google.com/site/cldr/translation/characters#TOC-Exemplar-Characters [3] DIN 2137-2 From cldr-users at unicode.org Wed Nov 21 18:41:27 2018 From: cldr-users at unicode.org (Kip Cole via CLDR-Users) Date: Thu, 22 Nov 2018 08:41:27 +0800 Subject: Formatting a Number Range Message-ID: <4F7B6547-BBD0-4CC6-A1F4-CD5246164BDC@gmail.com> In TR35 section 2.4.1 I see: > Formats can be supplied for numbers (as above) or for currencies or other units. They can also > be used with ranges of numbers, resulting in formatting strings like ?$10K? or ?$3?7M?. However other than the more generic miscellaneous format for a range (typically ?{0}-{1}?) I?m unclear how I would format a range using the example above. I can see formatting each end of the range of course, and combining using the range format ?{0}-{1}?. But I?ve no idea how to resolve the format that would result in an output of ?$3?7M? since all of the short formats (and format masks) assume a single number. What am I missing? -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Thu Nov 22 00:31:09 2018 From: cldr-users at unicode.org (Marcel Schneider via CLDR-Users) Date: Thu, 22 Nov 2018 07:31:09 +0100 Subject: Formatting a Number Range In-Reply-To: <4F7B6547-BBD0-4CC6-A1F4-CD5246164BDC@gmail.com> References: <4F7B6547-BBD0-4CC6-A1F4-CD5246164BDC@gmail.com> Message-ID: On 22/11/2018 01:41, Kip Cole via CLDR-Users wrote: > In TR35 section 2.4.1 I see: > >> Formats can be supplied for numbers (as above) or for currencies or >> other units. They can also be used with ranges of numbers, >> resulting in formatting strings like ?$10K? or ?$3?7M?. > > However other than the more generic miscellaneous format for a range > (typically ?{0}-{1}?) I?m unclear how I would format a range using > the example above. > > I can see formatting each end of the range of course, and combining > using the range format ?{0}-{1}?. But I?ve no idea how to resolve > the format that would result in an output of ?$3?7M? since all of the > short formats (and format masks) assume a single number. > > What am I missing? Indeed the puzzle as I can see it is that "$3?7M" is basically a non-standard format, because large figures in ranges should not be abbreviated (assuming that the meaning is not "from three dollars to seven million dollars"): | ?Note that when expressing a range with very large numbers, to avoid | confusion, the first number should not be abbreviated; for example, | ??$75?$80,000?? means ?from $75 to $80,000,? not ?from $75,000 to $80,000.?? https://www.dailywritingtips.com/use-a-dash-for-number-ranges/ For clarifying this, as well as in the keyboarding thread wrt main vs auxiliary letters we?re really waiting for an authoritative response. Thanks. Best regards, Marcel From cldr-users at unicode.org Thu Nov 22 01:26:13 2018 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Thu, 22 Nov 2018 08:26:13 +0100 Subject: Formatting a Number Range In-Reply-To: References: <4F7B6547-BBD0-4CC6-A1F4-CD5246164BDC@gmail.com> Message-ID: Agreed that 3?7M is best avoided, since it is ambiguous for readers: (3?7)M vs 3-(7M). Mark On Thu, Nov 22, 2018 at 7:32 AM Marcel Schneider via CLDR-Users < cldr-users at unicode.org> wrote: > On 22/11/2018 01:41, Kip Cole via CLDR-Users wrote: > > In TR35 section 2.4.1 I see: > > > >> Formats can be supplied for numbers (as above) or for currencies or > >> other units. They can also be used with ranges of numbers, > >> resulting in formatting strings like ?$10K? or ?$3?7M?. > > > > However other than the more generic miscellaneous format for a range > > (typically ?{0}-{1}?) I?m unclear how I would format a range using > > the example above. > > > > I can see formatting each end of the range of course, and combining > > using the range format ?{0}-{1}?. But I?ve no idea how to resolve > > the format that would result in an output of ?$3?7M? since all of the > > short formats (and format masks) assume a single number. > > > > What am I missing? > > Indeed the puzzle as I can see it is that "$3?7M" is basically a > non-standard format, because large figures in ranges should not > be abbreviated (assuming that the meaning is not "from three dollars to > seven million dollars"): > > | ?Note that when expressing a range with very large numbers, to avoid > | confusion, the first number should not be abbreviated; for example, > | ??$75?$80,000?? means ?from $75 to $80,000,? not ?from $75,000 to > $80,000.?? > > https://www.dailywritingtips.com/use-a-dash-for-number-ranges/ > > For clarifying this, as well as in the keyboarding thread wrt main vs > auxiliary letters we?re really waiting for an authoritative response. > > Thanks. > > Best regards, > Marcel > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Fri Nov 30 08:09:48 2018 From: cldr-users at unicode.org (Marcel Schneider via CLDR-Users) Date: Fri, 30 Nov 2018 15:09:48 +0100 Subject: Hard-to-use "annotations" files in LDML Message-ID: <32b9eb3b-6fdb-2b22-a6bb-0f34fdf1d34d@orange.fr> The first argument in common/annotations/*.xml and common/annotationsDerived/*.xml is claiming to be the code point ("cp"), but it has all values showing up as literals. Eg the first 'annotation' element in annotations/fr.xml is: peau | peau claire So we need to use a tool to learn that the code point is actually U+1F3FB. I?d like that element to be this way: peau | peau claire A ticket had been filed about that 4 months ago, but it is still unaccepted and unscheduled (if my understanding of "milestone UNSCH" is correct): Adding code point scalar values in LDML https://unicode.org/cldr/trac/ticket/11289 I think it is essential to be able to edit these files in a straightforward way, first because as per instructions [1] we need to remove all keywords that are just echoing the emoji name. In the example above, given the emoji name is "peau claire", we?ll need to remove both "peau" ("skin", because it?s the starting word of the emoji name, "peau claire") and "peau claire" ("light skin tone", because it?s the emoji name). On the other hand, rather than leaving the field blank, we may add "blanc" ("white"), because people with light skin tone may be referred to as "white" people. And we should follow the English example by adding the Fitzpatrick "type 1?2". By the way I don?t know why the file still shows "peau claire" in fr-FR while the chart doesn?t: https://www.unicode.org/cldr/charts/34/annotations/romance.html After editing is completed, files are to be uploaded using the bulk submission facility of SurveyTool, according to earlier discussions. Hence we are to be using a tool that adds the code point from the literals, and then before submission, to clean the code points away (easily by passing a regex). The reason is that the literals may be either unsupported or hard to recognize in a text editor. By contrast, if the code points were part of LDML, they wouldn?t have to be neither added nor removed. That would of course break lots of things, and require changes to the DTD of LDML: In order to browse Emojipedia or other sources alongside, best would be to sort elements by code points. That may be done on vetter side, given SurveyTool does accept data in any order provided it is valid LDML, but sorting hex may not be straightforward. Doing edits in SurveyTool is of course the easiest way, but a text editor allows to work very fast, while SurveyTool may be used to fine-tune the result in a second take. There is one thing that is most tedious, that is every vetter has to do the cleanup by him- or herself, while collaborating on an LDML file prior to sharing it would enable all vetters to submit a bulk of cleared votes, and then to easily check ST without having to do any more than a handful edits. Such a method would help significantly streamline CLDR survey and vetting, ultimately allowing organizations to set Coverage level to Comprehensive for everyone. (About that, please see ticket "Coverage:Comprehensive obfuscated in documentation and communication, while useful goal": https://unicode.org/cldr/trac/ticket/11524 ) [1] http://cldr.unicode.org/translation/short-names-and-keywords#TOC-Character-Keywords Quote: | Here are some tips on how to be mindful of the number of keywords: | ? Don't add grammatical variants: pick one of {"walks", "walking"}; pick one of {sake, sak?}. | ? Don?t add emoji names (these will be added automatically) | ? Don?t add repeats of words starting with the same starting word in the emoji name. Best regards, Marcel From cldr-users at unicode.org Fri Nov 30 13:50:57 2018 From: cldr-users at unicode.org (Asmus Freytag via CLDR-Users) Date: Fri, 30 Nov 2018 11:50:57 -0800 Subject: Hard-to-use "annotations" files in LDML In-Reply-To: <32b9eb3b-6fdb-2b22-a6bb-0f34fdf1d34d@orange.fr> References: <32b9eb3b-6fdb-2b22-a6bb-0f34fdf1d34d@orange.fr> Message-ID: <4ace2465-2a07-4ec6-8ece-a25ec31300fa@ix.netcom.com> An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Fri Nov 30 14:57:51 2018 From: cldr-users at unicode.org (Marcel Schneider via CLDR-Users) Date: Fri, 30 Nov 2018 21:57:51 +0100 Subject: Hard-to-use "annotations" files in LDML In-Reply-To: <4ace2465-2a07-4ec6-8ece-a25ec31300fa@ix.netcom.com> References: <32b9eb3b-6fdb-2b22-a6bb-0f34fdf1d34d@orange.fr> <4ace2465-2a07-4ec6-8ece-a25ec31300fa@ix.netcom.com> Message-ID: <40518fe4-6683-38c4-a793-2d65f1883fa7@orange.fr> On 30/11/2018 20:50, Asmus Freytag via CLDR-Users wrote: > Agree with you, using literals for a "CP" field is just bad schema design. > A./ Thank you. My first Trac ticket ever was about the same problem but in the charts: https://unicode.org/cldr/trac/ticket/10206 It was 20 months ago, when I still knew even less English than today, so it was titled: annotations pages missing CP column And started: ?The annotations pages are of limited use as they are missing a column for code points, beside the text style glyphs.? ?Missing? should read ?lacking.? But the point why I?m citing it here is that the suggestion has been accepted (soon) and implemented. Although making changes to an LDML file structure is way less straightforward than enhancing the charts, I / we hopefully look forward. Best regards, Marcel > > On 11/30/2018 6:09 AM, Marcel Schneider via CLDR-Users wrote: >> The first argument in common/annotations/*.xml and common/annotationsDerived/*.xml is >> claiming to be the code point ("cp"), but it has all values showing up as literals. >> Eg the first 'annotation' element in annotations/fr.xml is: >> >> peau | peau claire >> >> So we need to use a tool to learn that the code point is actually U+1F3FB. >> >> I?d like that element to be this way: >> >> peau | peau claire >> >> A ticket had been filed about that 4 months ago, but it is still unaccepted and >> unscheduled (if my understanding of "milestone UNSCH" is correct): >> >> Adding code point scalar values in LDML >> https://unicode.org/cldr/trac/ticket/11289 >> >> I think it is essential to be able to edit these files in a straightforward way, first >> because as per instructions [1] we need to remove all keywords that are just echoing >> the emoji name. In the example above, given the emoji name is "peau claire", we?ll >> need to remove both "peau" ("skin", because it?s the starting word of the emoji name, >> "peau claire") and "peau claire" ("light skin tone", because it?s the emoji name). >> On the other hand, rather than leaving the field blank, we may add "blanc" ("white"), >> because people with light skin tone may be referred to as "white" people. And we should >> follow the English example by adding the Fitzpatrick "type 1?2". >> >> By the way I don?t know why the file still shows "peau claire" in fr-FR while the chart >> doesn?t: >> >> https://www.unicode.org/cldr/charts/34/annotations/romance.html >> >> After editing is completed, files are to be uploaded using the bulk submission facility >> of SurveyTool, according to earlier discussions. >> >> Hence we are to be using a tool that adds the code point from the literals, and then >> before submission, to clean the code points away (easily by passing a regex). The reason >> is that the literals may be either unsupported or hard to recognize in a text editor. >> >> By contrast, if the code points were part of LDML, they wouldn?t have to be neither added >> nor removed. That would of course break lots of things, and require changes to the DTD of >> LDML: >> >> >> ?? >> >> >> In order to browse Emojipedia or other sources alongside, best would be to sort elements >> by code points. That may be done on vetter side, given SurveyTool does accept data in any >> order provided it is valid LDML, but sorting hex may not be straightforward. >> >> Doing edits in SurveyTool is of course the easiest way, but a text editor allows to work >> very fast, while SurveyTool may be used to fine-tune the result in a second take. >> >> There is one thing that is most tedious, that is every vetter has to do the cleanup by >> him- or herself, while collaborating on an LDML file prior to sharing it would enable >> all vetters to submit a bulk of cleared votes, and then to easily check ST without >> having to do any more than a handful edits. >> >> Such a method would help significantly streamline CLDR survey and vetting, ultimately >> allowing organizations to set Coverage level to Comprehensive for everyone. >> (About that, please see ticket "Coverage:Comprehensive obfuscated in documentation and >> communication, while useful goal": >> https://unicode.org/cldr/trac/ticket/11524 >> ) >> >> [1] http://cldr.unicode.org/translation/short-names-and-keywords#TOC-Character-Keywords >> Quote: >> | Here are some tips on how to be mindful of the number of keywords: >> | ???????? Don't add grammatical variants: pick one of {"walks", "walking"}; pick one of {sake, sak?}. >> | ???????? Don?t add emoji names (these will be added automatically) >> | ???????? Don?t add repeats of words starting with the same starting word in the emoji name. >> >> Best regards, >> Marcel >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users > > > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Fri Nov 30 15:12:58 2018 From: cldr-users at unicode.org (Steven R. Loomis via CLDR-Users) Date: Fri, 30 Nov 2018 13:12:58 -0800 Subject: Hard-to-use "annotations" files in LDML In-Reply-To: <4ace2465-2a07-4ec6-8ece-a25ec31300fa@ix.netcom.com> References: <32b9eb3b-6fdb-2b22-a6bb-0f34fdf1d34d@orange.fr> <4ace2465-2a07-4ec6-8ece-a25ec31300fa@ix.netcom.com> Message-ID: Marcel, Asmus: Perhaps 'cp' could have been named something else, but the spec is clear: https://unicode.org/reports/tr35/tr35-general.html#Annotations > The cp attribute value has two formats: either a single string, or if contained within [?] a UnicodeSet It's a string, not a (single) codepoint. > So we need to use a tool to learn that the code point is actually U+1F3FB. It's an XML file. There are many ways to process it. You could have a separate tool which reads the XML file, and adds a comment (which is ignored on upload) that has all of the code points spelled out. When you say: > The reason is that the literals may be either unsupported or hard to recognize in a text editor. I don't see this as a reason to change the structure. There are plenty of other literal strings in CLDR. We could share ideas about which editors work well-I use emacs and/or VS Code. Steven On Fri, Nov 30, 2018 at 11:50 AM Asmus Freytag via CLDR-Users wrote: > > Agree with you, using literals for a "CP" field is just bad schema design. > A./ > > On 11/30/2018 6:09 AM, Marcel Schneider via CLDR-Users wrote: > > The first argument in common/annotations/*.xml and common/annotationsDerived/*.xml is > claiming to be the code point ("cp"), but it has all values showing up as literals. > Eg the first 'annotation' element in annotations/fr.xml is: > > peau | peau claire > > So we need to use a tool to learn that the code point is actually U+1F3FB. > > I?d like that element to be this way: > > peau | peau claire > > A ticket had been filed about that 4 months ago, but it is still unaccepted and > unscheduled (if my understanding of "milestone UNSCH" is correct): > > Adding code point scalar values in LDML > https://unicode.org/cldr/trac/ticket/11289 > > I think it is essential to be able to edit these files in a straightforward way, first > because as per instructions [1] we need to remove all keywords that are just echoing > the emoji name. In the example above, given the emoji name is "peau claire", we?ll > need to remove both "peau" ("skin", because it?s the starting word of the emoji name, > "peau claire") and "peau claire" ("light skin tone", because it?s the emoji name). > On the other hand, rather than leaving the field blank, we may add "blanc" ("white"), > because people with light skin tone may be referred to as "white" people. And we should > follow the English example by adding the Fitzpatrick "type 1?2". > > By the way I don?t know why the file still shows "peau claire" in fr-FR while the chart > doesn?t: > > https://www.unicode.org/cldr/charts/34/annotations/romance.html > > After editing is completed, files are to be uploaded using the bulk submission facility > of SurveyTool, according to earlier discussions. > > Hence we are to be using a tool that adds the code point from the literals, and then > before submission, to clean the code points away (easily by passing a regex). The reason > is that the literals may be either unsupported or hard to recognize in a text editor. > > By contrast, if the code points were part of LDML, they wouldn?t have to be neither added > nor removed. That would of course break lots of things, and require changes to the DTD of > LDML: > > > > > > In order to browse Emojipedia or other sources alongside, best would be to sort elements > by code points. That may be done on vetter side, given SurveyTool does accept data in any > order provided it is valid LDML, but sorting hex may not be straightforward. > > Doing edits in SurveyTool is of course the easiest way, but a text editor allows to work > very fast, while SurveyTool may be used to fine-tune the result in a second take. > > There is one thing that is most tedious, that is every vetter has to do the cleanup by > him- or herself, while collaborating on an LDML file prior to sharing it would enable > all vetters to submit a bulk of cleared votes, and then to easily check ST without > having to do any more than a handful edits. > > Such a method would help significantly streamline CLDR survey and vetting, ultimately > allowing organizations to set Coverage level to Comprehensive for everyone. > (About that, please see ticket "Coverage:Comprehensive obfuscated in documentation and > communication, while useful goal": > https://unicode.org/cldr/trac/ticket/11524 > ) > > [1] http://cldr.unicode.org/translation/short-names-and-keywords#TOC-Character-Keywords > Quote: > | Here are some tips on how to be mindful of the number of keywords: > | ? Don't add grammatical variants: pick one of {"walks", "walking"}; pick one of {sake, sak?}. > | ? Don?t add emoji names (these will be added automatically) > | ? Don?t add repeats of words starting with the same starting word in the emoji name. > > Best regards, > Marcel > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users From cldr-users at unicode.org Fri Nov 30 17:57:42 2018 From: cldr-users at unicode.org (Marcel Schneider via CLDR-Users) Date: Sat, 1 Dec 2018 00:57:42 +0100 Subject: Hard-to-use "annotations" files in LDML In-Reply-To: References: <32b9eb3b-6fdb-2b22-a6bb-0f34fdf1d34d@orange.fr> <4ace2465-2a07-4ec6-8ece-a25ec31300fa@ix.netcom.com> Message-ID: On 30/11/2018 22:12, Steven R. Loomis via CLDR-Users wrote: > Marcel, Asmus: > > Perhaps 'cp' could have been named something else, but the spec is clear: > > https://unicode.org/reports/tr35/tr35-general.html#Annotations >> The cp attribute value has two formats: either a single string, or if contained within [?] a UnicodeSet > > It's a string, not a (single) codepoint. I?d have loved to find both the sequence of code points and the literal string right in the file. > >> So we need to use a tool to learn that the code point is actually U+1F3FB. > > It's an XML file. There are many ways to process it. Vetters have the choice between using SurveyTool?s GUI and bulk submission facility. Whenever doing edits in LDML is more efficient than doing them in the GUI, we?re to edit XML, but I?d like that we were not supposed to further process the files beforehand. > > You could have a separate tool which reads the XML file, and adds a > comment (which is ignored on upload) that has all of the code points > spelled out. I?m interested in any tool able to do that. Would you please share the one you recommend? > > When you say: >> The reason is that the literals may be either unsupported or hard to recognize in a text editor. > > I don't see this as a reason to change the structure. There are > plenty of other literal strings in CLDR. Those in the emoji annotations seem to be the only ones that come in the way of editing the files for survey. So far I?ve put together the following list of files containing what we?re supposed to survey in ST: 1) common/main/*.xml 2) common/subdivisions/*.xml 3) common/annotations/*.xml 4) common/annotationsDerived/*.xml 5) common/rbnf/*.xml 6) common/casing/*.xml If I?m missing some files, please let me know. So far, the only hard-to-read literal strings are found in: 3) common/annotations/ and 4) common/annotationsDerived/. > > We could share ideas about which editors work well-I use emacs and/or VS Code. Thank you, I?ve now installed VS Code and the ECDC extension, but the latter doesn?t work for me as per the provided instructions. I don?t know whether it?s me or the software. And anyway the code points should be in the file. Up to now I?m using Gedit on Linux, and Notepad++ on Windows. With the Gedit Draw Spaces plugin showing nice triangles over , , U+2007, U+2011 (the sort of thing that we could use in ST too, as already reported). Not losing much time setting up environments and learning to code tools right now is essential to me as I?m very very busy all the time, and nevertheless I?ll have to get the on-coming CLDR survey round done. Please help people like me with solutions that work out of the box. Thanks. Marcel > > Steven > On Fri, Nov 30, 2018 at 11:50 AM Asmus Freytag via CLDR-Users > wrote: >> >> Agree with you, using literals for a "CP" field is just bad schema design. >> A./ >> >> On 11/30/2018 6:09 AM, Marcel Schneider via CLDR-Users wrote: >> >> The first argument in common/annotations/*.xml and common/annotationsDerived/*.xml is >> claiming to be the code point ("cp"), but it has all values showing up as literals. >> Eg the first 'annotation' element in annotations/fr.xml is: >> >> peau | peau claire >> >> So we need to use a tool to learn that the code point is actually U+1F3FB. >> >> I?d like that element to be this way: >> >> peau | peau claire >> >> A ticket had been filed about that 4 months ago, but it is still unaccepted and >> unscheduled (if my understanding of "milestone UNSCH" is correct): >> >> Adding code point scalar values in LDML >> https://unicode.org/cldr/trac/ticket/11289 >> >> I think it is essential to be able to edit these files in a straightforward way, first >> because as per instructions [1] we need to remove all keywords that are just echoing >> the emoji name. In the example above, given the emoji name is "peau claire", we?ll >> need to remove both "peau" ("skin", because it?s the starting word of the emoji name, >> "peau claire") and "peau claire" ("light skin tone", because it?s the emoji name). >> On the other hand, rather than leaving the field blank, we may add "blanc" ("white"), >> because people with light skin tone may be referred to as "white" people. And we should >> follow the English example by adding the Fitzpatrick "type 1?2". >> >> By the way I don?t know why the file still shows "peau claire" in fr-FR while the chart >> doesn?t: >> >> https://www.unicode.org/cldr/charts/34/annotations/romance.html >> >> After editing is completed, files are to be uploaded using the bulk submission facility >> of SurveyTool, according to earlier discussions. >> >> Hence we are to be using a tool that adds the code point from the literals, and then >> before submission, to clean the code points away (easily by passing a regex). The reason >> is that the literals may be either unsupported or hard to recognize in a text editor. >> >> By contrast, if the code points were part of LDML, they wouldn?t have to be neither added >> nor removed. That would of course break lots of things, and require changes to the DTD of >> LDML: >> >> >> >> >> >> In order to browse Emojipedia or other sources alongside, best would be to sort elements >> by code points. That may be done on vetter side, given SurveyTool does accept data in any >> order provided it is valid LDML, but sorting hex may not be straightforward. >> >> Doing edits in SurveyTool is of course the easiest way, but a text editor allows to work >> very fast, while SurveyTool may be used to fine-tune the result in a second take. >> >> There is one thing that is most tedious, that is every vetter has to do the cleanup by >> him- or herself, while collaborating on an LDML file prior to sharing it would enable >> all vetters to submit a bulk of cleared votes, and then to easily check ST without >> having to do any more than a handful edits. >> >> Such a method would help significantly streamline CLDR survey and vetting, ultimately >> allowing organizations to set Coverage level to Comprehensive for everyone. >> (About that, please see ticket "Coverage:Comprehensive obfuscated in documentation and >> communication, while useful goal": >> https://unicode.org/cldr/trac/ticket/11524 >> ) >> >> [1] http://cldr.unicode.org/translation/short-names-and-keywords#TOC-Character-Keywords >> Quote: >> | Here are some tips on how to be mindful of the number of keywords: >> | ? Don't add grammatical variants: pick one of {"walks", "walking"}; pick one of {sake, sak?}. >> | ? Don?t add emoji names (these will be added automatically) >> | ? Don?t add repeats of words starting with the same starting word in the emoji name. >> >> Best regards, >> Marcel >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > From cldr-users at unicode.org Fri Nov 30 20:48:55 2018 From: cldr-users at unicode.org (Marcel Schneider via CLDR-Users) Date: Sat, 1 Dec 2018 03:48:55 +0100 Subject: Hard-to-use "annotations" files in LDML In-Reply-To: References: <32b9eb3b-6fdb-2b22-a6bb-0f34fdf1d34d@orange.fr> <4ace2465-2a07-4ec6-8ece-a25ec31300fa@ix.netcom.com> Message-ID: <531cf9b2-b817-eba4-fd3e-a8d82e3ae993@orange.fr> On 01/12/2018 00:57, Marcel Schneider via CLDR-Users wrote: > On 30/11/2018 22:12, Steven R. Loomis via CLDR-Users wrote: [?] >> We could share ideas about which editors work well-I use emacs and/or VS Code. > > Thank you, I?ve now installed VS Code and the ECDC extension, but the latter doesn?t work for me as per > the provided instructions. I don?t know whether it?s me or the software. And anyway the code points > should be in the file. It now works as specified (I had to click on [Reload] in the extension pane). But among all conversions I cannot find the one converting a string to UTF-32, as that is uncommon in programming. UTF-16 is the closest we can get, as even HTML entities are either mixed (named and decimal) or all decimal. No option to get the entities in hexadecimal format. Additionally VS Code doesn?t respect the XKB key bindings. While I?ve permutated Right Control and Backspace in xkb/keycodes/evdev, VS Code keeps doing backspace on BKSP, and does nothing on RCTL. And Backspace is no real Backspace in VS Code, but like Ctrl+Z it deletes the whole last bunch of keystrokes. I?m close to uninstalling VS Code. >> You could have a separate tool which reads the XML file, and adds a >> comment (which is ignored on upload) that has all of the code points >> spelled out. You know what? Would you mind adding these comments to a copy of the following two files: https://www.unicode.org/repos/cldr/tags/release-34/common/annotations/fr.xml https://www.unicode.org/repos/cldr/tags/release-34/common/annotationsDerived/fr.xml and making the enhanced files available somewhere? If that is easy, you may run the tool on the whole two directories and post the enhanced clones on SVN. Thanks in advance. Best regards, Marcel