Hard-to-use "annotations" files in LDML

Fri Nov 30 17:57:42 CST 2018

On 30/11/2018 22:12, Steven R. Loomis via CLDR-Users wrote:
> Marcel, Asmus:
> 
> Perhaps 'cp' could have been named something else, but the spec is clear:
> 
> https://unicode.org/reports/tr35/tr35-general.html#Annotations
>> The cp attribute value has two formats: either a single string, or if contained within […] a UnicodeSet
> 
> It's a string, not a (single) codepoint.

I’d have loved to find both the sequence of code points and the literal string right in the file.

> 
>> So we need to use a tool to learn that the code point is actually U+1F3FB.
> 
> It's an XML file. There are many ways to process it.

Vetters have the choice between using SurveyTool’s GUI and bulk submission facility.
Whenever doing edits in LDML is more efficient than doing them in the GUI, we’re to
edit XML, but I’d like that we were not supposed to further process the files beforehand.

> 
> You could have a separate tool which reads the XML file, and adds a
> comment (which is ignored on upload) that has all of the code points
> spelled out.

I’m interested in any tool able to do that. Would you please share the one you recommend?

> 
> When you say:
>> The reason is that the literals may be either unsupported or hard to recognize in a text editor.
> 
> I don't see this as a reason to change the structure.  There are
> plenty of other literal strings in CLDR.

Those in the emoji annotations seem to be the only ones that come in the way of editing the files for survey.

So far I’ve put together the following list of files containing what we’re supposed to survey in ST:

1)   common/main/*.xml

2)   common/subdivisions/*.xml

3)   common/annotations/*.xml

4)   common/annotationsDerived/*.xml

5)   common/rbnf/*.xml

6)   common/casing/*.xml

If I’m missing some files, please let me know. So far, the only hard-to-read literal strings are found in:

3)   common/annotations/ and

4)   common/annotationsDerived/.

> 
> We could share ideas about which editors work well-I use emacs and/or VS Code.

Thank you, I’ve now installed VS Code and the ECDC extension, but the latter doesn’t work for me as per
the provided instructions. I don’t know whether it’s me or the software. And anyway the code points
should be in the file.

Up to now I’m using Gedit on Linux, and Notepad++ on Windows. With the Gedit Draw Spaces plugin showing
nice triangles over <NBSP>, <NNBSP>, U+2007, U+2011 (the sort of thing that we could use in ST too, as
already reported).

Not losing much time setting up environments and learning to code tools right now is essential to me as
I’m very very busy all the time, and nevertheless I’ll have to get the on-coming CLDR survey round done.

Please help people like me with solutions that work out of the box.

Thanks.

Marcel

> 
> Steven
> On Fri, Nov 30, 2018 at 11:50 AM Asmus Freytag via CLDR-Users
> <cldr-users at unicode.org> wrote:
>>
>> Agree with you, using literals for a "CP" field is just bad schema design.
>> A./
>>
>> On 11/30/2018 6:09 AM, Marcel Schneider via CLDR-Users wrote:
>>
>> The first argument in common/annotations/*.xml and common/annotationsDerived/*.xml is
>> claiming to be the code point ("cp"), but it has all values showing up as literals.
>> Eg the first 'annotation' element in annotations/fr.xml is:
>>
>> <annotation cp="��">peau | peau claire</annotation>
>>
>> So we need to use a tool to learn that the code point is actually U+1F3FB.
>>
>> I’d like that element to be this way:
>>
>> <annotation cp="1F3FB" char="��">peau | peau claire</annotation>
>>
>> A ticket had been filed about that 4 months ago, but it is still unaccepted and
>> unscheduled (if my understanding of "milestone UNSCH" is correct):
>>
>> Adding code point scalar values in LDML
>> https://unicode.org/cldr/trac/ticket/11289
>>
>> I think it is essential to be able to edit these files in a straightforward way, first
>> because as per instructions [1] we need to remove all keywords that are just echoing
>> the emoji name. In the example above, given the emoji name is "peau claire", we’ll
>> need to remove both "peau" ("skin", because it’s the starting word of the emoji name,
>> "peau claire") and "peau claire" ("light skin tone", because it’s the emoji name).
>> On the other hand, rather than leaving the field blank, we may add "blanc" ("white"),
>> because people with light skin tone may be referred to as "white" people. And we should
>> follow the English example by adding the Fitzpatrick "type 1–2".
>>
>> By the way I don’t know why the file still shows "peau claire" in fr-FR while the chart
>> doesn’t:
>>
>> https://www.unicode.org/cldr/charts/34/annotations/romance.html
>>
>> After editing is completed, files are to be uploaded using the bulk submission facility
>> of SurveyTool, according to earlier discussions.
>>
>> Hence we are to be using a tool that adds the code point from the literals, and then
>> before submission, to clean the code points away (easily by passing a regex). The reason
>> is that the literals may be either unsupported or hard to recognize in a text editor.
>>
>> By contrast, if the code points were part of LDML, they wouldn’t have to be neither added
>> nor removed. That would of course break lots of things, and require changes to the DTD of
>> LDML:
>>
>> <!ELEMENT annotation ( #PCDATA ) >
>> <!ATTLIST annotation cp CDATA #REQUIRED >   <!-- repurposed -->
>> <!ATTLIST annotation char CDATA #REQUIRED > <!-- added -->
>>
>> In order to browse Emojipedia or other sources alongside, best would be to sort elements
>> by code points. That may be done on vetter side, given SurveyTool does accept data in any
>> order provided it is valid LDML, but sorting hex may not be straightforward.
>>
>> Doing edits in SurveyTool is of course the easiest way, but a text editor allows to work
>> very fast, while SurveyTool may be used to fine-tune the result in a second take.
>>
>> There is one thing that is most tedious, that is every vetter has to do the cleanup by
>> him- or herself, while collaborating on an LDML file prior to sharing it would enable
>> all vetters to submit a bulk of cleared votes, and then to easily check ST without
>> having to do any more than a handful edits.
>>
>> Such a method would help significantly streamline CLDR survey and vetting, ultimately
>> allowing organizations to set Coverage level to Comprehensive for everyone.
>> (About that, please see ticket "Coverage:Comprehensive obfuscated in documentation and
>> communication, while useful goal":
>> https://unicode.org/cldr/trac/ticket/11524
>> )
>>
>> [1] http://cldr.unicode.org/translation/short-names-and-keywords#TOC-Character-Keywords
>> Quote:
>> | Here are some tips on how to be mindful of the number of keywords:
>> | •        Don't add grammatical variants: pick one of {"walks", "walking"}; pick one of {sake, saké}.
>> | •        Don’t add emoji names (these will be added automatically)
>> | •        Don’t add repeats of words starting with the same starting word in the emoji name.
>>
>> Best regards,
>> Marcel
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at unicode.org
>> http://unicode.org/mailman/listinfo/cldr-users
>>
>>
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at unicode.org
>> http://unicode.org/mailman/listinfo/cldr-users
> 
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>