Hard-to-use "annotations" files in LDML

Fri Nov 30 15:12:58 CST 2018

Marcel, Asmus:

Perhaps 'cp' could have been named something else, but the spec is clear:

https://unicode.org/reports/tr35/tr35-general.html#Annotations
> The cp attribute value has two formats: either a single string, or if contained within […] a UnicodeSet

It's a string, not a (single) codepoint.

> So we need to use a tool to learn that the code point is actually U+1F3FB.

It's an XML file. There are many ways to process it.

You could have a separate tool which reads the XML file, and adds a
comment (which is ignored on upload) that has all of the code points
spelled out.

When you say:
> The reason is that the literals may be either unsupported or hard to recognize in a text editor.

I don't see this as a reason to change the structure.  There are
plenty of other literal strings in CLDR.

We could share ideas about which editors work well-I use emacs and/or VS Code.

Steven
On Fri, Nov 30, 2018 at 11:50 AM Asmus Freytag via CLDR-Users
<cldr-users at unicode.org> wrote:
>
> Agree with you, using literals for a "CP" field is just bad schema design.
> A./
>
> On 11/30/2018 6:09 AM, Marcel Schneider via CLDR-Users wrote:
>
> The first argument in common/annotations/*.xml and common/annotationsDerived/*.xml is
> claiming to be the code point ("cp"), but it has all values showing up as literals.
> Eg the first 'annotation' element in annotations/fr.xml is:
>
> <annotation cp="��">peau | peau claire</annotation>
>
> So we need to use a tool to learn that the code point is actually U+1F3FB.
>
> I’d like that element to be this way:
>
> <annotation cp="1F3FB" char="��">peau | peau claire</annotation>
>
> A ticket had been filed about that 4 months ago, but it is still unaccepted and
> unscheduled (if my understanding of "milestone UNSCH" is correct):
>
> Adding code point scalar values in LDML
> https://unicode.org/cldr/trac/ticket/11289
>
> I think it is essential to be able to edit these files in a straightforward way, first
> because as per instructions [1] we need to remove all keywords that are just echoing
> the emoji name. In the example above, given the emoji name is "peau claire", we’ll
> need to remove both "peau" ("skin", because it’s the starting word of the emoji name,
> "peau claire") and "peau claire" ("light skin tone", because it’s the emoji name).
> On the other hand, rather than leaving the field blank, we may add "blanc" ("white"),
> because people with light skin tone may be referred to as "white" people. And we should
> follow the English example by adding the Fitzpatrick "type 1–2".
>
> By the way I don’t know why the file still shows "peau claire" in fr-FR while the chart
> doesn’t:
>
> https://www.unicode.org/cldr/charts/34/annotations/romance.html
>
> After editing is completed, files are to be uploaded using the bulk submission facility
> of SurveyTool, according to earlier discussions.
>
> Hence we are to be using a tool that adds the code point from the literals, and then
> before submission, to clean the code points away (easily by passing a regex). The reason
> is that the literals may be either unsupported or hard to recognize in a text editor.
>
> By contrast, if the code points were part of LDML, they wouldn’t have to be neither added
> nor removed. That would of course break lots of things, and require changes to the DTD of
> LDML:
>
> <!ELEMENT annotation ( #PCDATA ) >
> <!ATTLIST annotation cp CDATA #REQUIRED >   <!-- repurposed -->
> <!ATTLIST annotation char CDATA #REQUIRED > <!-- added -->
>
> In order to browse Emojipedia or other sources alongside, best would be to sort elements
> by code points. That may be done on vetter side, given SurveyTool does accept data in any
> order provided it is valid LDML, but sorting hex may not be straightforward.
>
> Doing edits in SurveyTool is of course the easiest way, but a text editor allows to work
> very fast, while SurveyTool may be used to fine-tune the result in a second take.
>
> There is one thing that is most tedious, that is every vetter has to do the cleanup by
> him- or herself, while collaborating on an LDML file prior to sharing it would enable
> all vetters to submit a bulk of cleared votes, and then to easily check ST without
> having to do any more than a handful edits.
>
> Such a method would help significantly streamline CLDR survey and vetting, ultimately
> allowing organizations to set Coverage level to Comprehensive for everyone.
> (About that, please see ticket "Coverage:Comprehensive obfuscated in documentation and
> communication, while useful goal":
> https://unicode.org/cldr/trac/ticket/11524
> )
>
> [1] http://cldr.unicode.org/translation/short-names-and-keywords#TOC-Character-Keywords
> Quote:
> | Here are some tips on how to be mindful of the number of keywords:
> | •        Don't add grammatical variants: pick one of {"walks", "walking"}; pick one of {sake, saké}.
> | •        Don’t add emoji names (these will be added automatically)
> | •        Don’t add repeats of words starting with the same starting word in the emoji name.
>
> Best regards,
> Marcel
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users