Hard-to-use "annotations" files in LDML

Marcel Schneider via CLDR-Users cldr-users at unicode.org
Fri Nov 30 08:09:48 CST 2018


The first argument in common/annotations/*.xml and common/annotationsDerived/*.xml is
claiming to be the code point ("cp"), but it has all values showing up as literals.
Eg the first 'annotation' element in annotations/fr.xml is:

<annotation cp="��">peau | peau claire</annotation>

So we need to use a tool to learn that the code point is actually U+1F3FB.

I’d like that element to be this way:

<annotation cp="1F3FB" char="��">peau | peau claire</annotation>

A ticket had been filed about that 4 months ago, but it is still unaccepted and
unscheduled (if my understanding of "milestone UNSCH" is correct):

Adding code point scalar values in LDML
https://unicode.org/cldr/trac/ticket/11289

I think it is essential to be able to edit these files in a straightforward way, first
because as per instructions [1] we need to remove all keywords that are just echoing
the emoji name. In the example above, given the emoji name is "peau claire", we’ll
need to remove both "peau" ("skin", because it’s the starting word of the emoji name,
"peau claire") and "peau claire" ("light skin tone", because it’s the emoji name).
On the other hand, rather than leaving the field blank, we may add "blanc" ("white"),
because people with light skin tone may be referred to as "white" people. And we should
follow the English example by adding the Fitzpatrick "type 1–2".

By the way I don’t know why the file still shows "peau claire" in fr-FR while the chart
doesn’t:

https://www.unicode.org/cldr/charts/34/annotations/romance.html

After editing is completed, files are to be uploaded using the bulk submission facility
of SurveyTool, according to earlier discussions.

Hence we are to be using a tool that adds the code point from the literals, and then
before submission, to clean the code points away (easily by passing a regex). The reason
is that the literals may be either unsupported or hard to recognize in a text editor.

By contrast, if the code points were part of LDML, they wouldn’t have to be neither added
nor removed. That would of course break lots of things, and require changes to the DTD of
LDML:

<!ELEMENT annotation ( #PCDATA ) >
<!ATTLIST annotation cp CDATA #REQUIRED >   <!-- repurposed -->
<!ATTLIST annotation char CDATA #REQUIRED > <!-- added -->

In order to browse Emojipedia or other sources alongside, best would be to sort elements
by code points. That may be done on vetter side, given SurveyTool does accept data in any
order provided it is valid LDML, but sorting hex may not be straightforward.

Doing edits in SurveyTool is of course the easiest way, but a text editor allows to work
very fast, while SurveyTool may be used to fine-tune the result in a second take.

There is one thing that is most tedious, that is every vetter has to do the cleanup by
him- or herself, while collaborating on an LDML file prior to sharing it would enable
all vetters to submit a bulk of cleared votes, and then to easily check ST without
having to do any more than a handful edits.

Such a method would help significantly streamline CLDR survey and vetting, ultimately
allowing organizations to set Coverage level to Comprehensive for everyone.
(About that, please see ticket "Coverage:Comprehensive obfuscated in documentation and
communication, while useful goal":
https://unicode.org/cldr/trac/ticket/11524
)

[1] http://cldr.unicode.org/translation/short-names-and-keywords#TOC-Character-Keywords
Quote:
| Here are some tips on how to be mindful of the number of keywords:
| •        Don't add grammatical variants: pick one of {"walks", "walking"}; pick one of {sake, saké}.
| •        Don’t add emoji names (these will be added automatically)
| •        Don’t add repeats of words starting with the same starting word in the emoji name.

Best regards,
Marcel


More information about the CLDR-Users mailing list