Hard-to-use "annotations" files in LDML

Marcel Schneider via CLDR-Users cldr-users at unicode.org
Fri Nov 30 14:57:51 CST 2018


On 30/11/2018 20:50, Asmus Freytag via CLDR-Users wrote:
> Agree with you, using literals for a "CP" field is just bad schema design.
> A./

Thank you. My first Trac ticket ever was about the same problem but in the charts:

https://unicode.org/cldr/trac/ticket/10206

It was 20 months ago, when I still knew even less English than today, so it was titled:


    annotations pages missing CP column

And started: “The annotations pages are of limited use as they are missing a column for code points, beside the text style glyphs.”
“Missing” should read “lacking.”

But the point why I’m citing it here is that the suggestion has been accepted (soon) and implemented.

Although making changes to an LDML file structure is way less straightforward than enhancing the charts,
I / we hopefully look forward.

Best regards,
Marcel
>
> On 11/30/2018 6:09 AM, Marcel Schneider via CLDR-Users wrote:
>> The first argument in common/annotations/*.xml and common/annotationsDerived/*.xml is
>> claiming to be the code point ("cp"), but it has all values showing up as literals.
>> Eg the first 'annotation' element in annotations/fr.xml is:
>>
>> <annotation cp="��">peau | peau claire</annotation>
>>
>> So we need to use a tool to learn that the code point is actually U+1F3FB.
>>
>> I’d like that element to be this way:
>>
>> <annotation cp="1F3FB" char="��">peau | peau claire</annotation>
>>
>> A ticket had been filed about that 4 months ago, but it is still unaccepted and
>> unscheduled (if my understanding of "milestone UNSCH" is correct):
>>
>> Adding code point scalar values in LDML
>> https://unicode.org/cldr/trac/ticket/11289
>>
>> I think it is essential to be able to edit these files in a straightforward way, first
>> because as per instructions [1] we need to remove all keywords that are just echoing
>> the emoji name. In the example above, given the emoji name is "peau claire", we’ll
>> need to remove both "peau" ("skin", because it’s the starting word of the emoji name,
>> "peau claire") and "peau claire" ("light skin tone", because it’s the emoji name).
>> On the other hand, rather than leaving the field blank, we may add "blanc" ("white"),
>> because people with light skin tone may be referred to as "white" people. And we should
>> follow the English example by adding the Fitzpatrick "type 1–2".
>>
>> By the way I don’t know why the file still shows "peau claire" in fr-FR while the chart
>> doesn’t:
>>
>> https://www.unicode.org/cldr/charts/34/annotations/romance.html
>>
>> After editing is completed, files are to be uploaded using the bulk submission facility
>> of SurveyTool, according to earlier discussions.
>>
>> Hence we are to be using a tool that adds the code point from the literals, and then
>> before submission, to clean the code points away (easily by passing a regex). The reason
>> is that the literals may be either unsupported or hard to recognize in a text editor.
>>
>> By contrast, if the code points were part of LDML, they wouldn’t have to be neither added
>> nor removed. That would of course break lots of things, and require changes to the DTD of
>> LDML:
>>
>> <!ELEMENT annotation ( #PCDATA ) >
>> <!ATTLIST annotation cp CDATA #REQUIRED >   <!-- repurposed -->
>> <!ATTLIST annotation char CDATA #REQUIRED > <!-- added -->
>>
>> In order to browse Emojipedia or other sources alongside, best would be to sort elements
>> by code points. That may be done on vetter side, given SurveyTool does accept data in any
>> order provided it is valid LDML, but sorting hex may not be straightforward.
>>
>> Doing edits in SurveyTool is of course the easiest way, but a text editor allows to work
>> very fast, while SurveyTool may be used to fine-tune the result in a second take.
>>
>> There is one thing that is most tedious, that is every vetter has to do the cleanup by
>> him- or herself, while collaborating on an LDML file prior to sharing it would enable
>> all vetters to submit a bulk of cleared votes, and then to easily check ST without
>> having to do any more than a handful edits.
>>
>> Such a method would help significantly streamline CLDR survey and vetting, ultimately
>> allowing organizations to set Coverage level to Comprehensive for everyone.
>> (About that, please see ticket "Coverage:Comprehensive obfuscated in documentation and
>> communication, while useful goal":
>> https://unicode.org/cldr/trac/ticket/11524
>> )
>>
>> [1] http://cldr.unicode.org/translation/short-names-and-keywords#TOC-Character-Keywords
>> Quote:
>> | Here are some tips on how to be mindful of the number of keywords:
>> | •        Don't add grammatical variants: pick one of {"walks", "walking"}; pick one of {sake, saké}.
>> | •        Don’t add emoji names (these will be added automatically)
>> | •        Don’t add repeats of words starting with the same starting word in the emoji name.
>>
>> Best regards,
>> Marcel
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at unicode.org
>> http://unicode.org/mailman/listinfo/cldr-users
>
>
>
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20181130/09e70821/attachment.html>


More information about the CLDR-Users mailing list