Feedback on CLDR JSON and encoding crucial data only in keys

Tue Nov 3 10:36:09 CST 2015

Will do, thanks! I'll file two separate issues, since they're unrelated.

Outlook<http://aka.ms/Ox5hz3> より送信

On Mon, Nov 2, 2015 at 10:32 PM -0800, "Mark Davis ☕️" <mark at macchiato.com<mailto:mark at macchiato.com>> wrote:

I suggest that you file this as a bug, and we can discuss in the meeting.

For #1, the knottiest issue is
"dayperiod": {
"displayName": "AM/PM",
"displayName-alt-variant": "am/pm"
},

We've wrestled with this. As I recall, we considered fleshing it out, be something like:

"dayperiod": {
 "displayName": {
    "plain": "AM/PM",
    "variant": "am/pm"
  },
},
But because 'alt' could potentially go on every leaf node that would require adding a level (and "plain") for essentially every leaf node. (And where alt can go on non-leaf nodes we'd have to work that in also.) But we could explore some ideas.

For #2, we could probably go to a simpler format for JSON. We could look at space-delimited strings, maybe with a special sequence for ranges, that would be easy to parse.

Mark

On Mon, Nov 2, 2015 at 10:25 AM, Ben Hamilton <beng at fb.com<mailto:beng at fb.com>> wrote:
Hi folks,

I'm working on a server to allow arbitrary queries of slices of CLDR data using the GraphQL protocol (https://facebook.github.io/graphql/<https://urldefense.proofpoint.com/v2/url?u=https-3A__facebook.github.io_graphql_&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=gpfxwYl04l1SD8BzaPGd9w&m=pmMHgswSwZH2y6554tWnRroOO9Cv-EOel1h7LP1k_bw&s=5Q8cPTS6pbvdq3rinK6Rhv5zifwsopimDJEHkH8FMgA&e=>).

While working with the fully resolved CLDR JSON data, I noticed a few design decisions that complicate building a structured object model (required by GraphQL) to represent it:

1) Crucial LDML data is often encoded only in JSON keys, requiring clients to parse keys to extract them

For example, number formats (e.g. from main/root/numbers.json) require parsing the keys to know the range of values to which the format should be applied:

"decimalFormat": {
"1000-count-other": "0K",
"10000-count-other": "00K",
"100000-count-other": "000K",
"1000000-count-other": "0M",
(snip)
}

If I wanted to build an object model to represent this, I'd need to know that the keys of this dictionary include three pieces of data separated by "-" and write a parser which understands the meaning of each section.

This becomes much more complicated when dealing with dateFields.json, which include keys with particularly complex encodings. From main/root/dateFields.json:

"sat-narrow": {
"relative-type--1": "last Sa",
"relative-type-0": "this Sa",
"relative-type-1": "next Sa"
},
"dayperiod": {
"displayName": "AM/PM",
"displayName-alt-variant": "am/pm"
},

For this, I need to know that the "-" separators have multiple meanings, and might be present (or not), and could act either as a field separator, or as a negation operation in front of a number.

I think we can keep the keys as-is as opaque unique identifiers, but the values should be more structured. A map with separate fields for the meanings of each item in the key (plus the original value) would be great. The original XML format does this pretty well; I think we can do that in the JSON without too much trouble.

2) Much of the LDML data is represented as serialized UTS #35 UnicodeSet objects, which requires deserializing them to understand the underlying meaning

For example, main/root/characters.json includes:

"characters": {
"exemplarCharacters": "[]",
"auxiliary": "[]",
"punctuation": "[\\\\- , ; \\\\: ! ? . ( ) \\\\[ \\\\] \\\\{ \\\\}]",
(snip)
}

This means every program which wants to interact with this data needs to include a UTS #35 UnicodeSet deserializer (or forward the raw patterns on to the client with the assumption that it will include a UnicodeSet deserializer).

For many languages including JavaScript / ECMAScript, I don't think there exists such a deserializer today—please let me know if I'm wrong!

Ben

_______________________________________________
CLDR-Users mailing list
CLDR-Users at unicode.org<mailto:CLDR-Users at unicode.org>
http://unicode.org/mailman/listinfo/cldr-users<https://urldefense.proofpoint.com/v2/url?u=http-3A__unicode.org_mailman_listinfo_cldr-2Dusers&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=gpfxwYl04l1SD8BzaPGd9w&m=pmMHgswSwZH2y6554tWnRroOO9Cv-EOel1h7LP1k_bw&s=dxxpBQJBcv3_8PNIXxV7tBLVAvJmeQ651UVSpajwA58&e=>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20151103/fa7ca09e/attachment.html>