Feedback on CLDR JSON and encoding crucial data only in keys

Ben Hamilton beng at fb.com
Mon Nov 2 12:25:11 CST 2015


Hi folks,

I'm working on a server to allow arbitrary queries of slices of CLDR 
data using the GraphQL protocol (https://facebook.github.io/graphql/).

While working with the fully resolved CLDR JSON data, I noticed a few 
design decisions that complicate building a structured object model 
(required by GraphQL) to represent it:

1) Crucial LDML data is often encoded only in JSON keys, requiring 
clients to parse keys to extract them

For example, number formats (e.g. from main/root/numbers.json) require 
parsing the keys to know the range of values to which the format should 
be applied:

"decimalFormat": {
"1000-count-other": "0K",
"10000-count-other": "00K",
"100000-count-other": "000K",
"1000000-count-other": "0M",
(snip)
}

If I wanted to build an object model to represent this, I'd need to know 
that the keys of this dictionary include three pieces of data separated 
by "-" and write a parser which understands the meaning of each section.

This becomes much more complicated when dealing with dateFields.json, 
which include keys with particularly complex encodings. From 
main/root/dateFields.json:

"sat-narrow": {
"relative-type--1": "last Sa",
"relative-type-0": "this Sa",
"relative-type-1": "next Sa"
},
"dayperiod": {
"displayName": "AM/PM",
"displayName-alt-variant": "am/pm"
},

For this, I need to know that the "-" separators have multiple meanings, 
and might be present (or not), and could act either as a field 
separator, or as a negation operation in front of a number.

I think we can keep the keys as-is as opaque unique identifiers, but the 
values should be more structured. A map with separate fields for the 
meanings of each item in the key (plus the original value) would be 
great. The original XML format does this pretty well; I think we can do 
that in the JSON without too much trouble.

2) Much of the LDML data is represented as serialized UTS #35 UnicodeSet 
objects, which requires deserializing them to understand the underlying 
meaning

For example, main/root/characters.json includes:

"characters": {
"exemplarCharacters": "[]",
"auxiliary": "[]",
"punctuation": "[\\\\- , ; \\\\: ! ? . ( ) \\\\[ \\\\] \\\\{ \\\\}]",
(snip)
}

This means every program which wants to interact with this data needs to 
include a UTS #35 UnicodeSet deserializer (or forward the raw patterns 
on to the client with the assumption that it will include a UnicodeSet 
deserializer).

For many languages including JavaScript / ECMAScript, I don't think 
there exists such a deserializer today—please let me know if I'm wrong!

Ben



More information about the CLDR-Users mailing list