Feedback on CLDR JSON and encoding crucial data only in keys
Ben Hamilton
beng at fb.com
Mon Nov 2 12:25:11 CST 2015
Hi folks,
I'm working on a server to allow arbitrary queries of slices of CLDR
data using the GraphQL protocol (https://facebook.github.io/graphql/).
While working with the fully resolved CLDR JSON data, I noticed a few
design decisions that complicate building a structured object model
(required by GraphQL) to represent it:
1) Crucial LDML data is often encoded only in JSON keys, requiring
clients to parse keys to extract them
For example, number formats (e.g. from main/root/numbers.json) require
parsing the keys to know the range of values to which the format should
be applied:
"decimalFormat": {
"1000-count-other": "0K",
"10000-count-other": "00K",
"100000-count-other": "000K",
"1000000-count-other": "0M",
(snip)
}
If I wanted to build an object model to represent this, I'd need to know
that the keys of this dictionary include three pieces of data separated
by "-" and write a parser which understands the meaning of each section.
This becomes much more complicated when dealing with dateFields.json,
which include keys with particularly complex encodings. From
main/root/dateFields.json:
"sat-narrow": {
"relative-type--1": "last Sa",
"relative-type-0": "this Sa",
"relative-type-1": "next Sa"
},
"dayperiod": {
"displayName": "AM/PM",
"displayName-alt-variant": "am/pm"
},
For this, I need to know that the "-" separators have multiple meanings,
and might be present (or not), and could act either as a field
separator, or as a negation operation in front of a number.
I think we can keep the keys as-is as opaque unique identifiers, but the
values should be more structured. A map with separate fields for the
meanings of each item in the key (plus the original value) would be
great. The original XML format does this pretty well; I think we can do
that in the JSON without too much trouble.
2) Much of the LDML data is represented as serialized UTS #35 UnicodeSet
objects, which requires deserializing them to understand the underlying
meaning
For example, main/root/characters.json includes:
"characters": {
"exemplarCharacters": "[]",
"auxiliary": "[]",
"punctuation": "[\\\\- , ; \\\\: ! ? . ( ) \\\\[ \\\\] \\\\{ \\\\}]",
(snip)
}
This means every program which wants to interact with this data needs to
include a UTS #35 UnicodeSet deserializer (or forward the raw patterns
on to the client with the assumption that it will include a UnicodeSet
deserializer).
For many languages including JavaScript / ECMAScript, I don't think
there exists such a deserializer today—please let me know if I'm wrong!
Ben
More information about the CLDR-Users
mailing list