Feedback on CLDR JSON and encoding crucial data only in keys

Mark Davis ☕️ mark at macchiato.com
Tue Nov 3 00:32:14 CST 2015


I suggest that you file this as a bug, and we can discuss in the meeting.

For #1, the knottiest issue is
"dayperiod": {
"displayName": "AM/PM",
"displayName-alt-variant": "am/pm"
},

We've wrestled with this. As I recall, we considered fleshing it out, be
something like:

"dayperiod": {
 "displayName": {
    "plain": "AM/PM",
    "variant": "am/pm"
  },
},
But because 'alt' could potentially go on every leaf node that would
require adding a level (and "plain") for essentially every leaf node. (And
where alt can go on non-leaf nodes we'd have to work that in also.) But we
could explore some ideas.

For #2, we could probably go to a simpler format for JSON. We could look at
space-delimited strings, maybe with a special sequence for ranges, that
would be easy to parse.



Mark

On Mon, Nov 2, 2015 at 10:25 AM, Ben Hamilton <beng at fb.com> wrote:

> Hi folks,
>
> I'm working on a server to allow arbitrary queries of slices of CLDR data
> using the GraphQL protocol (https://facebook.github.io/graphql/).
>
> While working with the fully resolved CLDR JSON data, I noticed a few
> design decisions that complicate building a structured object model
> (required by GraphQL) to represent it:
>
> 1) Crucial LDML data is often encoded only in JSON keys, requiring clients
> to parse keys to extract them
>
> For example, number formats (e.g. from main/root/numbers.json) require
> parsing the keys to know the range of values to which the format should be
> applied:
>
> "decimalFormat": {
> "1000-count-other": "0K",
> "10000-count-other": "00K",
> "100000-count-other": "000K",
> "1000000-count-other": "0M",
> (snip)
> }
>
> If I wanted to build an object model to represent this, I'd need to know
> that the keys of this dictionary include three pieces of data separated by
> "-" and write a parser which understands the meaning of each section.
>
> This becomes much more complicated when dealing with dateFields.json,
> which include keys with particularly complex encodings. From
> main/root/dateFields.json:
>
> "sat-narrow": {
> "relative-type--1": "last Sa",
> "relative-type-0": "this Sa",
> "relative-type-1": "next Sa"
> },
> "dayperiod": {
> "displayName": "AM/PM",
> "displayName-alt-variant": "am/pm"
> },
>
> For this, I need to know that the "-" separators have multiple meanings,
> and might be present (or not), and could act either as a field separator,
> or as a negation operation in front of a number.
>
> I think we can keep the keys as-is as opaque unique identifiers, but the
> values should be more structured. A map with separate fields for the
> meanings of each item in the key (plus the original value) would be great.
> The original XML format does this pretty well; I think we can do that in
> the JSON without too much trouble.
>
> 2) Much of the LDML data is represented as serialized UTS #35 UnicodeSet
> objects, which requires deserializing them to understand the underlying
> meaning
>
> For example, main/root/characters.json includes:
>
> "characters": {
> "exemplarCharacters": "[]",
> "auxiliary": "[]",
> "punctuation": "[\\\\- , ; \\\\: ! ? . ( ) \\\\[ \\\\] \\\\{ \\\\}]",
> (snip)
> }
>
> This means every program which wants to interact with this data needs to
> include a UTS #35 UnicodeSet deserializer (or forward the raw patterns on
> to the client with the assumption that it will include a UnicodeSet
> deserializer).
>
> For many languages including JavaScript / ECMAScript, I don't think there
> exists such a deserializer today—please let me know if I'm wrong!
>
> Ben
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20151102/7823eabb/attachment.html>


More information about the CLDR-Users mailing list