Merging unit skeletons for output - a better way?

Kip Cole kipcole9 at gmail.com
Sat May 16 02:26:05 CDT 2020


Congratulations to those who implemented the new Unit conversion and preference data in CLDR 37. Its been a joy to implement on top of the data, and not without a few challenges :-)

One area that appears undocumented, and one that is quite tricky to implement, is merging unit skeletons when outputting a string representation. I will use some examples to illustrate. All examples are using a unit value of “3” unless otherwise indicated, and all in the “en” locale.

## Basic question

Is there a better heuristic or some algorithm I’m missing that would improve this?  Totally ok that this is a new part of CLDR working around some heuristics is also fine. Just after the communities view of the best approach to take.

## Outputting a translatable unit (meaning it has a single skeleton in CLDR)

“Kilometer-per-hour” => “{0} kilometers per hour”

This is a simple case and the merging of the value into the skeleton is deterministic.
No issues, simple substitutions.

My implementation produces "3 kilometres per hour"

## Outputting a compound unit (no direct translation, composing is required)

“Kilometer per second” => “{0} kilometers”, “ per “ and “{0} second”

Now we have three skeletons that need to be merged. Here are the 
Issues as I see them:

1. In order to resolve the skeleton for the denominator “second” I take the plural value for “1” (ie always singular form)
2. Ignore the placeholder in the denominator so “{0} second” becomes “ second”
3. String join the three skeletons
4. Merge the number value into the placeholder “{0}”
5. Replace the double space between “per” and “second” that arises because there is a trailing space in the “per” skeleton and a leading space in the “ second” skeleton

All of this is a heuristic and I’m not at all sure it transitive for all other locales.

My implementation produces "3 kilometres per second"

## Outputting with an SI prefix (and/or square and cubic prefix)

This is the case when the applied SI prefix has no direct translation and we are composting the translation.

“Millifurlong” => “milli{0}”, “{0} furlongs”

The heuristic I currently apply is:

1. Since the prefix skeleton has the placeholder after the text it is merged in front of the unit
2. The placeholder of the prefix skeleton is deleted => “milli”, “{0} furlongs"
3. The prefix is merged to the front of the text in the unit skeleton => “{0} millifurlongs”
4. Merge the number value into the placeholder

The heuristic of merging the SI (or other) prefix into the unit skeleton is unlikely to be correct for all locales.

My implementation produces "3 millifurlongs"

## Outputting a compound unit

This is where we have a unit leveraging the “times” skeleton.

“Furlong light year” => “{0} light years”, “⋅”, “ {0} furlongs”

1. The order of the skeletons is determined by the canonical sort order in Units.xml
2. The “times” skeleton is introduced between the two units
3. Current heuristic is to omit the placeholder on all but the first skeleton (there may be n skeletons)
4. String join skeletons
5. Replace duplicate whitespace

This has similar issues as the previous “prefix” example - collapsing duplicate whitespace is required.
It also has the heuristic of determining when to use the plural form for a sub-unit or the singular form.

My implementation produces: "3 light years⋅furlongs”

It uses the same plural form for all sub units. Its not “correct” English and its just as likely to be the wrong strategy for most locales (this is a guess).

Many thanks for any help or suggestions,

—Kip

PS: In case anyone get this far, the implementation is in the Elixir language at https://github.com/elixir-cldr/cldr_units







More information about the CLDR-Users mailing list