Merging unit skeletons for output - a better way?

Mark Davis ☕️ mark at macchiato.com
Sat May 16 15:00:00 CDT 2020


Thanks for the detailed report. Can you file this as a ticket?

The biggest problem is that the spec for constructing the fallback compound
names is missing details for the times pattern, the power patterns, and the
prefix patterns. The per pattern appears to be complete,
https://unicode.org/reports/tr35/tr35-general.html#perUnitPatterns, and
some of that is described there goes also for the other complex names, such
as that the fallback name construction may not work well for languages with
inflections. And that the "remove the placeholder" step does require
removing spaces around the {0}. Note that the "square" doesn't not work
right for gendered languages because it often then needs to agree with the
base unit.

First, we need to add full descriptions for all the complex fallback names.
Second, we need to add a test file with construction of some more
complicated names.

As to the details, the heuristics do have to play with the plurals,
including having all but the last in a times sequence use the singular.

We are in the process of gathering information for including gender and
case, and will need heuristics for those as well. For example, my current
draft has:


   1.

   Prefixes & powers: the gender of the whole is the same as the gender of
   the operand. In pseudocode:
   1.

      gender(square, meter) = gender(meter)
      2.

      gender(kilo, meter) = gender(meter)
      2.

   Per: the gender of the whole is the gender of the numerator
   1.

      gender(gram per meter) = gender(gram)
      3.

   Times: the gender of the whole is the gender of the last operand
   1.

      gender(gram-meter) = gender(gram)


NOTE: I'm sure that we will find cases of languages that have different
strategies for dealing with the plural, gender, and case in the complex
cases; so we'll undoubtedly need to refine as we go along.

Mark


On Sat, May 16, 2020 at 12:27 AM Kip Cole via CLDR-Users <
cldr-users at unicode.org> wrote:

> Congratulations to those who implemented the new Unit conversion and
> preference data in CLDR 37. Its been a joy to implement on top of the data,
> and not without a few challenges :-)
>
> One area that appears undocumented, and one that is quite tricky to
> implement, is merging unit skeletons when outputting a string
> representation. I will use some examples to illustrate. All examples are
> using a unit value of “3” unless otherwise indicated, and all in the “en”
> locale.
>
> ## Basic question
>
> Is there a better heuristic or some algorithm I’m missing that would
> improve this?  Totally ok that this is a new part of CLDR working around
> some heuristics is also fine. Just after the communities view of the best
> approach to take.
>
> ## Outputting a translatable unit (meaning it has a single skeleton in
> CLDR)
>
> “Kilometer-per-hour” => “{0} kilometers per hour”
>
> This is a simple case and the merging of the value into the skeleton is
> deterministic.
> No issues, simple substitutions.
>
> My implementation produces "3 kilometres per hour"
>
> ## Outputting a compound unit (no direct translation, composing is
> required)
>
> “Kilometer per second” => “{0} kilometers”, “ per “ and “{0} second”
>
> Now we have three skeletons that need to be merged. Here are the
> Issues as I see them:
>
> 1. In order to resolve the skeleton for the denominator “second” I take
> the plural value for “1” (ie always singular form)
> 2. Ignore the placeholder in the denominator so “{0} second” becomes “
> second”
> 3. String join the three skeletons
> 4. Merge the number value into the placeholder “{0}”
> 5. Replace the double space between “per” and “second” that arises because
> there is a trailing space in the “per” skeleton and a leading space in the
> “ second” skeleton
>
> All of this is a heuristic and I’m not at all sure it transitive for all
> other locales.
>
> My implementation produces "3 kilometres per second"
>
> ## Outputting with an SI prefix (and/or square and cubic prefix)
>
> This is the case when the applied SI prefix has no direct translation and
> we are composting the translation.
>
> “Millifurlong” => “milli{0}”, “{0} furlongs”
>
> The heuristic I currently apply is:
>
> 1. Since the prefix skeleton has the placeholder after the text it is
> merged in front of the unit
> 2. The placeholder of the prefix skeleton is deleted => “milli”, “{0}
> furlongs"
> 3. The prefix is merged to the front of the text in the unit skeleton =>
> “{0} millifurlongs”
> 4. Merge the number value into the placeholder
>
> The heuristic of merging the SI (or other) prefix into the unit skeleton
> is unlikely to be correct for all locales.
>
> My implementation produces "3 millifurlongs"
>
> ## Outputting a compound unit
>
> This is where we have a unit leveraging the “times” skeleton.
>
> “Furlong light year” => “{0} light years”, “⋅”, “ {0} furlongs”
>
> 1. The order of the skeletons is determined by the canonical sort order in
> Units.xml
> 2. The “times” skeleton is introduced between the two units
> 3. Current heuristic is to omit the placeholder on all but the first
> skeleton (there may be n skeletons)
> 4. String join skeletons
> 5. Replace duplicate whitespace
>
> This has similar issues as the previous “prefix” example - collapsing
> duplicate whitespace is required.
> It also has the heuristic of determining when to use the plural form for a
> sub-unit or the singular form.
>
> My implementation produces: "3 light years⋅furlongs”
>
> It uses the same plural form for all sub units. Its not “correct” English
> and its just as likely to be the wrong strategy for most locales (this is a
> guess).
>
> Many thanks for any help or suggestions,
>
> —Kip
>
> PS: In case anyone get this far, the implementation is in the Elixir
> language at https://github.com/elixir-cldr/cldr_units
>
>
>
>
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at corp.unicode.org
> https://corp.unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/cldr-users/attachments/20200516/56cc6c05/attachment-0001.htm>


More information about the CLDR-Users mailing list