Use of Unicode 6.3 bidi format chars in CLDR number formats?

Philippe Verdy verdy_p at wanadoo.fr
Fri Apr 29 00:58:44 CDT 2016


2016-04-29 6:37 GMT+02:00 Steven Loomis <srl at icu-project.org>:

> Asmus:
>
> > Given the correct choice of internal format for the database,
>
> The internal format is a Unicode String, specifically, UTF-8.
>
> > Given that CLDR data should be specifying the desired appearance
>
> But CLDR is text, specifically, XML, and not glyphs…
>

My opinion is there are differences only in terms of usage of these
translated resources: if they are intended to be used only in plain-text
documents (without any form of styling or DOM), then Bidi controls may be
needed.

For usage within rich-text documents, those BiDi controls are in fact more
a nuisance.

I don't think it is the role of CLDR to dictate how those generated strings
will be embedded in documents. So those resources should remain
self-contained independantly of their context of use: we should only expect
that the Bidi algorithm will be correct only for the trnalsated item
itself, in isolation.

If with this isolation we don't need  any control, don't insert any one:
it's up to the outer context to specify those that will be needed around
the translated resource. So RLI/PDI or LRI/PDI should never been needed...
except if this is to surround only a *part* of the translated resource,
excludinc the start and/or end of it.

However if the *whole* translated resources may need bidi controls in some
contexts, I think this should be previded only as external metadata, to
indicate how it can be safely embedded into another context. As those
resouces are normally created for a specific locale, they already have an
implicit default direction associated to that locale (including root, if
ever needed).

The alternative would be to provide two distinct resources, one for use in
isolation (rich text docuemnts providing their own embedding via markup or
style), another with surrounding Bidi controls for use in non-isolated
contexts such as plain-text documents, but it would be overkill. My opinion
is that it is enough to specify that a translated resource MUST be used in
isolation only (this is not strictly the case for currency amounts composed
with a formatted number and a currency symbol, or for other formatted
numbers with a measurement unit, both normally following the regular order
of words in the external language (except in English and similar languages
which put the currency symbol before the amount).

There are similar issues with formatting more complex numbers : ordering of
the positive or negative sign, ordering of the exponential notation,
ordering of an additional percent/permille symbol, ordering of additional
fractions (when not using the decimal notation but a true fraction
separated from the integer part by some additive notation or only by
"styling" the fraction itself), ordering of date elements in numeric
formats notably in abbreviated notations (e.g. "29/04/16" which should not
be implicitly reordered as "16/04/29" depending on the RTL or LTR context
before it): each language has its own interpretation of dates in specific
orders, even if spans of characters inside the numeric notation have weak
directions.)

In all these texts, the resources proving the format should just specify in
metadata if they expect a specific ordering (either "rtl", or "ltr", or
"inherited" by default for almost all resources), and if this resource
should also be isolated or not (affecting the order of elements in the
context after it, or if those elements after the embedded resource should
restore the direction that was effective before the start of the embedded
resource). This could be just a few optional attributes in resources, with
5 radio-buttons in the CLDR interface to define them:

  * default/inherited
  * rtl
  * rtl isolated
  * ltr
  * ltr isolated

(the 6th option: inherited with isolation, is not needed in my opinion).
The old "embed" style of CSS should be deprecated. And we should never have
to use Bidi overrides (RLO/PDF or LRO/PDF, or single marks like RLM and RLM
that break everything).

In most translated resources the "default/inherited" option will be used,
no need of any additional attributes in the LDML schema. Otherwise we'll
see two optional attributes: isolate="true" (default is same as
isolate="false"), and dir="ltr" or dir="rtl" (default is dir="inherited").

In such cases, we'll never need any Bidi control, or they can be generated
on the fly by the I18n library for usage in plain-text only contexts.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20160429/17923c27/attachment.html>


More information about the CLDR-Users mailing list