CLDR TL;DR article

Philippe Verdy verdy_p at wanadoo.fr
Wed Dec 24 07:35:02 CST 2014


2014-12-24 13:49 GMT+01:00 Jukka K. Korpela <jkorpela at cs.tut.fi>:

> 2014-12-24, 13:55, Philippe Verdy wrote, commenting on an announcement of
> http://perladvent.org/2014/2014-12-23.html :
>
>  That article about the Locale::CLDR gives an example of bad usage with:
>>
>>   *
>>
>>     fr: «foo», «bar» et «baz»
>>
>
> I agree, but I think it’s a more serious mistake to have
>
> ur: ”foo“، ”bar“، اور ”baz“
>
> This is a longstanding issue with no clear solution so far. In plain text,
> you can choose between SPACE, NO-BREAK SPACE, one of the “fixed-width”
> spaces like THIN SPACE, and the NARROW NO-BREAK SPACE. The “fixed-width”
> spaces (which largely aren’t fixed-width in reality) are by definition
> compatibility equivalent to SPACE, with its line breaking behavior. The
> NARROW NO-BREAK SPACE would seem ideal, but it has really been designed for
> a different purposes and there is no reason to expect that its width
> corresponds to that of espace fine insécable in French typography;
> moreover, its availability in fonts is limited, and it may still cause a
> symbol of undisplayable character to appear—surely worse than a space of
> any width, or no space.
>

Definitely THINSP is THE most correct space to use ; other spaces (NBSP,
and standard SPACE) are just some best-fit fallbacks meant to be used where
Unicode is not usable, but the absence of any space is definitely wrong.

(no longer an issue in HTML, but may remain an issue only when converting
from Unicode to legacy 8-bit codepages where only NBSP is present in
ISO8859-1 or Windows-1252 or CP850, the 3 most used legacy charsets used in
French)

THINSP is rendered correctly now by all current versions of OpenType
renderers.

The world now speaks Unicode in all new applications all over the web, and
even on databases; fallbacks may only be needed for some console
applications using those legacy charsets (but console drivers are handling
these fallbacks, or should do it.

So the only thing I personnally don't like in CLDR is that it restricts
those punctuations to only one Unicode character (this restriction is
quite.. ahemmm.... stupid).

If you want to preserve compatiility, this is ONLY for applications that do
not "speak" Unicode at all (not even UTF-8 which already generates
multibyte sequences not fitting in one of their 8-bit "characters" and in
that case those single-character punctuations for egacy apps should be
restricted to only ASCII and you'll have problems with most Asian languages
or with Armenian, Arabic, Persan, Urdu...: they will need "multibyte"
strings in their legacy apps for their punctuations, and will be forced to
use either UTF-8 for non-ASCIII characters, or various non-portable legacy
8-bit charsets).

Pesonally I thnk its up to the adapters processing the CLDR data to provide
these fallbacks (and for French the fallback from THINSP to an empty string
is always wrong (even if this may be a good choice for English, whose
typographic thin space was traditionnally narrower at 1/8th em; instead of
1/6th to 1/4th em in French typography).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20141224/8b8d7427/attachment.html>


More information about the CLDR-Users mailing list