CLDR (was: Private Use areas)

Marcel Schneider via Unicode unicode at
Fri Aug 31 05:17:41 CDT 2018

On 31/08/18 07:27 Janusz S. Bień via Unicode wrote:
> > Given NamesList.txt / Code Charts comments are kept minimal by design, 
> > one couldn’t simply pop them into XML or whatever, as the result would be 
> > disappointing and call for completion in the aftermath. Yet another task 
> > competing with CLDR survey.
> Please elaborate. It's not clear for me what do you mean.

These comments are designed for the Code Charts and as such must not be
disproportionate in exhaustivity. Eg we have lists of related languages ending 
in an ellipsis. Once this is popped into XML, ie extracted from NamesList.txt
to be fed in an extensible and unconstrained format (without any constraint 
as of available space, number and length of comments, and so on), any lack 
is felt as a discriminating neglect, and there will be a huge rush adding data.
Yet Unicode hasn’t set up products where that data could be published, ie not 
in the Code Charts (for the abovementioned reason), not in ICU so far as the 
additional information involved does not match a known demand on user side 
(localizing software does not mean providing scholarly exhaustive information
about supported characters). The use will be in character pickers providing 
every available information about a given character. That is why Unicode is
to prioritize CLDR for CLDR users, rather than extra information for the web.

> > Reviewing CLDR data is IMO top priority.
> > There are many flaws to be fixed in many languages including in English.
> > A lot of useful digest charts are extracted from XML there,
> Which XML? where?

More precisely it is LDML, the CLDR-specific XML.
What I called “digest charts” are the charts found here:

The access is via this page:

where the charts are in the Charts column, while the raw data is under SVN Tag.

> > and we really 
> > need to go through the data and correct the many many errors, please.
> Some time ago I tried to have a close look at the Polish locale and
> found the CLDR site prohibitively confusing.

I experienced some trouble too, mainly because "SVN Tag" is counter-intuitive 
for the access to the XML data (except when knowing about SubVersioN).
Polish data is found here:

The access is via the top of the "Summary" index page (showing root data):

You may wish to particularly check the By-Type charts:

Here I’d suggest to first focus on alphabetic information and on punctuation.

Under Latin (table caption, without anchor) we find out what punctuation 
Polish has compared to other locales using the same script.
The exact character appears when hovering the header row.
Eg U+2011 NON-BREAKING HYPHEN is systematically missing, which is 
an error in almost every locale using hyphen. TC is about to correct that.

Further you will see that while Polish is using apostrophe
CLDR does not have the correct apostrophe for Polish, as opposed eg to French.
You may wish to note that from now on, both U+0027 APOSTROPHE and 
U+0022 QUOTATION MARK are ruled out in almost all locales, given the 
preferred characters in publishing are U+2019 and, for Polish, the U+201E and 
U+201D that are already found in CLDR pl.

Note however that according to the information provided by English Wikipedia:
Polish also uses single quotes, that by contrast are still missing in CLDR.

Now you might understand what I meant when pointing that there are still 
many errors in many languages in CLDR, including in English.

Best regards,


> Best regards
> Janusz
> -- 
> , 
> Janusz S. Bien
> emeryt (emeritus)

More information about the Unicode mailing list