CLDR

Mon Sep 3 02:45:38 CDT 2018

On Fri, Aug 31 2018 at 10:27 +0200, Manuel Strehl via Unicode wrote:
> The XML files in these folders:
>
> https://unicode.org/repos/cldr/tags/latest/common/

Thanks for the link.

In the meantime I rediscovered Locale Explorer

http://demo.icu-project.org/icu-bin/locexp

which I used some time ago.

On Fri, Aug 31 2018 at 12:17 +0200, Marcel Schneider via Unicode wrote:
> On 31/08/18 07:27 Janusz S. Bień via Unicode wrote:
> […]
>> > Given NamesList.txt / Code Charts comments are kept minimal by design, 
>> > one couldn’t simply pop them into XML or whatever, as the result would be 
>> > disappointing and call for completion in the aftermath. Yet another task 
>> > competing with CLDR survey.
>> 
>> Please elaborate. It's not clear for me what do you mean.
>
> These comments are designed for the Code Charts and as such must not be
> disproportionate in exhaustivity. Eg we have lists of related languages ending 
> in an ellipsis.

Looks like we have different comments in mind.

[...]

>> > Reviewing CLDR data is IMO top priority.
>> > There are many flaws to be fixed in many languages including in English.
>> > A lot of useful digest charts are extracted from XML there,
>> 
>> Which XML? where?
>
> More precisely it is LDML, the CLDR-specific XML.
> What I called “digest charts” are the charts found here:
>
> http://www.unicode.org/cldr/charts/34/
>
> The access is via this page:
>
> http://cldr.unicode.org/index/downloads
>
> where the charts are in the Charts column, while the raw data is under
> SVN Tag.

Thanks for the link. I found especially interesting the Polish section
in

https://www.unicode.org/cldr/charts/34/subdivisionNames/other_indo_european.html

Looks like a complete rubbish, e.g.

plmp = Federal Capital Territory(???) = Pomerania (Latin/English name of
Pomorze) transliterated into the Greek alphabet (and something in
Arabic).

The header of the page says "The coverage depends on the availability of
data in wikidata for these names" but I was unable to find this rubbish
in Wikidata (but I was not looking very hard).

>
>> 
>> > and we really 
>> > need to go through the data and correct the many many errors, please.

But who is the right person or institution to do it?

>> 
>> Some time ago I tried to have a close look at the Polish locale and
>> found the CLDR site prohibitively confusing.
>
> I experienced some trouble too, mainly because "SVN Tag" is counter-intuitive 
> for the access to the XML data (except when knowing about SubVersioN).
> Polish data is found here:
>
> https://www.unicode.org/cldr/charts/34/summary/pl.html
>
> The access is via the top of the "Summary" index page (showing root data):
>
> https://www.unicode.org/cldr/charts/34/summary/root.html
>
> You may wish to particularly check the By-Type charts:
>
> https://www.unicode.org/cldr/charts/34/by_type/index.html
>
> Here I’d suggest to first focus on alphabetic information and on punctuation.
>
> https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html
>
> Under Latin (table caption, without anchor) we find out what punctuation 
> Polish has compared to other locales using the same script.
> The exact character appears when hovering the header row.
> Eg U+2011 NON-BREAKING HYPHEN is systematically missing, which is 
> an error in almost every locale using hyphen. TC is about to correct that.
>
> Further you will see that while Polish is using apostrophe
> https://slowodnia.tumblr.com/post/136492530255/the-use-of-apostrophe-in-polish
> CLDR does not have the correct apostrophe for Polish, as opposed eg to French.

I understand that by "the correct apostrophe" you mean U+2019 RIGHT
SINGLE QUOTATION MARK.

> You may wish to note that from now on, both U+0027 APOSTROPHE and 
> U+0022 QUOTATION MARK are ruled out in almost all locales, given the 
> preferred characters in publishing are U+2019 and, for Polish, the U+201E and 
> U+201D that are already found in CLDR pl.

The situation seems more complicated because the chart

https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html

contains different list of punctuation characters than

https://www.unicode.org/cldr/charts/34/summary/pl.html.

I guess the latter is the primary one, and it contains U+2019 RIGHT
SINGLE QUOTATION MARK (and U+0x2018 LEFT SINGLE QUOTATION MARK, too).

>
> Note however that according to the information provided by English Wikipedia:
> https://en.wikipedia.org/wiki/Quotation_mark#Polish
> Polish also uses single quotes, that by contrast are still missing in CLDR.

You are right, but who cares? Looks like this has no practical
importance. Nobody complains about the wrong use of quotation marks in
Polish by Word or OpenOffice, so looks like the software doesn't use
this information. So this is rather a matter of aesthetics...

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien