Mapping Unicode script name to CLDR script code

Richard Wordingham richard.wordingham at ntlworld.com
Sun Mar 14 07:04:31 CDT 2021


On Sun, 14 Mar 2021 09:41:22 +0800
Kip Cole via CLDR-Users <cldr-users at unicode.org> wrote:

> I note that
> https://unicode-org.github.io/cldr-staging/charts/39/supplemental/languages_and_scripts.html
> <https://unicode-org.github.io/cldr-staging/charts/39/supplemental/languages_and_scripts.html>
> does map from Unicode language name (at least informally) to CLDR
> language code but that mapping isn’t, as far as I can see, in
> supplementalData.xml.

I think that's a map from allegedly English names to BCP 47 codes.  Not
all the script names are Unicode names.  For example, 'Lanna' is not.

> > On 14 Mar 2021, at 9:35 am, Kip Cole <kipcole9 at gmail.com> wrote:
> > 
> > Using the script properties (from scripts.txt in the Unicode repo
> > for example), the script of some text can be detected. 
> > 
> > However I am not able to find a mapping from Unicode script names
> > to CLDR script codes.  Ie a way to map "Hirigana -> Jpan" or
> > "Javanese -> Java".
> > 
> > I’ve checked supplementalData.xml and scriptMetadata.txt to no
> > avail.
> > 
> > Is there a canonical mapping somewhere?

As Doug pointed out, PropertyValueAliases.txt should normally work.
However, there are a number of cases that it doesn't handle:

Jpan is composed of (at least) 3 Unicode scripts: Hani, Hira and Kata.

Kore is a similar combination of the Unicode scripts Hani and Hang.

Hrkt expands to 'Hiragana or Katakana'; there might be some usage for
Japanese text that deliberately excludes kanji.

Latf and Latg are stylistic differences of Latn. I suspect there ought
to be a lot of (largely) predictable spelling differences between
de-Latn and de-Latf (basically where ligatures or their lack need to be
noted) and between ga-Latn and ga-Latg (how lenition is written).

Likewise, Syre, Syrj and Syrn are stylistic variants of Syrc.

Hans and Hant are the simplified and traditional character sets of
Chinese; both are specialisations of the generic code Hani.

Have fun.

Richard.







More information about the CLDR-Users mailing list