declination of CLDR language names‏

Jukka K. Korpela jkorpela at cs.tut.fi
Tue Nov 11 02:43:58 CST 2014


2014-11-11 10:04, Amir E. Aharoni wrote:

> CLDR has lists of language names. This is useful for displaying language
> names to users in lists or menus, but less useful for displaying them in
> full sentences.

Similar considerations apply to CLDR data in general. It is primarily 
useful for lists, menus, tables, and isolated expressions, less suitable 
for texts in sentence contexts. The reason is that the latter involves 
much more difficult linguistic issues and is more complex—and also less 
often needed.

> For example, it can work for displaying a sentence like
> "This page is not available in $language" in English, where the language
> name is unchanged by morphology, but it won't work for Russian and many
> other languages where the language name must change according to
> grammatical case (or something along these lines).

This is a case where localization needs to involve real translation of 
text, not just construction of localized expressions using data like CLDR.

> For fun, I already started implementing a relevant automatic grammar
> transformation for language names in Russian for my project (MediaWiki),
> but before I dive too deeply into this, I wanted to ask: is there an
> existing solution for this anywhere?

There must be existing solutions, since the problem is common and needs 
to be solved somehow, but I would expect the solutions to be simplistic 
and clumsy, often leading to non-idiomatic constructs. For example, much 
of the Facebook localization uses approaches that would in this case 
mean using “This page is not available in the language $language.” 
(e.g., “Эта страница не доступна на языке английский”), which is 
understandable but sounds as artificial as it is, and also often 
ungrammatical; to make it grammatical, at the cost of making it even 
more clumsy, you could use “This page is not available in the following 
language: $language.”

Any general solution to the problem would be very complicated, because 
in different languages, different types of inflection information are 
needed in order to use (e.g.) language names in sentence context. This 
means that a CLDR entry would need to contain structured data with a 
structure that depends on the language. A way to deal with this would be 
to represent that data as a string (say, "5*J"), to be interpreted 
according to conventions defined separately for each language by the 
language experts (e.g., “5” might be a number of declination type and 
“*J” would indicate that special rule J is to be applied). But such 
conventions do not usually exist, as well-defined sets of rules.

Of course, to the extent that the inflected forms can be created from 
the base form alone, with no extra information, the problem becomes pure 
language technology (of varying difficulty by language).

Yucca





More information about the CLDR-Users mailing list