question about identifying CLDR coverage % for Amharic

Fri Feb 24 15:42:54 CST 2017

On Fri, 24 Feb 2017 09:17:01 +0100
Mark Davis ☕️ <mark at macchiato.com> wrote:

> v27 is (relatively) ancient, so I'd suggest you look at more recent
> versions, rather than start off with "It gets worse...".

> Where we have confirmation that the root collation is sufficient for
> the language, then an empty file can be added to
> http://unicode.org/repos/cldr/tags/latest/common/collation/. (We used
> to use a "validSublocales" attributed, but found that simply having
> empty files worked better, procedurally.)

> In that directory, you'll see an item for am.xml. It isn't completely
> empty: it just rearranges the Ethi script ahead of Latin.

I notice a very similar file lo.xml.  When did Laos haul up the white
flag and more or less adopt the modern Thai collation order for Lao?  I
was startled to see the following statement therein, "The root collation
order is valid for this language. Just move the native script first".

The Lao collations I am acquainted with require large numbers of
contractions, as one cannot leverage the lesser significance of tone
marks; each syllable has its own primary weight.  I suspended my
development for one of the simpler systems when I discovered my tables
were much more accurate than my test data - I need to buy a different
Lao dictionary.

Now, if Laos hasn't more or less standardised on DUCET, assessing its
coverage would not be easy.  Even at the discrete level, would it have
reached the 'core' level?  Although the collation would be wrong, it
would still be usable.  This is how the task of assessment 'gets worse'.

> There may even be cases that CLDR refuses to cover.  An example in
> > English is that CLDR refuses to handle the difference in indefinite
> > article between "a 3-page letter" and "an 8-page letter".  What
> > percentage of non-coverage would one calculate for this?

> "Refuses"?

> That is a loaded term, usually part of an accusation. Joe could
> accuse, for example, you, Richard Wordingham, of "refusing" to run a
> 4 minute mile, even though: nobody ever asked you; it wouldn't
> probably be possible; and if it were, you probably wouldn't be able
> to spend the amount of time and effort to do so; or want to, given
> your busy life.

Are you saying that this refusal in the LDML specification is not a
response to my pointing out that the English plural rules didn't handle
this subtlety?  I suppose someone else may also have stumbled over the
issue.

> The scope of CLDR is to provide a core set of locale data for
> internationalization services. It does not have as a goal the ability
> to grammatically compose messages in all of the languages it covers.
> That is a huge task that many, many people are developing
> sophisticated ML models for doing.

The 'plural rules' had already struck me as a large undertaking.  The
occurrence of the nasal mutation after some Welsh numerals seems to
vary from valley to valley, though I suspect it's even less
systematic.  And that's before one reaches the point where one has to
ask whether the numbers are being said decimally or vigesimally.

> 
> We extend the scope of CLDR periodically when we get proposals for
> doing so that are feasible, and have a high enough priority given the
> many, many items on our "todo" list
> <http://unicode.org/cldr/trac/query?status=accepted&status=design&status=new&col=id&col=summary&col=owner&col=type&col=priority&col=milestone&col=component&col=weeks&report=20&desc=1&order=id>
> (1821, currently). We were able to do so with plural rules, for
> example. And it isn't out of scope in the future for us to support
> data for doing a limited set of local-scope adjustments across
> languages, if we have a practical proposal for doing so. We haven't
> "refused" to do a/an.

UTS#35 Version 30 Part 3 Section 5
(http://unicode.org/reports/tr35/tr35-numbers.html#Language_Plural_Rules)
reads like a refusal:

"On the other hand, the above constructions are relatively rare in
messages constructed using numeric placeholders, so the disruption for
implementations currently using CLDR plural categories wouldn't be
worth the small gain."

There is a data synchronisation issue, unfortunately.  Is 1800
"eighteen hundred" or "one thousand eight hundred"?

> If you or others are interested in contributing to CLDR, please let us
> know. (One caveat; sometimes there are practical limitations on our
> accepting contributions because the size of the contribution imposes
> to high a cost on just the assessment of it.)

I have pi_Thai word- and line-breaking rules to provide once there
is a home for them.  They're not perfect, as I don't resolve sandhi.

Richard.