adding transforms to collation
Richard Wordingham via CLDR-Users
cldr-users at unicode.org
Fri Jan 12 02:46:40 CST 2018
On Fri, 12 Jan 2018 09:51:03 +0700
Martin Hosken via CLDR-Users <cldr-users at unicode.org> wrote:
> Dear Mark,
>
> > This may just be a case where the UCA doesn't work well enough
> > without preprocessing. The standard does allow for such
> > preprocessing, and the question is how to allow for that in CLDR
> > data. One way I can think of by allowing a transform for text that
> > is applied before sorting to be specified in the UCA description.
> > For speed, implementations would probably do that in code, but we
> > could have a data representation that could be used in a reference
> > implementation.
>
> Which is quicker? To run a transform to reorder stuff or to process a
> thousand contractions? The reason I ask is that something like this
> could probably be handled by contractions (I haven't checked), but it
> would take a lot. Burmese for it's CCVT model (ala Lao) takes around
> 600.
Contractions may be quicker for Burmese - for the most part the task is
to reorder VC to CV, and the final consonants are clearly marked as
such. I am a bit bothered that I couldn't see a transform to do a
rewrite such as VC → CV where V and C are defined by Unicode sets. It
would be helpful to have a full syntax definition in the LDML.
Lao is much more complicated, as final consonants are not tagged as
such, but are recognised by context rules that are difficult to wind
into contractions and prefix rules for the CLDR collation algorithm.
For Vietnamese, one complicating factor is that, so far as I am aware,
there isn't a full implementation of the CLDR collation algorithm that
includes tailoring rules. According to the ICU user guide, no known
language has contractions that overlap canonical decompositions -
Vietnamese (as described) was not a known language. One can work round
this problem. Instead of having contractions for <tone mark, final
consonant>, one would have contractions for <vowel, tone mark, final
consonant>. The downside is that this increases the number of
contractions by an order of magnitude.
Richard.
More information about the CLDR-Users
mailing list