adding transforms to collation

Fri Jan 12 05:07:13 CST 2018

Contractions would generally be faster than preprocessing, although the
more contractions you have with the same initial substring, the slower it
is. But large sets of contractions also slow down the non-contracted forms,
because of the extra lookup. So adding "& h < ch" will slow down every
instance of collating "c". They also can burden everything because of
memory impact.

>  am a bit bothered that I couldn't see a transform to do a rewrite such
as VC → CV where V and C are defined by Unicode sets.

I was just circulating an idea, not a fully-fleshed out approach. However,
if we used the <transform> syntax (
http://unicode.org/reports/tr35/tr35-general.html#Transforms), that permits:

(S1)(S2) → $2$1; // where S1 and S2 are unicode sets or sequences involving
unicode sets

Examples in view-source:
http://www.unicode.org/repos/cldr/trunk/common/transforms/Latin-Katakana.xml

However, another alternative to contractions is to use the
http://unicode.org/reports/tr35/tr35-collation.html#Context_Before. Using
context is more limited than contractions, but can be much faster and may
be applicable for Vietnamese. With that, you can the change the sort order
of a latter letter based on previous context. It may not be powerful enough
to do what people want to do, but here is a simple example of where it
would work.

   - I want the syllable with á to sort as a primary difference, *as a
   whole*: "can" < "cán"
   - test case: "can y" < "cán x", where the x/y difference doesn't matter.
   - But within the syllable I want the difference between a and á *not* to
   swamp later consonants.
      - test case: "cán" < "cat", where the n/t difference predominates

The following can be entered in
http://demo.icu-project.org/icu-bin/collation.html

Rules:
&t<á|t
&n<á|n

// the syntax says: if a 't' comes after an an 'á', then sort it as a
primary difference from a regular t.

Input:
cat x
can x
cát x
cán x
cat y
can y
cát y
cán y

Output
<1 can x
<1 can y
<1 cán x
<1 cán y
<1 cat x
<1 cat y
<1 cát x
<1 cát y

Without the new rules:

<1 can x
<2 cán x
<1 can y
<2 cán y
<1 cat x
<2 cát x
<1 cat y
<2 cát y

Mark

Mark

On Fri, Jan 12, 2018 at 9:46 AM, Richard Wordingham via CLDR-Users <
cldr-users at unicode.org> wrote:

> On Fri, 12 Jan 2018 09:51:03 +0700
> Martin Hosken via CLDR-Users <cldr-users at unicode.org> wrote:
>
> > Dear Mark,
> >
> > > This may just be a case where the UCA doesn't work well enough
> > > without preprocessing. The standard does allow for such
> > > preprocessing, and the question is how to allow for that in CLDR
> > > data. One way I can think of by allowing a transform for text that
> > > is applied before sorting to be specified in the UCA description.
> > > For speed, implementations would probably do that in code, but we
> > > could have a data representation that could be used in a reference
> > > implementation.
> >
> > Which is quicker? To run a transform to reorder stuff or to process a
> > thousand contractions? The reason I ask is that something like this
> > could probably be handled by contractions (I haven't checked), but it
> > would take a lot. Burmese for it's CCVT model (ala Lao) takes around
> > 600.
>
> Contractions may be quicker for Burmese - for the most part the task is
> to reorder VC to CV, and the final consonants are clearly marked as
> such.  I am a bit bothered that I couldn't see a transform to do a
> rewrite such as VC → CV where V and C are defined by Unicode sets.  It
> would be helpful to have a full syntax definition in the LDML.
>
> Lao is much more complicated, as final consonants are not tagged as
> such, but are recognised by context rules that are difficult to wind
> into contractions and prefix rules for the CLDR collation algorithm.
>
> For Vietnamese, one complicating factor is that, so far as I am aware,
> there isn't a full implementation of the CLDR collation algorithm that
> includes tailoring rules.  According to the ICU user guide, no known
> language has contractions that overlap canonical decompositions -
> Vietnamese (as described) was not a known language. One can work round
> this problem. Instead of having contractions for <tone mark, final
> consonant>, one would have contractions for <vowel, tone mark, final
> consonant>.  The downside is that this increases the number of
> contractions by an order of magnitude.
>
> Richard.
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20180112/6da25ebf/attachment.html>