adding transforms to collation

Richard Wordingham via CLDR-Users cldr-users at unicode.org
Fri Jan 12 13:02:33 CST 2018


On Fri, 12 Jan 2018 12:07:13 +0100
Mark Davis ☕️ via CLDR-Users <cldr-users at unicode.org> wrote:

> Contractions would generally be faster than preprocessing, although
> the more contractions you have with the same initial substring, the
> slower it is. But large sets of contractions also slow down the
> non-contracted forms, because of the extra lookup. So adding "& h <
> ch" will slow down every instance of collating "c". They also can
> burden everything because of memory impact.

Unless I am missing some nasty unassimilated borrowings like *séto, ɪ
think one only needs 65 contractions for reasonably spelt Vietnamese in
NFD - 5 tone marks × (8 final consonants + 4 glide writings (<i, o, u,
y>) + 'a' for the 3 purely vocalic diphthongs <ia>, <ưa> and <ua>).  I
believe a tone mark will need a contraction more often than not.
Problems are threatened for when the text has been stored in NFC.

> >  am a bit bothered that I couldn't see a transform to do a rewrite
> > such  
> > as VC → CV where V and C are defined by Unicode sets.

> I was just circulating an idea, not a fully-fleshed out approach.
> However, if we used the <transform> syntax (
> http://unicode.org/reports/tr35/tr35-general.html#Transforms), that
> permits:
> 
> (S1)(S2) → $2$1; // where S1 and S2 are unicode sets or sequences
> involving unicode sets

Reassuring to know, but are you sure this isn't just the ICU
implementation of LDML? :-) I couldn't find the meaning of '$1' in a
transform anywhere in Section 10.3 of the LDML. 

> However, another alternative to contractions is to use the
> http://unicode.org/reports/tr35/tr35-collation.html#Context_Before.
> Using context is more limited than contractions, but can be much
> faster and may be applicable for Vietnamese. With that, you can the
> change the sort order of a latter letter based on previous context.
> It may not be powerful enough to do what people want to do, but here
> is a simple example of where it would work.
> 
>    - I want the syllable with á to sort as a primary difference, *as a
>    whole*: "can" < "cán"
>    - test case: "can y" < "cán x", where the x/y difference doesn't
> matter.
>    - But within the syllable I want the difference between a and á
> *not* to swamp later consonants.
>       - test case: "cán" < "cat", where the n/t difference
> predominates
> 
> The following can be entered in
> http://demo.icu-project.org/icu-bin/collation.html
> 
> Rules:
> &t<á|t
> &n<á|n
> 
> // the syntax says: if a 't' comes after an an 'á', then sort it as a
> primary difference from a regular t.

I use this approach in my massive Lao collation table.  However, how do
you ensure that "caX" < "cáY" when X and Y starts with punctuation?  I
relied on a feature of Lao orthography, and am not happy with doing so.

Richard.



More information about the CLDR-Users mailing list