adding transforms to collation
Richard Wordingham via CLDR-Users
cldr-users at unicode.org
Fri Jan 12 13:02:33 CST 2018
On Fri, 12 Jan 2018 12:07:13 +0100
Mark Davis ☕️ via CLDR-Users <cldr-users at unicode.org> wrote:
> Contractions would generally be faster than preprocessing, although
> the more contractions you have with the same initial substring, the
> slower it is. But large sets of contractions also slow down the
> non-contracted forms, because of the extra lookup. So adding "& h <
> ch" will slow down every instance of collating "c". They also can
> burden everything because of memory impact.
Unless I am missing some nasty unassimilated borrowings like *séto, ɪ
think one only needs 65 contractions for reasonably spelt Vietnamese in
NFD - 5 tone marks × (8 final consonants + 4 glide writings (<i, o, u,
y>) + 'a' for the 3 purely vocalic diphthongs <ia>, <ưa> and <ua>). I
believe a tone mark will need a contraction more often than not.
Problems are threatened for when the text has been stored in NFC.
> > am a bit bothered that I couldn't see a transform to do a rewrite
> > such
> > as VC → CV where V and C are defined by Unicode sets.
> I was just circulating an idea, not a fully-fleshed out approach.
> However, if we used the <transform> syntax (
> http://unicode.org/reports/tr35/tr35-general.html#Transforms), that
> permits:
>
> (S1)(S2) → $2$1; // where S1 and S2 are unicode sets or sequences
> involving unicode sets
Reassuring to know, but are you sure this isn't just the ICU
implementation of LDML? :-) I couldn't find the meaning of '$1' in a
transform anywhere in Section 10.3 of the LDML.
> However, another alternative to contractions is to use the
> http://unicode.org/reports/tr35/tr35-collation.html#Context_Before.
> Using context is more limited than contractions, but can be much
> faster and may be applicable for Vietnamese. With that, you can the
> change the sort order of a latter letter based on previous context.
> It may not be powerful enough to do what people want to do, but here
> is a simple example of where it would work.
>
> - I want the syllable with á to sort as a primary difference, *as a
> whole*: "can" < "cán"
> - test case: "can y" < "cán x", where the x/y difference doesn't
> matter.
> - But within the syllable I want the difference between a and á
> *not* to swamp later consonants.
> - test case: "cán" < "cat", where the n/t difference
> predominates
>
> The following can be entered in
> http://demo.icu-project.org/icu-bin/collation.html
>
> Rules:
> &t<á|t
> &n<á|n
>
> // the syntax says: if a 't' comes after an an 'á', then sort it as a
> primary difference from a regular t.
I use this approach in my massive Lao collation table. However, how do
you ensure that "caX" < "cáY" when X and Y starts with punctuation? I
relied on a feature of Lao orthography, and am not happy with doing so.
Richard.
More information about the CLDR-Users
mailing list