Transform rule syntax clarifications

Cameron Dutro via CLDR-Users cldr-users at unicode.org
Sat Nov 16 15:18:00 CST 2019


Hey Kip,

I'm certainly not an expert, but I did write the current transforms
implementation
<https://github.com/twitter/twitter-cldr-rb/tree/master/lib/twitter_cldr/transforms>
in TwitterCLDR <https://github.com/twitter/twitter-cldr-rb#transliteration>,
so I think I can be of some help here.

The transform rules syntax is very similar to regular expressions, so your
intuition that "$ddot?" is interpreted as "optional $ddot" is correct. You
are also correct as to the meaning of the asterisk, should work the same as
it does in the regex world.

The other bits of syntax you've mentioned are from the Unicode Set
specification, which you can find in UTS #35
<https://unicode.org/reports/tr35/#Unicode_Sets>. Unicode Sets are like
regex character classes, but as you've noticed, there are a couple of
special operations they support that regexes don't. Specifically, the "-"
operator is the symmetric difference
<https://en.wikipedia.org/wiki/Symmetric_difference> between the two
operands (UTS 35 says "asymmetric difference," but I don't think that's a
thing - I can't find any definition of it online). The "&" operator is the
set intersection of the two operands, and no operator is their union.

Hope that helps!

-Cameron

On Fri, Nov 15, 2019 at 6:19 PM Kip Cole via CLDR-Users <
cldr-users at unicode.org> wrote:

> I’m implementing the transform rules and would appreciate a few
> confirmations or corrections:
>
> Ι ($glower $ddot?) $rough → H | ι $1 ;
> The "$ddot?” Is interpreted as “optional $ddot”
> in the usual regex meaning
>
> $accent_minus = [[$accent]-[$iotasub$macron]];
> The “[[..]-[..]]" is regex character set negation?
>
> $notAbove = [[:^ccc=0:] & [:^ccc=230:]];
> The “[[..]&[..]]" is regex character set intersection?
>
> | $1 $iotasub ← ($evowel $macron $accentMinus *) i ;
> That the “*” here is “zero of more times $accentMinus” in the
> usual regex meaning? And “$1” is the capture result in the usual regex
> meaning too?
>
> t ($notAbove+) ̈ ; # ARABIC LETTER TEH MARBUTA
> The “+” is the usual regex meaning of “one or more times”
>
> Many thanks, —Kip
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20191116/1b33e1ca/attachment.html>


More information about the CLDR-Users mailing list