Transform rule syntax clarifications

Cameron Dutro via CLDR-Users cldr-users at unicode.org
Sat Nov 16 15:18:43 CST 2019


By the way, what language are you writing your implementation in? Is it
open-source? Would love to take a look if possible :)

-Cameron

On Sat, Nov 16, 2019 at 1:18 PM Cameron Dutro <cameron at lumoslabs.com> wrote:

> Hey Kip,
>
> I'm certainly not an expert, but I did write the current transforms
> implementation
> <https://github.com/twitter/twitter-cldr-rb/tree/master/lib/twitter_cldr/transforms>
> in TwitterCLDR
> <https://github.com/twitter/twitter-cldr-rb#transliteration>, so I think
> I can be of some help here.
>
> The transform rules syntax is very similar to regular expressions, so your
> intuition that "$ddot?" is interpreted as "optional $ddot" is correct. You
> are also correct as to the meaning of the asterisk, should work the same as
> it does in the regex world.
>
> The other bits of syntax you've mentioned are from the Unicode Set
> specification, which you can find in UTS #35
> <https://unicode.org/reports/tr35/#Unicode_Sets>. Unicode Sets are like
> regex character classes, but as you've noticed, there are a couple of
> special operations they support that regexes don't. Specifically, the "-"
> operator is the symmetric difference
> <https://en.wikipedia.org/wiki/Symmetric_difference> between the two
> operands (UTS 35 says "asymmetric difference," but I don't think that's a
> thing - I can't find any definition of it online). The "&" operator is the
> set intersection of the two operands, and no operator is their union.
>
> Hope that helps!
>
> -Cameron
>
> On Fri, Nov 15, 2019 at 6:19 PM Kip Cole via CLDR-Users <
> cldr-users at unicode.org> wrote:
>
>> I’m implementing the transform rules and would appreciate a few
>> confirmations or corrections:
>>
>> Ι ($glower $ddot?) $rough → H | ι $1 ;
>> The "$ddot?” Is interpreted as “optional $ddot”
>> in the usual regex meaning
>>
>> $accent_minus = [[$accent]-[$iotasub$macron]];
>> The “[[..]-[..]]" is regex character set negation?
>>
>> $notAbove = [[:^ccc=0:] & [:^ccc=230:]];
>> The “[[..]&[..]]" is regex character set intersection?
>>
>> | $1 $iotasub ← ($evowel $macron $accentMinus *) i ;
>> That the “*” here is “zero of more times $accentMinus” in the
>> usual regex meaning? And “$1” is the capture result in the usual regex
>> meaning too?
>>
>> t ($notAbove+) ̈ ; # ARABIC LETTER TEH MARBUTA
>> The “+” is the usual regex meaning of “one or more times”
>>
>> Many thanks, —Kip
>>
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at unicode.org
>> http://unicode.org/mailman/listinfo/cldr-users
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20191116/76eb9306/attachment-0001.html>


More information about the CLDR-Users mailing list