Transform resolution and before context matches

Mark Davis ☕️ mark at macchiato.com
Mon Mar 29 13:21:25 CDT 2021


Kip, would you mind filing a ticket on this, so that we can track it?

Mark


On Mon, Mar 29, 2021 at 7:27 AM Mark Davis ☕️ <mark at macchiato.com> wrote:

> Thanks for your message. There is more information in
> https://unicode-org.github.io/icu/userguide/transforms/general/ that
> should be incorporated into the LDML section. As to your particular points.
>
> I have some answers below, but I can follow up with details of the edge
> cases when I have more time.
>
> Mark
>
>
> On Mon, Mar 29, 2021 at 6:58 AM Kip Cole via CLDR-Users <
> cldr-users at unicode.org> wrote:
>
>> I’m now implementing CLDR transforms and would appreciate some
>> understanding of the following two items:
>>
>> 1. Resolving the correct transform from “Any-Latin”. For example,
>> “de-Latin” has a transform rule “Any-Latin” but such a transform doesn’t
>> exist in the repo. So I presume an appropriate transform has to be
>> resolved. Reading the inheritance rules isn’t helping me. So using this
>> example, how does one resolve the correct transform for “Any-Latin”.
>>
>
> There are special inheritance rules for Transforms with locales.
>
>    - Any is a special identifier that breaks text by script run, and
>    within that script run is replaced by the script of the run.
>    - The fallback if there is not a language is language => script. The
>    fallback is a 'ladder' between the source and target
>    -
>
>
>> 2. I’m not sure how to interpret the Unicode regular expression
>> "[[:Z:][:Ps:][:Pi:]$]” when its in a “before context” as it is in
>> “Any-Publishing.xml” Specifically, where does the “$” anchor?
>>
>>   (a) Does “$” in this case mean matching the character just before the
>> insertion point? Or does it mean maches an end-of-line at the insertion
>> point? Or something else?
>>
>
> It means "off the end of the string". So it is like ^ or $ in regular
> expressions.
>
>>
>>   (b) For the majority of “before context” matches, which don’t have any
>> anchors in them (“$” or “^”) is the intent that the match aligns to the
>> text immediately before the insertion point (ie with an implied “$” ending
>> at the insertion point). Or is it intended to match anywhere in the prior
>> context from the begging of the string (that would seem strange but TR35
>> doesn’t seem to explain the correct interpretation and TR18 is silent on
>> the topic).
>
>
> It is immediately before.
>
>>
>>
>> As always, thanks for the insight and assistance,
>>
>> —Kip
>>
>>
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at corp.unicode.org
>> https://corp.unicode.org/mailman/listinfo/cldr-users
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/cldr-users/attachments/20210329/9d10d037/attachment.htm>


More information about the CLDR-Users mailing list