Unicode Regex Question

Cameron Dutro cameron at lumoslabs.com
Thu Jan 1 13:02:39 CST 2015


Thanks very much Mark for that additional documentation, and thanks Nick
for filing the ticket :)

-Cameron

On Wed, Dec 31, 2014 at 11:18 AM, Steven R. Loomis <srl at icu-project.org>
wrote:

> Philippe, Mark:
>  Transliterators seem to be in ICU 1.8, so 1999- 15 and almost 16 years
> ago.
>
> S
>
> Enviado desde nuestro iPhone.
>
> El dic 31, 2014, a las 2:51 AM, Mark Davis [image: ☕]️ <mark at macchiato.com>
> escribió:
>
> ​​
> ​>
> No the way it is written is really a litteral $ or a or b or a Greek
> character.
>
>> Philippe, you are once again not listening.
> ​ ​
> The $ in CLDR transforms is NOT the same as $ in regex.
> ​I do know what I'm talking about here: Alan Liu and I designed this
> (though years ago).​
>
> Now, there is a defect in the LDML documentation, in that the $ is not
> described fully. For that, people can look at the ICU documentation (from
> which LDML gets the transform syntax)
> ​:​
>>
> http://userguide.icu-project.org/transforms/general/rules#TOC-ther
>
> Cameron, would you mind filing a  CLDR ticket
> ​to update and expand the documentation
> ?
>
>
> Mark <https://google.com/+MarkDavis>
>
> *— Il meglio è l’inimico del bene —*
>
> On Wed, Dec 31, 2014 at 11:02 AM, Philippe Verdy <verdy_p at wanadoo.fr>
> wrote:
>
>> No the way it is written is really a litteral $ or a or b or a Greek
>> character.
>> And yes you used a notation embedding two character classes within
>> another character class to create an union. However $ (if it means an end
>> of string) cannot be part of that union and cannot even be part of a
>> character class as it is is then not a character itself but a boundary
>> condition.
>>
>> So yes youe extension is very confusive (in addition of bing incoherent
>> and not enough general to handle various boundary conditions)
>>
>> TL;DR: it was another proposal making a BETTER use of the $ for something
>> else more productive and about how regexp can be embedded into a special
>> syntax allowing to define any custom boundary conditions including end of
>> strings, or other boundaries (and also not limited to properties defined
>> with properties in the UCD. It is a generalisation of the concept; which
>> will be used everywhere Uncode properties are not sufficient, and without
>> necessarily needing addition of new properties to handle specific locales
>> (for example these boundaries could be used in CLDR data instead of the
>> UCD, or in specific locales not supported by CLDR).
>>
>>
>> 2014-12-31 10:27 GMT+01:00 Mark Davis <emoji_u2615.png>️ <
>> mark at macchiato.com>:
>>
>>>
>>> On Wed, Dec 31, 2014 at 1:40 AM, Philippe Verdy <verdy_p at wanadoo.fr>
>>> wrote:
>>>
>>>> Your example with "[[a$b][:script=greek:]]" does not make any sense if
>>>> that $ means an "end of string" and where it is embedded in a character
>>>> class itself in another embedding character-class.
>>>>
>>>
>>> ​That is incorrect. The way the transform works, any reference to a
>>> character position outside the bounds of a string matches $. So what I
>>> wrote matches the start or end of a string, or a, or b, or any greek-script
>>> character.
>>>
>>> However, if you look at the transform data files, you'll see real cases
>>> where $ is used, rather than the artificial one I used.
>>>
>>> As to the rest of your post, tl;dr.
>>>
>>> Mark <https://google.com/+MarkDavis>
>>>
>>> *— Il meglio è l’inimico del bene —*
>>>
>>
>>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150101/08806d47/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: emoji_u2615.png
Type: image/png
Size: 1890 bytes
Desc: not available
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150101/08806d47/attachment.png>


More information about the CLDR-Users mailing list