Unicode Regex Question

Steven R. Loomis srl at icu-project.org
Wed Dec 31 13:18:00 CST 2014


Philippe, Mark: 
 Transliterators seem to be in ICU 1.8, so 1999- 15 and almost 16 years ago. 

S

Enviado desde nuestro iPhone.

> El dic 31, 2014, a las 2:51 AM, Mark Davis ☕️ <mark at macchiato.com> escribió:
> 
> ​​​> No the way it is written is really a litteral $ or a or b or a Greek character.
> 
> ​Philippe, you are once again not listening.​ ​The $ in CLDR transforms is NOT the same as $ in regex. ​I do know what I'm talking about here: Alan Liu and I designed this (though years ago).​
> 
> Now, there is a defect in the LDML documentation, in that the $ is not described fully. For that, people can look at the ICU documentation (from which LDML gets the transform syntax)​:​​
> 
> http://userguide.icu-project.org/transforms/general/rules#TOC-ther
> 
> Cameron, would you mind filing a  CLDR ticket ​to update and expand the documentation?
> 
> 
> Mark
> 
> — Il meglio è l’inimico del bene —
> 
>> On Wed, Dec 31, 2014 at 11:02 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>> No the way it is written is really a litteral $ or a or b or a Greek character.
>> And yes you used a notation embedding two character classes within another character class to create an union. However $ (if it means an end of string) cannot be part of that union and cannot even be part of a character class as it is is then not a character itself but a boundary condition.
>> 
>> So yes youe extension is very confusive (in addition of bing incoherent and not enough general to handle various boundary conditions)
>> 
>> TL;DR: it was another proposal making a BETTER use of the $ for something else more productive and about how regexp can be embedded into a special syntax allowing to define any custom boundary conditions including end of strings, or other boundaries (and also not limited to properties defined with properties in the UCD. It is a generalisation of the concept; which will be used everywhere Uncode properties are not sufficient, and without necessarily needing addition of new properties to handle specific locales (for example these boundaries could be used in CLDR data instead of the UCD, or in specific locales not supported by CLDR).
>> 
>> 
>> 2014-12-31 10:27 GMT+01:00 Mark Davis <emoji_u2615.png>️ <mark at macchiato.com>:
>>> 
>>>> On Wed, Dec 31, 2014 at 1:40 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>>>> Your example with "[[a$b][:script=greek:]]" does not make any sense if that $ means an "end of string" and where it is embedded in a character class itself in another embedding character-class.
>>> 
>>> ​That is incorrect. The way the transform works, any reference to a character position outside the bounds of a string matches $. So what I wrote matches the start or end of a string, or a, or b, or any greek-script character.
>>> 
>>> However, if you look at the transform data files, you'll see real cases where $ is used, rather than the artificial one I used.
>>> 
>>> As to the rest of your post, tl;dr.
>>> 
>>> Mark
>>> 
>>> — Il meglio è l’inimico del bene —
> 
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20141231/db0ed7a8/attachment-0001.html>


More information about the CLDR-Users mailing list