Unicode Regex Question

Mark Davis ☕️ mark at macchiato.com
Wed Dec 31 04:51:54 CST 2014


​​
​>
No the way it is written is really a litteral $ or a or b or a Greek
character.

​
Philippe, you are once again not listening.
​ ​
The $ in CLDR transforms is NOT the same as $ in regex.
​I do know what I'm talking about here: Alan Liu and I designed this
(though years ago).​

Now, there is a defect in the LDML documentation, in that the $ is not
described fully. For that, people can look at the ICU documentation (from
which LDML gets the transform syntax)
​:​
​

http://userguide.icu-project.org/transforms/general/rules#TOC-ther

Cameron, would you mind filing a  CLDR ticket
​to update and expand the documentation
?


Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*

On Wed, Dec 31, 2014 at 11:02 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> No the way it is written is really a litteral $ or a or b or a Greek
> character.
> And yes you used a notation embedding two character classes within another
> character class to create an union. However $ (if it means an end of
> string) cannot be part of that union and cannot even be part of a character
> class as it is is then not a character itself but a boundary condition.
>
> So yes youe extension is very confusive (in addition of bing incoherent
> and not enough general to handle various boundary conditions)
>
> TL;DR: it was another proposal making a BETTER use of the $ for something
> else more productive and about how regexp can be embedded into a special
> syntax allowing to define any custom boundary conditions including end of
> strings, or other boundaries (and also not limited to properties defined
> with properties in the UCD. It is a generalisation of the concept; which
> will be used everywhere Uncode properties are not sufficient, and without
> necessarily needing addition of new properties to handle specific locales
> (for example these boundaries could be used in CLDR data instead of the
> UCD, or in specific locales not supported by CLDR).
>
>
> 2014-12-31 10:27 GMT+01:00 Mark Davis [image: ☕]️ <mark at macchiato.com>:
>
>>
>> On Wed, Dec 31, 2014 at 1:40 AM, Philippe Verdy <verdy_p at wanadoo.fr>
>> wrote:
>>
>>> Your example with "[[a$b][:script=greek:]]" does not make any sense if
>>> that $ means an "end of string" and where it is embedded in a character
>>> class itself in another embedding character-class.
>>>
>>
>> ​That is incorrect. The way the transform works, any reference to a
>> character position outside the bounds of a string matches $. So what I
>> wrote matches the start or end of a string, or a, or b, or any greek-script
>> character.
>>
>> However, if you look at the transform data files, you'll see real cases
>> where $ is used, rather than the artificial one I used.
>>
>> As to the rest of your post, tl;dr.
>>
>> Mark <https://google.com/+MarkDavis>
>>
>> *— Il meglio è l’inimico del bene —*
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20141231/b1b8b344/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: emoji_u2615.png
Type: image/png
Size: 1890 bytes
Desc: not available
URL: <http://unicode.org/pipermail/cldr-users/attachments/20141231/b1b8b344/attachment.png>


More information about the CLDR-Users mailing list