Unicode Regex Question

Cameron Dutro cameron at lumoslabs.com
Tue Dec 30 20:35:44 CST 2014


Thanks Philippe, the [:eos:] pseudo class looks much less ambiguous than
the "$" character, thanks for your thorough writeup. What would be process
be for getting your change reviewed/accepted?

-Cameron

On Tue, Dec 30, 2014 at 4:40 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> I do agree. The $ is just a common shortcut that represents a condition
> which could be also be given more explicitly, with a named pseudo-class.
>
> Your example with "[[a$b][:script=greek:]]" does not make any sense if
> that $ means an "end of string" and where it is embedded in a character
> class itself in another embedding character-class.
>
> I would better expect something like "[[ab][:script=greek:]]|$" or "
> [[ab][:script=greek:][:eos:]]" where [:eos:] matches no character but
> only at end of string (also at start of string ???, for that it is mixing
> two different kinds of matching: a precontext instead of a post-context)
>
> Regexps should also have a better and more explicit notation for
> precontexts and postcontexts (including with non-empty matched contents).
>
> And the being/end of strings/texts is not the only boundary needed in
> those pre/post-contexts, there are other interesting ones, notably
> start/end of words (depending on locale-sensitive definitions of word
> boundaries), or even start/end of sentences (here also locale dependant).
>
> Possibly we could also have regexps defining their own custom boundary
> conditions, assigned locally with a symbolic name (internally the regexp
> engine would parse all those conditions in parallel to the main parsing, to
> define when they raise up their condition flag to true.
>
> E.g. [:^hex=[0-9a-f]+] where internally the defined "hex" boundary is a
> normal regexp matched greedily in both directions, backward and forward).
> then we could reuse that condition in several other places of the regexp
> with "[:^hex]"
> (if they are reused, they are still matched separately on distinct
> positions and do not have theur matched content equal in each instance).
>
> With a similar system we could also define named subregexps such as
> [:$hex=[0-9a-f]+] defining the "[:$hex]" subregexp.
>
> The previous custom boundary could also be defined as [:^hex=[:$hex]].
> Here we see that this "$" would be more useful and would not imply any"end
> of string" meaning (but it would deviate from legacy regexps where this $
> is taken litterally in character classes). In that last example the custom
> boundary and the defined subregexp are given the same "hex" name separately.
>
> But we could also say that they automatically share the same namespace, so
> that any defined subregexp would also be the name of a defined boundary, to
> use them the first character $ or ^ after "[:" is used to see if we mean an
> expansion of a subregexp whose matched content will be part of the outer
> matched content, or if we mean a condition matched internally ony with a
> testable flag but not included in the outer matched content.
>
> Then the start/end of string condition is nothing else than the evaluation
> of the custom boundary condition "[:^eos]", definabled as "[:^eos=.*]" and
> matched greedly by default; also here I'm assuming that "." matches any
> character, including newlines if they are part of the content of a
> "string", otherwise you'll need to define the "eos" custom boundary as
> "[:^eos=([\r\n]|.)*]"
>
>
>
>
>
> 2014-12-31 0:26 GMT+01:00 Cameron Dutro <cameron at lumoslabs.com>:
>
>> Also, would it be fair to say simply removing the outer set of square
>> brackets and treating the entire thing as a regex is correct? It doesn't
>> make sense to me to have these transform rules be "almost" regexes except
>> for this one "$" exception, especially given "$"'s special significance in
>> regexes.
>>
>> -Cameron
>>
>> On Tue, Dec 30, 2014 at 3:22 PM, Cameron Dutro <cameron at lumoslabs.com>
>> wrote:
>>
>>> Thanks Mark. Is that documented anywhere?
>>>
>>> -Cameron
>>>
>>> On Tue, Dec 30, 2014 at 11:40 AM, Mark Davis [image: ☕]️ <
>>> mark at macchiato.com> wrote:
>>>
>>>> $ has a special meaning in the transforms; it means the end of string
>>>> (either end). Unlike normal regex, however, it can occur in character
>>>> classes, eg [[a$b][:script=greek:]]
>>>>
>>>>
>>>> Mark <https://google.com/+MarkDavis>
>>>>
>>>> *— Il meglio è l’inimico del bene —*
>>>>
>>>> On Tue, Dec 30, 2014 at 8:21 PM, Cameron Dutro <cameron at lumoslabs.com>
>>>> wrote:
>>>>
>>>>> Hey cldr-users,
>>>>>
>>>>> I'm looking at this entry
>>>>> <http://unicode.org/cldr/trac/browser/trunk/common/transforms/Any-Publishing.xml#L21>
>>>>> in CLDR transforms. I'm curious why that "$" character is inside the
>>>>> character class. Here's the line reproduced:
>>>>>
>>>>> <tRule>$makeRight = [[:Z:][:Ps:][:Pi:]$] ;</tRule>
>>>>>
>>>>> I see an outer character class that contains three internal unicode
>>>>> character sets and a literal dollar sign. Usually in regular expressions,
>>>>> the dollar sign is used to match the end of the string. When it's included
>>>>> in a character class however, it should be interpreted as a literal
>>>>> character.
>>>>>
>>>>> Was including the dollar sign in the character class intentional?
>>>>> Should it be treated as an end-of-string anchor or a literal string?
>>>>>
>>>>> -Cameron
>>>>>
>>>>> _______________________________________________
>>>>> CLDR-Users mailing list
>>>>> CLDR-Users at unicode.org
>>>>> http://unicode.org/mailman/listinfo/cldr-users
>>>>>
>>>>>
>>>>
>>>
>>
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at unicode.org
>> http://unicode.org/mailman/listinfo/cldr-users
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20141230/818f8075/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: emoji_u2615.png
Type: image/png
Size: 1890 bytes
Desc: not available
URL: <http://unicode.org/pipermail/cldr-users/attachments/20141230/818f8075/attachment-0001.png>


More information about the CLDR-Users mailing list