Unicode Regex Question

Philippe Verdy verdy_p at wanadoo.fr
Tue Dec 30 18:40:28 CST 2014


I do agree. The $ is just a common shortcut that represents a condition
which could be also be given more explicitly, with a named pseudo-class.

Your example with "[[a$b][:script=greek:]]" does not make any sense if that
$ means an "end of string" and where it is embedded in a character class
itself in another embedding character-class.

I would better expect something like "[[ab][:script=greek:]]|$" or "
[[ab][:script=greek:][:eos:]]" where [:eos:] matches no character but only
at end of string (also at start of string ???, for that it is mixing two
different kinds of matching: a precontext instead of a post-context)

Regexps should also have a better and more explicit notation for
precontexts and postcontexts (including with non-empty matched contents).

And the being/end of strings/texts is not the only boundary needed in those
pre/post-contexts, there are other interesting ones, notably start/end of
words (depending on locale-sensitive definitions of word boundaries), or
even start/end of sentences (here also locale dependant).

Possibly we could also have regexps defining their own custom boundary
conditions, assigned locally with a symbolic name (internally the regexp
engine would parse all those conditions in parallel to the main parsing, to
define when they raise up their condition flag to true.

E.g. [:^hex=[0-9a-f]+] where internally the defined "hex" boundary is a
normal regexp matched greedily in both directions, backward and forward).
then we could reuse that condition in several other places of the regexp
with "[:^hex]"
(if they are reused, they are still matched separately on distinct
positions and do not have theur matched content equal in each instance).

With a similar system we could also define named subregexps such as
[:$hex=[0-9a-f]+] defining the "[:$hex]" subregexp.

The previous custom boundary could also be defined as [:^hex=[:$hex]]. Here
we see that this "$" would be more useful and would not imply any"end of
string" meaning (but it would deviate from legacy regexps where this $ is
taken litterally in character classes). In that last example the custom
boundary and the defined subregexp are given the same "hex" name separately.

But we could also say that they automatically share the same namespace, so
that any defined subregexp would also be the name of a defined boundary, to
use them the first character $ or ^ after "[:" is used to see if we mean an
expansion of a subregexp whose matched content will be part of the outer
matched content, or if we mean a condition matched internally ony with a
testable flag but not included in the outer matched content.

Then the start/end of string condition is nothing else than the evaluation
of the custom boundary condition "[:^eos]", definabled as "[:^eos=.*]" and
matched greedly by default; also here I'm assuming that "." matches any
character, including newlines if they are part of the content of a
"string", otherwise you'll need to define the "eos" custom boundary as
"[:^eos=([\r\n]|.)*]"





2014-12-31 0:26 GMT+01:00 Cameron Dutro <cameron at lumoslabs.com>:

> Also, would it be fair to say simply removing the outer set of square
> brackets and treating the entire thing as a regex is correct? It doesn't
> make sense to me to have these transform rules be "almost" regexes except
> for this one "$" exception, especially given "$"'s special significance in
> regexes.
>
> -Cameron
>
> On Tue, Dec 30, 2014 at 3:22 PM, Cameron Dutro <cameron at lumoslabs.com>
> wrote:
>
>> Thanks Mark. Is that documented anywhere?
>>
>> -Cameron
>>
>> On Tue, Dec 30, 2014 at 11:40 AM, Mark Davis [image: ☕]️ <
>> mark at macchiato.com> wrote:
>>
>>> $ has a special meaning in the transforms; it means the end of string
>>> (either end). Unlike normal regex, however, it can occur in character
>>> classes, eg [[a$b][:script=greek:]]
>>>
>>>
>>> Mark <https://google.com/+MarkDavis>
>>>
>>> *— Il meglio è l’inimico del bene —*
>>>
>>> On Tue, Dec 30, 2014 at 8:21 PM, Cameron Dutro <cameron at lumoslabs.com>
>>> wrote:
>>>
>>>> Hey cldr-users,
>>>>
>>>> I'm looking at this entry
>>>> <http://unicode.org/cldr/trac/browser/trunk/common/transforms/Any-Publishing.xml#L21>
>>>> in CLDR transforms. I'm curious why that "$" character is inside the
>>>> character class. Here's the line reproduced:
>>>>
>>>> <tRule>$makeRight = [[:Z:][:Ps:][:Pi:]$] ;</tRule>
>>>>
>>>> I see an outer character class that contains three internal unicode
>>>> character sets and a literal dollar sign. Usually in regular expressions,
>>>> the dollar sign is used to match the end of the string. When it's included
>>>> in a character class however, it should be interpreted as a literal
>>>> character.
>>>>
>>>> Was including the dollar sign in the character class intentional?
>>>> Should it be treated as an end-of-string anchor or a literal string?
>>>>
>>>> -Cameron
>>>>
>>>> _______________________________________________
>>>> CLDR-Users mailing list
>>>> CLDR-Users at unicode.org
>>>> http://unicode.org/mailman/listinfo/cldr-users
>>>>
>>>>
>>>
>>
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20141231/5864fd50/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: emoji_u2615.png
Type: image/png
Size: 1890 bytes
Desc: not available
URL: <http://unicode.org/pipermail/cldr-users/attachments/20141231/5864fd50/attachment.png>


More information about the CLDR-Users mailing list