Regex for Grapheme Cluster Breaks

Mark Davis ☕️ via Unicode unicode at unicode.org
Wed Jan 3 04:38:17 CST 2018


Quick update: Manish pointed out that I'd misstated one of the rules,
should be:

skin-sequence = $E_Base $Extend* $E_Modifier ;

​With that change, the test passes. (Thanks Manish!)​

Mark

On Wed, Jan 3, 2018 at 10:16 AM, Mark Davis ☕️ <mark at macchiato.com> wrote:

> I had a UTC action to adjust http://www.unicode.org/
> reports/tr29/proposed.html#Table_Combining_Char_Sequences_and_Grapheme_
> Clusters to update the regex, and other necessary changes surrounding
> text.
>
> Here is what I've come up with for an EBNF formulation. The $x are the GCB
> properties.
>
> cluster = crlf | $Control | precore* core postcore* ;
>
>
> crlf = $CR $LF ;
>
>
> precore =  $Prepend ;
>
>
> postcore = (?: virama-sequence | [$Extend $ZWJ $Virama $SpacingMark] );
>
>
> core = (?: hangul-syllable | ri-sequence | xpicto-sequence | virama-sequence
> | [^$Control $CR $LF] );
>
>
> hangul-syllable = $L* (?:$V+ | $LV $V* | $LVT) $T* | $L+ | $T+ ;
>
>
> ri-sequence = $RI $RI ;
>
>
>
> skin-sequence = $E_Base $E_Modifier ;
>
>
> xpicto-sequence = (?: skin-sequence | \p{Extended_Pictographic} ) (?:
> $Extend* $ZWJ (?: skin-sequence | \p{Extended_Pictographic} ))* ;
>
>
> virama-sequence = [$Virama $ZWJ] $LinkingConsonant ;
>
>
> ​I have tools to turn that into a (lovely) regex:
>
> \p{gcb=cr}\p{gcb=lf}|\p{gcb=control}|\p{gcb=Prepend}*(?:\
> p{gcb=l}*(?:\p{gcb=v}+|\p{gcb=lv}\p{gcb=v}*|\p{gcb=lvt})\p{
> gcb=t}*|\p{gcb=l}+|\p{gcb=t}+|\p{gcb=ri}\p{gcb=ri}|(?:\p{
> gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_Pictographic})(?:\
> p{gcb=Extend}*\p{gcb=zwj}(?:\p{gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_
> Pictographic}))*|[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb=
> LinkingConsonant}|[^\p{gcb=control}\p{gcb=cr}\p{gcb=lf}])
> (?:[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb=LinkingConsonant}|[\p{
> gcb=Extend}\p{gcb=zwj}\p{gcb=Virama}\p{gcb=SpacingMark}])*
>> ​(It is a bit shorter if some more property names/values are abbreviated.)
>
> I then tested against the current test file: GraphemeBreakTest.txt. There
> is one outlying failure with that test file:
>
> 813) ☝̈��
>
> hex: 261D 0308 1F3FB
>
> test: [0, 4]
>
> ebnf: [0, 2, 4]
>
> I believe that is a problem with the test rather than the BNF, but I need
> to track it down in any event.
>
> ​A regex is much easier for many applications to use than the current rule
> syntax, so I'm going to see if the other segmentations could be
> reformulated ​as ebnfs (ideally corresponding to regular grammars, or in
> the worst case, for PEGs).
>
> Feedback is welcome.
>
>> Mark
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180103/19498f86/attachment.html>


More information about the Unicode mailing list