Regex for Grapheme Cluster Breaks

Mark Davis ☕️ via Unicode unicode at unicode.org
Wed Jan 3 03:16:36 CST 2018


I had a UTC action to adjust
http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters
to update the regex, and other necessary changes surrounding text.

Here is what I've come up with for an EBNF formulation. The $x are the GCB
properties.

cluster = crlf | $Control | precore* core postcore* ;


crlf = $CR $LF ;


precore =  $Prepend ;


postcore = (?: virama-sequence | [$Extend $ZWJ $Virama $SpacingMark] );


core = (?: hangul-syllable | ri-sequence | xpicto-sequence | virama-sequence
| [^$Control $CR $LF] );


hangul-syllable = $L* (?:$V+ | $LV $V* | $LVT) $T* | $L+ | $T+ ;


ri-sequence = $RI $RI ;



skin-sequence = $E_Base $E_Modifier ;


xpicto-sequence = (?: skin-sequence | \p{Extended_Pictographic} ) (?:
$Extend* $ZWJ (?: skin-sequence | \p{Extended_Pictographic} ))* ;


virama-sequence = [$Virama $ZWJ] $LinkingConsonant ;


​I have tools to turn that into a (lovely) regex:

\p{gcb=cr}\p{gcb=lf}|\p{gcb=control}|\p{gcb=Prepend}*(?:\p{gcb=l}*(?:\p{gcb=v}+|\p{gcb=lv}\p{gcb=v}*|\p{gcb=lvt})\p{gcb=t}*|\p{gcb=l}+|\p{gcb=t}+|\p{gcb=ri}\p{gcb=ri}|(?:\p{gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_Pictographic})(?:\p{gcb=Extend}*\p{gcb=zwj}(?:\p{gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_Pictographic}))*|[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb=LinkingConsonant}|[^\p{gcb=control}\p{gcb=cr}\p{gcb=lf}])(?:[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb=LinkingConsonant}|[\p{gcb=Extend}\p{gcb=zwj}\p{gcb=Virama}\p{gcb=SpacingMark}])*
​
​(It is a bit shorter if some more property names/values are abbreviated.)

I then tested against the current test file: GraphemeBreakTest.txt. There
is one outlying failure with that test file:

813) ☝̈��

hex: 261D 0308 1F3FB

test: [0, 4]

ebnf: [0, 2, 4]

I believe that is a problem with the test rather than the BNF, but I need
to track it down in any event.

​A regex is much easier for many applications to use than the current rule
syntax, so I'm going to see if the other segmentations could be
reformulated ​as ebnfs (ideally corresponding to regular grammars, or in
the worst case, for PEGs).

Feedback is welcome.

​
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180103/dd7573a0/attachment.html>


More information about the Unicode mailing list