Regex for Grapheme Cluster Breaks

Wed Jan 3 03:16:36 CST 2018

I had a UTC action to adjust
http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters
to update the regex, and other necessary changes surrounding text.

Here is what I've come up with for an EBNF formulation. The $x are the GCB
properties.

cluster = crlf | $Control | precore* core postcore* ;

crlf = $CR $LF ;

precore =  $Prepend ;

postcore = (?: virama-sequence | [$Extend $ZWJ $Virama $SpacingMark] );

core = (?: hangul-syllable | ri-sequence | xpicto-sequence | virama-sequence
| [^$Control $CR $LF] );

hangul-syllable = $L* (?:$V+ | $LV $V* | $LVT) $T* | $L+ | $T+ ;

ri-sequence = $RI $RI ;

skin-sequence = $E_Base $E_Modifier ;

xpicto-sequence = (?: skin-sequence | \p{Extended_Pictographic} ) (?:
$Extend* $ZWJ (?: skin-sequence | \p{Extended_Pictographic} ))* ;

virama-sequence = [$Virama $ZWJ] $LinkingConsonant ;

I have tools to turn that into a (lovely) regex:

\p{gcb=cr}\p{gcb=lf}|\p{gcb=control}|\p{gcb=Prepend}*(?:\p{gcb=l}*(?:\p{gcb=v}+|\p{gcb=lv}\p{gcb=v}*|\p{gcb=lvt})\p{gcb=t}*|\p{gcb=l}+|\p{gcb=t}+|\p{gcb=ri}\p{gcb=ri}|(?:\p{gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_Pictographic})(?:\p{gcb=Extend}*\p{gcb=zwj}(?:\p{gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_Pictographic}))*|[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb=LinkingConsonant}|[^\p{gcb=control}\p{gcb=cr}\p{gcb=lf}])(?:[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb=LinkingConsonant}|[\p{gcb=Extend}\p{gcb=zwj}\p{gcb=Virama}\p{gcb=SpacingMark}])*

(It is a bit shorter if some more property names/values are abbreviated.)

I then tested against the current test file: GraphemeBreakTest.txt. There
is one outlying failure with that test file:

813) ☝̈��

hex: 261D 0308 1F3FB

test: [0, 4]

ebnf: [0, 2, 4]

I believe that is a problem with the test rather than the BNF, but I need
to track it down in any event.

A regex is much easier for many applications to use than the current rule
syntax, so I'm going to see if the other segmentations could be
reformulated as ebnfs (ideally corresponding to regular grammars, or in
the worst case, for PEGs).

Feedback is welcome.

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180103/dd7573a0/attachment.html>