Regex for Grapheme Cluster Breaks

Mark Davis ☕️ via Unicode unicode at
Wed Jan 3 03:16:36 CST 2018

I had a UTC action to adjust
to update the regex, and other necessary changes surrounding text.

Here is what I've come up with for an EBNF formulation. The $x are the GCB

cluster = crlf | $Control | precore* core postcore* ;

crlf = $CR $LF ;

precore =  $Prepend ;

postcore = (?: virama-sequence | [$Extend $ZWJ $Virama $SpacingMark] );

core = (?: hangul-syllable | ri-sequence | xpicto-sequence | virama-sequence
| [^$Control $CR $LF] );

hangul-syllable = $L* (?:$V+ | $LV $V* | $LVT) $T* | $L+ | $T+ ;

ri-sequence = $RI $RI ;

skin-sequence = $E_Base $E_Modifier ;

xpicto-sequence = (?: skin-sequence | \p{Extended_Pictographic} ) (?:
$Extend* $ZWJ (?: skin-sequence | \p{Extended_Pictographic} ))* ;

virama-sequence = [$Virama $ZWJ] $LinkingConsonant ;

​I have tools to turn that into a (lovely) regex:

​(It is a bit shorter if some more property names/values are abbreviated.)

I then tested against the current test file: GraphemeBreakTest.txt. There
is one outlying failure with that test file:

813) ☝̈��

hex: 261D 0308 1F3FB

test: [0, 4]

ebnf: [0, 2, 4]

I believe that is a problem with the test rather than the BNF, but I need
to track it down in any event.

​A regex is much easier for many applications to use than the current rule
syntax, so I'm going to see if the other segmentations could be
reformulated ​as ebnfs (ideally corresponding to regular grammars, or in
the worst case, for PEGs).

Feedback is welcome.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list