Possible bug in formal grammar for extended grapheme cluster

Andre Schappo via Unicode unicode at unicode.org
Mon Dec 18 03:59:06 CST 2017

Ah! That explains why

pcre2grep -u '^\X{1}$'

matches with



André Schappo

On 17 Dec 2017, at 17:17, Mark Davis ☕️ via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>> wrote:

Thanks for the feedback. You're correct about this; that is a holdover from an earlier version of the document when there was a more basic treatment of RI sequences.

There is already an action to modify these. There is a placeholder review note about that just above


(scroll up just a bit).



On Sun, Dec 17, 2017 at 4:16 PM, David P. Kendal via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>> wrote:

It’s possible I’m missing something, but the formal grammar/regular
expression given for extended grapheme clusters appears to have a bug
in it.

The bug is here:

    RI-Sequence := Regional_Indicator+

If the formal grammar is intended to exactly match the rules given the
the “Grapheme Cluster Boundary Rules” section below it as-is, then
this should be

    RI-Sequence := Regional_Indicator Regional_Indicator

since as given it would cause any number of RI characters to coalesce
into a single grapheme cluster, instead of pairs of characters. That
is, the text U+1F1EC U+1F1E7 U+1F1EA U+1F1FA would represent one
grapheme cluster instead of the correct two.

dpk (David P. Kendal) · Nassauische Str. 36, 10717 DE · http://dpk.io/
   we do these things not because they are easy,      +49 159 03847809<tel:%2B49%20159%2003847809>
      but because we thought they were going to be easy
          — ‘The Programmers’ Credo’, Maciej Cegłowski

�� �� ��
André Schappo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171218/87bff05f/attachment.html>

More information about the Unicode mailing list