Re: Clarification on Annex 29, GB12–13

Andy Heninger andy.heninger at gmail.com
Fri Apr 1 14:37:57 CDT 2022


This is a misunderstanding of the way the break rules are meant to be
applied.

The rules test for the presence (÷) or absence (×) of a boundary at a
single location in the subject text. When there is an extended context, as
in GB12 or GB13, the rules do not imply anything about boundaries, or the
lack thereof, within that context. Although it is the case for other rules
with context, like WB6 and 7, or the various sentence break rules, that
there aren't boundaries within the context.

This can all get pretty confusing.

  -- Andy

On Thu, Mar 31, 2022 at 7:28 AM Don Hosek via Unicode <
unicode at corp.unicode.org> wrote:

> Annex 29 says:
> > Do not break within emoji flag sequences. That is, do not break between
> regional indicator (RI) symbols if there is an odd number of RI characters
> before the break point.
> > GB12  sot (RI RI)* RI ×       RI
> > GB13  [^RI] (RI RI)* RI       ×       RI
>
> This would seem to indicate that any even number of RI tags should be
> treated as a single grapheme so given, e.g., 🇦🇹🇦🇺🇦🇶 this should be a
> single grapheme rather than the expected three. There is no test in
> https://www.unicode.org/Public/14.0.0/ucd/auxiliary/GraphemeBreakTest.txt
> that would enforce this however. Or is this just a case of my misreading
> the spec and there is an implicit ÷ after each pair of RI characters? (if
> the latter, it might be helpful for future implementors to have a note to
> that effect).
>
> -dh
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20220401/f50d9947/attachment.htm>


More information about the Unicode mailing list