Question Regarding UCD Draft Files and GraphemeBreakTest Discrepancy

Peter Constable pgcon6 at msn.com
Fri Mar 21 23:21:17 CDT 2025


At UTC meeting #182, UTC decided to remove the Extended_Pictographic property from a number of code points that are assigned to non-emoji characters. See

https://www.unicode.org/cgi-bin/GetL2Ref.pl?182-C20

This goes back to feedback submitted on the Unicode 15.0 beta — see feedback from Charlotte Buff (time stamp Fri Jun 24 10:24:49 CDT 2022) in

https://www.unicode.org/review/pri453/

which led to an action item to investigate removing the property from non-emoji characters, the outcome of which was a recommendation to UTC 182 to do just that — see section 5.1 in

https://www.unicode.org/L2/L2025/25006-utc182-properties-recs.pdf

The deeper background behind this is that Extended_Pictographic was created as a code point property that could be assigned to code points that were likely to be assigned in the future to emoji characters so that the line breaking implementation in a product sold (say) today would be forward compatible with emoji assigned in the future. The concern was that some devices might not get frequent software updates but user might start using new emoji created some time after the device was released. As Charlotte Buff observed in her feedback,

"the Extended_Pictographic property has no use outside of emoji ZWJ sequences"


So, that's the background. The draft emoji-data.txt file for Unicode 17 has been to remove several code points from Extended_Pictographic in accordance with UTC decision 182-C20. It's possible that some test data that should have had a corresponding update was overlooked. If you think that's the case, please submit feedback for PRI #514

https://www.unicode.org/review/pri514/

which is the public review issue for the Unicode 17.0 alpha review. (See the contact form link in that page.)



Peter


-----Original Message-----
From: Unicode <unicode-bounces at corp.unicode.org> On Behalf Of Naoto Sato via Unicode
Sent: Friday, March 21, 2025 2:25 PM
To: unicode at corp.unicode.org
Subject: Question Regarding UCD Draft Files and GraphemeBreakTest Discrepancy

Hello,

I have a question regarding the draft version of the UCD files (https://www.unicode.org/Public/draft/ucd/). I’m not sure if this is the appropriate place for such inquiries, so please forgive me if it is not.

While testing the draft "emoji-data.txt"
(https://www.unicode.org/Public/draft/ucd/emoji/emoji-data.txt), I encountered a failing test case in GraphemeBreakTest:

÷ 2701 × 200D × 2701 ÷  #  ÷ [0.2] UPPER BLADE SCISSORS (ExtPict) × [9.0] ZERO WIDTH JOINER (ZWJ) × [11.0] UPPER BLADE SCISSORS (ExtPict) ÷ [0.3]

This test case assumes that U+2701 is classified as Extended_Pictographic. However, the latest emoji-data.txt does not include it, whereas version 16.0 did. Additionally, the web version of the test
(https://www.unicode.org/Public/draft/ucd/auxiliary/GraphemeBreakTest.html#s23)
also indicates that U+2701 is an Extended_Pictographic, leading to an inconsistency.

This discrepancy is causing our test to fail. Could you clarify whether this is an issue or an expected change?

Thanks,
Naoto



More information about the Unicode mailing list