Unicode Properties and Canonical Equivalence

Richard Wordingham richard.wordingham at ntlworld.com
Fri Aug 12 00:17:51 CDT 2022


May a process conforming to Unicode requirement C6 (TUS Section 3.2),
"A process shall not assume that the interpretations of two
canonical-equivalent character sequences are distinct", consider the
Unicode set

[\p{sc = Greek}&&\p{sc ≠ Greek}]

to be non-empty?

The problem is that the canonically equivalent characters U+00B4 ACUTE
ACCENT and U+1FFD GREEK OXIA have conflicting script properties, but a
Unicode-conformant process may freely interchange the two characters
whenever they appear as part of a string (Conformance Requirement C7).
This conflict was allowed to stand in Consensus 113-C16 back in 2007,
pending further study.

For me, the question arose in the context of regular regular
expressions for Unicode strings under canonical equivalence.

A practical solution of instead using scx=Greek does not work, for
U+00B4 does not include Greek in its script extensions.

The only sane resolution I can see is to treat \p{sc = Greek} as the
set of characters canonically equivalent to a character with the script
property value of Greek, and similarly \p{sc ≠ Greek} as the set of
characters canonically equivalent to a character with a script property
value other than Greek.  Disallowing the script property seems insane.

Richard.



More information about the Unicode mailing list