Unicode Properties and Canonical Equivalence
Richard Wordingham
richard.wordingham at ntlworld.com
Fri Aug 12 00:17:51 CDT 2022
May a process conforming to Unicode requirement C6 (TUS Section 3.2),
"A process shall not assume that the interpretations of two
canonical-equivalent character sequences are distinct", consider the
Unicode set
[\p{sc = Greek}&&\p{sc ≠ Greek}]
to be non-empty?
The problem is that the canonically equivalent characters U+00B4 ACUTE
ACCENT and U+1FFD GREEK OXIA have conflicting script properties, but a
Unicode-conformant process may freely interchange the two characters
whenever they appear as part of a string (Conformance Requirement C7).
This conflict was allowed to stand in Consensus 113-C16 back in 2007,
pending further study.
For me, the question arose in the context of regular regular
expressions for Unicode strings under canonical equivalence.
A practical solution of instead using scx=Greek does not work, for
U+00B4 does not include Greek in its script extensions.
The only sane resolution I can see is to treat \p{sc = Greek} as the
set of characters canonically equivalent to a character with the script
property value of Greek, and similarly \p{sc ≠ Greek} as the set of
characters canonically equivalent to a character with a script property
value other than Greek. Disallowing the script property seems insane.
Richard.
More information about the Unicode
mailing list