Unicode Properties and Canonical Equivalence

Mon Aug 15 21:08:50 CDT 2022

On Mon, 15 Aug 2022 11:38:24 -0700
Markus Scherer via Unicode <unicode at corp.unicode.org> wrote:

> On Thu, Aug 11, 2022 at 10:21 PM Richard Wordingham via Unicode <
> unicode at corp.unicode.org> wrote:  
> 
> > May a process conforming to Unicode requirement C6 (TUS Section
> > 3.2), "A process shall not assume that the interpretations of two
> > canonical-equivalent character sequences are distinct", consider the
> > Unicode set
> >
> > [\p{sc = Greek}&&\p{sc ≠ Greek}]
> >
> > to be non-empty?
> >  
> 
> Regardless of other considerations, a set and its inverse are
> disjoint.

You're now asserting that \P{prop = val} and \p{prop ≠ val} are
synonymous. To give a clear concrete example, I couldn't find a
definition of \p{scx ≠ Beng}.  Does this contain U+0964 DEVANAGARI
DANDA, which is in \p{scx = Beng}?

Perhaps then it would be less confusing to ask whether

[\p{sc = Greek}&&\p{sc = Common}]

may be considered to be non-empty.

> The problem is that the canonically equivalent characters U+00B4 ACUTE
> > ACCENT and U+1FFD GREEK OXIA have conflicting script properties,
> > but a Unicode-conformant process may freely interchange the two
> > characters whenever they appear as part of a string (Conformance
> > Requirement C7). This conflict was allowed to stand in Consensus
> > 113-C16 back in 2007, pending further study.

> Would you mind providing the information that you have already
> collected? Such as the script property values for these characters,
> and what that 2007 consensus says and what it was based on; and which
> value you think we should change to what other value.

Consensus 113-C16 is recorded in L2/07-346 and reads:

"[113-C16] Consensus: Due to the need for further study, the Script
property value for 5 Greek compatibility accents will stay "Greek" in
Unicode 5.1.0: [L2/07-202]

"U+1FC1 GREEK DIALYTIKA AND PERISPOMENI

"U+1FED GREEK DIALYTIKA AND VARIA

"U+1FEE GREEK DIALYTIKA AND OXIA

"U+1FEF GREEK VARIA

"U+1FFD GREEK OXIA"

To this day, U+1FEE and U+1FFD have sc=Greek, while their singleton
decompositions U+0385 GREEK DIALYTIKA TONOS and U+00B4 ACUTE ACCENT
have sc=Common.  The other three lack singleton decompositions and
therefore present me with no formal issues, though the script
assignments can lead to the first two being rendered differently in
their NFC forms (Greek font) and NFD forms (Latin font).

I am still working on generating a modern day (Unicode 14.0, ideally
also candidate Unicode 15.0) equivalent of the anomaly report L2/07-071.

UTC minutes do not record technical reasoning.  I presume the problem
is that some stand-alone Greek accents would be treated as Greek and
others as Common (raised by Ken Whistler in L2/07-202).  I note that
this could lead to them being rendered using different fonts, especially
when in different paragraphs in plain text. There may be other issues.

For the properties with the issue that characters and their singleton
decompositions have different property values, it has occurred to me
that one solution would be to instead support two derived properties:

1) The value of the iterated singleton decomposition of the character if
any, otherwise the original property.
2) The set of the values for the character and everything of which it
is a singleton decomposition.

In the case of property sc, I am wondering whether to notate them as say
sc_i and sc_s or whether to reuse the name sc for one of them.
This takes me back to the original question.  (Brevity is useful as I
often use property-based regular expressions at the command line.)

Richard.