Demonstrating Non-compliance to C6 (No distinct Interpretations)

Richard Wordingham richard.wordingham at ntlworld.com
Sun Jul 6 06:04:41 CDT 2014


How does one establish non-compliance of a process to Conformance
Requirement C6, "A process shall not assume that the interpretations of
two canonical-equivalent character sequences are distinct"?

The problems I have are:

1. It is not sufficient to demonstrate that the process interprets
canonically equivalent character sequences differently.

2. There therefore appears to be a mental activity involved.  For
example, the following snippet is non-compliant by virtue of the
comment:

# Function f should perform uppercasing.
if s1 is canonically equivalent to s2 and
   f(s1) is not canonically equivalent to f(s2) then
      print s1, " and ", s2, " are c.e. but f converts them to ", f(s1),
      " and ", f(s2)
endif

because it requires that if f() ever interprets a pair of canonically
equivalent strings differently, it shall always interpret them
differently.

If I remove the comment, it seems that the snippet might be compliant,
for it is no longer clear that f() 'interprets' its argument.

If I buffer the function values, then it seems to be compliant,
especially if I add a comment:

# Function f should perform uppercasing.
u1 = f(s1);
u2 = f(s2);
if s1 is canonically equivalent to s2 and
   u1 is not canonically equivalent to u2 then
# This message should never be generated.
      print s1, " and ", s2, " are c.e. but f converts them to ", u1,
      " and ", u2
endif

Have I understood C6?

The background is that I am writing a regular expression engine for
equivalence classes of strings under canonical equivalence and I
realised that there was a novel issue in the choice of 'longest
leftmost' when matching the pattern \p{ccc=0}\p{ccc≠0}.  Would using
character fragment positions in an unnormalised input string make my
engine non-compliant with the Unicode standard?  I think the 'practical'
answer is that just using these positions makes selection of
matching strings ill-defined as an operation on equivalence classes,
and so should not be an option.

Richard.



More information about the Unicode mailing list