The Unicode Standard and ISO

Fri Jun 8 12:41:20 CDT 2018

On Fri, 8 Jun 2018 13:40:21 +0200
Mark Davis ☕️ <mark at macchiato.com> wrote:

> Mark
> 
> On Fri, Jun 8, 2018 at 10:06 AM, Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:  
> 
> > On Fri, 8 Jun 2018 05:32:51 +0200 (CEST)
> > Marcel Schneider via Unicode <unicode at unicode.org> wrote:
> >  
> > > Thank you for confirming. All witnesses concur to invalidate the
> > > statement about uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. —
> > > After being invented in its actual form, sorting was standardized
> > > simultaneously in ISO/IEC 14651 and in Unicode Collation
> > > Algorithm, the latter including practice‐oriented extra
> > > features.  
> >
> > The UCA contains features essential for respecting canonical
> > equivalence.  ICU works hard to avoid the extra effort involved,
> > apparently even going to the extreme of implicitly declaring that
> > Vietnamese is not a human language.  

> A bit over the top, eh?

Then remove the "no known language" from the bug list, or declare that
you don't know SE Asian languages.

The root problem is that the UCA cannot handle syllable by syllable
comparisons; if the UCA could handle that, the correct collation of
unambiguous true Lao would become simple.  The CLDR algorithm provides
just enough memory to make Lao collation possible; however, ICU isn't
fast enough to load a collation from customisation - it takes hours!
One could probably do better if one added suffix contractions, but
adding that capability might be nightmare.

> I'm guessing you mean https://unicode.org/cldr/trac/ticket/10868,
> which nicely outlines a proposal for dealing with a number of
> problems with Vietnamese.

It still includes a brute force work-around.

> We clearly don't support every sorting feature that various
> dictionaries and agencies come up with. Sometimes it is because we
> can't (yet) see a good way to do it:

>    1. it might be not determinant: many governmental standards or
> style sheets require "interesting" sorting, such as determining that
> "XI" is a roman numeral (not the president of China) and sorting as
> 11, or when "St." is meant to be Street *and* when meant to be Saint
> (St. Stephen's St.)

I believe the first is a character identity issue.  Some of us
see the difference between U+0058 LATIN CAPITAL LETTER X and the
discouraged U+2169 ROMAN NUMERAL TEN as more than just a round-tripping
difference.  For example, by hand, I write the 'V' in 'Henry V' with a
regnal number quite differently to 'Henry V.' where 'V' is short for a
name.

> > > Since then,
> > > these two standards are kept in synchrony uninterruptedly.  

> > But the consortium has formally dropped the commitment to DUCET in
> > CLDR.  Even when restricted to strings of assigned characters, the
> > CLDR and ICU no longer make the effort to support the DUCET
> > collation. Indeed, I'm not even sure that the DUCET is a tailoring
> > of the root CLDR collation, even when restricted to assigned
> > characters.  Tailorings tend to have odd side effects; fortunately,
> > they rarely if ever matter. CLDR root is a rewrite with
> > modifications of DUCET; it has changes that are prohibited as
> > 'tailorings'! 

> CLDR does make some tailorings to the DUCET to create its root
> collation, notably adding special contractions of private use
> characters to allow for tailoring support and indexes [
> http://unicode.org/reports/tr35/tr35-collation.html#File_Format_FractionalUCA_txt
> ]  plus the rearrangement of some characters (mostly punctuation and
> symbols) to allow runtime parametric reordering of groups of
> characters (eg to put numbers after letters) [
> http://unicode.org/reports/tr35/tr35-collation.html#grouping_classes_of_characters
> ].

My main point is that for practical purposes (i.e. ICU), Unicode has
moved away from ISO/IEC 14651.  The difference is small.  I didn't say
that there weren't good reasons.

>    - If there are other changes that are not well documented, or if
> you think those features are causing problems in some way, please
> file a ticket.

Well, I don't have to use DUCET, though I've found it easier for
unmaintainable tailorings.  I need to write code to apply
non-parametric LDML tailorings - ICU is, alas, ridiculously slow.  I
hope that's just a matter of optimisation balance between compiling a
tailoring and applying it.  Are there any published compliance tests
for non-parametric tailorings?  I'm not sure how one would check that an
alleged parametric reordering of numbers and letters applied to a
tailoring of DUCET was in accordance with the LDML definition, but I
don't think you want to expend money sorting that out. 

>    - If there is a particular change that you think is not conformant
> to UCA, please also file that.

Sorry, I must have scanned the conformance requirements too quickly.  I
had got it into my head that someone had recklessly required that
tailorings being in accordance with LDML.  That constraint only applies
to parametric tailorings, so any properly structured unambiguously
defined finite complete set of weights (albeit some implicit) is a
tailoring of UCA.  Formally, the CLDR root collation uses prefix
weights, but using the CLDR collation algorithm on the CLDR root
collation is equivalent to using the UCA.  (This isn't always so - my
tailoring for Lao using the CLDR collation algorithm is not equivalent
to using the UCA on a finite table of weights.)

Richard.