UCA question / Produce Collation Element Arrays

Richard Wordingham via CLDR-Users cldr-users at unicode.org
Sat Dec 2 13:52:15 CST 2017


On Sat, 2 Dec 2017 16:25:30 +0100
Mark Davis ☕️ via CLDR-Users <cldr-users at unicode.org> wrote:

> Supposed that you have the following, where S are starters and n are
> non-starters. | represents the current position.
> 
> | S1 S2 S3 n1 n2 n3 n4 S4
> 
> S1 S2 isn't in the CET, so you emit and logically change the input.
> I'll represent that as:
> 
> w(S1) | S2 S3 n1 n2 n3 n4 S4

One subtle nitpick here.  One also has to eliminate <S1 S2 S3>, <S1 S2
S3 n1>, ... and <S1 S2 S3n1 n2 n3 n4 S4> before one can conclude that
the relevant collating element is <S1>.  I do this by recording whether
each collating element and prefix of a collating element is the prefix
of a collating element.  This sort of tagging is not logically
necessary, but is practically very useful.

The simplest example of this issue in the DUCET is <U+0FB2 U+0F71
U+0F80>.  Or is a conformant implementation of the UCA allowed to reject
DUCET even if one can find a way to specify that it be used?  There's
no explicit concession that a CET has to be well-formed.

Richard.



More information about the CLDR-Users mailing list