UCA question / Produce Collation Element Arrays

Mark Davis ☕️ via CLDR-Users cldr-users at unicode.org
Sun Dec 3 06:36:57 CST 2017


The algorithm is predicated on any input table being well formed. (
http://unicode.org/reports/tr10/#Well-Formed)

Tibetan is a documented exception in the DUCET, but it also documents how
to fix it.

Mark <https://twitter.com/mark_e_davis>

On Sat, Dec 2, 2017 at 8:52 PM, Richard Wordingham via CLDR-Users <
cldr-users at unicode.org> wrote:

> On Sat, 2 Dec 2017 16:25:30 +0100
> Mark Davis ☕️ via CLDR-Users <cldr-users at unicode.org> wrote:
>
> > Supposed that you have the following, where S are starters and n are
> > non-starters. | represents the current position.
> >
> > | S1 S2 S3 n1 n2 n3 n4 S4
> >
> > S1 S2 isn't in the CET, so you emit and logically change the input.
> > I'll represent that as:
> >
> > w(S1) | S2 S3 n1 n2 n3 n4 S4
>
> One subtle nitpick here.  One also has to eliminate <S1 S2 S3>, <S1 S2
> S3 n1>, ... and <S1 S2 S3n1 n2 n3 n4 S4> before one can conclude that
> the relevant collating element is <S1>.  I do this by recording whether
> each collating element and prefix of a collating element is the prefix
> of a collating element.  This sort of tagging is not logically
> necessary, but is practically very useful.
>
> The simplest example of this issue in the DUCET is <U+0FB2 U+0F71
> U+0F80>.  Or is a conformant implementation of the UCA allowed to reject
> DUCET even if one can find a way to specify that it be used?  There's
> no explicit concession that a CET has to be well-formed.
>
> Richard.
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20171203/34201e45/attachment.html>


More information about the CLDR-Users mailing list