UCA question / Produce Collation Element Arrays

Sat Dec 2 09:25:30 CST 2017

Supposed that you have the following, where S are starters and n are
non-starters. | represents the current position.

| S1 S2 S3 n1 n2 n3 n4 S4

S1 S2 isn't in the CET, so you emit and logically change the input. I'll
represent that as:

w(S1) | S2 S3 n1 n2 n3 n4 S4

S2 S3 are in the CET, so set S to them. I'll show S by [...]

w(S1) [ S2 S3 ] | n1 n2 n3 n4 S4

You then successively look through each of the n's.

Suppose S2 S3 n1 isn't in the CET, so you continue.
Suppose S2 S3 n2 is in the CET, but n2 is blocked, so you also continue
Suppose S2 S3 n3 is in the CET, and n3 is not blocked, so you set S to
them.

Logically the input list now looks like the following

w(S1) [ S2 S3 n3 ] n1 n2 | n4 S4

Suppose S2 S3 n3 n4 is in the CET, and n4 is not blocked, so you set S to
them. You now have:

w(S1) [ S2 S3 n3 n4 ] n1 n2 | S4

You have run out of non-starters so you stop and emit weight(S2 S3 n3 n4),
and reset the current position to after them.

w(S1) w(S2 S3 n3 n4)  | n1 n2 S4

So the next item you consider is n1.

There is just one subtlety. Notice that when considering whether n4 is
blocked, you don't consider the items you have already put into S. So n3
and n4 can have the same ccc. Normally people don't actually modify the
input stream, so thinking n4 is blocked is an easy error to make.

Mark <https://twitter.com/mark_e_davis>

On Sat, Dec 2, 2017 at 1:32 PM, Kip Cole via CLDR-Users <
cldr-users at unicode.org> wrote:

> Markus and co, probably another dumb question but I’m making progress.  In
> section 7.2 or TR10 the algorithm for producing a CE array says:
>
> > S2.1 Find the longest initial substring S at each point that has a match
> in the collation element table.
> >
> > S2.1.1 If there are any non-starters following S, process each
> non-starter C.
> >
> > S2.1.2 If C is an unblocked non-starter with respect to S, find if S + C
> has a match in the collation element table.
> >
> > Note: This condition is specific to non-starters, and is not precisely
> the same as the concept of blocking in normalization, since it is dealing
> with look ahead for a discontiguous match, rather than with normalization
> forms. Hangul jamos and other starters are only supported with contiguous
> matches .
> >
> > S2.1.3 If there is a match, replace S by S + C, and remove C.
> >
>
> For s2.1.1 I’m trying to confirm what “process each non-starter C” means.
> Best I understand so far it means “ignore” or “skip” all C that are
> non-starters.  is that the correct interpretation?   It would seem to be
> consistent with the annotation:
>
> Steps 2.1.1 “process each non-starter C” and 2.1.2 “find if S + C has a
> match in the table”, where one or more intermediate non-starters may be
> skipped (making it discontiguous), extends a contraction match by one code
> point at a time to find the next match. In particular, if C is a
> non-starter and if the table had a mapping for ABC but not one for AB, then
> a discontiguous-contraction match on text ABMC (with M being a skippable
> non-starter) would never be found. Well-formedness condition 5 requires the
> presence of the prefix contraction AB.
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20171202/9a2db31a/attachment.html>