Vietnamese Collation Wrong for Polysyllables
Mark Davis ☕️ via CLDR-Users
cldr-users at unicode.org
Thu Jan 11 03:03:31 CST 2018
You raise a good issue.
The first question I'd have is whether the lexical ordering described in
https://en.wikipedia.org/wiki/Vietnamese_alphabet#Tone_marks is expected by
average Vietnamese. We have seen before cases where a formal government
specification (French accent ordering) is expected by nobody outside of a
small group of mavens.
Assuming it is required ...
This may just be a case where the UCA doesn't work well enough without
preprocessing. The standard does allow for such preprocessing, and the
question is how to allow for that in CLDR data. One way I can think of by
allowing a transform for text that is applied before sorting to be
specified in the UCA description. For speed, implementations would probably
do that in code, but we could have a data representation that could be used
in a reference implementation. It appears that Vietnamese syllables are
well structured, which could allow a relatively simple transform to do the
job, along the lines of what you are suggesting. Even in code, it would
still probably make the sorting considerably slower than for other
languages, so we might want to offer two variants for sorting. In XML,
something like the following (X,Y are just for illustration):
XY → YX;
Another question I'd have is whether there are any changes to the CLDR
rules for Vietnamese that would make the ordering "closer" to what is
required, without such a transform or a gazillion collation rules. For
example, would making the tone-marks primary differences produce a result
that is closer?
In any event, we'd want to involve Vietnamese experts before going any
On Thu, Jan 11, 2018 at 1:15 AM, Richard Wordingham via CLDR-Users <
cldr-users at unicode.org> wrote:
> Is it in order for me to raise a ticket to report that the CLDR
> Vietnamese collation is wrong for polysyllabic words? For example, it
> sorts <Á-Căn-Đình> 'Argentina/e' before <A-Phú-Hản> 'Afghan(istan)',
> where as <A-Phú-Hản> comes before <Á-Căn-Đình> on p1 of the 2016
> edition of 'Tuttle Compact Vietnamese Dictionary: Vietnamese-English
> English-Vietnamese'. The dictionary looks right - except that it has
> transposed the order of acute and grave accents!
> I know exactly what is wrong for this example - the final paragraph of
> https://en.wikipedia.org/wiki/Vietnamese_alphabet#Tone_marks explains
> how Vietnamese collation works with the tone marks. The key message
> is, "Ordering according to primary and secondary differences proceeds
> syllable by syllable". Thus <A> and <Á> have a primary difference in
> the two country names. I have a good idea of how to fix the problem,
> but I don't have time to work out the details this month, which might
> be needed for a ticket.
> There is one formal problem with the solution I have in mind. It
> involves collating elements such as <U+0301, n> to swap the tone mark
> (which really has primary weight) and final consonant, and the problem
> is that the FCD closure of a collation with such elements is infinite -
> it has to include such generated collating elements as <U+0301, ń, ń,
> ń, n>.
> I am also assuming that syllable boundaries are always marked
> in words with tone marks. Any revision of the CLDR definition should
> be checked against a Vietnamese dictionary - according to
> https://bugzilla.redhat.com/show_bug.cgi?id=516467, Nguyen Thai Ngoc Duy
> seems to have done the donkey work by providing
> CLDR-Users mailing list
> CLDR-Users at unicode.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the CLDR-Users