Vietnamese Collation Wrong for Polysyllables

Richard Wordingham via CLDR-Users cldr-users at unicode.org
Wed Jan 10 18:15:30 CST 2018


Is it in order for me to raise a ticket to report that the CLDR
Vietnamese collation is wrong for polysyllabic words?  For example, it
sorts <Á-Căn-Đình> 'Argentina/e' before <A-Phú-Hản> 'Afghan(istan)',
where as <A-Phú-Hản> comes before <Á-Căn-Đình> on p1 of the 2016
edition of 'Tuttle Compact Vietnamese Dictionary: Vietnamese-English
English-Vietnamese'.  The dictionary looks right - except that it has
transposed the order of acute and grave accents!

I know exactly what is wrong for this example - the final paragraph of
https://en.wikipedia.org/wiki/Vietnamese_alphabet#Tone_marks explains
how Vietnamese collation works with the tone marks.  The key message
is, "Ordering according to primary and secondary differences proceeds
syllable by syllable".  Thus <A> and <Á> have a primary difference in
the two country names.  I have a good idea of how to fix the problem,
but I don't have time to work out the details this month, which might
be needed for a ticket.

There is one formal problem with the solution I have in mind.  It
involves collating elements such as <U+0301, n> to swap the tone mark
(which really has primary weight) and final consonant, and the problem
is that the FCD closure of a collation with such elements is infinite -
it has to include such generated collating elements as <U+0301, ń, ń,
ń, n>.

I am also assuming that syllable boundaries are always marked
in words with tone marks.  Any revision of the CLDR definition should
be checked against a Vietnamese dictionary - according to
https://bugzilla.redhat.com/show_bug.cgi?id=516467, Nguyen Thai Ngoc Duy
seems to have done the donkey work by providing
http://repo.or.cz/w/words-vi.git.

Richard.



More information about the CLDR-Users mailing list