Vietnamese Collation Wrong for Polysyllables

Richard Wordingham via CLDR-Users cldr-users at unicode.org
Thu Jan 11 14:51:51 CST 2018


On Thu, 11 Jan 2018 10:03:31 +0100
Mark Davis ☕️ via CLDR-Users <cldr-users at unicode.org> wrote:

> The first question I'd have is whether the lexical ordering described
> in https://en.wikipedia.org/wiki/Vietnamese_alphabet#Tone_marks is
> expected by average Vietnamese. We have seen before cases where a
> formal government specification (French accent ordering) is expected
> by nobody outside of a small group of mavens.

How small is small?  The official Thai way seems not to be how Thais
usually compare manually!

> Assuming it is required ...
> 
> This may just be a case where the UCA doesn't work well enough without
> preprocessing. The standard does allow for such preprocessing, and the
> question is how to allow for that in CLDR data. One way I can think
> of by allowing a transform for text that is applied before sorting to
> be specified in the UCA description.

That method may assist greatly with other mainland SE Asian languages
that compare syllable by syllable - Lao, Tai Lue and Burmese at least.
Indeed, it might make the Lao CCVT syllable-based ordering (i.e.
initial, final, vowel and then tone) much easier to implement, as well
as greatly simplifying the Lao CVCT order I was struggling to implement
in March 2017.  

> It appears that
> Vietnamese syllables are well structured, which could allow a
> relatively simple transform to do the job, along the lines of what
> you are suggesting.

In vi.xml, I find "St. Barthélemy", which is a little worrying.  Even
more important are words like "a-xít" or "axít" 'acidʹ and several other
partially assimilated words, such as "phốtpho" 'phosphorus'.
> In any event, we'd want to involve Vietnamese experts before going any
> further.

Glad to have someone else do the work!

Richard.



More information about the CLDR-Users mailing list