The Unicode Standard and ISO

Fri Jun 8 20:53:04 CDT 2018

On Fri, 8 Jun 2018 20:45:26 +0200
Philippe Verdy via Unicode <unicode at unicode.org> wrote:

> 2018-06-08 19:41 GMT+02:00 Richard Wordingham via Unicode <
> unicode at unicode.org>:  

> The way tailoring is designed in CLDR using only data used by a
> generic algorithm, and not custom algorithm is not the only way to
> collate Lao. You can perectly add new custom algorithm promitives
> that will use new collation data rules that can be inserted as
> "hooks" in UCA (which provides several points at which it is
> possible, but UCA just makes these hooks act as "no-op".

The ideal is to have a common library rather than add specific routines
to support specific languages.  Now, this can be done in a common
library; ICU break iterators have dedicated routines for CJK and for
Siamese.  I wonder if this could be done for Lao and possibly Tai
Lue.  I've a vague recollection that UCA collation for Tai Lue in the
New Tai Lue script only needs thousands of contractions, so it may work
well enough in the main CLDR collation algorithm.  Martin Hosken
provided the numbers, probably on the Unicore list, when New Tai Lue
formally switched from phonetic to visual order.  Taking the definition
of logical order literally, the change legitimised the logical order of
New Tai Lue. 

> You can be much faster is you create a specific library for Lao, that
> would still be able to process the basic collation rules and then
> make more advanced inferences based on larger cluster boundaries than
> just those considered in the standard basic UCA, so it is perfectly
> possible to extend it to cover more complex Lao syllables and various
> specific quirks (such as hyphenation in the middle of clusters, as
> seen in some Indic scripts using left matras).

How is this hyphenation done?  The answer probably belongs in the
thread entitled 'Hyphenation Markup', unless its restricted to the
visual order scripts.  If it's occurring in the visual order scripts,
we may need to add contractions for <preposed vowel, soft hyphen,
consonant>; U+00AD breaks contractions, and, indeed, may be used for
exactly that purpose, as it is generally easier to type than CGJ.
While I've seen line-breaking after a left matra in Thai, I've never
*seen* a hyphen after a left matra.

Richard.