From richard.wordingham at ntlworld.com Wed Mar 1 17:24:58 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 1 Mar 2017 23:24:58 +0000 Subject: question about identifying CLDR coverage % for Amharic In-Reply-To: <20170224214254.368ca8a7@JRWUBU2> References: <49138035-387E-40A4-A7BB-B802AFFC45B4@wenlin.com> <20170224062458.23d13e27@JRWUBU2> <20170224214254.368ca8a7@JRWUBU2> Message-ID: <20170301232458.51979fe7@JRWUBU2> On Fri, 24 Feb 2017 21:42:54 +0000 Richard Wordingham wrote: > I notice a very similar file lo.xml. When did Laos haul up the white > flag and more or less adopt the modern Thai collation order for Lao? As there has been no answer to this question, I presume the surrender has not happened. As my ticket submission was rejected as spam, would someone kindly file a ticket along these lines: ==Lao collation is not linguistically correct== The file collation/lo.xml contains the reckless falsehood "The root collation order is valid for this language". If phonetic Lao syllables were represented by single characters, Lao collation would be a simple lexicographic order. It is therefore unable to use anything but primary weights. A Lao syllable may be considered to be composed of onset + vowel + coda + tone; the onset and vowel may be interleaved (as in Thai), and the tone is represented by a mark following the onset and no later than immediately after the vowel. There are two basic schemes ordering for single syllables: 1) 2) The first is the one most commonly used; the second is closer to the CLDR default. Unlike Thai, the vowel weighting for compound vowel symbols is not composed from the individual vowels. For example, part of the ordering is: ??? < ?? < ??? < ?? < ???? However, the current collation yields ?? < ??? < ???? < ?? < ??? This ordering is manifestly wrong. I suggest that the reckless comment be amended to something like, "The root collation is of some utility in sorting this language; accurate collation appears to require large tables". Yours faithfully, Richard Wordingham. From mark at macchiato.com Thu Mar 2 04:50:27 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 2 Mar 2017 11:50:27 +0100 Subject: question about identifying CLDR coverage % for Amharic In-Reply-To: <20170301232458.51979fe7@JRWUBU2> References: <49138035-387E-40A4-A7BB-B802AFFC45B4@wenlin.com> <20170224062458.23d13e27@JRWUBU2> <20170224214254.368ca8a7@JRWUBU2> <20170301232458.51979fe7@JRWUBU2> Message-ID: ?Filed on your behalf at http://unicode.org/cldr/trac/ticket/10098 (Surprised that it thought you were spam; are others seeing that?)? ?Also, would it be possible for you to supply the ordering rules for CLDR? Longer term, if the rules can be expressed without too much data, I think the change should be made in the DUCET; no need for that to differ gratuitously from ?what is acceptable in Laos. On Thu, Mar 2, 2017 at 12:24 AM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > > I notice a very similar file lo.xml. When did Laos haul up the white > > flag and more or less adopt the modern Thai collation order for Lao? > > As there has been no answer to this question, I presume the surrender > has not happened. As my ticket submission was rejected as spam, would > someone kindly file a ticket along these lines: > > ==Lao collation is not linguistically correct== > > The file collation/lo.xml contains the reckless falsehood "The root > collation order is valid for this language". > > If phonetic Lao syllables were represented by single characters, Lao > collation would be a simple lexicographic order. It is therefore unable > to use anything but primary weights. > > A Lao syllable may be considered to be composed of onset + vowel + coda > + tone; the onset and vowel may be interleaved (as in Thai), and the > tone is represented by a mark following the onset and no later than > immediately after the vowel. There are two basic schemes ordering for > single syllables: > > 1) > 2) > > The first is the one most commonly used; the second is closer to the > CLDR default. > > Unlike Thai, the vowel weighting for compound vowel symbols is not > composed from the individual vowels. For example, part of the ordering > is: > > ??? < ?? < ??? < ?? < ???? > > However, the current collation yields > ?? < ??? < ???? < ?? < ??? > > This ordering is manifestly wrong. > > I suggest that the reckless comment be amended to something like, "The > root collation is of some utility in sorting this language; accurate > collation appears to require large tables". > > Yours faithfully, > > Richard Wordingham. > Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From rajavelmani at gmail.com Thu Mar 2 12:09:53 2017 From: rajavelmani at gmail.com (Manikandan Ramalingam Kandaswamy) Date: Thu, 2 Mar 2017 10:09:53 -0800 Subject: Fwd: Regarding Time zone name usage In-Reply-To: References: Message-ID: Hi CLDR, I need some clarification on using time zone names spec. I am implementing the date/time format for date symbols `zzzz` and I have questions on the regarding the time zone format spec . I am focusing on regionFormat-standard and regionFormat-daylight related to `zzzz` format ?{0} Standard Time? *or* ?{COUNTRY} Standard Time / {CITY} Standard Time?. Based on tr35 documentation spec I use metaZone and golden Zone in TimeZoneNames. But, I have these questions which I could not get the answers from tr35 documentation . ? Is there any time zone which is not in a metaZone? ? If there is a time zone which is not in a metaZone o Where is the localized COUNTRY or CITY name for constructing ?{COUNTRY} Standard Time / {CITY} Standard Time? o Should I use exemplarCity in TimeZoneNames.json ? Secondly, for existing metaZone in TimeZoneNames for `z?zzz` there are cases where there is no short time zone names like for Asia/Calcutta . In this case, I have these questions to be clarified ? Should I fall back to ?O? (GMT) format? ? Can we have some algorithm in zone format spec about short time zone name usage? Can someone help me to clarify the above questions? Thanks Mani -------------- next part -------------- An HTML attachment was scrubbed... URL: From rajavelmani at gmail.com Thu Mar 2 12:31:49 2017 From: rajavelmani at gmail.com (Manikandan Ramalingam Kandaswamy) Date: Thu, 2 Mar 2017 10:31:49 -0800 Subject: Regarding Time zone name usage In-Reply-To: References: Message-ID: Dear all, I found the below use cases based on tr35 timezone symbols table . It would be less confusing if we can update Time Zones documentation. 1. The `z`, `v` and `V` formats should have the formatted strings in TimeZoneNames.json for `regionFormat`, `regionFormat-type-standard` and ` regionFormat-type-daylight` 2. If the formatted timezone name is not found the the fallback is as follows a. for `z...zzz` fall back to `O` (The short localized GMT format.) b. for `zzzz` fall back to `OOOO` (The long localized GMT format.) b. for `v` and `vvvv` fall back to `VVVV` (The generic location format.) c. for `VVVV` if the `exemplarCity` not found fall back to `OOOO` (The long localized GMT format.) d. for `VVV` format if the `exemplarCity` not found fall back to `Etc/Unknown` time zone. It would be helpful if Time Zones documentation update with the above information. Thanks you for your support. -Mani On Thu, Mar 2, 2017 at 10:09 AM, Manikandan Ramalingam Kandaswamy < rajavelmani at gmail.com> wrote: > Hi CLDR, > > > > I need some clarification on using time zone names spec. > > > > I am implementing the date/time format for date symbols `zzzz` and I have > questions on the regarding the time zone format spec > . > > > > I am focusing on regionFormat-standard and regionFormat-daylight > > related to `zzzz` format ?{0} Standard Time? *or* ?{COUNTRY} Standard > Time / {CITY} Standard Time?. > > > > Based on tr35 documentation > > spec I use metaZone and golden Zone in TimeZoneNames. But, I have these > questions which I could not get the answers from tr35 documentation > > . > > ? Is there any time zone which is not in a metaZone? > > ? If there is a time zone which is not in a metaZone > > o Where is the localized COUNTRY or CITY name for constructing ?{COUNTRY} > Standard Time / {CITY} Standard Time? > > o Should I use exemplarCity in TimeZoneNames.json > > ? > > > > Secondly, for existing metaZone > > in TimeZoneNames for `z?zzz` there are cases where there is no short time > zone names like for Asia/Calcutta > . > In this case, I have these questions to be clarified > > ? Should I fall back to ?O? (GMT) format? > > ? Can we have some algorithm in zone format spec > about short time zone > name usage? > > Can someone help me to clarify the above questions? > > > Thanks > > Mani > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Mar 2 14:11:48 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 2 Mar 2017 20:11:48 +0000 Subject: question about identifying CLDR coverage % for Amharic In-Reply-To: References: <49138035-387E-40A4-A7BB-B802AFFC45B4@wenlin.com> <20170224062458.23d13e27@JRWUBU2> <20170224214254.368ca8a7@JRWUBU2> <20170301232458.51979fe7@JRWUBU2> Message-ID: <20170302201148.7f274827@JRWUBU2> On Thu, 2 Mar 2017 11:50:27 +0100 Mark Davis ?? wrote: > ?Also, would it be possible for you to supply the ordering rules for > CLDR? I currently define collation element weights; when I was working on the rules, I did not trust the CLDR collation definition format. I will make an effort to convert the rules into the proper format. If I can't work out how to express them succinctly and intelligibly, I'll try to create a C program to generate the data. (I currently generate the table using a bash script, and I suspect bash source code will not be welcome.) > Longer term, if the rules can be expressed without too much data, I > think the change should be made in the DUCET; no need for that to > differ gratuitously from ?what is acceptable in Laos. 1) I'm interested to know how you would square the change with the stability policy: (http://www.unicode.org/collation/ducet-changes.html) "Changes for characters which have been in the standard for longer than 2 years should generally be disallowed. The UTC can overrule this and mandate a change in a character weight entry, but should only do so when it determines that there is an egregious error or finds some other very strong motivation for disturbing an established value. In less than such extreme circumstances, solutions involving tailoring should be preferred." 2) DUCET defines a finite collation element table; Lao needs an infinite collation element table for use in the Unicode Collation Algorithm. The CLDR syntaxes allow finite expression of infinite tables by means of chaining context-sensitive mappings. 3) I think there is 'too much data'; the last table I generated defined 184,674 collation elements for Lao. On the other hand, the generating script is just 642 lines long, and 25% of lines are comment lines. Lao text needs a preprocessing stage in which syllable boundary marks are inserted. That would shorten the table considerably, and allow the UCA to use a finite collation element table. Richard. From pedberg at apple.com Wed Mar 8 20:05:02 2017 From: pedberg at apple.com (Peter Edberg) Date: Wed, 08 Mar 2017 18:05:02 -0800 Subject: CLDR 31 beta and LDML spec review Message-ID: <5D59DBF8-391F-4A52-9CAD-99E117736C19@apple.com> Dear CLDR users and friends, A beta version of CLDR 31 is available for review. This includes: ? A draft of the updated LDML specification at http://www.unicode.org/reports/tr35/proposed.html ; a link to a summary of modifications appears near the beginning. ? A draft download page at https://sites.google.com/site/cldr/index/downloads/cldr-31 Thank you for any feedback! Peter Edberg for the CLDR team -------------- next part -------------- An HTML attachment was scrubbed... URL: From hchapman at us.ibm.com Thu Mar 9 08:50:42 2017 From: hchapman at us.ibm.com (Helena S Chapman) Date: Thu, 9 Mar 2017 09:50:42 -0500 Subject: Usable regional authorization workflow service integration Message-ID: A question around the scope and role of CLDR due to my ignorance so please feel free to direct and critique accordingly. A common use of OAth service integration is when someone signs on to a new service offering as a consumer, we are often asked to "authenticate" thru our Facebook, Google+, Linkedin or Twitter accounts in the US. Either that or pay the price of one registration per site to gain access. The challenge is, other than Linkedin, none of the other services are available in Mainland China. A popular social media in Japan is LINE. Or MyMFB in certain other regions. Is this something CLDR would be able to help with over time? Ability to support various authorization workload/service integration at regional level that changes as business wax and wane with users' preferences? Fully equipped with sample integrations to Node (e.g. EveryAuth), PHP (Opauth), Python (SocialOAuth), or Cloud (Auth0) in ICU for various regions in the same web or mobile app? Or is this completely out of scope? Authorization workflow is only one example, there are other global user experiences related topics that can fall under a similar umbrella. signed, a person with a day job in globalization and hands-on cybersecurity and privacy experiences -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Mar 9 14:45:15 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 9 Mar 2017 21:45:15 +0100 Subject: Usable regional authorization workflow service integration In-Reply-To: References: Message-ID: In China, Baidu is the most popular social network. May be it's possible for them to be a OAuth server (if they respect the protocol correctly and have a reliable support). I know that Baidu is now developing as well in other countries, via various small community sites and blogs and advertizing partners, as well as with Chinese medias with foreign branches using other languages (such as CCTV, supporting at least Spanish and French-speaking services). Anyway we need more diversity in choices. Microsoft/Yahoo/bing could also be in the list, or Amazon, as well as wellknown sites for seeking jobs, or wellknown big ISPs (most European countries have at least 2 or 3 "big" players with lot of related services that can be unified with single sign-on with OAuth. Most of them also have international branches, notably Orange, DT, Vodafone and growing presences in content sites...). You could also add websites used for post talks/comments on web articles, such as Disqus (very popular with media/TV/radio/news sites). 2017-03-09 15:50 GMT+01:00 Helena S Chapman : > A question around the scope and role of CLDR due to my ignorance so please > feel free to direct and critique accordingly. > > A common use of OAth service integration is when someone signs on to a new > service offering as a consumer, we are often asked to "authenticate" thru > our Facebook, Google+, Linkedin or Twitter accounts in the US. Either that > or pay the price of one registration per site to gain access. The challenge > is, other than Linkedin, none of the other services are available in > Mainland China. A popular social media in Japan is LINE. Or MyMFB in > certain other regions. > > Is this something CLDR would be able to help with over time? Ability to > support various authorization workload/service integration at regional > level that changes as business wax and wane with users' preferences? Fully > equipped with sample integrations to Node (e.g. EveryAuth), PHP (Opauth), > Python (SocialOAuth), or Cloud (Auth0) in ICU for various regions in the > same web or mobile app? > > Or is this completely out of scope? Authorization workflow is only one > example, there are other global user experiences related topics that can > fall under a similar umbrella. > > signed, a person with a day job in globalization and hands-on > cybersecurity and privacy experiences > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Thu Mar 9 14:58:37 2017 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 9 Mar 2017 12:58:37 -0800 Subject: Usable regional authorization workflow service integration In-Reply-To: References: Message-ID: This looks very out of scope for CLDR. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Mar 9 15:37:20 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 9 Mar 2017 22:37:20 +0100 Subject: Usable regional authorization workflow service integration In-Reply-To: References: Message-ID: I'm not talking about the value of these services but to the initial question about OAuth for single sign-in on the CLDR site, where only a few major US providers are proposed (I'd say "advertized" there): do you want to include only OAuth providers that are full members of CLDR TC or Unicode ? 2017-03-09 21:58 GMT+01:00 Markus Scherer : > This looks very out of scope for CLDR. > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Mar 9 15:41:10 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 9 Mar 2017 22:41:10 +0100 Subject: Usable regional authorization workflow service integration In-Reply-To: References: Message-ID: Note that Tweeter is not a full member, just an associate member. If it is proposed it's only because it's "popular". And LinkedIn is proposed but not even a member (but it has some popularity and is the only one available in China, where it is not so "popular" like Baidu). 2017-03-09 22:37 GMT+01:00 Philippe Verdy : > I'm not talking about the value of these services but to the initial > question about OAuth for single sign-in on the CLDR site, where only a few > major US providers are proposed (I'd say "advertized" there): > do you want to include only OAuth providers that are full members of CLDR > TC or Unicode ? > > 2017-03-09 21:58 GMT+01:00 Markus Scherer : > >> This looks very out of scope for CLDR. >> markus >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vapier at gentoo.org Thu Mar 9 16:16:43 2017 From: vapier at gentoo.org (Mike Frysinger) Date: Thu, 9 Mar 2017 17:16:43 -0500 Subject: Usable regional authorization workflow service integration In-Reply-To: References: Message-ID: <20170309221643.GB31094@vapier> On 09 Mar 2017 22:37, Philippe Verdy wrote: > I'm not talking about the value of these services but to the initial > question about OAuth for single sign-in on the CLDR site, where only a few > major US providers are proposed (I'd say "advertized" there): > do you want to include only OAuth providers that are full members of CLDR > TC or Unicode ? the original post does not seem to be about "what providers can be used to log in w/CLDR" but more "can CLDR be a focal point for OAuth/regional authorities". i'm with Markus -- this doesn't look like it's anywhere close to what CLDR is doing or should be doing. -mike -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: Digital signature URL: From richard.wordingham at ntlworld.com Fri Mar 10 17:16:48 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 10 Mar 2017 23:16:48 +0000 Subject: Lao Collation (was: question about identifying CLDR coverage % for Amharic) In-Reply-To: <20170302201148.7f274827@JRWUBU2> References: <49138035-387E-40A4-A7BB-B802AFFC45B4@wenlin.com> <20170224062458.23d13e27@JRWUBU2> <20170224214254.368ca8a7@JRWUBU2> <20170301232458.51979fe7@JRWUBU2> <20170302201148.7f274827@JRWUBU2> Message-ID: <20170310231648.01fb2cb7@JRWUBU2> On Thu, 2 Mar 2017 20:11:48 +0000 Richard Wordingham wrote: I think on balance this is a CLDR question rather than an ICU support question - the clincher is that the answer may be not to use ICU. > On Thu, 2 Mar 2017 11:50:27 +0100 > Mark Davis ?? wrote: > > ?Also, would it be possible for you to supply the ordering rules for > > CLDR? > I currently define collation element weights; when I was working on > the rules, I did not trust the CLDR collation definition format. I > will make an effort to convert the rules into the proper format. If > I can't work out how to express them succinctly and intelligibly, > I'll try to create a C program to generate the data. (I currently > generate the table using a bash script, and I suspect bash source > code will not be welcome.) > 3) I think there is 'too much data'; the last table I generated > defined 184,674 collation elements for Lao. On the other hand, the > generating script is just 642 lines long, and 25% of lines are > comment lines. I've expressed the table as CLDR tailoring rules, but I am having difficulty in checking them. To check them, I need an implementation that: 1) is cheap (ideally free) to use. 2) interprets LDML the same way as ICU. It would help if I could be assured that two compliant implementations of CLDR would give the same results for collation when normalisation is enabled. I had hoped to use ICU to test the tailoring; building my own implementation would be very vulnerable to my misinterpretations of LDML. However, I've hit a hard limit with ICU Version 58.2 in the storage of 'CE32's in field CollationDataBuilder::ce32s - 0x7ffff, = 524,287, elements. I'm not sure where they're coming from, but there are a great many entries of length 20 to 26. (In DUCET form, my entries typically have 3 or 4 CEs.) The limit is connected to the size of bit fields in ICU, so it would be hard for me to change it. I hit the limit in the 26,070th expansion of the array in CollationDataBuilder::encodeExpansion32, which yields an index of 524267, but according to the error details, parsing has only reached line 7268 of my rules input file. Apart from the creation of abstract weights (given below) and comments, my rules input file has one contraction per line. I may be able to work round it by testing subsets of the script. I set up the abstract weights (corresponding to initial consonants, compound vowels, final consonants and tones) using # Vowels &\u0eb9 < \ufdd2? < ? < \ufdd2? < ? < \u0ebb < ? < \ufdd2AW < \ufdd2AAW < \ufdd2OE < \ufdd2OOE < \ufdd2IA < \ufdd2IIA < \ufdd2UEA < \ufdd2UUEA < \ufdd2UA < \ufdd2UUA < ? < ? < \ufdd2AO < ? # Initial consonants <\ufdd2\u0ede<\ufdd2\u0e81<\ufdd2\u0e82<\ufdd2\u0e84<\ufdd2\u0e87 <\ufdd2\u0e88<\ufdd2\u0eaa<\ufdd2\u0e8a<\ufdd2\u0edf<\ufdd2\u0e8d <\ufdd2\u0e94<\ufdd2\u0e95<\ufdd2\u0e96<\ufdd2\u0e97<\ufdd2\u0e99 <\ufdd2\u0e9a<\ufdd2\u0e9b<\ufdd2\u0e9c<\ufdd2\u0e9d<\ufdd2\u0e9e <\ufdd2\u0e9f<\ufdd2\u0ea1<\ufdd2\u0ea2<\ufdd2\u0ea3<\ufdd2\u0ea5 <\ufdd2\u0ea7<\ufdd2\u0eab<\ufdd2\u0ead <\ufdd2\u0eae # Tones <\u0ec8<\u0ec9<\u0eca<\u0ecb # Final consonants <\ufdd3\u0e81<\ufdd3\u0e87<\ufdd3\u0edf<\ufdd3\u0e8d<\ufdd3\u0e94 <\ufdd3\u0e99<\ufdd3\u0e9a<\ufdd3\u0ea1<\ufdd3\u0ea7<\ufdd3\u0ebd # Treat ? as ? finally - TBC! &\ufdd3?=\ufdd3? I don't need to tailor the weights for the vowels U+0EB0 LAO VOWEL SIGN A to U+0EB9 LAO VOWEL SIGN UU. I then use contractions like &\ufdd2\u0eab\ufdd2\u0e8d\ufdd2UUA\ufdd3\u0e9a = \u0eab\u0e8d?\u0e9a It is possible to eliminate some contractions - this one captures the observation, redundant in pure Lao, that in a word starting thus, U+0E9A LAO LETTER BO would belong to the first syllable, not the second. The vowel would be spelt differently in an open syllable. This matters, because, for example, ????? 'kettle' comes before its string-theoretical prefix ??? 'work, action', though with the vowel in this pair, characters beyond the '?' have to be taken into account. Actually, I think I had already eliminated this contraction, but then mistranslated from bash to C :-( So, is there a more capacious alternative to ICU? Speed matters less for testing, and even ICU currently takes over 2 hours to choke on my tailoring, though I think I can see a time-space tradeoff in CollationDataBuilder::encodeExpansion32 that may speed matters up. Alternatively, am I accidentally creating overly long contractions? I hope to eliminate abstract weights, but first I want to check the move from DUCET-style weight tables to LDML notation. Richard. From richard.wordingham at ntlworld.com Sat Mar 18 16:19:15 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 18 Mar 2017 21:19:15 +0000 Subject: Lao Collation (was: question about identifying CLDR coverage % for Amharic) In-Reply-To: References: <49138035-387E-40A4-A7BB-B802AFFC45B4@wenlin.com> <20170224062458.23d13e27@JRWUBU2> <20170224214254.368ca8a7@JRWUBU2> <20170301232458.51979fe7@JRWUBU2> Message-ID: <20170318211915.048d8d8a@JRWUBU2> On Thu, 2 Mar 2017 11:50:27 +0100 Mark Davis ?? wrote: > ?Also, would it be possible for you to supply the ordering rules for > CLDR? I'm still working on converting the rules from a table of weights to CLDR rules. Unfortunately, to get a tolerable number of rules, I seem to need contexts that start with a tone mark. It seems wrong to ask users to mark syllabary boundaries when they are blindingly obvious. The problem sequence is syllables like ???? 'stem' . In terms of abstract weights, I need to convert it to where these are the weights for the initial consonant, vowel, final consonant, and tone, in that order. The big problem in ordering it is that one needs to determine whether the final NO is syllable-final, or the start of the next syllable. If it were the start of the next syllable, the word would come before ?????? 'brave' because NO comes before HO SUNG in alphabetical order. However, as Lao orders syllable by syllable, we have the order ??? 'courageous' < ?????? < ????. I therefore end up being driven to a series of contractions: ?? > # Must defer weight for tone mark until final consonant is known. ? | ??? > # Identify consonant as starting a syllable - redundant for ?. Used in ??????. ? | ?? > # Would be bled by other contractions, similar to the above, if NO started a syllable. Used in ????. ? | ? > Used at the end of an intelligible run of word characters. Used by ??? at the end of a phrase, or if words are being separated by ZWSP at input. For use with ICU, there is an apparently fatal objection, which I'm currently having difficulty in disabling by hacking ICU. The context does not start at an NFC boundary, so ICU rejects the input as invalid. Now, for these examples, I have to add a consonant to the start of the context, which increases the number of contextual contractions thirty-fold. Now, I may be able to get the number of contractions down by tailoring more tightly to Lao phonology (with a weather-eye to other Laotian languages). Ultimately, it may be possible to secure drastic reductions by tailoring to the lexicon, with the certainty of some new words being missorted. > Longer term, if the rules can be expressed without too much data, I > think the change should be made in the DUCET; no need for that to > differ gratuitously from ?what is acceptable in Laos. Short of adding Lao tone processing to Quebecois accent processing, I think Lao tones will have to continue to make secondary differences in DUCET. The principle of ordering open syllables before closed syllables with the same initial and vowel may also need too much data. The 38 modern Lao vowel symbols, simple, compound and including what I reckon as matres lectionis, should sort as follows: ?? . This improves the ordering to: ?? and between . This would best be handled by treating SIGN LO as having a compatibility decomposition to LETTER LO. Finally, people rarely appreciate the simplicity of Thai collation. For Thai, the 'logical exception' vowels are simply swapped with the immediately following consonant. For Lao, the 'logical exception' vowels are swapped with the immediately following consonant *cluster*. It is particularly worth doing this for the 6 Lao letters that may be composed of two characters: 1) HO NGO 2) HO NYO 3) HO NO or 4) HO MO or 5) HO LO or 6) HO WO At present, DUCET gives the following odd ordering: ??? < ???