CLDR proposal: Move collator CLDR settings into ICU format

Philippe Verdy verdy_p at wanadoo.fr
Sat Apr 4 02:05:56 CDT 2015


May be there's a way to use (or create) a converter tool that will
automatically generate an equivalent XML version for at least some versions
(allow transitions).
These generated files would be explicitly marked as "derived" (so that they
are no longer directly supported as references, only provided to be
informative).

Or put the sources of such conversion tool in an opensource repository
(should compile at least on Linux, possibly on Windows too, or written in a
portable and widely used language available across platforms such as
Javascript.or Java).

This open-sourced tool does not need to be optimized (this is a one-shot
conversion), it should be demonstrative, so its sources should remain as
simple as possible without lots of dependencies with various external
libraries or API's and complex data structures. In fact this source can be
a useful informative companion of the specifications (often it is just
simpler and faster to look at the sources instead of deciphering natural
English text and its ambiguities that occur too easily).

But this source can also give programming hints to implementers about how
to parse correctly the reference data for their applications, even if in
fact they will use another appropriate internal format for betrer
performance at runtime : collation in applications is a critical
functionality where performance is highly desired, in order
to efficiently manage large volumes of text, for example in plain text
searches or when sorting query result sets, so they in fact do not even use
the ICU public syntax or XML syntax internally using parsers repeatedly).


2015-04-04 8:05 GMT+02:00 Mark Davis [image: ☕]️ <mark at macchiato.com>:

> I'm strongly in favor of these changes.
>
>
> Mark <https://google.com/+MarkDavis>
>
> *— Il meglio è l’inimico del bene —*
>
> On Fri, Apr 3, 2015 at 10:59 PM, Markus Scherer <markus.icu at gmail.com>
> wrote:
>
>> Dear CLDR team & users,
>>
>> I would like to propose the following spec & data changes for CLDR 28.
>> Please provide *feedback by next Thursday, 2015-apr-09*.
>> CLDR ticket: http://unicode.org/cldr/trac/ticket/8289
>>
>> Proposal:
>> - Deprecate XML elements under <collation>:
>>     import, settings, suppress_contractions, optimize
>>   together with their specific attributes
>> - Change the CLDR collation tailorings data to
>>   replace the use of these XML elements with equivalent ICU syntax
>>
>> For example:
>>
>> <settings caseFirst="upper"/>
>> <import source="da" type="standard"/>
>> <suppress_contractions>[เ-ไ ເ-ໄ ꪵ ꪶ ꪹ ꪻ ꪼ]</suppress_contractions>
>> <settings normalization="on" alternate="shifted" reorder="Thai"/>
>>
>> ->
>>
>> [caseFirst upper]
>> [import da-u-co-standard]
>> [suppressContractions [เ-ไ ເ-ໄ ꪵ ꪶ ꪹ ꪻ ꪼ]]
>> [normalization on][alternate shifted][reorder Thai]
>>
>> Rationale:
>>
>> The LDML collation spec
>> <http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Element>
>> provides for two ways for parametric settings and special rules in
>> collation tailoring data: via special XML elements, or as part of the ICU
>> syntax rules in <cr><![CDATA[...]]></cr>. See the underlined elements in
>> the following line copied from the spec:
>>
>> <!ELEMENT collation (alias | ( *import*, settings?,
>> suppress_contractions?, optimize?*, cr*, special*)) >
>>
>> Two ways of doing the same thing lead to inconsistencies.
>>
>> CLDR tools and tests would not have to convert these elements to ICU
>> syntax any more.
>>
>> The spec would be simpler.
>>
>> This change makes it clearer that the settings get *import*ed too, not
>> just the rules.
>>
>> Note that CLDR 24
>> <http://www.unicode.org/reports/tr35/tr35-33/tr35.html#Modifications>
>> deprecated the XML syntax for rules and replaced the XML syntax rules data
>> with equivalent ICU syntax rules.
>>
>> Sincerely,
>> markus
>>
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at unicode.org
>> http://unicode.org/mailman/listinfo/cldr-users
>>
>>
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150404/85998746/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: emoji_u2615.png
Type: image/png
Size: 1890 bytes
Desc: not available
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150404/85998746/attachment.png>


More information about the CLDR-Users mailing list