Standardised Encoding of Text
Richard Wordingham
richard.wordingham at ntlworld.com
Sun Aug 9 16:03:37 CDT 2015
On Sun, 9 Aug 2015 21:14:38 +0200
Mark Davis ☕️ <mark at macchiato.com> wrote:
> Mark <https://google.com/+MarkDavis>
>
> *— Il meglio è l’inimico del bene —*
>
> On Sun, Aug 9, 2015 at 7:10 PM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:
>
> > On Sun, 9 Aug 2015 17:10:01 +0200
> > Mark Davis ☕️ <mark at macchiato.com> wrote:
> > > For example, perhaps the addition of real data to CLDR for a
> > > "basic-validity-check" on a language-by-language basis.
> > CLDR is currently not useful. Are you really going to get Mayan
> > time formats when the script is encoded? Without them, there will
> > be no CLDR data.
> That is a misunderstanding. CLDR provides both locale (language)
> specific data for formatting, collation, etc., but also data about
> languages. It is not limited to the first.
I'm basing my statement on the 'minimal data commitment' listed in
http://cldr.unicode.org/index/cldr-spec/minimaldata .
If there is a sustained failure to provide 4 main data/time formats, the
locale may be removed.
> > > It might be
> > > possible to use a BNF grammar for the components, for which we are
> > > already set up.
> > Are you sure?
> I said "might be possible". That normally indicates that a degree of
> uncertainty. That is, "no, I'm not sure".
> There is no reason to be unnecessarily argumentative; it doesn't
> exactly encourage people to explore solutions to a problem.
I was responding to the 'for which we are already set up'. The problem
is that canonical equivalence can make it very difficult to specify a
syntax. The text segmentation appendices suggest that you have already
hit trouble with canonical equivalence; I suspect you have tools set up
to prevent such problems recurring.
With a view to analysing the effects of analysing the
rquirements of the USE, I investigated the effects of canonical
equivalence on regular expressions. I eventually discovered the
relevant mathematical theory - it replaces strings by 'traces', which
for our purposes are fully decomposed character strings modulo canonical
equivalence. I found very little interest in the matter on this list.
I gave the example of the regular expression
[:InPC=Top:]*[:InPC=Bottom:]*
Usefully converting that expression to specify NFD equivalents in
accordance with UTS#18 Version 17 Section 2.1 is non-trivial, though it
is doable. I have a feeling that some have claimed that an expression
like that is already in NFD.
> I don't think any algorithmic description would get all and only those
> strings that would be acceptable to writers of the language. What
> you'd end up with is a mechanism that had three values: clearly ok
> (eg, cat), clearly bogus (eg, a\u0308\u0308\u0308\u0308), and
> somewhere in between.
What have you got against 8th derivatives? -:)
You are looking at a different issue to me. One of the issues is rather
that for a word of one syllable, there should only be one order per
meaning, appearance and pronunciation for a pair of non-commuting
combining marks. For non-Indic scripts, that is generally handled by
ensuring that different orders of non-commuting combining marks render
differently.
> If the goal for the script rules is to cover all languages customarily
> written with that script, one way to do that is to develop the
> language rules as they come, and make sure that the script rules are
> broadened if necessary for each language. But there is also utility
> to having the language rules, especially for high-frequency languages.
The language rules serve a different function. The sequence
"xxxxlttttuuupppp" is clearly not English, but it is a perfectly
acceptable string for sorting, searching and rendering.
Richard.
More information about the Unicode
mailing list