Standardised Encoding of Text

Richard Wordingham richard.wordingham at ntlworld.com
Sun Aug 9 16:03:37 CDT 2015


On Sun, 9 Aug 2015 21:14:38 +0200
Mark Davis ☕️ <mark at macchiato.com> wrote:

> Mark <https://google.com/+MarkDavis>
> 
> *— Il meglio è l’inimico del bene —*
> 
> On Sun, Aug 9, 2015 at 7:10 PM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:
> 
> > On Sun, 9 Aug 2015 17:10:01 +0200
> > Mark Davis ☕️ <mark at macchiato.com> wrote:

> > > For example, perhaps the addition of real data to CLDR for a
> > > "basic-validity-check" on a language-by-language basis.

> > CLDR is currently not useful.  Are you really going to get Mayan
> > time formats when the script is encoded? Without them, there will
> > be no CLDR data.
 
> ​That is a misunderstanding. CLDR provides both locale (language)
> specific data for formatting, collation, etc., but also data about
> languages. It is not limited to the first.

I'm basing my statement on the 'minimal data commitment' listed in
http://cldr.unicode.org/index/cldr-spec/minimaldata .

If there is a sustained failure to provide 4 main data/time formats, the
locale may be removed.

> > > It might be
> > > possible to use a BNF grammar for the components, for which we are
> > > already set up.

> > Are you sure?

> I said "might be possible". That normally indicates that a degree of
> uncertainty. That is, "no, I'm not sure".

> There is no reason to be unnecessarily argumentative; it doesn't
> exactly encourage people to explore solutions to a problem.

I was responding to the 'for which we are already set up'.  The problem
is that canonical equivalence can make it very difficult to specify a
syntax.  The text segmentation appendices suggest that you have already
hit trouble with canonical equivalence; I suspect you have tools set up
to prevent such problems recurring.

With a view to analysing the effects of analysing the
rquirements of the USE, I investigated the effects of canonical
equivalence on regular expressions.  I eventually discovered the
relevant mathematical theory - it replaces strings by 'traces', which
for our purposes are fully decomposed character strings modulo canonical
equivalence. I found very little interest in the matter on this list.

I gave the example of the regular expression

[:InPC=Top:]*[:InPC=Bottom:]*

Usefully converting that expression to specify NFD equivalents in
accordance with UTS#18 Version 17 Section 2.1 is non-trivial, though it
is doable.  I have a feeling that some have claimed that an expression
like that is already in NFD. 

> I don't think any algorithmic description would get all and only those
> strings that would be acceptable to writers of the language. What
> you'd end up with is a mechanism that had three values: clearly ok
> (eg, cat), clearly bogus (eg, a\u0308\u0308\u0308\u0308), and
> somewhere in between.

What have you got against 8th derivatives? -:)

You are looking at a different issue to me.  One of the issues is rather
that for a word of one syllable, there should only be one order per
meaning, appearance and pronunciation for a pair of non-commuting
combining marks.  For non-Indic scripts, that is generally handled by
ensuring that different orders of non-commuting combining marks render
differently.

> If the goal for the script rules is to cover all languages customarily
> written with that script, one way to do that is to develop the
> language rules as they come, and make sure that the script rules are
> broadened if necessary for each language. But there is also utility
> to having the language rules, especially for high-frequency languages.

The language rules serve a different function.  The sequence
"xxxxlttttuuupppp" is clearly not English, but it is a perfectly
acceptable string for sorting, searching and rendering.

Richard.



More information about the Unicode mailing list