ID_Start, ID_Continue,and stability extensions

Steffen Nurpmeso sdaoden at
Fri Apr 25 18:24:11 CDT 2014

Markus Scherer < at> wrote:
 |On Fri, Apr 25, 2014 at 6:05 AM, Steffen Nurpmeso <sdaoden at>wrote:
 |So imho it's a bit like «Kraut und Rüben» («higgledy-piggledy»
 |> sayy <>).
 |Ich weiß was das bedeutet :-)

hmmm, possibly a bit of a strong wording.
In no way a personal attack against a real person.
Unicode grew over two decades, only logical that this results in
loose tissue here and there.

 |I parse most of the UCD .txt files with a Python script and munge them into

Ugh this sounds terrible!  Programmers should have the option to
choose the right tools for the right tasks, i mean, payment and
everything is nice, but in the end it is our own life time...

 |Unicode also publishes XML versions of the data, with most or all

Yes, sorry, but i'm not taking a soapy bath in a privately owned
ocean but instead am dealing with a washtub.
150 MB of shock-headed data that yet machines have troubles with!
Even in the end the text files i need will be a tenth of that, and
i'm working with them (especially UnicodeData.txt) uncountable
times, i.e., direct human <-> text interaction.

 |You could also just use a library that provides these properties, rather
 |than roll your own.
 |Shameless plug for ICU here which has most of the low-level properties in
 |source code (from a generator), so no data loading for those. Ask the
 |list <> for help if needed.

But there still *are* products their creators can be prowd of, so
no need for pudency of any kind, imho.
It is of course not as common as in other cultures, say, Turkish 
goldsmiths, African silversmiths or Japanese swordsmiths and
ceramists et cetera, but, so all the more remarkable.


Maybe i turn to use a two-pass thing for my own little project, in
order to use the final category.  Right now i'm single-pass and am
thus required to use ugly things like, e.g.,

      {.name="Other_Alphabetic", .props=sct_ALPHA, .addprint=true},
      {.name="Ideographic", .props=sct_IDEOGRAPH, .addprint=true},
      /* Control characters, including the Zl and Zp separators (imho misplaced
       * and should go C) are not PRINTable */
      if (pp->addprint && !(p & (sct_Cc | sct_Cs | sct_Co | sct_Zl | sct_Zp))) {
         p |= sct_PRINT;
         /* And whitespace is not GRAPHical */
         if (!(p & sct_Zs))
            p |= sct_GRAPH;

 |Viele Grüße,

Oh.  No mention of this brilliant idea of mine, PropRecipe.txt?
Have a nice weekend. :)




 |Google Internationalization Engineering

Oh Google, cute little thing you.

More information about the Unicode mailing list