ID_Start, ID_Continue, and stability extensions

Markus Scherer at
Fri Apr 25 10:56:17 CDT 2014

On Fri, Apr 25, 2014 at 6:05 AM, Steffen Nurpmeso <sdaoden at>wrote:

>  |What I tried to say is, if you need ID_Start, then parse ID_Start from
>  |DerivedCoreProperties.txt. That's more stable (and easier than parsing
> the
>  |pieces and deriving
>  |
>  |#      Lu + Ll + Lt + Lm + Lo + Nl
>  |#    + Other_ID_Start
>  |#    - Pattern_Syntax
>  |#    - Pattern_White_Space
>  |
>  |yourself.
> But i *do* need to parse several many pieces (since i'm hardly
> interested in ID_Start only)!

That's ok. Wherever there is a choice, parse the derived property rather
than the pieces and doing your own derivation.

So imho it's a bit like «Kraut und Rüben» («higgledy-piggledy»
> sayy <>).

Ich weiß was das bedeutet :-)

Wouldn't it make sense to introduce a single PropListsJoined.txt
> that does it all.

Depends. You could just parse the files you need. They don't have to be

I parse most of the UCD .txt files with a Python script and munge them into
one combined file. Then I have C++ code that parses that. (Years ago I did
parse the pieces and derive at runtime but found it tedious to follow the
formula changes, and if the data structure eliminates redundancy, then the
data size is about the same.)

Unicode also publishes XML versions of the data, with most or all
properties in a single file. (It's just not as convenient for me to parse
XML in my tools, and the XML files were missing some pieces when I looked
at them.)

You could also just use a library that provides these properties, rather
than roll your own.
Shameless plug for ICU here which has most of the low-level properties in
source code (from a generator), so no data loading for those. Ask the
list <> for help if needed.

..and this is what i would do: offer a new file, say, Formula.txt,
> which defines exactly the necessary formula, e.g., to quote your
> example

It's not "my example". I copied that straight out of

It's not worth writing a parser that handles all formulas (they are meant
for human consumption) and derive their properties when you can just parse
the derived property values.

I don't know why there need to be megabytes of duplicated data.

It's easier to maintain the data in pieces, although we have to check the
derived results as well.
For implementers, the derived properties are the way to go.

Ach; and i'm not gonna start to dream of better support for ISO
> C / POSIX character classes.  (Oh.  ...It's surely sapless.)

Viele Grüße,
Google Internationalization Engineering
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list