ID_Start, ID_Continue,and stability extensions

Steffen Nurpmeso sdaoden at yandex.com
Fri Apr 25 08:05:05 CDT 2014


Hello,

Markus Scherer <markus.icu at gmail.com> wrote:
 |On Thu, Apr 24, 2014 at 12:56 PM, Steffen Nurpmeso <sdaode\
 |n at yandex.com>wrote:
 |> Markus Scherer <markus.icu at gmail.com> wrote:
 |>|I strongly recommend you parse the derived properties rather than trying
 |> to
 |>|follow the derivation formula, because that can change over time.
 |>
 |> ..this file includes only those core properties that have
 |> themselves a derivation-may-change property?
 |
 |I don't know what that means.

 |What I tried to say is, if you need ID_Start, then parse ID_Start from
 |DerivedCoreProperties.txt. That's more stable (and easier than parsing the
 |pieces and deriving
 |
 |#      Lu + Ll + Lt + Lm + Lo + Nl
 |#    + Other_ID_Start
 |#    - Pattern_Syntax
 |#    - Pattern_White_Space
 |
 |yourself.

But i *do* need to parse several many pieces (since i'm hardly
interested in ID_Start only)!

Unicode has DerivedAge.txt (i don't know where that is derived
from) and i need to parse PropList.txt anyway (to get the full
list of whitespace characters, for example).

So imho it's a bit like «Kraut und Rüben» («higgledy-piggledy»
sayy <http://www.dict.cc/?s=Kraut+und+R%C3%BCben>).

 |For example, at least one of the derivation formulas (for Alphabetic) is
 |changing from 6.3 to 7.0.

That is interesting or frightening, i don't know yet.

Wouldn't it make sense to introduce a single PropListsJoined.txt
that does it all.  Or, for the sake of small and possibly
space-constrained projects..

  ?0[steffen at sherwood ]$ (cd ~/arena/docs.coding/unicode/data;
  > ll DerivedCore* PropList*)
   100 [.]   99531 25 Sep  2013 PropList.txt
   820 [.]  836985 25 Sep  2013 DerivedCoreProperties.txt

..and this is what i would do: offer a new file, say, Formula.txt,
which defines exactly the necessary formula, e.g., to quote your
example

 Alphabetic
 < UnicodeData.txt
 < PropList.txt
 + Lu + Ll + Lt
 + Lm
 + Lo + Nl
 + Other_ID_Start
 - Pattern_Syntax
 - Pattern_White_Space
 =

That concept seems to be scalable at first glance.  Old parsers
will not generate correct data in the future anymore if
i understood correctly?  At least there should be
a formular-compatibility version tag added somewhere, so that
parsers can prevent themselves from generating incorrect data and
automatically.

I don't know why there need to be megabytes of duplicated data.
Ach; and i'm not gonna start to dream of better support for ISO
C / POSIX character classes.  (Oh.  ...It's surely sapless.)
Ciao,

--steffen




More information about the Unicode mailing list