Machine readable segmentation rules (was Re: =?utf-8?Q?=E2=97=8C_?=in LB28a in UAX14 of Unicode 15.1.0)

Daniel Bünzli daniel.buenzli at erratique.ch
Mon Sep 4 18:11:01 CDT 2023


Thanks for your links.

> It didn't seem worth it for a one-off,

I understand. I got annoyed more than once by the segmenters starting being defined over characters sets defined by regexps rather segmentation class properties. But then if that’s an easier way forward to define meaningful segmentations for end-users I’m willing to play along.

> > Unofficially, we have such a version in the tools code that generates the
> test data:

The reason I’m asking this is because it’s becoming quite clear to me that I took a wrong turn on how I implemented the Unicode segmenters 10 years ago. Or rather, what looked reasonable then no longer is: grapheme clusters could be implemented by a 2D lookup table and the other segmentation rules were simple enough to implement them in an ad-hoc and *streaming* fashion.

Nowadays this reasonable thing has grown into *horribly* ad-hoc and convoluted implementations of the segmenters. Sometimes the ad-hocness is nice because it allows to clean the spec [1], but pragmatically I end up spending way too much time when new rules get added. I need to split characters classes myself (e.g. in this case AL), massage the rules and cope with their changing complexity (e.g. the context enlarges). All this increases the probability that I implement them wrongly despite the test suite passing. I thought these rules would eventually stabilize the normalization standard (UAX 15) did but it doesn’t seem to be the case (this is not a complaint).

So I wonder we could maybe steer the segmentation standards towards the definition of some kind of general rule-based segmentation machine with machine readable rule specifications defined in the UCD.

For implementers it would be a matter of implementing this generic machine in some way and updating the segmenters would be a matter of feeding the machine with the new rules like the way we update normalization data on new Unicode releases. It would also likely make it easier for APIs to provide hooks for tailoring (which might benefit end-users too).

It seems to me that except for the (annoying) rewrite rules we are actualy not far from it. IIRC the operational model of all the segmenters is: between each two code points of the string to segment apply all the rule in order taking the first one whose regexp matches on the left and on the right to define the boundary status of the point. That a segmenter UCD data file could simply be:

 SEGMENTER := (RULE “\n”)* 
 RULE := REGEXP (“×”|”÷”|”!") REGEXP
 REGEXP := … # Unicode regexp as per UTS 18 syntax

For me it would provide a more sustainable approach from a maintenance point of view, similar to the one I have with normalization for which I simply regenerate compact data from the UCD on new Unicode releases.

Best, 

Daniel

[1]: https://www.unicode.org/mail-arch/unicode-ml/y2020-m03/0000.html





More information about the Unicode mailing list