Algorithms for Unicode script detection

Wed Jul 5 18:59:26 CDT 2017

On Thu, Jul 06, 2017 at 09:43:29AM +1000, Simon Cozens via Unicode wrote:
> I want to segment a Unicode text into runs according to their script.
> I've had a look through UAX#24 in the hope of finding a standard
> algorithm for doing this, but there isn't one specified. The
> implementation section gives some good pointers for what to be careful
> with (paired punctuation, etc.) but I can't find a step-by-step
> algorithm similar to the bidi algorithm or collation algorithm.
> 
> Equally, I don't see anything in ICU that segments into script-based
> runs. You can get script properties, but that doesn't help you resolve
> common characters in the context of a run.
> 
> Does anyone know of an open-source algorithm for doing this?

There is source/extra/scrptrun/ in ICU source tree (but not part of the
API), apparently it is used by its ParagraphLayout library. (A copy if
this code is used by Pango, and another copy is used by LibreOffice).

Regards,
Khaled