precomposed polytonic Greek characters with macrons and other diacritics
Markus Scherer
markus.icu at gmail.com
Mon Feb 8 13:10:20 CST 2016
On Mon, Feb 8, 2016 at 10:47 AM, James Tauber <jtauber at jtauber.com> wrote:
> Even with all this, though, my own work includes accentuation and
> syllabification algorithms, all of which are made more cumbersome by the
> lack of precomposed characters indicating vowel length. I'm currently
> leaning towards adding a layer of "character" processing on top of Python
> 3's otherwise decent support that effectively treats the relevant character
> sequences as single characters even if they aren't (and can't be
> precomposed).
>
I suggest you normalize the text (NFC or NFD), and then look for "grapheme
clusters". http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
In C++ and Java, you could use an ICU BreakIterator for the latter.
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160208/13f6ebd4/attachment.html>
More information about the Unicode
mailing list