precomposed polytonic Greek characters with macrons and other diacritics
liz at dijkmat.nl
Mon Feb 8 13:29:35 CST 2016
> On 08 Feb 2016, at 20:10, Markus Scherer <markus.icu at gmail.com> wrote:
> On Mon, Feb 8, 2016 at 10:47 AM, James Tauber <jtauber at jtauber.com> wrote:
> Even with all this, though, my own work includes accentuation and syllabification algorithms, all of which are made more cumbersome by the lack of precomposed characters indicating vowel length. I'm currently leaning towards adding a layer of "character" processing on top of Python 3's otherwise decent support that effectively treats the relevant character sequences as single characters even if they aren't (and can't be precomposed).
> I suggest you normalize the text (NFC or NFD), and then look for "grapheme clusters". http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
> In C++ and Java, you could use an ICU BreakIterator for the latter.
Might I suggest looking at Rakudo Perl 6’s implementation of NFG (Normalization Form Grapheme) which will generate synthetic codepoints on the fly under the hood.
For an introduction, see http://jnthn.net/papers/2015-spw-nfg.pdf
More information about the Unicode