Hyphenation

Cameron Dutro cameron at lumoslabs.com
Wed Feb 4 13:24:32 CST 2015


Thanks Jukka. I did some research and found out that LibreOffice (and
OpenOffice) uses a dictionary-based approach via the Hunspell project. They
have dictionaries for quite a few languages. Hunspell and TeX use an
algorithm developed at Stanford in a dissertation by Franklin Liang that
describes the format of such dictionaries and how to identify potential
hyphen locations in text. I realize this won't work for all non-dictionary
words, but Liang's algorithm purportedly does work for a great many of
them. I've attached a .pdf summary of how it works. Anyway, it's a place to
start. CLDR could perhaps incorporate the hyphenation dictionaries from
LibreOffice since I believe they're fairly permissively licensed.

-Cameron

On Wed, Feb 4, 2015 at 10:57 AM, Jukka K. Korpela <jkorpela at cs.tut.fi>
wrote:

> 2015-02-04, 19:58, Cameron Dutro wrote:
>
>  It is often the case, especially on smaller screens, that long words
>> must be hyphenated so they wrap in a natural way. As far as I can tell,
>> the CLDR data set does not define hyphenation rules.
>>
>
> That is correct. And they cannot really be described using the techniques
> currently deployed in CLDR.
>
>  I'm not even really
>> sure what the hyphenation rules should be for English.
>>
>
> They vary by version of English (and by authority).
>
>  The implementation I've seen uses a dictionary - maybe it's identifying
>> potential breaks at syllable boundaries?
>>
>
> Some simple hyphenators are dictionary-driven. But this does not work well
> even for English, since any word not in the dictionary would remain
> unhyphenated. It does not work well at all for languages that have, say, a
> thousand inflected forms for each verb or noun – but may have simple
> algorithmic rules for hyphenation.
>
> Hyphenation strategies vary greatly by language. At present, the best you
> can do is to try to find suitable hyphenation software for the languages
> that are relevant to you.
>
> Yucca
>
>
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150204/0ea33acd/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tb87nemeth.pdf
Type: application/pdf
Size: 166735 bytes
Desc: not available
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150204/0ea33acd/attachment-0001.pdf>


More information about the CLDR-Users mailing list