Chinese Word Breaking
c933103 at gmail.com
Tue Jul 21 05:10:14 CDT 2015
When you write text in modern Chinese, there will not be any break between
different words, and thus if you segment characters according to the
ideographic characters, what being groupped together would either be a
clausee or a sentence, Or even a whole paragraph if you are handling some
older text without punctuations.
Also, that group of characters are not solely used by modern standard
chinese. For example, in Japanese there are expressions like 満を持す which
these four characters are generally treated as one word but as you can see
it is a mix of ideograph and hiragana. Similarly Taiwanese (nan) user would
also write latin alphabet together with these ideograph to form word. In
these cases if you change it to ID then what you are selecting would just
be part of the word.
And on character level you can't even tell what language the character is
written in, let alone telling apart which character is word or not. In
fact, in literal Chinese (lzh), most of these characters can be consider as
a word itself.
2015年7月21日 下午2:59於 "Richard Wordingham" <richard.wordingham at ntlworld.com>寫道：
> I'm puzzled by a statement in UAX #29 Unicode Text Segmentation:
> "In particular, the characters with the Line_Break property values of
> Contingent_Break (CB), Complex_Context (SA/Southeast Asian), and
> Unknown (XX) are assigned word boundary property values based on
> criteria outside of the scope of this annex. That means that
> satisfactory treatment of languages like Chinese or Thai requires
> special handling."
> Is 'Contingent_Break (CB)' an error for 'Ideographic (ID)'? That would
> make sense for Chinese, for some applications needs to group ideographs
> into words.
> While I am on the topic, does anyone know of character level
> mechanisms used to advise alogrithms of the word boundaries (or lack
> of boundaries) in Chinese text?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode