Thai Word Breaking
richard.wordingham at ntlworld.com
Sat Aug 22 08:35:30 CDT 2015
I'm trying to work out the meaning of TUS 8.0 Section 23.2.
To do Thai word breaking properly, one needs to do a semantic analysis
of the text to do the equivalent of resolving the equivalent of
'humanevents' into 'human events' rather than 'humane vents'. One also
needs to cope with unknown and misspelt words. (A lot of effort has
been devoted to avoid going to the extreme of doing semantic analysis.)
However, it is possible to read Section 23.2 as prohibiting the use of
certain information, and I would like to check whether this is the
The opening paragraph seems clear enough on first reading:
"The effect of layout controls is specific to particular text processes.
As much as possible, lay-out controls are transparent to those text
processes for which they were not intended. In other words, their
effects are mutually orthogonal."
However, my first question is, "Are paragraph boundaries
directly admissible as evidence for or against word boundaries not
adjacent to them?". For example, most Thai word breakers would not
regard a paragraph boundary as any more significant than a
phrase-delimiting space. However, a paragraph boundary often indicates
a change of topic.
My second question is, "Are line breaks admissible as evidence for
or against word boundaries not adjacent to them?" For example, if a
phrase makes heavy use of U+200B ZERO WIDTH SPACE (ZWSP), one may deduce
that it is likely that all word boundaries within it are marked
explicitly. This example is more useful for Khmer than to Thai, for
whereas Cambodians were once taught to mark word boundaries, Thais
rarely use ZWSP to mark word boundaries.
My third question is, "Is the absence of a line break opportunity
admissible as evidence for or against a word boundary?". Here I
see conflicting signals.
There is a character U+2060 WORD JOINER (WJ) which *used* to be regarded
as the counterpart of ZWSP. The understanding was that ZSWP marked a
word boundary and provided a line-break opportunity, while WJ denied
both. This, however, is no longer the case. To quote the TUS section
P2S1: The word joiner must not be confused with the zero width joiner
or the combining grapheme joiner, which have very different functions.
P2S2: In particular, inserting a word joiner between two characters has
no effect on their ligating and cursive joining behavior.
P2S3: The word joiner should be ignored in contexts other than line
P2S4: Note in particular that the word joiner is ignored for word
P2S5: (See Unicode Standard Annex #29, “Unicode Text Segmentation.”)
Paragraph 2 Sentence 3 (P2S3) appears to rule out its use in
word-breaking, but perhaps it does not if line-breaking is being used
as evidence for word boundaries.
P2S4 has three very different interpretations:
(i) This is an assertion of fact, and may therefore be incorrect.
(ii) The word 'is' is sloppy wording for 'should be'. Section 23.2
contains much sloppier wording, as I have already advised members of
the UTC (4 July 2015).
(iii) This is a deduction from other parts of the specification. Now,
if P2S4 said 'is normally ignored for word segmentation', that would
have made sense, for that applies to the default word boundary
specification in UAX#29. However, just before Section 4.1, UAX#29
explains that it does not specify what happens for word boundary
determination in Thai! (It does constrain what happens, though.)
At the end of UAX#29 Section 6.2, there is the provision, "The Ignore
rules should not be overridden by tailorings, with the possible
exception of remapping some of the Format characters to other
classes." To accord with the user perceptions of Unicode-aware
people who work with SE Asian scripts, I am tempted to ask for CLDR
to tailor the word-breaking algorithms for the corresponding languages
so that the word-breaking classes of WJ (and ZWNBSP) are changed from
Format to MidLetter. That would match the widespread old *perception*
that there should be no word break in a sequence <Thai letter, (Thai
mark,)* WJ, Thai letter>. However, there are several objections:
(a) Perhaps P2S3 and P2S4 prohibit this.
(b) If the word-break property of Thai letters falls back to Other,
there would still be a word break between them.
(c) If the word-break property of Thai letters fell back to ALetter,
an old suggestion, WJ would have no effect on the presence of a word
(d) If Thai word breaking assigns word-break classes to each letter
(gc=Lo), then word boundaries can be suppressed by choosing the classes
appropriately. Non-spacing Thai vowels are very relevant to Thai
word-breaking, but formally are 'ignored'. WJ could be 'ignored' in
exactly the same way.
More information about the Unicode