Thai Word Breaking

Richard Wordingham richard.wordingham at
Thu Aug 27 18:09:52 CDT 2015

On Thu, 27 Aug 2015 21:49:45 +0200 (CEST)
Marcel Schneider <charupdate at> wrote:

> On 22 Aug 2015 at 15:47, Richard Wordingham  wrote:

> Still nobody answered the questions Richard Wordingham raised five
> days ago.

There are not many people who are in a position to say what unclear
sections of TUS are intended to mean.  I may have scared them into
silence by noting that people changing code because of one particular
*new* sentence in Section 23.2, namely:

> > P2S4: Note in particular that the word joiner is ignored for word
> > segmentation.

are at risk (but see below) of putting themselves in breach of the UK's
'Equality Act 2010'; more generally, they may be in breach of
transpositions of the EU Racial Equality Directive (2000/43/EC).  You
don't need to have racialist intentions to be in breach.

> > (ii) The word 'is' is sloppy wording for 'should be'. Section 23.2
> > contains much sloppier wording, as I have already advised members of
> > the UTC (4 July 2015).

This comment applies to the part of Section 23.2 referring to U+FEFF
ZERO WIDTH NO-BREAK SPACE (ZWNBSP).  UTC members were advised that to be
consistent, it should have changes corresponding to those made for WJ.
Such changes weren't made to the section on ZWNBSP, and so I can read
Section 23.2 as saying that ZWNBSP can be used to mark word boundaries
whereas WJ cannot. Reading the standard this way would probably protect
the writers of text editors (including word processors) from the
European legislation against indirect discrimination.  It's still a
shame about the degradation of old text that uses WJ instead of ZWNBSP,
but it should still render fine if one switches spell-checking off.
Word counts will change, though.


More information about the Unicode mailing list