Scope of Unicode Character Properties (was: Re: Deleting Lone Surrogates)

Philippe Verdy verdy_p at
Mon Oct 5 18:26:16 CDT 2015

2015-10-05 19:11 GMT+02:00 Ken Whistler <kenwhistler at>:

> However, it would be reasonable (and permitted) for an API to actually
> report a default value for a surrogate code point (i.e., treating it more
> or less like the reserved code point U+50005 that Marcus mentioned).

Unassigned (reserved) code points, when followed by an assigned combining
mark would still be treated as starters of a combining sequence by default.

This is not (IMHO) desirable for lone surrogates that should better be
handled in isolation independantly of what follows them.

My opinion is that they should be treated like new line controls, so that
the combining mark after it will also be separated into a defective
combining sequence without any starter (e.g. 000A 0302 creates two
clusters, this should be the same for D800 0302. D800 will have no defined
glyph to render, but the glyph for U+FFFD may be displayed, or just a
".notdef" tofu box).

Now for break opportunities, those lone surrogates should not create a
newline or paragraph break opportunity, but they may create a word break
opportunity to allow their easy separation and selection by a double-click
on this tofu in an editor; they may even create a syllable break
opportunity before and after them to allow wrapping long lines there).
Those adaptations however are not described at all in annexes speaking
about text segmentations.

So those surrogates (which are permanently assigned) could have their own
code point properties more formally defined. In my opinion handling them
like U+0000 is much better than handling thme like U+50005, which should
stay reserved and handled as standard starters with default combining class

Also those lone surrogates should be Bidi-neutral (imagine they occur in
the middle of some Arabic text, they should probably not change the
direction of the surrounding text and should not alter the embedding
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list