Deleting Lone Surrogates
verdy_p at wanadoo.fr
Sun Oct 4 13:53:25 CDT 2015
The default behavior of unassigned characters are to treat them like base
characters, so if they are followed by a combining mark, it would create a
default grapheme cluster, which is not appropriate here.
Surrogates are not chracters (so they cannot have any character
properties), but they are assigned and so don't have "default" properties
(only meant for *unassigned* codepoints).
I still think that it is safer to treat them (for text segmentation purpose
as pure isolates i.e. exactly like basic controls such as U+0000 NUL, or
such as the U+FFFD replacement control which is typically used as visible
placeholders for various errors).
For normalisation purpose they should also have combining class 0 (i.e.
acting as blockers against reorderings for canonical equivalences), and not
as "transparent" (discarded and bypassed as if those surrogates were not
present at all).
2015-10-04 19:50 GMT+02:00 Markus Scherer <markus.icu at gmail.com>:
> I would not spend any time specifying intricate rules for unpaired
> surrogates in 16-bit strings, or out-of range values in 32-bit strings.
> Most processing will treat them like unassigned characters, like U+50005,
> with only default behaviors.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode