Deleting Lone Surrogates
verdy_p at wanadoo.fr
Mon Oct 5 09:51:25 CDT 2015
Not silently ! Even if this removal is required to go on editing, this must
be notified to the user as it may occur in unedited parts of the file (and
it may be the sign that the document is not fully plain text, so the user
should not save the edited file)
If this is caused by a quirk in the user input (defect of the input mode or
keyboard layout), there should be a notification.
But for a general purpose editor that allows editing files including binary
ones (e.g. Emacs), it is best to NOT drop those lone surrogates at all, and
effectively treat them in isolation for ALL purposes (the DELETE key should
not delete more than this lone surrogate (it may be necessary to adjst the
cursor position after the deletion if the editor does not support placing
the cursor in the middle of a combining sequence, but a LONE surrogate + a
combining character should still be treated as two separate clusters and
the cursor or selection should be placable between the lone surrogate and
the combining mark.)
Note that file formats that contain binary parts and plain text parts do
exist, e.g. media files that contain a final plain text section for
metadata or for some XML data signature : it is safe to edit that final
part in a text editor, provided that it does not silently change the
encoding of the binary part.
In summary, I do not like the idea of silently dropping lone surrogates in
editors. If the editor needs it because it cannot safely handle binary
parts, the notification will say to the user that he should not use that
editor and choose something else, or it will allow the user to select
another appropriate file encoding to edit the file safely. The user should
not save the file blindly as it will be corrupted silently. Doing otherwise
would be a security issue.
And this remark extends to all other protocols using plain text input ;
lone surrogates should not be dropped silently (unless explicitly requested
for exemple in a maintenance cleanup or repair) : it this lone surrogate
violates the further processing, the only safe option is to reject the
whole text and report the error if text data is required but missing.
2015-10-05 13:50 GMT+02:00 Martin J. Dürst <duerst at it.aoyama.ac.jp>:
> On 2015/10/05 04:30, Asmus Freytag (t) wrote:
>> On 10/4/2015 6:02 AM, Richard Wordingham wrote:
>>> In the absence of a specific tailoring, is the combination of a lone
>>> surrogate and a combining mark a user-perceived character? Does a lone
>>> surrogate constitute a user-perceived character?
>> In an editing interface, a lone surrogate should be a user perceived
>> as otherwise you won't be able to manually delete it. Markus suggests
>> that it be
>> treated like an unassigned code point.
> In an editing tool (of which an editing interface is a part of), a lone
> surrogate should just be removed! Apparently, that's what happens in
> Richard's case, but only eventually.
> Regards, Martin.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode