Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Hans Åberg via Unicode unicode at unicode.org
Thu May 18 03:30:24 CDT 2017


> On 16 May 2017, at 15:21, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Tue, 16 May 2017 14:44:44 +0200
> Hans Åberg via Unicode <unicode at unicode.org> wrote:
> 
>>> On 15 May 2017, at 12:21, Henri Sivonen via Unicode
>>> <unicode at unicode.org> wrote:  
>> ...
>>> I think Unicode should not adopt the proposed change.  
>> 
>> It would be useful, for use with filesystems, to have Unicode
>> codepoint markers that indicate how UTF-8, including non-valid
>> sequences, is translated into UTF-32 in a way that the original octet
>> sequence can be restored.
> 
> Escape sequences for the inappropriate bytes is the natural technique.
> Your problem is smoothly transitioning so that the escape character is
> always escaped when it means itself. Strictly, it can't be done.
> 
> Of course, some sequences of escaped characters should be prohibited.
> Checking could be fiddly.

One could write the bytes using \xnn escape codes, sequences terminated using \& as in Haskell, translating '\' into "\\". It then becomes a C-encoded string, not plain text.





More information about the Unicode mailing list