Do you know a tool to decode "UTF-8 twice"

Wed Jan 29 12:21:55 CST 2014

Jörg:

This is the definition of cp1252 used by the whatwg and all current browser
implementations.
I've appealed to the cp1252 maintainer to update the definition so that we
don't have two competing standards, but I was rejected.
I've been considering naming it cp1252-whatwg.

On Wed, Jan 29, 2014 at 6:59 AM, "Jörg Knappen" <jknappen at web.de> wrote:

> A little postscrptum to this old thread:
>
> On pyPi, there is now a codec available that handles the peculiar
> definition of "latin1" inside mysql.
> The package is called mysql-latin1-codec and features an encoding
> consisting of cp1252 plus
> 0x81, 0x8D, 0x8F, 0x90, 0x9D (the latter five characters are undefined in
> the  python codec for cp1252).
>
>  https://pypi.python.org/pypi/mysql-latin1-codec/1.0
>
> --Jörg Knappen
>
>  *Gesendet:* Mittwoch, 30. Oktober 2013 um 19:14 Uhr
> *Von:* "Buck Golemon" <buck at yelp.com>
> *An:* "Frédéric Grosshans" <frederic.grosshans at gmail.com>
> *Cc:* "Jörg Knappen" <jknappen at web.de>, unicode <unicode at unicode.org>
> *Betreff:* Re: Aw: Re: Re: Re: Re: Do you know a tool to decode "UTF-8
> twice"
>
>
> On Wed, Oct 30, 2013 at 9:56 AM, Frédéric Grosshans <
> frederic.grosshans at gmail.com> wrote:
>>
>> Le 30/10/2013 17:32, "Jörg Knappen" a écrit :
>>
>>>
>>> The data did not only contain latin-1 type mangling for the non-existent
>>> Windows characters, but also sequences with the raw
>>> C1 control characters for all of latin-1. So I had to do them, too.
>>> The data weren't consistent at all, not even in their errors.
>>> --Jörg Knappen
>>
>>  Your question helped me dust off and repair a non working python snippet
>> I wrote for a similar problem. I was stuck with the mixing of windows-1252
>> and latin1 controls (linked with a chinese characters). I write it below
>> for reference.
>>
>> The python snippet below does not need sed, defines a function
>> (unscramble(S)) which works on strings. The extension to files should be
>> easy.
>>
>>     Frédéric Grosshans
>>
>>
>> def Step1Filter(S):
>>     for c in S :
>>     #works character/character because of the cp1252/latin1 ambiguity
>>         try :
>>             yield c.encode('cp1252')
>>         except UnicodeEncodeError :
>>             yield c.encode('latin1')
>>             #Useful where cp1252 is undefined (81, 8D, 8F, 90, 9D)
>>
>> def unscramble(S):
>>     return b''.join(c for c in Step1Filter(S)).decode('utf8')
>>
>> PS: If anyone is interested in a licence, I consider this simple enough
>> to be in the public domain an uncopyrightable.
>>
>
>  This encoding you've implemented above is known as windows-1252 by the
> whatwg and all browsers [1][2].
> The implementation of cp1252 in python is instead a direct consequence of
> the unicode.org definition [3].
>
>  [1] http://encoding.spec.whatwg.org/index-windows-1252.txt
>  [2] http://bukzor.github.io/encodings/cp1252.html
>  [3]
> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140129/78879d0b/attachment.html>