get the sourcecode [of UTF-8]

Giacomo Catenazzi cate at cateee.net
Mon Nov 4 03:17:37 CST 2024


On 2024-11-03 23:42, Jim DeLaHunt via Unicode wrote:
> Hello, anonymous person:
>
> On 2024-11-02 17:42, A bughunter via Unicode wrote:
>>
>> Where to get the sourcecode of relevent (version) UTF-8?: in order to 
>> checksum text against the specific encoding map (codepage).
>>
>> from A_bughunter at proton.me
>
> I'm afraid I don't really understand what you are asking here.
>
> UTF-8 is a data format, a way of representing 21-bit Unicode scalar 
> integers in 1, 2, 3, or 4 bytes (octets). It is defined in section 
> 2.5.3, "UTF-8" 
> <https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G11165>, 
> of the Core Specification of the Unicode Standard. It has not changed 
> over time, so it doesn't really have versions.
>
> If by "source code" you refer to an implementation of the UTF-8 
> format, then is no single answer. There are multiple implementations 
> of UTF-8, and so multiple independent bodies of "source code".
>
> And there are many things which could be called a "specific encoding 
> map (codepage)". I don't know which of those you are referring to.

Checksum may be tricky (interpreting the question). The more obvious 
problem is new line, some variants are encoded with CR or CR+LF, or LF. 
Programs may translate them so for checking sum text, you may need to 
normalize.

But then we have additional *problems* of Unicode: there may be more 
then one form to encode the same character: as example: accented 
characters may be encoded as one character, or two: base character and a 
combining diacritic (accent), e.g. Apple prefer the latter, and 
Microsoft the first. So it depends on your encoding map preference (and 
possibly further normalization). We may argue that the short one should 
be better (in this case): one of task of Unicode was to map common used 
(and also less used) encodings with a single Unicode character (so 
hinting a preference for encoding mapping). So for a checksum, you may 
need to agree on a normalized form, and that unfortunately may depend on 
Unicode version (or better: a code written with new Unicode character 
may not be correctly normalized with older programs.

Note: overlong UTF-8 encoding are not considered valid (so encoding a 
Unicode character not using the minimal length UTF-8 sequence). But that 
should be caught before (but with a checksum, care should be done, else 
this special case (as many others) may be abused (often a grave security 
issue). So it is complex, and your question is too vague (and imprecise) 
to help.

I recommend you to look existing implementations: PGP (and GPG) protocol 
may give some hints on securely doing checksum of text.

giacomo




More information about the Unicode mailing list