get the sourcecode [of UTF-8]
Giacomo Catenazzi
cate at cateee.net
Mon Nov 4 03:17:37 CST 2024
On 2024-11-03 23:42, Jim DeLaHunt via Unicode wrote:
> Hello, anonymous person:
>
> On 2024-11-02 17:42, A bughunter via Unicode wrote:
>>
>> Where to get the sourcecode of relevent (version) UTF-8?: in order to
>> checksum text against the specific encoding map (codepage).
>>
>> from A_bughunter at proton.me
>
> I'm afraid I don't really understand what you are asking here.
>
> UTF-8 is a data format, a way of representing 21-bit Unicode scalar
> integers in 1, 2, 3, or 4 bytes (octets). It is defined in section
> 2.5.3, "UTF-8"
> <https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G11165>,
> of the Core Specification of the Unicode Standard. It has not changed
> over time, so it doesn't really have versions.
>
> If by "source code" you refer to an implementation of the UTF-8
> format, then is no single answer. There are multiple implementations
> of UTF-8, and so multiple independent bodies of "source code".
>
> And there are many things which could be called a "specific encoding
> map (codepage)". I don't know which of those you are referring to.
Checksum may be tricky (interpreting the question). The more obvious
problem is new line, some variants are encoded with CR or CR+LF, or LF.
Programs may translate them so for checking sum text, you may need to
normalize.
But then we have additional *problems* of Unicode: there may be more
then one form to encode the same character: as example: accented
characters may be encoded as one character, or two: base character and a
combining diacritic (accent), e.g. Apple prefer the latter, and
Microsoft the first. So it depends on your encoding map preference (and
possibly further normalization). We may argue that the short one should
be better (in this case): one of task of Unicode was to map common used
(and also less used) encodings with a single Unicode character (so
hinting a preference for encoding mapping). So for a checksum, you may
need to agree on a normalized form, and that unfortunately may depend on
Unicode version (or better: a code written with new Unicode character
may not be correctly normalized with older programs.
Note: overlong UTF-8 encoding are not considered valid (so encoding a
Unicode character not using the minimal length UTF-8 sequence). But that
should be caught before (but with a checksum, care should be done, else
this special case (as many others) may be abused (often a grave security
issue). So it is complex, and your question is too vague (and imprecise)
to help.
I recommend you to look existing implementations: PGP (and GPG) protocol
may give some hints on securely doing checksum of text.
giacomo
More information about the Unicode
mailing list