get the sourcecode [of UTF-8]

Otto Stolz otto.stolz at uni-konstanz.de
Wed Nov 6 11:19:24 CST 2024


Hello bughunter,

before wording a question to any discussion group, it is recommended
to read (and understand) the pertinent FAQ list; otherwise the ensuing
discussion will focus on definitions and terms rather than the problem
at hand. You may start reading at <https://www.unicode.org/faq/>.

That said, I’ll try to answer your question. As your problem is not
quiet clear, you’ll get basically three answers, and a technical hint
pertaining to two of them.

You have asked::
> Where to get the sourcecode of relevent (version) UTF-8?: 
> in order to checksum text against the specific encoding map (codepage).
My answer depends on the purpose of the checksum.

UTF-8 is one method (of a handfull of standardized methods) to
represent Unicode text at the bit level in order to conveniently
transfer, or store, it. If the intend of your checksum is merely
to protect against transmission error, or tempering, then you
would simply checksum this bit-level representation of the text –
no knowledge of Unicode, or UTFs, is required to achieve this goal.

A Unicode code point is a number in the range from 0 to 1 114 111;
a Unicode text is a sequence of Unicode code points.
On the bit level, you can represent that sequence in various ways,
cf. <https://www.unicode.org/faq/utf_bom.html>. Hence, if you
want to compare two Unicode texts that are represented in arbitrary
bit-level representations (UTFs), then you would convert those
to the same UTF (preferably UTF-32) and checksum those. (UTF-32
stores the 21 bits needed to represent a Unicode code point in
one 32 bit wide storage location, leaving 11 bits unused.)

In Unicode, some characters may be represented in various ways;
e. g. an “é” can be coded as one single Unicode code point, viz.
U+00E9, LATIN SMALL LETTER E WITH ACUTE, or, alternatively, as
a pair of Unicode code points, viz. U+0065 U+0301 LATIN SMALL
LETTER E + COMBINING ACUTE ACCENT. To cope with ambiguities of
this kind, Unicode defines those two representations as
“canonically equivalent”, i. e., they are to be treated in every
respect as equivalent and interchangeable, for details,
cf. <https://www.unicode.org/faq/normalization.html>. Hence,
if you want to check that two Unicode texts are canonically
equivalent, you would first convert them to UTF-32, then
‘normalize’ them (i. e. choose consistently the same representation
for all instances of canonically equivalent encodings), then
checksum the normalized representations.

You were asking for source code, but the better way to do
conversion and normalizations is by using an established and
well-tested program library, such as ICU,
cf. <https://icu.unicode.org/#h.i33fakvpjb7o>.

Good luck with your project,
     Otto



More information about the Unicode mailing list