<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"><div>Dear A bughunter,</div>


<div> </div>


<div>I am trying to help you. At first, please note that this is a public mailing list, so I won't encrypt my answer using PGP.</div>


<div> </div>


<div>I understand your question that you want to convert a UTF-8 encoded text into the character codepoints to generate a checksum.</div>


<div> </div>


<div>A Unicode codepoint is a 21 bit unsigned integer ranging from 0x0 to 0x10FFFF.</div>


<div> </div>


<div>There are different encodings to represent a Unicode character. The simplest would be UTF-32, which uses 32 bits for each codepoint.</div>


<div> </div>


<div>The most common characters (Latin and basic interpunction) are found in the codepoints 0x00 to U+7F, which would only require 7 bits. A document using UTF-32 would contain many zeroes, which would be inefficient and require much memory. Therefore, UTF-8 uses a trick: multi-byte sequences.</div>


<div> </div>


<div>Each byte contains 8 bits.</div>


<div> </div>


<div>If the most significant bit is 1 (that means, the byte value is > 0x7F), it is either the start of a multi-byte character or the continuation of a multi-byte sequence. The continuation of a multi-byte character always starts with 0b10, that means, the byte value is between 0x80 and 0xBF. </div>


<div> </div>


<div>Characters in the range 0x00 to 0x7F are coded as they are. That means: codepoint 0b0xxxxxxx becomes 0b0xxxxxxx</div>


<div>Characters in the range 0x0080 to 0x07FF are coded starting with 0b110. That means, codepoint 0b00000yyy_xxxxxxxx becomes 0b110yyyxx_10xxxxxx</div>


<div>Characters in the range 0x0800 to 0xFFFF are coded starting with 0b1110. That means, codepoint 0byyyyyyyy_xxxxxxxx becomes 0b1110yyyy_10yyyyxx_10xxxxxx</div>


<div>

<div>Characters in the range 0x100000 to 0x10FFFF are coded starting with 0b11110. That means, codepoint 0b000zzzzz_yyyyyyyy_xxxxxxxx becomes 0b11110zzz_10zzyyyy_10yyyyxx_10xxxxxx</div>


<div> </div>


<div>Now, for example, you encounter the following byte sequence and want to convert it from UTF-8 to the corresponding Unicode code point:</div>


<div> </div>


<div>0xF0 0x9F 0x98 0xB8</div>


<div>= 0b11110000 0b10011111 0b10011000 0b10111000</div>


<div>As you see, the sequence starts with 0b11110, which means you have to parse four bytes. The next three bytes start with 0b10 (the continuation sequence), which means, the encoding is valid.</div>


<div> </div>


<div>Let us transform this using the mapping from 0b11110zzz_10zzyyyy_10yyyyxx_10xxxxxx to 0b000zzzzz_yyyyyyyy_xxxxxxxx:</div>


<div>This leaves us with 0b00000001_11110110_00111000 = 0x0001F638 = U+1F638 = Grinning Cat Face with Smiling Eyes</div>


<div> </div>


<div>There are several libraries which can be used to parse UTF-8 encoded text and split it into the corresponding codepoints. For example, you can use the Java class java.io.InputStreamReader, with the third argument being the String literal "UTF-8".</div>


<div> </div>


<div>I hope, that helps you.</div>


<div> </div>


<div>Best regards,</div>


<div> </div>


<div>Marius</div>

</div>


<div> 

<div> 

<div style="margin: 10.0px 5.0px 5.0px 10.0px;padding: 10.0px 0 10.0px 10.0px;border-left: 2.0px solid rgb(195,217,229);">

<div style="margin: 0 0 10.0px 0;"><b>Gesendet:</b> Freitag, 08. November 2024 um 01:36 Uhr<br/>

<b>Von:</b> "Markus Scherer via Unicode" <unicode@corp.unicode.org><br/>

<b>An:</b> "Jim Breen" <jimbreen@gmail.com><br/>

<b>Cc:</b> "unicode@corp.unicode.org" <unicode@corp.unicode.org><br/>

<b>Betreff:</b> Re: get the sourcecode [of UTF-8]</div>


<div>

<div>

<div>On Thu, Nov 7, 2024 at 3:03 PM Jim Breen via Unicode <<a href="mailto:unicode@corp.unicode.org" onclick="" target="_blank">unicode@corp.unicode.org</a>> wrote:</div>


<div class="gmail_quote">

<blockquote class="gmail_quote" style="margin: 0.0px 0.0px 0.0px 0.8ex;border-left: 1.0px solid rgb(204,204,204);padding-left: 1.0ex;">On rare occasions, I need to dig into UTF-8 at the bit level. I have a<br/>

note pinned near my desk as an aide memoire. It has 3 lines:<br/>

<br/>

UTF-8<br/>

zzzzyyyyyxxxxx<br/>

1110zzzz 10yyyyyy 10xxxxxx</blockquote>


<div> </div>


<div>11110nnn 10zzzzzz 10yyyyyy 10xxxxxx</div>


<div> </div>


<div>markus</div>

</div>

</div>

</div>

</div>

</div>

</div></div></body></html>