get the sourcecode [of UTF-8]

Tue Nov 5 04:03:43 CST 2024

So let's analyse your question, but take an hits, if several persons 
tell you that they do not understand you, maybe it is time to rephrase 
your question. By repeating it, it will not make clearer.

So your original question:

 > Where to get the sourcecode of relevent (version) UTF-8?: in order to 
checksum text against the specific encoding map (codepage).

So first part:

 > Where to get the sourcecode of relevent (version) UTF-8?

1- "sourcecode" seems a strange word in modern world.

2- "relevant": relevant for what? You should specify or just skip.

UTF-8 is an encoding of Unicode.

Unicode is described in https://www.unicode.org/versions/Unicode16.0.0/ 
and there you will find the Unicode Character Database

https://www.unicode.org/Public/16.0.0/ucd/

So you have all tables you need. The first links provide you information 
on how to interpret the data, and also the link of the annex which 
describe the format of the tables.

It is a database (and many tables).

There are many implementations which uses the database. You should check 
yourself the different implementation. If you install e.g. Debian, you 
should be able to get all sources for all programs in Debian. Microsoft, 
Apple and Google doesn't publish all sources, but you may find 
documentation and some source. It should be your task to google. There 
is not single source code. You can write your own, using the above database.

 > In order to checksum text against the specific encoding map (codepage).

That part make no sense. "checksum" is a numeric sum of something. Just 
add single bytes of a text in UTF-8 and you have a checksum. But that it 
is trivial. We are not understanding checksum of what. And possibly you 
are using the word wrongly.

 > against the specific encoding map (codepage).

What? UTF-8 is already an encoding. "codepage" has not a unique meaning 
(you are using with "encoding", so probably you are using it as in 
1990s). UTF-8 to most encoding is not possible without losing 
information (and so checksum are not useful, but also dangerous).

Or you are interested on how to do the mapping? Wikipedia helps (but I 
hope you already read the relevant pages). Else check ECMA website 
(standards are free, and often the same as ISO): it tell you how to 
use/interpret some multibytes encoding (but not all). Googling you will 
find many implementations. I'm not sure Android is implementing many of 
them. There is a program "iconv" which just transcode text, and it is 
open source, so you get the sources.

It you cannot phrase it in a better way (it happens a lot), an example 
will help: you have X, Y,  Z, I do foo(), bar(), I miss the step M (the 
question), to get A, B, C (with real data for X, Y, Z, and A, B, C.

giacomo

On 2024-11-05 10:28, A bughunter via Unicode wrote:
>
> My reply to Slawomer is interspersed below.
>
>
> from A_bughunter at proton.me
>
> Sent with Proton Mail secure email.
>
> On Monday, November 4th, 2024 at 21:05, Sławomir Osipiuk via Unicode 
> <unicode at corp.unicode.org> wrote:
>
>
> > On Monday, 04 November 2024, 00:43:29 (-05:00), A bughunter via 
> Unicode wrote:
>
>
> Originating Question
>
> Where to get the sourcecode of relevent (version) UTF-8?: in order to 
> checksum text against the specific encoding map (codepage).
>
> Such as this now keep the originating question pinned at the top of 
> each reply and let every reply focus on the originating question 
> because as you see I was dumped on with over 5 pages of unrelated and 
> offtopic nonsense in reply to my single line question.
> >> No, it does not answer my question.
>
>
> I didn't post to hold a free seminar on computer science. By my grace 
> I will expound: UTF-8 is a text format of Unicode. Unicode is a 
> standard. In order to get anything to produce Unicode UTF-8 it must be 
> compiled. Time is a sequence of events you have compile time and 
> runtime. Before something is compiled it is sourcecode. Wherever the 
> UTF-8 is input into the sourcecode it is then compiled into a runtime. 
> As far as your gripe about my strange use of "bytecode" I have already 
> defined it absolutely so. You may go back and re-read.
> > I don't think I'm alone in saying that your question is very 
> unclear, in major part by your very strange use of certain terms. I 
> don't think I've ever encountered "bytecode" outside of Java 
> implementations, and never does it refer to textual (prose) data as 
> you seem to do. I still don't know what "compile time UTF-8" is 
> supposed to be, and I've read both your messages multiple times.
>
>
> Your question is offtopic the only part you need to focus on to answer 
> the originating question is: "the character to glyph map must be known."
> >> In order to fully authenticate: the codepage of the character to 
> glyph map must be known.
>
>
> > To authenticate what? At the end of the day, you're always just 
> authenticating a stream of bits.
> You are wrong about the end of the day " At the end of the day, you're 
> always just authenticating a stream of bits." but I will not argue or 
> correct you because it does not answer my question nor is it specific 
> to Unicode.
>
> >> I need the bytecode to glyph map of UTF-8 as it is used by my 
> runtime software.
>
> No, I do not want to map. I need the bytecode/character to glyph map 
> in the sourcecode of whatever is being used to produce UTF-8. 
> Absolutely this must be contained in the runtime software in order for 
> anything to produce UTF-8. Yet you have failed to ask for the 
> information required to answer my concise yet full though simple one 
> line relevent ontopic question.
> > You want to map UTF8-encoded code points to characters? (Glyphs are 
> the visual representations of characters, determined by the font.) In 
> that case the "map" is the Unicode database. Each code point (encoded 
> as one or more bytes in UTF8) maps to a character. Versions of the 
> database are freely accessible online.
>
> Again the question has been pinned at the top. "Where to get the 
> sourcecode of relevent (version) UTF-8?"
> > But I am still very unsure of what you're asking for. Are you 
> concerned that code points may be reassigned in the future? That, for 
> example, writing "Smith" in version 16 may appear as "Smite" in a 
> future version, and this affects the apparent content of a checksummed 
> text file? If so, that is prevented by the Unicode Stability Policy; 
> assigned code points cannot have their character identity changed. I 
> don't see any practical way of exploiting differences between Unicode 
> versions to alter the apparent content of text.
>
>
> The rest of Slawomir's reply was so far removed from my question 
> "Where to get the sourcecode of relevent (version) UTF-8?" it is not 
> worth replying.
> > Sławomir