get the sourcecode [of UTF-8]

Giacomo Catenazzi cate at cateee.net
Tue Nov 5 02:33:13 CST 2024


On 2024-11-04 6:43, A bughunter via Unicode wrote:
> No, it does not answer my question.
>
> Yes 1 byte is 8 bits and UTF-8 is Unicode Text Format - 8 bit. Then 
> you give me a manual page which is clearly for Unicode version 16. 
> When I say relevent version, wheather you call it core or not, I 
> anticipate you would ask about what implimentation of UTF-8: the 
> answer is the relevent implimentation is android 13 libbionic (bionic 
> C) which uses UTF-8.
>  Without the sourcecode you could only guess as to which unicode 
> version bionicC uses. With slight assumption, android 13 is open 
> source AOSP and, it would be possible to point out the exact unicode 
> used in it however this assumes my runtime matches a generic AOSP 
> android 13 source. So then the way in which I framed my question does 
> probe as to if there is any way to display the compile time UTF-8. 
> Sometimes there are --version options.
>  The part you do not seem to understand is the full circle of 
> authentication of a checksummed text. In order to fully authenticate: 
> the codepage of the character to glyph map must be known. Anything 
> further on this checksumming process would not be directly on topic of 
> this mailing list and you may ask me on the side. Although stating the 
> usecase is worth mentioning.

You may go to https://android.googlesource.com/platform/bionic/ to check 
the source, but it is a C library, so it may not even know about Unicode 
(and UTF-8): it may just care that strings are terminated with \0. And 
please note: we are not Android, so you are in the wrong place.

And you will hate me for next link: but we give you the resources, you 
need to do the homework and read and look the details. Handling 
characters is a huge task, done by many libraries. On some other mail, 
it seems you care about position of a character (column). The simple 
way: "fixed width characters": Unicode has a table of single and double 
width characters (and also characters that do not take space). Double 
width are used on complex scripts (e.g. Asian).

But usually things are much more complex, and a lot of working in 
progress (so the link on how Google and Android are changing the stack): 
there is a text layout library (mixing languages, left-to-right, 
right-to-left, justification, italic/bold, paragraphs, etc.). I think 
Android uses Minikin. Left-To-Right/Right-ToLeft may use a different 
library. Then you have text shaper (Harfbuzz in Android and most 
browsers): it find the glyph to use, the dimension, and where to put it. 
And you have other libraries to select font, and to display font: 
"rendering" (and anti-aliasing) according resolution and other factors. 
And probably it is much more difficult and with other libraries (right, 
they may use libICU of Unicode, algorithms to split a word at end of 
line, etc.).

As you see: there is much, and also a lot of working in progress. So 
your task is not easy (just because you need a lot of libraries and read 
sparse documentation), no need to be a genius (or a computer guy, in 
fact maintainers of such tools have different backgrounds), but it is a 
"lonely place". Very few people enter in it, so do not expect help here: 
we didn't dare to enter there: Unicode is already too complex. Good luck!

And the link: State of Text Rendering 2024: https://behdad.org/text2024/

giacomo





More information about the Unicode mailing list