get the sourcecode [of UTF-8]

Wed Nov 6 23:57:23 CST 2024

On 2024-11-06 18:45, A bughunter via Unicode wrote:
> …My query here is a sideline to my GitHub repo Unicode_map you may see 
> here https://github.com/freedom-foundation/unicode_map and you may see 
> my Proof of Concept 
> https://github.com/freedom-foundation/unicode_map?tab=readme-ov-file#proof-of-concept 
> if you would like to discuss the fine points of checksumming

Thank you for posting that link. It provides me a hint of what you want 
to do.

What I see in the repo are various representations of historical 
documents from the 18th century, which were originally produced as 
English language text hand-written on parchment with pen and ink. You 
have images of the text on physical pages, and the character content of 
the texts in UTF-8. You write there,

> These documents have priority to be archived in both ASCII wordcounted 
> and checksummed text and PDF/A-1 archive format for long term 
> preservation then signed and attested, garunteed to be veritable to 
> corresponding printed replica documents.… I’m interested in sourcecode 
> and libre-sourcecode. Libre-sourcecode being defined as allof a 
> machine specification (chip design), compiler and application 
> sourcecode which can be written out in respective computer programming 
> languages and archived, saved, transmit, reproduced, and build and run 
> all from paper written legit.…
Source: <https://github.com/freedom-foundation>

Your original first two messages said,

> Where to get the sourcecode of relevent (version) UTF-8?: in order to checksum text against the specific encoding map (codepage).
Source: 
<https://corp.unicode.org/pipermail/unicode/2024-November/011099.html>

> what implimentation of UTF-8: the answer is the relevent implimentation is android 13 libbionic (bionic C) which uses UTF-8.…
> …android 13 is open source AOSP and, it would be possible to point out the exact unicode used in it however this
> assumes my runtime matches a generic AOSP android 13 source. So then the way in which I framed my question does probe
> as to if there is any way to display the compile time UTF-8. Sometimes there are --version options.
> The part you do not seem to understand is the full circle of authentication of a checksummed text. In order to fully
> authenticate: the codepage of the character to glyph map must be known. Anything further on this checksumming process
> would not be directly on topic of this mailing list
Source: 
<https://corp.unicode.org/pipermail/unicode/2024-November/011102.html>

Put in conventional terminology of text processing and display in 
software systems, it seems that you want to preserve historical 
documents in digital form. This digital form include an expansive swath 
of the software stack: not just document content, but also several 
layers of software and hardware necessary to present the document. As 
part of this, you want to calculate some sort of robust digest of the 
digital form, to let a receiver of the document assure themselves that 
what they see (experience) when viewing the digital form of the document 
has the same relationship to the original document which you had when 
you authored the digital form.

One part of your software stack is similar to, but not necessarily the 
same as, the Android Open Source Project's libbionic (an implementation 
of libc).

You are looking for the source code for the part of your library which 
processes character codes in UTF-8 form, believing that this source code 
will show you how UTF-8 code units processed by that library will end up 
displayed as "glyphs" <https://unicode.org/glossary/#glyph> on a display 
surface. You want to capture this relationship between code unites and 
glyphs as part of your robust digest. You expect that the answer will be 
simple enough that a single email to a list will result in a simple 
reply which gives you what you seek.

I did a little web searching, and I think I can point you to some places 
where libbionic 
<https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc> 
processes code units in UTF-8 form. The source code uses the tags "mb", 
short for "multi-byte", and "wc", short for "wide character", in the 
names of functions which operate on UTF-8 code unit data and Unicode 
scalar values respectively.  Take a look at:

function mbsnrtowcs() 
<https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc/bionic/wchar.cpp#68>

function mbrtc32() 
<https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc/bionic/mbrtoc32.cpp#36>

I imagine you will find these unsatisfying. They implement the UTF-8 
data conversions with no mention of either UTF-8 version or Unicode 
version. Nor do they mention glyphs, fonts, character to glyph mapping, 
or any of the other text rendering complexity which it seems you want to 
characterise.

I have the impression that you are trying to reinvent a whole lot of 
work in text representation, text display, digital document 
preservation, archiving, and software preservation, without having yet 
taking the time to learn about existing work in the fields. If your 
intent is to preserve 18th century hand-written documents well, I 
suggest you start by representing them as well-crafted PDF/A files. You 
could perhaps get a PhD in digital archiving and still not exhaust all 
the implications of what I think you are asking.

Good luck with your project! Best regards,
       —Jim DeLaHunt

-- 
.   --Jim DeLaHunt, jdlh at jdlh.com http://blog.jdlh.com/ (http://jdlh.com/)
       multilingual websites consultant, Vancouver, B.C., Canada
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20241106/de004dbe/attachment-0001.htm>