get the sourcecode [of UTF-8]

Thu Nov 7 08:32:25 CST 2024

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

My reply to Jim is interspersed.

Originating concise yet full though simple one line relevent ontopic Question.

Where to get the sourcecode of relevent (version) UTF-8?: in order to checksum text against the specific encoding map (codepage).

from A_bughunter at proton.me

Sent with Proton Mail secure email.

On Thursday, November 7th, 2024 at 05:57, Jim DeLaHunt via Unicode <unicode at corp.unicode.org> wrote:

> 
> On 2024-11-06 18:45, A bughunter via Unicode wrote:
> 
> > …My query here is a sideline to my GitHub repo Unicode_map you may see here https://github.com/freedom-foundation/unicode_map and you may see my Proof of Concept https://github.com/freedom-foundation/unicode_map?tab=readme-ov-file#proof-of-concept if you would like to discuss the fine points of checksumming

You did not need my GitHub page to gather the use-case. There is no hint. The full use-case has been stated: "checksum text". Ye have some misconception about what text is. Text is text. 
> Thank you for posting that link. It provides me a hint of what you want to do.

Those are current Law and not all of those were handwritten on parchment. My_Declaration https://github.com/freedom-foundation/My_Declaration and Journal of The Congress https://github.com/freedom-foundation/Journal-of-The-Congress are both of machine typed origen by myself. The photographs of the source documents is completely irrelivant to this mailing list thread. Not sure why you opened like the narrator and assessor of Antiques Roadshow.
> What I see in the repo are various representations of historical documents from the 18th century, which were originally produced as English language text hand-written on parchment with pen and ink. You have images of the text on physical pages, and the character content of the texts in UTF-8. You write there,
> 
> > These documents have priority to be archived in both ASCII wordcounted and checksummed text and PDF/A-1 archive format for long term preservation then signed and attested, garunteed to be veritable to corresponding printed replica documents.… I’m interested in sourcecode and libre-sourcecode. Libre-sourcecode being defined as allof a machine specification (chip design), compiler and application sourcecode which can be written out in respective computer programming languages and archived, saved, transmit, reproduced, and build and run all from paper written legit.…

Do you like the full continuity definition of Libre-sourcecode? I enjoy the plug but the moderator might not. None of this was necesarry to answer the question on this mailing list thread. This is a one line question and you are the first to here in this post give a relevent answer (about 18 pages of replies). I gave you the GitHub link for an invitation to discuss offtopic extras with me. Like this you have quoted me on ( nod of approval ) yeah that is some great stuff. Join me. 

> Source: <https://github.com/freedom-foundation>
> 
> Your original first two messages said,
> 
> > Where to get the sourcecode of relevent (version) UTF-8?: in order to checksum text against the specific encoding map (codepage).
> 
> Source: <https://corp.unicode.org/pipermail/unicode/2024-November/011099.html>
> 
> > what implimentation of UTF-8: the answer is the relevent implimentation is android 13 libbionic (bionic C) which uses UTF-8.…
> > …android 13 is open source AOSP and, it would be possible to point out the exact unicode used in it however this
> > assumes my runtime matches a generic AOSP android 13 source. So then the way in which I framed my question does probe
> > as to if there is any way to display the compile time UTF-8. Sometimes there are --version options.
> > The part you do not seem to understand is the full circle of authentication of a checksummed text. In order to fully
> > authenticate: the codepage of the character to glyph map must be known. Anything further on this checksumming process
> > would not be directly on topic of this mailing list 
> 
> Source: <https://corp.unicode.org/pipermail/unicode/2024-November/011102.html>
> 
Yes these are my messages not sure why you are requoting me but at least I have a read reciept. Again showing you a photograph of The Unanimous Declaration of the thirteen united States of America shouldn't make anything click in your head to be able to answer my originating question any better than without having seen the photograph. I gave you the GitHub link for an invitation to discuss offtopic extras with me. 

You have a pretty good description here. I welcome this sort of discussion on my GitHub as, whilst because of my maxim "the sourcecode must be known", it has no bearing on answering my question. 
> Put in conventional terminology of text processing and display in software systems, it seems that you want to preserve historical documents in digital form. This digital form include an expansive swath of the software stack: not just document content, but also several layers of software and hardware necessary to present the document. As part of this, you want to calculate some sort of robust digest of the digital form, to let a receiver of the document assure themselves that what they see (experience) when viewing the digital form of the document has the same relationship to the original document which you had when you authored the digital form.
> 

It should be but one doesn't really know because when any buy an android phone it does not come with the sourcecode required to comply with GNU GPL and if one asks they would only point at AOSP on Google's website as if that was compliance: it is not. As I said in a previous post "with slight assumption" that the source on Google's site was used to build my machine. 
> One part of your software stack is similar to, but not necessarily the same as, the Android Open Source Project's libbionic (an implementation of libc).

Actually I asked for both. I said: You shouldn't have to reverse engineer the software to contrast it against the Unicode standard it purports to have been. here https://corp.unicode.org/pipermail/unicode/2024-November/011118.html No with about 18 pages of replies it is getting hard to track and some mail I couldn't even find before replying, so i will just ask again does Unicode have any reference model? When saying both I mean the standard's source and the machine's source. I would compare the consortium standard or consortium reference to the sourcecode. This is why I avoid saying implimentation because there is no need to bend and contort the runtime away from the standard. Here you say "for the part of your library" but you had said "software stack" try to be consistent. While the library has to do with StdIO ( standard input output ) and terminal displaying but as you said "stack" it may not be restricted to this library; there is also the Linux kernel sourcecode and I made no such restriction in my question. I would need kindof a pointer from the consortium that should have some idea how the standard they make is actually carried out in software such as Linux. There is nothing here about "believing": actions can happen without any presupposition. The only reasonable supposition that usually fails is that those being emailed will comply with GNU GPL such as aforementioned about the android sourcecode. It is reasonable to suppose a business will comply with law and one needs no "believing" as to wheather or not they will comply. Neither do I need any "believing" about the competance of the Unicode consortium to answer a simple relevant question. Nor do I need any expectations about what the answer should be that is pretty much why one ask's a question. If I could write out the expected answer I guess I wouldn't need to ask (Where a scalar snake is hiding?). Although it is reasonable to expect an answer or suppose with any reserve that Unicode consortium would answer a question.
 Also for your betterment I must believe what I know. If you do not believe what you know then you have some mental malfunction. 
> You are looking for the source code for the part of your library which processes character codes in UTF-8 form, believing that this source code will show you how UTF-8 code units processed by that library will end up displayed as "glyphs" <https://unicode.org/glossary/#glyph> on a display surface. You want to capture this relationship between code unites and glyphs as part of your robust digest. You expect that the answer will be simple enough that a single email to a list will result in a simple reply which gives you what you seek.
No, No wanting I already got an example from Oren here https://corp.unicode.org/pipermail/unicode/2024-November/011113.html however this has nothing to do with what my machine being used to make UTF-8 is using. Something like this is domewhere behing the displayingbof the terminal stdio. It goes to prove this thread can have been answered within a couple or so replies. 

 I will take a look at this. It appears to be a piece of the sourcecode puzzle. If one uses Debian FOSS system and it produces UTF-8 texts there must be sourcecode. 
> I did a little web searching, and I think I can point you to some places where libbionic <https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc> processes code units in UTF-8 form. The source code uses the tags "mb", short for "multi-byte", and "wc", short for "wide character", in the names of functions which operate on UTF-8 code unit data and Unicode scalar values respectively. Take a look at:
> 
> function mbsnrtowcs() <https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc/bionic/wchar.cpp#68>
> 
> function mbrtc32() <https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc/bionic/mbrtoc32.cpp#36>
> 

I would guess there may be interconnecting parts of the sourcecode. It is disappointing of Unicode consortium that it does not know how android or Linux impliment's it's standard. As I said it is as though it is some thing on the sideline not having much to do with software in use.
> I imagine you will find these unsatisfying. They implement the UTF-8 data conversions with no mention of either UTF-8 version or Unicode version. Nor do they mention glyphs, fonts, character to glyph mapping, or any of the other text rendering complexity which it seems you want to characterise.
> 
I don't get why you think Unicode stops short of glyphs when while this discussion goes on a guy will send a glyph to the mailing list calling it an english character and ask it to be added. Hence this glyph and charactor become synonymous in such instance. https://corp.unicode.org/pipermail/unicode/2024-November/011123.html You guys have these fragmented thoughts and misconceptions. There is a full continuity where all of these be synonymous. data which is unicode-data which is bytecode which is an C integer which is an character may be a letter which is a glyph which is text. There is a full continuity where these are all in specific instances synonymous though not all backward compatible as a hierarchy. This is to say when Mr. Reader is reading the screen he sees TEXT. That text is consituted and comprised of all of these synonyms. Which are synonyms where they converge (instance) to comprise TEXT on Mr. Reader's screen.

Actually I have no desire for Unicode UTF-8. I was pulled into this because it happens to be what my machine uses. The 7-bit ASCII should be perfect for use-case "checksum text". UTF-8 is a problem. It should seem to you that I am not trying to reinvent but backtrack to the last-known-good. I do not need to learn; you have some misconception. A cyclic redundancy continuum on ASCII 7-bit text having the codepage included is simple enough and undeniably verifiable. However this is not the subject of the Unicode mailing list. I gave you the GitHub link for an invitation to discuss offtopic extras with me. 
> I have the impression that you are trying to reinvent a whole lot of work in text representation, text display, digital document preservation, archiving, and software preservation, without having yet taking the time to learn about existing work in the fields. If your intent is to preserve 18th century hand-written documents well, I suggest you start by representing them as well-crafted PDF/A files. You could perhaps get a PhD in digital archiving and still not exhaust all the implications of what I think you are asking.
You are overcomplicating. Oh well this is a spot where I can plug that the simple preservations have potentially eternal implications. I did not ask anything about my use-case. Reread the originating question. I gave you the GitHub link for an invitation to discuss offtopic extras such as use-case with me.

Should I save the luck in a container until it expires.
> Good luck with your project! Best regards,
> —Jim DeLaHunt
> 
> --
> . --Jim DeLaHunt, jdlh at jdlh.com http://blog.jdlh.com/ (http://jdlh.com/)
> multilingual websites consultant, Vancouver, B.C., Canada
-----BEGIN PGP SIGNATURE-----
Version: ProtonMail

wnUEARYKACcFgmcsz3UJkKkWZTlQrvKZFiEEZlQIBcAycZ2lO9z2qRZlOVCu
8pkAAJABAQDzzuqO8qYsv+TRTTqESLgMowklV9sSZpb+CMQ7I9WjJQEA0yzf
1apCcuexXl+rCS939le7RlpAaUT0VaO7yfCpHgg=
=Td6M
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: publickey - A_bughunter at proton.me - 0x66540805.asc
Type: application/pgp-keys
Size: 653 bytes
Desc: not available
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20241107/97261a6c/attachment.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: publickey - A_bughunter at proton.me - 0x66540805.asc.sig
Type: application/pgp-signature
Size: 119 bytes
Desc: not available
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20241107/97261a6c/attachment.sig>