get the sourcecode [of UTF-8]

A bughunter A_bughunter at proton.me
Tue Nov 5 05:43:04 CST 2024


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

My reply to Jim is interspersed below.
 Originating Question

Where to get the sourcecode of relevent (version) UTF-8?: in order to checksum text against the specific encoding map (codepage).

from A_bughunter at proton.me

Sent with Proton Mail secure email.

On Tuesday, November 5th, 2024 at 05:52, Jim DeLaHunt <list+unicode at jdlh.com> wrote:

> On 2024-11-04 20:04, A bughunter via Unicode wrote:

Hello Jim, pleased to hear back from you again. You have had the best answer so far. You wrote: "If by "source code" you refer to an implementation of the UTF-8 format, then is no single answer.", which is about the only correct paraphrase I have seen from the five or more pages dumped on me by this mailing list in reply to my single line question I have pinned at the top of this reply. However you did not proceed to ask the required information: which version is relevent?, of which info I preemptively posted in a previous reply: UTF-8 as implimented in android 13 libbionicC but the reference of the Unicode standard is also relevent and in comparison, both. Generally we put the standard into a computer language such as C. Therefore the Unicode V.16 standard of UTF-8 should also be the sourcecode of the implimentation these converge making them synonymous at the convergence. For instance the mathematical code of md5sum is encased into C {function} code the formula is both sourcecode in the RFC and the implimentation of it. This same convergence will happen with a standard such as Unicode. The standard map, wherever it is shown to the programmer whome is to impliment the standard, is also the source which is imported into the C code to "impliment" it.
 To further keep the definitions in the context of this mailing list thread. I say bytecode and character also converge to be synonymous. Where the C language would call something a character on disk and RAM this is bytecode. I can say it either way; and I gave you that in a previous post: Call it either. I do not like to repeat myself because it encourages the common habit of ignoring that which I have already said; Yet here again I will say ASCII is a 7bit codepage when a programmer would impliment UTF-8 the ascii is determined by only 7 bits therefore in the C language this would be bit-code: I have already said this. Because UTF-8 is 8 bit ALL of ASCII subset in UTF-8 is then bytecode. Furthermore the word glyph is also synonymous with character and that bytecode which UTF-8 has for that character. Where they converge on the computer makes them synonymous: this is just the facts of speaking English. Do not assimilate what others are saying. Stay focused and kindof ignore the distractions.

I defined the words so there is not any issues about meanings. 
> People are trying to help, but the meaning you seem to have for certain
> words are different than the meaning we are used to when discussing
> Unicode and text encoding.
> 
> If you would like to learn how we use the words, consider reading
> Chapter 1 of the Unicode Core Specification:
> https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-1/.
> 
> Particularly, see what it says about code points ane encoding forms.
> Also have a look at the terms in the Unicode glossary, especially
> https://unicode.org/glossary/#glyph and
> 
> https://unicode.org/glossary/#glyph_image and
> 
> https://unicode.org/glossary/#character and
> 
> https://unicode.org/glossary/#code_unit.
> 
> 
No, a problem with common understanding is that ye swap and change meanings even with webpages, manuals, and dictionaries. I have already given definitions within the self contained context of this mailing list thread. My definitions are better but if ye wish to propose some from the consortium docs feel free to import them into this mailing list thread. 
> But another way to find common understanding is for you to give an
> example of what you are looking for. For instance, can you show us the
> sourcecode for the ASCII bytecode to glyph map? Or the sourcecode for
> the bytecode to glyph map of another encoding standard?
That is pretty much why I am here: I am asking you for sourcecode. RFC20 would be the standard and the sourcecode to be imported into the machine which is what ye claim UTF-8 has although RFC20 is 7bit bit-code wherever UTF-8 is implimented in sourcecode (glibC) it is then made into byte-code - as the RFC20 put's it "stored in 8 bits": that is 7bit bit-code stored in 8bit UTF-8 byte-code.

Sure I can explain how I use it to checksum text. In-fact I have already invited ye to ask me outside of this mailing list because checksum is not unicode specific.
> Also, given that sourcecode, can you explain how you use it to "checksum
> text"?

RFC20 is the same thing as a codepage. You can pull RFC20 here https://github.com/freedom-foundation/ASCII-format-for-Network-Interchange . For whatever reason IBM has taken down codepages on the website. You may note that ASCII has something like 5 or more versions since 1968 and while I hear over and over again that ASCII is a subset of UTF-8 it cannot impliment 5 differing versions simultainiously. I would need to see the sourcecode I have here asked for to identify which version of ASCII UTF-8 is using in what ye call a " subset". We would probably be better off without UTF-8 it is more like a shim or (slim-jim) was added ontop of ASCII to interfere with it.
> If you can show me the sourcecode for the ASCII bytecode to glyph map,
> and explain how to checksum text, maybe we will better understand what
> you seek.
> 
> --
> . --Jim DeLaHunt, jdlh at jdlh.com http://blog.jdlh.com/
> (http://jdlh.com/)
> multilingual websites consultant, Vancouver, B.C., Canada
-----BEGIN PGP SIGNATURE-----
Version: ProtonMail

wnUEARYKACcFgmcqBMQJkKkWZTlQrvKZFiEEZlQIBcAycZ2lO9z2qRZlOVCu
8pkAAPtpAQClaUbaDoEnBBCWo7U1rzTLsbBMvXBF6dqH/k2gdKweYgD+JMMk
/jWqodXNoGhtWxzhbPvJnnl5Y84cqA24IcE75As=
=4Dvg
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: publickey - A_bughunter at proton.me - 0x66540805.asc
Type: application/pgp-keys
Size: 653 bytes
Desc: not available
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20241105/4f16d617/attachment.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: publickey - A_bughunter at proton.me - 0x66540805.asc.sig
Type: application/pgp-signature
Size: 119 bytes
Desc: not available
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20241105/4f16d617/attachment.sig>


More information about the Unicode mailing list