get the sourcecode [of UTF-8]
Phil Smith III
lists at akphs.com
Wed Nov 6 21:45:50 CST 2024
One more possibility:
>If my OS reads 101 as ABA another system could show 101 as BAB though
>the checksum should match because it is the same data, and yet because
>this is bytecode there are further multiple data possibilities to
>produce 101->nnn suppose 11011 and 01010: these two data files produce
>different checksums but should display the same output as it were
>intended ABA. This is to say the integrity is meant to be of the
>content delivered to mrreader, and when using UTF-8 a checksum cannot
>verify the integrity of text content.
That makes me think that what you really want might be a UTF-8 NFC or NFD algorithm, not source code for a UTF-8 encoder per se.
Then if you have two strings that display the letter À, one of which comprises
U+00C0 Latin Capital Letter A with Grave
and one of which comprises
U+0041 LATIN CAPITAL LETTER A plus U+0300 Combining Grave Accent
you'll wind up with the same UTF-8 string to compare. Depending on whether you go to NFC or NFD, you'll wind up with the first or the second for both, but you won't care at that point--YOU just need to be consistent which you use.
Is this what you need? The Unicode specification defines NFC and NFD, and, again, I'm sure there are plenty of implementations available, and those are unlikely to vary/be wrong since that would mean comparisons and thus searches would fail that should succeed.
More information about the Unicode
mailing list