base1024 encoding using Unicode emojis

Sun Mar 11 12:07:27 CDT 2018

On Sun, Mar 11, 2018 at 11:25 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> Ideally, the purpose of such base-1024 encoding is to allow compacting
> arbitrary data into plain-text which can be safely preserved including by
> Unicode normalization and transforms by encoding like UTF-8.
> But then we have a way to do that is such a way that this minimizes the
> UTF-8 string sizes (Emojis is probably not the best set to use if most of
> them lie in supplementary planes).
>

Yeah, it certainly results in larger utf8 strings.  For example a sha256
hash is 112 bytes when encoded as Ecoji utf8.  For base64, sha256 is 44
bytes.

Even though its more bytes, Ecoji has less visible characters than base64
for sha256.  Ecoji has 28 visible characters and base64 44.  So that makes
me wonder which one would be quicker for a human to verify on average?
Also, which one is more accurate for a human to verify? I have no idea. For
accuracy, it seems like a lot of thought was put into the visual uniqueness
of Unicode emojis.

>
> You can choose another arbitrary set of 1024 codepoints in the BMP that is
> preserved by normalization (no decomposition, combining class=0) and text
> filters (no controls, no whitespaces, possibly no punctuation, only letters
> or digits) and which is still simple to compute with a basic algorithm not
> requiring any table lookup (only a few tests for some boundary values or a
> very small lookup table with 16 entries, one entry for each subset of 64
> values).
>
> As well some frequent binary data (notably runs of null bytes) should be
> able to use shorter UTF-8 sequences from the ASCII set, so my opinion is
> that the 64 first codes should be the same as standard Base-64, others can
> be taken easily from CJK blocks, or the PUA block in the BMP, but you can
> also select some blocks below the U+0800 codepoint so that they get encoded
> as 2 bytes and not 3 for the rest of the BMP (and 4 bytes for most emojis,
> where 10 bits become 64 bits with a huge waste of storage space in UTF-8)
>
> So the real need it to find the smallest set of subranges with 64
> consecutive codepoints with minimal values that contain only letters or
> digits and where all positions are assigned with such general properties.
> Emojis will unlikely be part of them ! With this goal, you can even avoid
> using any PUAs (which are likely to be filtered/forbidden by some
> protocols), or compatibility characters (likely to be transformed by
> NFKC/NFKD).
>
> And even within just the BMP, you could reach more than 10-bit encoding
> (base-1024) and can probably find 12-bit encoding (base 4096) or more (CJK
> blocks of the BMP offer wide ranges of suitable characters, as well as some
> extended Latin or extended Cyrillic blocks)
>
> If you want to use supplementary characters that are already encoded, then
> you can certainly use CJK blocks in the large supplementary ideographic
> plane and create a 16-bit encoding (base 65536). Only some legacy Emojis in
> the BMP will be used before that.
>
>
>
> 2018-03-11 6:04 GMT+01:00 Keith Turner via Unicode <unicode at unicode.org>:
>
>> I created a neat little project based on Unicode emojis.  I thought
>> some on this list may find it interesting.  It encodes arbitrary data
>> as 1024 emojis.  The project is called Ecoji and is hosted on github
>> at https://github.com/keith-turner/ecoji
>>
>> Below are some examples of encoding and decoding.
>>
>> $ echo 'Unicode emojis are awesome!!' | ecoji
>> ������������������������������������������������
>>
>> $ echo ������������������������������������������������   | ecoji -d
>> Unicode emojis are awesome!!
>>
>> I would eventually like to create a base4096 version when there are more
>> emojis.
>>
>> Keith
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180311/caeba64e/attachment.html>