base1024 encoding using Unicode emojis

Philippe Verdy via Unicode unicode at unicode.org
Sun Mar 11 10:25:13 CDT 2018


Ideally, the purpose of such base-1024 encoding is to allow compacting
arbitrary data into plain-text which can be safely preserved including by
Unicode normalization and transforms by encoding like UTF-8.
But then we have a way to do that is such a way that this minimizes the
UTF-8 string sizes (Emojis is probably not the best set to use if most of
them lie in supplementary planes).

You can choose another arbitrary set of 1024 codepoints in the BMP that is
preserved by normalization (no decomposition, combining class=0) and text
filters (no controls, no whitespaces, possibly no punctuation, only letters
or digits) and which is still simple to compute with a basic algorithm not
requiring any table lookup (only a few tests for some boundary values or a
very small lookup table with 16 entries, one entry for each subset of 64
values).

As well some frequent binary data (notably runs of null bytes) should be
able to use shorter UTF-8 sequences from the ASCII set, so my opinion is
that the 64 first codes should be the same as standard Base-64, others can
be taken easily from CJK blocks, or the PUA block in the BMP, but you can
also select some blocks below the U+0800 codepoint so that they get encoded
as 2 bytes and not 3 for the rest of the BMP (and 4 bytes for most emojis,
where 10 bits become 64 bits with a huge waste of storage space in UTF-8)

So the real need it to find the smallest set of subranges with 64
consecutive codepoints with minimal values that contain only letters or
digits and where all positions are assigned with such general properties.
Emojis will unlikely be part of them ! With this goal, you can even avoid
using any PUAs (which are likely to be filtered/forbidden by some
protocols), or compatibility characters (likely to be transformed by
NFKC/NFKD).

And even within just the BMP, you could reach more than 10-bit encoding
(base-1024) and can probably find 12-bit encoding (base 4096) or more (CJK
blocks of the BMP offer wide ranges of suitable characters, as well as some
extended Latin or extended Cyrillic blocks)

If you want to use supplementary characters that are already encoded, then
you can certainly use CJK blocks in the large supplementary ideographic
plane and create a 16-bit encoding (base 65536). Only some legacy Emojis in
the BMP will be used before that.



2018-03-11 6:04 GMT+01:00 Keith Turner via Unicode <unicode at unicode.org>:

> I created a neat little project based on Unicode emojis.  I thought
> some on this list may find it interesting.  It encodes arbitrary data
> as 1024 emojis.  The project is called Ecoji and is hosted on github
> at https://github.com/keith-turner/ecoji
>
> Below are some examples of encoding and decoding.
>
> $ echo 'Unicode emojis are awesome!!' | ecoji
> ������������������������������������������������
>
> $ echo ������������������������������������������������   | ecoji -d
> Unicode emojis are awesome!!
>
> I would eventually like to create a base4096 version when there are more
> emojis.
>
> Keith
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180311/c143a3a0/attachment.html>


More information about the Unicode mailing list