Time format characters 'h' and 'k'

Philippe Verdy via CLDR-Users cldr-users at unicode.org
Sun Aug 20 06:41:19 CDT 2017


Anyway, I think that UTF-16 (and their surrogates) will later be
deprecated. Even for apps that want 16-bit code units, it is probable that
another 16-bit encoding will appear, preserving the sam level of
compactness but simplifying the binary order:

It is easy to create such alternative while maintaining the 8-bit and 32bit
encodings unchanged (UTF-8 and UTF-32).

Notably you can create a 16-bit encoding placing surrogates (named "UTF16S"
with the "S" meaning "shifted") at end (in 0xF800-0xFFFF), by shifting
U+E000..U+FFFF to 0xD800..0xF7FF). As U+FFFE and U+FFFF are non-characters,
they will fall down to 0xF7FE..0xF7FF, still usable as special markers such
as end of text, so that all 16-bit code units in this 0x0000-0xF7FD will be
valid (0xF7FC representing U+FFFD, i.e. the replacement character used as a
possible substitute for transcoding from texts with invalid/non-matching
encodings).

With this, instead of using U+0000 in strings as end of string markers, we
could use U+FFFF encoded as 0xF7FF in this 16-bit encoding. The NULL
control would remain encoded as 0x0000 but would no longer mark an end of
string. Another alternative would be to use 0x0000 in this encoding to
represent U+FFFF, by shifting also all codepoints that are non-characters,
and then U+0000 would be represented by 0xF7FD (but binary order would not
be preserved) or as 0x0001 (preserving binary order of assigned characters,
but all ASCII characters would be shifted up by one position in this 16-bit
encoding.

There would still remain the two columns of  non-characters (in the Arabic
compatiblity block) but they could be shifted as well just before the
surrogates, in another variant that would place **all** non-characters
(including surrogates) at end of the 16-bit encoding.



2017-08-20 13:16 GMT+02:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> 2017-08-20 11:42 GMT+02:00 Mark Davis ☕️ via CLDR-Users <
> cldr-users at unicode.org>:
>
>> As I recall, one of those historical anomalies (like the surrogate range
>> not being at the top of the BMP).
>>
>
> I don't think this is an anomaly: placing the surrogates at top would have
> avoided the emergence of UTF-8 with compatiblity with 7-bit US-ASCII.
> Placing them at top would have broken many more things and UTF-8 would have
> not become the most useful encoding and the default now to be supported by
> all web standards.
>
> Ideally the surrogates should have been at end of the BMP (possibly
> leaving only a few non-characters after them or tweaking surrogates and
> allcoations in places so that they would have not used U+FFFE and U+FFFF
> kept as reserved surrogates not used in any pair for valid codepoints).
> They would have sorted in binary mode and preserved the binary order
> between UTF-8, UTF-16 and UTF-32...
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170820/cb0247e5/attachment-0001.html>


More information about the CLDR-Users mailing list