Time format characters 'h' and 'k'

Philippe Verdy via CLDR-Users cldr-users at unicode.org
Mon Aug 21 17:33:57 CDT 2017


2017-08-21 23:32 GMT+02:00 Asmus Freytag via CLDR-Users <
cldr-users at unicode.org>:

> On 8/21/2017 6:12 AM, Peter Constable via CLDR-Users wrote:
>
> This is waaaayyyyy off topic.
>
>
> Well, it gives a nice window into Philippe's thinking. He must be viewing
> encodings as if they were printed on giant wall charts, instead of thinking
> of them numerically where the "upper" end of a range (top) would the higher
> value.
>

IO'm n,ot alone to think like this, given this is the default presentation
in the standard itself and all code charts included and officially
released. Unicode is about encoding text, and naturally it adopts a text
-related point of view where reading order is considered; even with Bidi,
all the encoded scripts are using top to bottom layouts by default (except
historic scripts using boustrophedon and special uses such as rotating
Latin texts to render it along a vertical border of a rectangular area or a
very narrow strip, where it could be bottom to top).
The terms top and bottom are used in many places in the standard for
presentation related descriptions, it makes no sense if you speak about
numerical values where the correct terms are "first"/"last" or
"lower"/"upper" or "start"/"end" which are explicitly referencing a precise
order (unlike "top" and "bottom" that are just 2D directions).

Of course there remains the expression English "top ten" (or with any other
number), but the number is required to give it the meaning "first" (in a
sequence frequently open-ended).

But the discussion was not out of topic: Mark Davis did not justify really
why he would have prefered the surrogates at end of the BMP. Various
decisions had to be made rapidly to accelerate the merging of efforts
between Unicode and ISO. But not all was done coherently even if the new
compatible version of both standards were created which were both
incompatible with their former initial version. With the price of
interoperability break already paid, some errors could have been avoided.

If there are some reasons why surrogates should be have been at end of the
BMP, it would have been
* to have 16-bit binary ordering compatible with the binary ordering of
UTF-8 and UTF-32
* to facilitate some algorithms or optimize some storage

These reasons are valid, but nothing prohibit creating a 16-bit encoding
that can do that (with exactly the same price as UTF-16): this can be done
only by remapping the 16-bit encoding space with a bijection that swaps
some value ranges.

Here we speak about implementations of algorithms, not about interchanges
where UTF-16 is defined to be used, but that has fallen out of use (except
for the Windows-specific "Unicode" APIs which are however not really UTF-16
compliant, and in the NTFS filesystem or FAT32/exFAT with their "Unicode"
extension that are also not completely Unciode compliant).

What I mean is that the remaining allocation of surrogates in the BMP is
just a waste only kept because of UTF-16, and not even needed for data
interchange, while implementations can still work internally with their own
16-bit encoding if needed, without even needing to comply strictly with
what is defined in UTF-16. So what is done in NTFS or FAT32/exFAT does not
matter, it just follows what is defined in these filesystems by Microsoft.

But other 16-bit encodings are largely possible using other mappings
(notably for Chinese, but as well for Latin)

So you continue to think this is out of topic of the Unicode mailing list,
when the subject was introduced (but not justified at all) by Mark Davis
not explaining his opinion about the placement of surrogates in the BMP,
and using as well a confusing term to say it. I am convinced this was
really fully ON topic of Unicode (including on its mailing list where Mark
Davis started his remark), its history, and its usage and implementations.

I'd say that surrogates were bad solutions for what was not really a
problem and should have remained in private implementations. Even UTF-16
should have not even been standardized, it was not needed at all (and we
would havbe avoided as well the nightmare of ZWNBSP used as BOM in text
files, and its later disunification using another codepoint, another
historical error) !

If you asked to people, they'd say : remove UTF-16 from the standard, and
remove surrogates completely from the BMP. Or at least deprecate them
completely (leaving their use only for internal legacy implementations, but
not for any interchange, or leaving implementations defining their own
16-bit encoding if they want and documenting it if needed).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170822/1e09d3cb/attachment.html>


More information about the CLDR-Users mailing list