0027, 02BC, 2019, or a new character?

Philippe Verdy via Unicode unicode at unicode.org
Thu Jan 25 09:40:44 CST 2018

Such example shows that ignoring umlauts makes the document
counterintuitive. Nobody is able to infer that "Paper" refers to a person
here or if he actually meant a paper sheet/article...
At least he should have written "Paeper" which would be more correct (if
"Christoph Päper" is German, the umlaut is equivalent to a following "e"),
or even "Christoph Paper".

Apply that tot the Kazakh language, and attempt to drop the apostrophes
(because they very commonly cause various technical issues in softwares),
I'm sure you'll see problems of interpretation or too many synonyms, that
the use of acute instead would have avoided

All softwares today are "8-bit" clean and support at least ISO 8859-1 or
windows 1252, if they don't support multibyte UTF-8; the time of 7-bit
ASCII is ended now since long, except in very old systems, that were anyway
not used at all for Kazakh in Cyrillic; so acute accents are more likely than
ASCII apostrophes to survive the technical software constraints, notably if
Latin letters with accents come from the ISO 8859-1 subset which is also
8-bit in Unicode. Even with UTF-8, these Latin letters with accents (from
any ISO 8859-* subset) will be 2-byte wide, so exactly the same encoding
size as basic letter+ASCII quote and the encoding size is definitely not an
issue anywhere (all existing Kazakh Cyrillic letters are already using
2-byte encoding in UTF-8, as all their assigned code points values were
higher than 0x7F but lower than 0x800)

Choosing the ASCII quote for this "apostrophe" will not save anything ; but
the regular Unicode apostrophe U+2019 would need... 3 bytes after the
1-byte basic Latin letter from ASCII (so it is worse !).

Choosing the acute accent above Latin letters from ISO 8859-* would avoid
this issue, because they are precombined, and in UTF-8 the usual prefered
representation is in NFC form using a single code points. Javascript, Java,
or C/C++ "wide string" types will handle these characters also with a
single code unit (so the measured string "length" matches the number of
letters). You will avoid all problems of SQL code injection on web sites if
you have to allow the ASCII quotes unfiltered in data input forms to
represent the proposed Kazakh orthography: with the acute, you can still
continue to reject all ASCII quotes from software input forms and people
won't be forced to use the alternate U+2019, not found on their basic
keyboards, or will not substitute it by an hyphen or space or will not drop
it completely; they'll just type letters with acute accents with a single
keystroke on their Latinized keyboard.

2018-01-25 13:15 GMT+01:00 Andrew West via Unicode <unicode at unicode.org>:

> On 23 January 2018 at 00:55, James Kass via Unicode <unicode at unicode.org>
> wrote:
> >
> > Regular American users simply don't type umlauts, period.
> Not even the president of the Unicode Consortium when referring to
> Christoph Päper:
> http://www.unicode.org/L2/L2018/18051-emoji-ad-hoc-resp.pdf
> Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180125/96188250/attachment.html>

More information about the Unicode mailing list