0027, 02BC, 2019, or a new character?

Philippe Verdy via Unicode unicode at unicode.org
Thu Jan 25 09:48:42 CST 2018

Just a remark for fun:
- You'll also note that this talk is all about the apostrophe, and if
Kazakhstan wants to introduce it in 2019, that year will match exactly the
code point U+2019 [ ’ ]...
- This year 2018 is also the year to discuss and reverse the apostrophe
decision, and it matches the codepoint U+2018 [ ‘ ] for the reversed

Happy new years to ‘Kazakhstan’ !  But now we have a new way to memoize the
code point value for these apostrophes !

2018-01-25 16:40 GMT+01:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> Such example shows that ignoring umlauts makes the document
> counterintuitive. Nobody is able to infer that "Paper" refers to a person
> here or if he actually meant a paper sheet/article...
> At least he should have written "Paeper" which would be more correct (if
> "Christoph Päper" is German, the umlaut is equivalent to a following "e"),
> or even "Christoph Paper".
> Apply that tot the Kazakh language, and attempt to drop the apostrophes
> (because they very commonly cause various technical issues in softwares),
> I'm sure you'll see problems of interpretation or too many synonyms, that
> the use of acute instead would have avoided
> All softwares today are "8-bit" clean and support at least ISO 8859-1 or
> windows 1252, if they don't support multibyte UTF-8; the time of 7-bit
> ASCII is ended now since long, except in very old systems, that were anyway
> not used at all for Kazakh in Cyrillic; so acute accents are more likely than
> ASCII apostrophes to survive the technical software constraints, notably
> if Latin letters with accents come from the ISO 8859-1 subset which is also
> 8-bit in Unicode. Even with UTF-8, these Latin letters with accents (from
> any ISO 8859-* subset) will be 2-byte wide, so exactly the same encoding
> size as basic letter+ASCII quote and the encoding size is definitely not an
> issue anywhere (all existing Kazakh Cyrillic letters are already using
> 2-byte encoding in UTF-8, as all their assigned code points values were
> higher than 0x7F but lower than 0x800)
> Choosing the ASCII quote for this "apostrophe" will not save anything ;
> but the regular Unicode apostrophe U+2019 would need... 3 bytes after the
> 1-byte basic Latin letter from ASCII (so it is worse !).
> Choosing the acute accent above Latin letters from ISO 8859-* would avoid
> this issue, because they are precombined, and in UTF-8 the usual prefered
> representation is in NFC form using a single code points. Javascript, Java,
> or C/C++ "wide string" types will handle these characters also with a
> single code unit (so the measured string "length" matches the number of
> letters). You will avoid all problems of SQL code injection on web sites if
> you have to allow the ASCII quotes unfiltered in data input forms to
> represent the proposed Kazakh orthography: with the acute, you can still
> continue to reject all ASCII quotes from software input forms and people
> won't be forced to use the alternate U+2019, not found on their basic
> keyboards, or will not substitute it by an hyphen or space or will not drop
> it completely; they'll just type letters with acute accents with a single
> keystroke on their Latinized keyboard.
> 2018-01-25 13:15 GMT+01:00 Andrew West via Unicode <unicode at unicode.org>:
>> On 23 January 2018 at 00:55, James Kass via Unicode <unicode at unicode.org>
>> wrote:
>> >
>> > Regular American users simply don't type umlauts, period.
>> Not even the president of the Unicode Consortium when referring to
>> Christoph Päper:
>> http://www.unicode.org/L2/L2018/18051-emoji-ad-hoc-resp.pdf
>> Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180125/d5db5995/attachment.html>

More information about the Unicode mailing list