Sentence_Break, Semi-colons, and Apparent Miscategorization

Mark Davis ☕️ via Unicode unicode at unicode.org
Thu Mar 8 09:04:44 CST 2018


>From the first line, I guess you mean that all three questions are having
to do with the Sentence_Break property values. Namely:

http://www.unicode.org/reports/tr29/proposed.html#Table_Sentence_Break_Property_Values
http://www.unicode.org/reports/tr29/proposed.html#SContinue

Mark

On Thu, Mar 8, 2018 at 9:25 AM, fantasai via Unicode <unicode at unicode.org>
wrote:

> Given that the comma and colon are categorized as SContinue,
> why is the semicolon also not SContinue?


> Also, why is the Greek Question Mark not categorized with
> the rest of the question marks?
>

​As I recall
​,​
​both are
 because the semicolon can also represent a greek question mark (they are
canonically equivalent
​, so you can't reliably distinguish between them
).​

​BTW, here is a table of property differences for codepoint X, toNfc(X) (if
a single character) and toNfkc(X) (again, if a single character).

https://docs.google.com/spreadsheets/d/1ZExxhAujA8kX42F8KBK3okX_So7Dt5YZvyanL8dH8tM/edit#gid=0

It was a quick dump so no guarantees that all the dots are crossed. It
skips comparing properties that are purposefully different across NFC (like
Decomposition_Mapping) or different code points (like Name or Block), and
most CJK properties (ones starting with 'k').


> Why aren't the vertical presentation forms categorized with
> the things they are presenting?
>

​At least some of them are:
U+FE10 ( ︐ ) PRESENTATION FORM FOR VERTICAL COMMA
U+FE11 ( ︑ ) PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA
U+FE13 ( ︓ ) PRESENTATION FORM FOR VERTICAL COLON
U+FE31 ( ︱ ) PRESENTATION FORM FOR VERTICAL EM DASH
U+FE32 ( ︲ ) PRESENTATION FORM FOR VERTICAL EN DASH
​

>
> Thanks~
> ~fantasai
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180308/0828d6fc/attachment.html>


More information about the Unicode mailing list