Sentence_Break, Semi-colons, and Apparent Miscategorization
fantasai via Unicode
unicode at unicode.org
Thu Mar 29 17:23:06 CDT 2018
On 03/08/2018 07:04 AM, Mark Davis ☕️ wrote:
> From the first line, I guess you mean that all three questions are having to do with the Sentence_Break property values. Namely:
> On Thu, Mar 8, 2018 at 9:25 AM, fantasai via Unicode <unicode at unicode.org <mailto:unicode at unicode.org>> wrote:
> > Given that the comma and colon are categorized as SContinue,
> > why is the semicolon also not SContinue?
> > Also, why is the Greek Question Mark not categorized with
> > the rest of the question marks?
> As I recall, both are because the semicolon can also represent a greek question mark
> (they are canonically equivalent, so you can't reliably distinguish between them).
I'm guessing this is why all other semicolons (which don't have
this problem) are also categorized as Other instead of SContinue?
Given SContinue is a set of punctuation that's “softer” than
STerm, it seems to me it would make more sense to categorize
them all (including the Greek question mark) as SContinue,
and then allow implementations to tailor the Greek question
mark and semicolon to STerm as needed. Leaving them all under
Other means that all semicolons would have to be individually
tailored out of Other, which seems much more error-prone.
> > Why aren't the vertical presentation forms categorized with
> > the things they are presenting?
> At least some of them are:
> U+FE10 ( ︐ ) PRESENTATION FORM FOR VERTICAL COMMA
> U+FE11 ( ︑ ) PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA
> U+FE13 ( ︓ ) PRESENTATION FORM FOR VERTICAL COLON
> U+FE31 ( ︱ ) PRESENTATION FORM FOR VERTICAL EM DASH
> U+FE32 ( ︲ ) PRESENTATION FORM FOR VERTICAL EN DASH
Yes, but others aren't:
︒ U+FE12 PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP
︕ U+FE15 PRESENTATION FORM FOR VERTICAL EXCLAMATION MARK
︖ U+FE16 PRESENTATION FORM FOR VERTICAL QUESTION MARK
I'm also wondering about Armenian, Coptic, and Ethiopic
* Armenian exclamation mark and question mark are Other,
whereas Latin (ASCII) places them as STerm.
* None of the Coptic punctuation is categorized as non-Other,
not even the full stop which I'd expect under STerm.
* Ethiopic comma and colon are not grouped with commas and
colons in general under SContinue.
Were these intentionally or accidentally placed under Other?
More information about the Unicode