Ancient Greek apostrophe marking elision

Mon Jan 28 01:31:40 CST 2019

Note that this is no different than the reasonably common cases in English
such as «the boys’ books».
(you can try various combinations in
http://unicode.org/cldr/utility/list-unicodeset.jsp)

There are certainly cases that are suboptimal in word selection. As another
example, «re-iterate» seems like it should not break around hyphens, but on
the other hand in «an out-of-the-box experience» it seems like they should.
Expecting people to type in hard-to-find invisible characters just to
correct double-click is not a realistic expectation. Short of a dictionary
or ML lookup, there is no good way to distinguish certain tricky cases.
(And that probably needs more context, to distinguish «Ted was lyin’ to her
mother.» from «She said ‘Ted was lyin’ to her mother.».)

But the question is how important those are in daily life. I'm not sure why
the double-click selection behavior is so much more of a problem for
Ancient Greek users than it is for the somewhat larger community of English
users. Word selection is not normally as important an operation as line
break, which does work as expected.

Mark

On Sun, Jan 27, 2019 at 8:13 PM James Tauber via Unicode <
unicode at unicode.org> wrote:

> On Sun, Jan 27, 2019 at 1:22 PM Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:
>
>> Except the Uniocde-compliant processes aren't required to follow the
>> scheme of TR27 Unicode Text Segmentation.  However, it is only required
>> to select the whole word because the U+2019 is followed by a letter.
>> TR27 prescribes different behaviour for "dogs'" with U+2019 (interpret
>> as two 'words') and U+02BC (interpret as one word).  The GTK-based
>> email client I'm using has that difference, but also fails with
>> "don't" unless one uses U+02BC.
>>
>> However LibreOffice treats "don't" as a single word for U+0027, U+02BC
>> and U+2019, but "dogs'" as a single word only for U+02BC.  This
>> complies with TR27.  I'm not surprised, as LibreOffice does use or has
>> used ICU.
>>
>
> This comes back to my original question that started this thread. Many
> people creating Ancient Greek digital resources use U+02BC seemingly
> because of incorrect word-breaking with *word-final* U+2019 (which is the
> only time it occurs in Ancient Greek and always marking elision, never as
> the end of a quotation).
>
> I am trying to write guidelines as to why they should use U+2019. I'm
> convinced it's technically the right code point to use but am wanting to
> get my facts straight about how to address the word-breaking issue
> (specifically for word-final U+2019 in Ancient Greek, to be clear). In my
> original post, I asked if a language-specific tailoring of the text
> segmentation algorithm was the solution but no one here has agreed so far.
>
> Here's a concrete example from Smyth's Grammar:
>
> γένοιτ’ ἄν
>
> Double-clicking on the first word should select the U+2019 as well.
> Interestingly on macOS Mojave it does in Pages[1] but not in Notes, the
> Terminal or here in Gmail on Chrome.
>
> To be clear: when I say "should" I mean that that is the expectation
> classicists have and the failure to meet it is why some of them insist on
> using U+02BC.
>
> I'm happy if the answer is "use U+2019 and go get your text segmentation
> implementations fixed"[2] but am looking for confirmation of that.
>
> James
>
> [1] To be honest, I was impressed Pages got it right.
> [2] In the same spirit as "if certain combining character combinations
> don't work, the solution is not to add precomposed characters, it's to
> improve the fonts" or "tonos and oxia are the same and if they look
> different, it's the fault of your font".
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/40883a83/attachment.html>