Another take on the English apostrophe in Unicode

Ted Clancy tclancy at mozilla.com
Thu Jun 11 01:08:42 CDT 2015


On Thu, Jun 11, 2015 at 1:17 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> The ASCII punctuations have been ovveriden for a lot of different roles.
> There's simply no way to map them to a category that matches their semantic
> role. [...] "Pd" (dash) is then appropriate for the ASCII hyphen-minus.
>

I agree, but I wasn't talking about the ASCII hyphen, U+002D
(HYPHEN-MINUS). I was talking about U+2010 (HYPHEN).

I also wasn't talking about changing the properties of U+0027 (APOSTROPHE).


> in dictionaries I've seen small slanted tildes, or slanted small equal
> signs, to make the distinction with true hyphens used in compound words
>

This is drifting off-topic, but I wanted to address the thing you just said
above. Firstly, in the dictionaries I've seen, the slanted double hyphen is
only used when a line break happens to occur at the same place as a "true
hyphen". It replaces the "true hyphen". When a line is broken at a
hyphenation point between letters, an ordinary-looking hyphen is displayed.

Secondly, this character is encoded in Unicode at U+2E17 (DOUBLE OBLIQUE
HYPHEN).

- Ted


On Thu, Jun 11, 2015 at 1:17 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> The ASCII punctuations have been ovveriden for a lot of different roles.
> There's simply no way to map them to a category that matches their semantic
> role. So the ASCII hyphen and apostrophe-quote can only be given a very
> weak category that just exhibit their visual role. "Pd" (dash) is then
> appropriate for the ASCII hyphen-minus. You can't really tell from the
> character alone if it is a punctuation or a minus sign.
>
> If it is a minus sign you can reencode it better using the more specific
> mathematical minus sign. Otherwise, even if it is not a minus sign, it can
> be:
> - a connector between words in compound words (hyphen)
> - a trailing mark at end of lines for indicating a word has been broken in
> the middle (but remember that I asked previously for another character for
> that role because this word-breaking hyphen is not necessarily an
> horisontal hyphen (in dictionaries I've seen small slanted tildes, or
> slanted small equal signs, to make the distinction with true hyphens used
> in compound words, also because sometimes these breaks are not necessarily
> between two syllables in "pocket books" with very narrow columns and
> minimized spacing)
> - a bullet leading items in a vertical list (this should be an en dash,
> follwoed by some spacing)
> - a punctuation (not necessarily at begining of line) marking the change
> of person speaking (very common in litterature, notably in theatre).
>
> As a connector between words, there's a demonstrated need of
> differentiating regular hyphens, longer hyphens (preferably surrounded by
> thin spaces) for noting intervals (we can use the EN DASH for that), long
> hyphens between two separate names that are joined (example in propers
> names, after mariage, there's an example in France, where INSEE encodes it
> for now using TWO successive hyphens, which are also used in French
> identity cards, passports, social security green cards...).
>
>
> ----
>
> Still nobody replied to my past comment (about 1 month ago) about the
> various forms of the word-breaking hypĥen / line-wrapping symbol:
>
> * I'm not speaking about the SHY control, but about the real character
> whose glyph appears when SHY is materialized at end of lines (and which
> should be neither minus, or en-dash but also not the same as the
> orthographic hyphen used between words in a compound word).
>
> * This character can also be found (and is needed) also for breaking long
> mathematical formulas and must be clearly distinct from the regular minus.
>
> * This character is also needed for rendering long lines of programming
> code or textual data (it is something that must not be entered in programs
> but that must be rendered because theses programs or codes have significant
> line breaks: the glyph indicates that the following rendered line break is
> to be discarded). Not all programming languages have a syntax allwong to
> use an escape before the line break (such escaping varies, it may be a
> backslash in C/C++, or an underscore in Basic, but in data dumps such as
> CSV files, it is impossible to note such escape in the data language
> itself, and we need to render some specific glyph).
>
> * This character is absolutely needed when rendering on a static medium
> (i.e. printing or broadcasting) ;  for dynamic medium (such as personal
> displays with a personal UI) we could still use scrolling, but users don't
> like horizontal scrolls and highly prefer reading the text directly. So
> they expect to see a distinctive glyph (or icon) to see the distinction
> between line breaks where there are significant or where they just wrap too
> long lines, and still see the distinction with other regular hyphens and
> minus (that are also significant and very frequently distinct)
>
>
> 2015-06-11 0:51 GMT+02:00 Ted Clancy <tclancy at mozilla.com>:
>
>> On 4/Jun/2015 19:01, Leo Broukhis wrote:
>>>
>>> Along the same lines, we might need a MODIFIER LETTER HYPHEN, because,
>>> for
>>> example, the work ack-ack isn't decomposable into words, or even
>>> morphemes,
>>> "ack" and "ack".
>>>
>> I do think that U+2010 (HYPHEN) is miscategorised. I think it should have
>> General Category = Pc, not Pd. (That is, hyphens are connectors, not
>> dashes.) That would make it a "word" character.
>>
>> Or, at the very least, U+2010 should have Word Break = MidNumLet (meaning
>> it can occur in the middle of numbers or letters). UAX #29 says that U+2010
>> deliberately does *not* have Word Break = MidNumLet, though an
>> implementation may treat it as if it did. (UAX #29 doesn't give any reasons
>> for this decision. I can understand why U+002D (HYPHEN-MINUS) doesn't have
>> Word Break = MidNumLet, due to its history of being used as a dash or minus
>> sign, but U+2010 should never be used as a dash or minus sign, so I don't
>> see the problem.)
>>
>> But luckily, the miscategorisation of U+2010 hasn't led to any pressing
>> practical problems, unlike the misuse of U+2019 for the apostrophe.
>>
>> - Ted
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150611/d540759d/attachment.html>


More information about the Unicode mailing list