UAX 29 questions

Philippe Verdy verdy_p at wanadoo.fr
Fri Jan 30 09:59:16 CST 2015


2015-01-30 9:32 GMT+01:00 Mark Davis ☕️ <mark at macchiato.com>:

> 2. Also, the following 2 rules are not equivalent:
>
> a) Any  × (Format | Extend)
> b) X (Extend | Format)* → X
>

That's what I replied in the first message but using an "as if" which was
not clear enough, my seconde reply reformulated it by making clear about
the right side (the substitution iccuring n the next rules; that you view
as a "shortcut").

Your first argument about convolution is not very justified between WB56
and WB57 that are also clear when rewritten by separating ALetter and
HebrewLetter.

But I also note this case for Hebrew's handling of apostrophes/quotes also
exists in the Latin script (including in English only) for the context of
word-breaking only (this does not apply to linebreaking and syllable
breaking for hyphenation, which are other types of breakers).

The rule about Format and Extend is still kept separate in WB56 and listed
first only because it correctly preserves the canonical equivalences for
extenders, which include all combining characters with non-zero combining
class; and which also include the gold rule for not breaking in the middle
of default grapheme clusters (which also includes joiners like CGJ and ZWJ
with any breaker algorithms, except code point breakers for some conforming
UTF's like UTF-16).

WB57 is evidently subject to tailorings. It just provides a default
behavior where the single quote/apostrophe is handled as an elision mark
most often used at end of words, and glued with the next word without space
separation.

WB57 It also handles the case where it is also followed by some spaces or
other punctuations and the single quote is then not an orthographic elision
mark but a punctuation marking an end of quotation.

One problem is the SingleQuote class used in WB57 is possibly too large :
it acts as an elision mark (apostrophe) only for a smaller number of
single-quote-like characters.

The other problem of WB57 is that it assumes that elision marked by
apostrophes occurs only at end of words (not true even for English) and
this is where per-language tailoring is not only possible but most probably
recommended.

Such tailoring should will affect the behavor of WB56 (notably in English,
French, Italian... where the apostrophe is lexicalized and its usage
regulated by their standard grammar).

----

But I wonder if tailoring of WB57 is not also needed for Hebrew. I see WB57
only as a initial default tailoring for the script itself, not for the
actual language (which may also be Yiddish). And could also include usual
transcriptions of foreign words, or of common but informal
abbreviations/contractions too (the apostrophe is highly prefered to the
dot for abbreviating/contracting in the middle of a word and notably when
the abbreviated part is not even pronounced but completely elided.

It seems ajso that Swedish may also use the colon in the middle of a word,
without space separations, instead of an apostrophe.

Other languages may prefer other signs for elisions (including an hyphen;
which does not break words but only syllables for candidate breaking of
long lines), notably if there are confusions with quote-like letters

Another common notation (found in French typography) uses superscripts for
the final letters when elision occurs in the middle of a word, but this is
in fact just a written abbreviation (this totaly replaces the use of the
abbreviation dot; normally never used in the middle and completely
eliminated in acronyms): this is not really an elision the abbreviated word
with superscript is sctill fullly read without the elision; so the
apostrophe cannot be used.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150130/9bd28626/attachment.html>


More information about the Unicode mailing list