UAX 29 questions
verdy_p at wanadoo.fr
Thu Jan 29 21:19:44 CST 2015
2015-01-29 19:52 GMT+01:00 Karl Williamson <public at khwilliamson.com>:
> Rule WB4 is
> "Ignore Format and Extend characters, except when they appear at the
> beginning of a region of text.".
> Not clearly stated, but it appears to me that the ZWJ must be considered
> here to be the beginning of a region of text, as we are looking at the
> boundary between it and the "A". No rule specifically mentions ALetter
> followed by an Extend, so by the default rule, WB14
> "Otherwise, break everywhere (including around ideographs)"
All the text is targeted at finding candidate positions for breaks. It is
not very clear that "ignore" is definitive and means that there cannot be
any further breaks before the Format and Extend characters, except at
beginng of text. So all the rest of rules is ignored, there was a match and
you stop there; no break before;
Any × (Format | Extend)
This is confirmed in other rules that state the word "otherwise", including
the last one (WB14) you quote which is explciitly not applicable.
But I agree with you that rules WB56 and WB57 should better be rewritten as
ALetter × (MidLetter | MidNumLet | Single_Quote) (ALetter |
Hebrew_Letter × ((MidLetter | MidNumLet) (ALetter | Hebrew_Letter) |
Note also that for French, the single quote is followed by a word break,
but NOT a linebreak by default, and also NOT a syllable break for
hyphenation) except in very few exceptions like "aujourd'hui" which is
treated now as a single word -there's an elision but also a contraction of
4 words as if it was written "au jour d' hui", but the term "hui" no longer
occurs anywhere isolately except for that common word where all components
are glued), most elision apostrophes normally occur at end of word (e.g.
after the two apostrohpes in « l'année n'est pas terminée »).
The rare cases where you should not break after an apostrophe is when
elision occurs in the middle of a word in some vulgar expressions like «
c't'après-m' » which contains two informal words « c't' » and « après-m' »
which are abbreviating « cet après-midi » in popular language.
In English you have the case where the elision occurs at the begining of a
word : « it's » is two words « it » and « 's » abbreviating « is » : or in
the middle « aren't » containing two glued words « are » and « n't »
abbreviating « not ».
In both cases, you can use the WB rules, but then treat some exceptions for
This way a single matching rule is needed and you no longer need to look
for other rules.
But we are not discussing line breaks here, but only word breaks (for the
purpose of performing dictionary lookups and grammar analysis) : we
shouldbe able with the default rules to "unglue" the words by default,
using then an exception lsiss to see if we must reattach them as they are
not all words.
So first attempt to look for word terminated by an apostrophe, and then
perform language-dependand perform lookup for known exceptions (« aujourd'
» « hui » cannot match because « hui » is not a separate word) fow whch we
must try something else :
Look for word starting by an apostrophe (n English « it's » would be first
treated bythe previous rule as « it' » and « s » but « s » alone is treated
as an exeption, then with this rule it will correctly idenofy « ’s »
independantly of the previous word, except if it is an acronym like in «
GMO's » because in that case the « 's » is not a separate verb or a
genitive particle but a known plural mark).
Word breaks are more complicate to handle than line breaks as they need to
perform dictionary lookups to assert them, But this is the purpose of a
word breaking process to be used in order to perform dicutionnary lookups.
With it, ou can then safely talior the line breaking alogorith in otder to
implement syllable breaking for hyphenation which needs these dictionary
lookups also to detect exceptions to the normal syllable breaks (which can
be performed only with langiage-secific loolups for some pairs, or digrams
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode