I'm trying to understand this Word Break test
ashpilkin at gmail.com
ashpilkin at gmail.com
Tue Oct 10 12:01:20 CDT 2023
On Tue, 2023-10-10 at 10:24 -0600, Karl Williamson wrote:
> The two relevant rules, I believe are
>
> Keep horizontal whitespace together.
>
> WB3d WSegSpace × WSegSpace
>
> Ignore Format and Extend characters, except after sot, CR, LF, and
> Newline. (See Section 6.2, Replacing Ignore Rules.) This also has the
> effect of: Any × (Format | Extend | ZWJ)
>
> WB4 X (Extend | Format | ZWJ)* → X
>
> [Rule 4] says to pretend that the
> Extend doesn't exist except after certain classes. The character
> preceding the Extend one is a WSeqSpace character, so we get
>
> X Extend → X
> WSegSpace Extend → WSegSoace
Not quite. Section 6.2, referenced in the comment, says that the ignore
rule 4 means two things:
- First, don't break before (Extend | Format | ZWJ) unless a preceding
(higher-priority) rule mandates that;
- Second, in every *subsequent* (lower-priority) rule, replace every
boundary property X by X (Extend | Format | ZWJ)* .
As rule 3d precedes rule 4, we don't get to "pretend" that the
combining diaeresis doesn't exists for the purposes of rule 3d, as you
say,---only for rules 5, ..., 999. Thus rule 3d does not apply
anywhere, then rule 4 applies between the first space and the combining
diaeresis, then 999 applies between the diaeresis and the second space.
(And IIUC this makes some sense---putting a combining accent on a space
is a way to typeset that combining accent by itself that doesn't
require its standalone form to be encoded separately.)
--
Good luck,
Alex
More information about the Unicode
mailing list