I'm trying to understand this Word Break test

ashpilkin at gmail.com ashpilkin at gmail.com
Tue Oct 10 12:01:20 CDT 2023


On Tue, 2023-10-10 at 10:24 -0600, Karl Williamson wrote:
> The two relevant rules, I believe are
> 
> Keep horizontal whitespace together.
>
> WB3d 	WSegSpace 	× 	WSegSpace
>
> Ignore Format and Extend characters, except after sot, CR, LF, and 
> Newline. (See Section 6.2, Replacing Ignore Rules.) This also has the 
> effect of: Any × (Format | Extend | ZWJ)
>
> WB4 	X (Extend | Format | ZWJ)* 	→ 	X
> 
> [Rule 4] says to pretend that the 
> Extend doesn't exist except after certain classes.  The character 
> preceding the Extend one is a WSeqSpace character, so we get
> 
> X Extend → X
> WSegSpace Extend → WSegSoace

Not quite. Section 6.2, referenced in the comment, says that the ignore
rule 4 means two things:

- First, don't break before (Extend | Format | ZWJ) unless a preceding
(higher-priority) rule mandates that;

- Second, in every *subsequent* (lower-priority) rule, replace every
boundary property X by X (Extend | Format | ZWJ)* .

As rule 3d precedes rule 4, we don't get to "pretend" that the
combining diaeresis doesn't exists for the purposes of rule 3d, as you
say,---only for rules 5, ..., 999.  Thus rule 3d does not apply
anywhere, then rule 4 applies between the first space and the combining
diaeresis, then 999 applies between the diaeresis and the second space.

(And IIUC this makes some sense---putting a combining accent on a space
is a way to typeset that combining accent by itself that doesn't
require its standalone form to be encoded separately.)

-- 
Good luck,
Alex



More information about the Unicode mailing list