I'm trying to understand this Word Break test
Karl Williamson
public at khwilliamson.com
Tue Oct 10 11:24:15 CDT 2023
In the 15.1 UCD files in the auxiliary folder, there is the
WordBreakTest.txt file. It contains the following line:
÷ 0020 × 0308 ÷ 0020 ÷ # ÷ [0.2] SPACE (WSegSpace) × [4.0] COMBINING
DIAERESIS (Extend_FE) ÷ [999.0] SPACE (WSegSpace) ÷ [0.3]
I don't understand how UAX #29 leads to a break between the COMBINING
DIARESIS and the SPACE. The two relevant rules, I believe are
Keep horizontal whitespace together.
WB3d WSegSpace × WSegSpace
Ignore Format and Extend characters, except after sot, CR, LF, and
Newline. (See Section 6.2, Replacing Ignore Rules.) This also has the
effect of: Any × (Format | Extend | ZWJ)
WB4 X (Extend | Format | ZWJ)* → X
Looking at the boundary I mentioned, we have
"Extend" followed immediately by "WSegSpace"
The higher priority rules don't involve these classes, so don't apply.
Rule 3d doesn't apply, but Rule 4 does. It says to pretend that the
Extend doesn't exist except after certain classes. The character
preceding the Extend one is a WSeqSpace character, so we get
X Extend → X
WSegSpace Extend → WSegSoace
That means that we are to pretend that the boundary is between
WSegSpace WSegSpace
Rule 3d does apply to this, and says that no break is to happen. But
the test says instead Rule 999.0 applies and a break should occur.
Please explain.
More information about the Unicode
mailing list