I'm trying to understand this Word Break test

Karl Williamson public at khwilliamson.com
Tue Oct 10 11:24:15 CDT 2023


In the 15.1 UCD files in the auxiliary folder, there is the 
WordBreakTest.txt file.  It contains the following line:

÷ 0020 × 0308 ÷ 0020 ÷	#  ÷ [0.2] SPACE (WSegSpace) × [4.0] COMBINING 
DIAERESIS (Extend_FE) ÷ [999.0] SPACE (WSegSpace) ÷ [0.3]

I don't understand how UAX #29 leads to a break between the COMBINING 
DIARESIS and the SPACE.  The two relevant rules, I believe are

Keep horizontal whitespace together.
WB3d 	WSegSpace 	× 	WSegSpace
Ignore Format and Extend characters, except after sot, CR, LF, and 
Newline. (See Section 6.2, Replacing Ignore Rules.) This also has the 
effect of: Any × (Format | Extend | ZWJ)
WB4 	X (Extend | Format | ZWJ)* 	→ 	X

Looking at the boundary I mentioned, we have

"Extend" followed immediately by "WSegSpace"

The higher priority rules don't involve these classes, so don't apply.
Rule 3d doesn't apply, but Rule 4 does.  It says to pretend that the 
Extend doesn't exist except after certain classes.  The character 
preceding the Extend one is a WSeqSpace character, so we get

X Extend → X
WSegSpace Extend → WSegSoace

That means that we are to pretend that the boundary is between

WSegSpace WSegSpace

Rule 3d does apply to this, and says that no break is to happen.  But 
the test says instead Rule 999.0 applies and a break should occur.

Please explain.




More information about the Unicode mailing list