NNBSP and Word Boundaries

Mark Davis ☕️ mark at macchiato.com
Fri Oct 2 02:25:01 CDT 2015


Like Andy, I'm hesitant about changing the gc of NNBSP, because of
backwards compatibility concerns.

I'm also starting to think that scoping the wb change to Mongolian may not
be a bad thing. We might want to explore what it would look like, since it
would preserve the maximum compatibility for current use of NNBSP with
French and other languages. (The use of NNBSP in French, although not all
that common, I suspect would swamp—in terms of frequency of usage—the use
with Mongolian, simply because the amount of text worldwide in French is so
much greater.)

Context

The proposed WB change is from XX to EX

Old relevant props:

WB ; EX                               ; ExtendNumLet
WB ; LE                               ; ALetter
WB ; XX                               ; Other

Old rules with EX:

WB13a (AHLetter | Numeric | Katakana | ExtendNumLet) × ExtendNumLet
WB13b ExtendNumLet × (AHLetter | Numeric | Katakana)

====

Off of the top of my head, perhaps something like:

We add:

WB ; ML                              ; Mongolian_Letter
WB ; NN                              ; NNBSP // maybe different name

We change the contents of LE and XX to move characters to the two new value
sets.
Eg, MN gets http://unicode.org/cldr/utility/list-unicodeset.jsp?a=
[:scx=/Mong/:]&[:wb=ALetter:]

We change the "macro"

AHLetter(ALetter | Hebrew_Letter | Mongolian_Letter)

*At this point, all behaves the same; that is just a 'refactoring'.*


Now we can modify the behavior for sequences with NN adjacent to ML.

We add:

WB13c Mongolian_Letter × NNBSP
WB13d NNBSP × Mongolian_Letter

*If* we want to also change behavior on the other side of the NNBSP,
whenever the Mongolian_Letter and NNBSP occur in sequence, we add 2
additional rules (with the appropriate values for ..., like Numeric)

WB13c Mongolian_Letter NNBSP × (...)
WB13d (...) × NNBSP Mongolian_Letter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151002/9ea8e339/attachment.html>


More information about the Unicode mailing list