Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?

Ken Whistler kenwhistler at att.net
Fri May 1 09:17:31 CDT 2015


Personally, I don't have a horse in this race, because I am not 
responsible for
any linebreaking implementation -- so a change for halfwidth katakana 
matter one way or the other to me.

Secondly, there is no formal stability guarantee constraining Line_Break 
values (other than the generic guarantee that the property itself or
existing aliases cannot be *removed* from the standard). Nor is there
any stability guarantee regarding the rest of the algorithm definition 
in UAX #14.
So in principle, the UTC could rewrite it completely. But I doubt that 
that would
be in anybody's interest at this point. ;-)

But as I see it, the way this should work is for the major stakeholders 
who *do*
have implemented linebreaking algorithms depending on UAX #14 working
in released products (and that would include people speaking for various
browsers and for Apple products in general, I think) should be the ones
either pushing for a change, because it would make their behavior more 
and acceptable for Japanese, or pushing back *against* a change, because 
depend on UAX #14 stability and would prefer tweaking the behavior in their
implementations, instead. So I'd like to see a formal proposal for a change
(specified *exactly* as to the set of characters affected) brought to 
the UTC,
where implementers and users of ICU could make the case for or against.

The other thing that I think would need to happen here is that any proposal
should also provide suggested wording for UAX #14 which would explain
why halfwidth katakana specifically need to break with the general 
that were used 15 years ago to assign LB classes based on East_Asian_Width
considerations, and instead need to match the LB classes of their
fullwidth katakana counterparts. That should be made explicit in the text
of UAX #14, so somebody else doesn't "discover" another inconsistency
between sets of values and try to change things back later on -- not knowing
the rationale for the values.

Because a well-formed proposal for a change like this involves both
a justification for a property value change *and* a corresponding fix
to annex text, I think this is too late in the cycle to be taken as just
beta feedback for the Version 8.0 release, unfortunately. Because of
the potential hit on existing implementations (and test cases), this needs
full review, and should instead be pushed as an early proposal for
the Version 9.0 release cycle.


On 5/1/2015 5:33 AM, Koji Ishii wrote:
> I support Makoto for the change. Nobody should appreciate that behavior, either worked around locally (Firefox, IE) or unnoticed (Chrome). Rather than implementing yet another work around in Chrome, I wish it being fixed finally after 15 years.
> If this issue is like 5 people say break and 5 not to, or considering the long life of the bug, 9 say break and 1 say not to, I understand that Ken’s answer might make more sense. However, I’m quite sure that this is a 10-0 issue. Everyone using UAX#14 has to choose from trailer, unnoticed, or won’t fix. I think that kind of things should better be fixed.
> Half-width CJK should follow the same line breaking class as their wide counterparts. From that point of view, half-width Hangul being AL is actually correct. (Note that this is not the same as full-width oftentimes having the different classes than their narrow counterparts.)
> Half-width punctuations already have correct classes, so they’re fine. Symbols in U+FFE8-FFEE are AL, which looks also incorrect, but I do not find these code points in any CJK legacy encoding. Where had they come from? Logical thinking is to assign the same classes as their wide counterparts, but I can’t be sure without knowing where they came from.
> Ken, does this change cause problems in terms of the stability policy?
> /koji

More information about the Unicode mailing list