Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?

Fri May 1 09:47:38 CDT 2015

On 5/1/2015 7:17 AM, Ken Whistler wrote:
>
> Koji,
>
> Personally, I don't have a horse in this race, because I am not 
> responsible for
> any linebreaking implementation -- so a change for halfwidth katakana 
> wouldn't
> matter one way or the other to me.
>
> Secondly, there is no formal stability guarantee constraining 
> Line_Break property
> values (other than the generic guarantee that the property itself or
> existing aliases cannot be *removed* from the standard). Nor is there
> any stability guarantee regarding the rest of the algorithm definition 
> in UAX #14.
> So in principle, the UTC could rewrite it completely. But I doubt that 
> that would
> be in anybody's interest at this point. ;-)
>
> But as I see it, the way this should work is for the major 
> stakeholders who *do*
> have implemented linebreaking algorithms depending on UAX #14 working
> in released products (and that would include people speaking for various
> browsers and for Apple products in general, I think) should be the ones
> either pushing for a change, because it would make their behavior more 
> correct
> and acceptable for Japanese, or pushing back *against* a change, 
> because they
> depend on UAX #14 stability and would prefer tweaking the behavior in 
> their
> implementations, instead. So I'd like to see a formal proposal for a 
> change
> (specified *exactly* as to the set of characters affected) brought to 
> the UTC,
> where implementers and users of ICU could make the case for or against.

I would go further and suggest that UTC make no change until it has 
positively heard from a representative sample of users/implementers.

This kind of seemingly innocuous change does affect implementations but 
implementers are usually not expecting to have the ground shift under 
them after a decade or more of stable property assignments. Silence on 
their part may just as likely be the result of failing to appreciate the 
possibility of adverse outcome than of actual acquiescence.

To the degree that the CSS working group relies on UAX#14 as default in 
some/any situations, it would be imperative to hear from them as well, 
before taking any action.

In principle, this should be the stated procedure by the UTC when making 
any change in long-standing property assignments -- particularly for 
widely deployed scripts.

That said, with proper buy-in from stakeholders, I see no objection to 
making a change.

A./
>
> The other thing that I think would need to happen here is that any 
> proposal
> should also provide suggested wording for UAX #14 which would explain
> why halfwidth katakana specifically need to break with the general 
> principles
> that were used 15 years ago to assign LB classes based on 
> East_Asian_Width
> considerations, and instead need to match the LB classes of their
> fullwidth katakana counterparts. That should be made explicit in the text
> of UAX #14, so somebody else doesn't "discover" another inconsistency
> between sets of values and try to change things back later on -- not 
> knowing
> the rationale for the values.
>
> Because a well-formed proposal for a change like this involves both
> a justification for a property value change *and* a corresponding fix
> to annex text, I think this is too late in the cycle to be taken as just
> beta feedback for the Version 8.0 release, unfortunately. Because of
> the potential hit on existing implementations (and test cases), this 
> needs
> full review, and should instead be pushed as an early proposal for
> the Version 9.0 release cycle.
>
> --Ken
>
> On 5/1/2015 5:33 AM, Koji Ishii wrote:
>> I support Makoto for the change. Nobody should appreciate that 
>> behavior, either worked around locally (Firefox, IE) or unnoticed 
>> (Chrome). Rather than implementing yet another work around in Chrome, 
>> I wish it being fixed finally after 15 years.
>>
>> If this issue is like 5 people say break and 5 not to, or considering 
>> the long life of the bug, 9 say break and 1 say not to, I understand 
>> that Ken’s answer might make more sense. However, I’m quite sure that 
>> this is a 10-0 issue. Everyone using UAX#14 has to choose from 
>> trailer, unnoticed, or won’t fix. I think that kind of things should 
>> better be fixed.
>>
>> Half-width CJK should follow the same line breaking class as their 
>> wide counterparts. From that point of view, half-width Hangul being 
>> AL is actually correct. (Note that this is not the same as full-width 
>> oftentimes having the different classes than their narrow counterparts.)
>>
>> Half-width punctuations already have correct classes, so they’re 
>> fine. Symbols in U+FFE8-FFEE are AL, which looks also incorrect, but 
>> I do not find these code points in any CJK legacy encoding. Where had 
>> they come from? Logical thinking is to assign the same classes as 
>> their wide counterparts, but I can’t be sure without knowing where 
>> they came from.
>>
>> Ken, does this change cause problems in terms of the stability policy?
>>
>> /koji
>>
>>
>>
>
>