Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?

Asmus Freytag (t) asmus-inc at ix.netcom.com
Sun May 3 14:53:19 CDT 2015

On 5/3/2015 9:47 AM, Koji Ishii wrote:
> Thank you so much Ken and Asmus for the detailed guides and histories. 
> This helps me a lot.
> In terms of time frame, I don't insist on specific time frame, Unicode 
> 9 is fine if that works well for all.
> I'm not sure how much history and postmortem has to be baked into the 
> section of UAX#14, hope not much because I'm not familiar with how it 
> was defined so other than what Ken and Asmus kindly provided in this 
> thread. But from those information, I feel stronger than before that 
> this was simply an unfortunate oversight. In the document Ken quoted, 
> F and W are distinguished, but H and N are not. In '90, East Asian 
> versions of Office and RichEdit were in my radar and all of them 
> handled halfwidth Katakana as ID for the line breaking purposes. 
> That's quite understandable given the amount of code points to work 
> on, given the priority of halfwidth Katakana, and given the difference 
> of "what line breaking should be" and UAX#14 as Ken noted, but writing 
> it up as a document doesn't look an easy task


kana are special in that they are not shared among languages. From that 
perspective, there's nothing wrong with having a "general purpose" 
algorithm support the rules of the target language (unless that would 
add undue complexity, which isn't a consideration here).

Based on the data presented informally here in postings, I find your 
conclusion (oversight) quite believable. The task would therefore be to 
present the same data in a more organized fashion as part of a formal 
proposal. Should be doable.

I think you'd want to focus on survey of modern practice in 
implementations (and if you have data on some of them going back to the 
'90s the better).

 From the historical analysis it's clear that there was a desire to 
create assignments that didn't introduce random inconsistencies between 
LB and EAW properties, but that kind of self-consistency check just 
makes sure that all characters of some group defined by the intersection 
of property subsets are treated the same (unless there's an overriding 
reason to differentiate within). It seems entirely plausible that this 
process misfired  for the characters in question, more likely so, given 
that the earliest drafts of the tables were based on an implementation 
also being created by MS around the same time. That makes any difference 
to other MS products even more likely to be an oversight.

I do want to help UTC establish a precedent of getting changes like that 
endorsed by a representative sample of implementers and key external 
standards (where applicable, in this case that would be CSS), to avoid 
the chance of creating undue disruption (and to increase the chance that 
the resulting modified algorithm is actually usable off-the-shelf, for 
example for "default" or "unknown language" type scenarios.

Hence my insistence that you go out and drum up support. But it looks 
like this should be relatively easy, as there seems to be no strong case 
for maintaining the status quo, other than that it is the status quo.


> I agree that implementers and CSS WG should be involved, but given IE 
> and FF have already tailored, and all MS products as well, I guess it 
> should not be too hard. I'm in Chrome team now, and the only problem 
> for me to fix it in Chrome is to justify why Chrome wants to tailor 
> rather than fixing UAX#14 (and the bug priority...)
> Either Makoto or I can bring it up to CSS WG to get back to you.
> /koji
> On Sat, May 2, 2015 at 4:12 AM, Asmus Freytag (t) 
> <asmus-inc at ix.netcom.com <mailto:asmus-inc at ix.netcom.com>> wrote:
>     Thank you, Ken, for your dedicated archeological efforts.
>     I would like to emphasize that, at the time, UAX#14 reflected
>     observed behavior, in particular (but not exclusively) for MS
>     products some of which (at the time) used an LB algorithm that
>     effectively matched an untailored UAX#14.
>     However, recently, the W3C has spent considerable effort to look
>     into different layout-related algorithms and specification. If, in
>     that context, a consensus approach is developed that would point
>     to a better "default" behavior for untailored UAX#14-style line
>     breaking, I would regard that as a critical mass of support to
>     allow UTC to consider tinkering with such a long-standing set of
>     property assignments.
>     This would be true, especially, if it can be demonstrated that
>     (other than matching legacy behavior) there's no context that
>     would benefit from the existing classification. I note that this
>     was something several posters implied.
>     So, if implementers of the legacy behavior are amenable to achieve
>     this by tailoring, and if the change augments the number of
>     situations where untailored UAX#14-style line breaking can be
>     used, that would be a win that might offset the cost of a
>     disruptive change.
>     We've heard arguments why the proposed change is technically
>     superior for Japanese. We now need to find out whether there are
>     contexts where a change would adversely affect users/implementers.
>     Following that, we would look for endorsements of the proposal
>     from implementers or other standards organizations such as W3C
>     (and, if at all possible, agreement from those implementers who
>     use the untailored algorithm now). With these three preconditions
>     in place, I would support an effort of the UTC to revisit this
>     question.
>     A./
>     On 5/1/2015 9:48 AM, Ken Whistler wrote:
>>     Suzuki-san,
>>     On 5/1/2015 8:25 AM, suzuki toshiya wrote:
>>>     Excuse me, there is any discussion record how UAX#14 class for
>>>     halfwidth-katakana in 15 years ago? If there is such, I want to
>>>     see a sample text (of halfwidth-katakana) and expected layout
>>>     result for it.
>>     The *founding* document for the UTC discussion of the initial
>>     Line_Break property values 15 years ago was:
>>     http://www.unicode.org/L2/L1999/99179.pdf
>>     and the corresponding table draft (before approval and conversion
>>     into the final format that was published with UTR #14 -- later
>>     /UAX/ #14) was:
>>     http://www.unicode.org/L2/L1999/99180.pdf
>>     There is nothing different or surprising in terms of values
>>     there. The halfwidth
>>     katakana were lb=AL and the fullwidth katakana were lb=ID in
>>     that earliest draft, as of 1999.
>>     What is new information, perhaps, is the explicit correlation
>>     that can be found
>>     in those documents with the East_Asian_Width properties, and the
>>     explanation
>>     in L2/99-179 that the EAW property values were explicitly used to
>>     make distinctions for the initial LB values.
>>     There is no sample text or expected layout results from that time
>>     period,
>>     because that was not the basis for the original UTC decisions on
>>     any of this.
>>     Initial LB values were generated based on existing General_Category
>>     and EAW values, using general principles. They were not generated by
>>     examining and specifying in detail the line breaking behavior for
>>     every single script in the standard, and then working back from those
>>     detailed specifications to attempt to create a universal
>>     specification
>>     that would replicate all of that detailed behavior. Such an approach
>>     would have been nearly impossible, given the state of all the data,
>>     and might have taken a decade to complete.
>>     That said, Japanese line breaking was no doubt considered as part of
>>     the overall background, because the initial design for UTR #14
>>     was informed
>>     by experience in implementation of line breaking algorithms at
>>     Microsoft
>>     in the 90's.
>>>     You commented that the UAX#14 class should not be changed but
>>>     the tailoring of the line breaking behaviour would solve
>>>     the problem (as Firefox and IE11 did). However, some developers
>>>     may wonder "there might be a reason why UTC put halfwidth-katakana
>>>     to AL - without understanding it, we could not determine whether
>>>     the proposed tailoring should be enabled always, or enabled
>>>     only for a specific environment (e.g. locale, surrounding text)".
>>     See above, in L2/99-179. *That* was the justification. It had nothing
>>     to do with specific environment, locale, or surrounding text.
>>>     If UTC can supply the "expected layout result for halfwidth-
>>>     katakana (used to define the class in current UAX#14)", it
>>>     would be helpful for the developers to evaluate the proposed
>>>     tailoring algorithm.
>>     UAX #14 was never intended to be a detailed, script-by-script
>>     specification of line layout results. It is a default, generic,
>>     universal
>>     algorithm for line breaking that does a decent, generic job of
>>     line breaking in generic contexts without tailoring or specific
>>     knowledge of language, locale, or typographical conventions in use.
>>     UAX #14 is not a replacement for full specification of kinsoku
>>     rules for Japanese, in particular. Nor is it intended as any kind
>>     of replacement for JIS X 4051.
>>     Please understand this: UAX #14 does *NOT* tell anyone how
>>     Japanese text *should* line break. Instead, it is Japanese
>>     typographers,
>>     users and standardizers who tell implementers of line break
>>     algorithms for Japanese what the expectations for Japanese text
>>     should
>>     be, in what contexts. It is then the job of the UTC and of the
>>     platform and application vendors to negotiate the details of
>>     which part of that expected behavior makes sense to try to
>>     cover by tweaking the default line-breaking algorithm and the
>>     Line_Break property values for Unicode characters, and which
>>     part of that expected behavior makes sense to try to cover
>>     by adjusting commonly accessible and agreed upon tailoring
>>     behavior (or public standards like CSS), and finally which part
>>     of that
>>     expected behavior should instead be addressed by value-added,
>>     proprietary
>>     implementations of high end publishing software.
>>     Regards,
>>     --Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150503/ef32012c/attachment.html>

More information about the Unicode mailing list