Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?

Koji Ishii kojiishi at gmail.com
Mon Aug 17 01:21:44 CDT 2015


Hi all,

I'm not in sync with publishing schedule, sorry about that, but is it
possible to consider this change for Unicode 9.0 time frame?

I believe all concerns were cleared in the discussion, but if any were
left, I'd be happy to discuss further.

And I hope I'm not too late this time?

/koji

On Tue, May 5, 2015 at 6:19 AM, Peter Edberg <pedberg at apple.com> wrote:

> I have been checking with various groups at Apple. The consensus here is
> that we would like to see the linebreak value for halfwidth katakana
> changed to ID.
>
> - Peter E
>
>
>
> On May 3, 2015, at 12:53 PM, Asmus Freytag (t) <asmus-inc at ix.netcom.com>
> wrote:
>
> On 5/3/2015 9:47 AM, Koji Ishii wrote:
>
> Thank you so much Ken and Asmus for the detailed guides and histories.
> This helps me a lot.
>
> In terms of time frame, I don't insist on specific time frame, Unicode 9
> is fine if that works well for all.
>
> I'm not sure how much history and postmortem has to be baked into the
> section of UAX#14, hope not much because I'm not familiar with how it was
> defined so other than what Ken and Asmus kindly provided in this thread.
> But from those information, I feel stronger than before that this was
> simply an unfortunate oversight. In the document Ken quoted, F and W are
> distinguished, but H and N are not. In '90, East Asian versions of Office
> and RichEdit were in my radar and all of them handled halfwidth Katakana as
> ID for the line breaking purposes. That's quite understandable given the
> amount of code points to work on, given the priority of halfwidth Katakana,
> and given the difference of "what line breaking should be" and UAX#14 as
> Ken noted, but writing it up as a document doesn't look an easy task
>
>
> Koji,
>
> kana are special in that they are not shared among languages. From that
> perspective, there's nothing wrong with having a "general purpose"
> algorithm support the rules of the target language (unless that would add
> undue complexity, which isn't a consideration here).
>
> Based on the data presented informally here in postings, I find your
> conclusion (oversight) quite believable. The task would therefore be to
> present the same data in a more organized fashion as part of a formal
> proposal. Should be doable.
>
> I think you'd want to focus on survey of modern practice in
> implementations (and if you have data on some of them going back to the
> '90s the better).
>
> From the historical analysis it's clear that there was a desire to create
> assignments that didn't introduce random inconsistencies between LB and EAW
> properties, but that kind of self-consistency check just makes sure that
> all characters of some group defined by the intersection of property
> subsets are treated the same (unless there's an overriding reason to
> differentiate within). It seems entirely plausible that this process
> misfired  for the characters in question, more likely so, given that the
> earliest drafts of the tables were based on an implementation also being
> created by MS around the same time. That makes any difference to other MS
> products even more likely to be an oversight.
>
> I do want to help UTC establish a precedent of getting changes like that
> endorsed by a representative sample of implementers and key external
> standards (where applicable, in this case that would be CSS), to avoid the
> chance of creating undue disruption (and to increase the chance that the
> resulting modified algorithm is actually usable off-the-shelf, for example
> for "default" or "unknown language" type scenarios.
>
> Hence my insistence that you go out and drum up support. But it looks like
> this should be relatively easy, as there seems to be no strong case for
> maintaining the status quo, other than that it is the status quo.
>
> A./
>
>
>
> I agree that implementers and CSS WG should be involved, but given IE and
> FF have already tailored, and all MS products as well, I guess it should
> not be too hard. I'm in Chrome team now, and the only problem for me to fix
> it in Chrome is to justify why Chrome wants to tailor rather than fixing
> UAX#14 (and the bug priority...)
>
> Either Makoto or I can bring it up to CSS WG to get back to you.
>
> /koji
>
>
> On Sat, May 2, 2015 at 4:12 AM, Asmus Freytag (t) <asmus-inc at ix.netcom.com
> > wrote:
>
>> Thank you, Ken, for your dedicated archeological efforts.
>>
>> I would like to emphasize that, at the time, UAX#14 reflected observed
>> behavior, in particular (but not exclusively) for MS products some of which
>> (at the time) used an LB algorithm that effectively matched an untailored
>> UAX#14.
>>
>> However, recently, the W3C has spent considerable effort to look into
>> different layout-related algorithms and specification. If, in that context,
>> a consensus approach is developed that would point to a better "default"
>> behavior for untailored UAX#14-style line breaking, I would regard that as
>> a critical mass of support to allow UTC to consider tinkering with such a
>> long-standing set of property assignments.
>>
>> This would be true, especially, if it can be demonstrated that (other
>> than matching legacy behavior) there's no context that would benefit from
>> the existing classification. I note that this was something several posters
>> implied.
>>
>> So, if implementers of the legacy behavior are amenable to achieve this
>> by tailoring, and if the change augments the number of situations where
>> untailored UAX#14-style line breaking can be used, that would be a win that
>> might offset the cost of a disruptive change.
>>
>> We've heard arguments why the proposed change is technically superior for
>> Japanese. We now need to find out whether there are contexts where a change
>> would adversely affect users/implementers. Following that, we would look
>> for endorsements of the proposal from implementers or other standards
>> organizations such as W3C (and, if at all possible, agreement from those
>> implementers who use the untailored algorithm now). With these three
>> preconditions in place, I would support an effort of the UTC to revisit
>> this question.
>>
>> A./
>>
>>
>> On 5/1/2015 9:48 AM, Ken Whistler wrote:
>>
>> Suzuki-san,
>>
>> On 5/1/2015 8:25 AM, suzuki toshiya wrote:
>>
>>
>> Excuse me, there is any discussion record how UAX#14 class for
>> halfwidth-katakana in 15 years ago? If there is such, I want to
>> see a sample text (of halfwidth-katakana) and expected layout
>> result for it.
>>
>>
>> The *founding* document for the UTC discussion of the initial
>> Line_Break property values 15 years ago was:
>>
>> http://www.unicode.org/L2/L1999/99179.pdf
>>
>> and the corresponding table draft (before approval and conversion
>> into the final format that was published with UTR #14 -- later
>> *UAX* #14) was:
>>
>> http://www.unicode.org/L2/L1999/99180.pdf
>>
>> There is nothing different or surprising in terms of values there. The
>> halfwidth
>> katakana were lb=AL and the fullwidth katakana were lb=ID in
>> that earliest draft, as of 1999.
>>
>> What is new information, perhaps, is the explicit correlation that can be
>> found
>> in those documents with the East_Asian_Width properties, and the
>> explanation
>> in L2/99-179 that the EAW property values were explicitly used to
>> make distinctions for the initial LB values.
>>
>> There is no sample text or expected layout results from that time period,
>> because that was not the basis for the original UTC decisions on any of
>> this.
>> Initial LB values were generated based on existing General_Category
>> and EAW values, using general principles. They were not generated by
>> examining and specifying in detail the line breaking behavior for
>> every single script in the standard, and then working back from those
>> detailed specifications to attempt to create a universal specification
>> that would replicate all of that detailed behavior. Such an approach
>> would have been nearly impossible, given the state of all the data,
>> and might have taken a decade to complete.
>>
>> That said, Japanese line breaking was no doubt considered as part of
>> the overall background, because the initial design for UTR #14 was
>> informed
>> by experience in implementation of line breaking algorithms at Microsoft
>> in the 90's.
>>
>>
>> You commented that the UAX#14 class should not be changed but
>> the tailoring of the line breaking behaviour would solve
>> the problem (as Firefox and IE11 did). However, some developers
>> may wonder "there might be a reason why UTC put halfwidth-katakana
>> to AL - without understanding it, we could not determine whether
>> the proposed tailoring should be enabled always, or enabled
>> only for a specific environment (e.g. locale, surrounding text)".
>>
>>
>> See above, in L2/99-179. *That* was the justification. It had nothing
>> to do with specific environment, locale, or surrounding text.
>>
>>
>> If UTC can supply the "expected layout result for halfwidth-
>> katakana (used to define the class in current UAX#14)", it
>> would be helpful for the developers to evaluate the proposed
>> tailoring algorithm.
>>
>>
>> UAX #14 was never intended to be a detailed, script-by-script
>> specification of line layout results. It is a default, generic, universal
>> algorithm for line breaking that does a decent, generic job of
>> line breaking in generic contexts without tailoring or specific
>> knowledge of language, locale, or typographical conventions in use.
>>
>> UAX #14 is not a replacement for full specification of kinsoku
>> rules for Japanese, in particular. Nor is it intended as any kind
>> of replacement for JIS X 4051.
>>
>> Please understand this: UAX #14 does *NOT* tell anyone how
>> Japanese text *should* line break. Instead, it is Japanese typographers,
>> users and standardizers who tell implementers of line break
>> algorithms for Japanese what the expectations for Japanese text should
>> be, in what contexts. It is then the job of the UTC and of the
>> platform and application vendors to negotiate the details of
>> which part of that expected behavior makes sense to try to
>> cover by tweaking the default line-breaking algorithm and the
>> Line_Break property values for Unicode characters, and which
>> part of that expected behavior makes sense to try to cover
>> by adjusting commonly accessible and agreed upon tailoring
>> behavior (or public standards like CSS), and finally which part of that
>> expected behavior should instead be addressed by value-added, proprietary
>> implementations of high end publishing software.
>>
>> Regards,
>>
>> --Ken
>>
>>
>>
>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150817/cb549827/attachment-0001.html>


More information about the Unicode mailing list