Request for Information

fantasai fantasai.lists at inkedblade.net
Tue Aug 12 21:19:19 CDT 2014


On 07/23/2014 11:37 PM, Richard Wordingham wrote:
> On Wed, 23 Jul 2014 20:45:48 +0100
> fantasai <fantasai.lists at inkedblade.net> wrote:
>
>> I would like to request that Unicode include, for each writing system
>> it encodes, some information on how it might justify.
>
> Unicode encodes scripts, and I suspect CLDR only really supports living
> languages.  Scripts can be used for multiple writing systems - the
> example of the Latin script for Romaji in Japanese was given in the
> original post.
>
>>     a) Text justification typically expands at word-separating
>> characters, but may also expand between letters.
>>     b) Since this writing system does not use spaces, justification
>> typically expands between letters.
>
> Are you hoping for details on this?  This justification, which I've
> seen called 'Thai justification' in Microsoft Word, generally treats
> spacing combining marks (gc=Mc) like letters in the Tai Tham script when
> used for Tai Khuen.

Actually, for b) I was thinking more about Zh/Ja. I'm mostly hoping
for some kind of clues, however detailed (like the Tibetan chapter,
which devotes an entire page to justification) or superficial. Right
now I have nothing for minority scripts like Javanese. But in order
to display Javanese I have to make some kind of assumption, one way
or another.

>>     c) Javanese only breaks between clauses, where punctuation is used,
>>        resulting in horrendously ragged lines. (Did I get that right?)
>
> No.  The text samples I could find quickly show scripta continua, but I
> suspect the line breaks are occurring at word or syllable boundaries.
> If I am right about the constraint on line break position, then this
> can be recovered by marking the optional line breaks with ZWSP.  In
> addition, the consonants should be reclassified from AL to SA.

:) I picked this example because I was pretty sure that extrapolating
from the UAX14 data was going to give me a wrong answer. Thank you for
confirming.

Does the UTC think that including a statement or two about line
breaking conventions in the per-script descriptions is unnecessary
and inappropriate?

> However, such a change would be incompatible with a modern writing
> system in which words are separated by spaces (if such exists). I don't
> know what happens in Indonesian schools, so I can't report an error.
> Scripta continua and non-scripta continua in the same script are
> incompatible in plain text.

Not really. You can break at spaces and also break by dictionary,
and as long as the two methods agree on what is an unbreakable
"word", it will work. It's only if they disagree that you run
into a problem.

~fantasai


More information about the Unicode mailing list