Word_Break for Hieroglyphs

Mark Davis ☕️ via Unicode unicode at unicode.org
Thu Dec 14 08:53:13 CST 2017


Mark <https://twitter.com/mark_e_davis>

On Thu, Dec 14, 2017 at 3:22 PM, Michael Everson <everson at evertype.com>
wrote:

> On 14 Dec 2017, at 14:14, Mark Davis ☕️ via Unicode <unicode at unicode.org>
> wrote:
>
> > The Word_Break property doesn't have a value Complex_Context, but I
> think that was just a typo in your message.
> >
> > The word break and line break properties for 1,057 [:Script=Egyp:]
> characters are currently
> >
> > Word_Break=ALetter
> > Line_Break=Alphabetic
> >
> > Off the top of my head, I think the best course would be to make them
> both the same as for most of [:Script=Hani:]
> >
> > Word_Break=Other
> > Line_Break=Ideographic
>
> Egyptian is not ideographic and is certainly not fixed-width. CJK does not
> cluster. Why should you want to make them the same?


​fixed-width has *nothing* to do with these properties. The issue is
whether spaces are required between words. The impact of the *these* properties
with their current values are that

   - you would ​never break a word within a string of hieroglyphs (eg
   double-click) and
   - you would only break within a string of hieroglyphs if there are no
   spaces, etc. on the line.

For example, if you have a string of 300 hieroglyphs in a paragraph, double
clicking on one of them would select the entire string, because as far as
Word_Break is concerned, the entire 300 characters form one word. For
linebreak, you would only break when forced. So in a paragraph of passages
of English + hieroglyphs (represented here by CAPS), you would only break
at the spaces and when forced. For example, suppose we have:

... the passage ABCJKQELRKLQNEKLAFNKLAEFNKLARENKQLNRKEWLQNFNNAKDFNFNQKLER
is constructed from 15 words with...

It would not line break (with the current properties) as:

... the passage ABCJKQELRKLQNEKLAFNKLAEFNKLAREN
KQLNRKEWLQNFNNAKDFNFNQKLER is constructed from
15 words with...

but rather as:

... the passage
ABCJKQELRKLQNEKLAFNKLAEFNKLARENKQLNRKEW
LQNFNNAKDFNFNQKLER is constructed from 15 words with...



> Moreover, these properties were defined at the beginning, were they not?
> Bob Richmond and others will certainly have a view on this.
>

If there is defined clustering behavior that affects line break, then the
line break property value would need to be Complex_Context.

But the *current* value is Alphabetic, which makes any length of
hieroglyphs function as one (possibly very long) word. That appears clearly
wrong, even if it was "defined at the beginning". Properties are not carved
in stone (so to speak); we sometimes find out later, especially for seldom
used scripts, that property values can be improved.


> > We would only need to use Complex_Context [:lb=SA:] for scripts that
> keep some letters together and break others apart (typically needing
> dictionary lookup). I would suspect for modern use of Egyp, that is not the
> case;
>
> Please do not “suspect”. It is not hard to ask experts.
>

​You misunderstand. When I say "I suspect" that means I'm not certain. Thus
I would like people who are both knowledgeable about hieroglyphs *and*
Unicode properties to weigh in. I know that people like Andrew Glass are on
this list, who satisfy both criteria.
​

>
> > most people would expect the characters to would just flow like
> ideographs, breaking between any pair:
>
> NO. Clusters cannot be broken up just anywhere.
>

A simple assertion without more information is useless.

Does that mean that ancient inscriptions would leave gaps at the end of
lines in order to not break a cluster, or that modern users would expect
software to leave gaps at the end of lines in order ​to not break a
cluster? And what constitutes a cluster? Is that semantically determined
(eg like Thai), or is it based on algorithmic features of the hieroglyphs?


> > you wouldn't need to disallow breaks between a <man whose head is hit
> with an axe> and a <head of hippopotamus>, for example.
> >
> > Also, I noticed that the 14 Egyp characters with Line_Break≠Alphabetic
> have a linebreak and general category properties that seem odd and
> inconsistent to me.
> >
> > Line_Break=Close_Punctuation
> > General_Category=Other_Letteritems: 8
> > Egyptian Hieroglyphs — O. Buildings, parts of buildings, etc.items: 6
> >
> >  ��   U+1325B EGYPTIAN HIEROGLYPH O006D
> >  ��   U+1325C EGYPTIAN HIEROGLYPH O006E
> >  ��   U+1325D EGYPTIAN HIEROGLYPH O006F
> >  ��   U+13282 EGYPTIAN HIEROGLYPH O033A
> >  ��   U+13287 EGYPTIAN HIEROGLYPH O036B
> >  ��   U+13289 EGYPTIAN HIEROGLYPH O036D
> > Egyptian Hieroglyphs — V. Rope, fiber, baskets, bags, etc.items: 2
> >
> >  ��   U+1337A EGYPTIAN HIEROGLYPH V011B
> >  ��   U+1337B EGYPTIAN HIEROGLYPH V011C
> > Line_Break=Open_Punctuation
> > General_Category=Other_Letteritems: 6
> > Egyptian Hieroglyphs — O. Buildings, parts of buildings, etc.items: 5
> >
> >  ��   U+13258 EGYPTIAN HIEROGLYPH O006A
> >  ��   U+13259 EGYPTIAN HIEROGLYPH O006B
> >  ��   U+1325A EGYPTIAN HIEROGLYPH O006C
> >  ��   U+13286 EGYPTIAN HIEROGLYPH O036A
> >  ��   U+13288 EGYPTIAN HIEROGLYPH O036C
> > Egyptian Hieroglyphs — V. Rope, fiber, baskets, bags, etc.items: 1
> >
> >  ��   U+13379 EGYPTIAN HIEROGLYPH V011A
>
> These properties were chosen explicitly when Egyptian was first defined.
> Those are enclosing punctuation characters.
>

​The issue is that the general category property values are *not* punctuation
characters, so there appears to be an inconsistency (as I said).



>
> Michael Everson.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171214/6baa00eb/attachment.html>


More information about the Unicode mailing list