APL Under-bar Characters

Khaled Hosny khaledhosny at eglug.org
Sun Aug 16 11:53:52 CDT 2015


On Sun, Aug 16, 2015 at 09:31:25AM -0700, alexweiner at alexweiner.com wrote:
> Khaled,
> Thank you for the link. The normalization methods were already discussed,
> specifically here:
> 
> http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html

Grapheme cluster boundaries detection is different from normalisation,
please read the link I provided.

> Where the problem of "how big" is ä is discussed. The answer being that this is
> one symbol, because the Unicode Consortium decided that it is also its own
> standalone character. From the thread:
> 
> I'll give you an example. What would you want ⍴,'ä' to be?
> 
> Right now, that could return either 1 or 2 depending on whether the ä was using
> the precomposed character (U+00E4) or the combining mark (U+0061, U+0308).
> Visually, these are identical, and generally you'd expect them to compare
> equal.

If you are counting grapheme clusters, then the answer is one in both
cases.

> In Unicode, the comparison of equivalent (but with different characters)
> strings are done by performing a normalisation step prior to comparison. There
> are 4 different types of normalisation, with different behaviour.

Quoting from the link I provided:

    A key feature of default Unicode grapheme clusters (both legacy and
    extended) is that they remain unchanged across all canonically
    equivalent forms of the underlying text. Thus the boundaries remain
    unchanged whether the text is in NFC or NFD. Using a grapheme
    cluster as the fundamental unit of matching thus provides a very
    clear and easily explained basis for canonically equivalent
    matching. This is important for applications from searching to
    regular expressions.

See also: http://unicode.org/faq/char_combmark.html#7

> Now, the ä character has a precomposed form in Unicode, and if you couple that
> with the NFC normalisation form, you'd get the above _expression_ to return 1.
> 
> 
> So I'm not sure why the allowance was made for ä as well as other certain
> characters,  but not for other things (under-bar characters) that face
> similar representation issues. 

It was encoded for compatibility of pre-existing character sets AFAIK.

Regards,
Khaled


> 
> 
>     -------- Original Message --------
>     Subject: Re: APL Under-bar Characters
>     From: Khaled Hosny <khaledhosny at eglug.org>
>     Date: Sun, August 16, 2015 8:17 am
>     To: alexweiner at alexweiner.com
>     Cc: unicode at unicode.org
> 
>     On Sun, Aug 16, 2015 at 07:35:17AM -0700, alexweiner at alexweiner.com wrote:
>     > Hello Unicode Mailing List,
>     >
>     > There is significant discussion about the problems of adding capital
>     letters
>     > with individual under-bars in this mailing list for GNU APL.
>     >
>     > http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00050.html
>     >
>     > Pretty much it adds up to the following problem:
>     >
>     > The string length functionality would view an 'A' code point combined
>     with an
>     > '_' code point as an item that has two elements, while something that
>     looks
>     > like 'A' Should be atomic, and return a length of one.
> 
>     I think what you need is better “character” counting [1], rather than
>     new precomposed characters.
> 
>     Regards,
>     Khaled
> 
>     1. http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
> 


More information about the Unicode mailing list