APL Under-bar Characters
kenwhistler at att.net
Sun Aug 16 13:37:50 CDT 2015
It seems to me that APL has some very deeply embedded (and ancient)
assumptions about fixed-width 8-bit characters, dating from ASCII days.
It only got as far as it did with the current assumptions because people
hacked up 8-bit fonts for all the special characters for the APL syntax,
and because IBM implemented those as dedicated special character sets with
matching specialized APL keyboards.
A built-in function like ⍴ which returns the *size* of data is structurally
hand-in-hand with the definition of vectors and arrays. There seem to
be very deep assumptions in the APL data model that strings are simply
an array of *fixed-size* data elements, aka "characters".
So requiring ⍴,'ä' and ⍴,'_A_' to "just work" is the moral equivalent of
C library call strlen("ä") or strlen("_A_") to "just work", regardless
representation of the data in the string. It is a nonsensical requirement
if applied to general Unicode strings outside the context of a very
carefully restricted subset designed to ensure one-to-one relationship
between "character" and "array element".
A Unicode-based APL implementation can (presumably) just up the size
of its "character" to 16-bits internally (actually a UTF-16 code *unit*)
and carefully restrict itself to the subset of ASCII & Latin-1, the APL
symbols and a few other operators needed to fill out the set.
Looking at the fonts people seem to actually be using in various
the general choice seems to be to use both uppercase and lowercase Latin
and forgo the old convention of underlined uppercase Latin letters. That
small adjustment to make to not stay stuck in the 70's, frankly.
I can understand Alex's request that Unicode then effectively "solve the
providing a fixed-width 16-bit entity for "_A_" that could then just be
the restricted subset in the APL implementations. But that isn't going
to happen --
because of the normalization stability guarantees for the Unicode Standard.
And in any case, if users of APL need something more sophisticated for
string handling than strictly limited subsets based on the assumption that
character=element_of_fixed_data_size_array, then rho and a limited subset
aren't going to handle it anyway. At that point, another layer of
would have to be built on top of the basic array and vector processing. And
then Khaled's points about character=grapheme_cluster become relevant.
On 8/16/2015 9:53 AM, Khaled Hosny wrote:
> On Sun, Aug 16, 2015 at 09:31:25AM -0700, alexweiner at alexweiner.com wrote:
> So I'm not sure why the allowance was made for ä as well as other certain
> characters, but not for other things (under-bar characters) that face
> similar representation issues.
> It was encoded for compatibility of pre-existing character sets AFAIK.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode