Bit arithmetic on Unicode characters?

Fri Oct 7 11:06:31 CDT 2016

Richard Wordingham wrote:

> Yes, it's a trade-off. The application I had in mind is converting
> between mathematical letter variants and their 'plain' forms.

Long-time list members might remember a Windows utility I wrote to
convert between normal Unicode text and Mathematical Alphanumeric
Symbols. Andrew West (of BabelPad fame) has a similar, web-based app
that also supports things like small caps and superscript.

Both of these use lookup tables to do the conversions, and use
algorithms only for very broad-based operations, like distinguishing the
Latin-letter range in the MAS block from the Greek letters and the
digits. There's no practical value in implementing conversions like this
algorithmically. Maybe if there were one or two exceptions in the MAS
range instead of two dozen, it might be different.

> Perhaps there is just enough information in the UCD to allow
> exhaustive, automated tests.

I can't find anything in the UCD that distinguishes one "font variant"
from another (UnicodeData.txt shown as an example):

1D400;MATHEMATICAL BOLD CAPITAL A;Lu;0;L;<font> 0041;;;;N;;;;;
1D434;MATHEMATICAL ITALIC CAPITAL A;Lu;0;L;<font> 0041;;;;N;;;;;
1D468;MATHEMATICAL BOLD ITALIC CAPITAL A;Lu;0;L;<font> 0041;;;;N;;;;;
1D49C;MATHEMATICAL SCRIPT CAPITAL A;Lu;0;L;<font> 0041;;;;N;;;;;
1D4D0;MATHEMATICAL BOLD SCRIPT CAPITAL A;Lu;0;L;<font> 0041;;;;N;;;;;
1D504;MATHEMATICAL FRAKTUR CAPITAL A;Lu;0;L;<font> 0041;;;;N;;;;;
1D538;MATHEMATICAL DOUBLE-STRUCK CAPITAL A;Lu;0;L;<font> 0041;;;;N;;;;;
1D56C;MATHEMATICAL BOLD FRAKTUR CAPITAL A;Lu;0;L;<font> 0041;;;;N;;;;;
1D5A0;MATHEMATICAL SANS-SERIF CAPITAL A;Lu;0;L;<font> 0041;;;;N;;;;;
1D5D4;MATHEMATICAL SANS-SERIF BOLD CAPITAL A;Lu;0;L;<font>
0041;;;;N;;;;;
1D608;MATHEMATICAL SANS-SERIF ITALIC CAPITAL A;Lu;0;L;<font>
0041;;;;N;;;;;
1D63C;MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL A;Lu;0;L;<font>
0041;;;;N;;;;;
1D670;MATHEMATICAL MONOSPACE CAPITAL A;Lu;0;L;<font> 0041;;;;N;;;;;

And that's probably as it should be, because UTC never intended MAS to
be readily transformed to and from "plain" characters. They're supposed
to be used for mathematical expressions in which styled letters have
special meaning. (My utility, and I'm sure Andrew's, were written
entirely tongue-in-cheek.)

> My email client found a font to render U+1D547 as the unwary
> would expect, i.e. using a glyph suitable for ℙ U+2119 DOUBLE-STRUCK
> CAPITAL P. I was surprised when I first saw those gaps; I would have
> expected characters with appropriate singleton decompositions to protect
> the unwary. (The idea might have come up at the time of encoding, and
> been dismissed with reasons.)

Unifying identical characters with identical meanings, rather than
creating pointless duplicates, was a major design tenet of Unicode.

> I don't know whether the font's misrendering is an accident or is
> deliberate partial protection of the victims of bad character code
> selection.

Either way, it's a bug. Users who try to render an unassigned code point
should not be "protected" by showing them a glyph that the font designer
thought should be there. They should be shown a .notdef glyph so they
know something is wrong.

--
Doug Ewell | Thornton, CO, US | ewellic.org