Pd: Odp: RE: What to do if a legacy compatibility character is defective?
Asmus Freytag
asmusf at ix.netcom.com
Fri Oct 24 17:26:53 CDT 2025
Fundamentally when Unicode "unifies" characters it often does so "across
sources". For example, any ordinary ASCII letters are unified across
character sets, even if some legacy platform shows a somewhat different
pixel arrangement for some letter compared to some other platform.
The most common reason for Unicode to disunify characters relates to the
*same* source showing both.
These same considerations apply to compatibility characters.
The primary goal for encoding any compatibility characters is to allow
round-trip of data from the source with systems operating in Unicode and
vice versa. It is a non-goal to be able to tell from the Unicode
character code which legacy platform the character was mapped from or is
being mapped to.
The required evidence to support a request for disunification therefore
always consists of a document (screenshot) (usually other than a
character set table) that shows that the two characters are distinct in
their source environment and that that distinction matters (for example,
that it can't be determined mechanically by context).
From the original document (section 1, page 1), it looks like that
there are two characters that are distinct in the source, but have been
mapped to the same Unicode character 1CE2B. I can certainly sympathize
with the view that unifying these based on their close visual similarity
was, what we used to call a case of "arms-length" unification.
In this example, a character stream representing data encoding the
pieces used in the representation of a particular run of text in the
"large character mode" would not reliably round trip, and after
round-tripping (with a real device), the displayed characters would look
subtly different. For handling data being processed transiently using
Unicode there would be a loss of round-tripping, resulting in a change
in data stream without a change in contents, which is what compatibility
characters are designed to normally avoid. For a live terminal emulator,
the effect would be a small degradation of the fidelity of the
emulation. There's no simple workaround as analyzing the fragments in
what amounts to 2-D text display isn't without challenges.
I can understand the frustration of the submitter on being told that
there's an arbitrary limitation on fidelity and some degradation should
be seen as acceptable. While visually not prominent, the disposition
needlessly violates source separation for a single character.
For the examples involving block characters, it is unclear whether they
involve issues of unification within a source or across sources. If the
unification is across sources (platforms) then knowing the target
platform can be used to adjust the glyph being displayed, and there is
no issue. The same is true for any SHIFT mode in a source character set,
because whether the device operates in the shifted mode or not has to be
known and already affects what is displayed at some byte location in the
source character set.
I cannot tell whether the Script Encoding disposition violates source
separation or merely suggests reuse of character codes for multiple
sources/modes in a way that may be amenable to disambiguation with
additional, but available context information.
A./
More information about the Unicode
mailing list