Unicode fundamental character identity

Mon Feb 3 14:24:45 CST 2025

On 2/3/2025 9:36 AM, Sławomir Osipiuk via Unicode wrote:
> On Monday, 03 February 2025, 12:19:06 (-05:00), Peter Constable via 
> Unicode wrote:
>
>     As stated previously, Unicode makes no guarantee of supporting
>     source separation / round-trip compatibility with HP264x.
>
>
> I'm honestly surprised by this. I always thought (because it was 
> repeated so many times - must remember repetition does not equal 
> truth) that round-trip compatibility with old character sets was a 
> founding cornerstone of Unicode and so contrastive use (aka source 
> separation) in an old charset would be persuasive evidence for inclusion.

You guys are talking past each other a bit.

Unicode decided early on to guarantee round-trip to important, widely 
used character sets of the time. The key interest was to be able to 
deploy software that worked internally in Unicode but could interface 
with existing systems without incurring data loss in round trip.

This level guarantee does not exist for just any character set. It 
didn't even exist for all character sets then in existence.

However, if conflating two characters causes a particular problem, 
Unicode has accepted case-by-case requests not to unify them, or even to 
disunify them. However, instead of applying a guarantee, the UTC will 
look at a bit of a cost/benefit analysis, considering the cost of having 
to encode additional characters (in perpetuity) vs. the benefit for the 
intended users.

If this is a problem with a single character, I don't really buy the 
cost savings argument, especially in a case where after adding some 
extensions, a whole set could be matched. If there is a group involved, 
the cost goes up.

On the other hand, I also would like to understand the benefit for the 
supposed user group. Is it mainly that of avoiding a single pixel 
infidelity in display only, or are these characters that would need to 
round-trip, because they might be in data that is entered on a simulated 
device, processed on a Unicode system and then output again.

I think it's stupid for both sides to fight over a single pixel. Yes, it 
smells like a bad unification even though the character is arcane (but 
so are others where minute details matter even though 'nobody' is likely 
to use that character much). Having a stupidly incomplete mapping can be 
frustrating, but is being unfaithful going to impact users in any 
noticeable way?

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20250203/a8bd1bfc/attachment.htm>