Corrigendum #9

Karl Williamson public at
Sun Jun 8 10:47:16 CDT 2014

On 06/07/2014 10:33 PM, Asmus Freytag wrote:
> On 6/7/2014 9:19 PM, Karl Williamson wrote:
>> On 06/02/2014 11:00 AM, Shawn Steele wrote:
>>> To further my understanding, can someone provide examples of how
>>> these are used in actual practice?  I can't think of any offhand and
>>> the closest I get is like the old escape characters to get a dot
>>> matrix printer to shift modes, or old word processor internal
>>> formatting sequences.
>> Here's an example of a possible use.  20 some years ago I wrote a
>> front-end to the Unix diff utility.  Showing the differences between
>> files (usually 2 versions of the same program's code) is an extremely
>> common programming activity.  I do it many times a day.  One reason is
>> to try to find out why a bug has crept in.  In doing so, there are
>> some differences that are not relevant to the task at hand, and their
>> being shown is a significant distraction. For example, in programming,
>> one might have renamed a variable (identifier) because its purpose has
>> changed somewhat and the name should accurately reflect its new
>> function so the reader is not subconsciously misled.  It would be nice
>> to be able to suppress the variable name changes from the difference
>> display. There could be thousands of them.  By changing the name in
>> each file version to the same noncharacter during the diff, these
>> differences won't be displayed, and there would not be any possible
>> conflict with the input files having that noncharacter in them.  (For
>> display the noncharacter is changed back to the original value in its
>> respective file)  Further, one might want to ignore the name changes
>> of two variables.  Just use a second noncharacter, up to 66.
>> I wrote this long before noncharacters were available.  What I do
>> instead is scan the files for rarely used characters until I find
>> enough ones that aren't in the files.  For example U+9F is unlikely to
>> appear.  Scanning the files takes time.  This step could be omitted
>> for noncharacters that are known to be illegal in the input.
> This "illegal in the input" so "I'm free to assume I can use them for my
> purposes" was definitely the primary(!) design goal discussed when the
> set of 32 were added to Unicode. Having UTC backpedal from that, many
> years after original design, based on a single meeting and without
> public review is really a breakdown of the process.
> A./

I should note that this front-end to 'diff' changes the input files, 
writes the modified versions out, and calls 'diff' with those modified 
files as its inputs.  By using noncharacters, it would be depending on 
'diff' to 1) not use them, and 2) to not filter them out, and 3) for the 
system to be able to store and retrieve them in files.

I think a revision to the text was advisable to clarify that 2) and 3) 
were acceptable.  I haven't heard anybody on this thread disagree with 

But item 1) shows how tricky this issue really is.  My utility looks 
like a fancier 'diff' to those people who call it, so they would be 
justified in wanting it not to use noncharacters because they have their 
own purposes for them.  If some of those callers were themselves 
utilities, their callers might want to use noncharacters for their own 
purposes.  And so on and so on.

I don't have a good answer, except to say that Asmus' characterization 
above looks reasonable.

The purpose of public reviews is to try to get a broad range of ideas, 
and if none are forthcoming, then the fact that there was such a review 
should be an adequate defense of the ultimate decision.  Not holding a 
review is an invitation to lingering suspicions on the part of the 
public about the motives behind any such decision.  These can fester and 
the trust level is permanently diminished.  There will always be people 
who won't like the decision, and who will assume that the deciders are 
malevolent.  But the vast majority will accept a decision that seems to 
have been made in good faith after public input.

This is just how things work, no matter what the venue or issue.  It may 
be that the UTC thought this was minor enough to not require a review, 
but if so, time has shown that to have been an incorrect perception.

More information about the Unicode mailing list