Characters that should be displayed?

Jukka K. Korpela jkorpela at cs.tut.fi
Sun Jun 29 16:02:59 CDT 2014


2014-06-29 21:44, Koji Ishii wrote:

> The spec currently has the following text[2]:
>
>> Control characters (Unicode class Cc) other than tab (U+0009), line
>> feed (U+000A), and carriage return (U+000D) are ignored for the
>> purpose of rendering. (As required by [UNICODE], unsupported
>> Default_ignorable characters must also be ignored for rendering.)
>
> and there’s a feedback saying that CSS should display visible glyphs
> for these control characters.

That would change the identity of the characters. They are by definition 
“control characters”, i.e. they have no visible glyphs, but they may 
have control effects. However, it might be argued that rendering them 
somehow would not mean normal rendering but be a diagnostic indication 
of an error. Those characters are invalid in HTML and XML (except XML 
1.1, but who uses it?).

However, the tradition of web browsers is permissive in order to be 
user-friendly. E.g., a casual control character somewhere might be 
interesting to a *developer* or maintainer to notice, so that he could 
analyze and fix the problem that caused it, but to a *user* (visitor), 
it would mostly be just disturbing. He can’t fix the problem, and is 
mostly useless to him to see that the page has some control character in 
the source. So *developer tools* should indicate should problems or 
provide ways to detect, but it seems correct to ignore them in normal 
rendering.

> Since all major browsers do not display
> them today, this is a breaking-change

Well, I would not take that as strong argument. This would be a change 
in error processing. But it would not be useful for other reasons.

> I found the following text in Unicode 6.3, p. 185, "5.21 Ignoring
> Characters in Processing”[3]:
>
>> Surrogate code points, private-use characters, and control
>> characters are not given the Default_Ignorable_Code_Point property.
>> To avoid security problems, such characters or code points, when
>> not interpreted and not displayable by normal rendering, should be
>> displayed in fallback rendering with a fallback glyph
>
> By looking at this, my questions are as follows:
>
> 1. Should control characters that browsers do not interpret be
> displayed in fallback rendering?

It is reasonable to interpret that there are no such control characters, 
because all control characters except those with special handling are 
interpreted as being invalid data and therefore ignored.

2. Should private-use characters
> (U+E000-F8FF, 0F0000-0FFFFD, 100000-10FFFD) without glyphs be
> displayed in fallback rendering?

They might be seen as “not displayable by normal rendering”, so yes. On 
the practical side, although Private Use characters should not be used 
in public information interchange, they are increasingly popular in 
“icon font” tricks. Whatever we think of such tricks, users should not 
be punished for them. If the trick fails (usually because a page uses a 
downloadable font for icon glyphs allocated to Private Use codepoints 
but something prevents the use of such a font), it is relevant to the 
user to know that there is *some* data, which can be crucial (e.g., an 
item in a navigation menu). So some dull fallback rendering is probably 
better than simply ignoring the characters.

> 3. When the above text says “surrogate code points”, does that mean
> everything outside BMP?

No, it means code points that do not represent *any* characters due to 
being in certain special areas in the coding space. They are invalid in 
HTML and in XML. If they appear in data, the reason is usually that 
UTF-16 encoded data containing non-BMP characters is being processed in 
a wrong way. At the level of interpreting a byte stream as a stream of 
characters, surrogate code *units* in UTF-16 should be processed and 
interpreted in pairs so that one pair is taken as one character. And 
when CSS gets at it, it only sees the character in the DOM.

It is adequate to ignore surrogate code points, since they are invalid 
and signalling them to users (as opposite to developers) would hardly do 
any good.

> 4. Should every code point that are not
> given the Default_Ignorable_Code_Point property and that without
> interpretations nor glyphs displayed in fallback rendering? I could
> not find such statement in Unicode spec, but there are some people
> who believe so.
 > 5. Is there anything else Unicode recommends to
> display in fallback rendering, or not to display? This must be RTFM,
> but pointing out where to read would be appreciated.

 From the Unicode point of view, an implementation may decide what 
characters it supports. What it does to characters that it does not 
support seems to be generally up to the implementation to decide as 
regards to rendering. Here, too, I would consider the practical impact 
on users. If a page contain characters that have no glyphs in the fonts 
that are used, then the page has data that is probably valid but cannot 
be rendered in a particular situation. Showing some indication of this 
is relevant, because the user knows he is missing something real, and he 
might be able to fix the situation in various ways (e.g., changing 
browser settings, downloading an installing extra fonts, or just 
switching to a different browser – browsers are known to differ in their 
abilities to use the fonts installed in a system).

Yucca




More information about the Unicode mailing list