Characters that should be displayed?

Koji Ishii kojiishi at
Mon Jun 30 13:35:22 CDT 2014

Thank you all for this a lot of great feedback. I learned a lot.

I, however, still don’t get one thing. In the spec text:

Surrogate code points, private-use characters, and control characters are not given the Default_Ignorable_Code_Point property. To avoid security problems, such characters or code points, when not interpreted and not displayable by normal rendering, should be displayed in fallback rendering with a fallback glyph

How could displaying missing PUA glyph help security? I can imagine address bar could have such security risks, but this is about rendering. I can imagine 0x00 could lead to buffer overflow attacks, but it looks to me that preventing such characters inserted into DOM is safer, though I admit that I’m not professional in security at all.

I understand some here wants to display them to help users to identify broken characters, some consider it doesn’t help users at all. I tend to agree with the later, but either way, it’s about helping users to fix their documents.

Anyone knows what security risks the spec is talking about?


On Jul 1, 2014, at 1:33 AM, Philippe Verdy <verdy_p at<mailto:verdy_p at>> wrote:

I generally agree with your comment.

For your question U+FFFD is not special in CSS, it's just a standard character that will be mapped to some symbol (from any font, or synthetized from an internal font (or collection of glyphs) of the renderer according to other styles (there's no warranty that syles like itelaic or bold will look different, in fact there's no good way to exhibit alternatives if the renderer does not lookup a matching font, but at least the renderer should size it according to the computed "font-size:" setting). That symbol is often (but not necessaily a "white" question mark in a "black" diamond; replace "white" in fact by background color/image/shades, and "black" by the "color:" setting, just like in regular fonts mapping any other symbol).
This symbol should also have an inherited direction, not a strong LTR direction: it should not alter the direction of text on either side (or break runs of text) for Bidi rendering, but it may eventually be mirrored in resolved RTL runs (if this is appropriate for the chosen glyph (not always easy to determine if the symbol is chosen from a matching font in context ; but as the symbol to use is quite arbitrary, and should be enough distinctive from other characters, this mirroring is not really necessary, unless the symbol shows some explicit text is a specific style; something to avoid as the character is not specific to any script or language).

2014-06-30 17:59 GMT+02:00 Konstantin Ritt <ritt.ks at<mailto:ritt.ks at>>:
2014-06-29 22:24 GMT+03:00 Asmus Freytag <asmusf at<mailto:asmusf at>>:
but things get harder the more I think:

3. When the above text says “surrogate code points”, does that mean everything outside BMP? It reads so to me, but I’m surprised that characters in BMP and outside BMP have such differences, so I’m doubting my English skill.

No, those would be supplementary code points. Surrogates are values that are intended to be used in pairs as code units in UTF-16. Ill-formed data may contain unpaired values, those are referred to as Surrogate code points.

IIRC, after HTML parsing, validating and building DOM, no any single surrogate code point could be met in, since presence of any ill-formed data in the Unicode text makes the whole text ill-formed.
It's a security recommendation to decoders to replace any unpaired surrogate code point with U+FFFD instead, thus making the text well-formed. As a side effect, the unpaired surrogate code point becomes visible (usually as a square box fallback glyph).
What the consideration regarding U+FFFD in CSS?


Unicode mailing list
Unicode at<mailto:Unicode at>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list