Scope of Unicode Character Properties (was: Re: Deleting Lone Surrogates)
kenwhistler at att.net
Mon Oct 5 12:11:39 CDT 2015
Section 3.5, Properties, of the standard attempts to address this.
"Code point properties" are properties of the code points, per se, and
clearly do have all code points (U+0000..U+10FFFF) in their scope.
An example is the Surrogate code point property, which wouldn't
make much sense if it didn't apply to surrogate code points!
"Encoded character properties" are properties of the characters
themselves -- attributes like Ideographic or Numeric_Value. For
those are given *default* values for all reserved code points (and
for noncharacter and PUA code points). In principle, the scope should be
all Unicode scalar values: U+0000..U+D7FF, U+E000..U+10FFFF,
because it doesn't make much sense to talk about character properties
for code points that are ill-formed and which cannot ever actually
represent a character.
However, in practice, it is simplest to extend the *default* values of
encoded character properties to the surrogate code points, so that
in the cases where they occur in ill-formed text, APIs and
applications have some hope of doing something useful,
rather than just reacting exceptionally to featureless singularities
embedded in text.
Hence, the bullet in the text in the standard:
* For each encoded character property there is a mapping from every
code point to some value in the set of values associated with that property.
There is nothing in the standard, as I read it, that imposes a conformance
requirement on any process that would *require* it to interpret
an isolated surrogate code point and give it a particular property value.
However, it would be reasonable (and permitted) for an API to actually
report a default value for a surrogate code point (i.e., treating it more
or less like the reserved code point U+50005 that Marcus mentioned).
Such behavior in a character property API is likely to result in more
graceful behavior than simply throwing exceptions.
On 10/4/2015 12:30 PM, Richard Wordingham wrote:
> Do all Unicode character properties extend to all codepoints? If not,
> how does one tell which do and which don't? ...
More information about the Unicode