Block Boundaries (was: RE: Corrigendum #9)

Whistler, Ken ken.whistler at sap.com
Fri May 30 14:50:37 CDT 2014


Skipping over the wording related to noncharacters for the moment,
let me address the block stability issue:

> I also am curious as to why the consecutive group of 32 noncharacters
> can't be split off into its own block instead of being part of an Arabic
> one.  I'm unaware of any stability policy forbidding this.  Another
> block is to be split, if I recall correctly, to accommodate the new
> Cherokee characters.

Actually, this is *not* correct.

The Latin Extended-E block *will* be first published in Unicode 7.0
next month. In the charts for that version and in Blocks.txt, the
range for Latin Extended-E is AB30..AB6F.

True, it was initially approved with a more extended range, and was
long shown with the longer range in the Roadmap. But the Roadmap
is just a "roadmap", and not the standard. The new range allocated
to the Cherokee Supplement (AB70..ABBF) is in ballot now, so that
allocation is not final, although I personally consider it unlikely to change
before publication next year.

At any rate the revision of the range for the Latin Extended-E block occurred before
actual publication of that block.

The net net here is that the last major churning of block boundaries dates
all the way back to Unicode 1.1 times and the great Hangul Catastrophe.
And the last time any formal block boundary was touched was in 2002,
when all blocks were firmly ended on xxxF boundaries as part of synchronizing
documentation between the Unicode Standard and 10646.
And while there is indeed no actual stability guarantee in place that would
absolutely prevent the UTC or SC2 from adjusting a block boundary if it
decided to, the committees are very unlikely to do so, for the reasons
that Asmus cited.

Keep in mind that even if the UTC, for some reason, decided it would be
a cool idea to split the Arabic Presentation Forms-A block into a new, shorter
range and two new blocks, just so FDD0..FDEF could have its own
block identity for the noncharacter range, it would be rather likely that
a fight would then ensue in the SC2 framework over balloting for such
a change to be synchronized in 10646. Nobody has the stomach for
that kind of a pointless fight over something with such marginal relevance
and benefit.

If people want to *fix* this, assuming that "this" is an actual problem,
then the issue, as I see it, isn't really block ranges per se, which don't
mean a whole lot outside of regex expressions that may use them.
Instead, the issue is the de facto alignment of chart presentation with
block boundaries. Jiggering the chart production to *present* the
range FB50..FDFF as three *chart* units, instead of one, would solve
most of the problem for all but the most hardcore Unicode metaphysicians
out there. ;-)

BTW, for those worried about the FDD0..FDEF range on noncharacters
having to live in a mixed neighborhood in the Arabic Presentation Forms-A
block, remember that we have lived since 2002 with the BOM itself 
residing in the Arabic Presentation Forms-B
block. Nobody seems to get too worked up any more about that particular
funky address.

--Ken






More information about the Unicode mailing list