trying to understand the relationship between the Version 1 Hangul syllables and the later versions'

Ken Whistler kenwhistler at
Fri Jun 19 17:12:59 CDT 2015


As usual, the situation is way more complicated that perhaps it has any 

It isn't just Version 1 Hangul that have to be considered, but also 
Version 1.1 Hangul.

Version 1.0 contained 2350 Hangul syllables, encoded in the range 

Version 1.1 contained 6646 Hangul syllables, encoded in the range 3400..3D2D
and a distinct new range 3D2E..4DFF. It thus added 4306 to what was in
Version 1.0 already.

Version 2.0 (and all subsequent versions) contained the 11172 Hangul
syllables we now see, encoded in the range AC00..D7A3. Version 2.0
*deleted* all the Hangul syllables in the range 3400..4DFF.

You also need to pay attention to the history of the encoding of jamo.

Version 1.0 contained 94 "Hangul Elements", encoded in the range 3131..318E.

Version 1.1 retained the same 94 "Hangul Letters" in the range 3131..318E.
Version 1.1 added 240 conjoining jamo letters in the range 1100..11F9.

Version 2.0 retained both of those sets.

O.k., now what were those various chunks?

The Unicode 1.0 set of 2350 was encoded for compatibility with KS C 
They were given no formal decompositions (the concept didn't yet exist), but
the implication in the standard was essentially that Hangul syllables could
just be spelled out with jamo letter sequences. The details were an exercise
for implementation, however, and were soon overtaken by events in
the Unicode/10646 merger.

The Unicode 1.1 set of 4306 additions came from the 10646 merger work,
and comprised two actual subsets:

Hangul Supplementary Syllables A (1930 modern syllables) from KS C 
(See the Unicode 1.1 subrange: 3D2E..44BD.)

Hangul Supplementary Syllables B (2376 old Korean syllables) from KS C 
(See the Unicode 1.1 subrange: 44BE..4DFF.)

*All* of the Unicode 1.1 Hangul syllables were given decompositions.
(Although the formalization of Unicode normalization did not yet exist.)
The decompositions can be see in UnicodeData-1.1.5.txt. Because the
syllables were then encoded in three "alphabetical" extents, with a few 
stragglers tucked
on, the decompositions were not algorithmically defined -- they were just
enumerated in the data file. The decompositions involved the new set of
conjoining jamo letters, rather than the older set, which were relegated
to compatibility mapping status.

The Unicode 2.0 set of 11,172 was known as the "Johab" set from KS C 
That was an algorithmically designed replacement of the earlier sets from
Korean standards -- designed to cover all modern syllables algorithmically,
by putting all the combinations of initial, medial and final jamos in a 
alphabetical order, whether or not each syllable that resulted was actually
attested in modern Korean use or not.

There was an enormous hullabaloo at the time, of course, about the changes
required to switch over from the old ranges to the new set. But the whole
shebang was balloted as Amendment 5 to ISO/IEC 10646-1:1993, and when
that ballot passed, Unicode adopted the change wholesale into the
documentation and data files for Unicode 2.0, to stay in synch.

But "The Korean Mess", as it was then known, led directly to the 
by both SC2 and the UTC that such re-encoding of already standardized
and published characters was enormously damaging to both standards.
It was also expensive to the early implementers: Oracle, for example, long
maintained distinct database support for the Unicode 1.1 Korean, which was
incompatible with the Unicode 2.0 Korean.

In any case, if anybody has any lingering questions about why the following
policy exists and is *strictly* enforced:

or why the applicable version for that stability policy is 2.0+, the 
answer is
that it was a direct reaction to "The Korean Mess".


On 6/19/2015 1:29 PM, Karl Williamson wrote:
> I haven't found any information on this.  It can't just be a 
> transliteration difference, because the number of code points is 
> vastly different between them.
> Is it the case that the version 1 syllables is a failed abstraction 
> that was replaced by the later versions?

More information about the Unicode mailing list