"A Programmer's Introduction to Unicode"

Janusz S. Bień jsbien at mimuw.edu.pl
Sun Mar 12 00:04:56 CST 2017

On Fri, Mar 10 2017 at 19:55 CET, manish at mozilla.com writes:
> I recently wrote
> http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
> , which sort of addresses the whole hangup programmers have with
> treating code points as "characters".


This is just another confirmation that the present Unicode terminology
is confusing. Let me remind below a fragment of an old thread about

Best regards


On Thu, Sep 15 2016 at 21:12 CEST, jsbien at mimuw.edu.pl writes:
> On Thu, Sep 15 2016 at 16:36 CEST, john.w.kennedy at gmail.com writes:
> [...]
>> In the new Swift programming language, which is white-hot in the Apple
>> community, Apple is moving toward a model of a transparent, generic
>> Unicode that can be “viewed” as UTF-8, UTF-16, or UTF-32 if necessary,
>> but in which a “character” contains however many code points it needs
>> (“e” with a stacked macron, acute accent, and dieresis is
>> algorithmically one “character” in Swift). Moreover,
>> e-with-an-acute-accent and e followed by a combining acute accent, for
>> example, compare as equal. At present, the underlying code is still
>> UTF-16LE.
> For several years I use the name "textel" (text element, in Polish
> "tekstel") for such objects. I do it mostly orally in my presentations
> for my students, but I used it also in writing e.g. in
> http://bc.klf.uw.edu.pl/118/, unfortunately without a proper
> definition. A rudymentary definition was provided for me only in my
> recent paper in Polish: http://bc.klf.uw.edu.pl/480/. It states simply
> (on p. 69) "an elementary text element independently of its Unicode
> representation" (meaning in particular composed vs precomposed). I still
> hope to formulate sooner or later a more satisfactory definition :-)
> I think Swift confirms that such a notion is really needed.
> Best regards
> Janusz

On Wed, Sep 21 2016 at  6:44 CEST, jsbien at mimuw.edu.pl writes:
> On Tue, Sep 20 2016 at 18:09 CEST, doug at ewellic.org writes:
>> Janusz Bień wrote:
>>> For me it means that Swift's characters are equivalence classes of the
>>> set of extended grapheme clusters by canonical equivalence relation.
>> I still hope we can come to some conclusion on the correct Unicode name
>> for this concept. I don't think non-Unicode interpretations of terms
>> like "grapheme" are grounds for throwing out "grapheme cluster,"
> I agree.
>> but I can see that the equivalence class itself is lacking a name.
> I'glad.
>> Note that the Swift definition doesn't say that <00E9> and <0065 0301>
>> are identical entities, only that the language compares them as equal.
> I'm fully aware of this.
> Best regards
> Janusz

Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

More information about the Unicode mailing list