"A Programmer's Introduction to Unicode"

Sun Mar 12 13:43:22 CDT 2017

> This is just another confirmation that the present Unicode terminology
is confusing.

I find this to be a symptom of our pedagogy around "characters" in
programming; most folks get taught that characters are bytes are code
points, especially because many languages try to make this the case.
The name "grapheme cluster" could be improved upon, but it's not the
primary source of this confusion.
-Manish


On Sat, Mar 11, 2017 at 10:04 PM, Janusz S. Bień <jsbien at mimuw.edu.pl> wrote:
> On Fri, Mar 10 2017 at 19:55 CET, manish at mozilla.com writes:
>> I recently wrote
>> http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
>> , which sort of addresses the whole hangup programmers have with
>> treating code points as "characters".
>
> [...]
>
> This is just another confirmation that the present Unicode terminology
> is confusing. Let me remind below a fragment of an old thread about
> "textels".
>
> Best regards
>
> Janusz
>
>
> On Thu, Sep 15 2016 at 21:12 CEST, jsbien at mimuw.edu.pl writes:
>> On Thu, Sep 15 2016 at 16:36 CEST, john.w.kennedy at gmail.com writes:
>>
>> [...]
>>
>>> In the new Swift programming language, which is white-hot in the Apple
>>> community, Apple is moving toward a model of a transparent, generic
>>> Unicode that can be “viewed” as UTF-8, UTF-16, or UTF-32 if necessary,
>>> but in which a “character” contains however many code points it needs
>>> (“e” with a stacked macron, acute accent, and dieresis is
>>> algorithmically one “character” in Swift). Moreover,
>>> e-with-an-acute-accent and e followed by a combining acute accent, for
>>> example, compare as equal. At present, the underlying code is still
>>> UTF-16LE.
>>
>> For several years I use the name "textel" (text element, in Polish
>> "tekstel") for such objects. I do it mostly orally in my presentations
>> for my students, but I used it also in writing e.g. in
>> http://bc.klf.uw.edu.pl/118/, unfortunately without a proper
>> definition. A rudymentary definition was provided for me only in my
>> recent paper in Polish: http://bc.klf.uw.edu.pl/480/. It states simply
>> (on p. 69) "an elementary text element independently of its Unicode
>> representation" (meaning in particular composed vs precomposed). I still
>> hope to formulate sooner or later a more satisfactory definition :-)
>>
>> I think Swift confirms that such a notion is really needed.
>>
>> Best regards
>>
>> Janusz
>
> On Wed, Sep 21 2016 at  6:44 CEST, jsbien at mimuw.edu.pl writes:
>> On Tue, Sep 20 2016 at 18:09 CEST, doug at ewellic.org writes:
>>> Janusz Bień wrote:
>>>
>>>> For me it means that Swift's characters are equivalence classes of the
>>>> set of extended grapheme clusters by canonical equivalence relation.
>>>
>>> I still hope we can come to some conclusion on the correct Unicode name
>>> for this concept. I don't think non-Unicode interpretations of terms
>>> like "grapheme" are grounds for throwing out "grapheme cluster,"
>>
>> I agree.
>>
>>> but I can see that the equivalence class itself is lacking a name.
>>
>> I'glad.
>>
>>>
>>> Note that the Swift definition doesn't say that <00E9> and <0065 0301>
>>> are identical entities, only that the language compares them as equal.
>>
>> I'm fully aware of this.
>>
>> Best regards
>>
>> Janusz
>
>
> --
>                            ,
> Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
> Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
> jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
>