"A Programmer's Introduction to Unicode"

Alastair Houghton alastair at alastairs-place.net
Mon Mar 13 14:18:00 CDT 2017


On 13 Mar 2017, at 17:55, J Decker <d3ck0r at gmail.com> wrote:
> 
> I liked the Go implementation of character type - a rune type - which is a codepoint.  and strings that return runes from by index.
> https://blog.golang.org/strings

IMO, returning code points by index is a mistake.  It over-emphasises the importance of the code point, which helps to continue the notion in some developers’ minds that code points are somehow “characters”.  It also leads to people unnecessarily using UCS-4 as an internal representation, which seems to have very few advantages in practice over UTF-16.

> Doesn't solve the problem for composited codepoints though... 
> 
> texel looks to be defined as a graphic element already.  TEXture ELement.

Yes, but I thought the proposal was “textel”, with the extra “t”.  Re-using “texel” would be quite inappropriate; there are certainly people who work on rendering software who would strongly object to that, for very good reasons.

I would caution, however, that there’s already a lot of terminology associated with Unicode, perhaps for understandable reasons, but if the word “textel” is going to have a definition that differs from (say) an extended grapheme cluster, I think a great deal of consideration should be given to what exactly that definition should be.  We already have “characters”, code units, code points, combining sequences, graphemes, grapheme clusters, extended grapheme clusters and probably other things I’ve missed off that list.  Merely adding yet another bit of terminology isn’t going to fix the problem of developers misunderstanding or simply not being aware of the correct terminology or of some aspect of Unicode’s behaviour.

Kind regards,

Alastair.

--
http://alastairs-place.net




More information about the Unicode mailing list