Dealing with Georgian capitalization in programming languages

Ken Whistler via Unicode unicode at unicode.org
Tue Oct 2 16:43:27 CDT 2018


On 10/2/2018 12:45 AM, Martin J. Dürst via Unicode wrote:
> capitalize: uppercase (or title-case) the first character of the 
> string, lowercase the rest
>
>
> When I say "cause problems", I mean producing mixed-case output. I 
> originally thought that 'capitalize' would be fine. It is fine for 
> lowercase input: I stays lowercase because Unicode Data indicates that 
> titlecase for lowercase Georgian letters is the letter itself. But it 
> will produce the apparently undesirable Mixed Case for ALL UPPERCASE 
> input.
>
> My questions here are:
> - Has this been considered when Georgian Mtavruli was discussed in the
>   UTC?
>
Not explicitly, that I recall. The whole issue of titlecasing came up 
very late in the preparation of case mapping tables for Mtavruli and 
Mkhedruli for 11.0.

But it seems to me that the problem you are citing can be avoided if you 
simply rethink what your "capitalize" means. It really should be 
conceived of as first lowercasing the *entire* string, and then 
titlecasing the *eligible* letters -- i.e., usually the first letter. 
(Note that this allows for the concept that titlecasing might then be 
localized on a per-writing-system basis -- the issue would devolve to 
determining what the rules are for "eligible" letters.) But the simple 
default would just be to titlecase the initial letter of each "word" 
segment of a string.

Note that conceived this way, for the Georgian mappings, where the 
titlecase mapping for Mkhedruli is simply the letter itself, this 
approach ends up with:

capitalize(mkhedrulistring) --> mkhedrulistring

capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> 
mkhedrulistring

Thus avoiding any mixed case.

--Ken



More information about the Unicode mailing list