Dealing with Georgian capitalization in programming languages

Philippe Verdy via Unicode unicode at unicode.org
Tue Oct 2 16:07:56 CDT 2018


I see no easy way to convert ALL UPPERCASE text with consistant casing as
there's no rule, except by using dictionnary lookups.
In reality data should be input using default casing (as in dictionnary
entries), independantly of their position in sentences, paragraphs or
titles, and the contextual conversion of some or all characters to
uppercase being done algorithmically (this is safe for conversion to ALL
UPPERCASE, and quite reliable for conversion to Tile Case, with just a few
dictionnary lookups for a small set of knows words per language.

Note that title casing works differently in English (which is most often
abusing by putting capitales on every word), while most other languages
capitalize only selected words, or just the first selected word in French
(in addition to the possible first letter of non-selected words such as
definite and indefinite articles at start of the sentence). Capitalization
of initials on every word is wrong in German which uses capitalisation even
more strictly than French or Italian: when in doubts, do not perform any
titlecasing, and allow data to provide the actual capitalization of titles
directly (it is OK and even recommanded in German to have section headings,
or even book titles, written as if they were in the middle of sentences,
and you capitalize only titles and headings that are full sentences
grammatically, but not simple nominal groups.

So title casing should not even be promoted by the UCD standard (where it
is in fact using only very basic, simplistic rules) and applicable only in
some applications for some languages and in specific technical or rendering
contexts.



Le mar. 2 oct. 2018 à 22:21, Markus Scherer via Unicode <unicode at unicode.org>
a écrit :

> On Tue, Oct 2, 2018 at 12:50 AM Martin J. Dürst via Unicode <
> unicode at unicode.org> wrote:
>
>> ... The only
>> operation that can cause problems is 'capitalize'.
>>
>> When I say "cause problems", I mean producing mixed-case output. I
>> originally thought that 'capitalize' would be fine. It is fine for
>> lowercase input: I stays lowercase because Unicode Data indicates that
>> titlecase for lowercase Georgian letters is the letter itself. But it
>> will produce the apparently undesirable Mixed Case for ALL UPPERCASE
>> input.
>>
>> My questions here are:
>> - Has this been considered when Georgian Mtavruli was discussed in the
>>    UTC?
>> - How have any other implementers (ICU,...) addressed this, in
>>    particular the operation that's called 'capitalize' in Ruby?
>>
>
> By default, ICU toTitle() functions titlecase at word boundaries (with
> adjustment) and lowercase all else.
> That is, we implement Unicode chapter 3.13 Default Case Conversions R3
> toTitlecase(x), except that we modified the default boundary adjustment.
>
> You can customize the boundaries (e.g., only the start of the string).
> We have options for whether and how to adjust the boundaries (e.g., adjust
> to the next cased letter) and for copying, not lowercasing, the other
> characters.
> See C++ and Java class CaseMap and the relevant options.
>
> markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181002/4b01d3be/attachment.html>


More information about the Unicode mailing list