New Unicode Working Group: Message Formatting

Philippe Verdy via Unicode unicode at unicode.org
Tue Jan 14 12:05:59 CST 2020


People name are NOT transliterated freely. It's up to each person to
document his romanized name, it should not be invented by automatic
processes. And frequently the romanized name (officialized) does noit match
the original name in another script: this is very frequent for Chinese
people, as well as trademarks).
There are also common but informal names, not always official but commonly
use in the press/medias and their orthography varies across
countries/languages. If these people are "wellknown" (notably historic
personalities, or artists), they may have their page in some Wikipedia and
Wikidata.

There's no need to "translate" them, you'll use a database query to
retrieve names (including the preferred/most frequent one, the official
one). In some countries several orthographies may be used (e.g. for streets
named after people's: these names are not translatable, except if locally
the streets are multilingual: this is not a database of people names but a
geographic database for other purposes, even if these originate from people
they are still geographic names *derived* from people names).

For this you'll still use placeholders in the messages and the value of the
placeholder may be queried in the relevant database for the relevant target
language; variable forms for these names (e.g. genitives) may be found but
are not easily derived). If these are geographic names, they may be
transliterated but there are competing standards for transliterations of
toponyms, so you'll also need to tune your application to select the
romanization system relevant for the target language (the international
standards are language neutral, but not relevant for specific countries
that have their own officialized terminology, or for the Unioted Nations
that need to cite them in several official working languages), if the
geographic database does not already contain an officialized/prefered
romanization (there are also needs for transliteration from Latin to other
scripts).

Anyway proper names are to be treated specially, there's nothing that can
be used in message format API to select what will be the effective
replacement value of a placeholder. But the replacement may, or may not,
specify alternate forms for correct formatting when multiple forms are
possible (genitives, capitalisation, elisions and contextual mutations).
for the same selected name coming from an external database.

MessageFormat API and translator tools should not have to manage the
external databases, which will be "translated" separately with enough forms
relevant for their presentation and composition in larger messages.

Why this group exist now in CLDR ? most probably because there are already
difficulties to manage translations in existing CLDR data (which is focused
on a small part of what is translatable). CLDR is concerned by only a few
geographic items : countries, some subnational regions, continents, and
some cities used for timezones.

But the main problem is the proliferation of variant forms in CLDR, added
only for a few languages that need them, and no evident fallback to the
common form used in most other languages that don't need that distinction
or not the same kind of distinctions (e.g. plural forms, grammatical gender
or personal gender not always matching together, politeness/formal forms).

Once again I suggest you start contributing to a translation project and
experiment with them before continuing. Look at Wikimedia wikis
(translation templates, the translation extension, and the companion
Translatewiki.net wiki), Transifex, Google Translator, RessourceBundle and
formatting API in Java, .po/.pot for Gettext in many opensource projects,
Facebook translation tool, internationalization APIs in Windows, iOS,
MacOS, and the ICU library which is the de facto base for CLDR...


Le mar. 14 janv. 2020 à 16:11, wjgo_10009 at btinternet.com via Unicode <
unicode at unicode.org> a écrit :

> The reply from Mr Verdy has indeed been helpful, as indeed has also been
> an offlist private reply from someone who has, thus far, not been a
> participant in this thread.
>
>
> Mr Verdy wrote:
>
>
> > You seem to have never seen how translation packages work and are used
> in common projects (not just CLDR, but you could find them as well in
> Wikimedia projects, or translation packages for lot of open source
> packages).
>
> What seems to be the case to Mr Verdy is in fact the actual situation.
>
> I do not satisfy the second of the two conditions of the invitation to
> join the working group. I am, in fact, retired and I have never worked in
> the i18n/l10n industry. Also, from the explanations it is not as close to
> my research interests as I had thought, and indeed hoped. I just do what I
> can on my research project from time to time using a home computer, a
> personal webspace hosted by an internet service provider, some budget
> software, mainly High-Logic FontCreator, and Serif PagePlus desktop
> publishing package, together with the software bundled with Windows 10.
> Older people are often advised to try to keep the mind active, so my
> research activity at least does that. If the research itself has benefits
> more generally in making progress in the application of information
> technology then that is an additional benefit.
>
>
> One thing that of which you might like to take account and specifically
> "build-out" in computer formatting is a tendency that can occur in some
> computer systems software and also in everyday transactions also before
> computers became widespread, namely of not allowing a person to be recorded
> or listed with more that two initials before his or her surname, to the
> extent that some people even have a practice of not using more than two
> initials even when the document, such as a letter, or a form, before them
> specifically uses three or more initials. Common explanations are that
> "It's for the computer" and "Two initials is enough to identify someone"
> and "Someone could have many names". Yet the second is not true and the
> first is only because somewhere along the line someone has decided that
> that is how it to be done: the third is true, but the fact that that is the
> person's name on his or her birth certificate is the legal fact of the
> matter and so needs to be properly accommodated in systems recording names.
> Also, the United Kingdom and United States format of a given name, one or
> more additional given names, then a surname is not suitable for some other
> cultures. I remember some registration forms for college courses that would
> ask for surname and forenames, with a panel for each, together with a
> printed note on every such form "If your name cannot be expressed in that
> format, please write your whole name in the box labelled 'surname'".
>
>
> However, with localization there are other issues. I seem to remember
> somewhere that people whose name is correctly expressed in a script other
> than Latin script often have a transliterated "Romanized form" of their
> name as well for use on travel documents. So will your format system
> include provision for this please, such as by allowing both to be linked
> together in a document please?
>
>
> Another feature is that I have known people from various countries who
> have, in everyday use, chosen to be known in everyday workplace situations
> by an English first name rather than their official given name, while using
> their original surname, perhaps transliterated. So it would be good if
> the name format accounts for that too please, in a manner that does not
> give the possible impression of that use being for some questionable
> purpose. Maybe a new term such as ChosenSocialName could be used for that
> please.
>
>
> An interesting facet of transliteration is that the name of a famous
> mathematician whose name was properly written using Cyrillic characters,
> was transliterated into English as Chebyshev, whereas the set of
> polynomials named after him are each designated by including the letter T.
> The transliteration of the name of the mathematician into German starts
> with a T rather than the C used in English. There was a short thread that
> explored within it this topic in this mailing list around the year 2000,
> not necessarily in the year 2000 itself, but I have not been able to locate
> it.
>
>
> William Overington
>
>
> Tuesday 14 January 2020
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20200114/b85fc88f/attachment.html>


More information about the Unicode mailing list