Cameron Dutro via CLDR-Users
cldr-users at unicode.org
Fri Sep 20 00:48:28 CDT 2019
Hey CLDR users,
I'm currently trying to update TwitterCLDR
<https://github.com/twitter/twitter-cldr-rb> to CLDR v35 (from v29, yikes),
and running into an issue with the Katakana-Latin BGN transliteration
rules. It looks like the current rules don't take combining marks into
account. I'm specifically seeing issues with Dakuten
<https://en.wikipedia.org/wiki/Dakuten_and_handakuten>, special combining
Hiragana and Katakana sound marks. The transliteration rules specify
applying an NFD normalization to the input text first, meaning all the
Dakuten are converted into individual codepoints. However, they are not
considered by the subsequent set of transliteration rules and remain in the
output text, combining with the Roman characters in unexpected ways.
Looking through the CLDR XML source, I came across this line
specifies a special $wordBoundary variable and a corresponding comment
directing the reader to the old CLDR issue tracker. I was able to
cross-reference this issue
<https://unicode-org.atlassian.net/browse/CLDR-4238> in the new JIRA system
that I think is the successor to the old Trac issue. It describes exactly
the problem I'm dealing with.
Interestingly, the $wordBoundary variable in the transliteration rules XML
file is not actually used anywhere in the transliteration rules. This
suggests at one point it was perhaps designed to handle Daktuen in some way.
It appears ICU 64.2 emits the correct transliteration results while 57.1
does not, without any alterations to the transliteration rules in the CLDR.
This leaves me with the following questions:
1. Does ICU use the $wordBoundary variable even though it doesn't appear
in any rules? If so, how?
2. It appears somehow ICU was fixed without a concurrent update to the
CLDR rules. What changes were made to ICU?
Thanks in advance,
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the CLDR-Users