From richard.wordingham at ntlworld.com Mon Mar 1 02:52:43 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 1 Mar 2021 08:52:43 +0000 Subject: Grammatical features / gender power & prefix derivation In-Reply-To: References: <86C2719E-57F0-4E9F-9178-DB220E3218DA@gmail.com> <20210301011840.62516e72@JRWUBU2> Message-ID: <20210301085243.507e252c@JRWUBU2> On Sun, 28 Feb 2021 19:54:36 -0800 Mark Davis ?? via CLDR-Users wrote: > We are not talking a prefix "taking" gender, but rather contributing > to the gender of the result. For example, given the gender of the > unit "meter", and the prefix "kilo-", what is the gender of the unit > "kilometer"? (In the target language, of course!) My point was that there is scope for "metre" to be feminine, "kilometre" to be neuter and the noncy "megametre" to be feminine, with the gender of the last two being determined by the compounding. However, the order of the elements would probably stop that happening in Pali, but the principle is there for the effect to turn up in some other language. For Pali, one would probably have to nativise "kilometre" to something like _mattasahassa_ to be sure of getting a neuter gender, but as a collective concept one could have a neuter _sahassamatta_ for "km". I'm not sure if those collective concepts can take plurals. Richard. From richard.wordingham at ntlworld.com Mon Mar 1 23:53:15 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 2 Mar 2021 05:53:15 +0000 Subject: Grammatical features / gender power & prefix derivation In-Reply-To: <86C2719E-57F0-4E9F-9178-DB220E3218DA@gmail.com> References: <86C2719E-57F0-4E9F-9178-DB220E3218DA@gmail.com> Message-ID: <20210302055315.498f8357@JRWUBU2> On Mon, 1 Mar 2021 06:50:35 +0800 Kip Cole via CLDR-Users wrote: > If my understanding is correct, then looking at the Section 16.1: > feature="gender" structure="prefix" value="0"/> Is there any circumstance > whereby ?value? could be anything other than ?0? ? Is there any > circumstance where the power or prefix themselves would form part of > the gender determination? (Based on the above I assume not, but > confirmation would be helpful). Looking at the locales for ?root?, > ?de? and ?fr?, all of them have ?value=0? for ?power? and ?prefix?. I'm not sure that it's relevant, but the suppletive form _kilo_ for _kilogram_ often has a different gender to the longer form, e.g. neuter rather than masculine in Slavonic languages, or optionally ki/vi class in Swahili as opposed to 'n' class. (Mark Rosenfelder found me the Slavonic case.) Richard. From mark at macchiato.com Tue Mar 2 09:04:48 2021 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 2 Mar 2021 07:04:48 -0800 Subject: Grammatical features / gender power & prefix derivation In-Reply-To: <20210302055315.498f8357@JRWUBU2> References: <86C2719E-57F0-4E9F-9178-DB220E3218DA@gmail.com> <20210302055315.498f8357@JRWUBU2> Message-ID: That is useful info. We currently have the long form (kilogram) and the abbreviation (kg) in different languages. So the most useful information would be cases of languages where those two have different genders. Or where the prefixed form has a different gender than the " base form ", such as kilogram and gram. We currently have a limited number of languages with the grammatical data, as you can see from the chart linked from the 39 release page. We're going to be gearing up to add more languages in the next release so this kind of information will be useful as we prepare for that release. On Mon, Mar 1, 2021, 21:54 Richard Wordingham via CLDR-Users < cldr-users at unicode.org> wrote: > On Mon, 1 Mar 2021 06:50:35 +0800 > Kip Cole via CLDR-Users wrote: > > > If my understanding is correct, then looking at the Section 16.1: > > > feature="gender" structure="prefix" value="0"/> Is there any circumstance > > whereby ?value? could be anything other than ?0? ? Is there any > > circumstance where the power or prefix themselves would form part of > > the gender determination? (Based on the above I assume not, but > > confirmation would be helpful). Looking at the locales for ?root?, > > ?de? and ?fr?, all of them have ?value=0? for ?power? and ?prefix?. > > I'm not sure that it's relevant, but the suppletive form _kilo_ for > _kilogram_ often has a different gender to the longer form, e.g. neuter > rather than masculine in Slavonic languages, or optionally ki/vi class > in Swahili as opposed to 'n' class. (Mark Rosenfelder found me the > Slavonic case.) > > Richard. > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at corp.unicode.org > https://corp.unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Mar 2 16:28:04 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 2 Mar 2021 22:28:04 +0000 Subject: Grammatical features / gender power & prefix derivation In-Reply-To: References: <86C2719E-57F0-4E9F-9178-DB220E3218DA@gmail.com> <20210302055315.498f8357@JRWUBU2> Message-ID: <20210302222804.21943fb9@JRWUBU2> On Tue, 2 Mar 2021 07:04:48 -0800 Mark Davis ?? via CLDR-Users wrote: > That is useful info. We currently have the long form (kilogram) and > the abbreviation (kg) in different languages. So the most useful > information would be cases of languages where those two have > different genders. Well, when 'kilo' and 'kilogram' have different genders, the gender of number plus abbreviation is unknown! The gender differences according to Wiktionary are: "kilogram" v. "kilo", masculine inanimate v. neuter: Czech, Polish, "kilogram" v. "kilo" v. "kilootje", masculine v. common v. neuter: Dutch (Masculine v. common looks wrong in principle!) "kilogram" v. "kilo", neuter v. masculine/neuter: Norwegian (both standards) "?????????" v. "????", masculine inanimate v. neuter: Russian > Or where the prefixed form has a different gender than the " base > form ", such as kilogram and gram. I've been looking, but I've only turned up one example of the Swahili plural form *vilogramu; the n-/n- noun classing of _kilogramu_ 'kilogram' has very little competition. The word _gramu_ is in the n-/n- noun class. Richard. From mark at macchiato.com Tue Mar 2 16:47:59 2021 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 2 Mar 2021 14:47:59 -0800 Subject: Grammatical features / gender power & prefix derivation In-Reply-To: <20210302222804.21943fb9@JRWUBU2> References: <86C2719E-57F0-4E9F-9178-DB220E3218DA@gmail.com> <20210302055315.498f8357@JRWUBU2> <20210302222804.21943fb9@JRWUBU2> Message-ID: On Tue, Mar 2, 2021 at 2:28 PM Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Tue, 2 Mar 2021 07:04:48 -0800 > Mark Davis ?? via CLDR-Users wrote: > > > That is useful info. We currently have the long form (kilogram) and > > the abbreviation (kg) in different languages. So the most useful > > information would be cases of languages where those two have > > different genders. > > Well, when 'kilo' and 'kilogram' have different genders, the gender of > number plus abbreviation is unknown! The gender differences according > to Wiktionary are: > We appear to be talking past one another. We don't support "kilo" (as a separate term, or as an abbreviation for "kilogram" or other kilo-units), so your statement "when 'kilo' and 'kilogram' have different genders" is not relevant to CLDR currently. (It could be in the future, but I'd like to clearly distinguish current from future capabilities.) > "kilogram" v. "kilo", masculine inanimate v. neuter: Czech, Polish, > > > "kilogram" v. "kilo" v. "kilootje", masculine v. common v. neuter: Dutch > (Masculine v. common looks wrong in principle!) > > "kilogram" v. "kilo", neuter v. masculine/neuter: Norwegian (both > standards) > > "?????????" v. "????", masculine inanimate v. neuter: Russian > > > Or where the prefixed form has a different gender than the " base > > form ", such as kilogram and gram. > > I've been looking, but I've only turned up one example of the > Swahili plural form *vilogramu; the n-/n- noun classing of _kilogramu_ > 'kilogram' has very little competition. The word _gramu_ is in the > n-/n- noun class. > That sounds useful, but I couldn't quite parse what you were saying. Do you mean that the word for gram is in the n-/n- noun class and the word for kilogram is not? > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Wed Mar 3 16:34:42 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 3 Mar 2021 22:34:42 +0000 Subject: Fw: Grammatical features / gender power & prefix derivation Message-ID: <20210303223442.5abb8ec5@JRWUBU2> (This was meant for the list.) Begin forwarded message: Date: Wed, 3 Mar 2021 00:20:28 +0000 From: Richard Wordingham To: Mark Davis ?? Subject: Re: Grammatical features / gender power & prefix derivation On Tue, 2 Mar 2021 14:47:59 -0800 Mark Davis ?? wrote: > We appear to be talking past one another. We don't support "kilo" (as > a separate term, or as an abbreviation for "kilogram" or other > kilo-units), so your statement "when 'kilo' and 'kilogram' have > different genders" is not relevant to CLDR currently. (It could be in > the future, but I'd like to clearly distinguish current from future > capabilities.) It is relevant as to when one cannot confidently assign a gender to the abbreviation. Whether CLDR supports the equivalent of "kilo" is linguistically irrelevant. At present I can believe you are only interested in generating text using a correct gender. > > I've been looking, but I've only turned up one example of the > > Swahili plural form *vilogramu; the n-/n- noun classing of > > _kilogramu_ 'kilogram' has very little competition. The word > > _gramu_ is in the n-/n- noun class. > That sounds useful, but I couldn't quite parse what you were saying. > Do you mean that the word for gram is in the > n-/n- noun class and the word for kilogram is not? The Swahili word for 'gram' is in the n-/n- noun class. The long Swahili word for 'kilogram' that I can consistently find is also in the n-/n- class. It seems likely that the usual word for 'kilogram' is _kilo_, which has the virtue of conforming to Swahili phonology. The word _kilo_ can reportedly be in either n-/n- noun class or the ki-/vi- word class, thought the former seems to be commoner. When the long word for 'kilogram' is in the n-/n- class, its plural is _kilogramu_. If it be in the ki-/vi- class, its plural is _vilogramu_. I've found only one example of it, in the sentence, "Kuna vilogramu 28 kwa gramu 100 za bidhaa", which Google Translates translates as "There are 28 kilograms per 100 grams of product". If the 'v' is not a typo, that sentence is a lovely example of different noun classes for base and derived units. By contrast, I get multiple hits on "kilogramu tatu" meaning '3 kg', so for generating text one should accept the word for 'kilogram' as being in the same noun class as the word for 'gram'. (It's far from unknown for Swahili words to be in multiple word classes with little or no difference in meaning.) Richard. From richard.wordingham at ntlworld.com Sat Mar 6 05:36:19 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 6 Mar 2021 11:36:19 +0000 Subject: Grammatical features / gender power & prefix derivation In-Reply-To: <20210301011840.62516e72@JRWUBU2> References: <86C2719E-57F0-4E9F-9178-DB220E3218DA@gmail.com> <20210301011840.62516e72@JRWUBU2> Message-ID: <20210306113619.5b720979@JRWUBU2> On Mon, 1 Mar 2021 01:18:40 +0000 Richard Wordingham via CLDR-Users wrote: > On Mon, 1 Mar 2021 06:50:35 +0800 > Kip Cole via CLDR-Users wrote: > > > My understanding of TR35 section 16.1 is that when deriving the > > grammatical gender of a ?power? (like ?square meter?) or > > ?prefix? (like ?milligram?) the basic operation is to strip the > > power and/or prefix and derive the gender of the base unit (?meter? > > in this case). > > > > If my understanding is correct, then looking at the Section 16.1: > > > feature="gender" structure="prefix" value="0"/> Is there any circumstance > > whereby ?value? could be anything other than ?0? ? Is there any > > circumstance where the power or prefix themselves would form part of > > the gender determination? (Based on the above I assume not, but > > confirmation would be helpful). Looking at the locales for ?root?, > > ?de? and ?fr?, all of them have ?value=0? for ?power? and > > ?prefix?. > > I think you need something like a Tigrinya or Sanskrit locale to give > you any confidence. It looks as though the simplification works for Tigrinya. The compounds tend not to univerbate in Tigrinya as it maintains the Semitic aversion to compounding nouns, but inanimate singulars lack obligatory gender and numbers above one happily take the singular. Looking closer to hand, how is assimilation of numbers to the counted object to be handled? It has effects much like gender. An English example is "an amp" v. "a milliamp"; in a receding style of Welsh, we have "deng medr" (10 m) v. "deg cilomedr" (10 km). Richard. From richard.wordingham at ntlworld.com Sat Mar 6 15:52:42 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 6 Mar 2021 21:52:42 +0000 Subject: Grammatical features / gender power & prefix derivation In-Reply-To: <20210306113619.5b720979@JRWUBU2> References: <86C2719E-57F0-4E9F-9178-DB220E3218DA@gmail.com> <20210301011840.62516e72@JRWUBU2> <20210306113619.5b720979@JRWUBU2> Message-ID: <20210306215242.55b620df@JRWUBU2> On Sat, 6 Mar 2021 11:36:19 +0000 Richard Wordingham via CLDR-Users wrote: > Looking closer to hand, how is assimilation of numbers to the counted > object to be handled? It has effects much like gender. An English > example is "an amp" v. "a milliamp"; in a receding style of Welsh, we > have "deng medr" (10 m) v. "deg cilomedr" (10 km). And there's a normative example in Italian _uno steradiante_, which is about fifteen times as common as _un steradiante_. Richard. From kipcole9 at gmail.com Sat Mar 13 19:35:37 2021 From: kipcole9 at gmail.com (Kip Cole) Date: Sun, 14 Mar 2021 09:35:37 +0800 Subject: Mapping Unicode script name to CLDR script code Message-ID: <7E499380-6637-4E81-B408-10B10AC10EBB@gmail.com> Using the script properties (from scripts.txt in the Unicode repo for example), the script of some text can be detected. However I am not able to find a mapping from Unicode script names to CLDR script codes. Ie a way to map "Hirigana -> Jpan" or "Javanese -> Java". I?ve checked supplementalData.xml and scriptMetadata.txt to no avail. Is there a canonical mapping somewhere? Many thanks, ?Kip From kipcole9 at gmail.com Sat Mar 13 19:41:22 2021 From: kipcole9 at gmail.com (Kip Cole) Date: Sun, 14 Mar 2021 09:41:22 +0800 Subject: Mapping Unicode script name to CLDR script code In-Reply-To: <7E499380-6637-4E81-B408-10B10AC10EBB@gmail.com> References: <7E499380-6637-4E81-B408-10B10AC10EBB@gmail.com> Message-ID: I note that https://unicode-org.github.io/cldr-staging/charts/39/supplemental/languages_and_scripts.html does map from Unicode language name (at least informally) to CLDR language code but that mapping isn?t, as far as I can see, in supplementalData.xml. > On 14 Mar 2021, at 9:35 am, Kip Cole wrote: > > Using the script properties (from scripts.txt in the Unicode repo for example), the script of some text can be detected. > > However I am not able to find a mapping from Unicode script names to CLDR script codes. Ie a way to map "Hirigana -> Jpan" or "Javanese -> Java". > > I?ve checked supplementalData.xml and scriptMetadata.txt to no avail. > > Is there a canonical mapping somewhere? > > Many thanks, ?Kip > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Mar 14 00:02:30 2021 From: doug at ewellic.org (Doug Ewell) Date: Sat, 13 Mar 2021 23:02:30 -0700 Subject: Mapping Unicode script name to CLDR script code Message-ID: <20210313230230.665a7a7059d7ee80bb4d670165c8327d.7674e287fd.wbe@email15.godaddy.com> Kip Cole wrote: > Using the script properties (from scripts.txt in the Unicode repo for > example), the script of some text can be detected. > > However I am not able to find a mapping from Unicode script names to > CLDR script codes. Ie a way to map "Hirigana -> Jpan" or "Javanese -> > Java". > > I've checked supplementalData.xml and scriptMetadata.txt to no avail. > > Is there a canonical mapping somewhere? Have you tried PropertyValueAliases.txt? It's a Unicode Character Database file, not a CLDR file. -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From richard.wordingham at ntlworld.com Sun Mar 14 07:04:31 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 14 Mar 2021 12:04:31 +0000 Subject: Mapping Unicode script name to CLDR script code In-Reply-To: References: <7E499380-6637-4E81-B408-10B10AC10EBB@gmail.com> Message-ID: <20210314120431.544941be@JRWUBU2> On Sun, 14 Mar 2021 09:41:22 +0800 Kip Cole via CLDR-Users wrote: > I note that > https://unicode-org.github.io/cldr-staging/charts/39/supplemental/languages_and_scripts.html > > does map from Unicode language name (at least informally) to CLDR > language code but that mapping isn?t, as far as I can see, in > supplementalData.xml. I think that's a map from allegedly English names to BCP 47 codes. Not all the script names are Unicode names. For example, 'Lanna' is not. > > On 14 Mar 2021, at 9:35 am, Kip Cole wrote: > > > > Using the script properties (from scripts.txt in the Unicode repo > > for example), the script of some text can be detected. > > > > However I am not able to find a mapping from Unicode script names > > to CLDR script codes. Ie a way to map "Hirigana -> Jpan" or > > "Javanese -> Java". > > > > I?ve checked supplementalData.xml and scriptMetadata.txt to no > > avail. > > > > Is there a canonical mapping somewhere? As Doug pointed out, PropertyValueAliases.txt should normally work. However, there are a number of cases that it doesn't handle: Jpan is composed of (at least) 3 Unicode scripts: Hani, Hira and Kata. Kore is a similar combination of the Unicode scripts Hani and Hang. Hrkt expands to 'Hiragana or Katakana'; there might be some usage for Japanese text that deliberately excludes kanji. Latf and Latg are stylistic differences of Latn. I suspect there ought to be a lot of (largely) predictable spelling differences between de-Latn and de-Latf (basically where ligatures or their lack need to be noted) and between ga-Latn and ga-Latg (how lenition is written). Likewise, Syre, Syrj and Syrn are stylistic variants of Syrc. Hans and Hant are the simplified and traditional character sets of Chinese; both are specialisations of the generic code Hani. Have fun. Richard. From richard.wordingham at ntlworld.com Sun Mar 28 14:11:43 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 28 Mar 2021 20:11:43 +0100 Subject: Problems with org.unicode.cldr.tool.ShowKeyboards Message-ID: <20210328201143.44a442aa@JRWUBU2> The LCML specification makes using it to document keyboards seem like a good idea. So I have been trying to document some of my own. I have been having troubles devising an identifier for my X-SAMPA keyboard. Its purpose is that one can type in IPA using the X-SAMPA ASCIIfication and get out IPA in Normal Form C. It has been extended slightly to support capital letters and other diacritics that one encounters in transliteration. 1. My first attempt used "und-t-k0". The tool objected that I should rather use the language "en". I then tried "en-t-k0", which triggered the exception: java.lang.StringIndexOutOfBoundsException: begin 0, end -1, length 7 at java.lang.String.checkBoundsBeginEnd(java.base at 9-internal/String.java:3119) at java.lang.String.substring(java.base at 9-internal/String.java:1907) at org.unicode.cldr.tool.ShowKeyboards$Id.(ShowKeyboards.java:811) 2. I then changed tack, and asked myself, "For which language do you most often switch to this keyboard?". The current answer is 'Pali', so I tried pi-Latn-t-k0-ubuntu, pi-Latn-t-k0-ubuntu. In each case I got a non-terminating exception, "org.unicode.cldr.draft.Keyboard$KeyboardException: Bad locale tag: pi-Latn-t-k0-ubuntu, [No minimal data for:pi_Latn]". Are keyboards not allowed for Pali? 3. The question of which vendor's system the keyboard is targeted at is difficult. It's being used on Linux, but 'debian' or 'Ubuntu' might be a more useful answer. The actual coding of the keyboard comes in three flavours, Keyman for Linux (KMfL), emacs (or quail) and M17N. KMfL is the simplest, but only works/worked with the iBus input manager, while M17N should work for both iBus and fcitx. It's not at all clear how I should reflect this in the keyboard identity. 4. The error message for loose text within elements ('PCDATA') is less helpful than it could be. For example, Caused by: org.xml.sax.SAXParseException; systemId: file:///home/richard/unicode/cldr/38/keyboards/und/pi-t-k0-ubuntu.xml; lineNumber: 91; columnNumber: 12; The content of element type "keyMap" must match "(map|flicks)+". tells one (by elimination) that there is such text somewhere in an element of type "keyMap: that ends on line 91. That is of limited help when an element has 1500 lines, as has happened to me. (Being new to the game, I had to eliminate misplaced elements or misspelt element names - they give different errors.) Unfortunately, this error message seems not to be under the control of the CLDR project. -- Despite the warning about my being wicked enough to create a Pali keyboard, the charts and tables were produced for the keyboard. However, there are numerous lurking issues: 5. The layout chart shows only 95 graphic symbols (including space). Are there any plans to chart 'dead key' combinations and the like? (This may not be a trivial exercise.) 6. Most of the keys are shown as being dead keys, though the design intent is that they are not treated as dead keys - the 'default' option is intended, as opposed to 'settings/transformPartial="hide"'. The keyboard format provides no way to note this! 7. Typing the key labelled 'A' with shift enabled is intended to generate the character U+0251 LATIN SMALL LETTER ALPHA; only on typing a backslash does it change to U+0041 LATIN CAPITAL LETTER A. As the technology used assumes a mnemonic keyboard (in so far as it doesn't simply assume a US English keyboard), this is implemented as: ... The two transforms are in the 'type=simple' transforms element. Should not the tool raise an eyebrow at this? I feel the charts ought to display the C01 key as producing '?'. Even more seriously, the tool seems to deduce just from the map element above that the keyboard can produce the letter 'A'. 8. Is there a list of available tools for capturing various keyboards in CLDR notation, for example converting a .klc file from MSKLC and, as more of a niche product, converting a .mim file from M17N? If you feel tickets on CLDR should be raised, please advise how the issues should be grouped. Richard. From kipcole9 at gmail.com Mon Mar 29 08:57:34 2021 From: kipcole9 at gmail.com (Kip Cole) Date: Mon, 29 Mar 2021 21:57:34 +0800 Subject: Transform resolution and before context matches Message-ID: <702869D2-DCCA-4449-909C-6DD195C78298@gmail.com> I?m now implementing CLDR transforms and would appreciate some understanding of the following two items: 1. Resolving the correct transform from ?Any-Latin?. For example, ?de-Latin? has a transform rule ?Any-Latin? but such a transform doesn?t exist in the repo. So I presume an appropriate transform has to be resolved. Reading the inheritance rules isn?t helping me. So using this example, how does one resolve the correct transform for ?Any-Latin?. 2. I?m not sure how to interpret the Unicode regular expression "[[:Z:][:Ps:][:Pi:]$]? when its in a ?before context? as it is in ?Any-Publishing.xml? Specifically, where does the ?$? anchor? (a) Does ?$? in this case mean matching the character just before the insertion point? Or does it mean maches an end-of-line at the insertion point? Or something else? (b) For the majority of ?before context? matches, which don?t have any anchors in them (?$? or ?^?) is the intent that the match aligns to the text immediately before the insertion point (ie with an implied ?$? ending at the insertion point). Or is it intended to match anywhere in the prior context from the begging of the string (that would seem strange but TR35 doesn?t seem to explain the correct interpretation and TR18 is silent on the topic). As always, thanks for the insight and assistance, ?Kip From mark at macchiato.com Mon Mar 29 09:27:45 2021 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 29 Mar 2021 07:27:45 -0700 Subject: Transform resolution and before context matches In-Reply-To: <702869D2-DCCA-4449-909C-6DD195C78298@gmail.com> References: <702869D2-DCCA-4449-909C-6DD195C78298@gmail.com> Message-ID: Thanks for your message. There is more information in https://unicode-org.github.io/icu/userguide/transforms/general/ that should be incorporated into the LDML section. As to your particular points. I have some answers below, but I can follow up with details of the edge cases when I have more time. Mark On Mon, Mar 29, 2021 at 6:58 AM Kip Cole via CLDR-Users < cldr-users at unicode.org> wrote: > I?m now implementing CLDR transforms and would appreciate some > understanding of the following two items: > > 1. Resolving the correct transform from ?Any-Latin?. For example, > ?de-Latin? has a transform rule ?Any-Latin? but such a transform doesn?t > exist in the repo. So I presume an appropriate transform has to be > resolved. Reading the inheritance rules isn?t helping me. So using this > example, how does one resolve the correct transform for ?Any-Latin?. > There are special inheritance rules for Transforms with locales. - Any is a special identifier that breaks text by script run, and within that script run is replaced by the script of the run. - The fallback if there is not a language is language => script. The fallback is a 'ladder' between the source and target - > 2. I?m not sure how to interpret the Unicode regular expression > "[[:Z:][:Ps:][:Pi:]$]? when its in a ?before context? as it is in > ?Any-Publishing.xml? Specifically, where does the ?$? anchor? > > (a) Does ?$? in this case mean matching the character just before the > insertion point? Or does it mean maches an end-of-line at the insertion > point? Or something else? > It means "off the end of the string". So it is like ^ or $ in regular expressions. > > (b) For the majority of ?before context? matches, which don?t have any > anchors in them (?$? or ?^?) is the intent that the match aligns to the > text immediately before the insertion point (ie with an implied ?$? ending > at the insertion point). Or is it intended to match anywhere in the prior > context from the begging of the string (that would seem strange but TR35 > doesn?t seem to explain the correct interpretation and TR18 is silent on > the topic). It is immediately before. > > > As always, thanks for the insight and assistance, > > ?Kip > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at corp.unicode.org > https://corp.unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Mon Mar 29 13:21:25 2021 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 29 Mar 2021 11:21:25 -0700 Subject: Transform resolution and before context matches In-Reply-To: References: <702869D2-DCCA-4449-909C-6DD195C78298@gmail.com> Message-ID: Kip, would you mind filing a ticket on this, so that we can track it? Mark On Mon, Mar 29, 2021 at 7:27 AM Mark Davis ?? wrote: > Thanks for your message. There is more information in > https://unicode-org.github.io/icu/userguide/transforms/general/ that > should be incorporated into the LDML section. As to your particular points. > > I have some answers below, but I can follow up with details of the edge > cases when I have more time. > > Mark > > > On Mon, Mar 29, 2021 at 6:58 AM Kip Cole via CLDR-Users < > cldr-users at unicode.org> wrote: > >> I?m now implementing CLDR transforms and would appreciate some >> understanding of the following two items: >> >> 1. Resolving the correct transform from ?Any-Latin?. For example, >> ?de-Latin? has a transform rule ?Any-Latin? but such a transform doesn?t >> exist in the repo. So I presume an appropriate transform has to be >> resolved. Reading the inheritance rules isn?t helping me. So using this >> example, how does one resolve the correct transform for ?Any-Latin?. >> > > There are special inheritance rules for Transforms with locales. > > - Any is a special identifier that breaks text by script run, and > within that script run is replaced by the script of the run. > - The fallback if there is not a language is language => script. The > fallback is a 'ladder' between the source and target > - > > >> 2. I?m not sure how to interpret the Unicode regular expression >> "[[:Z:][:Ps:][:Pi:]$]? when its in a ?before context? as it is in >> ?Any-Publishing.xml? Specifically, where does the ?$? anchor? >> >> (a) Does ?$? in this case mean matching the character just before the >> insertion point? Or does it mean maches an end-of-line at the insertion >> point? Or something else? >> > > It means "off the end of the string". So it is like ^ or $ in regular > expressions. > >> >> (b) For the majority of ?before context? matches, which don?t have any >> anchors in them (?$? or ?^?) is the intent that the match aligns to the >> text immediately before the insertion point (ie with an implied ?$? ending >> at the insertion point). Or is it intended to match anywhere in the prior >> context from the begging of the string (that would seem strange but TR35 >> doesn?t seem to explain the correct interpretation and TR18 is silent on >> the topic). > > > It is immediately before. > >> >> >> As always, thanks for the insight and assistance, >> >> ?Kip >> >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at corp.unicode.org >> https://corp.unicode.org/mailman/listinfo/cldr-users >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: