Loose Matching questions
Rafael Xavier
rxaviers at gmail.com
Thu Aug 21 08:20:13 CDT 2014
Dear all,
Reading the Loose Matching TR35 documentation
<http://unicode.org/reports/tr35/#Loose_Matching> lead me to the questions
below. I have quoted the documentation and inlined the questions (probably
newbie).
Thanks in advance for your help!
7.2 Loose Matching
> Loose matching ignores attributes of the strings being compared that are
> not important to matching. It involves the following steps:
>
> - Remove "." from currency symbols and other fields used for matching,
> and also from the input string unless:
> - "." is in the decimal set, and
> - its position in the input string is immediately before a decimal
> digit
> - Ignore all format characters: in particular, ignore the RLM and LRM
> used to control BIDI formatting.
>
> Where do I find a list of all format characters?
> - Ignore all characters in [:Zs:] unless they occur between letters.
> (In the heuristics below, even those between letters are ignored except to
> delimit fields)
>
> Where do I find a list of all [:Zs:] characters?
> - Map all characters in [:Dash:] to U+002D HYPHEN-MINUS
>
> Where do I find a list of all [:Dash:] characters?
> - Use the data in the element to map equivalent characters (for
> example, curly to straight apostrophes). Other apostrophe-like characters
> should also be treated as equivalent, especially if the character actually
> used in a format may be unavailable on some keyboards. For example:
> - U+02BB MODIFIER LETTER TURNED COMMA (ʻ) might be typed instead as
> U+2018 LEFT SINGLE QUOTATION MARK (‘).
> - U+02BC MODIFIER LETTER APOSTROPHE (ʼ) might be typed instead as
> U+2019 RIGHT SINGLE QUOTATION MARK (’), U+0027 APOSTROPHE, etc.
> - U+05F3 HEBREW PUNCTUATION GERESH (׳) might be typed instead as
> U+0027 APOSTROPHE.
>
> Except for the U+05F3 example, the other two cannot be found in
http://www.unicode.org/repos/cldr-aux/json/25/supplemental/characterFallbacks.json.
Are both the "other apostrophe-like characters". Where do I find a complete
list of the apostrophe-like characters? Do mappings follow an algorithm,
algebric formula or lookup table?
On
http://unicode.org/reports/tr35/tr35-info.html#Supplemental_Character_Fallback_Data,
there's:
There is more than one possible fallback: the recommended usage is that
> when a character value is not in the desired repertoire the following
> process is used, whereby the first value that is wholly in the desired
> repertoire is used.
>
> - toNFC(value)
> - other canonically equivalent sequences, if there are any
> - the explicit substitutes value (in order)
> - toNFKC(value)
>
> Does it mean that when the character being looked up is not found, the
above process should be followed? Where do I find the definition of toNFC(),
toNFC(), canonically equivalence and explicit substitutes?
> - Apply mappings particular to the domain (i.e., for dates or for
> numbers, discussed in more detail below)
>
> Where?
> - Apply case folding (possibly including language-specific mappings
> such as Turkish i)
>
> Where do I find more information about it?
> - Normalize to NFKC; thus no-break space will map to space; half-width
> katakana will map to full-width.
>
> Are both mappings (no-break space and half-width katakana) all it's
about, or are there any other NFKC normalizations that should be done?
Where do I find a complete list of what should be done? Do mappings follow
an algorithm, algebric formula or lookup table?
Loose matching involves (logically) applying the above transform to both
> the input text and to each of the field elements used in matching, before
> applying the specific heuristics below. For example, if the input number
> text is " - NA f. 1,000.00", then it is mapped to "-naf1,000.00" before
> processing. The currency signs are also transformed, so "NA f." is
> converted to "naf" for purposes of matching. As with other Unicode
> algorithms, this is a logical statement of the process; actual
> implementations can optimize, such as by applying the transform
> incrementally during matching.
>
"NA f." is the currency symbol for ANG (Netherlands Antillean guilder, aka
Netherlands Antilles Florin according to wikipedia
<http://en.wikipedia.org/wiki/Netherlands_Antillean_guilder>). nl-CW and
nl-SX defines ANG symbol as NAf.. All other locales define it as ANG.
Following the above recommendation (to map NA f. into naf), how is
implementation supposed to know naf is ANG? Where do I find a mapping
between naf and ANG?
--
+55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers
http://rafael.xavier.blog.br
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140821/b313be0e/attachment.html>
More information about the CLDR-Users
mailing list