Loose Matching questions

Philippe Verdy verdy_p at wanadoo.fr
Thu Aug 21 11:11:12 CDT 2014


2014-08-21 15:20 GMT+02:00 Rafael Xavier <rxaviers at gmail.com>:

> Dear all,
>
> Reading the Loose Matching TR35 documentation
> <http://unicode.org/reports/tr35/#Loose_Matching> lead me to the
> questions below. I have quoted the documentation and inlined the questions
> (probably newbie).
>
> Thanks in advance for your help!
>
> 7.2 Loose Matching
>> Loose matching ignores attributes of the strings being compared that are
>> not important to matching. It involves the following steps:
>>
>>    - Remove "." from currency symbols and other fields used for
>>    matching, and also from the input string unless:
>>       - "." is in the decimal set, and
>>       - its position in the input string is immediately before a decimal
>>       digit
>>    - Ignore all format characters: in particular, ignore the RLM and LRM
>>    used to control BIDI formatting.
>>
>> Where do I find a list of all format characters?
>

Look at the general property "Cf' (format controls) in the UCD.

>
>>    - Ignore all characters in [:Zs:] unless they occur between letters.
>>    (In the heuristics below, even those between letters are ignored except to
>>    delimit fields)
>>
>>  Where do I find a list of all [:Zs:] characters?
>
Like Cf.

>
>>    - Map all characters in [:Dash:] to U+002D HYPHEN-MINUS
>>
>>  Where do I find a list of all [:Dash:] characters?
>
Look at "Derived properties."

>
>>    - Use the data in the element to map equivalent characters (for
>>    example, curly to straight apostrophes). Other apostrophe-like characters
>>    should also be treated as equivalent, especially if the character actually
>>    used in a format may be unavailable on some keyboards. For example:
>>       - U+02BB MODIFIER LETTER TURNED COMMA (ʻ) might be typed instead
>>       as U+2018 LEFT SINGLE QUOTATION MARK (‘).
>>       - U+02BC MODIFIER LETTER APOSTROPHE (ʼ) might be typed instead as
>>       U+2019 RIGHT SINGLE QUOTATION MARK (’), U+0027 APOSTROPHE, etc.
>>       - U+05F3 HEBREW PUNCTUATION GERESH (‎׳) might be typed instead as
>>       U+0027 APOSTROPHE.
>>
>>  Except for the U+05F3 example, the other two cannot be found in
> http://www.unicode.org/repos/cldr-aux/json/25/supplemental/characterFallbacks.json.
> Are both the "other apostrophe-like characters". Where do I find a complete
> list of the apostrophe-like characters? Do mappings follow an algorithm,
> algebric formula or lookup table?
>
This rule is language dependant. Some languages (notalby those that were
converted to the Latin script from another script, when the Latin alphabet
was not enough to represent letters similar to the glotal stop, when H was
already used for a breathing consonnant or for digrams) use apostrophe-like
characters as plain letters and sometimes even make distnctions between a
left and right apostrophe (in that case the straight ASCII apostrophe is a
bad fallback.

However loose matching is frequently ignoring other differences such as
vowel points and cantillation marks in Arabic and Hebrew. You need to know
for which context you need "loose matching" and what users are expecting
about these matches.

> On
> http://unicode.org/reports/tr35/tr35-info.html#Supplemental_Character_Fallback_Data,
> there's:
>
> There is more than one possible fallback: the recommended usage is that
>> when a character value is not in the desired repertoire the following
>> process is used, whereby the first value that is wholly in the desired
>> repertoire is used.
>>
>>    - toNFC(value)
>>    - other canonically equivalent sequences, if there are any
>>    - the explicit substitutes value (in order)
>>    - toNFKC(value)
>>
>>  Does it mean that when the character being looked up is not found, the
> above process should be followed? Where do I find the definition of
> toNFC(), toNFC(), canonically equivalence and explicit substitutes?
>
>
>>    - Apply mappings particular to the domain (i.e., for dates or for
>>    numbers, discussed in more detail below)
>>
>>  Where?
>
>
>>    - Apply case folding (possibly including language-specific mappings
>>    such as Turkish i)
>>
>>  Where do I find more information about it?
>

>>    - Normalize to NFKC; thus no-break space will map to space;
>>    half-width katakana will map to full-width.
>>
>>  All is documented in the standard. You should first read the initial
chapters to learn the basic concepts, notably chapter 3 about conformance,
and then look at the referenced chapters.


>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140821/3e8f415a/attachment.html>


More information about the CLDR-Users mailing list