Removing accents and diacritics from a word

Walter Tross via Unicode unicode at unicode.org
Thu Jul 18 15:38:20 CDT 2019


OK, but if I, as a German, were to search for München in a context where I
only had ASCII characters available, I would type Muenchen.

Il giorno gio 18 lug 2019 alle ore 22:23 Asmus Freytag (c) <
asmusf at ix.netcom.com> ha scritto:

> On 7/18/2019 1:08 PM, Walter Tross wrote:
>
> Please remember that diacritics carry information.
>
> That goes without saying, The context is for a situation like the one
> where you might need to allow someone to enter a word without accents (e.g.
> because they don't have the right keyboard).
>
> In Italian, e.g., where the grave or acute accent is almost always at the
> end of words, this information is preserved, when transliterating, by
> removing the accent and appending an apostrophe, like in però→pero' (pero
> would be a different word). E.g., my father-in-law has Nicolo' instead of
> Nicolò on his credit card.
> In German, ä, ö and ü are transliterated as ae, oe and ue. E.g., the
> portal of München (Munich) is https://www.muenchen.de/
> Etc.
>
> whether to fold the umlauts using the added "e" or just the base letter,
> or doing both, would depend on the circumstance.
>
> This is not about preserving information, but enabling access/search from
> an approximation of the full word.
>
> A./
>
>
>
> Il giorno gio 18 lug 2019 alle ore 02:09 Asmus Freytag (c) via Unicode <
> unicode at unicode.org> ha scritto:
>
>> On 7/17/2019 11:25 AM, Sławomir Osipiuk wrote:
>>
>> “Transliteration”?
>>
>> Maybe more generic that what you’re looking for. Used for the process of
>> producing the “machine readable zone” on passports:
>>
>> https://www.icao.int/publications/Documents/9303_p3_cons_en.pdf (see
>> section 6, page 30)
>>
>>
>>
>> “Accent folding” or “diacritic folding” is used in some places. String
>> folding is “A string transform F, with the property that repeated
>> applications of the same function F produce the same output: F(F(S)) = F(S)
>> for all input strings S”. Accent folding is a special case of that.
>>
>> https://unicode.org/reports/tr23/#StringFunctionClassificationDefinitions
>>
>> https://alistapart.com/article/accent-folding-for-auto-complete/
>>
>> Diacritic folding. Thanks. Just didn't think of the operation as folding
>> the way it came up, but that's what it is.
>>
>> A./
>>
>>
>>
>>
>>
>>
>> *From:* Unicode [mailto:unicode-bounces at unicode.org
>> <unicode-bounces at unicode.org>] *On Behalf Of *Asmus Freytag via Unicode
>> *Sent:* Wednesday, July 17, 2019 13:38
>> *To:* Unicode Mailing List
>> *Subject:* Removing accents and diacritics from a word
>>
>>
>>
>> A question has come up in another context:
>>
>> Is there any linguistic term for describing the process of removing
>> accents and diacritics from a word to create its “base form”, e.g. São Tomé
>> to Sao Tome?
>>
>> The linguistic term "string normalization" appears not that preferable in
>> a computing context.
>>
>> Any ideas?
>>
>> A./
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190718/defe5389/attachment.html>


More information about the Unicode mailing list