Mixed-Script confusables in prog.languages
Richard Wordingham
richard.wordingham at ntlworld.com
Sun Dec 4 16:45:58 CST 2016
On Sun, 4 Dec 2016 12:09:36 +0100
Reini Urban <reini at cpanel.net> wrote:
> * normalize identifiers (NFC) and only store normalized variants.
> this should catch bidi spoofs, combining characters and such.
That doesn't catch bidi spoofs.
> * check each unicode code point for its Script property and besides
> Latin, Common and Inherited only allow the first script, but error on
> any other mixed script. Additional scripts need to be declared.
> https://github.com/perl11/cperl/issues/229
>
> in perl like this:
> use utf8 ‘Greek’, ‘Cyrillic’;
Your rule isn't clear. Would an identifier like ψ_S be automatically
allowed?
I presume you're handling the spoofing of the SMALL PHI characters by
other means.
For multilingual support, you would want rules more like
'After script X, allow script Y'.
> Of course there exist several languages which require more than one
> script,
<snip>
> or african languages as some have other than Latin roots, e.g.
> Ethiopian from Semitic.
I don't see your problem here. What problem do you see with Amharic?
> Indian languages also sound problematic,
Is this the ZWJ/ZWNJ issue? That surely is a problem within a script.
> and
> all the Old_<script>
Now I am confused. What problem do you see that you don't have in the
Latin script?
Richard.
More information about the Unicode
mailing list