Mixed-Script confusables in prog.languages

Richard Wordingham richard.wordingham at ntlworld.com
Sun Dec 4 16:45:58 CST 2016


On Sun, 4 Dec 2016 12:09:36 +0100
Reini Urban <reini at cpanel.net> wrote:

> * normalize identifiers (NFC) and only store normalized variants.
> this should catch bidi spoofs, combining characters and such.

That doesn't catch bidi spoofs.

> * check each unicode code point for its Script property and besides
> Latin, Common and Inherited only allow the first script, but error on
> any other mixed script. Additional scripts need to be declared.
> https://github.com/perl11/cperl/issues/229
> 
> in perl like this:
>     use utf8 ‘Greek’, ‘Cyrillic’;

Your rule isn't clear.  Would an identifier like ψ_S be automatically
allowed?

I presume you're handling the spoofing of the SMALL PHI characters by
other means.

For multilingual support, you would want rules more like

'After script X, allow script Y'.

> Of course there exist several languages which require more than one
> script, 
<snip>
> or african languages as some have other than Latin roots, e.g.
> Ethiopian from Semitic.

I don't see your problem here.  What problem do you see with Amharic?

> Indian languages also sound problematic,

Is this the ZWJ/ZWNJ issue?  That surely is a problem within a script.

> and
> all the Old_<script>

Now I am confused.  What problem do you see that you don't have in the
Latin script?

Richard.



More information about the Unicode mailing list