Mixed-Script confusables in prog.languages

Reini Urban reini at cpanel.net
Mon Dec 5 02:31:11 CST 2016

> On Dec 4, 2016, at 11:45 PM, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
> On Sun, 4 Dec 2016 12:09:36 +0100
> Reini Urban <reini at cpanel.net> wrote:
>> * normalize identifiers (NFC) and only store normalized variants.
>> this should catch bidi spoofs, combining characters and such.
> That doesn't catch bidi spoofs.

Right. Bidi spoofs are already caught by the IDStart, IDContinue rule.

i.e. ‮goog‬le <U+202E (right-to-left override), g, o, o, g, U+202C (pop directional formatting), l, e>
is already caught as illegal.

Mixing RTL scripts, such as Arabic with Latin is not caught with the mixed-script rule per se.

>> * check each unicode code point for its Script property and besides
>> Latin, Common and Inherited only allow the first script, but error on
>> any other mixed script. Additional scripts need to be declared.
>> https://github.com/perl11/cperl/issues/229
>> in perl like this:
>>    use utf8 ‘Greek’, ‘Cyrillic’;
> Your rule isn't clear.  Would an identifier like ψ_S be automatically
> allowed?

ψ_S contains Greek U+03C8, Common and Latin. Since Latin and Common are always allowed, the only
new script is Greek. The first non-default script is automatically and silently allowed, only a mix with another
non-default script, such as Cyrillic would error or need an explicit declaration.

So ψ_S alone is fine, if everything else is Greek.
But mixing with the Cyrillic version would lead to an error.

> I presume you're handling the spoofing of the SMALL PHI characters by
> other means.

The spoof attempt would be ѱ_S with Cyrillic U+0471, Common, Latin.
2 mixed scripts which are illegal, if undeclared.
Same with PHI, which exists as Greek or Cyrillic. Most of Greek characters have confusable 
Cyrillic counterparts, that’s why a declaration of use utf8 ‘Greek’, ‘Cyrillic’;
i.e. mixing those two sounds highly dangerous. 
With the UCD confusable table this would be an error. In my rule not, since the user
declared those two scripts to be mixed.

> For multilingual support, you would want rules more like
> 'After script X, allow script Y’.

Can you expand on that with an example? I’m no expert on this.

Like after Hangul, allow Han? After Hiragana, allow Katakana?

>> Of course there exist several languages which require more than one
>> script, 
> <snip>
>> or african languages as some have other than Latin roots, e.g.
>> Ethiopian from Semitic.
> I don't see your problem here.  What problem do you see with Amharic?

Amharic is not defined as UCD script property. It’s alphabet is called Ge’ez, which we call
Ethiopic in the UCD. But that’s all I know. I’m not a domain expert. Does Ethiopic uses
other Semitic scripts in its alphabet or is it complete? I learned some CFK languages, 
where you historically allow mixed scripts. But for other scripts I’m clueless.
The examples I got mix it with Runic. Valid or nonsense?

The problem is to decide which scripts are commonly mixed in which languages to allow
them to be valid identifiers.

How about the many Indian scripts? Do they mix?
Being an indian movie expert tells me that indian languages usually don’t mix. 
They make Tamil and Bengali versions of Hindi movies, and usually fall back to english to
get common points across the barrier. But their scripts? No idea.

>> Indian languages also sound problematic,
> Is this the ZWJ/ZWNJ issue?  That surely is a problem within a script.
>> and
>> all the Old_<script>
> Now I am confused.  What problem do you see that you don't have in the
> Latin script?

That I have no idea if those Old_<script> alphabets are still in use to create 
aliases for them.
In the examples in perl which partially came from parrot there’s a wild eclectic mix of various scripts
which do make no sense at all. So I don’t know if I can trust those tests, that they make sense and 
are readable at all. My guess is that the authors just liked code golfing and picked random unicode
characters. It’s from perl after all.

Such as this perl test t/mro/isa_c3_utf8.t

use utf8 qw( Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam Hiragana );

package 캎oẃ;
package urḲḵk;
@urḲḵk::ISA = 'kഌoんḰ';
package к;
@urḲḵk::ISA = ('kഌoんḰ', '캎oẃ');
package ṭ화ckэ;

These identifiers are unreadable, because I don’t assume that anybody will be able to understand
Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam and Hiragana at once.
I understand a bit Hangul, Cyrillic and Hiragana, but the mix sounds highly illegal to me.

So my rule makes sense. You need to declare non-default scripts used in your identifiers if mixed.
(not strings. these can be everything, even illegal UTF-8).

More information about the Unicode mailing list