Mixed-Script confusables in prog.languages

Wed Dec 14 11:28:23 CST 2016

> On Dec 5, 2016, at 1:51 PM, gfb hjjhjh <c933103 at gmail.com> wrote:
> 
> How about package names like ロシアМС21(Note the МС are Cyrillic), or πr²の秘密, or エリ_хорошо_μ'sic_4⃣ever? Although they aren't really names that people would usually use in package/var names, they are meaningful names…

My program thinks otherwise.

1st:
$ cperl5.25.2 -Mutf8 -e'package ロシアМС21;’

Invalid script Cyrillic in identifier МС21;
 for U+041C. Have Katakana at -e line 1.

Legalize those mixed scripts with this:
$ cperl5.25.2 -C -Mutf8=Katakana,Cyrillic -e'package ロシアМС21;'

2nd:
$ cperl5.25.2 -C -Mutf8 -e'$エリ_хорошо_μ::sic_4⃣;'
Invalid script Cyrillic in identifier Ñ
Ð¾Ñ€Ð¾ÑˆÐ¾_Î¼::sic_4âƒ£;
 for U+0445. Have Katakana at -e line 1.

Illegal mixed scripts Katakana + Cyrillic + Greek.
Almost legal with this:

$ cperl5.25.2 -C -Mutf8=Katakana,Cyrillic,Greek -e'$エリ_хорошо_μ::sic_4⃣;'
Unrecognized character \x{20e3}; marked by <-- HERE after о_μ::sic_4<-- HERE near column 20 at -e line 1.

U+20e3 is no ID_Continue.

3rd:
$ cperl5.25.2 -C -Mutf8 -e'$πr²の秘密;'
Unrecognized character \x{b2}; marked by <-- HERE after $πr<-- HERE near column 4 at -e line 1.

² is no ID_Continue

Legal with:
$ cperl5.25.2 -C -Mutf8=Greek,Hiragana,Han -e'$πrの秘密;'

> 2016年12月5日 16:39 於 "Reini Urban" <reini at cpanel.net> 寫道：
> 
> > On Dec 4, 2016, at 11:45 PM, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
> >
> > On Sun, 4 Dec 2016 12:09:36 +0100
> > Reini Urban <reini at cpanel.net> wrote:
> >
> >> * normalize identifiers (NFC) and only store normalized variants.
> >> this should catch bidi spoofs, combining characters and such.
> >
> > That doesn't catch bidi spoofs.
> 
> Right. Bidi spoofs are already caught by the IDStart, IDContinue rule.
> 
> i.e. ‮goog‬le <U+202E (right-to-left override), g, o, o, g, U+202C (pop directional formatting), l, e>
> is already caught as illegal.
> 
> Mixing RTL scripts, such as Arabic with Latin is not caught with the mixed-script rule per se.
> 
> >> * check each unicode code point for its Script property and besides
> >> Latin, Common and Inherited only allow the first script, but error on
> >> any other mixed script. Additional scripts need to be declared.
> >> https://github.com/perl11/cperl/issues/229
> >>
> >> in perl like this:
> >>    use utf8 ‘Greek’, ‘Cyrillic’;
> >
> > Your rule isn't clear.  Would an identifier like ψ_S be automatically
> > allowed?
> 
> ψ_S contains Greek U+03C8, Common and Latin. Since Latin and Common are always allowed, the only
> new script is Greek. The first non-default script is automatically and silently allowed, only a mix with another
> non-default script, such as Cyrillic would error or need an explicit declaration.
> 
> So ψ_S alone is fine, if everything else is Greek.
> But mixing with the Cyrillic version would lead to an error.
> 
> > I presume you're handling the spoofing of the SMALL PHI characters by
> > other means.
> 
> The spoof attempt would be ѱ_S with Cyrillic U+0471, Common, Latin.
> 2 mixed scripts which are illegal, if undeclared.
> Same with PHI, which exists as Greek or Cyrillic. Most of Greek characters have confusable
> Cyrillic counterparts, that’s why a declaration of use utf8 ‘Greek’, ‘Cyrillic’;
> i.e. mixing those two sounds highly dangerous.
> With the UCD confusable table this would be an error. In my rule not, since the user
> declared those two scripts to be mixed.
> 
> > For multilingual support, you would want rules more like
> >
> > 'After script X, allow script Y’.
> 
> Can you expand on that with an example? I’m no expert on this.
> 
> Like after Hangul, allow Han? After Hiragana, allow Katakana?
> 
> >> Of course there exist several languages which require more than one
> >> script,
> > <snip>
> >> or african languages as some have other than Latin roots, e.g.
> >> Ethiopian from Semitic.
> >
> > I don't see your problem here.  What problem do you see with Amharic?
> 
> Amharic is not defined as UCD script property. It’s alphabet is called Ge’ez, which we call
> Ethiopic in the UCD. But that’s all I know. I’m not a domain expert. Does Ethiopic uses
> other Semitic scripts in its alphabet or is it complete? I learned some CFK languages,
> where you historically allow mixed scripts. But for other scripts I’m clueless.
> The examples I got mix it with Runic. Valid or nonsense?
> 
> The problem is to decide which scripts are commonly mixed in which languages to allow
> them to be valid identifiers.
> 
> How about the many Indian scripts? Do they mix?
> Being an indian movie expert tells me that indian languages usually don’t mix.
> They make Tamil and Bengali versions of Hindi movies, and usually fall back to english to
> get common points across the barrier. But their scripts? No idea.
> 
> >
> >> Indian languages also sound problematic,
> >
> > Is this the ZWJ/ZWNJ issue?  That surely is a problem within a script.
> >
> >> and
> >> all the Old_<script>
> >
> > Now I am confused.  What problem do you see that you don't have in the
> > Latin script?
> 
> That I have no idea if those Old_<script> alphabets are still in use to create
> aliases for them.
> In the examples in perl which partially came from parrot there’s a wild eclectic mix of various scripts
> which do make no sense at all. So I don’t know if I can trust those tests, that they make sense and
> are readable at all. My guess is that the authors just liked code golfing and picked random unicode
> characters. It’s from perl after all.
> 
> Such as this perl test t/mro/isa_c3_utf8.t
> 
> use utf8 qw( Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam Hiragana );
> 
> ...
> package 캎oẃ;
> package urḲḵｋ;
> @urḲḵｋ::ISA = 'kഌoんḰ';
> package к;
> @urḲḵｋ::ISA = ('kഌoんḰ', '캎oẃ');
> package ṭ화ckэ;
> ...
> 
> These identifiers are unreadable, because I don’t assume that anybody will be able to understand
> Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam and Hiragana at once.
> I understand a bit Hangul, Cyrillic and Hiragana, but the mix sounds highly illegal to me.
> 
> So my rule makes sense. You need to declare non-default scripts used in your identifiers if mixed.
> (not strings. these can be everything, even illegal UTF-8).
> 
>