Mixed-Script confusables in prog.languages

Wed Dec 14 11:44:39 CST 2016

> On Dec 5, 2016, at 3:31 PM, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
> 
> On Mon, 5 Dec 2016 09:31:11 +0100
> Reini Urban <reini at cpanel.net> wrote:
> 
>>> On Dec 4, 2016, at 11:45 PM, Richard Wordingham
>>> <richard.wordingham at ntlworld.com> wrote:
>>> 
>>> On Sun, 4 Dec 2016 12:09:36 +0100
>>> Reini Urban <reini at cpanel.net> wrote:
>>> 
>>>> * normalize identifiers (NFC) and only store normalized variants.
>>>> this should catch bidi spoofs, combining characters and such.  
>>> 
>>> That doesn't catch bidi spoofs.  
>> 
>> Right. Bidi spoofs are already caught by the IDStart, IDContinue rule.
>> 
>> i.e. ‮goog‬le <U+202E (right-to-left override), g, o, o, g, U+202C
>> (pop directional formatting), l, e> is already caught as illegal.
>> 
>> Mixing RTL scripts, such as Arabic with Latin is not caught with the
>> mixed-script rule per se.
>> 
>>>> * check each unicode code point for its Script property and besides
>>>> Latin, Common and Inherited only allow the first script, but error
>>>> on any other mixed script. Additional scripts need to be declared.
>>>> https://github.com/perl11/cperl/issues/229
>>>> 
>>>> in perl like this:
>>>>   use utf8 ‘Greek’, ‘Cyrillic’;  
>>> 
>>> Your rule isn't clear.  Would an identifier like ψ_S be
>>> automatically allowed?  
>> 
>> ψ_S contains Greek U+03C8, Common and Latin. Since Latin and Common
>> are always allowed, the only new script is Greek. The first
>> non-default script is automatically and silently allowed, only a mix
>> with another non-default script, such as Cyrillic would error or need
>> an explicit declaration.
>> 
>> So ψ_S alone is fine, if everything else is Greek.
>> But mixing with the Cyrillic version would lead to an error.
>> 
>>> I presume you're handling the spoofing of the SMALL PHI characters
>>> by other means.  
>> 
>> The spoof attempt would be ѱ_S with Cyrillic U+0471, Common, Latin.
>> 2 mixed scripts which are illegal, if undeclared.
>> Same with PHI, which exists as Greek or Cyrillic. Most of Greek
>> characters have confusable Cyrillic counterparts, that’s why a
>> declaration of use utf8 ‘Greek’, ‘Cyrillic’; i.e. mixing those two
>> sounds highly dangerous. With the UCD confusable table this would be
>> an error. In my rule not, since the user declared those two scripts
>> to be mixed.
> 
> The choice with PHI includes:
> 
> U+0278 LATIN SMALL LETTER PHI
> U+03C6 GREEK SMALL LETTER PHI
> 
> a Greek (!) script character with compatibiity decomposition to U+03C6
> 
> U+03D5 GREEK PHI SYMBOL
> 
> and a whole host of common script characters with compatibility
> decomposition to U+03C6:
> 
> U+1D6D7 MATHEMATICAL BOLD SMALL PHI
> U+1D6DF MATHEMATICAL BOLD PHI SYMBOL
> U+1D711 MATHEMATICAL ITALIC SMALL PHI
> U+1D719 MATHEMATICAL ITALIC PHI SYMBOL
> U+1D74B MATHEMATICAL BOLD ITALIC SMALL PHI
> U+1D753 MATHEMATICAL BOLD ITALIC PHI SYMBOL
> U+1D785 MATHEMATICAL SANS-SERIF BOLD SMALL PHI
> U+1D78D MATHEMATICAL SANS-SERIF BOLD PHI SYMBOL
> U+1D7BF MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL PHI
> U+1D7C7 MATHEMATICAL SANS-SERIF BOLD ITALIC PHI SYMBOL
> 
> They are all ID_Start.

Oh my. Dragons beware. So I need to add some trie tables to add warnings with those rules also.
I don’t want to error on some obscure confusables rule only yet.
perl doesn’t even ship the security tables, so people are not aware of it.

> You didn't mention the inherited script.  Is that automatically
> allowed, e.g. φ̈ᵣ <U+03C6, U+0308 COMBINING DIAERESIS, U+1D63 LATIN
> SUBSCRIPT SMALL LETTER R> (scripts: Greek, inherited, Latin)?  I
> encountered that variable name in a radar specification last week.

Inherited is allowed with ID_Continue, yes. Not in ID_Start position.
Combiners are normalized to NFC.

> There might be issues - it's possible that क̐ <U+0915 DEVANAGARI LETTER
> KA, U+0310 COMBINING CANDRABINDU> might spoof कँ <U+0915, U+0901
> DEVANAGARI SIGN CANDRABINDU>.

Good test case:

\x{915}\x{310} is legal Devanagari normalized to one char.
\x{915}\x{901} are two legal Devanagari characters.
but they are confusables. This would need special confusable rules.

> 
>>> For multilingual support, you would want rules more like
>>> 
>>> 'After script X, allow script Y’.  
>> 
>> Can you expand on that with an example? I’m no expert on this.
>> 
>> Like after Hangul, allow Han? After Hiragana, allow Katakana?
> 
> It allows one to mix Japanese and Korean variables without being able
> to kana and Hangul.
> 
> Some of the Semitic abjads are sometimes used with vowel symbols
> normally assoicated with a different Semitic script.  One could use
> such a construct to limit the mixing.  However, for such cases a rule
> such as 'allow script Y marks on script X bases' would be much better.
> 
>>> I don't see your problem here.  What problem do you see with
>>> Amharic?  
> 
>> Amharic is not defined as UCD script property. It’s alphabet is
>> called Ge’ez, which we call Ethiopic in the UCD. But that’s all I
>> know. I’m not a domain expert. Does Ethiopic uses other Semitic
>> scripts in its alphabet or is it complete? I learned some CFK
>> languages, where you historically allow mixed scripts. But for other
>> scripts I’m clueless. The examples I got mix it with Runic. Valid or
>> nonsense?
> 
> I would say nonsense - or graphic design.  The use of Chinese
> ideographs alongside sinoform scripts is the primary example.
> However, 'symbols' as opposed to letters may leak from one script to
> another, and that may be an issue for variable names.  For example,
> English can use Arabic numerals, Roman numerals or Roman letters for
> numbering in lists, and I've known people to resort to Greek letters.
> Accent marks can also move, though these are usually encoded
> separately.  I've already used the example of candrabindu being
> borrowed from the Devanagari script to the Latin script - it was
> borrowed for use in Sanskrit.
> 
>> How about the many Indian scripts? Do they mix?
> 
> Microsoft mostly won't let long-supported *Indian* scripts mix within
> syllables.
> 
> I would say they mixed in much the same way as the Latin and Cyrillic
> scripts mix.  In many ways they act as font variants of one another, so
> features and rare letters may move between them.  This is most
> noticeable where large chunks of the Brahmi character set are missing,
> such as Tamil and Lao.  For Tamil, the gaps may be filled by 'Grantha'
> letters.  For Lao, subscript consonants bear a very strong resemblance
> to the Tai Tham subscript forms.  On the other hand, the unencoded
> characters added to the Lao script to support Pali have been
> well harmonised to the Lao script, and using characters from other
> scripts for them would definitely be wrong.  (There's mostly a
> consensus as to what the right bogus coding for them within the Lao
> block is.  Unfortunately, I don't have good enough evidence for an
> encoding proposal.)

I see. This would be a fine case for needed declaration of those mixed scripts.
Similar to East-Asian.

>> That I have no idea if those Old_<script> alphabets are still in use
>> to create aliases for them.
> 
> They'll still be in use.  We had a guy at work (computer department)
> who kept notes on his whiteboard in runes.  Someone analysing cuneiform
> texts might very well want to create variable names that are a mix of
> Latin for function (as 'n_' = "number of") and cuneiform for the form
> being counted or whatever.
> 
>> Such as this perl test t/mro/isa_c3_utf8.t
>> 
>> use utf8 qw( Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam
>> Hiragana );
>> 
>> ...
>> package 캎oẃ;
>> package urḲḵｋ;
>> @urḲḵｋ::ISA = 'kഌoんḰ';
>> package к;
>> @urḲḵｋ::ISA = ('kഌoんḰ', '캎oẃ');
>> package ṭ화ckэ;
>> ...
>> 
>> These identifiers are unreadable, because I don’t assume that anybody
>> will be able to understand Hangul Cyrillic Ethiopic
>> Canadian_Aboriginal Malayalam and Hiragana at once. I understand a
>> bit Hangul, Cyrillic and Hiragana, but the mix sounds highly illegal
>> to me.
> 
> There's no law against it!  More to the point, it was just a test.

I declared it as no undeclared mixed-script law in my language :)

> However, allowing Cyrillic or Greek immediately makes every apparent
> 'o' (or 'A') a potential spoof.  Remember, "Letter 'O' Considered
> Harmful”. 

Yes, this should be warned about.