Mixed-Script confusables in prog.languages

Richard Wordingham richard.wordingham at ntlworld.com
Mon Dec 5 08:31:38 CST 2016


On Mon, 5 Dec 2016 09:31:11 +0100
Reini Urban <reini at cpanel.net> wrote:

> > On Dec 4, 2016, at 11:45 PM, Richard Wordingham
> > <richard.wordingham at ntlworld.com> wrote:
> > 
> > On Sun, 4 Dec 2016 12:09:36 +0100
> > Reini Urban <reini at cpanel.net> wrote:
> >   
> >> * normalize identifiers (NFC) and only store normalized variants.
> >> this should catch bidi spoofs, combining characters and such.  
> > 
> > That doesn't catch bidi spoofs.  
> 
> Right. Bidi spoofs are already caught by the IDStart, IDContinue rule.
> 
> i.e. ‮goog‬le <U+202E (right-to-left override), g, o, o, g, U+202C
> (pop directional formatting), l, e> is already caught as illegal.
> 
> Mixing RTL scripts, such as Arabic with Latin is not caught with the
> mixed-script rule per se.
> 
> >> * check each unicode code point for its Script property and besides
> >> Latin, Common and Inherited only allow the first script, but error
> >> on any other mixed script. Additional scripts need to be declared.
> >> https://github.com/perl11/cperl/issues/229
> >> 
> >> in perl like this:
> >>    use utf8 ‘Greek’, ‘Cyrillic’;  
> > 
> > Your rule isn't clear.  Would an identifier like ψ_S be
> > automatically allowed?  
> 
> ψ_S contains Greek U+03C8, Common and Latin. Since Latin and Common
> are always allowed, the only new script is Greek. The first
> non-default script is automatically and silently allowed, only a mix
> with another non-default script, such as Cyrillic would error or need
> an explicit declaration.
> 
> So ψ_S alone is fine, if everything else is Greek.
> But mixing with the Cyrillic version would lead to an error.
> 
> > I presume you're handling the spoofing of the SMALL PHI characters
> > by other means.  
> 
> The spoof attempt would be ѱ_S with Cyrillic U+0471, Common, Latin.
> 2 mixed scripts which are illegal, if undeclared.
> Same with PHI, which exists as Greek or Cyrillic. Most of Greek
> characters have confusable Cyrillic counterparts, that’s why a
> declaration of use utf8 ‘Greek’, ‘Cyrillic’; i.e. mixing those two
> sounds highly dangerous. With the UCD confusable table this would be
> an error. In my rule not, since the user declared those two scripts
> to be mixed.

The choice with PHI includes:

U+0278 LATIN SMALL LETTER PHI
U+03C6 GREEK SMALL LETTER PHI

a Greek (!) script character with compatibiity decomposition to U+03C6

U+03D5 GREEK PHI SYMBOL

and a whole host of common script characters with compatibility
decomposition to U+03C6:

U+1D6D7 MATHEMATICAL BOLD SMALL PHI
U+1D6DF MATHEMATICAL BOLD PHI SYMBOL
U+1D711 MATHEMATICAL ITALIC SMALL PHI
U+1D719 MATHEMATICAL ITALIC PHI SYMBOL
U+1D74B MATHEMATICAL BOLD ITALIC SMALL PHI
U+1D753 MATHEMATICAL BOLD ITALIC PHI SYMBOL
U+1D785 MATHEMATICAL SANS-SERIF BOLD SMALL PHI
U+1D78D MATHEMATICAL SANS-SERIF BOLD PHI SYMBOL
U+1D7BF MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL PHI
U+1D7C7 MATHEMATICAL SANS-SERIF BOLD ITALIC PHI SYMBOL

They are all ID_Start.

You didn't mention the inherited script.  Is that automatically
allowed, e.g. φ̈ᵣ <U+03C6, U+0308 COMBINING DIAERESIS, U+1D63 LATIN
SUBSCRIPT SMALL LETTER R> (scripts: Greek, inherited, Latin)?  I
encountered that variable name in a radar specification last week.

There might be issues - it's possible that क̐ <U+0915 DEVANAGARI LETTER
KA, U+0310 COMBINING CANDRABINDU> might spoof कँ <U+0915, U+0901
DEVANAGARI SIGN CANDRABINDU>.

> > For multilingual support, you would want rules more like
> > 
> > 'After script X, allow script Y’.  
> 
> Can you expand on that with an example? I’m no expert on this.
> 
> Like after Hangul, allow Han? After Hiragana, allow Katakana?

It allows one to mix Japanese and Korean variables without being able
to kana and Hangul.

Some of the Semitic abjads are sometimes used with vowel symbols
normally assoicated with a different Semitic script.  One could use
such a construct to limit the mixing.  However, for such cases a rule
such as 'allow script Y marks on script X bases' would be much better.

> > I don't see your problem here.  What problem do you see with
> > Amharic?  

> Amharic is not defined as UCD script property. It’s alphabet is
> called Ge’ez, which we call Ethiopic in the UCD. But that’s all I
> know. I’m not a domain expert. Does Ethiopic uses other Semitic
> scripts in its alphabet or is it complete? I learned some CFK
> languages, where you historically allow mixed scripts. But for other
> scripts I’m clueless. The examples I got mix it with Runic. Valid or
> nonsense?

I would say nonsense - or graphic design.  The use of Chinese
ideographs alongside sinoform scripts is the primary example.
However, 'symbols' as opposed to letters may leak from one script to
another, and that may be an issue for variable names.  For example,
English can use Arabic numerals, Roman numerals or Roman letters for
numbering in lists, and I've known people to resort to Greek letters.
Accent marks can also move, though these are usually encoded
separately.  I've already used the example of candrabindu being
borrowed from the Devanagari script to the Latin script - it was
borrowed for use in Sanskrit.

> How about the many Indian scripts? Do they mix?

Microsoft mostly won't let long-supported *Indian* scripts mix within
syllables.

I would say they mixed in much the same way as the Latin and Cyrillic
scripts mix.  In many ways they act as font variants of one another, so
features and rare letters may move between them.  This is most
noticeable where large chunks of the Brahmi character set are missing,
such as Tamil and Lao.  For Tamil, the gaps may be filled by 'Grantha'
letters.  For Lao, subscript consonants bear a very strong resemblance
to the Tai Tham subscript forms.  On the other hand, the unencoded
characters added to the Lao script to support Pali have been
well harmonised to the Lao script, and using characters from other
scripts for them would definitely be wrong.  (There's mostly a
consensus as to what the right bogus coding for them within the Lao
block is.  Unfortunately, I don't have good enough evidence for an
encoding proposal.)

> That I have no idea if those Old_<script> alphabets are still in use
> to create aliases for them.

They'll still be in use.  We had a guy at work (computer department)
who kept notes on his whiteboard in runes.  Someone analysing cuneiform
texts might very well want to create variable names that are a mix of
Latin for function (as 'n_' = "number of") and cuneiform for the form
being counted or whatever.

> Such as this perl test t/mro/isa_c3_utf8.t
> 
> use utf8 qw( Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam
> Hiragana );
> 
> ...
> package 캎oẃ;
> package urḲḵk;
> @urḲḵk::ISA = 'kഌoんḰ';
> package к;
> @urḲḵk::ISA = ('kഌoんḰ', '캎oẃ');
> package ṭ화ckэ;
> ...
> 
> These identifiers are unreadable, because I don’t assume that anybody
> will be able to understand Hangul Cyrillic Ethiopic
> Canadian_Aboriginal Malayalam and Hiragana at once. I understand a
> bit Hangul, Cyrillic and Hiragana, but the mix sounds highly illegal
> to me.

There's no law against it!  More to the point, it was just a test.

However, allowing Cyrillic or Greek immediately makes every apparent
'o' (or 'A') a potential spoof.  Remember, "Letter 'O' Considered
Harmful". 

Richard.



More information about the Unicode mailing list