Mixed-Script confusables in prog.languages

Martin J. Dürst duerst at it.aoyama.ac.jp
Mon Dec 5 05:29:01 CST 2016

On 2016/12/05 17:31, Reini Urban wrote:

> ψ_S contains Greek U+03C8, Common and Latin. Since Latin and Common are always allowed, the only
> new script is Greek. The first non-default script is automatically and silently allowed, only a mix with another
> non-default script, such as Cyrillic would error or need an explicit declaration.
> So ψ_S alone is fine, if everything else is Greek.
> But mixing with the Cyrillic version would lead to an error.

Allowing mixing of Greek and Latin (or Cyrillic and Latin) would be a 
big problem. As an example, it would allow A_Α (the second letter is a 
Greek one).

> Amharic is not defined as UCD script property. It’s alphabet is called Ge’ez, which we call
> Ethiopic in the UCD. But that’s all I know. I’m not a domain expert. Does Ethiopic uses
> other Semitic scripts in its alphabet or is it complete?

It's complete. I have never heard that it would need Arabic or Hebrew or 
some such.

> How about the many Indian scripts? Do they mix?
> Being an indian movie expert tells me that indian languages usually don’t mix.
> They make Tamil and Bengali versions of Hindi movies, and usually fall back to english to
> get common points across the barrier. But their scripts? No idea.

I don't think they mix two different scripts in the same word. Would be 
very confusing.

> In the examples in perl which partially came from parrot there’s a wild eclectic mix of various scripts
> which do make no sense at all. So I don’t know if I can trust those tests, that they make sense and
> are readable at all. My guess is that the authors just liked code golfing and picked random unicode
> characters. It’s from perl after all.
> Such as this perl test t/mro/isa_c3_utf8.t
> use utf8 qw( Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam Hiragana );
> ...
> package 캎oẃ;
> package urḲḵk;
> @urḲḵk::ISA = 'kഌoんḰ';
> package к;
> @urḲḵk::ISA = ('kഌoんḰ', '캎oẃ');
> package ṭ화ckэ;
> ...
> These identifiers are unreadable, because I don’t assume that anybody will be able to understand
> Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam and Hiragana at once.
> I understand a bit Hangul, Cyrillic and Hiragana, but the mix sounds highly illegal to me.

The mixes aren't illegal, in that they are not against any law. But they 
are complete intellegible garbage anyway.

Regards,   Martin.

More information about the Unicode mailing list