UAX44: loose matching of symbolic values and the `is` prefix
Mathias Bynens
mathias at qiwi.be
Mon Jun 6 02:58:37 CDT 2016
http://unicode.org/reports/tr44/#UAX44-LM3 mentions the `is` prefix:
> For loose matching of symbolic values, an initial prefix string "is" is ignored. […] Ignoring any initial "is" on a symbolic value during loose matching is likely to produce the best results in application areas such as regex. Removal of an initial "is" string for a loose matching comparison only needs to be done once for a symbolic value, and need not be tested recursively. There are no property aliases or property value aliases of the form "isisisisistooconvoluted" defined just to test implementation edge cases.
UAX44 provides the reason for the existence of this “feature”:
> The reason for this is that APIs returning property values are often named using the convention of prefixing "is" (or "Is" or "Is_", and so forth) to a property value.
That seems like a rather weak argument. Specifically applying this to UTS18 (Unicode regular expressions):
> "Script=Greek" is equivalent to "Script=isGreek" or "Script=Is_Greek"
If there is already a way to match all symbols in the Greek script (not counting the use of aliases and other loose matching requirements), i.e. `Script=Greek` — what good does it do to add support for yet another one?
Looking at implementations in the wild, Steven Levithan found (https://github.com/mathiasbynens/es-unicode-regexp-proposal/issues/2#issuecomment-143288062) that some regex flavors use `Is` for scripts, some for blocks, some for scripts and blocks, some for neither. Since some script and block names collide, this causes problems, especially when porting regexes across flavors.
The `is` prefix doesn’t provide any functionality that would otherwise be unavailable. It doesn’t add any value, yet causes incompatibility, author confusion, and it increases implementation complexity. UAX 44 includes two entire paragraphs pointing out that last part:
> Removal of an initial "is" string for a loose matching comparison only needs to be done once for a symbolic value, and need not be tested recursively. There are no property aliases or property value aliases of the form "isisisisistooconvoluted" defined just to test implementation edge cases.
>
> Existing and future property aliases and property value aliases are guaranteed to be unique within their relevant namespaces, even if an initial prefix string "is" is ignored. The existing cases of note for aliases that do start with "is" are: dt=Iso (Decomposition_Type=Isolated) and lb=IS. The Decomposition_Type value alias does not cause any problem, because there is no contrasting value alias dt=o (Decomposition_Type=olated). For lb=IS, note that the "IS" is the entire property value alias, and is not a prefix. There is no null value for the Line_Break property for it to contrast with, but implementations of loose matching should be careful of this edge case, so that "lb=IS" is not misinterpreted as matching a null value.
Backwards compatibility seems to be the only good reason to continue supporting the `is` prefix *for existing implementations*, such as the one in Perl. But why is it still a requirement for new engines to support it as part of UAX44-LM3?
I’d like to propose changing UAX44-LM3 to make supporting the `is` prefix optional for new implementations.
More information about the Unicode
mailing list