UAX44: loose matching of symbolic values and the `is` prefix
sisrivas at blueyonder.co.uk
Mon Jun 6 04:11:15 CDT 2016
> On 06 June 2016 at 08:58 Mathias Bynens <mathias at qiwi.be> wrote:
> http://unicode.org/reports/tr44/#UAX44-LM3 mentions the `is` prefix:
> > For loose matching of symbolic values, an initial prefix string "is" is
> > ignored. […] Ignoring any initial "is" on a symbolic value during loose
> > matching is likely to produce the best results in application areas such
> > as regex. Removal of an initial "is" string for a loose matching
> > comparison only needs to be done once for a symbolic value, and need not
> > be tested recursively. There are no property aliases or property value
> > aliases of the form "isisisisistooconvoluted" defined just to test
> > implementation edge cases.
> UAX44 provides the reason for the existence of this “feature”:
> > The reason for this is that APIs returning property values are often
> > named using the convention of prefixing "is" (or "Is" or "Is_", and so
> > forth) to a property value.
> That seems like a rather weak argument. Specifically applying this to
> UTS18 (Unicode regular expressions):
> > "Script=Greek" is equivalent to "Script=isGreek" or "Script=Is_Greek"
> If there is already a way to match all symbols in the Greek script (not
> counting the use of aliases and other loose matching requirements), i.e.
> `Script=Greek` — what good does it do to add support for yet another one?
> Looking at implementations in the wild, Steven Levithan found
> that some regex flavors use `Is` for scripts, some for blocks, some for
> scripts and blocks, some for neither. Since some script and block names
> collide, this causes problems, especially when porting regexes across flavors.
> The `is` prefix doesn’t provide any functionality that would otherwise be
> unavailable. It doesn’t add any value, yet causes incompatibility, author
> confusion, and it increases implementation complexity. UAX 44 includes two
> entire paragraphs pointing out that last part:
> > Removal of an initial "is" string for a loose matching comparison only
> > needs to be done once for a symbolic value, and need not be tested
> > recursively. There are no property aliases or property value aliases of
> > the form "isisisisistooconvoluted" defined just to test implementation
> > edge cases.
> > Existing and future property aliases and property value aliases are
> > guaranteed to be unique within their relevant namespaces, even if an
> > initial prefix string "is" is ignored. The existing cases of note for
> > aliases that do start with "is" are: dt=Iso
> > (Decomposition_Type=Isolated) and lb=IS. The Decomposition_Type value
> > alias does not cause any problem, because there is no contrasting value
> > alias dt=o (Decomposition_Type=olated). For lb=IS, note that the "IS" is
> > the entire property value alias, and is not a prefix. There is no null
> > value for the Line_Break property for it to contrast with, but
> > implementations of loose matching should be careful of this edge case,
> > so that "lb=IS" is not misinterpreted as matching a null value.
> Backwards compatibility seems to be the only good reason to continue
> supporting the `is` prefix *for existing implementations*, such as the one in
> Perl. But why is it still a requirement for new engines to support it as part
> of UAX44-LM3?
> I’d like to propose changing UAX44-LM3 to make supporting the `is` prefix
> optional for new implementations.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode