UAX44: loose matching of symbolic values and the `is` prefix

srivas sinnathurai sisrivas at blueyonder.co.uk
Mon Jun 6 04:11:15 CDT 2016


Thanks Ashley.

> 
>     On 06 June 2016 at 08:58 Mathias Bynens <mathias at qiwi.be> wrote:
> 
> 
>     http://unicode.org/reports/tr44/#UAX44-LM3 mentions the `is` prefix:
> 
>     > For loose matching of symbolic values, an initial prefix string "is" is
>     > ignored. […] Ignoring any initial "is" on a symbolic value during loose
>     > matching is likely to produce the best results in application areas such
>     > as regex. Removal of an initial "is" string for a loose matching
>     > comparison only needs to be done once for a symbolic value, and need not
>     > be tested recursively. There are no property aliases or property value
>     > aliases of the form "isisisisistooconvoluted" defined just to test
>     > implementation edge cases.
> 
>     UAX44 provides the reason for the existence of this “feature”:
> 
>     > The reason for this is that APIs returning property values are often
>     > named using the convention of prefixing "is" (or "Is" or "Is_", and so
>     > forth) to a property value.
> 
>     That seems like a rather weak argument. Specifically applying this to
> UTS18 (Unicode regular expressions):
> 
>     > "Script=Greek" is equivalent to "Script=isGreek" or "Script=Is_Greek"
> 
>     If there is already a way to match all symbols in the Greek script (not
> counting the use of aliases and other loose matching requirements), i.e.
> `Script=Greek` — what good does it do to add support for yet another one?
> 
>     Looking at implementations in the wild, Steven Levithan found
> (https://github.com/mathiasbynens/es-unicode-regexp-proposal/issues/2#issuecomment-143288062)
> that some regex flavors use `Is` for scripts, some for blocks, some for
> scripts and blocks, some for neither. Since some script and block names
> collide, this causes problems, especially when porting regexes across flavors.
> 
>     The `is` prefix doesn’t provide any functionality that would otherwise be
> unavailable. It doesn’t add any value, yet causes incompatibility, author
> confusion, and it increases implementation complexity. UAX 44 includes two
> entire paragraphs pointing out that last part:
> 
>     > Removal of an initial "is" string for a loose matching comparison only
>     > needs to be done once for a symbolic value, and need not be tested
>     > recursively. There are no property aliases or property value aliases of
>     > the form "isisisisistooconvoluted" defined just to test implementation
>     > edge cases.
>     >
>     > Existing and future property aliases and property value aliases are
>     > guaranteed to be unique within their relevant namespaces, even if an
>     > initial prefix string "is" is ignored. The existing cases of note for
>     > aliases that do start with "is" are: dt=Iso
>     > (Decomposition_Type=Isolated) and lb=IS. The Decomposition_Type value
>     > alias does not cause any problem, because there is no contrasting value
>     > alias dt=o (Decomposition_Type=olated). For lb=IS, note that the "IS" is
>     > the entire property value alias, and is not a prefix. There is no null
>     > value for the Line_Break property for it to contrast with, but
>     > implementations of loose matching should be careful of this edge case,
>     > so that "lb=IS" is not misinterpreted as matching a null value.
> 
> 
>     Backwards compatibility seems to be the only good reason to continue
> supporting the `is` prefix *for existing implementations*, such as the one in
> Perl. But why is it still a requirement for new engines to support it as part
> of UAX44-LM3?
> 
>     I’d like to propose changing UAX44-LM3 to make supporting the `is` prefix
> optional for new implementations.
> 
> 

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160606/f89d1335/attachment.html>


More information about the Unicode mailing list