Why is pattern-matching of NULs slow?
Karl Williamson
public at khwilliamson.com
Sat Apr 9 11:26:34 CDT 2022
On 4/8/22 17:27, David Starner via Unicode wrote:
> On Fri, Apr 8, 2022 at 6:25 AM Roger L Costello via Unicode
> <unicode at corp.unicode.org> wrote:
>> Why would pattern-matching NULs be slower than pattern-matching other characters?
>
> Flex is written in C, and C strings use NUL as a terminator, and can't
> include NUL. The demand for Flex to handle NULs would be pretty
> minimal, it's mostly used on text documents that don't have NUL, and
> so I suspect someone tossed in a hack to make it work with NUL when it
> had to, and nobody has been back to fix it. It's mature software,
> without a release in five years, so I don't see that changing.
>
I can take Perl as a teaching example. Right off the bat it was used
for parsing binary data, so had to accept embedded NULs.
Things had to be written by hand to duplicate libc functions but allow
those NULs. Over the years various libc functions have been added such
as memchr(), memmem() that did allow for embedded NULs, and Perl
converted to use them on platforms where provided.
But there remain many functions that accept only NUL-terminated strings,
and so workarounds are used. In some cases, that means re-implementing
the libc function in C code. Often the libc version will be implemented
in assembly language, making it faster than Perl's C version.
A particularly flagrant example is strxfrm() for collating text. Perl
did not want to re-implement the complex locale handling that this
function handles. So, there is a wrapper for it that splits the string
into NUL-terminated segments, and plays some shenanigans, all of which
take extra cycles.
More information about the Unicode
mailing list