Why is pattern-matching of NULs slow?

Karl Williamson public at khwilliamson.com
Sat Apr 9 11:26:34 CDT 2022


On 4/8/22 17:27, David Starner via Unicode wrote:
> On Fri, Apr 8, 2022 at 6:25 AM Roger L Costello via Unicode
> <unicode at corp.unicode.org> wrote:
>> Why would pattern-matching NULs be slower than pattern-matching other characters?
> 
> Flex is written in C, and C strings use NUL as a terminator, and can't
> include NUL. The demand for Flex to handle NULs would be pretty
> minimal, it's mostly used on text documents that don't have NUL, and
> so I suspect someone tossed in a hack to make it work with NUL when it
> had to, and nobody has been back to fix it. It's mature software,
> without a release in five years, so I don't see that changing.
> 

I can take Perl as a teaching example.  Right off the bat it was used 
for parsing binary data, so had to accept embedded NULs.

Things had to be written by hand to duplicate libc functions but allow 
those NULs.  Over the years various libc functions have been added such 
as memchr(), memmem() that did allow for embedded NULs, and Perl 
converted to use them on platforms where provided.

But there remain many functions that accept only NUL-terminated strings, 
and so workarounds are used.  In some cases, that means re-implementing 
the libc function in C code.  Often the libc version will be implemented 
in assembly language, making it faster than Perl's C version.

A particularly flagrant example is strxfrm() for collating text.  Perl 
did not want to re-implement the complex locale handling that this 
function handles.  So, there is a wrapper for it that splits the string 
into NUL-terminated segments, and plays some shenanigans, all of which 
take extra cycles.


More information about the Unicode mailing list