http://www.perlmonks.org?node_id=935021

superawesome has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to speed up a few regex's in a script that gets called a few dozen times a day. Each invocation basically loops through a ton of source code and builds a sort of searchable index.

The problem is one single run of this script is now taking more than a day to run. There's some parallel-ization that can be done, but I'm hopeful there's something to be gained within each script as well.

The script in question is MXR's "genxref": here

Here's a relevant NYTProf run (one of the dozens that gets run daily, across different source repos): here. You can see some lines are getting hit a million times or more.

Here's a good example fragment:

879 # Remove nested parentheses. 880 while ($contents =~ s/\(([^\)]*)\(/\($1\05/g || 881 $contents =~ s/\05([^\(\)]*)\)/ $1 /g) {}

This is one problematic snippet, but hardly the only one... the script is littered with complicated regex's. Most of them quick enough as-is, but some (like above) have become a significant performance bottleneck as our code base as grown.

How might I improve upon this situation? Specific improvements and general ideas both welcome... I know the basics from a theoretical perspective (don't capture if you don't have to, try not to backtrack, etc), but not how to spot/fix problems. I don't have enough real-world experience with this.

Thanks!

Replies are listed 'Best First'.
Re: tight loop regex optimization
by moritz (Cardinal) on Nov 01, 2011 at 07:09 UTC

    You are using the $& special variable in your program. perlre warns:

    WARNING: Once Perl sees that you need one of $&, "$`", or "$'" anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression "(?: ... )" instead.) But if you never use $&, "$`" or "$'", then patterns without capturing parentheses will not be penalized. So avoid $&, "$'", and "$`" if you can, but if you can't (and some algorithms really appreciate them), once you've used them once, use them at will, because you've already paid the price. As of 5.005, $& is not so costly as the other two.

    So avoid using $& and friends. A simple workaround is to put parens around the whole regex, and use $1 instead.

    Depending on your perl version, perlre might offer other workarounds.

      I had seen this warning, but was mis-reading it as indicating that the line "use strict" was actually *causing* it, which I just could not figure out. Once the proper lines were pointed out, the fix was exactly what you mentioned (parens and $1).

      Unfortunately this didn't really help performance much in this particular case... but it's a good fix anyway. Thanks!

Re: tight loop regex optimization
by BrowserUk (Patriarch) on Nov 01, 2011 at 07:20 UTC

    Skip to the bottom first!

    There are a few places in the code where small savings can be achieved essentially for free:

    • 1 while $re; runs ~10% faster than while( $re ) {}.
    • $re && last; runs a few percent faster than $re && do{ last };

    But even if you could make this level of savings on every single line in the program, you'd still save maybe an hour at most.

    Looking at a few of the individual REs, nothing leaps off the page as being particularly extravagant. You should heed the annotation at the top of the profiling and try to remove usage of $&. This has been known to effect a substantial time saving.

    The only place affected is this sub:

    sub java_clean { my $contents = $_[0]; while ($contents =~ s/(\{[^\{]*)\{([^\{\}]*)\}/ $1."\05".&wash($2)/ges) {} $contents =~ s/\05/\{\}/gs; # Remove imports ##$contents =~ s/^\s*import.*;/&wash($&)/gem; $contents =~ s/(^\s*import.*;)/&wash($1)/gem; # Remove packages ##$contents =~ s/^\s*package.*;/&wash($&)/gem; $contents =~ s/(^\s*package.*;)/&wash($1)/gem; return $contents; }

    The uncommented replacements should have the same effect (untested) and the changes could have a substantial affect on the overall performance of a script dominated by regex manipulations.

    While you're at it, you can also add a few micro-optimisations where they are called millions of times like:

    sub wash { ##### my $towash = $_[0]; return ( "\n" x ( $_[0] =~ tr/\n// ) ); }

    which will save the 7 seconds spent copying the input parameter. But given that the overall runtime is 7 minutes, that's not going to have a big effect. The only way you're a likely to get substantial savings from within the script, is to try optimising the algorithms used -- which amounts to tuning all of the individual regexes; and the heuristics they represent -- and that comes with enormous risk of breaking the logic completely and would require extensive and detailed testing.

    Bottom line

    All of that said, if you split the workload across two processors, you're likely to achieve close to a 50% saving. Across 4, and a 75% saving is theoretically possible. It really doesn't make much sense to spend time looking for saving within the script when, with a little restructuring, it lends itself so readily to being parallelised.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      On the $& fix, all I can say is *thank you*. I had investigated this previously, but I was completely misreading the warning in NYTProf as complaining that line 32 itself was doing this, which I couldn't figure out at all. You actually found the offending lines and even offered a fix! After making your change, the warning does indeed go away. From the docs this seems like a safe change to make.

      Sadly, this seems to have a very small effect in this particular environment / workload. A 38-second run was virtually unaffected... +/- one second. On a 15-minute repo, it fell by about 4-5 seconds. There are bigger repos that might show a bigger benefit, but I suspect we'll only shave at most a few minutes off over a full day's work. Still, every little bit helps and fixing the warning is nice in and of itself. :)

      I'm working on some parallelization for this, which should help. I'm slightly concerned this might overload the system it runs on, but that can be fixed with the proper application of a wallet.

      For example, we have a set of repos that have this script run on them every 4 hours. With a very rudimentary parallelization algorithm in the calling script, the runtime fell from about 1h50m to about 1h15m. This was just moving from serial execution to 2 parallel jobs... with some inefficiencies due to very simple shell backgrounding and "wait" semantics. A better implementation (say, using GNU parallel) may get it down to an hour flat.

      At the same time there's a much bigger set of repos that get processed daily. I think it takes over 24 hours to run (separate issue: it stopped reporting its status regularly). This runs concurrently, and obviously can overlap the 4-hour jobs. If I parallelize this daily job as well (2 processes), I run the risk of simultaneously maxxing out all 4 cores in this particular system... which would (possibly) be bad, because it also runs the web app that these scripts populate the data for. So I kinda need to be careful how I do this.

      Still, it's clearly a very good approach to reducing wall-clock time. I was primarily hoping someone would spot something egregiously wrong with the regex's that I could fix, but that seems not to be the case. Oh well... guess we'll do it the hard way. :)

        With a very rudimentary parallelization algorithm in the calling script, the runtime fell from about 1h50m to about 1h15m. This was just moving from serial execution to 2 parallel jobs... with some inefficiencies due to very simple shell backgrounding and "wait" semantics.

        That smacks of very poor parallelisation. The "usual suspect" in this scenario is splitting the workload -- in this case the list of files -- in "half" (numerically) and running an independent process on each "half". The problem of course is that equal numbers of files are usually very unequal in terms of the runtime requirement because of differences in the size and complexity of the individual files. Sod's law dictates that all the big complex files will end up in one half, and leaving all the short, simple ones in the other. Hence after the first few minutes, the latter runs out of files to process and you're back to a single linear workflow.

        Far better, and actually trivial to implement, is to remove the for $f (@f) { file processing loop from each of the findxxxx() routines and then feed the file list into a shared queue. You then run the findxxxx() sub in two or more threads that read the names of the next file to process from the queue and do their thing.

        The process then becomes self-balancing. Allowing one thread to process many small files concurrently with another that processes a few large ones. Or whatever combination of the two that happens to result in the way they come off the queue.

        This requires a little restructuring of the existing script, but it would definitely be better for it. Bringing the script into the modern world by adding strictness and warnings and a level of modularisation wouldn't go amiss either.

        If I parallelize this daily job as well (2 processes), I run the risk of simultaneously maxxing out all 4 cores in this particular system... which would (possibly) be bad, because it also runs the web app that these scripts populate the data for. So I kinda need to be careful how I do this.

        I've encountered this reasoning for avoiding parallelisation before. And whilst it seems like a good idea to reserve some cpu overhead for "important stuff", it doesn't hold water under closer inspection. You cannot "bank" CPU cycles and utilise them later. Unused CPU time is pure cost.

        Like underutilised staff in an empty shop, you are still paying the wages and to keep the lights on. Even on the very latest CPUs with complex power management, when the CPU goes to sleep, the ram, IO cards, disks and the power supply are still using their full quota of energy.

        The way to ensure "important stuff" gets done in a timely manner, is to use the prioritisation capabilities of your OS. Your web server should be running with 'IO priority' (ABOVENORMAL), whilst CPU-intensive tasks like this are running with 'background priority' (BELOWNORMAL).

        In this way, the background workload will never usurp the web server for attention when it needs to do something, but it will utilise every scrap of CPU that is going spare.

        There is nothing wrong with 100% utilisation, provided that prioritisation ensures that the right job gets the CPU when it needs it.

        Oh well... guess we'll do it the hard way.

        There are actually many refinements to the process you are currently using that could potentially provide even greater benefits than parallelisation.

        For example: How many of those source files change on a daily or 4-hourly basis?

        It would be a pretty trivial enhancement to retain and check the modification time -- or even MD5 checksum if you're paranoid -- for each of the source files and only reprocess those that actually change.

        Unless you have dozens of programmers constantly messing with your code-base, I could see that easily implemented change reducing your xrefing processing demands to single digit percentage points of your current setup.

        Even better, assuming that you are using source control -- and that source control is flexible enough to allow you to automate the process -- have the check-in process re-invoke the xrefing process only when a file is modified.

        Again, if you are paranoid, you could run the code-base through a full xref cycle once per week on the weekends. Ie. Use an equivalent of the weekly full/daily differential backup procedure.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: tight loop regex optimization
by ikegami (Patriarch) on Nov 01, 2011 at 05:34 UTC

    You're only going to some limited benefit by tweaking the regex patterns. Shaving 20% off of 60 seconds is still 48 seconds.

    One thing I see is that you're scanning the file from to to bottom for each s/// operator, and often more than once per pattern. Your efforts might be better spent avoid that. Some examples are self contained,

    while ($contents =~ s/\05([^\n\05]+)\05/$1\05\05/gs) {} | | v $contents =~ s/\05([^\n\05]+)(?=\05)/$1\05/gs;

    But that's not going to help you remove this waste in general.

      Hmm... going to have to get help here. I'm not familiar enough with this code or its expected output to really tell when it's working or when I might introduce a subtle bug somewhere. This seems more like the latter scenario. Thanks for pointing this out!

      On this particular example, I'm not understanding how the two things are identical. The substitution pattern in your version has one less \05. Is that intentional? If so, how does that work? I don't really grok the look-around assertion, and/or how that's relevant to not needing the extra \05 in the substitution pattern.

      For that matter, I'm not sure what \05 even is. Most places seem to say that an octal character code requires exactly 3 digits, but some say you can get away with less, as long as the leading digit is zero. But, \05 interpreted as octal is non-printing... don't know what it is, or why it would be in these files. Any thoughts on that?

      Thanks!

        I don't really grok the look-around assertion

        Think of it as a subroutine call. The engine tries to match the sub pattern at the current location, but it current location doesn't change.

        +------------ Matched at pos 0. | +-------- Matched at pos 1. | |+------- Matched at pos 2. | || +----- Matched at pos 1. | || |+---- Matched at pos 2. | || ||+--- Matched at pos 3. | || ||| v vv vvv 'abcd' =~ /a(?!BC)..d/ # Matches

        Since the position on the outer regex isn't affected, replacements don't take what the sub expression matched into consideration.

        my $s = 'abcd'; $s =~ s/a(?=bc)/x/; # xbcd my $s = 'aefg'; $s =~ s/a(?=bc)/x/; # abcd

        The substitution pattern in your version has one less \05. Is that intentional?

        Yes. One less \05 is being replaced, so one less \05 should be added.

        I'm not familiar enough with this code or its expected output to really tell when it's working or when I might introduce a subtle bug somewhere

        Then I guess the next order of business is to figure out the code and write test cases.