Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: tight loop regex optimization

by BrowserUk (Pope)
on Nov 01, 2011 at 07:20 UTC ( #935047=note: print w/ replies, xml ) Need Help??


in reply to tight loop regex optimization

Skip to the bottom first!

There are a few places in the code where small savings can be achieved essentially for free:

  • 1 while $re; runs ~10% faster than while( $re ) {}.
  • $re && last; runs a few percent faster than $re && do{ last };

But even if you could make this level of savings on every single line in the program, you'd still save maybe an hour at most.

Looking at a few of the individual REs, nothing leaps off the page as being particularly extravagant. You should heed the annotation at the top of the profiling and try to remove usage of $&. This has been known to effect a substantial time saving.

The only place affected is this sub:

sub java_clean { my $contents = $_[0]; while ($contents =~ s/(\{[^\{]*)\{([^\{\}]*)\}/ $1."\05".&wash($2)/ges) {} $contents =~ s/\05/\{\}/gs; # Remove imports ##$contents =~ s/^\s*import.*;/&wash($&)/gem; $contents =~ s/(^\s*import.*;)/&wash($1)/gem; # Remove packages ##$contents =~ s/^\s*package.*;/&wash($&)/gem; $contents =~ s/(^\s*package.*;)/&wash($1)/gem; return $contents; }

The uncommented replacements should have the same effect (untested) and the changes could have a substantial affect on the overall performance of a script dominated by regex manipulations.

While you're at it, you can also add a few micro-optimisations where they are called millions of times like:

sub wash { ##### my $towash = $_[0]; return ( "\n" x ( $_[0] =~ tr/\n// ) ); }

which will save the 7 seconds spent copying the input parameter. But given that the overall runtime is 7 minutes, that's not going to have a big effect. The only way you're a likely to get substantial savings from within the script, is to try optimising the algorithms used -- which amounts to tuning all of the individual regexes; and the heuristics they represent -- and that comes with enormous risk of breaking the logic completely and would require extensive and detailed testing.

Bottom line

All of that said, if you split the workload across two processors, you're likely to achieve close to a 50% saving. Across 4, and a 75% saving is theoretically possible. It really doesn't make much sense to spend time looking for saving within the script when, with a little restructuring, it lends itself so readily to being parallelised.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re: tight loop regex optimization
Select or Download Code
Re^2: tight loop regex optimization
by superawesome (Initiate) on Nov 02, 2011 at 04:47 UTC

    On the $& fix, all I can say is *thank you*. I had investigated this previously, but I was completely misreading the warning in NYTProf as complaining that line 32 itself was doing this, which I couldn't figure out at all. You actually found the offending lines and even offered a fix! After making your change, the warning does indeed go away. From the docs this seems like a safe change to make.

    Sadly, this seems to have a very small effect in this particular environment / workload. A 38-second run was virtually unaffected... +/- one second. On a 15-minute repo, it fell by about 4-5 seconds. There are bigger repos that might show a bigger benefit, but I suspect we'll only shave at most a few minutes off over a full day's work. Still, every little bit helps and fixing the warning is nice in and of itself. :)

    I'm working on some parallelization for this, which should help. I'm slightly concerned this might overload the system it runs on, but that can be fixed with the proper application of a wallet.

    For example, we have a set of repos that have this script run on them every 4 hours. With a very rudimentary parallelization algorithm in the calling script, the runtime fell from about 1h50m to about 1h15m. This was just moving from serial execution to 2 parallel jobs... with some inefficiencies due to very simple shell backgrounding and "wait" semantics. A better implementation (say, using GNU parallel) may get it down to an hour flat.

    At the same time there's a much bigger set of repos that get processed daily. I think it takes over 24 hours to run (separate issue: it stopped reporting its status regularly). This runs concurrently, and obviously can overlap the 4-hour jobs. If I parallelize this daily job as well (2 processes), I run the risk of simultaneously maxxing out all 4 cores in this particular system... which would (possibly) be bad, because it also runs the web app that these scripts populate the data for. So I kinda need to be careful how I do this.

    Still, it's clearly a very good approach to reducing wall-clock time. I was primarily hoping someone would spot something egregiously wrong with the regex's that I could fix, but that seems not to be the case. Oh well... guess we'll do it the hard way. :)

      With a very rudimentary parallelization algorithm in the calling script, the runtime fell from about 1h50m to about 1h15m. This was just moving from serial execution to 2 parallel jobs... with some inefficiencies due to very simple shell backgrounding and "wait" semantics.

      That smacks of very poor parallelisation. The "usual suspect" in this scenario is splitting the workload -- in this case the list of files -- in "half" (numerically) and running an independent process on each "half". The problem of course is that equal numbers of files are usually very unequal in terms of the runtime requirement because of differences in the size and complexity of the individual files. Sod's law dictates that all the big complex files will end up in one half, and leaving all the short, simple ones in the other. Hence after the first few minutes, the latter runs out of files to process and you're back to a single linear workflow.

      Far better, and actually trivial to implement, is to remove the for $f (@f) { file processing loop from each of the findxxxx() routines and then feed the file list into a shared queue. You then run the findxxxx() sub in two or more threads that read the names of the next file to process from the queue and do their thing.

      The process then becomes self-balancing. Allowing one thread to process many small files concurrently with another that processes a few large ones. Or whatever combination of the two that happens to result in the way they come off the queue.

      This requires a little restructuring of the existing script, but it would definitely be better for it. Bringing the script into the modern world by adding strictness and warnings and a level of modularisation wouldn't go amiss either.

      If I parallelize this daily job as well (2 processes), I run the risk of simultaneously maxxing out all 4 cores in this particular system... which would (possibly) be bad, because it also runs the web app that these scripts populate the data for. So I kinda need to be careful how I do this.

      I've encountered this reasoning for avoiding parallelisation before. And whilst it seems like a good idea to reserve some cpu overhead for "important stuff", it doesn't hold water under closer inspection. You cannot "bank" CPU cycles and utilise them later. Unused CPU time is pure cost.

      Like underutilised staff in an empty shop, you are still paying the wages and to keep the lights on. Even on the very latest CPUs with complex power management, when the CPU goes to sleep, the ram, IO cards, disks and the power supply are still using their full quota of energy.

      The way to ensure "important stuff" gets done in a timely manner, is to use the prioritisation capabilities of your OS. Your web server should be running with 'IO priority' (ABOVENORMAL), whilst CPU-intensive tasks like this are running with 'background priority' (BELOWNORMAL).

      In this way, the background workload will never usurp the web server for attention when it needs to do something, but it will utilise every scrap of CPU that is going spare.

      There is nothing wrong with 100% utilisation, provided that prioritisation ensures that the right job gets the CPU when it needs it.

      Oh well... guess we'll do it the hard way.

      There are actually many refinements to the process you are currently using that could potentially provide even greater benefits than parallelisation.

      For example: How many of those source files change on a daily or 4-hourly basis?

      It would be a pretty trivial enhancement to retain and check the modification time -- or even MD5 checksum if you're paranoid -- for each of the source files and only reprocess those that actually change.

      Unless you have dozens of programmers constantly messing with your code-base, I could see that easily implemented change reducing your xrefing processing demands to single digit percentage points of your current setup.

      Even better, assuming that you are using source control -- and that source control is flexible enough to allow you to automate the process -- have the check-in process re-invoke the xrefing process only when a file is modified.

      Again, if you are paranoid, you could run the code-base through a full xref cycle once per week on the weekends. Ie. Use an equivalent of the weekly full/daily differential backup procedure.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        That smacks of very poor parallelisation.

        Yep, this is exactly as you describe. Each repo being processed takes a different amount of time, and I'm simply running X of them and waiting for all X to finish before starting the next round of X jobs. If they take roughly the same amount of time, this works pretty well... if they are very disparate, it's not much better than serial. By my estimation I'm within 20% of optimal on this data set. I don't think this algorithm will get much closer without some hand-optimization. Still, even this simple algorithm has reduced the running time by more than half.

        The calling script is currently a shell script, not Perl... I may switch that to get more power here. Alternatively I may use something like GNU Parallel, which should be just as good at this level. I still won't have parallelism within a given job (which would be somewhat more efficient in all cases), but I can live with that. This is already a major improvement in runtime.

        Far better, and actually trivial to implement, is to remove the for $f (@f) { file processing loop from each of the findxxxx() routines and then feed the file list into a shared queue. You then run the findxxxx() sub in two or more threads that read the names of the next file to process from the queue and do their thing.

        I think I could hack in a Thread::Queue in the bigger subs, enqueue the contents of @f, fork off worker threads, and have them dequeue from that queue. Shouldn't be *too* bad. One concern would be the @ft array, which is currently used side-by-side with @f... this seems like a guaranteed synchronization problem, and I suspect would need to be changed.

        What I've done so far is to approach this from the position of parallelizing the calls to this script, rather than internally within the script. One benefit of this is that it parallelizes everything the calling script does- of which genxref is the biggest (but not the only) job. It also has the benefit of not messing with the existing design... it's feels a lot "safer" than altering the genxref script itself, and should still get me most of the benefit in practice. It won't be *as* good, but much better than serial.

        whilst it seems like a good idea to reserve some cpu overhead for "important stuff", it doesn't hold water under closer inspection. You cannot "bank" CPU cycles and utilise them later. The way to ensure "important stuff" gets done in a timely manner, is to use the prioritisation capabilities of your OS.

        While I agree with this in theory (and in fact the calling script already uses "nice" on every individual invocation), I'm generally a bit hesitant to go all-out right away. I can't guarantee I won't wedge the system with I/O wait or context switches, so I'm slowly ramping up the parallelism. I do expect to eventually parallelize both the daily and 4-hour jobs, I'm just being overly cautious about it. :)

        How many of those source files change on a daily or 4-hourly basis?

        An excellent observation... this would be a massive improvement, I think. We do have a lot of committers, but realistically the changes are often isolated to a comparatively small portion of the files. This kind of speed-up would probably mean we could process far more often than daily. On-commit is not out of the realm of possibility. Thanks for the idea!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://935047]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (11)
As of 2014-08-29 19:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (287 votes), past polls