Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re^3: tight loop regex optimization

by BrowserUk (Pope)
on Nov 02, 2011 at 07:02 UTC ( #935312=note: print w/ replies, xml ) Need Help??


in reply to Re^2: tight loop regex optimization
in thread tight loop regex optimization

With a very rudimentary parallelization algorithm in the calling script, the runtime fell from about 1h50m to about 1h15m. This was just moving from serial execution to 2 parallel jobs... with some inefficiencies due to very simple shell backgrounding and "wait" semantics.

That smacks of very poor parallelisation. The "usual suspect" in this scenario is splitting the workload -- in this case the list of files -- in "half" (numerically) and running an independent process on each "half". The problem of course is that equal numbers of files are usually very unequal in terms of the runtime requirement because of differences in the size and complexity of the individual files. Sod's law dictates that all the big complex files will end up in one half, and leaving all the short, simple ones in the other. Hence after the first few minutes, the latter runs out of files to process and you're back to a single linear workflow.

Far better, and actually trivial to implement, is to remove the for $f (@f) { file processing loop from each of the findxxxx() routines and then feed the file list into a shared queue. You then run the findxxxx() sub in two or more threads that read the names of the next file to process from the queue and do their thing.

The process then becomes self-balancing. Allowing one thread to process many small files concurrently with another that processes a few large ones. Or whatever combination of the two that happens to result in the way they come off the queue.

This requires a little restructuring of the existing script, but it would definitely be better for it. Bringing the script into the modern world by adding strictness and warnings and a level of modularisation wouldn't go amiss either.

If I parallelize this daily job as well (2 processes), I run the risk of simultaneously maxxing out all 4 cores in this particular system... which would (possibly) be bad, because it also runs the web app that these scripts populate the data for. So I kinda need to be careful how I do this.

I've encountered this reasoning for avoiding parallelisation before. And whilst it seems like a good idea to reserve some cpu overhead for "important stuff", it doesn't hold water under closer inspection. You cannot "bank" CPU cycles and utilise them later. Unused CPU time is pure cost.

Like underutilised staff in an empty shop, you are still paying the wages and to keep the lights on. Even on the very latest CPUs with complex power management, when the CPU goes to sleep, the ram, IO cards, disks and the power supply are still using their full quota of energy.

The way to ensure "important stuff" gets done in a timely manner, is to use the prioritisation capabilities of your OS. Your web server should be running with 'IO priority' (ABOVENORMAL), whilst CPU-intensive tasks like this are running with 'background priority' (BELOWNORMAL).

In this way, the background workload will never usurp the web server for attention when it needs to do something, but it will utilise every scrap of CPU that is going spare.

There is nothing wrong with 100% utilisation, provided that prioritisation ensures that the right job gets the CPU when it needs it.

Oh well... guess we'll do it the hard way.

There are actually many refinements to the process you are currently using that could potentially provide even greater benefits than parallelisation.

For example: How many of those source files change on a daily or 4-hourly basis?

It would be a pretty trivial enhancement to retain and check the modification time -- or even MD5 checksum if you're paranoid -- for each of the source files and only reprocess those that actually change.

Unless you have dozens of programmers constantly messing with your code-base, I could see that easily implemented change reducing your xrefing processing demands to single digit percentage points of your current setup.

Even better, assuming that you are using source control -- and that source control is flexible enough to allow you to automate the process -- have the check-in process re-invoke the xrefing process only when a file is modified.

Again, if you are paranoid, you could run the code-base through a full xref cycle once per week on the weekends. Ie. Use an equivalent of the weekly full/daily differential backup procedure.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^3: tight loop regex optimization
Select or Download Code
Re^4: tight loop regex optimization
by superawesome (Initiate) on Nov 03, 2011 at 00:43 UTC
    That smacks of very poor parallelisation.

    Yep, this is exactly as you describe. Each repo being processed takes a different amount of time, and I'm simply running X of them and waiting for all X to finish before starting the next round of X jobs. If they take roughly the same amount of time, this works pretty well... if they are very disparate, it's not much better than serial. By my estimation I'm within 20% of optimal on this data set. I don't think this algorithm will get much closer without some hand-optimization. Still, even this simple algorithm has reduced the running time by more than half.

    The calling script is currently a shell script, not Perl... I may switch that to get more power here. Alternatively I may use something like GNU Parallel, which should be just as good at this level. I still won't have parallelism within a given job (which would be somewhat more efficient in all cases), but I can live with that. This is already a major improvement in runtime.

    Far better, and actually trivial to implement, is to remove the for $f (@f) { file processing loop from each of the findxxxx() routines and then feed the file list into a shared queue. You then run the findxxxx() sub in two or more threads that read the names of the next file to process from the queue and do their thing.

    I think I could hack in a Thread::Queue in the bigger subs, enqueue the contents of @f, fork off worker threads, and have them dequeue from that queue. Shouldn't be *too* bad. One concern would be the @ft array, which is currently used side-by-side with @f... this seems like a guaranteed synchronization problem, and I suspect would need to be changed.

    What I've done so far is to approach this from the position of parallelizing the calls to this script, rather than internally within the script. One benefit of this is that it parallelizes everything the calling script does- of which genxref is the biggest (but not the only) job. It also has the benefit of not messing with the existing design... it's feels a lot "safer" than altering the genxref script itself, and should still get me most of the benefit in practice. It won't be *as* good, but much better than serial.

    whilst it seems like a good idea to reserve some cpu overhead for "important stuff", it doesn't hold water under closer inspection. You cannot "bank" CPU cycles and utilise them later. The way to ensure "important stuff" gets done in a timely manner, is to use the prioritisation capabilities of your OS.

    While I agree with this in theory (and in fact the calling script already uses "nice" on every individual invocation), I'm generally a bit hesitant to go all-out right away. I can't guarantee I won't wedge the system with I/O wait or context switches, so I'm slowly ramping up the parallelism. I do expect to eventually parallelize both the daily and 4-hour jobs, I'm just being overly cautious about it. :)

    How many of those source files change on a daily or 4-hourly basis?

    An excellent observation... this would be a massive improvement, I think. We do have a lot of committers, but realistically the changes are often isolated to a comparatively small portion of the files. This kind of speed-up would probably mean we could process far more often than daily. On-commit is not out of the realm of possibility. Thanks for the idea!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://935312]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2014-12-26 23:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (176 votes), past polls