http://www.perlmonks.org?node_id=935526


in reply to Re^3: tight loop regex optimization
in thread tight loop regex optimization

That smacks of very poor parallelisation.

Yep, this is exactly as you describe. Each repo being processed takes a different amount of time, and I'm simply running X of them and waiting for all X to finish before starting the next round of X jobs. If they take roughly the same amount of time, this works pretty well... if they are very disparate, it's not much better than serial. By my estimation I'm within 20% of optimal on this data set. I don't think this algorithm will get much closer without some hand-optimization. Still, even this simple algorithm has reduced the running time by more than half.

The calling script is currently a shell script, not Perl... I may switch that to get more power here. Alternatively I may use something like GNU Parallel, which should be just as good at this level. I still won't have parallelism within a given job (which would be somewhat more efficient in all cases), but I can live with that. This is already a major improvement in runtime.

Far better, and actually trivial to implement, is to remove the for $f (@f) { file processing loop from each of the findxxxx() routines and then feed the file list into a shared queue. You then run the findxxxx() sub in two or more threads that read the names of the next file to process from the queue and do their thing.

I think I could hack in a Thread::Queue in the bigger subs, enqueue the contents of @f, fork off worker threads, and have them dequeue from that queue. Shouldn't be *too* bad. One concern would be the @ft array, which is currently used side-by-side with @f... this seems like a guaranteed synchronization problem, and I suspect would need to be changed.

What I've done so far is to approach this from the position of parallelizing the calls to this script, rather than internally within the script. One benefit of this is that it parallelizes everything the calling script does- of which genxref is the biggest (but not the only) job. It also has the benefit of not messing with the existing design... it's feels a lot "safer" than altering the genxref script itself, and should still get me most of the benefit in practice. It won't be *as* good, but much better than serial.

whilst it seems like a good idea to reserve some cpu overhead for "important stuff", it doesn't hold water under closer inspection. You cannot "bank" CPU cycles and utilise them later. The way to ensure "important stuff" gets done in a timely manner, is to use the prioritisation capabilities of your OS.

While I agree with this in theory (and in fact the calling script already uses "nice" on every individual invocation), I'm generally a bit hesitant to go all-out right away. I can't guarantee I won't wedge the system with I/O wait or context switches, so I'm slowly ramping up the parallelism. I do expect to eventually parallelize both the daily and 4-hour jobs, I'm just being overly cautious about it. :)

How many of those source files change on a daily or 4-hourly basis?

An excellent observation... this would be a massive improvement, I think. We do have a lot of committers, but realistically the changes are often isolated to a comparatively small portion of the files. This kind of speed-up would probably mean we could process far more often than daily. On-commit is not out of the realm of possibility. Thanks for the idea!