Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
With a very rudimentary parallelization algorithm in the calling script, the runtime fell from about 1h50m to about 1h15m. This was just moving from serial execution to 2 parallel jobs... with some inefficiencies due to very simple shell backgrounding and "wait" semantics.

That smacks of very poor parallelisation. The "usual suspect" in this scenario is splitting the workload -- in this case the list of files -- in "half" (numerically) and running an independent process on each "half". The problem of course is that equal numbers of files are usually very unequal in terms of the runtime requirement because of differences in the size and complexity of the individual files. Sod's law dictates that all the big complex files will end up in one half, and leaving all the short, simple ones in the other. Hence after the first few minutes, the latter runs out of files to process and you're back to a single linear workflow.

Far better, and actually trivial to implement, is to remove the for $f (@f) { file processing loop from each of the findxxxx() routines and then feed the file list into a shared queue. You then run the findxxxx() sub in two or more threads that read the names of the next file to process from the queue and do their thing.

The process then becomes self-balancing. Allowing one thread to process many small files concurrently with another that processes a few large ones. Or whatever combination of the two that happens to result in the way they come off the queue.

This requires a little restructuring of the existing script, but it would definitely be better for it. Bringing the script into the modern world by adding strictness and warnings and a level of modularisation wouldn't go amiss either.

If I parallelize this daily job as well (2 processes), I run the risk of simultaneously maxxing out all 4 cores in this particular system... which would (possibly) be bad, because it also runs the web app that these scripts populate the data for. So I kinda need to be careful how I do this.

I've encountered this reasoning for avoiding parallelisation before. And whilst it seems like a good idea to reserve some cpu overhead for "important stuff", it doesn't hold water under closer inspection. You cannot "bank" CPU cycles and utilise them later. Unused CPU time is pure cost.

Like underutilised staff in an empty shop, you are still paying the wages and to keep the lights on. Even on the very latest CPUs with complex power management, when the CPU goes to sleep, the ram, IO cards, disks and the power supply are still using their full quota of energy.

The way to ensure "important stuff" gets done in a timely manner, is to use the prioritisation capabilities of your OS. Your web server should be running with 'IO priority' (ABOVENORMAL), whilst CPU-intensive tasks like this are running with 'background priority' (BELOWNORMAL).

In this way, the background workload will never usurp the web server for attention when it needs to do something, but it will utilise every scrap of CPU that is going spare.

There is nothing wrong with 100% utilisation, provided that prioritisation ensures that the right job gets the CPU when it needs it.

Oh well... guess we'll do it the hard way.

There are actually many refinements to the process you are currently using that could potentially provide even greater benefits than parallelisation.

For example: How many of those source files change on a daily or 4-hourly basis?

It would be a pretty trivial enhancement to retain and check the modification time -- or even MD5 checksum if you're paranoid -- for each of the source files and only reprocess those that actually change.

Unless you have dozens of programmers constantly messing with your code-base, I could see that easily implemented change reducing your xrefing processing demands to single digit percentage points of your current setup.

Even better, assuming that you are using source control -- and that source control is flexible enough to allow you to automate the process -- have the check-in process re-invoke the xrefing process only when a file is modified.

Again, if you are paranoid, you could run the code-base through a full xref cycle once per week on the weekends. Ie. Use an equivalent of the weekly full/daily differential backup procedure.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^3: tight loop regex optimization by BrowserUk
in thread tight loop regex optimization by superawesome

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others making s'mores by the fire in the courtyard of the Monastery: (5)
    As of 2018-11-21 06:06 GMT
    Find Nodes?
      Voting Booth?
      My code is most likely broken because:

      Results (237 votes). Check out past polls.