Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

multi threading

by egunth (Initiate)
on May 31, 2013 at 20:19 UTC ( [id://1036314]=perlquestion: print w/replies, xml ) Need Help??

egunth has asked for the wisdom of the Perl Monks concerning the following question:

Hi there, I've written a fairly simple script which reads through one master list line by line and compares entries to other lists in other files, returning essentially a count of matches. The problem is, my comparison lists are rather large (100's of millions total) and each entry needs to run through each one. I have access to a large multi-core server, but I don't know how to modify my code to utilize more processors. I want to do a "work crew" thread model with each thread running a portion of the master list simultaneously and then concatenating the results at the end.

My elementary understanding of this is that I would turn my process into a subroutine with parameters dictating which list entries it will work on, then invoking new threads with the threads module to handle all sections of the master list. Does this sound like the right approach? I just want to know if there are any obvious problems with my plan of attack, or any major things I'm missing since I'm a complete multi-threading newbie.

Also, will perl automatically run these different threads on different cores, or is there something else I need to do to make that happen? I'm running perl on RHEL6 by the way.

Replies are listed 'Best First'.
Re: multi threading
by perl-diddler (Chaplain) on May 31, 2013 at 21:16 UTC
    I would wonder why you think threading would be a good approach?

    I have a program that does, something similar. I wanted to compare version numbers of "rpm"s against version numbers as reported by RPM, of all other rpms that have the same base name.

    In my case, the multiple lists are many but small. I.e. I can have 5-10K different rpm names, but usually only 1-3 different versions (assuming it's been pruned regularly).

    The thing that takes time in my circumstance was calling "rpm" with a query that gives me it's idea how it split the Name from the Version and Release. Calling rpm or it's library, would still involve it opening the rpm package on disk, to parse it's header.

    As a result, much time is spent in disk waits -- and a bit of experimentation on a 12-core over a RAID50 showed me that scheduling about 9 inspection threads/procs at a time yield the least amount of overall time (though the speedup is only between 3-6X so it may not be the most efficient in terms of cpu usage, but I was looking for real-time benefits).

    So I make sure my list is sorted by names and split my list by # procs to allow. They each go off and run through their list, and when done, report back to the master -- sending back their reduced lists and a second list of rpmnames that are 'redundant' (to be removed).

    I also make sure that the minimum amount of work for each worker is at least "X" queries. If it isn't, I reduce the number of overall workers.

    In development, I found it useful to write the results to temporary space before it was re-merged by the parent, but after it was working, I went to using named pipes. Note.. I DID use /dev/shm for the location of the temporary space, so it was, in some respects, still IPC, but I could examine the results by the files created in /dev/shm if I wanted to, during development.

    I didn't see that threading offered any advantage over multiple procs, since perl threads are really separate procs anyway, and, at least for me, being able to send the child output to an intermediate space during development was real helpful. Since The child and parent both just used "FD"'s, it was trivial to switch them to directly talking over pipes once the development was done.

    A larger than normal run (took two different distro releases and combined and ran them through). Looks like:

    > time remove-oldver-rpms-in-dir.pl Read 35841 rpm names. Use 9 procs w/3984 items/process #pkgs=21111, #deletes=14730, total=35841 2 additional duplicates found in last pass Recycling 14732 duplicates...Done Cumulative This Phase ID 0.000s 0.000s Init 0.000s 0.000s start_program 0.060s 0.060s starting_children 0.065s 0.005s end_starting_children 118.643s 118.578s endRdFrmChldrn_n_start_re_sort 123.202s 4.559s afterFinalSort 202.70sec 18.16usr 64.72sys (40.89% cpu)
    The final difference from 123-202 was spent in moving each of the files to a per-disk recycle bin, that I periodically empty via another script.

    For me, the use of threads would have complicated things.

    Does that give you any ideas?

    linda

    ----
    P.S.-- If speed was really important, I could likely benefit by using multiple cores on that final step that take >100 secs, as it's all single threaded. I could probably 'rename' at least 3-5 files in parallel ... but it's just a maintenance script, so not a real high priority...

      This is really the approach I wanted to take, dividing the list and having simultaneous processes handle the smaller lists in parallel, then merging the results at the end. I thought (apparently incorrectly) that threads were the way to do this

      I'm extremely new to this kind of programming so if there's any more insight or a good reference you could point me to about using child processes that would be really helpful. With my computing resource I have plentiful cores and memory to be used, but I'm not sure about disk reading capabilities, so my instinct was to find a way to just break it up and use more cores. It sounds like this is what you did just in a better way than I was planning to

        I thought (apparently incorrectly) that threads were the way to do this

        This shows a method that might be adaptable to your task. But then again, maybe not.

        Had you ever supplied the requested information, you might have gotten a more definitive answer.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: multi threading
by BrowserUk (Patriarch) on May 31, 2013 at 20:38 UTC

    From your description, it seems likely that you are tackling the problem the wrong way. But with you current single threaded solution and your proposal for multi-threading it.

    To get a proper appraisal, post the code you are currently using and tell how big is the master list? ( In lines and bytes.)

    Also, were did you get the term "work crew" from?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

        Didn't ask you; I asked the OP.

        His answer might have told me a little about where he is at.

        Yours tells me that you know how to type "google".


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: multi threading
by vsespb (Chaplain) on Jun 01, 2013 at 08:26 UTC
    IMHO under perl for linux, fork() is better than Threads.
Re: multi threading
by sundialsvc4 (Abbot) on Jun 01, 2013 at 13:43 UTC

    This is strictly an I/O-bound process, so threading is unlikely to help you much here.   The ruling constraint is how fast the (one?) disk-drive can spin and move its read/write head assembly around ... and particularly, how often the read/write head is obliged to move from one place to another.   This is why you sometimes encounter the at-first counter-intuitive finding that “adding threads makes it slower,” this being a result of essentially randomizing the pattern of back-and-forth movements the read/write heads must make, and greatly increasing the “churn” of the operating-system’s buffers.

    What I think you really, really want to do here is to use, say, a SQLite database file (or files), indexing the data so that you do not in fact have to read every line to find what might be a match.   Indexes do not have to be perfect in order to be useful.   Any strategy that reduces the amount of records that must actually be examined, by any means whatever, is going to be worthwhile:   in this case, you simply want to separate the data into meaningful clumps so that you only need to iterate through one of them.

    Another useful strategy, especially if you are “clumping,” is to grab a big handful of records-to be-looked-for into a memory structure such as a list, big enough to fit in real memory (i.e. without paging ...), so that you can make each I/O against the other file do more work for you. Having spent the I/O time to retrieve the record (and its neighbors), you can compare it against the entire handful without incurring more I/O cost.   (But if the structure is so large (among all the processes that may exist) that it does cause paging, then you have just incurred a hidden I/O cost that can be quite debilitating:   a too-large to-fit yet frequently-accessed group of pages, causing thrashing to occur.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1036314]
Approved by BrowserUk
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2024-04-25 12:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found