Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Splitting large array for threads.

by Anonymous Monk
on Jun 15, 2014 at 01:27 UTC ( #1089915=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Alright, so i have a threading example below:
#! perl -slw use strict; use threads qw[ yield ]; use threads::shared; use Thread::Queue; use Time::HiRes qw[ sleep ]; use constant NTHREADS => 30; my $pos :shared = 0; open FILE, '<', $ARGV[ 0 ] or die $!; my $size = -s FILE; sub thread { my $Q = shift; my $tid = threads->tid; while( my $line = $Q->dequeue ) { printf "%3d: (%10d, %10d) :%s", $tid, $pos, $size, $line; sleep rand 5; } } my $Q = Thread::Queue->new; my @threads = map threads->create( \&thread, $Q ), 1 .. NTHREADS; while( !eof FILE ) { sleep 0.001 while $Q->pending; for( 1 .. NTHREADS ) { $Q->enqueue( scalar <FILE> ); lock $pos; $pos = tell FILE; } } $Q->enqueue( (undef) x NTHREADS ); $_->join for @threads;
Basically, i'm trying to literate over an array of over 4million entrys. At around 800 entrys the script just stops. I was thinking it was because of how large the array is. Should i just split the array into chunks, then go through each? how could i get this to work properly.

Comment on Splitting large array for threads.
Download Code
Re: Splitting large array for threads.
by biohisham (Priest) on Jun 15, 2014 at 04:22 UTC

    Chunking will definitely reduce the IPC overhead. If you are sure that none of the 4 million entries in the array is causing an error that makes your program stall then you can consider chunking.

    The MCE module has a direct chukning implementation you might want to explore



Re: Splitting large array for threads.
by BrowserUk (Pope) on Jun 15, 2014 at 07:24 UTC
    Basically, i'm trying to literate over an array of over 4million entrys. At around 800 entrys the script just stops.

    Neither your post nor your code make any sense.

    There is no "array" (of "over 4million entrys" or otherwise) anywhere in your code. There is a file that you are reading into a queue.

    But, the way you are reading that file and populating the queue makes no sense.

    You wait until the queue is empty and then populate it with one line for each worker thread. You also place the current read position of the file into a shared variable after reading each line.

    However, as you only have one shared variable $pos, by the time the worker threads use that value, you will have overwritten it several times, so the same position will be attributed to several lines. Ie. NTHREADS line will be reported with the same position, but only one of them will be correct. Nonsense.

    Based upon what your posted code is actually doing, there is no logic in using threads to process this file, because the overheads of locking and queuing and far exceed the cost of the per-line processing -- which consists entirely of printing each line to the console. What's more, those line will be printed in some semi-random order.

    You'd be better off with a simple:

    while( <FILE> ) { printf "0: (%10d,%10d) : %s", tell( FILE ), $size, $_; }

    At least the lines would be printed in the same order they are read and with a different position -- albeit the position of the end of the line +1 rather than the start of it. And it will run much, much more quickly for the absence of threads.

    Processing the lines of a single file -- that must be read from disk serially -- using multiple threads makes no sense, unless the processing involved for each line takes longer than it takes to read that line from disk. Disks are slow; so you have to be doing a considerable amount of processing per line for that to be true.

    As for why it hangs after 800 lines: it isn't immediately obvious by reading the code, but I'm not going to expend effort to either verify that nor attempt to debug it, because it is nonsensical, do nothing useful code.

    I appreciate that when we start using something new, we often write do nothing programs to get a feel for stuff, but expecting others to debug that nonsensical code it asking a lot when there is no clear perspective of what you are hoping to achieve.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Splitting large array for threads.
by perlfan (Curate) on Jun 16, 2014 at 20:49 UTC
    I would get working your serial version of the code before you create a threaded version of what you wish to do.

    Once you do this, you can think about how it makes most sense to split the work amongst your threads.

    If your input file is indeed an array, perhaps even of numbers, then I would consider looking at the various options and interfaces the Perl Data Language gives you.

    You may also want to consider using a truly threaded Perl like language like Qore if you wish to do things in a more native way. I have a talk linked via perlfan for it as it relates to Perl programmers.

Re: Splitting large array for threads.
by Anonymous Monk on Jun 16, 2014 at 21:39 UTC

    The latest version of threads distributed with Perl v5.20 states:

    WARNING

    The "interpreter-based threads" provided by Perl are not the fast, lightweight system for multitasking that one might expect or hope for. Threads are implemented in a way that make them easy to misuse. Few people know how to use them correctly or will be able to provide help.

    The use of interpreter-based threads in perl is officially discouraged.

    I believe this is the relevant discussion on P5P.

    You may want to look into doing this with multiple processes instead. A central manager process could take responsibility of handing out reasonably-sized chunks out to the workers. See for example fork or Parallel::ForkManager, as well as perlipc.

Re: Splitting large array for threads.
by Preceptor (Chaplain) on Jun 18, 2014 at 16:59 UTC

    A few points if I may?

    • Don't enqueue 'undef', use the 'end()' method to Thread::Queue. Much neater.
    • What are you trying to accomplish with 'pos'? Because it's a race condition. You lock it and update it, but a thread may - or may not - have already dequeued and read the variable .
    • You don't really need that 'eof' test, as it's implicit in reading 'FILE'. You might be better off with a while loop there.
    • 'use strict' is good. 'use warnings' is good too.

    I can't see why your process would be stalling though. Usually I would look for an empty queue or a lock. Can I suggest inserting:

    $Q -> pending();

    Into that loop, for the sake of verification. My guess is that you might be getting tripped up by that 'sleep' call. But can't tell for sure, because it seems to work ok with an 1800 line file.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1089915]
Approved by LanX
Front-paged by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (10)
As of 2014-11-27 08:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (182 votes), past polls