Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Splitting large array for threads.

by Anonymous Monk
on Jun 15, 2014 at 01:27 UTC ( #1089915=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Alright, so i have a threading example below:
#! perl -slw use strict; use threads qw[ yield ]; use threads::shared; use Thread::Queue; use Time::HiRes qw[ sleep ]; use constant NTHREADS => 30; my $pos :shared = 0; open FILE, '<', $ARGV[ 0 ] or die $!; my $size = -s FILE; sub thread { my $Q = shift; my $tid = threads->tid; while( my $line = $Q->dequeue ) { printf "%3d: (%10d, %10d) :%s", $tid, $pos, $size, $line; sleep rand 5; } } my $Q = Thread::Queue->new; my @threads = map threads->create( \&thread, $Q ), 1 .. NTHREADS; while( !eof FILE ) { sleep 0.001 while $Q->pending; for( 1 .. NTHREADS ) { $Q->enqueue( scalar <FILE> ); lock $pos; $pos = tell FILE; } } $Q->enqueue( (undef) x NTHREADS ); $_->join for @threads;
Basically, i'm trying to literate over an array of over 4million entrys. At around 800 entrys the script just stops. I was thinking it was because of how large the array is. Should i just split the array into chunks, then go through each? how could i get this to work properly.

Replies are listed 'Best First'.
Re: Splitting large array for threads.
by BrowserUk (Pope) on Jun 15, 2014 at 07:24 UTC
    Basically, i'm trying to literate over an array of over 4million entrys. At around 800 entrys the script just stops.

    Neither your post nor your code make any sense.

    There is no "array" (of "over 4million entrys" or otherwise) anywhere in your code. There is a file that you are reading into a queue.

    But, the way you are reading that file and populating the queue makes no sense.

    You wait until the queue is empty and then populate it with one line for each worker thread. You also place the current read position of the file into a shared variable after reading each line.

    However, as you only have one shared variable $pos, by the time the worker threads use that value, you will have overwritten it several times, so the same position will be attributed to several lines. Ie. NTHREADS line will be reported with the same position, but only one of them will be correct. Nonsense.

    Based upon what your posted code is actually doing, there is no logic in using threads to process this file, because the overheads of locking and queuing and far exceed the cost of the per-line processing -- which consists entirely of printing each line to the console. What's more, those line will be printed in some semi-random order.

    You'd be better off with a simple:

    while( <FILE> ) { printf "0: (%10d,%10d) : %s", tell( FILE ), $size, $_; }

    At least the lines would be printed in the same order they are read and with a different position -- albeit the position of the end of the line +1 rather than the start of it. And it will run much, much more quickly for the absence of threads.

    Processing the lines of a single file -- that must be read from disk serially -- using multiple threads makes no sense, unless the processing involved for each line takes longer than it takes to read that line from disk. Disks are slow; so you have to be doing a considerable amount of processing per line for that to be true.

    As for why it hangs after 800 lines: it isn't immediately obvious by reading the code, but I'm not going to expend effort to either verify that nor attempt to debug it, because it is nonsensical, do nothing useful code.

    I appreciate that when we start using something new, we often write do nothing programs to get a feel for stuff, but expecting others to debug that nonsensical code it asking a lot when there is no clear perspective of what you are hoping to achieve.

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Splitting large array for threads.
by biohisham (Priest) on Jun 15, 2014 at 04:22 UTC

    Chunking will definitely reduce the IPC overhead. If you are sure that none of the 4 million entries in the array is causing an error that makes your program stall then you can consider chunking.

    The MCE module has a direct chukning implementation you might want to explore

      Here is the link

      MCE 1.513 was moved to backpan some time back.

Re: Splitting large array for threads.
by perlfan (Curate) on Jun 16, 2014 at 20:49 UTC
    I would get working your serial version of the code before you create a threaded version of what you wish to do.

    Once you do this, you can think about how it makes most sense to split the work amongst your threads.

    If your input file is indeed an array, perhaps even of numbers, then I would consider looking at the various options and interfaces the Perl Data Language gives you.

    You may also want to consider using a truly threaded Perl like language like Qore if you wish to do things in a more native way. I have a talk linked via perlfan for it as it relates to Perl programmers.

Re: Splitting large array for threads.
by Anonymous Monk on Jun 16, 2014 at 21:39 UTC

    The latest version of threads distributed with Perl v5.20 states:


    The "interpreter-based threads" provided by Perl are not the fast, lightweight system for multitasking that one might expect or hope for. Threads are implemented in a way that make them easy to misuse. Few people know how to use them correctly or will be able to provide help.

    The use of interpreter-based threads in perl is officially discouraged.

    I believe this is the relevant discussion on P5P.

    You may want to look into doing this with multiple processes instead. A central manager process could take responsibility of handing out reasonably-sized chunks out to the workers. See for example fork or Parallel::ForkManager, as well as perlipc.

Re: Splitting large array for threads.
by Preceptor (Deacon) on Jun 18, 2014 at 16:59 UTC

    A few points if I may?

    • Don't enqueue 'undef', use the 'end()' method to Thread::Queue. Much neater.
    • What are you trying to accomplish with 'pos'? Because it's a race condition. You lock it and update it, but a thread may - or may not - have already dequeued and read the variable .
    • You don't really need that 'eof' test, as it's implicit in reading 'FILE'. You might be better off with a while loop there.
    • 'use strict' is good. 'use warnings' is good too.

    I can't see why your process would be stalling though. Usually I would look for an empty queue or a lock. Can I suggest inserting:

    $Q -> pending();

    Into that loop, for the sake of verification. My guess is that you might be getting tripped up by that 'sleep' call. But can't tell for sure, because it seems to work ok with an 1800 line file.

Re: Splitting large array for threads.
by marioroy (Priest) on Dec 13, 2014 at 05:01 UTC
    Hi all,

    Am writing about MCE being one option. 4 million is a big number. Therefore will describe various ways (non-chunking and chunking).

    First, chunk_size => 1

    use MCE::Loop max_workers => 4, chunk_size => 1; ## non-chunking takes 2m18s to complete mce_loop { MCE->say($_); } 1..4_000_000;

    Next, chunk_size => 'auto'

    use MCE::Loop max_workers => 4, chunk_size => 'auto'; ## chunking takes 0m12s to complete (IPC becomes 11.5x faster) mce_loop { my ($mce, $chunk_ref, $chunk_id) = @_; my @o; for (@{ $chunk_ref }) { push @o, $_; } MCE->say(@o); } 1..4_000_000;

    Finally, processing a file directly containing 4 million rows.

    use MCE::Loop max_workers => 4, chunk_size => 'auto'; ## processing a file directly (mce_loop_f) takes 0m11.7s mce_loop_f { my ($mce, $chunk_ref, $chunk_id) = @_; chomp @{ $chunk_ref }; my @o; for (@{ $chunk_ref }) { push @o, $_; } MCE->say(@o); } '/path/to/four_million_rows.txt';

    But hold on... Much time comes from writing 4 million rows to STDOUT. Fasten your seatbelt. Rerunning and directing output to /dev/null. The time also includes MCE->say(...)

    ## array non-chunking...: 1m54.467s ## array auto-chunking..: 0m 0.843s 136x ## file auto-chunking..: 0m 0.467s 245x

    Running again by commenting out MCE->say(...) to take that out of the equation. Am pleasantly surprised to see 4 million rows with chunk_size 1 in just 1 minute. Gosh, that is fast considering chunk_size => 1 (over 61k per second). However, chunking reduces IPC altogether. Furthermore, MCE can process an input file directly for even lesser overhead.

    ## array non-chunking...: 1m 5.458s ## array auto-chunking..: 0m 0.821s 80x ## file auto-chunking..: 0m 0.411s 159x

    It's not fair... :) Part of that time includes the time to load Perl itself and any modules. There is also the time to spawn 4 workers and shutting down in the end. I tested by adding MCE->last. The time needed is 0m0.074s. Therefore, the 0.411s above is really 0.337.

    mce_loop_f { MCE->last; # immediately leaves the block and input ... } '/path/to/four_million_rows.txt';

    Well then, here are the times by subtracting 0.074s from above to get the time needed for IPC only. Ha, still not able to break 1 minute for chunk_size => 1.

    ## array non-chunking...: 1m 5.384s ## array auto-chunking..: 0m 0.747s 88x ## file auto-chunking..: 0m 0.337s 194x

    Chunking enables IPC to run many times faster due to lesser overhead.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1089915]
Approved by LanX
Front-paged by davido
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (7)
As of 2018-06-20 19:59 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (117 votes). Check out past polls.