Re^5: Process large text data in array

When I increase the number of filters to 83 and skew the ordering to match towards the end (to simulate a CPU bound process), I get 5.8s run time for the same input for the single threaded code. I get 6.1s run time for the multi-threaded code.

Could you post those versions of the two programs (save me trying to reproduce them from your descriptions), as I'd like to do a little more analysis on them.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this

In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Comment on Re^5: Process large text data in array

Replies are listed 'Best First'.

Re^6: Process large text data in array
by SimonPratt (Friar) on Mar 11, 2015 at 18:12 UTC

Sure, this is the multi-threaded code:

use strict;
use threads;
use Thread::Queue;
use Time::HiRes 'time';

use constant MAXTHREADS => 2;

my $workQueue = Thread::Queue->new();
my $outQueue = Thread::Queue->new();

my @filters;
push @filters, sub { /blahdeblah/ ? 1 : undef } for 1..83;

push @filters, sub { /active/ ? 1 : undef };
push @filters, sub { /anotherfilter/ ? 1 : undef };

my @threads = map { threads->new( \&worker ) } 1..MAXTHREADS;

my $file_name = 'test.txt';
open my $DATF, '<', $file_name;
while ( <$DATF> ) {
    $workQueue->enqueue($_);
}
close $DATF;

$workQueue->end();

$_->join for @threads;

$outQueue->end();

my @dat;
while (my $line = $outQueue->dequeue()) {
    push @dat, $line;
}

print( time - $^T, "\n" );

sub worker {
    while ( my $line = $workQueue->dequeue() ) {
        chomp $line;
        foreach my $filter (@filters) {
            my $newline = $filter->($line) or next;
            $outQueue->enqueue($line);
            last;
        }
    }
}
[download]

This is the single-threaded code:

use strict;
use Time::HiRes 'time';

my (@dat) = ();

my @filters;
push @filters, sub { /blahdeblah/ ? 1 : undef } for 1..83;

push @filters, sub { /active/ ? 1 : undef };
push @filters, sub { /anotherfilter/ ? 1 : undef };

my $file_name = 'test.txt';
open my $DATF, '<', $file_name;
while( chomp(my $line = <$DATF>) ) {

    foreach my $filter (@filters) {
        my $newline = $filter->($line) or next;
        push (@dat, $line);
        last;
    }

}

close($DATF);

print( time - $^T, "\n" );
[download]

And the data file I used was made up of the following:

active=sync|foo=bar=bam|sync=53|foo=bar=bam|sync=53|foo=bar=bam|sync=5
+3|
foo=bar=bam|sync=53||foo=bar=bam|sync=53
anotherfilter=forest|foo=bar=bam|sync=53|foo=bar=bam|sync=53|foo=bar=b
+am|sync=53|foo=bar=bam|sync=53|foo=bar=bam|sync=53
[download]

I initially came up with this threading model to parse I/B/E/S monthly financial data files, which are pretty hefty (roughly 30GB all up) and require a lot of processing for each line (anywhere between 40 lines of code - 200 lines of code), ultimately winding up with a massively CPU bound operation. Given the ultimate dataset size though (and the fact it is already broken up into multiple files), the final model I went with, which provides the best speed enhancement for this scenario, is a multi-threaded model that divides the work based on files, rather than lines. Splitting on lines was good for individual file processing, but not good enough for overall processing and ultimately not as scalable, due to natural limits on performance enhancement when splitting processing into such tiny work units.

Ultimately, you need to know the underlying environment and task very well to be able to make a good decision about what can / needs to be multi-threaded, how it should be split up and where the sweet spot is for performance enhancement.

edit: Removed use warnings, as I didn't actually run this code with use warnings enabled

[reply]
[d/l]
[select]


Your skill will accomplish what the force of many cannot
	PerlMonks