Re^5: Multi-thread combining the results together

That's an interesting idea. I was thinking of trying a single string made of space separated tokens. In that case the ^$ would become \b's. And a grep is not needed because I would be doing match global against a single string instead of running the built regex 80K times against each token individually. There is no reason that I couldn't join the tokens by \n and I could try that without modifying build_regex().

As a note, the array of @tokens are all unique. For each token, I want it either fully copied or nothing (a yes/no situation for each of the 80K tokens). A typical regex will have 10-14 terms and produces a result set of about 6 results from 80K possibilities.

If I can get maybe a 3x from algorithm improvements and another 3x from parallelization. I would be in the <10 minute max run time range which is "good enough". As it turns out in practice, not every possibility needs to be run and when a token needs to be investigated further for "close matches", I cache the result. More than a decade ago, run time was 20 minutes max on an Win 95 machine. One of the "problems" with software that "works" is that it often winds up being applied to larger and larger data sets. The 80K terms are extracted from 3 million input lines. 12 years ago, this was only 200K input lines and much smaller @tokens array!

I appreciate all of the ideas in this thread! I have a lot of experimentation ahead of me.

Ultimately, I would like to develop an algorithm that builds some kind of a tree structure which can be traversed much faster than any regex approach. I figure that will be non-trivial to accomplish.

Update: I tried the idea of using a multi-line, match global upon a string of \n separated tokens instead of running a regex on each token individually. This didn't work. This is significantly slower than the current code. It produces the same result, albeit slower. Next up: I will try the \b idea.

Comment on Re^5: Multi-thread combining the results together

Replies are listed 'Best First'.

Re^6: Multi-thread combining the results together
by vr (Curate) on Jul 27, 2019 at 17:30 UTC

I tried the idea of using a multi-line, match global upon a string of \n separated tokens instead of running a regex on each token individually. This didn't work. This is significantly slower than the current code.

I can't explain better, but if you really need anchoring within tokens, care should be taken to let re-engine fail ASAP and move ahead. In example below, don't let it aimlessly do "/.+/", when it's clear it won't find "123" before next separator. It's really contrived example (and a no-op), not about threads anymore, maybe it's time for another SOPW question with real dataset and SSCCE (and better explanation).

use strict;
use warnings;
use feature 'say';
use Data::Dump 'dd';
use Time::HiRes 'time';

my $N = 1000;
srand 123;
my @tokens = map { int rand 1e9 } 1 .. $N;

sub build_regex { qr/^.+123/m }
sub build_regex2 { qr/^.+\K123/m }

{                       # case 1
my $t = time;
my $count = 0;
for my $token ( @tokens )
{
    my $regex = build_regex( $token );
    /$regex/ && $count++ for @tokens;
}
say $count;
say time - $t;
}

{                       # case 2
my $t = time;
my $count = 0;
my $concat = join "\n", @tokens;
for my $token ( @tokens ) {
    my $regex = build_regex( $token );
    $count++ while $concat =~ /$regex/g
}
say time - $t;
}

{                       # case 3
my $t = time;
my $count = 0;
my $concat = join "\n", @tokens;
for my $token ( @tokens ) {
    my $regex = build_regex2( $token );
    $count++ while $concat =~ /$regex/g
}
say time - $t;
}

__END__

5000
0.384264945983887
1.7059121131897
0.13309907913208
[download]

[reply]
[d/l]


Your skill will accomplish what the force of many cannot
	PerlMonks