Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^2: Improving script's speed and performance...

by Anonymous Monk
on Dec 28, 2015 at 17:53 UTC ( [id://1151280]=note: print w/replies, xml ) Need Help??


in reply to Re: Improving script's speed and performance...
in thread Improving script's speed and performance...

OK. So you're saying your strings are pretty short and consist almost entirely of the words of English language. The unused variable $bestalignment in your code makes me doubt that, but OK.

Here's one approach (just an illustration, you should definitely debug and improve it). It depends on certain properties of the English language, such as the fact that there are not very many unique words (even allowing for stuff like "array1"); certainly English has less then a million unique words, and only several tens of thousands are actually used. So you should preprocess your data:

use strict; use warnings; use String::LCSS_XS qw( lcss ); my @array1 = ('This sentence is stored in array1') x 1_000; my @array2 = ( 'Thank you for looking on this.', 'This sentence is a test sentence.', 'This sentence is stored in array2', 'This is an other sentence.', 'All variables are stored in array2', ) x 20_000; preprocess( \@array1 ); preprocess( \@array2 ); process( \@array1, \@array2 ); exit 0; my %words; my %codes; my $word_num; sub process { my ( $array1, $array2 ) = @_; for my $elem1 (@$array1) { my $best_subseq = ""; my $best_subseq_words = 0; my $best_subseq_chars = 0; my $found = 0; for my $elem2 (@$array2) { my $subseq = lcss( $elem1->[1], $elem2->[1] ); if ( $subseq && length($subseq) > $best_subseq_words ) { my $real_subseq = join ' ', map $codes{$_}, split //, $subseq; if ( length($real_subseq) > $best_subseq_chars ) { $best_subseq = $real_subseq; $best_subseq_words = length($subseq); $best_subseq_chars = length($real_subseq); $found = 1; } } } if ( $found == 1 ) { printf "<<%s>> is the lcss of\n\t<<%s>>\n", $best_subseq, $elem1->[0]; } } } sub preprocess { my ($array) = @_; for my $i ( 0 .. $#$array ) { my $packed = join '', map { my $code = $words{$_}; if ( not defined $code ) { $code = chr( $word_num += 1 ); $words{$_} = $code; $codes{$code} = $_; } $code; } $array->[$i] =~ /\w+/g; $array->[$i] = [ $array->[$i], $packed ]; } }
on my ancient laptop, time perl lcss.pl > /dev/null finished in 1 min 33 seconds; that's for 100 millions iterations (1000 elements in @array1, 20000 * 5 in @array2).

Note that if you actually use some wildly different data, this approach may break (and I didn't test it very much).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1151280]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2024-04-19 22:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found