Re^2: Improving script's speed and performance...

OK. So you're saying your strings are pretty short and consist almost entirely of the words of English language. The unused variable $bestalignment in your code makes me doubt that, but OK.

Here's one approach (just an illustration, you should definitely debug and improve it). It depends on certain properties of the English language, such as the fact that there are not very many unique words (even allowing for stuff like "array1"); certainly English has less then a million unique words, and only several tens of thousands are actually used. So you should preprocess your data:

use strict;
use warnings;

use String::LCSS_XS qw( lcss );

my @array1 = ('This sentence is stored in array1') x 1_000;
my @array2 = (
    'Thank you for looking on this.',
    'This sentence is a test sentence.',
    'This sentence is stored in array2',
    'This is an other sentence.',
    'All variables are stored in array2',
) x 20_000;

preprocess( \@array1 );
preprocess( \@array2 );

process( \@array1, \@array2 );

exit 0;

my %words;
my %codes;
my $word_num;

sub process {
    my ( $array1, $array2 ) = @_;

    for my $elem1 (@$array1) {
        my $best_subseq       = "";
        my $best_subseq_words = 0;
        my $best_subseq_chars = 0;
        my $found             = 0;

        for my $elem2 (@$array2) {

            my $subseq = lcss( $elem1->[1], $elem2->[1] );

            if ( $subseq && length($subseq) > $best_subseq_words ) {
                my $real_subseq =
                  join ' ', map $codes{$_}, split //, $subseq;

                if ( length($real_subseq) > $best_subseq_chars ) {
                    $best_subseq       = $real_subseq;
                    $best_subseq_words = length($subseq);
                    $best_subseq_chars = length($real_subseq);
                    $found             = 1;
                }
            }
        }

        if ( $found == 1 ) {
            printf "<<%s>> is the lcss of\n\t<<%s>>\n",
              $best_subseq,
              $elem1->[0];
        }
    }
}

sub preprocess {
    my ($array) = @_;

    for my $i ( 0 .. $#$array ) {
        my $packed =
          join '', map {
            my $code = $words{$_};
            if ( not defined $code ) {
                $code = chr( $word_num += 1 );
                $words{$_}    = $code;
                $codes{$code} = $_;
            }
            $code;
          } $array->[$i] =~ /\w+/g;

        $array->[$i] = [ $array->[$i], $packed ];
    }
}
[download]

on my ancient laptop, time perl lcss.pl > /dev/null finished in 1 min 33 seconds; that's for 100 millions iterations (1000 elements in @array1, 20000 * 5 in @array2).

Note that if you actually use some wildly different data, this approach may break (and I didn't test it very much).

Comment on Re^2: Improving script's speed and performance... Select or Download Code


No such thing as a small change
	PerlMonks