Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Better solution to the code

by Anonymous Monk
on Jan 25, 2008 at 09:43 UTC ( #664253=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

Can I get some better approach to modify the below script, so that execution will be faster.
Currently I am iterating array of large number of records to compare the macthing string occur in a file (30000 lines).
and matching record store in a text file called Result_file.txt

# The array @tag contains around 40000 records # Input_file.dat contains 30000 lines open(FH1,"+>Result_file.txt") or die "Cannot create file $!\n"; foreach my $fkey(@tag) { open(FH,"<Input_file.dat") or die "Cannot read $!\n"; while(<FH>) { if($_ =~ m/$fkey/g) { print FH1 "$_\n"; } } close(FH); } close(FH1);
Can anyone please help me to improve the performance of the above mentioned code

Replies are listed 'Best First'.
Re: Better solution to the code
by moritz (Cardinal) on Jan 25, 2008 at 10:20 UTC
    If you don't need to preserve the output order in Result_file.txt you can reduce the runtime to a single pass over Input_file.dat:
    # if @tag contains simple words my $re = join '|', @tag; # if they can be more complicated: # my $re = join '|', map { "(?:$_)" } @tag; open my $out, '+>', "Result_file.txt" or die "Can't open file Result_file.txt for writing: $!"; open my $in, '<', 'Input_file.dat' or die "Can't read Input_file.dat: $!"; while(<$in>){ print $out $_ if m/$re/o; } close $in; close $out;

    If you use perl 5.10.0, the match against many (constant) alternatives is blazingly fast due to the trie optimizations, demerphq++

Re: Better solution to the code
by Punitha (Priest) on Jan 25, 2008 at 09:53 UTC

    Hi, try this

    open(FH1,"+>Result_file.txt") or die "Cannot create file $!\n"; open(FH,"<Input_file.dat") or die "Cannot read $!\n"; while(<FH>) { my $data=$_; chomp($data); print FH1 "DATA:$data\n" if(grep/$data/,@tag); } close(FH);


      Precompiling the regexes should provide a speedup and using List::MoreUtils::any() may do if the chance of a match is good since the test will shortcut on success. Naturally you will benchmark;-)

      use List::MoreUtils qw(any); open my $OUT, '>', 'Result_file.txt' or die "Cannot create file: $!\n" +; open my $IN, '<', 'Input_file.dat' or die "Cannot read file: $!\n"; # precompile the regexes. @tag_rx = map {qr/$_/} @tag; while ( my $data = <$IN> ) { print $OUT $data if any { $data =~ /$_/ } @tag_rx; } close $IN; close $OUT;

Re: Better solution to the code
by Lu. (Hermit) on Jan 25, 2008 at 10:30 UTC

    The poor performance comes from the fact that you are opening and parsing the same (big) file many times.

    You would be better off reversing your strategy and opening the file, parsing it and comparing each line with the contents of your array @tag.

    You should also, if possible, consider loading your data into a hash instead of an array. If you do that, you will profit from exists.
    # your data is in %tag open (IN, "<Input_file.dat") or die "Cannot read $!\n"; open (OUT,"+>Result_file.txt") or die "Cannot create file $!\n"; while (<IN>) { print OUT $_ if exists $tag{$_}; }

      The idea with the hash won't work, because the regex match searches for a matching substring, the hash lookup compares the whole string.

      But that reminds me of another possible optimization: if @tag doesn't contain regexes but only constant substrings, index might speed up things.

      So instead of if ($_ =~ m/$something/){ ... }, you can write if (0 <= index $_, $something).

      BTW, to put @tags into %tags use:
      my %tags; @tags{@tags} = undef;
      Yes, it's confusing calling a hash and and an array the same thing.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://664253]
Approved by lidden
and the fire pops...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2017-05-30 04:31 GMT
Find Nodes?
    Voting Booth?