Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Searching pattern in 400 files and getting count out of each file

by Rita_G (Initiate)
on Nov 08, 2012 at 06:24 UTC ( #1002822=perlquestion: print w/ replies, xml ) Need Help??
Rita_G has asked for the wisdom of the Perl Monks concerning the following question:

I am writing script to search data from around 400 files. I need all the details like in each file how many times that pattern present. I have around 8000 paterns which I need to search each in all 400 files. currently I am using grep command in scriot which makes the performance down. Now I am thinking to use multithreading here..but I am totally newbee for this..could you please help me on getting correct solution

Here is my current code

while(@data = $sth->fetchrow_array()) { $result = `grep -i -w -c "$data[0]" /u05/oracle/R12COE/spotli +ghter/Search_Files/Forms/*`; @search = split('\n',$result); $arrsize = @search; for($i = 0; $i < $arrsize; $i++) { ($path, $count) = split(':',$search[$i]); ($filename, $fextn) = split('\.',$path); $filename = `basename $path`; #$fextn = `echo $fname | sed 's/.*\.//'`; #$fname = `echo $path | perl -pe 's|.*/||'`;

Comment on Searching pattern in 400 files and getting count out of each file
Download Code
Re: Searching pattern in 400 files and getting count out of each file
by Athanasius (Monsignor) on Nov 08, 2012 at 07:21 UTC

    Hello Rita_G, and welcome to the Monastery!

    Here is some good advice from the Camel Book (4th edition, p. 696):

    Avoid unnecessary syscalls. ... Avoid unnecessary system calls. ... Worry about starting subprocesses, but only if they’re frequent.

    The performance problems you are seeing almost certainly derive from the frequent use of backticks in your script. Each such use incurs an additional overhead.

    The good news is that all the backtick operations in your script can be replaced with pure Perl. See grep, File::Basename, and the substitution operator s/// in Regexp Quote Like Operators and perlre.

    Don’t even think about multithreading until you’ve re-implemented your script in pure Perl and benchmarked the results!

    Hope that helps,

    Athanasius <°(((><contra mundum

      The original posting also (through the grep) reads each file for every pattern. i.e. it opens every single file (400 of them) 8000 or so times. Look at strategies to just read each file once; another reply on this thread seems to have implicitly done this, without clarifying why, and it also misses the opportunity to tidy up the patterns outside the file read loop.
      A Monk aims to give answers to those who have none, and to learn from those who know more.
Re: Searching pattern in 400 files and getting count out of each file
by grondilu (Pilgrim) on Nov 08, 2012 at 08:44 UTC

    Indeed if you worry about performance you should probably not use a shell call of "grep", but use the internal perl grep command.

    Why not just simply open each file one at a time?

    open my $p, "< ". shift @ARGV or die "could not open pattern file: $!"; my @pattern = map { chomp; qr/$_/ } <$p>; for my $filename (@ARGV) { open my $f, "< $filename" or die "could not open $filename: $!"; my @lines = <$f>; for my $pattern (@pattern) { printf "%s has %d matches for pattern %s\n", $filename, scalar(grep $pattern, @lines), $pattern; } }

    NB. This post was edited several times.

      IIRC , the original patterns are pulled from a database, but in the above case pulling the patterns from a file each time could perhaps be avoided by reading them all in at once:
      open my $p, "< ". shift @ARGV or die "could not open pattern file: $!"; my @patterns = <$p>; chomp @patterns; for my $filename (@ARGV) { open my $f, "< $filename" or die "could not open $filename: $!"; my @lines = <$f>; for my $pattern (@patterns) { printf "%s has %d matches for pattern /%s/\n", $filename, scalar(grep /$pattern/, @lines), $pattern; } }

      Going back to the original pulling patterns from a db gives us:

      # get column 0 from all result rows simultaneously.... my $results = $sth->fetchall_arrayref([0]); # error check here? # results are ref to array of hashes, so convert to array # we could avoid this and use "results" directly my @patterns = map { $_->[0] } @$results; for my $filename (@ARGV) { open my $f, "< $filename" or die "could not open $filename: $!"; my @lines = <$f>; for my $pattern (@patterns) { printf "%s has %d matches for pattern /%s/\n", $filename, scalar(grep /$pattern/, @lines), $pattern; } }

      A Monk aims to give answers to those who have none, and to learn from those who know more.

        Hi Thanks a lot for your reply...I never thought that Perl monks really helps in this way.. Provided details really helped me a lot to improve on performance. I am sending you the updated code Please do have a look and let me know..Is this ok..or do I can still improve it

        #! /usr/perl use DBI; use warnings; use strict; my $dbh = DBI->connect('DBI:Oracle:R12COE','apps','app5vis') or die "c +ouldn't connect to database: " . DBI->errstr; my $sth = $dbh->prepare("SELECT DISTINCT UPPER(OBJECT_NAME) FROM CG_CO +MPARATIVE_MATRIX_TAB WHERE OBJECT_NAME IS NOT NULL ORDER BY 1 ASC") o +r die "couldn't pr epare statement: " . $dbh->errstr; $dbh->{AutoCommit} = 0; $dbh->{RaiseError} = 1; $dbh->{ora_check_sql} = 0; $dbh->{RowCacheSize} = 16; my $sth1; my $count = 0; my $i = 0; my $path = ""; my $result; my @search; my $arrsize; my $filename; my $fextn; my @files; my $dir; my $file; my $fh; my $ext = ""; my $j = 0; my @data; $sth->execute; my @obj_name; my $obj_name; while(@data = $sth->fetchrow_array()) { $obj_name[$j]= $data[0]; $j++; } $dir = '/u05/oracle/R12COE/spotlighter/Search_Files/Forms'; opendir(DIR,$dir)or die $!; @files = grep{-f "$dir/$_"} readdir(DIR); # $result = `grep -i -w -c "$data[0]" /u05/oracle/R12COE/spotli +ghter/Search_Files/Forms/*`; foreach $file (@files) { open $fh,"< $file" or die "couldn't open $file:$!"; { for $obj_name(@obj_name) { ($ext) = $file =~ /(\.[^.]+)$/; #printf "%s has %d matches for pattern + /%s/\n",$file,scalar(grep /$obj_name/, <$fh>),$obj_name; $count = scalar(grep /$obj_name/, <$fh +>); $sth1 = $dbh->prepare("INSERT INTO CUS +TOM_FILES_SUMMARY(FILE_NAME,FILE_TYPE,DEP_OBJECT_NAME,OCCURANCE)VALUE +S('$file','$ext',' $obj_name',$count)") or die "couldn't insert statement: " . $dbh->errs +tr; $sth1->execute; } } close $fh; } $sth->finish; $sth1->finish; $dbh->disconnect;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1002822]
Approved by davido
Front-paged by MidLifeXis
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (10)
As of 2014-10-22 16:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (119 votes), past polls