Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Searching pattern in 400 files and getting count out of each file

by grondilu (Pilgrim)
on Nov 08, 2012 at 08:44 UTC ( #1002839=note: print w/ replies, xml ) Need Help??


in reply to Searching pattern in 400 files and getting count out of each file

Indeed if you worry about performance you should probably not use a shell call of "grep", but use the internal perl grep command.

Why not just simply open each file one at a time?

open my $p, "< ". shift @ARGV or die "could not open pattern file: $!"; my @pattern = map { chomp; qr/$_/ } <$p>; for my $filename (@ARGV) { open my $f, "< $filename" or die "could not open $filename: $!"; my @lines = <$f>; for my $pattern (@pattern) { printf "%s has %d matches for pattern %s\n", $filename, scalar(grep $pattern, @lines), $pattern; } }

NB. This post was edited several times.


Comment on Re: Searching pattern in 400 files and getting count out of each file
Download Code
Re^2: Searching pattern in 400 files and getting count out of each file
by space_monk (Chaplain) on Nov 08, 2012 at 11:30 UTC
    IIRC , the original patterns are pulled from a database, but in the above case pulling the patterns from a file each time could perhaps be avoided by reading them all in at once:
    open my $p, "< ". shift @ARGV or die "could not open pattern file: $!"; my @patterns = <$p>; chomp @patterns; for my $filename (@ARGV) { open my $f, "< $filename" or die "could not open $filename: $!"; my @lines = <$f>; for my $pattern (@patterns) { printf "%s has %d matches for pattern /%s/\n", $filename, scalar(grep /$pattern/, @lines), $pattern; } }

    Going back to the original pulling patterns from a db gives us:

    # get column 0 from all result rows simultaneously.... my $results = $sth->fetchall_arrayref([0]); # error check here? # results are ref to array of hashes, so convert to array # we could avoid this and use "results" directly my @patterns = map { $_->[0] } @$results; for my $filename (@ARGV) { open my $f, "< $filename" or die "could not open $filename: $!"; my @lines = <$f>; for my $pattern (@patterns) { printf "%s has %d matches for pattern /%s/\n", $filename, scalar(grep /$pattern/, @lines), $pattern; } }

    A Monk aims to give answers to those who have none, and to learn from those who know more.

      Hi Thanks a lot for your reply...I never thought that Perl monks really helps in this way.. Provided details really helped me a lot to improve on performance. I am sending you the updated code Please do have a look and let me know..Is this ok..or do I can still improve it

      #! /usr/perl use DBI; use warnings; use strict; my $dbh = DBI->connect('DBI:Oracle:R12COE','apps','app5vis') or die "c +ouldn't connect to database: " . DBI->errstr; my $sth = $dbh->prepare("SELECT DISTINCT UPPER(OBJECT_NAME) FROM CG_CO +MPARATIVE_MATRIX_TAB WHERE OBJECT_NAME IS NOT NULL ORDER BY 1 ASC") o +r die "couldn't pr epare statement: " . $dbh->errstr; $dbh->{AutoCommit} = 0; $dbh->{RaiseError} = 1; $dbh->{ora_check_sql} = 0; $dbh->{RowCacheSize} = 16; my $sth1; my $count = 0; my $i = 0; my $path = ""; my $result; my @search; my $arrsize; my $filename; my $fextn; my @files; my $dir; my $file; my $fh; my $ext = ""; my $j = 0; my @data; $sth->execute; my @obj_name; my $obj_name; while(@data = $sth->fetchrow_array()) { $obj_name[$j]= $data[0]; $j++; } $dir = '/u05/oracle/R12COE/spotlighter/Search_Files/Forms'; opendir(DIR,$dir)or die $!; @files = grep{-f "$dir/$_"} readdir(DIR); # $result = `grep -i -w -c "$data[0]" /u05/oracle/R12COE/spotli +ghter/Search_Files/Forms/*`; foreach $file (@files) { open $fh,"< $file" or die "couldn't open $file:$!"; { for $obj_name(@obj_name) { ($ext) = $file =~ /(\.[^.]+)$/; #printf "%s has %d matches for pattern + /%s/\n",$file,scalar(grep /$obj_name/, <$fh>),$obj_name; $count = scalar(grep /$obj_name/, <$fh +>); $sth1 = $dbh->prepare("INSERT INTO CUS +TOM_FILES_SUMMARY(FILE_NAME,FILE_TYPE,DEP_OBJECT_NAME,OCCURANCE)VALUE +S('$file','$ext',' $obj_name',$count)") or die "couldn't insert statement: " . $dbh->errs +tr; $sth1->execute; } } close $fh; } $sth->finish; $sth1->finish; $dbh->disconnect;

        Within the outer loop “foreach $file (@files)”, each file is opened once for reading, and then closed after the inner loop has completed. (The extra block enclosing this inner loop is redundant, BTW.) But within the inner loop, the filehandle $fh is read-from each time through the loop. The result is that after the first call to <$fh> in list context, the entire file has been read and the filehandle now points to the end of the file. On each subsequent iteration of the inner loop, <$fh> returns an empty list, so $count will then always be zero.

        There are two ways to fix this:

        (1) Add the following line before the call to grep:

        seek($fh, 0, 0);

        This will ensure that the filehandle begins again at the beginning of the file on each iteration. See seek.

        (2) Read the entire file into memory before the inner loop (store it as an array of lines), and apply the grep to this in-memory array. This strategy may take up a lot of memory (i.e., if the files are large), but it will save a lot of processing time. Reading from a file is an inherently time-consuming operation, which your script is currently repeating each time through the inner loop (or, at least, it would be doing so if the seek were in there!).

        Now some general advice: As a matter of good Perl style, you should declare a variable only at the latest possible place in the code. In the script as given, a number of variables are declared but not used at all, and others are declared way ahead of time. Perl is not C! Get in the habit of declaring variables at the point of first use, and your code will become clearer and easier to debug and maintain.

        Update: Here is my (untested!) re-write of the script:

        Hope that helps,

        Athanasius <°(((><contra mundum

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1002839]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2014-09-20 03:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (152 votes), past polls