Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

File::Grep

by Masem (Monsignor)
on Jan 19, 2002 at 20:22 UTC ( [id://140084]=sourcecode: print w/replies, xml ) Need Help??
Category: File utilities
Author/Contact Info Michael K. Neylon [mailto://mneylon-pm@masemware.com]
Description:

There were a few questions in SOPW that involved finding patterns in a number of files, and typically resulted in answers revolving around using a system call to grep. I was thinking that it would seem easier if they simply just used File::Grep, when I realized that File::Grep does not exist at all despite all the other File:: modules. While grepping is a trivial task, I wrote this keeping efficiency in mind such that the type of call environment will affect performance, using the 3-way state of wantarray to determine the difference.

Note that I've not released this to CPAN yet. Two issues still to decide: first, the namespace File:: appears to be typically left for several base packages, and I don't know if releasing this as File::Grep would invade that space (similar to releasing a DBI::-named package (which should be in the DBIx:: namespace). Secondly, I can think of some supplimentary functions to add to this, such as "fgrep_flat" which would returned the flatted array of matches for all files, and possibly "fgrep_into_array" which the user would supply a reference to an array, such that for very large files, the transfer of the return variable would not temporarily duplicate the large number of matches and possibly overwhelm memory. So I'm solicatiting other suggestions, or if even this file is necessary (I've not seen anything else via CPAN, google, or PM Super Search that implies a similar being exists). If this is just duplication of effort, I'm not too worried, it only took 1/2hr to write and test appropriately, most of that in CVS and testing.

#!/usr/bin/perl -w

package File::Grep;

use strict;
use Carp;

BEGIN {
    use Exporter   ();
    use vars       qw($VERSION @ISA @EXPORT @EXPORT_OK %EXPORT_TAGS);
    $VERSION     = sprintf( "%d.%02d", q(0.01) =~ /\s(\d+)\.(\d+)/ );
    @ISA         = qw(Exporter);
    @EXPORT      = qw();
    @EXPORT_OK   = qw( fgrep );
    %EXPORT_TAGS = (  );
}

sub fgrep (&@) {
    my ( $coderef, @files ) = @_;
    my $returntype;
    if ( wantarray ) { 
        $returntype = 2;                # Return everything
    } elsif ( defined( wantarray ) ) {
        $returntype = 1;                # Return just the count
    } else {
        $returntype = 0;                # Return at first match
    }

    my @matches;
    my $count;

    foreach my $file ( @files ) {
        if ( $returntype == 2 ) {
            push @matches, { filename => $file,
                             count => 0,
                             matches => [] };
        }
        open FILE, "<$file" or 
            carp "Cannot open file $file to grep: $!" and next;
        while ( my $line = <FILE> ) {
            local $_ = $line;
            if ( &$coderef ) {
                $count++;
                last if ( $returntype == 0 );   # Last of while loop!
                if ( $returntype == 2 ) {
                    $matches[-1]->{ count }++;
                    push @{ $matches[-1]->{ matches } }, $line;
                }
            }
        }
        close FILE;
        if ( !$returntype && $count ) {
            return 1;
        }
    }
    if ( $returntype == 2 ) { 
        return @matches;
    } elsif ( $returntype == 1 ) {
        return $count;
    } else {
        return 0;        # Void context; if here, nothing was found, e
+ver
    }
}

1;
__END__

=head1 NAME

File::Grep - Find matches to a pattern in a series of files

=head1 SYNOPSIS

  use File::Grep qw( fgrep );
  
  # Void context
  if ( fgrep { /$user/ } "/etc/passwd" ) { do_something(); }

  # Scalar context
  print "The index page was hit ",
    fgrep { /index\.html/ } glob "/var/log/httpd/access.log.*",
    " times\n";

  # Array context
  my @matches = fgrep { /index\.html } glob "/var/log/httpd/access.log
+.*";
  foreach my $matchset ( @matches ) {
      print "There were ", $matchset->{ count }, " matches in ", 
          $matchset->{ filename }, "\n";
  }


=head1 DESCRIPTION

File::Grep mimics the functionality of the grep function in perl, but
applying it to files instead of a list.  This is similar in nature to 
the UNIX grep command, but more powerful as the pattern can be any leg
+al
perl function. 

While looking for patterns for files is trivally easy, File::Grep take
+s 
steps to be efficient in both computation and resources.  Namely, if c
+alled 
in void context, it will short circuit execution when a match is locat
+ed 
and immediately report truthfulness.  In scalar context, it will only 
+keep 
track of the number of matches and return that value.  In array contex
+t, it 
will generate an array of hashes that include details on the matching 
+-- 
specifically for each hash, key "filename" will be the name of the cur
+rent 
file, "count" will be the number of hits, and "matches" will be an arr
+ay 
reference containing the matched lines, in order of discovery.  The 
ordering of this array will follow the same order of files as passed i
+n 
from fgrep.

The syntax for this command is similar to grep: 

   fgrep BLOCK ARRAY.  
   
The block should be a subroutine that returns if a match was found or 
+not.  
The variable $_ will be localized before this routine is called, so ma
+y
be used to process the current line.  Note, however, that only the 
original content of the line is saved in the array of hashes in array
context.  The array is a list of files to be grepped.  If a file canno
+t
be opened, a warning will be issued, though the function will continue
+ to 
process remaining files; in addition, an entry in the array of hashes 
+will
still be created as to not mess up any indexing with the original file
+ 
list.

=head1 EXPORT

"fgrep" may be exported, but this is not set by default.

=head1 AUTHOR

Michael K. Neylon, E<lt>mneylon-pm@masemware.comE<gt>

=head1 SEE ALSO

L<perl>.

=cut
Replies are listed 'Best First'.
Re: File::Grep
by larryk (Friar) on Jan 20, 2002 at 17:58 UTC
    Good job. The only thing I can add is perhaps you should binmode FILE; as (certainly on Windoze) I have had problems with corrupt logfile lines, binary data and perl jumping out of while loops. Correct me if I'm wrong but doing binmode on OSes that do not require it doesn't affect (no-op?) file ops? So it would be precautionary - alternatively use another module to handle (no pun intended) the file.
       larryk                                          
    perl -le "s,,reverse killer,e,y,rifle,lycra,,print"
    
      Unfortunately, turning binmode on by default would be problematic as well. But I think I have another solution (which I solicitate here, hopefully people will catch this and offer replies...)

      Instead of passing a file list, I could have the array by a mixed set of either filehandles or scalars. If it's a filehandle, it will be assumed to be an OPEN file handle, such that it will read right from the file. Otherwise, it will do the same as above. The only problem here is trying to determine the difference between a scalar and a filehandle. I can't think of an easy test that will capture all the possible cases of filehandles, including those that are from the IO:: modules. If anyone has any ideas, that would be helpful.

      But obviously, if this was in place, then if you had binary files that you wanted search, it would be rather trivial to create a list of open filehandles, all set to binmode, before passing to this function.

      -----------------------------------------------------
      Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
      "I can see my house from here!"
      It's not what you know, but knowing how to find it if you don't know that's important


        I ran into this problem in the past when trying to allow a function to accept filenames or filehandles.

        From my testing at the time I don't think it is possible to determine if a scalar is a filehandle or not. As you point out the IO:: objects would be difficult to handle.

        The solution that I can up with was just to assume that any references passed to the function were filehandles and any scalars were filenames. Here is an extract from the code.

        # If the filename is a reference it is assumed that it is a valid # filehandle, if not we create a filehandle. # if (ref($OLEfile)) { $fh = $OLEfile; } else { # Create a new file, open for writing $fh = FileHandle->new("> $OLEfile"); if (not defined $fh) { croak "Can't open $OLEfile. ......"; } # binmode file whether platform requires it or not binmode($fh); } # Store the filehandle $self->{_filehandle} = $fh;
        This isn't bullet-proof but if the documentation is explicit then it may be sufficient.

        --
        John.

          ... if you had binary files that you wanted search ...

        I actually meant ASCII files with one or two corrupt lines containing binary data - more specifically ^Z - the DOS EOF character which perl sees as the perfect opportunity to jump out of a while loop early (unless binmode is in effect).

        I'm not sure what you mean by

          Unfortunately, turning binmode on by default would be problematic as well.

        but I think I have a solution - use open. The open pragma affects I/O ops for the script and from perldoc open comes the following snippet

          The ":raw" discipline corresponds to "binary mode" and the ":crlf" dis +cipline corresponds to "text mode" on platforms that distinguish betw +een the two modes when opening files (which is many DOS-like platform +s, including Windows). These two disciplines are currently no-ops on +platforms where binmode() is a no-op, but will be supported everywher +e in future.

        which may solve whatever problems you are suggesting binmode causes.

           larryk                                          
        perl -le "s,,reverse killer,e,y,rifle,lycra,,print"
        
Re: File::Grep: Add'l Functionality.
by dmitri (Priest) on Jan 21, 2002 at 22:42 UTC
    It would be nice if this module provided options to match across multiple lines (so that it is a little more Perlish).
      Right now, I don't slurp the entire file, as to be efficient on memory. I can imagine a version where, at any one time, N lines from the file are in memory, thus allowing multiline regexes. The problem here is what to return in the case where you want the matched data. Do you return the N matched lines that triggered it? Or (and something to add to this version) just the line number where the match started? I think the best way to handle this would be to write an extention for the module, File::Grep::Multiline, that I can add later after getting the main part up to CPAN.

      -----------------------------------------------------
      Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
      "I can see my house from here!"
      It's not what you know, but knowing how to find it if you don't know that's important

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://140084]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2024-04-25 23:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found