Category: |
File utilities |
Author/Contact Info |
Michael K. Neylon [mailto://mneylon-pm@masemware.com] |
Description: |
There were a few questions in SOPW that involved finding
patterns in a number of files, and typically resulted in
answers revolving around using a system call to grep. I was thinking that it would seem easier if they simply just
used File::Grep, when I realized that File::Grep does not
exist at all despite all the other File:: modules. While
grepping is a trivial task, I wrote this keeping efficiency
in mind such that the type of call environment will affect
performance, using the 3-way state of wantarray to
determine the difference.
Note that I've not released this to CPAN yet. Two issues
still to decide: first, the namespace File:: appears to be typically left for several base packages, and I don't know if releasing this as File::Grep would invade that space (similar to releasing a DBI::-named package (which should be in the DBIx:: namespace). Secondly, I can think of some supplimentary functions to add to this, such as "fgrep_flat" which would returned the flatted array of matches for all files, and possibly "fgrep_into_array" which the user would supply a reference to an array, such that for very large files, the transfer of the return variable would not temporarily duplicate the large number of matches and possibly overwhelm memory. So I'm solicatiting other suggestions, or if even this file is necessary (I've not seen anything else via CPAN, google, or PM Super Search that implies a similar being exists). If this is just duplication of effort, I'm not too worried, it only took 1/2hr to write and test appropriately, most of that in CVS and testing.
|
#!/usr/bin/perl -w
package File::Grep;
use strict;
use Carp;
BEGIN {
use Exporter ();
use vars qw($VERSION @ISA @EXPORT @EXPORT_OK %EXPORT_TAGS);
$VERSION = sprintf( "%d.%02d", q(0.01) =~ /\s(\d+)\.(\d+)/ );
@ISA = qw(Exporter);
@EXPORT = qw();
@EXPORT_OK = qw( fgrep );
%EXPORT_TAGS = ( );
}
sub fgrep (&@) {
my ( $coderef, @files ) = @_;
my $returntype;
if ( wantarray ) {
$returntype = 2; # Return everything
} elsif ( defined( wantarray ) ) {
$returntype = 1; # Return just the count
} else {
$returntype = 0; # Return at first match
}
my @matches;
my $count;
foreach my $file ( @files ) {
if ( $returntype == 2 ) {
push @matches, { filename => $file,
count => 0,
matches => [] };
}
open FILE, "<$file" or
carp "Cannot open file $file to grep: $!" and next;
while ( my $line = <FILE> ) {
local $_ = $line;
if ( &$coderef ) {
$count++;
last if ( $returntype == 0 ); # Last of while loop!
if ( $returntype == 2 ) {
$matches[-1]->{ count }++;
push @{ $matches[-1]->{ matches } }, $line;
}
}
}
close FILE;
if ( !$returntype && $count ) {
return 1;
}
}
if ( $returntype == 2 ) {
return @matches;
} elsif ( $returntype == 1 ) {
return $count;
} else {
return 0; # Void context; if here, nothing was found, e
+ver
}
}
1;
__END__
=head1 NAME
File::Grep - Find matches to a pattern in a series of files
=head1 SYNOPSIS
use File::Grep qw( fgrep );
# Void context
if ( fgrep { /$user/ } "/etc/passwd" ) { do_something(); }
# Scalar context
print "The index page was hit ",
fgrep { /index\.html/ } glob "/var/log/httpd/access.log.*",
" times\n";
# Array context
my @matches = fgrep { /index\.html } glob "/var/log/httpd/access.log
+.*";
foreach my $matchset ( @matches ) {
print "There were ", $matchset->{ count }, " matches in ",
$matchset->{ filename }, "\n";
}
=head1 DESCRIPTION
File::Grep mimics the functionality of the grep function in perl, but
applying it to files instead of a list. This is similar in nature to
the UNIX grep command, but more powerful as the pattern can be any leg
+al
perl function.
While looking for patterns for files is trivally easy, File::Grep take
+s
steps to be efficient in both computation and resources. Namely, if c
+alled
in void context, it will short circuit execution when a match is locat
+ed
and immediately report truthfulness. In scalar context, it will only
+keep
track of the number of matches and return that value. In array contex
+t, it
will generate an array of hashes that include details on the matching
+--
specifically for each hash, key "filename" will be the name of the cur
+rent
file, "count" will be the number of hits, and "matches" will be an arr
+ay
reference containing the matched lines, in order of discovery. The
ordering of this array will follow the same order of files as passed i
+n
from fgrep.
The syntax for this command is similar to grep:
fgrep BLOCK ARRAY.
The block should be a subroutine that returns if a match was found or
+not.
The variable $_ will be localized before this routine is called, so ma
+y
be used to process the current line. Note, however, that only the
original content of the line is saved in the array of hashes in array
context. The array is a list of files to be grepped. If a file canno
+t
be opened, a warning will be issued, though the function will continue
+ to
process remaining files; in addition, an entry in the array of hashes
+will
still be created as to not mess up any indexing with the original file
+
list.
=head1 EXPORT
"fgrep" may be exported, but this is not set by default.
=head1 AUTHOR
Michael K. Neylon, E<lt>mneylon-pm@masemware.comE<gt>
=head1 SEE ALSO
L<perl>.
=cut
|
Re: File::Grep
by larryk (Friar) on Jan 20, 2002 at 17:58 UTC
|
Good job. The only thing I can add is perhaps you should binmode FILE; as (certainly on Windoze) I have had problems with corrupt logfile lines, binary data and perl jumping out of while loops. Correct me if I'm wrong but doing binmode on OSes that do not require it doesn't affect (no-op?) file ops? So it would be precautionary - alternatively use another module to handle (no pun intended) the file.
larryk
perl -le "s,,reverse killer,e,y,rifle,lycra,,print"
| [reply] [d/l] |
|
Unfortunately, turning binmode on by default would be problematic as well. But I think I have another solution (which I solicitate here, hopefully people will catch this and offer replies...)
Instead of passing a file list, I could have the array by a mixed set of either filehandles or scalars. If it's a filehandle, it will be assumed to be an OPEN file handle, such that it will read right from the file. Otherwise, it will do the same as above. The only problem here is trying to determine the difference between a scalar and a filehandle. I can't think of an easy test that will capture all the possible cases of filehandles, including those that are from the IO:: modules. If anyone has any ideas, that would be helpful.
But obviously, if this was in place, then if you had binary files that you wanted search, it would be rather trivial to create a list of open filehandles, all set to binmode, before passing to this function.
-----------------------------------------------------
Dr. Michael K. Neylon - mneylon-pm@masemware.com
||
"You've left the lens cap of your mind on again, Pinky" - The Brain
"I can see my house from here!"
It's not what you know, but knowing how to find it if you don't know that's important
| [reply] |
|
I ran into this problem in the past when trying to allow a function to accept filenames or filehandles.
From my testing at the time I don't think it is possible to determine if a scalar is a filehandle or not. As you point out the IO:: objects would be difficult to handle.
The solution that I can up with was just to assume that any references passed to the function were filehandles and any scalars were filenames. Here is an extract from the code.
# If the filename is a reference it is assumed that it is a valid
# filehandle, if not we create a filehandle.
#
if (ref($OLEfile)) {
$fh = $OLEfile;
}
else {
# Create a new file, open for writing
$fh = FileHandle->new("> $OLEfile");
if (not defined $fh) {
croak "Can't open $OLEfile. ......";
}
# binmode file whether platform requires it or not
binmode($fh);
}
# Store the filehandle
$self->{_filehandle} = $fh;
This isn't bullet-proof but if the documentation is explicit then it may be sufficient.
--
John.
| [reply] [d/l] |
|
|
... if you had binary files that you wanted search ...
I actually meant ASCII files with one or two corrupt lines containing binary data - more specifically ^Z - the DOS EOF character which perl sees as the perfect opportunity to jump out of a while loop early (unless binmode is in effect).
I'm not sure what you mean by
Unfortunately, turning binmode on by default would be problematic as well.
but I think I have a solution - use open. The open pragma affects I/O ops for the script and from perldoc open comes the following snippet
The ":raw" discipline corresponds to "binary mode" and the ":crlf" dis
+cipline corresponds to "text mode" on platforms that distinguish betw
+een the two modes when opening files (which is many DOS-like platform
+s, including Windows). These two disciplines are currently no-ops on
+platforms where binmode() is a no-op, but will be supported everywher
+e in future.
which may solve whatever problems you are suggesting binmode causes.
larryk
perl -le "s,,reverse killer,e,y,rifle,lycra,,print"
| [reply] [d/l] [select] |
Re: File::Grep: Add'l Functionality.
by dmitri (Priest) on Jan 21, 2002 at 22:42 UTC
|
It would be nice if this module provided options to match across multiple lines (so that it is a little more Perlish). | [reply] |
|
Right now, I don't slurp the entire file, as to be efficient on memory. I can imagine a version where, at any one time, N lines from the file are in memory, thus allowing multiline regexes. The problem here is what to return in the case where you want the matched data. Do you return the N matched lines that triggered it? Or (and something to add to this version) just the line number where the match started? I think the best way to handle this would be to write an extention for the module, File::Grep::Multiline, that I can add later after getting the main part up to CPAN.
-----------------------------------------------------
Dr. Michael K. Neylon - mneylon-pm@masemware.com
||
"You've left the lens cap of your mind on again, Pinky" - The Brain
"I can see my house from here!"
It's not what you know, but knowing how to find it if you don't know that's important
| [reply] |
|
|