Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Perl program to search files

by harangzsolt33 (Pilgrim)
on Dec 26, 2018 at 01:18 UTC ( #1227699=perlquestion: print w/replies, xml ) Need Help??

harangzsolt33 has asked for the wisdom of the Perl Monks concerning the following question:

I have created a little perl script that searches documents for a kind of binary string and lists the files that contain that exact string. Can you make suggestions on how I could improve this program or what I could have done to make this better? Also, if you see some obvious bugs in my program, would you please tell me? Thank you! Here is my script:

Edit: Okay, as requested, I am inserting the code here instead of just a link. And thank you for your ideas and advice. I will read through them more carefully tomorrow and try to implement them! Thank you!

#!/usr/bin/perl # # This Perl program looks for a certain binary string # in a file and lists the files found and the total number # of files found. I tested this script in TinyPerl 5.8. # ################################################################# use strict; use warnings; my $FIND = ".\nThis is"; my $CASE_SENSITIVE = 0; my $FOLDER = 'H:\\HOME'; my $RECURSIVE = 1; my @INCLUDE = qw( *.txt *.doc *.xls *.ppt *.log *.htm *.html * +.xml *. *.eml *.pdf *.rtf *.bas *.c *.h *.pl *.cgi *.bak *.reg *.ini +*.inf *.ion *.1st *.lst *.pas *.py *.cpp *.php *.xfm *.js *.asm *.rt +*.old ); my $VERBOSE = 0; my $PAGE_BREAK = 0; ################################################################# Main ### ### PROGRAM STARTS HERE: ## # my $TOTAL = 0; my $BUFF_SIZE = 4000000; my $LINUX = (index(uc($^O), 'WIN') < 0) ? 1 : 0; $CASE_SENSITIVE or $FIND = uc($FIND); CheckDIR($FOLDER); print "\n", '-' x 72; print "\n $TOTAL file(s) found with matching string\n"; exit; ################################################################# Sear +chFile # # This function is automatically called by CheckDIR() every time # a file is found. This function gets the path & name of the file. # # Usage: SearchFile(PATH, NAME) <-- Called by CheckDIR() # sub SearchFile { my $PATH = $_[0]; my $NAME = $_[1]; my $FULLNAME = "$PATH$NAME"; (-f $FULLNAME) or return; # not a plain file? my $L = -s $FULLNAME; # Get file size $L or return; # file size == 0? # Is the search string longer than # the file we're searching? $L >= length($FIND) or return; my $F; my $START = 0; my $FOUND = 0; my $BUFFER; open $F, "<$FULLNAME" or return; binmode $F; while ($START < $L) { if ($START > $L) { last; } read $F, $BUFFER, $BUFF_SIZE, $START; if (defined $BUFFER) { if (!$CASE_SENSITIVE) { $BUFFER = uc($BUFFER); } if (index($BUFFER, $FIND) >= 0) { $FOUND++; last; } } else { last; } if (length($BUFFER) < $BUFF_SIZE) { last; } $START += $BUFF_SIZE - length($FIND); } close $F; if ($FOUND) { $TOTAL++; if ($PAGE_BREAK) { if ($TOTAL % $PAGE_BREAK == 0) { print "\nPress ENTER to continue..."; <STDIN>; } } PRINT("FOUND: $FULLNAME"); } } ################################################################# Chec +kDIR # # This function reads the contents of a folder and calls # SearchFile() for each file that was found. # # Usage: CheckDIR(PATH) # sub CheckDIR { @_ or return; my $FULLNAME; my $PATH = shift; defined $PATH or return; length($PATH) or return; if ($VERBOSE) { PRINT("SEARCHING: ", FormatPath($PATH)); } $PATH = AddSeparator($PATH); opendir(my $DIR, $PATH) or return; while (my $NAME = readdir $DIR) { $FULLNAME = "$PATH$NAME"; if (-d($FULLNAME)) { # Check into subdirectory if RECURSIVE == 1 # Skip directory if its name starts with "." if ($RECURSIVE) { CheckDIR($FULLNAME) unless (vec($NAME, 0, 8) == 46); } next; } # Scan file contents only if the file name # matches a certain pattern. foreach my $W (@INCLUDE) { if (isMatch($NAME, $W)) { SearchFile($PATH, $NAME); } } } closedir $DIR; } ################################################################# isMa +tch # # This function returns 1 if a filename matches a certain # wildcard pattern. There may be several question marks in # the search pattern, but only one asterisk is allowed! # The asterisk may be at the beginning, the end, or in the # middle of the pattern. A question mark matches any # character. An asterisk matches zero or more characters. # The matching is NOT case sensitive! # # Usage: INTEGER = isMatch(FILENAME, WILDCARD) # # Example: isMatch("New_Document.txt", "n*.txt") ---> 1 # sub isMatch { @_ > 1 or return 0; my $F = shift; defined $F or return 0; length($F) or return 0; my $W = shift; defined $W or return 0; length($W) or return 0; $F = uc($F); $W = uc($W); # If there aren't any wildcards at all... if (CountChars($W, '*?') == 0) { return ($F eq $W) ? 1 : 0; } # Match what's before the asterisk... return 0 unless (_isMatch($F, $W, 1)); # Match what comes after the asterisk... return _isMatch($F, $W, -1); } ################################################################# _isM +atch # # This private function is called by isMatch() # This function compares two strings and returns 1 if # both strings match until the first asterisk. # This function can start comparing strings starting # from the beginning or starting from the end! # DIRECTION must be either 1 or -1. # # Usage: INTEGER = _isMatch(FILENAME, WILDCARD, DIRECTION) # sub _isMatch { my $F = shift; my $f; my $LF = length($F)-1; my $W = shift; my $w; my $LW = length($W)-1; my $DIR = shift; my $STOP = $LW; my $START = 0; my $FSTART = 0; if ($DIR < 0) { $STOP = 0; $START = $LW; $FSTART = $LF; } while ($START != $STOP) { $w = vec($W, $START, 8); # Grab byte from wildcard pattern $f = vec($F, $FSTART, 8); # Grab byte from filename $START += $DIR; $FSTART += $DIR; if ($w == 42) # ASTERISK? { return 1; } else { # If the character is "?" then skip, but if # it's not "?", then the characters must match. if ($w != 63) { ($f == $w) or return 0; } } } return 1; } ################################################################# PRIN +T # # This function prints the total number of matches found # so far followed by some text and a new line character. # # Usage: PRINT(STRING) # sub PRINT { print ' ', RSPACE(8, $TOTAL), @_, "\n"; } ################################################################# Form +atPath # # This function replaces the \\ or / characters in a path # to match the current OS's separator character. # # Usage: PATH = FormatPath(PATH) # sub FormatPath { @_ or return ''; my $PATH = $_[0]; length($PATH) or return ''; if ($LINUX) { $PATH =~ tr|\\|/|; } else { $PATH =~ tr|/|\\|; } return $PATH; } ################################################################# AddS +eparator # # This function makes sure that the given path ends with # a / or \ depending on the current OS. # # Usage: PATH = AddSeparator(PATH) # sub AddSeparator { @_ or return ''; my $PATH = $_[0]; length($PATH) or return ''; my $c = vec($PATH, length($PATH)-1, 8); return FormatPath($PATH . (($c == 47 || $c == 92) ? '' : '/') ); } ################################################################# RSPA +CE # # This function adds spaces to the end of a string list to # make sure it's exactly N characters long. If the string # is longer than N, then just returns the string itself. # # Usage: STRING = RSPACE(N, STRING, [STRINGS...]) # sub RSPACE { @_ > 1 or return ''; my $LEN = shift; my $S = join('', @_); my $L = length($S); $L < $LEN or return $S; return $S . (' ' x ($LEN - $L)); } ################################################################# Coun +tChars # # This function counts how many times STRING contains # any of the characters of SUBSTR. # # Usage: INTEGER = CountChars(STRING, SUBSTR) # sub CountChars { @_ > 1 or return 0; my $S = shift; defined $S or return 0; length($S) or return 0; my $L = shift; defined $L or return 0; length($L) or return 0; my $P; my $c; my $i = length($L); my $N = 0; while ($i-- > 0) { $P = 0; $c = substr($L, $i, 1); while (($P = 1+index($S, $c, $P)) > 0) { $N++; } } return $N; } #################################################################

Btw I tested the program, and it seems to work fine.

Replies are listed 'Best First'.
Re: Perl program to search files
by haukex (Chancellor) on Dec 26, 2018 at 10:07 UTC
    if you see some obvious bugs in my program, would you please tell me?

    Unfortunately your algorithm suffers from an issue: when the text to be found happens to be split across two buffers, the match is not found. To demonstrate, try setting $BUFF_SIZE = 4096, and then generating a test file with e.g. perl -e 'print "x"x4095, ".\nThis is", "x"x5000' >test.txt - the match won't be found, but if you either set $BUFF_SIZE = 4095, or change the "x"x4095 to "x"x4096, the match is found. A typical solution to this kind of problem would be to implement a sliding window (Update 2: see reply by vr below), making sure that the buffers are large enough so that the search string can be contained in them. Alternatively, if the files are always large enough to fit into memory, of course it's possible to just load the entire file into memory at once. You might want to have a look at these nodes, for example: Matching in huge files and Re: Search hex string in vary large binary file.

    Can you make suggestions on how I could improve this program or what I could have done to make this better?

    As Athanasius already mentioned, there are some things you're re-implementing. This is fine as a learning exercise, but I think it also helps to be aware of these kinds of things. It also gives you something to test your own implementations against (which I would recommend doing).

    In addition, I agree that variable names in all uppercase should be reserved for constants, IMO including variables that get set once at the top of the program and aren't supposed to be changed, so e.g. IMO $LINUX is fine, but e.g. $START is IMO not a good choice. Also I agree that magic numbers aren't good, although I would take it a step further, e.g. my $ASTERISK = ord("*");, my $DOT = ord(".");, etc. Some more descriptive variable names would be helpful in understanding the code as well, e.g. sub CountChars is all one-letter variables.

    A couple of other thoughts/issues:

    • my $LINUX = (index(uc($^O), 'WIN') < 0) ? 1 : 0; - this will misdetect darwin as Windows (or rather, "not Linux", despite it being a *NIX OS), and there are lots of other OSes that are neither Linux nor Windows. You might just want to stick to checking against "MSWin32" - but as I said above really the best solution here IMO is to leave the handling of filenames to a module.
    • opendir(my $DIR, $PATH) or return; and open $F, "<$FULLNAME" or return; - you might want to consider reporting this to the user instead of silently skipping files/dirs.
    • You might want to have a look at Getopt::Long to be able to handle command-line arguments instead of editing the variables in the source.
    • Update: Instead of a long list of globs in @INCLUDE, you could build a single regex to match against the filenames. See e.g. Building Regex Alternations Dynamically.

      Minor note:

      A typical solution to this kind of problem would be to implement a sliding window

      But he is trying to slide:

      $START += $BUFF_SIZE - length($FIND);

      Unfortunately, it seems that his 4th argument to read is understood as "offset into file", while, of course, it is "offset into buffer". If this

      read $F, $BUFFER, $BUFF_SIZE, $START;

      is replaced with

      seek $F, $START, 0; read $F, $BUFFER, $BUFF_SIZE;

      then his sliding window works.

        But he is trying to slide

        Ah, you are correct, thank you!

Re: Perl program to search files
by Athanasius (Bishop) on Dec 26, 2018 at 04:31 UTC

    Hello harangzsolt33,

    One thing I notice is that your sub RSPACE is essentially duplicating part of the functionality already present in Perl’s printf and sprintf builtin functions. In fact, your functions RSPACE and PRINT can be removed altogether, and the two places where PRINT is called can be written as follows:

    # (1) in sub SearchFile: PRINT("FOUND: $FULLNAME"); # original printf " %-8dFOUND: %s\n", $TOTAL, $FULLNAME; # replacement # (2) in sub CheckDIR: if ($VERBOSE) { PRINT("SEARCHING: ", FormatPath($PATH)); } + # original printf " %-8dSEARCHING: %s\n", $TOTAL, FormatPath($PATH) if $VERBOSE; + # replacement

    Note the use of a statement mofifier in the second case. In general, statement modifiers should be preferred over compound statements whenever a single statement is involved, as this simplifies the code.

    Some of your coding conventions could also be improved (IMHO):

    • The use of all-uppercase variable names should be reserved for constants.
    • Magic numbers (even ASCII values) should be replaced by constants:
      use Const::Fast; const my $ASTERISK => 42; const my $DOT => 46; const my $QUESTION => 63; # etc.
    • In a script intended to be called directly from the command line, and not used as a module, the distinction between “public” and “private” functions is pointless. (Conversely, if the script is intened to be called as a module, it should begin with an appropriate package declaration.)

    I haven’t studied your code in detail. Overall, it looks good: well thought-out and implemented.

    Regarding posting:
    (1) It would be better to include your code in the question itself (inside <readmore> tags!) than to provide a web address, as the latter is likely to become invalid and so render this thread incomprehensible to future readers.
    (2) You say you have tested the program. Adding your test code (again, in <readmore> tags, of course) would aid the monks in evaluating your code.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Note the use of a statement modifier in the second case. In general, statement modifiers should be preferred over compound statements whenever a single statement is involved, as this simplifies the code.

      I STRONGLY oppose to this statement. This might be true for you, but I personally hate statement modifiers used for this, as it causes me to read code backwards, which is a maintenance nightmare.

      IF simplification to one-liners is required, I use it the other way round (TIMTOWTDI)

      if ($VERBOSE) { PRINT ("SEARCHING: ", FormatPath ($PATH)); } + # original $VERBOSE and printf "" %-8dSEARCHING: %s\n", $TOTAL, FormatPath ($PATH +); # Readable and understandable printf " %-8dSEARCHING: %s\n", $TOTAL, FormatPath ($PATH) if $VERBOSE; + # unmaintanable

      Enjoy, Have FUN! H.Merijn

        Agree. This seems to be a matter of personal taste as indicated by the discussion at: A matter of style: how to perform a simple action based on a simple condition?

        In addition to preferring block-if to postfix-if I always use the multi-line form of block-if. Why? Consider changing:

        if ($VERBOSE) { PRINT("SEARCHING ..."); }
        to:
        if ($VERBOSE) { # Do something else here ... or just add a clarifying comment PRINT("SEARCHING ..."); }
        With multi-line block-if that's an easy to understand one line change and easier to understand at a glance when looking at the change history of the file in version control. OTOH, with postfix-if, adding a second statement to the if condition requires making more and harder-to-understand code changes.

Re: Perl program to search files
by Anonymous Monk on Dec 28, 2018 at 14:13 UTC
    A great example of how not to write even basic programs. You code also doesn't do what you claim it does, throw in reinventing the wheel, and making it worse than established methods.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1227699]
Approved by Athanasius
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2019-11-17 13:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Strict and warnings: which comes first?



    Results (86 votes). Check out past polls.

    Notices?