Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Perl program to search files

by haukex (Archbishop)
on Dec 26, 2018 at 10:07 UTC ( [id://1227704]=note: print w/replies, xml ) Need Help??


in reply to Perl program to search files

if you see some obvious bugs in my program, would you please tell me?

Unfortunately your algorithm suffers from an issue: when the text to be found happens to be split across two buffers, the match is not found. To demonstrate, try setting $BUFF_SIZE = 4096, and then generating a test file with e.g. perl -e 'print "x"x4095, ".\nThis is", "x"x5000' >test.txt - the match won't be found, but if you either set $BUFF_SIZE = 4095, or change the "x"x4095 to "x"x4096, the match is found. A typical solution to this kind of problem would be to implement a sliding window (Update 2: see reply by vr below), making sure that the buffers are large enough so that the search string can be contained in them. Alternatively, if the files are always large enough to fit into memory, of course it's possible to just load the entire file into memory at once. You might want to have a look at these nodes, for example: Matching in huge files and Re: Search hex string in vary large binary file.

Can you make suggestions on how I could improve this program or what I could have done to make this better?

As Athanasius already mentioned, there are some things you're re-implementing. This is fine as a learning exercise, but I think it also helps to be aware of these kinds of things. It also gives you something to test your own implementations against (which I would recommend doing).

In addition, I agree that variable names in all uppercase should be reserved for constants, IMO including variables that get set once at the top of the program and aren't supposed to be changed, so e.g. IMO $LINUX is fine, but e.g. $START is IMO not a good choice. Also I agree that magic numbers aren't good, although I would take it a step further, e.g. my $ASTERISK = ord("*");, my $DOT = ord(".");, etc. Some more descriptive variable names would be helpful in understanding the code as well, e.g. sub CountChars is all one-letter variables.

A couple of other thoughts/issues:

  • my $LINUX = (index(uc($^O), 'WIN') < 0) ? 1 : 0; - this will misdetect darwin as Windows (or rather, "not Linux", despite it being a *NIX OS), and there are lots of other OSes that are neither Linux nor Windows. You might just want to stick to checking against "MSWin32" - but as I said above really the best solution here IMO is to leave the handling of filenames to a module.
  • opendir(my $DIR, $PATH) or return; and open $F, "<$FULLNAME" or return; - you might want to consider reporting this to the user instead of silently skipping files/dirs.
  • You might want to have a look at Getopt::Long to be able to handle command-line arguments instead of editing the variables in the source.
  • Update: Instead of a long list of globs in @INCLUDE, you could build a single regex to match against the filenames. See e.g. Building Regex Alternations Dynamically.

Replies are listed 'Best First'.
Re^2: Perl program to search files
by vr (Curate) on Dec 26, 2018 at 12:12 UTC

    Minor note:

    A typical solution to this kind of problem would be to implement a sliding window

    But he is trying to slide:

    $START += $BUFF_SIZE - length($FIND);

    Unfortunately, it seems that his 4th argument to read is understood as "offset into file", while, of course, it is "offset into buffer". If this

    read $F, $BUFFER, $BUFF_SIZE, $START;

    is replaced with

    seek $F, $START, 0; read $F, $BUFFER, $BUFF_SIZE;

    then his sliding window works.

      But he is trying to slide

      Ah, you are correct, thank you!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1227704]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-19 06:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found