in reply to Perl program to search files
if you see some obvious bugs in my program, would you please tell me?
Unfortunately your algorithm suffers from an issue: when the text to be found happens to be split across two buffers, the match is not found. To demonstrate, try setting $BUFF_SIZE = 4096, and then generating a test file with e.g. perl -e 'print "x"x4095, ".\nThis is", "x"x5000' >test.txt - the match won't be found, but if you either set $BUFF_SIZE = 4095, or change the "x"x4095 to "x"x4096, the match is found. A typical solution to this kind of problem would be to implement a sliding window (Update 2: see reply by vr below), making sure that the buffers are large enough so that the search string can be contained in them. Alternatively, if the files are always large enough to fit into memory, of course it's possible to just load the entire file into memory at once. You might want to have a look at these nodes, for example: Matching in huge files and Re: Search hex string in vary large binary file.
Can you make suggestions on how I could improve this program or what I could have done to make this better?
As Athanasius already mentioned, there are some things you're re-implementing. This is fine as a learning exercise, but I think it also helps to be aware of these kinds of things. It also gives you something to test your own implementations against (which I would recommend doing).
- sub CountChars: How can I count the number of occurrences of a substring within a string?
- sub RSPACE and sub PRINT: sprintf/printf (as mentioned by Athanasius)
- sub AddSeparator and sub FormatPath: the core module File::Spec, in this case especially catdir and catfile (I prefer File::Spec::Functions; there's also the CPAN module Path::Class)
- sub isMatch: maybe Text::Glob (there are a few others on CPAN)
- sub CheckDIR: the core module File::Find (or maybe File::Find::Rule from CPAN)
In addition, I agree that variable names in all uppercase should be reserved for constants, IMO including variables that get set once at the top of the program and aren't supposed to be changed, so e.g. IMO $LINUX is fine, but e.g. $START is IMO not a good choice. Also I agree that magic numbers aren't good, although I would take it a step further, e.g. my $ASTERISK = ord("*");, my $DOT = ord(".");, etc. Some more descriptive variable names would be helpful in understanding the code as well, e.g. sub CountChars is all one-letter variables.
A couple of other thoughts/issues:
- my $LINUX = (index(uc($^O), 'WIN') < 0) ? 1 : 0; - this will misdetect darwin as Windows (or rather, "not Linux", despite it being a *NIX OS), and there are lots of other OSes that are neither Linux nor Windows. You might just want to stick to checking against "MSWin32" - but as I said above really the best solution here IMO is to leave the handling of filenames to a module.
- opendir(my $DIR, $PATH) or return; and open $F, "<$FULLNAME" or return; - you might want to consider reporting this to the user instead of silently skipping files/dirs.
- You might want to have a look at Getopt::Long to be able to handle command-line arguments instead of editing the variables in the source.
- Update: Instead of a long list of globs in @INCLUDE, you could build a single regex to match against the filenames. See e.g. Building Regex Alternations Dynamically.