if you see some obvious bugs in my program, would you please tell me?
Unfortunately your algorithm suffers from an issue: when the text to be found happens to be split across two buffers, the match is not found. To demonstrate, try setting $BUFF_SIZE = 4096, and then generating a test file with e.g. perl -e 'print "x"x4095, ".\nThis is", "x"x5000' >test.txt - the match won't be found, but if you either set $BUFF_SIZE = 4095, or change the "x"x4095 to "x"x4096, the match is found. A typical solution to this kind of problem would be to implement a sliding window (Update 2: see reply by vr below), making sure that the buffers are large enough so that the search string can be contained in them. Alternatively, if the files are always large enough to fit into memory, of course it's possible to just load the entire file into memory at once. You might want to have a look at these nodes, for example: Matching in huge files and Re: Search hex string in vary large binary file.
Can you make suggestions on how I could improve this program or what I could have done to make this better?
As Athanasius already mentioned, there are some things you're re-implementing. This is fine as a learning exercise, but I think it also helps to be aware of these kinds of things. It also gives you something to test your own implementations against (which I would recommend doing).
In addition, I agree that variable names in all uppercase should be reserved for constants, IMO including variables that get set once at the top of the program and aren't supposed to be changed, so e.g. IMO $LINUX is fine, but e.g. $START is IMO not a good choice. Also I agree that magic numbers aren't good, although I would take it a step further, e.g. my $ASTERISK = ord("*");, my $DOT = ord(".");, etc. Some more descriptive variable names would be helpful in understanding the code as well, e.g. sub CountChars is all one-letter variables.
A couple of other thoughts/issues:
- my $LINUX = (index(uc($^O), 'WIN') < 0) ? 1 : 0; - this will misdetect darwin as Windows (or rather, "not Linux", despite it being a *NIX OS), and there are lots of other OSes that are neither Linux nor Windows. You might just want to stick to checking against "MSWin32" - but as I said above really the best solution here IMO is to leave the handling of filenames to a module.
- opendir(my $DIR, $PATH) or return; and open $F, "<$FULLNAME" or return; - you might want to consider reporting this to the user instead of silently skipping files/dirs.
- You might want to have a look at Getopt::Long to be able to handle command-line arguments instead of editing the variables in the source.
- Update: Instead of a long list of globs in @INCLUDE, you could build a single regex to match against the filenames. See e.g. Building Regex Alternations Dynamically.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
| & || & |
| < || < |
| > || > |
| [ || [ |
| ] || ] ||