comment on

when you first mentioned this in the chatterbox you asked if anyone had any suggestions to improve it. So here goes FWIW.

You are writing your matches to a temporary file and then reading them in again to construct a hash. That means that if you find any substrings with more than one occurrence, you will end up with entries in your data file for all of them. For instance if you find that token you mentioned and it's in every line of your - lets say - 500 record file. Then your first mention in the text file says 500 occurrences, the next says 499 and so on down to 2 occurrences. You don't get an entry that says one, but you will have checked for it.

If instead of writting to that file you created the hash as you process the file, then right at the top you can just check if it exists. If it does then dont bother checking any further, you've already found all these matches. For the example I gave this equates to leaving out 124750 checks and that's just for one pattern.

like this:

do
{
    my $p = $packets[0];
    foreach my $l ($level..length($p))
    {
        foreach my $pos (0..length($p)-$l)
        {
            my $str = substr($p,$pos,$l);

            next if exists $all{$str}; # if we've already found this
                                       # string somewhere else, exit
                                       # this iteration 
            my $num = 0;
            for (0..$#packets)
            {
                if ($l <= length($packets[$_]))
                {
                    pos($packets[$_]) = 0;
                    while ($packets[$_] =~ /$str/g)
                    {
                        $num ++; 
                    }
                }
            }
            unless (exists $all{$str}) 
            {
                $all{$str} = $num unless $num < $threshold;
            }
        }
    }
    shift(@packets);
} while ($#packets >= 0);
[download]

Nuance

In reply to RE: Substring Finding/Counting by nuance
in thread Substring Finding/Counting by Guildenstern

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Come for the quick hacks, stay for the epiphanies.
	PerlMonks