when you first mentioned this in the chatterbox you
asked if anyone had any suggestions to improve it.
So here goes FWIW.
You are writing your matches to a temporary file
and then reading them in again to construct a hash.
That means that if you find any substrings with more
than one occurrence, you will end up with entries in
your data file for all of them. For instance if you
find that token you mentioned and it's in every
line of your - lets say - 500 record file. Then
your first mention in the text file says 500
occurrences, the next says 499 and so on down to 2
occurrences. You don't get an entry that says one,
but you will have checked for it.
If instead of writting to that file you created
the hash as you process the file, then right at the
top you can just check if it exists. If it does
then dont bother checking any further, you've already found all these matches. For the example I gave
this equates to leaving out 124750 checks and that's
just for one pattern.
like this:
do
{
my $p = $packets[0];
foreach my $l ($level..length($p))
{
foreach my $pos (0..length($p)-$l)
{
my $str = substr($p,$pos,$l);
next if exists $all{$str}; # if we've already found this
# string somewhere else, exit
# this iteration
my $num = 0;
for (0..$#packets)
{
if ($l <= length($packets[$_]))
{
pos($packets[$_]) = 0;
while ($packets[$_] =~ /$str/g)
{
$num ++;
}
}
}
unless (exists $all{$str})
{
$all{$str} = $num unless $num < $threshold;
}
}
}
shift(@packets);
} while ($#packets >= 0);
Nuance
|