Monks,
I'm working on Unix and when our data shares start filling up (80%-ish) I run a program that creates a report of suspect "temporary" files—files matching the pattern /\Acore\z|copy|te?mp|bak|\b(?:old|test)|([a-z_])\1\1/i or files older than 2 years—and e-mail it to the department to review.
One area that this script misses entirely are temporary files that were modified within 2 years and have a true temporary name (like yacb4yGI6p, YQGsCV6Rbx, and SL8qEfnFDQ).
My approach to finding these (as seen in the testing script below) is to:
- Create a regex of letter trigrams
- Hash words with a length >= 4
- Look for all files that:
- Have a lowercase letter (that's not in the extension)
- Have an uppercase letter (that's not in the extension)
- Have a digit
- Only contain letters and digits and may or may not have an extension
- Do not contain a word from the dictionary with a length >= 4
- Do not match the list of trigrams
When I run this on ~1TB of data the results look decent. Some valid files show up, but the bulk are temporaries.
Does anyone have suggestions for improving this process or a new approach to offer? I'm certainly no linguist.
Also, keep in mind:
- This process is only generating a report for humans to review—not taking action and whacking files. I believe that would be impossible due to the idiosyncracies of English, the sloppiness of some typists, and the occasional job that has legit, yet awkward naming schemes or ID's.
- Files named in other languages are very rare for us.
Many thanks.
Code:
use File::Find::Rule;
use List::MoreUtils qw(all);
use Number::Format qw(format_number);
use Regexp::Assemble;
my %ngram;
my %dict;
my $total;
### Which dictionary? 'Tis set up for testing at home and work.
my $uname = `uname -a`;
my $dict = $uname =~ /debian/i ? '/usr/share/dict/american-english' :
$uname =~ /SunOS/i ? '/usr/share/lib/dict/words' :
undef ;
### Gather ngrams.
open my $DICT, '<', $dict or die $!;
while (<$DICT>) {
chomp;
### Only allow words that begin with a lowercase letter,
### contain only letters (no hyphens, quotes, etc.),
### and have 3 or more letters.
next unless m/\A[a-z][A-Za-z]+\z/ && length >= 3;
print "$_\n";
### Gather letter trios (ngrams, or, more specifically, trigrams).
my $str = $_;
my @ngrams = map {
substr($str, $_, 3);
} 0 .. (length $_) - 3;
### Tally.
++$ngram{$_} for @ngrams;
++$total;
### Only add 4+ lengths to the dictionary--many temps were matchin
+g lengths of 3.
++$dict{$_} if length >= 4;
}
print "\n";
print 'Total words: ', format_number($total), "\n";
### Show the results sorted by occurrence and remove those less than 1
+%.
print "All:\n";
for my $ngram (sort {$ngram{$b} <=> $ngram{$a}} keys %ngram) {
my $percentage = format_number(($ngram{$ngram} / $total) * 100, 1,
+ 1);
printf "%3s: %4s (%4s%%)\n", $ngram, format_number($ngram{$ngram})
+, $percentage;
delete $ngram{$ngram} if $percentage < 1;
}
print "\n";
print "Keepers:\n";
for my $ngram (sort {$ngram{$b} <=> $ngram{$a}} keys %ngram) {
my $percentage = format_number(($ngram{$ngram} / $total) * 100, 1,
+ 1);
printf "%3s: %4s (%4s%%)\n", $ngram, format_number($ngram{$ngram})
+, $percentage;
}
print "\n";
### Build an RE based on the ngrams.
my $ra = Regexp::Assemble->new;
$ra->add($_) for keys %ngram;
print $ra->re, "\n";
### Files must match these to be considered temporary.
my @REs = (
### Lower/upper case letters not in the extension.
qr/\A[^.]+[a-z]/,
qr/\A[^.]+[A-Z]/,
### Digit.
qr/\d/,
### Name only contains upper/lower case letters or digits; ext. op
+tional.
qr/\A[a-zA-Z\d]+(?:\.[a-zA-Z]{1,4})?\z/,
);
File::Find::Rule->file
->exec(
sub {
my $file = $_;
### Test for REs, words, then ngrams.
return unless all { $file =~ $_ } @REs;
for ($file =~ /([A-Za-z][a-z]+|[A-Z]+)/g) {
if (exists $dict{lc $_}) {
print "\tSkipping '$file' due to presence of '$_'\
+n";
return;
}
}
return if lc $file =~ $ra->re;
print "$file\n";
}
)
->in(qw(/data /tmp));