0: #!/usr/bin/perl
1:
2: # Simple program to remove duplicate email messages
3: # from an mbox file. This program only looks at the content
4: # of the message for uniqueness, not entire message with the headers.
5: # There is no file locking, use this program on a backup
6: # of your mbox file.
7: # Enjoy.
8:
9: use strict;
10: use warnings;
11: use Digest::MD5 qw(md5_hex);
12:
13: #grab file names from the program parameters.
14: #and do some error checking.
15: my ($from, $to) = @_;
16: die "usage: $0 from to" unless (defined $from && defined $to);
17: my (%uniq, $msg);
18: my ($head, $body);
19: my $i = 0;
20:
21: $|++;
22:
23: open (my $fh, "<$from") || die "cannot open $from: $!";
24: while(<$fh>) {
25: #emails in mbox files always begin with ^From
26: #when /^From / is matched, process the previous message
27: #then start on this message
28: if(m/^From /) {
29: next if ($msg eq "");
30: #increment the counter for a status report
31: $i++;
32: #print a status report if necessary.
33: #I like to do it this way
34: print '.' if(($i % 50) == 0);
35: print " $i\n" if(($i % 1000) == 0);
36: #since evolution can give different headers on the same message,
37: #only hash the body of the message, and use that to compare to other
38: #emails. The entire message will be stored in the hash though.
39: ($head, $body) = split /\n\n/, $msg;
40: #standard perl technique for removing duplicates, using hashes and
41: #md5 files.
42: $uniq{md5_hex($body)} = $msg;
43:
44: #done processing the previous message, start the next message
45: $msg = $_;
46: } else {
47: #current line didn't match /^From / so this line is part of the
48: #middle of the current message. Just tack it on to the end.
49: $msg .= $_;
50: }
51: }
52:
53: #print the results to a file.
54: open (my $th, ">$to") || die "cannot open $to: $!";
55: while(my ($k, $v) = each %uniq) {
56: print $th $v;
57: }
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|