Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Remove Duplicates from a mbox file

by coolmichael (Deacon)
on Sep 23, 2003 at 06:38 UTC ( #293415=perlcraft: print w/ replies, xml ) Need Help??

   1: #!/usr/bin/perl
   2: 
   3: # Simple program to remove duplicate email messages
   4: # from an mbox file. This program only looks at the content
   5: # of the message for uniqueness, not entire message with the headers.
   6: # There is no file locking, use this program on a backup 
   7: # of your mbox file.
   8: # Enjoy.
   9: 
  10: use strict;
  11: use warnings;
  12: use Digest::MD5 qw(md5_hex);
  13: 
  14: #grab file names from the program parameters.
  15: #and do some error checking.
  16: my ($from, $to) = @_;
  17: die "usage: $0 from to" unless (defined $from && defined $to);
  18: my (%uniq, $msg);
  19: my ($head, $body);
  20: my $i = 0;
  21: 
  22: $|++;
  23: 
  24: open (my $fh, "<$from") || die "cannot open $from: $!";
  25: while(<$fh>) {
  26: 	#emails in mbox files always begin with ^From 
  27: 	#when /^From / is matched, process the previous message
  28: 	#then start on this message
  29: 	if(m/^From /) {
  30: 		next if ($msg eq "");
  31: 		#increment the counter for a status report
  32: 		$i++;
  33: 		#print a status report if necessary.
  34: 		#I like to do it this way
  35: 		print '.' if(($i % 50) == 0);
  36: 		print " $i\n" if(($i % 1000) == 0);
  37: 		#since evolution can give different headers on the same message,
  38: 		#only hash the body of the message, and use that to compare to other
  39: 		#emails. The entire message will be stored in the hash though.
  40: 		($head, $body) = split /\n\n/, $msg;
  41: 		#standard perl technique for removing duplicates, using hashes and 
  42: 		#md5 files.
  43: 		$uniq{md5_hex($body)} = $msg;
  44: 		
  45: 		#done processing the previous message, start the next message
  46: 		$msg = $_;
  47: 	} else {
  48: 		#current line didn't match /^From / so this line is part of the
  49: 		#middle of the current message. Just tack it on to the end.
  50: 		$msg .= $_;
  51: 	}
  52: }
  53: 
  54: #print the results to a file.
  55: open (my $th, ">$to") || die "cannot open $to: $!";
  56: while(my ($k, $v) = each %uniq) {
  57: 	print $th $v;
  58: }

Comment on Remove Duplicates from a mbox file
Download Code
Re: Remove Duplicates from a mbox file
by Anonymous Monk on Sep 23, 2003 at 22:36 UTC
      I had, but it seemed like a little bit of overkill for what I was doing. And I got to learn a little more Perl doing it.

      --
      negativespace.net - all things inbetween.

        I couldn't get the perl code above to work right, so I kept searching and I found the one on the web site below, It seems to work great! It removed 2400 duplicates from a 200MB mbox file. It also automatically creates a backup for you. www.wdr1.com/hacks/mbox-dedup.pl
      That md5/hash trick is pretty cool. Here's a better version of the program (could be cleaner bit it works afaik). Using the Message-Id: would be faster but then we wouldn't need the md5. :)
      #!/usr/bin/perl # Simple program to remove duplicate email messages # from an mbox file. This program only looks at the content # of the message for uniqueness, not entire message with the headers. # There is no file locking, use this program on a backup # of your mbox file. # Enjoy. use strict; use warnings; use Digest::MD5 qw(md5_hex); #grab file names from the program parameters. #and do some error checking. my $from = shift @ARGV; my $keep = shift @ARGV; my $junk = shift @ARGV; if ( $#ARGV != -1 || ! defined $junk ) { print STDERR "usage: $0 original clean junk\n"; exit(-1); } my (%uniq, $msg); my ($head, $body); my $i = 0; my $dups = 0; my $nulls = 0; $|++; open (my $IN, "<$from") || die "cannot open $from: $!"; open (my $KEEP, ">$keep") || die "cannot open $keep $!"; open (my $JUNK, ">$junk") || die "cannot open $junk $!"; while(<$IN>) { #emails in mbox files always begin with ^From #when /^From / is matched, process the previous message #then start on this message if(m/^From /) { next if (!defined $msg || $msg eq ""); #increment the counter for a status report $i++; #print a status report if necessary. #I like to do it this way print '.' if(($i % 50) == 0); if(($i % 1000) == 0) { print " $i, $dups duplicates, $nulls null messages found\n" } #since evolution can give different headers on the same message, #only hash the body of the message, and use that to compare to oth +er #emails. The entire message will be stored in the hash though. ($head, $body) = split /\n\n/, $msg, 2; #standard perl technique for removing duplicates, using hashes and + #md5 files. if ( ! defined $body ) { $nulls++; print $JUNK $msg; } else { my $md5 = md5_hex($body); if ( !defined $uniq{$md5} ) { $uniq{$md5} = 1; print $KEEP $msg; } else { print $JUNK $msg; $dups++; } } #done processing the previous message, start the next message $msg = $_; } else { #current line didn't match /^From / so this line is part of the #middle of the current message. Just tack it on to the end. $msg .= $_; } } print "Done, $i messages, $dups duplicates, $nulls nulls\n";

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlcraft [id://293415]
Approved by jdtoronto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (6)
As of 2015-07-06 22:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (85 votes), past polls