Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Remove Duplicates from a mbox file

by coolmichael (Deacon)
on Sep 23, 2003 at 06:38 UTC ( #293415=perlcraft: print w/ replies, xml ) Need Help??

   1: #!/usr/bin/perl
   2: 
   3: # Simple program to remove duplicate email messages
   4: # from an mbox file. This program only looks at the content
   5: # of the message for uniqueness, not entire message with the headers.
   6: # There is no file locking, use this program on a backup 
   7: # of your mbox file.
   8: # Enjoy.
   9: 
  10: use strict;
  11: use warnings;
  12: use Digest::MD5 qw(md5_hex);
  13: 
  14: #grab file names from the program parameters.
  15: #and do some error checking.
  16: my ($from, $to) = @_;
  17: die "usage: $0 from to" unless (defined $from && defined $to);
  18: my (%uniq, $msg);
  19: my ($head, $body);
  20: my $i = 0;
  21: 
  22: $|++;
  23: 
  24: open (my $fh, "<$from") || die "cannot open $from: $!";
  25: while(<$fh>) {
  26: 	#emails in mbox files always begin with ^From 
  27: 	#when /^From / is matched, process the previous message
  28: 	#then start on this message
  29: 	if(m/^From /) {
  30: 		next if ($msg eq "");
  31: 		#increment the counter for a status report
  32: 		$i++;
  33: 		#print a status report if necessary.
  34: 		#I like to do it this way
  35: 		print '.' if(($i % 50) == 0);
  36: 		print " $i\n" if(($i % 1000) == 0);
  37: 		#since evolution can give different headers on the same message,
  38: 		#only hash the body of the message, and use that to compare to other
  39: 		#emails. The entire message will be stored in the hash though.
  40: 		($head, $body) = split /\n\n/, $msg;
  41: 		#standard perl technique for removing duplicates, using hashes and 
  42: 		#md5 files.
  43: 		$uniq{md5_hex($body)} = $msg;
  44: 		
  45: 		#done processing the previous message, start the next message
  46: 		$msg = $_;
  47: 	} else {
  48: 		#current line didn't match /^From / so this line is part of the
  49: 		#middle of the current message. Just tack it on to the end.
  50: 		$msg .= $_;
  51: 	}
  52: }
  53: 
  54: #print the results to a file.
  55: open (my $th, ">$to") || die "cannot open $to: $!";
  56: while(my ($k, $v) = each %uniq) {
  57: 	print $th $v;
  58: }

Comment on Remove Duplicates from a mbox file
Download Code
Re: Remove Duplicates from a mbox file
by Anonymous Monk on Sep 23, 2003 at 22:36 UTC
      I had, but it seemed like a little bit of overkill for what I was doing. And I got to learn a little more Perl doing it.

      --
      negativespace.net - all things inbetween.

        I couldn't get the perl code above to work right, so I kept searching and I found the one on the web site below, It seems to work great! It removed 2400 duplicates from a 200MB mbox file. It also automatically creates a backup for you. www.wdr1.com/hacks/mbox-dedup.pl
      That md5/hash trick is pretty cool. Here's a better version of the program (could be cleaner bit it works afaik). Using the Message-Id: would be faster but then we wouldn't need the md5. :)
      #!/usr/bin/perl # Simple program to remove duplicate email messages # from an mbox file. This program only looks at the content # of the message for uniqueness, not entire message with the headers. # There is no file locking, use this program on a backup # of your mbox file. # Enjoy. use strict; use warnings; use Digest::MD5 qw(md5_hex); #grab file names from the program parameters. #and do some error checking. my $from = shift @ARGV; my $keep = shift @ARGV; my $junk = shift @ARGV; if ( $#ARGV != -1 || ! defined $junk ) { print STDERR "usage: $0 original clean junk\n"; exit(-1); } my (%uniq, $msg); my ($head, $body); my $i = 0; my $dups = 0; my $nulls = 0; $|++; open (my $IN, "<$from") || die "cannot open $from: $!"; open (my $KEEP, ">$keep") || die "cannot open $keep $!"; open (my $JUNK, ">$junk") || die "cannot open $junk $!"; while(<$IN>) { #emails in mbox files always begin with ^From #when /^From / is matched, process the previous message #then start on this message if(m/^From /) { next if (!defined $msg || $msg eq ""); #increment the counter for a status report $i++; #print a status report if necessary. #I like to do it this way print '.' if(($i % 50) == 0); if(($i % 1000) == 0) { print " $i, $dups duplicates, $nulls null messages found\n" } #since evolution can give different headers on the same message, #only hash the body of the message, and use that to compare to oth +er #emails. The entire message will be stored in the hash though. ($head, $body) = split /\n\n/, $msg, 2; #standard perl technique for removing duplicates, using hashes and + #md5 files. if ( ! defined $body ) { $nulls++; print $JUNK $msg; } else { my $md5 = md5_hex($body); if ( !defined $uniq{$md5} ) { $uniq{$md5} = 1; print $KEEP $msg; } else { print $JUNK $msg; $dups++; } } #done processing the previous message, start the next message $msg = $_; } else { #current line didn't match /^From / so this line is part of the #middle of the current message. Just tack it on to the end. $msg .= $_; } } print "Done, $i messages, $dups duplicates, $nulls nulls\n";

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlcraft [id://293415]
Approved by jdtoronto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (9)
As of 2014-07-30 10:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (230 votes), past polls