Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

Organising mbox into threads?

by LoonyPandora (Novice)
on May 22, 2006 at 16:26 UTC ( #550971=perlquestion: print w/replies, xml ) Need Help??
LoonyPandora has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

Here is a question that has been plaguing me for a while... It's a bit long, so I apologise in advance. My eternal gratefulness if anyone can help :)

I have a fairly large mbox file of around 105,000 messages, and 350 MB in size. This is a backup of a Yahoo! groups message board that I created previously with my own code. It is well formed and otherwise in good condition - complete with message-id, and in-reply-to headers for 99.5% of the messages.

I am looking to get these messages into a Bulletin Board - The most important thing I need to do is organize the messages into threads, rather than the flat listing that Yahoo! groups provides.

In order to get these messages into my bulletin board, I need to get arrange the messages into 1 mbox file per thread, or one directory per thread. I've looked at the Mail::Box modules (which seems a bit complex for my puny mind...) and Mail::MboxParser (which seems to just parse mbox files, which I've already done successfully)

Any code that I have come up with is very slow - as basically it runs a nested loop - gets the message-id for the first message, and then loops through each message in the file again checking to see if there is a matching 'in-reply-to' header, then writes them both out to a new file. This means I am processing 105k*105k messages- which takes WAY too long to do it's work.

I've tried creating a hash of message-id's, and then adding any messages with a matching 'in-reply-to' header to the relevant key - faster, but has a flaw - if a message is in reply to the 2nd or 3rd message in a thread - there is no relevant key to add it to because it isn't the start of a thread. Meaning I can only process one level deep.

All other methods I have thought of have the same flaw as the hash method described above. I am unable to think of a way to do this without processing 105k*105k messages.

My question distils to this really: Is there any way to arrange this mbox file into threads organised by message-id WITHOUT looping through it 105k*105k times?

If I'm asking the impossible, I could post my exisiting code up and see if it is possible to optimise it and make it run any quicker. If anyone would be willing to help with that?

Many Thanks,

Replies are listed 'Best First'.
Re: Organising mbox into threads?
by Fletch (Chancellor) on May 22, 2006 at 17:04 UTC
Re: Organising mbox into threads?
by ruzam (Curate) on May 22, 2006 at 18:00 UTC
    The problem seems to be you have a linked list, but as it exits it's only reverse. I would start by scanning the file and building a hash as you started but put more in the hash.
    $hash{$msgid} = { filepos => $msgstart, inreply => $msgreply, nextmsg => [] # reference to an empty array }
    Then I'd scan the hash keys and build the forward links
    my @heads; while(my ($key, $value) = each(%hash)) { # does this msg have a parent? if (my $parent = %hash{$value->{inreply}}) { # link the parent to this message push @{$parent->{nextmsg}}, $key; } else { # no parent so it must become a thread start push @heads, $key; } }
    Finally, I'd loop through @heads to get the message threads.
    foreach my $msgid (@heads) { # do what ever you have to do to start a new thread # recursively follow the messages thread_msgs($hash{$msgid}); } # recursively read through the thread sub thread_msgs { my $msgid = shift; # use seek() to position the mbox file at # the message start using $msgid->{filepos} # copy to the end of the message foreach $msgid (@{$msgid->{nextmsg}}) { thread_msgs($msgid); } }
    Note that this preserves the thread of messages replied to messages, but it doesn't preserve the order of messages. You may want to sort by message id when it comes time to copy the messages. (also note I haven't tested this!)
Re: Organising mbox into threads?
by dsheroh (Prior) on May 22, 2006 at 19:44 UTC
    Have you tried a hash with one entry for each message which has the message-id as the key and in-reply-to (or null) as the value? (This differs from my understanding of your original hash attempt in that it sounds like you were trying to create hash entries only for heads of threads rather than for every message.)

    Once you have this hash, you can then (relatively) quickly identify which messages go with which heads:

    • $hash{A} is null (or a message-id which isn't in the hash), so it's a thread head
    • $hash{B} is A, and $hash{A} is a head, so it's in A's thread
    • $hash{C} is B, but B isn't a head, so look at $hash{B}, which is A; A is a head, so C is in A's thread
    • $hash{D} is C, but C isn't a head, so...
    A touch of recursion solves that neatly with just a few hash lookups instead of rescanning the mbox. If it's not fast enough for you, though, you can also easily set up a hash where $hash2{message-id} = (message-id of the thread's head), so that you can, when you get to D, just look up $hash2{C} instead of $hash{C}, then $hash{B}, then $hash{A}.

    Once you've identified the head of the thread that each message is in, you can then build the hash you originally attempted, mapping each head to an array of messages in that thread.

Re: Organising mbox into threads?
by parv (Priest) on May 23, 2006 at 04:43 UTC

    I was going to suggest to keep track of message IDs in a hash with each reply being a value for the parent message. To figure out the "Message-ID:" & "In-Reply-To:" values to extract, i saw " $obj->threads([FOLDERS], OPTIONS)" in "Mail::Box::Manager" pod.

    So, try the following & see if it makes any difference ...

    #!perl use warnings; use strict; use Mail::Box::Manager; my $mgr = Mail::Box::Manager->new( 'default_folder_type' => 'mbox' ) ; foreach my $mb ( @ARGV ) { my $mbox = $mgr->open( 'folder' => $mb ) or do { warn "Can't open $mb"; next; }; my $threads = $mgr->threads ( 'folders' => [ $mbox ] , 'timespan' => 'EVER' , 'window' => 'ALL +' ); save_thread( $_ ) for $threads->all ; $mgr->close( $mbox ); } { my ( $count , @stat ); sub save_thread { my ( $thread ) = @_; # Generate file name. my $file = 'thread-' . sprintf '%05d' , ++$count; my $save = $mgr->open( 'folder' => $file , 'access' => 'rw' , 'create' => 1 +) or die "Cannot open $file to save the thread.\n" ; push @stat , [ $count , $thread->numberOfMessages ]; $_->copyTo( $save ) for $thread->threadMessages ; $mgr->close( $save ); } sub END { print_stat(); } sub print_stat { my $out = ''; my ( $total_thr , $total_msg ) = ( 0 ) x2; foreach my $s ( @stat ) { $out .= sprintf "%4d : %2d\n" , @{ $s }; $total_thr++; $total_msg += $s->[1]; } my $old = select STDERR; printf "Threads: %4d, Messages: %4d\n%s\n" , $total_thr , $total_msg , join '' , qw( =- ) x20 ; print $out; select $old; } }

    For my test case, mind that above code generates triplicates in some cases (but "mutt(1)" does not have the problem in generating threads). In addition, while "mutt" notices 602 threads, above code produces 675 (but that could be /in some part/ due to various threading options that are set for "mutt"). I can provide the original file and those created by "Mail::Box*" if anybody is interested.

    *Update, May 23 2006* Added a simple sub to print the statistics; removed unused "Mail::Message" usage.

Re: Organising mbox into threads?
by parv (Priest) on May 23, 2006 at 05:30 UTC
    Other options, if the messages are already properly threaded, are to use "mutt(1)" or "slrn(1)" to open the mbox file; then use macros to save each thread. There is a "mutt" patch,, to tag the whole thread.
Re: Organising mbox into threads?
by LoonyPandora (Novice) on May 24, 2006 at 12:11 UTC

    Thanks for the help guys - much appreciated :) - I wasn't expecting such a collection of comprehensive and speedy responses ;)

    Fletch - I hadn't seen that module before - however I had read the article regarding threading that it links to. Don't know how I managed to miss the module... Thanks for pointing it out

    ruzam - Thanks for the code - I had a quick play with it, and couldn't get it working as I wanted. I'll have another crack at it, as I have only looked at it briefly.

    esper - Sorry, should have been clearer in my original posting - I did create a hash, where the keys were every message-id and tried your first suggestion - but it was indeed too slow. I didn't think of creating a 2nd hash containing the thread's head. I'll give that a shot, sounds like it could work!

    roboticus - Unfortunately, Yahoo groups threaded view goes by subject rather than message-id - giving me some really weird threads as there are thousands of messages with the subject "hi" in my mbox ;) - Also, it took me 3 weeks to get the messages downloaded, as yahoo prevent you from accessing pages too quickly, eventually banning your IP from the whole of yahoo groups! - I put a delay in my download script to get around this, but I'm not going to set the download going again!

    parv - Thanks for the code - I will test it, and see what I can do with it. I've never used mutt before, so it may take a while for me to try you other suggestions. If I can't get it working any other way, then I'll give it a shot.

Re: Organising mbox into threads?
by roboticus (Chancellor) on May 22, 2006 at 20:19 UTC

    Yahoo can display groups in threaded view. P'raps you could screen-scrape the hierarchy from that.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://550971]
Approved by ww
[Corion]: Meh. Some Open Source people get bought out by advertising firms to change their projects to include advertising. But nobody has approached me to buy out WWW::Mechanize ::$browser from me, to make the browsers launch an ad page ...
[Corion]: ... instead of about:blank, which the modules currently do at startup.
[Corion]: Maybe I'm too pricey. But I haven't received any offers at all ;-)
[Corion]: The ad hits from CPAN testers alone should bring in a pretty penny IMO. But maybe I misestimate the CPC that ads pay.

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (9)
As of 2017-07-24 14:16 GMT
Find Nodes?
    Voting Booth?
    I came, I saw, I ...

    Results (354 votes). Check out past polls.