Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Downloader for emails

by serf (Chaplain)
on Feb 09, 2006 at 12:19 UTC ( #529064=CUFP: print w/replies, xml ) Need Help??

Lately I've been having a play with WWW::Mechanize which I've really enjoyed.

I ended up coming to something that stumped me for a while, but with some inspiration to keep going (isn't the belief that you're close to your goal and should keep trying often the most useful help you can get!?) from Corion yesterday I cracked it.

There are good utilities out there like gotmail and fetchyahoo, but I couldn't find anything to deal with messages on's mail servers.

I wanted write a script to download & delete messages from my account, but the links from the folder index to the messages use JavaScript, which WWW::Mechanize doesn't know how to parse so can't follow links in (and neither it should!).

The form at the top of their index page has this (slightly tidied for readability):

<tr><form name="myForm" method="post" action="">

Which seems to be re-written by this:

function View(mid) { document.myForm.action = ''; document.myForm.showDraft.value = '0'; document.myForm.mid.value = mid; document.myForm.submit(); }

and the links to the messages look like this (slightly tidied):

<tr align="left" valign="top"> <td>&nbsp;&nbsp;&nbsp;</td> <td align="right"><input type="checkbox" name="midlist" value="5826884043">&nbsp;&nbsp;</td> <td>Sender Name&nbsp;&nbsp;</td> <td><a href="Javascript: View('5826884043')" onclick="(chopped out javascript popup)" >Email Subject...</a>&nbsp;&nbsp;</td> <td nowrap class="smaller">Fri 01-27-2006 08:09 AM&nbsp;&nbsp;</td> <td class="smaller">14 KB</td> </tr>

After figuring out from the JavaScript what I needed to get from it to post to the server (i.e. what URI to post to) I had to then work out how to post from WWW::Mechanize.

The thing which held me up was not being sure about using post (which wasn't re-documented in WWW::Mechanize because although I had found Corion's post Re: WWW::Mechanize and POST which pointed me to LWP::UserAgent which gave me the syntax I needed:

$ua->post( $url, \%form )

I wasn't getting the right response back from the server (I was unsure of what I actually needed to POST to satisfy it) so didn't know if I was even on the right track...

That's where Corion came in again... he suggested:

Corion 2006-02-08 08:04:46-05 serf: Consider using ethereal, HTTP Live Headers (FireFox Plugin) or my module, Sniffer::HTTP to see the POST requests generated. Then you can either replicate these requests in your Perl code or use my module..
Corion 2006-02-08 08:08:32-05 serf: Your goal is to post exactly the same as FireFox. At least until you find out what headers are relevant and that the URLs and content have the right values
Corion 2006-02-08 08:14:55-05 serf: I'd dump the headers/requests as they get generated from your browser vs. your Perl script and then try to make the Perl requests as similar as possible to the browser requests.

Well I'd played with ethereal the night before, had found it a bit confusing and not very nice to read the data in - because it displayed it as a hex dump + ascii broken into blocks - although you *can* save the transactions to a file.

I ended up deciding that it was easiest to go back to my long-time aquaintance tcpdump and google up the appropriate syntax to drive it (can never remember the switches for the damn thing!) and pipe that straight into a file. I found this worked:

tcpdump -A -s 65535 -i eth0 -l host $MY_IP and host $REMOTE >sniff.txt

Which gave me pretty much what I needed.

I piped the output from FireFox and my script into two seperate files, then looked at the files side-by-side and found that the string I was sending needed to have much more in it. I also clicked that the values in the %form hash would probably end up as that long & seperated line that was sent in the post.

(By the way, HTTP Live Headers was AWESOME for showing me this :o) - it's the cleanest and tidiest way to see what Firefox is sending - but doesn't let you watch the Perl script of course!)

The bit that I needed in (I've broken the 'goToMenu=' line with a few newlines and space so it wraps nicely here - it is all on one line) was this:

Content-Length: 262 goToMenu=&filefolder=&jumpList=&folder=INBOX&movefolder=&command= &mid=5095286849&index=0&doNewSort=&doDelete=&doEmpty=&doMove= &doSpam=&sort=date&sortdir=0 &midseq=5096143440.5096113306.5095988201.5095286849 &isSearch=0&showDraft=0&filefolder2=&jumpList2=

and with a few commands in vim I was able to nicely massage the line into a hash:

:.s/^/my %form = (^M " :.s/=/" => "/g :.s/&/",^M "/g :.s/$/"^M);

giving me:

my %form = ( "goToMenu" => "", "filefolder" => "", "jumpList" => "", "folder" => "INBOX", "movefolder" => "", "command" => "", "mid" => "5095286849", "index" => "0", "doNewSort" => "", "doDelete" => "", "doEmpty" => "", "doMove" => "", "doSpam" => "", "sort" => "date", "sortdir" => "0", "midseq" => "5096143440.5096113306.5095988201.5095286849", "isSearch" => "0", "showDraft" => "0", "filefolder2" => "", "jumpList2" => "" );

I was then able to go through and hash out key=>value pairs (testing the POST each time to make sure that I hadn't broken it) until I'd found the bare minimum that I needed for it to still work:

my %form = ( # "goToMenu" => "", # "filefolder" => "", # "jumpList" => "", "folder" => "INBOX", # "movefolder" => "", # "command" => "", "mid" => "5095286849", # "index" => "0", # "doNewSort" => "", # "doDelete" => "", # "doEmpty" => "", # "doMove" => "", # "doSpam" => "", # "sort" => "date", # "sortdir" => "0", # "midseq" => "5096143440.5096113306.5095988201.5095286849", # "isSearch" => "0", # "showDraft" => "0", # "filefolder2" => "", # "jumpList2" => "" )

I ended up with this as my basic script:

#!/usr/bin/perl # # # - download text-only emails from webmail servers. # use strict; use warnings; use WWW::Mechanize; use constant DEBUG => 1; my $login = 'mylogin'; my $passwd = 'mypassword'; my $folder = shift; die "Usage: $0 FOLDER\n" unless $folder; my $base = ""; my $loginuser = "$base/"; my $ua = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/200509 +22 Firefox/1.0.7 (Debian package 1.0.7-1)"; # Create a new instance of WWW::Mechanize my $mech = WWW::Mechanize->new( autocheck => 1 ); # Set User Agent $mech->agent($ua); # Continue with other tasks... DEBUG && print STDERR "Getting login page...\n"; $mech->get($loginuser); $mech->form("myForm"); $mech->field( "loginName", $login ); $mech->field( "user_pwd", $passwd ); $mech->field( "store_pwd", "on" ); DEBUG && print STDERR "Logging in...\n"; $mech->click("login"); my $main = "$base/$folder"; DEBUG && print STDERR "Getting: $main\n"; $mech->get($main); my $magic; # Session ID my %mid; # Mesage IDs for my $line ( split( $/, $mech->content() ) ) { $magic = $1 if $line =~ /^ document.myForm.action = '[0-9]+)';/; $mid{$1} = $1 if $line =~ /<td><a href="Javascript:\s*View\('([0-9]+)'\)"/i; } die "No mail found in '$folder' (no magic).\n" unless $magic; my $view = "$base/$magic"; for my $mid ( keys %mid ) { my %form = ( "mid" => $mid, "folder" => $folder, "fullHeaders" => "1", ); DEBUG && print STDERR "Getting: $mid\n"; $mech->post( $view, \%form ); my $content = $mech->content(); # # Strip off the start of the page down to the mail headers # $content =~ s|<html>.*Add Sender To Address Book</span></a></tr></table>\s +*||sg; # # Strip end of page after message # $content =~ s|</pre>\n</td></tr>.*$||sg; # # Strip the attachments block # $content =~ s|<td align="right" valign="top"><table><tr><td align= +"center"><b>Attachments</b></td></tr><tr><td>.*?<pre>\n||sg; # # Special header fixes # # Chop the last table block before the header $content =~ s|\s*<tr><td>&nbsp;</td></tr>\s*<tr><td><table width=" +100%"><tr><td valign="top" width="50%"><table width="100%" border="0" + cellspacing="0" cellpadding="0"><tr valign="top" align="left"><td wi +dth="1%" nowrap>||sg; # # So header lines don't come out double spaced # $content =~ s|</tr>\s+||sg; # # Put line breaks in the right places in the header # $content =~ s/<br>/\n/g; # # Strip any remaining page tags # for my $tag ( qw( tr td table img b ) ) { $content =~ s|<$tag[^>]*>||sg; $content =~ s|</$tag[^>]*>||sg; } # # Remove links # $content =~ s/<a\b[^>]+>//sg; $content =~ s|</a>||g; # # Reformat entities as text # $content =~ s/&nbsp;/ /g; $content =~ s/&amp;/\&/g; $content =~ s/&gt;/>/g; $content =~ s/&lt;/</g; $content =~ s/&quot;/"/g; my $out = "$mid.msg"; DEBUG && print STDERR "Writing output file: '$out'\n"; open (OUT, "> $out") || die "Can't write to '$out': $!\n"; print OUT "$content\n"; close(OUT); $form{"doDelete"} = "1"; $form{"midlist"} = "$mid"; DEBUG && print STDERR "Deleting: $mid\n"; $mech->post( $view, \%form ); }

NB: This will only work with plain-text emails, not HTML ones, because it strips all the HTML back to text. It would be simple enough to tune this to only weed out the page specific HTML and not touch the message. This script also doesn't understand attachments - that was not the point of this excercise :o) It should be easily customisable to work with any of the many other sites as well.

You will also find that in this bare-bones version you will need to run it more than once if you have more than a page full of messages in the folder because this cut-down version only processes the first page's index full of message IDs instead of checking for multiple pages and looping back to re-scrape the index for more message IDs until the folder is empty.

Since I got it working I now no longer fear HTTP POSTs, because I know I have the tools I need to do them. (and for years I'd been wondering about the witchcraft that enabled people to automate POSTs - all phear the foo of the Camel!)

Thanks to Corion for his assistance and for inspirating me and to GrandFather and Corion for their help with tidying up and posting this node.

I hope this might help someone else in their project too!

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://529064]
Approved by Corion
Front-paged by GrandFather
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2018-06-20 00:24 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (116 votes). Check out past polls.