|Keep It Simple, Stupid|
Downloader for everyone.net emailsby serf (Chaplain)
|on Feb 09, 2006 at 12:19 UTC||Need Help??|
Lately I've been having a play with WWW::Mechanize which I've really enjoyed.
I ended up coming to something that stumped me for a while, but with some inspiration to keep going (isn't the belief that you're close to your goal and should keep trying often the most useful help you can get!?) from Corion yesterday I cracked it.
The form at the top of their index page has this (slightly tidied for readability):
Which seems to be re-written by this:
and the links to the messages look like this (slightly tidied):
The thing which held me up was not being sure about using post (which wasn't re-documented in WWW::Mechanize because although I had found Corion's post Re: WWW::Mechanize and POST which pointed me to LWP::UserAgent which gave me the syntax I needed:
I wasn't getting the right response back from the server (I was unsure of what I actually needed to POST to satisfy it) so didn't know if I was even on the right track...
That's where Corion came in again... he suggested:
Corion 2006-02-08 08:04:46-05 serf: Consider using ethereal, HTTP Live Headers (FireFox Plugin) or my module, Sniffer::HTTP to see the POST requests generated. Then you can either replicate these requests in your Perl code or use my module..
Well I'd played with ethereal the night before, had found it a bit confusing and not very nice to read the data in - because it displayed it as a hex dump + ascii broken into blocks - although you *can* save the transactions to a file.
I ended up deciding that it was easiest to go back to my long-time aquaintance tcpdump and google up the appropriate syntax to drive it (can never remember the switches for the damn thing!) and pipe that straight into a file. I found this worked:
Which gave me pretty much what I needed.
I piped the output from FireFox and my script into two seperate files, then looked at the files side-by-side and found that the string I was sending needed to have much more in it. I also clicked that the values in the %form hash would probably end up as that long & seperated line that was sent in the post.
(By the way, HTTP Live Headers was AWESOME for showing me this :o) - it's the cleanest and tidiest way to see what Firefox is sending - but doesn't let you watch the Perl script of course!)
The bit that I needed in (I've broken the 'goToMenu=' line with a few newlines and space so it wraps nicely here - it is all on one line) was this:
and with a few commands in vim I was able to nicely massage the line into a hash:
I was then able to go through and hash out key=>value pairs (testing the POST each time to make sure that I hadn't broken it) until I'd found the bare minimum that I needed for it to still work:
I ended up with this as my basic script:
NB: This will only work with plain-text emails, not HTML ones, because it strips all the HTML back to text. It would be simple enough to tune this to only weed out the page specific HTML and not touch the message. This script also doesn't understand attachments - that was not the point of this excercise :o) It should be easily customisable to work with any of the many other everyone.net sites as well.
You will also find that in this bare-bones version you will need to run it more than once if you have more than a page full of messages in the folder because this cut-down version only processes the first page's index full of message IDs instead of checking for multiple pages and looping back to re-scrape the index for more message IDs until the folder is empty.
Since I got it working I now no longer fear HTTP POSTs, because I know I have the tools I need to do them. (and for years I'd been wondering about the witchcraft that enabled people to automate POSTs - all phear the foo of the Camel!)
I hope this might help someone else in their project too!