Reliable email parsing

powerman has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Reliable email parsing by xdg (Monsignor) on Sep 13, 2006 at 12:10 UTC
Word reliable in my question mean: defect free * comply to all email format related RFC* To me, reliable means that it deals well with emails that don't follow the RFCs. I have found Mail::Box to be the best at dealing with whatever the internet throws at it. Yes, it's a complex OO beast, but it's been around for years and it's been battle tested. "Defect free" is a meaningless phrase to me and it's unrealistic in any complex piece of software -- which I would consider any email parser to be. How would you confirm it anyway? Create a test set of emails and see if each parser can deal with it. If you really want to do a code review, Mail::Box::Parser, Mail::Box::Parser::Perl and Mail::Box::Parser::C are all fairly self-evident -- you don't really need to understand all the OO code to examine the parsers. -xdg Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.	[reply]
Re^2: Reliable email parsing by powerman (Friar) on Sep 13, 2006 at 12:35 UTC
Dealing with emails which don't follow RFC also required, of course. Here is good info about such emails: http://cr.yp.to/immhf.html. But when I say "RFC compliant" I mean support for all possible formats for email addresses, comments in email headers and all things like locale/language/encoding-specific, for example: `Content-Disposition: attachment; filename0=koi8-r'ru'%F0%D2%C9%D7%C5%D4 filename1=%20from filename2=" russia.txt"` [download] And about 'defect free'. There enough information on this topic now. Good example of it is DJB software and most basic&simple NIX utils. I'm sure: more code == more bugs, so I always prefer smaller/simpler solutions. I'm sure solution compliant to all these RFC can be coded under ~1000-1500 lines of code (I've partially done it already).	[reply] [d/l]
Re^3: Reliable email parsing by adamk (Chaplain) on Sep 14, 2006 at 02:59 UTC
> I'm sure solution compliant to all these RFC can be coded under ~1000-1500 lines of code (I've partially done it already). Not to be rude, but often a large portion of code complexity comes from the last bits and pieces, that get left till last because they are hard to implement. So to sum up the apparent answers. No, there isn't yet something that does what you want. Email:: should be close. I'm sure they'd appreciate your code to help finish off the set of Email:: module.	[reply]
Re^3: Reliable email parsing by DrHyde (Prior) on Sep 14, 2006 at 10:00 UTC
Good example of it is DJB software OK, now I know you're trolling.	[reply]
Re^2: Reliable email parsing by powerman (Friar) on Sep 14, 2006 at 20:00 UTC
Ok, I'm playing now with Mail::Box. Before trying to parse emails I need to parse my mailbox - to fetch individual emails. So I create simple oneliner which calculate messages from my mbox file 'Mail/-default'. `$ time perl -MMail::Box::Manager -le ' $m=Mail::Box::Manager->new; $f=$m->open(folder=>"Mail/-default"); print scalar $f->messages ' Unexpected end of header (C parser): charset="iso-8859-1" 3554 real 0m8.222s user 0m7.928s sys 0m0.200s` [download] Oops! Mutt say there 3549 message in this file, not 3554... So I've developed own reader for mbox format: `$ time ./mbox_scan Mail/-default 3549 Mail/-default real 0m0.492s user 0m0.408s sys 0m0.080s` [download] Hmm. 16 times faster?! Wow. And correct - it found 3549 messages, just as mutt. So, maybe I misunderstand something about this world, but WHY my parser on pure perl much faster than C parser in mature CPAN module? Ok, maybe that Mail::Box do a lot of additional parsing which I doesn't do, maybe... but why it produce incorrect results? Here is code of my parser, if interested (sorry, there lines up to 80 columns): Read more... (3 kB)	[reply] [d/l] [select]
Re^3: Reliable email parsing by Fletch (Bishop) on Sep 14, 2006 at 20:25 UTC
Well, your test mailbox looks to have invalid data (or at least invalid as far as Mail::Box is concerned; looks to be a bad MIME header). Figure out what the offending message is and let the maintainers know. As for speed, your code is doing nothing more than reading the message body in; Mail::Box is building up objects to represent each of the messages. That code which (basically) throws away the data it's reading rather than doing anything useful with it runs faster isn't really surprising.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Reliable email parsing by davorg (Chancellor) on Sep 13, 2006 at 11:51 UTC
You should look at the Email::* modules from the Perl Email Project, perhaps Email::Simple or Email::MIME will do what you want. And if they don't, then the project members seem pretty open to receiving bug reports and patches. -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply]
Re^2: Reliable email parsing by powerman (Friar) on Sep 13, 2006 at 12:00 UTC
Thanks! I'll check them, but... I've noticed 818 perl modules(packages) listed on this site. I agree, this task is complex. But having 818 modules for single task looks like no one of them really solved it. :( UPDATE: I've checked them. They are far away from compliance to all RFC I've listed. :( So, my question is still actual: is there exists realization compliant to all (or most) listed RFCs?	[reply]
Re^3: Reliable email parsing by davorg (Chancellor) on Sep 13, 2006 at 12:28 UTC
I've checked them. They are far away from compliance to all RFC I've listed. :( So, my question is still actual: is there exists realization compliant to all (or most) listed RFCs? Well, as I said, I'm pretty sure that's pretty well all of the email modules on CPAN. So if they don't come up to your high standards, then it looks like you're out of luck. Or... here's a wild and crazy idea. If you have specific areas where you know the existing modules have problems, then why not get involved with the Perl Email Project and help them fix those problems. Your involvement could be as small as raising bugs against the existing modules pointing out their deficiencies, or perhaps you could go as far as creating test cases that demonstrate the problems, or maybe you could even produce patches that fix them. Just saying that there are problems, doesn't really achieve much. Documenting the problems and helping to fix them benefits everyone. -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply]
Re^3: Reliable email parsing by davorg (Chancellor) on Sep 13, 2006 at 12:05 UTC
I've noticed 818 perl modules(packages) listed on this site Yes, well that's because they seem to list every mail handling module that they found on CPAN. As I understand it, the Email::* namespace is supposed to be a complete set of modules for carrying out all email processing in Perl - all of which will work nicely with each other. Once that set is complete (and I'm afraid I don't know how far off that is) anything in any of the other namespaces will be redundant. This is a similar approach to the one taken by the Perl DateTime project a few years ago. -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply]
Re^3: Reliable email parsing by dave0 (Friar) on Sep 14, 2006 at 03:42 UTC
I've noticed 818 perl modules(packages) listed on this site. I agree, this task is complex. But having 818 modules for single task looks like no one of them really solved it. :( The 818 modules are a nearly exhaustive list of any email-related module available from CPAN, and they're certainly not all for a single task. The list includes POP3 clients, SMTP servers, MIME parsers, local delivery agents, pipemailers, antispam tools, etc, etc. The reason they're all listed on the PEP wiki is to provide an easy way to categorize and annotate the modules, so that a subset of those modules can be recommended and improved as the current "best practice" for email handling in Perl. UPDATE: I've checked them. They are far away from compliance to all RFC I've listed. :( So, my question is still actual: is there exists realization compliant to all (or most) listed RFCs? Well, given that I use quite a few of those modules on a daily basis, I find it hard to believe that they're all "far away" from compliance. Do you have any specific issues to point out?	[reply]
Re: Reliable email parsing by dave0 (Friar) on Sep 13, 2006 at 13:53 UTC
The most robust MIME-parsing module on CPAN is probably MIME-tools. It suffers from about 10 years of accreted edge-cases and bugfixes tacked on top of the original design, so while it isn't nice code by any means, it handles just about anything you can throw at it (RFC-compliant or not). I currently work for the maintainer of MIME-Tools, so it's on my TODO list to start refactoring some of the 10 years of cruft -- patches are welcome.	[reply]
Re^2: Reliable email parsing by jdtoronto (Prior) on Sep 13, 2006 at 17:23 UTC
I have to agree dave0, I have spent a lot of time over the last 20 years or so writing email clients. Originally in C, then C++ and, horror of horrors, in VB. But the most trouble free has been Perl with Net::SMTP, Net::POP3 and MIME::Tools. About a year ago I took all the email related RFC's, plus all the things like the DBJ documents and bound them into a book about 40mm thick. With all of that as reference material, and the Perl modules, I haven't found a thing I can't handle. Good luck with your refactoring efforts, I, for one, look forward to seeing the results and the extended lifetime of a great family of modules. jdtoronto	[reply]
Re^3: Reliable email parsing by mattk (Pilgrim) on Sep 16, 2006 at 13:49 UTC
Yes, the MIME-tools are great. MIME::Entity and Mail::GPG would be your best bet.	[reply]
Re: Reliable email parsing by ruoso (Curate) on Sep 15, 2006 at 08:39 UTC
Oh, and don't forget to share with us your test base so all the bugs you found can be fixed. daniel	[reply]