powerman has asked for the wisdom of the Perl Monks concerning the following question:
Is there exists reliable email parser for perl?
CPAN module, or separate perl script, or some standard unix command line util which can be used from perl script is acceptable.
Word reliable in my question mean:
- defect free
- comply to all email format related RFC:
I know about huge amount of CPAN modules for MIME/822headers parsing... I've checked them all few months ago without success. Simple and small modules developed without RFC compliance, other few HUGE bundles containing too much complex OO code to be "defect free" and to give me a chance to check them for RFC compliance. :(
Here is example interface which show email parsing result I need:
# $mime_info = mime_unpack($mime);
# $mime_info = {
# Subject => $subject,
# Date => $date,
# From => $from,
# Reply_to => $reply_to,
# To => \@to,
# Cc => \@cc,
# encrypted => $true,
# signed_by => { $gpgid1 => $true, $gpgid2 => $true },
# headers => "From: ...\nTo: ...\n...\n",
# type => "multipart/mixed",
# body => [
# {
# headers => "...",
# type => "multipart/related",
# body => [
# {
# headers => "...",
# type => "multipart/alternative",
# body => [
# {
# headers => "...",
# type => "text/plain",
# charset => "koi8-r",
# body => "...",
# },
# {
# headers => "...",
# type => "text/html",
# charset => "koi8-r",
# body => "...",
# },
# ],
# },
# {
# headers => "...",
# type => "image/gif",
# name => 722370018427e0cb56ef03.gif",
# cid => "000a01c69f03$d48fa5e0$0202a8c0@m"
# body => "...",
# },
# ],
# },
# {
# headers => "...",
# type => "image/jpeg",
# filename => "_devochka.jpeg",
# body => "...",
# },
# ],
# }
# die "message must be encrypted" if !$email_info->{gpg}{encrypted};
# die "message must by signed by authorized developer"
# if !$email_info->{gpg}{signed_by}{ $fingerprint{powerman} }
# && !$email_info->{gpg}{signed_by}{ $fingerprint{nikita} }
# ;
Re: Reliable email parsing
by xdg (Monsignor) on Sep 13, 2006 at 12:10 UTC
|
Word reliable in my question mean: * defect free * comply to all email format related RFC
To me, reliable means that it deals well with emails that don't follow the RFCs. I have found Mail::Box to be the best at dealing with whatever the internet throws at it. Yes, it's a complex OO beast, but it's been around for years and it's been battle tested.
"Defect free" is a meaningless phrase to me and it's unrealistic in any complex piece of software -- which I would consider any email parser to be. How would you confirm it anyway? Create a test set of emails and see if each parser can deal with it.
If you really want to do a code review, Mail::Box::Parser, Mail::Box::Parser::Perl and Mail::Box::Parser::C are all fairly self-evident -- you don't really need to understand all the OO code to examine the parsers.
-xdg
Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.
| [reply] |
|
Content-Disposition: attachment;
filename*0*=koi8-r'ru'%F0%D2%C9%D7%C5%D4
filename*1*=%20from
filename*2=" russia.txt"
And about 'defect free'. There enough information on this topic now. Good example of it is DJB software and most basic&simple *NIX utils. I'm sure: more code == more bugs, so I always prefer smaller/simpler solutions.
I'm sure solution compliant to all these RFC can be coded under ~1000-1500 lines of code (I've partially done it already). | [reply] [d/l] |
|
> I'm sure solution compliant to all these RFC can be coded under ~1000-1500 lines of code (I've partially done it already).
Not to be rude, but often a large portion of code complexity comes from the last bits and pieces, that get left till last because they are hard to implement.
So to sum up the apparent answers.
No, there isn't yet something that does what you want. Email:: should be close. I'm sure they'd appreciate your code to help finish off the set of Email:: module.
| [reply] |
|
Good example of it is DJB software
OK, now I know you're trolling.
| [reply] |
|
Ok, I'm playing now with Mail::Box. Before trying to parse emails I need to parse my mailbox - to fetch individual emails. So I create simple oneliner which calculate messages from my mbox file 'Mail/-default'.
$ time perl -MMail::Box::Manager -le '
$m=Mail::Box::Manager->new;
$f=$m->open(folder=>"Mail/-default");
print scalar $f->messages
'
Unexpected end of header (C parser):
charset="iso-8859-1"
3554
real 0m8.222s
user 0m7.928s
sys 0m0.200s
Oops! Mutt say there 3549 message in this file, not 3554... So I've developed own reader for mbox format:
$ time ./mbox_scan Mail/-default
3549 Mail/-default
real 0m0.492s
user 0m0.408s
sys 0m0.080s
Hmm. 16 times faster?! Wow. And correct - it found 3549 messages, just as mutt. So, maybe I misunderstand something about this world, but WHY my parser on pure perl much faster than C parser in mature CPAN module? Ok, maybe that Mail::Box do a lot of additional parsing which I doesn't do, maybe... but why it produce incorrect results?
Here is code of my parser, if interested (sorry, there lines up to 80 columns):
| [reply] [d/l] [select] |
|
Well, your test mailbox looks to have invalid data (or at least invalid as far as Mail::Box is concerned; looks to be a bad MIME header). Figure out what the offending message is and let the maintainers know.
As for speed, your code is doing nothing more than reading the message body in; Mail::Box is building up objects to represent each of the messages. That code which (basically) throws away the data it's reading rather than doing anything useful with it runs faster isn't really surprising.
| [reply] |
A reply falls below the community's threshold of quality. You may see it by logging in.
| A reply falls below the community's threshold of quality. You may see it by logging in. |
Re: Reliable email parsing
by davorg (Chancellor) on Sep 13, 2006 at 11:51 UTC
|
You should look at the Email::* modules from the Perl Email Project, perhaps Email::Simple or Email::MIME will do what you want. And if they don't, then the project members seem pretty open to receiving bug reports and patches.
--
< http://dave.org.uk>
"The first rule of Perl club is you do not talk about
Perl club." -- Chip Salzenberg
| [reply] |
|
Thanks! I'll check them, but...
I've noticed 818 perl modules(packages) listed on this site. I agree, this task is complex. But having 818 modules for single task looks like no one of them really solved it. :(
UPDATE: I've checked them. They are far away from compliance to all RFC I've listed. :( So, my question is still actual: is there exists realization compliant to all (or most) listed RFCs?
| [reply] |
|
I've checked them. They are far away from compliance to all RFC I've listed. :( So, my question is still actual: is there exists realization compliant to all (or most) listed RFCs?
Well, as I said, I'm pretty sure that's pretty well all of the email modules on CPAN. So if they don't come up to your high standards, then it looks like you're out of luck.
Or... here's a wild and crazy idea. If you have specific areas where you know the existing modules have problems, then why not get involved with the Perl Email Project and help them fix those problems. Your involvement could be as small as raising bugs against the existing modules pointing out their deficiencies, or perhaps you could go as far as creating test cases that demonstrate the problems, or maybe you could even produce patches that fix them.
Just saying that there are problems, doesn't really achieve much. Documenting the problems and helping to fix them benefits everyone.
--
< http://dave.org.uk>
"The first rule of Perl club is you do not talk about
Perl club." -- Chip Salzenberg
| [reply] |
|
I've noticed 818 perl modules(packages) listed on this site
Yes, well that's because they seem to list every mail handling module that they found on CPAN. As I understand it, the Email::* namespace is supposed to be a complete set of modules for carrying out all email processing in Perl - all of which will work nicely with each other. Once that set is complete (and I'm afraid I don't know how far off that is) anything in any of the other namespaces will be redundant.
This is a similar approach to the one taken by the Perl DateTime project a few years ago.
--
< http://dave.org.uk>
"The first rule of Perl club is you do not talk about
Perl club." -- Chip Salzenberg
| [reply] |
|
| [reply] |
Re: Reliable email parsing
by dave0 (Friar) on Sep 13, 2006 at 13:53 UTC
|
The most robust MIME-parsing module on CPAN is probably MIME-tools.
It suffers from about 10 years of accreted edge-cases and bugfixes tacked on top of the original design, so while it isn't nice code by any means, it handles just about anything you can throw at it (RFC-compliant or not).
I currently work for the maintainer of MIME-Tools, so it's on my TODO list to start refactoring some of the 10 years of cruft -- patches are welcome. | [reply] |
|
I have to agree dave0, I have spent a lot of time over the last 20 years or so writing email clients. Originally in C, then C++ and, horror of horrors, in VB. But the most trouble free has been Perl with Net::SMTP, Net::POP3 and MIME::Tools. About a year ago I took all the email related RFC's, plus all the things like the DBJ documents and bound them into a book about 40mm thick. With all of that as reference material, and the Perl modules, I haven't found a thing I can't handle.
Good luck with your refactoring efforts, I, for one, look forward to seeing the results and the extended lifetime of a great family of modules. jdtoronto
| [reply] |
|
| [reply] |
Re: Reliable email parsing
by ruoso (Curate) on Sep 15, 2006 at 08:39 UTC
|
Oh, and don't forget to share with us your test base so all the bugs you found can be fixed.
| [reply] |
|
|