Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Reliable email parsing

by powerman (Friar)
on Sep 13, 2006 at 11:44 UTC ( [id://572716]=perlquestion: print w/replies, xml ) Need Help??

powerman has asked for the wisdom of the Perl Monks concerning the following question:

Is there exists reliable email parser for perl?

CPAN module, or separate perl script, or some standard unix command line util which can be used from perl script is acceptable.

Word reliable in my question mean:

I know about huge amount of CPAN modules for MIME/822headers parsing... I've checked them all few months ago without success. Simple and small modules developed without RFC compliance, other few HUGE bundles containing too much complex OO code to be "defect free" and to give me a chance to check them for RFC compliance. :(

Here is example interface which show email parsing result I need:

# $mime_info = mime_unpack($mime); # $mime_info = { # Subject => $subject, # Date => $date, # From => $from, # Reply_to => $reply_to, # To => \@to, # Cc => \@cc, # encrypted => $true, # signed_by => { $gpgid1 => $true, $gpgid2 => $true }, # headers => "From: ...\nTo: ...\n...\n", # type => "multipart/mixed", # body => [ # { # headers => "...", # type => "multipart/related", # body => [ # { # headers => "...", # type => "multipart/alternative", # body => [ # { # headers => "...", # type => "text/plain", # charset => "koi8-r", # body => "...", # }, # { # headers => "...", # type => "text/html", # charset => "koi8-r", # body => "...", # }, # ], # }, # { # headers => "...", # type => "image/gif", # name => 722370018427e0cb56ef03.gif", # cid => "000a01c69f03$d48fa5e0$0202a8c0@m" # body => "...", # }, # ], # }, # { # headers => "...", # type => "image/jpeg", # filename => "_devochka.jpeg", # body => "...", # }, # ], # } # die "message must be encrypted" if !$email_info->{gpg}{encrypted}; # die "message must by signed by authorized developer" # if !$email_info->{gpg}{signed_by}{ $fingerprint{powerman} } # && !$email_info->{gpg}{signed_by}{ $fingerprint{nikita} } # ;

Replies are listed 'Best First'.
Re: Reliable email parsing
by xdg (Monsignor) on Sep 13, 2006 at 12:10 UTC
    Word reliable in my question mean: * defect free * comply to all email format related RFC

    To me, reliable means that it deals well with emails that don't follow the RFCs. I have found Mail::Box to be the best at dealing with whatever the internet throws at it. Yes, it's a complex OO beast, but it's been around for years and it's been battle tested.

    "Defect free" is a meaningless phrase to me and it's unrealistic in any complex piece of software -- which I would consider any email parser to be. How would you confirm it anyway? Create a test set of emails and see if each parser can deal with it.

    If you really want to do a code review, Mail::Box::Parser, Mail::Box::Parser::Perl and Mail::Box::Parser::C are all fairly self-evident -- you don't really need to understand all the OO code to examine the parsers.

    -xdg

    Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

      Dealing with emails which don't follow RFC also required, of course. Here is good info about such emails: http://cr.yp.to/immhf.html.

      But when I say "RFC compliant" I mean support for all possible formats for email addresses, comments in email headers and all things like locale/language/encoding-specific, for example:

      Content-Disposition: attachment; filename*0*=koi8-r'ru'%F0%D2%C9%D7%C5%D4 filename*1*=%20from filename*2=" russia.txt"

      And about 'defect free'. There enough information on this topic now. Good example of it is DJB software and most basic&simple *NIX utils. I'm sure: more code == more bugs, so I always prefer smaller/simpler solutions.

      I'm sure solution compliant to all these RFC can be coded under ~1000-1500 lines of code (I've partially done it already).

        > I'm sure solution compliant to all these RFC can be coded under ~1000-1500 lines of code (I've partially done it already).

        Not to be rude, but often a large portion of code complexity comes from the last bits and pieces, that get left till last because they are hard to implement.

        So to sum up the apparent answers.

        No, there isn't yet something that does what you want. Email:: should be close. I'm sure they'd appreciate your code to help finish off the set of Email:: module.
        Good example of it is DJB software
        OK, now I know you're trolling.
      Ok, I'm playing now with Mail::Box. Before trying to parse emails I need to parse my mailbox - to fetch individual emails. So I create simple oneliner which calculate messages from my mbox file 'Mail/-default'.
      $ time perl -MMail::Box::Manager -le ' $m=Mail::Box::Manager->new; $f=$m->open(folder=>"Mail/-default"); print scalar $f->messages ' Unexpected end of header (C parser): charset="iso-8859-1" 3554 real 0m8.222s user 0m7.928s sys 0m0.200s
      Oops! Mutt say there 3549 message in this file, not 3554... So I've developed own reader for mbox format:
      $ time ./mbox_scan Mail/-default 3549 Mail/-default real 0m0.492s user 0m0.408s sys 0m0.080s
      Hmm. 16 times faster?! Wow. And correct - it found 3549 messages, just as mutt. So, maybe I misunderstand something about this world, but WHY my parser on pure perl much faster than C parser in mature CPAN module? Ok, maybe that Mail::Box do a lot of additional parsing which I doesn't do, maybe... but why it produce incorrect results?

      Here is code of my parser, if interested (sorry, there lines up to 80 columns):

        Well, your test mailbox looks to have invalid data (or at least invalid as far as Mail::Box is concerned; looks to be a bad MIME header). Figure out what the offending message is and let the maintainers know.

        As for speed, your code is doing nothing more than reading the message body in; Mail::Box is building up objects to represent each of the messages. That code which (basically) throws away the data it's reading rather than doing anything useful with it runs faster isn't really surprising.

        A reply falls below the community's threshold of quality. You may see it by logging in.
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Reliable email parsing
by davorg (Chancellor) on Sep 13, 2006 at 11:51 UTC

    You should look at the Email::* modules from the Perl Email Project, perhaps Email::Simple or Email::MIME will do what you want. And if they don't, then the project members seem pretty open to receiving bug reports and patches.

    --
    <http://dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      Thanks! I'll check them, but...

      I've noticed 818 perl modules(packages) listed on this site. I agree, this task is complex. But having 818 modules for single task looks like no one of them really solved it. :(

      UPDATE: I've checked them. They are far away from compliance to all RFC I've listed. :( So, my question is still actual: is there exists realization compliant to all (or most) listed RFCs?

        I've checked them. They are far away from compliance to all RFC I've listed. :( So, my question is still actual: is there exists realization compliant to all (or most) listed RFCs?

        Well, as I said, I'm pretty sure that's pretty well all of the email modules on CPAN. So if they don't come up to your high standards, then it looks like you're out of luck.

        Or... here's a wild and crazy idea. If you have specific areas where you know the existing modules have problems, then why not get involved with the Perl Email Project and help them fix those problems. Your involvement could be as small as raising bugs against the existing modules pointing out their deficiencies, or perhaps you could go as far as creating test cases that demonstrate the problems, or maybe you could even produce patches that fix them.

        Just saying that there are problems, doesn't really achieve much. Documenting the problems and helping to fix them benefits everyone.

        --
        <http://dave.org.uk>

        "The first rule of Perl club is you do not talk about Perl club."
        -- Chip Salzenberg

        I've noticed 818 perl modules(packages) listed on this site

        Yes, well that's because they seem to list every mail handling module that they found on CPAN. As I understand it, the Email::* namespace is supposed to be a complete set of modules for carrying out all email processing in Perl - all of which will work nicely with each other. Once that set is complete (and I'm afraid I don't know how far off that is) anything in any of the other namespaces will be redundant.

        This is a similar approach to the one taken by the Perl DateTime project a few years ago.

        --
        <http://dave.org.uk>

        "The first rule of Perl club is you do not talk about Perl club."
        -- Chip Salzenberg

        I've noticed 818 perl modules(packages) listed on this site. I agree, this task is complex. But having 818 modules for single task looks like no one of them really solved it. :(
        The 818 modules are a nearly exhaustive list of any email-related module available from CPAN, and they're certainly not all for a single task. The list includes POP3 clients, SMTP servers, MIME parsers, local delivery agents, pipemailers, antispam tools, etc, etc.

        The reason they're all listed on the PEP wiki is to provide an easy way to categorize and annotate the modules, so that a subset of those modules can be recommended and improved as the current "best practice" for email handling in Perl.

        UPDATE: I've checked them. They are far away from compliance to all RFC I've listed. :( So, my question is still actual: is there exists realization compliant to all (or most) listed RFCs?
        Well, given that I use quite a few of those modules on a daily basis, I find it hard to believe that they're all "far away" from compliance. Do you have any specific issues to point out?
Re: Reliable email parsing
by dave0 (Friar) on Sep 13, 2006 at 13:53 UTC
    The most robust MIME-parsing module on CPAN is probably MIME-tools.

    It suffers from about 10 years of accreted edge-cases and bugfixes tacked on top of the original design, so while it isn't nice code by any means, it handles just about anything you can throw at it (RFC-compliant or not).

    I currently work for the maintainer of MIME-Tools, so it's on my TODO list to start refactoring some of the 10 years of cruft -- patches are welcome.

      I have to agree dave0, I have spent a lot of time over the last 20 years or so writing email clients. Originally in C, then C++ and, horror of horrors, in VB. But the most trouble free has been Perl with Net::SMTP, Net::POP3 and MIME::Tools. About a year ago I took all the email related RFC's, plus all the things like the DBJ documents and bound them into a book about 40mm thick. With all of that as reference material, and the Perl modules, I haven't found a thing I can't handle.

      Good luck with your refactoring efforts, I, for one, look forward to seeing the results and the extended lifetime of a great family of modules.

      jdtoronto

Re: Reliable email parsing
by ruoso (Curate) on Sep 15, 2006 at 08:39 UTC

    Oh, and don't forget to share with us your test base so all the bugs you found can be fixed.

    daniel

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://572716]
Approved by polettix
Front-paged by derby
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-24 02:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found