Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Parsing emails with attachments

by neilwatson (Curate)
on Sep 05, 2002 at 16:54 UTC ( #195442=sourcecode: print w/ replies, xml ) Need Help??

Category: E-Mail Programs
Author/Contact Info Neil Watson
watson-wilson.ca
Description:

This code demonstrates how to use mime::tools to parse an incomming email into 6 parts:

  1. From
  2. To
  3. Subject
  4. Body
  5. Attachment names
  6. Actual attachment files
#!/usr/bin/perl -w

use strict;
use warnings;
use MIME::Parser;
use MIME::Entity;
use MIME::Body;

my (@body, $i, $subentity);
my $parser = new MIME::Parser;

#new attachment code start
#these are the types of attachments allowed
my @attypes= qw(application/msword
                application/pdf
                application/gzip
                application/tar
                application/tgz
                application/zip
                audio/alaw-basic
                audio/vox
                audio/wav
                image/bmp
                image/gif
                image/jpeg
                text/html
                text/plain
                text/vxml
);
my ($x, $newx, @attachment, $attachment, @attname, $bh, $nooatt);
#new attachement code end

my $to;      #contains the message to header
my $from;    #contains the message from header
my $subject; #contains the message subject heaer
my $body;    #contains the message body


$parser->ignore_errors(1);
$parser->output_to_core(1);

my $entity = $parser->parse(\*STDIN);
my $error = ($@ || $parser->last_error);

#get email headers
my $header = $entity->head;
$subject = $header->get('Subject');
$to = $header->get('To');
$from = $header->get('From');

chomp($subject);
chomp($to);
chomp($from);

#get email body
if ($entity->parts > 0){
    for ($i=0; $i<$entity->parts; $i++){
        
        $subentity = $entity->parts($i);
        
        if (($subentity->mime_type =~ m/text\/html/i) || ($subentity->
+mime_type =~ m/text\/plain/i)){
            $body = join "",  @{$subentity->body};
            #new attachment code start
            next;
            #new attachment code end
        }
        
        #this elsif is needed for Outlook's nasty multipart/alternativ
+e messages
        elsif ($subentity->mime_type =~ m/multipart\/alternative/i){

            $body = join "",  @{$subentity->body};
            
            #split html and text parts
            @body = split /------=_NextPart_\S*\n/, $body;
            
            #assign the first part of the message,
            #hopefully the text, part as the body
            $body = $body[1]; 
            
            #remove leading headers from body
            $body =~ s/^Content-Type.*Content-Transfer-Encoding.*?\n+/
+/is;
            #new attachment code start
            next;
            #new attachment code end
        }

        #new attachment code start
        #grab attachment name and contents
        foreach $x (@attypes){
            if ($subentity->mime_type =~ m/$x/i){
                $bh = $subentity->bodyhandle;
                $attachment = $bh->as_string;
                push @attachment, $attachment;
                push @attname, $subentity->head->mime_attr('content-di
+sposition.filename');
            }else{
                #some clients send attachments as application/x-type.
                #checks for that
                $newx = $x
                $newx =~ s/application\/(.*)/application\/x-$1/i;
                if ($subentity->mime_type =~ m/$newx/i){
                    $bh = $subentity->bodyhandle;
                    $attachment = $bh->as_string;
                    push @attachment, $attachment;
                    push @attname, $subentity->head->mime_attr('conten
+t-disposition.filename');
                }
            }
            
        }
        $nooatt = $#attachment + 1;
        #new attachment code end
    }
} else {
   $body = join "",  @{$entity->body};
}

#body may contain html tags. they will be stripped here
$body =~ s/(<br>)|(<p>)/\n/gi;           #create new lines
$body =~ s/<.+\n*.*?>//g;                #remove all <> html tages
$body =~ s/(\n|\r|(\n\r)|(\r\n)){3,}//g; #remove any extra new lines
$body =~ s/\&nbsp;//g;                   #remove html &nbsp characters

#remove trailing whitespace from body
$body =~ s/\s*\n+$//s;

open MAIL, ("|/usr/sbin/sendmail -t") || die "Unable to send mail: $!"
+;
print MAIL "To: $from\n";
print MAIL "From: root\n";
print MAIL "Subject: mime parser test\n\n";

print MAIL "Messege was contructed as follows:
\$from:    $from
\$to:      $to
\$subject: $subject

\$body:    $body
number of attachments: $nooatt
\$attachment(s): ".join ", ", @attname;
close MAIL;

#new attachment code start
#write contents of each attachment to a file
for ($x = 0; $x < $nooatt; $x++){
    open FH, ">/tmp/attachments/$attname[$x]" || die "cannot open FH: 
+$!\n";
    print FH "$attachment[$x]";
    close FH;
}
#new attachment code end

Comment on Parsing emails with attachments
Download Code
Re: Parsing emails with attachments
by LTjake (Prior) on Sep 05, 2002 at 17:29 UTC
    After a quick chat in the CB, it was noted that this line:
    $body =~ s/<.+\n*.*?>//g; #remove all <> html tages
    isn't reliable.

    Excerpt from CB:
    <ferrency> LTjake: given sufficiently complicated html, no regex can remove the tags and only the tags
    <Petruchio> Hehe... I'm not so sure, given a sufficiently complicated regex. :-)

    Maybe Ovid's code could help? (thread parent)
      I've said it before, and I'll say it again, this is one damn interesting regex, and I didn't write it ;) ( strip HTML tags )
      use Benchmark 'cmpthese'; my $data = join'',<DATA>; print untag($data), "\n\n\n", 'X' x 79, "\n\n\n", untagg($data), "\n\n\n", 'X' x 79, "\n\n\n",; warn "benchmarking the dumb way"; cmpthese(-3, { regex => sub { untag($data);}, parse => sub { untagg($data); }, }); warn "benchmarking the smart way"; warn "benchmarking the smart way"; use HTML::Parser; my $p = HTML::Parser->new( api_version => 3); my $ret =""; $p->handler(default => sub { $ret .= $_[0] if $_[1] eq 'text'},'text,event'); cmpthese(-3, { regex => sub { untag($data);}, parse => sub { $p->parse($data); }, }); sub untagg { local $_ = $_[0] || $_; require HTML::Parser; my $p = HTML::Parser->new( api_version => 3); my $ret =""; $p->handler(default => sub { $ret .= $_[0] if $_[1] eq 'text'} ,'text,event'); $p->parse($_); return($ret); } sub untag { local $_ = $_[0] || $_; # ALGORITHM: # find < , # comment <!-- ... -->, # or comment <? ... ?> , # or one of the start tags which require correspond # end tag plus all to end tag # or if \s or =" # then skip to next " # else [^>] # > s{ < # open tag (?: # open group (A) (!--) | # comment (1) or (\?) | # another comment (2) or (?i: # open group (B) for /i ( TITLE | # one of start tags SCRIPT | # for which APPLET | # must be skipped OBJECT | # all content STYLE # to correspond ) # end tag (3) ) | # close group (B), or ([!/A-Za-z]) # one of these chars, remember in (4) ) # close group (A) (?(4) # if previous case is (4) (?: # open group (C) (?! # and next is not : (D) [\s=] # \s or "=" ["`'] # with open quotes ) # close (D) [^>] | # and not close tag or [\s=] # \s or "=" with `[^`]*` | # something in quotes ` or [\s=] # \s or "=" with '[^']*' | # something in quotes ' or [\s=] # \s or "=" with "[^"]*" # something in quotes " )* # repeat (C) 0 or more times | # else (if previous case is not (4)) .*? # minimum of any chars ) # end if previous char is (4) (?(1) # if comment (1) (?<=--) # wait for "--" ) # end if comment (1) (?(2) # if another comment (2) (?<=\?) # wait for "?" ) # end if another comment (2) (?(3) # if one of tags-containers (3) </ # wait for end (?i:\3) # of this tag (?:\s[^>]*)? # skip junk to ">" ) # end if (3) > # tag closed }{}gsx; # STRIP THIS TAG return $_ ? $_ : ""; } __DATA__ u h a h <html> <head> <title>This title contains Perl but does not get changed.</title> </head> <body> <p>This is some text containing the term 'perl'.</p> <ol> <li>Unix</li> <li>Perl</li> <li>Linux</li> </ol> <p>Notice how the term perl in the following link doesn't change, but +the text does. <a href="http://www.perlmonks.org">Perlmonks.org</a></p> </body> </html> > < > < > < ! ] [ ] [ ] [ ] [ - <!-- --> 2 3 4 5 5 <<a href<<a>> <!-- foo bar --> <SCRIPT language="javascript"> // this is valid html // whether you like it or not // same goes for older browsers </SCRIPT>
      And the results are ;)
        I am using this script but I want the attachments to go to the same directory. I put this in the script with $parser->output_dir("opt/htdocs/webcache/attachments/"); Then I take the attachment names and hyperlink them in the body of the message to go to a loadfile for a database. attachment: file.doc body: https://webpage/webcache/attachments/file.doc the issue is that when MIME::Entity puts the file there it does collision resolution, I need to know how to get the filename it is actually assigning to the file and not the "recommended" filename from the message. Can you help with that?
      try to remove all <> tags with:
      $body =~ s/<[^>]*>//sg;
      The s modifier (at the end) tells the search to treat the $body variable as a single line. so \n's don't matter.
Re: Parsing emails with attachments
by davis (Vicar) on Sep 06, 2002 at 11:31 UTC
    Hi,

    It might be worth noting that MIME entities can themselves contain other entities, i.e., every email attachment can contain other email attachments. Therefore, to process these correctly, you'd need to break the email attachment parser into a recursive sub.

    Example:

    sub extract_files { my $entity = shift; my $num_parts = $entity->parts; # how many mime parts? if ($num_parts) { # we have a multipart mime message print "Multiple subentities found - parsing\n"; my $message; foreach (1..$num_parts) { $message .= extract_files( $entity->parts($_ - + 1) ); } return $message; } else { #Do some stuff.... } }

    There's a node around here (probably posted a couple of months ago) that details this a little better, but I can't find it using Super Search, or thepen's google-able archive.

    Cheers.
    davis
    Is this going out live?
    No, Homer, very few cartoons are broadcast live - it's a terrible strain on the animator's wrist
      Hi, I stored above script as read_mail.pl and tried to execute in passion below perl read_mail.pl < /var/mail/pgoupal But this reads only first mail, not rest of the emails. Pls help me on this.

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://195442]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (8)
As of 2014-09-03 02:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (35 votes), past polls