Category: |
E-Mail Programs |
Author/Contact Info |
Neil Watson
watson-wilson.ca
|
Description: |
This code demonstrates how to use mime::tools to parse an incomming email into 6 parts:
- From
- To
- Subject
- Body
- Attachment names
- Actual attachment files
|
#!/usr/bin/perl -w
use strict;
use warnings;
use MIME::Parser;
use MIME::Entity;
use MIME::Body;
my (@body, $i, $subentity);
my $parser = new MIME::Parser;
#new attachment code start
#these are the types of attachments allowed
my @attypes= qw(application/msword
application/pdf
application/gzip
application/tar
application/tgz
application/zip
audio/alaw-basic
audio/vox
audio/wav
image/bmp
image/gif
image/jpeg
text/html
text/plain
text/vxml
);
my ($x, $newx, @attachment, $attachment, @attname, $bh, $nooatt);
#new attachement code end
my $to; #contains the message to header
my $from; #contains the message from header
my $subject; #contains the message subject heaer
my $body; #contains the message body
$parser->ignore_errors(1);
$parser->output_to_core(1);
my $entity = $parser->parse(\*STDIN);
my $error = ($@ || $parser->last_error);
#get email headers
my $header = $entity->head;
$subject = $header->get('Subject');
$to = $header->get('To');
$from = $header->get('From');
chomp($subject);
chomp($to);
chomp($from);
#get email body
if ($entity->parts > 0){
for ($i=0; $i<$entity->parts; $i++){
$subentity = $entity->parts($i);
if (($subentity->mime_type =~ m/text\/html/i) || ($subentity->
+mime_type =~ m/text\/plain/i)){
$body = join "", @{$subentity->body};
#new attachment code start
next;
#new attachment code end
}
#this elsif is needed for Outlook's nasty multipart/alternativ
+e messages
elsif ($subentity->mime_type =~ m/multipart\/alternative/i){
$body = join "", @{$subentity->body};
#split html and text parts
@body = split /------=_NextPart_\S*\n/, $body;
#assign the first part of the message,
#hopefully the text, part as the body
$body = $body[1];
#remove leading headers from body
$body =~ s/^Content-Type.*Content-Transfer-Encoding.*?\n+/
+/is;
#new attachment code start
next;
#new attachment code end
}
#new attachment code start
#grab attachment name and contents
foreach $x (@attypes){
if ($subentity->mime_type =~ m/$x/i){
$bh = $subentity->bodyhandle;
$attachment = $bh->as_string;
push @attachment, $attachment;
push @attname, $subentity->head->mime_attr('content-di
+sposition.filename');
}else{
#some clients send attachments as application/x-type.
#checks for that
$newx = $x
$newx =~ s/application\/(.*)/application\/x-$1/i;
if ($subentity->mime_type =~ m/$newx/i){
$bh = $subentity->bodyhandle;
$attachment = $bh->as_string;
push @attachment, $attachment;
push @attname, $subentity->head->mime_attr('conten
+t-disposition.filename');
}
}
}
$nooatt = $#attachment + 1;
#new attachment code end
}
} else {
$body = join "", @{$entity->body};
}
#body may contain html tags. they will be stripped here
$body =~ s/(<br>)|(<p>)/\n/gi; #create new lines
$body =~ s/<.+\n*.*?>//g; #remove all <> html tages
$body =~ s/(\n|\r|(\n\r)|(\r\n)){3,}//g; #remove any extra new lines
$body =~ s/\ //g; #remove html   characters
#remove trailing whitespace from body
$body =~ s/\s*\n+$//s;
open MAIL, ("|/usr/sbin/sendmail -t") || die "Unable to send mail: $!"
+;
print MAIL "To: $from\n";
print MAIL "From: root\n";
print MAIL "Subject: mime parser test\n\n";
print MAIL "Messege was contructed as follows:
\$from: $from
\$to: $to
\$subject: $subject
\$body: $body
number of attachments: $nooatt
\$attachment(s): ".join ", ", @attname;
close MAIL;
#new attachment code start
#write contents of each attachment to a file
for ($x = 0; $x < $nooatt; $x++){
open FH, ">/tmp/attachments/$attname[$x]" || die "cannot open FH:
+$!\n";
print FH "$attachment[$x]";
close FH;
}
#new attachment code end
|
Re: Parsing emails with attachments
by davis (Vicar) on Sep 06, 2002 at 11:31 UTC
|
Hi,
It might be worth noting that MIME entities can themselves contain other entities, i.e., every email attachment can contain other email attachments. Therefore, to process these correctly, you'd need to break the email attachment parser into a recursive sub.
Example:
sub extract_files {
my $entity = shift;
my $num_parts = $entity->parts; # how many mime parts?
if ($num_parts) { # we have a multipart mime message
print "Multiple subentities found - parsing\n";
my $message;
foreach (1..$num_parts) {
$message .= extract_files( $entity->parts($_ -
+ 1) );
}
return $message;
} else {
#Do some stuff....
}
}
There's a node around here (probably posted a couple of months ago) that details this a little better, but I can't find it using Super Search, or thepen's google-able archive.
Cheers.
davis
Is this going out live?
No, Homer, very few cartoons are broadcast live - it's a terrible strain on the animator's wrist
| [reply] [d/l] |
|
Hi,
I stored above script as read_mail.pl and tried to execute in passion below
perl read_mail.pl < /var/mail/pgoupal
But this reads only first mail, not rest of the emails. Pls help me on this.
| [reply] |
Re: Parsing emails with attachments
by LTjake (Prior) on Sep 05, 2002 at 17:29 UTC
|
After a quick chat in the CB, it was noted that this line:
$body =~ s/<.+\n*.*?>//g; #remove all <> html tages
isn't reliable.
Excerpt from CB:
<ferrency> LTjake: given sufficiently complicated html, no regex can remove the tags and only the tags
<Petruchio> Hehe... I'm not so sure, given a sufficiently complicated regex. :-)
Maybe Ovid's code could help? (thread parent) | [reply] [d/l] |
|
use Benchmark 'cmpthese';
my $data = join'',<DATA>;
print untag($data),
"\n\n\n",
'X' x 79,
"\n\n\n",
untagg($data),
"\n\n\n",
'X' x 79,
"\n\n\n",;
warn "benchmarking the dumb way";
cmpthese(-3,
{
regex => sub { untag($data);},
parse => sub { untagg($data); },
});
warn "benchmarking the smart way";
warn "benchmarking the smart way";
use HTML::Parser;
my $p = HTML::Parser->new( api_version => 3);
my $ret ="";
$p->handler(default =>
sub { $ret .= $_[0] if $_[1] eq 'text'},'text,event');
cmpthese(-3,
{
regex => sub { untag($data);},
parse => sub { $p->parse($data); },
});
sub untagg {
local $_ = $_[0] || $_;
require HTML::Parser;
my $p = HTML::Parser->new( api_version => 3);
my $ret ="";
$p->handler(default => sub { $ret .= $_[0] if $_[1] eq 'text'}
,'text,event');
$p->parse($_);
return($ret);
}
sub untag {
local $_ = $_[0] || $_;
# ALGORITHM:
# find < ,
# comment <!-- ... -->,
# or comment <? ... ?> ,
# or one of the start tags which require correspond
# end tag plus all to end tag
# or if \s or ="
# then skip to next "
# else [^>]
# >
s{
< # open tag
(?: # open group (A)
(!--) | # comment (1) or
(\?) | # another comment (2) or
(?i: # open group (B) for /i
( TITLE | # one of start tags
SCRIPT | # for which
APPLET | # must be skipped
OBJECT | # all content
STYLE # to correspond
) # end tag (3)
) | # close group (B), or
([!/A-Za-z]) # one of these chars, remember in (4)
) # close group (A)
(?(4) # if previous case is (4)
(?: # open group (C)
(?! # and next is not : (D)
[\s=] # \s or "="
["`'] # with open quotes
) # close (D)
[^>] | # and not close tag or
[\s=] # \s or "=" with
`[^`]*` | # something in quotes ` or
[\s=] # \s or "=" with
'[^']*' | # something in quotes ' or
[\s=] # \s or "=" with
"[^"]*" # something in quotes "
)* # repeat (C) 0 or more times
| # else (if previous case is not (4))
.*? # minimum of any chars
) # end if previous char is (4)
(?(1) # if comment (1)
(?<=--) # wait for "--"
) # end if comment (1)
(?(2) # if another comment (2)
(?<=\?) # wait for "?"
) # end if another comment (2)
(?(3) # if one of tags-containers (3)
</ # wait for end
(?i:\3) # of this tag
(?:\s[^>]*)? # skip junk to ">"
) # end if (3)
> # tag closed
}{}gsx; # STRIP THIS TAG
return $_ ? $_ : "";
}
__DATA__
u h a h
<html>
<head>
<title>This title contains Perl but does not get changed.</title>
</head>
<body>
<p>This is some text containing the term 'perl'.</p>
<ol>
<li>Unix</li>
<li>Perl</li>
<li>Linux</li>
</ol>
<p>Notice how the term perl in the following link doesn't change, but
+the text does.
<a href="http://www.perlmonks.org">Perlmonks.org</a></p>
</body>
</html>
> < >
< > <
! ] [ ] [ ] [ ] [ - <!-- --> 2 3 4 5 5
<<a href<<a>>
<!-- foo bar -->
<SCRIPT language="javascript">
// this is valid html
// whether you like it or not
// same goes for older browsers
</SCRIPT>
And the results are ;)
| [reply] [d/l] |
|
I am using this script but I want the attachments to go to the same directory. I put this in the script with $parser->output_dir("opt/htdocs/webcache/attachments/"); Then I take the attachment names and hyperlink them in the body of the message to go to a loadfile for a database.
attachment: file.doc
body: https://webpage/webcache/attachments/file.doc
the issue is that when MIME::Entity puts the file there it does collision resolution, I need to know how to get the filename it is actually assigning to the file and not the "recommended" filename from the message. Can you help with that?
| [reply] |
|
try to remove all <> tags with:
$body =~ s/<[^>]*>//sg;
The s modifier (at the end) tells the search to treat the $body variable as a single line. so \n's don't matter.
| [reply] [d/l] |
|
|