Re: How to extract an email address from a mailto URL?
by linuxer (Curate) on Dec 29, 2008 at 20:23 UTC
|
check (e.g. grep) for lines, which contain 'mailto:' (be more specific if you like to; match the 'href' ...);
use Regexp::Common together with Regexp::Common::Email::Address to identify the mail address in matching lines
PS: Don't remove your original question here.
update
- RCEA added
| [reply] [Watch: Dir/Any] |
Re: How to extract an email address from a mailto URL?
by CountZero (Bishop) on Dec 29, 2008 at 21:04 UTC
|
grep is indeed the answer to your question if you can be sure that the whole of the 'a' ... '/a' phrase is on the same line.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] [Watch: Dir/Any] [d/l] |
Extracting email addresses from mailto URIs using HTML and URI parsers instead of regular expressions
by dorward (Curate) on Dec 30, 2008 at 15:07 UTC
|
I don't like using regular expressions on HTML documents, so my approach would be to use a proper HTML parser instead. This has a number of benefits, including the decoding of entities in the HTML representing the email address.
This code uses LWP::UserAgent to fetch the HTML document, HTML::TokeParser to read it, and URI to parse the URIs in it.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use HTML::TokeParser;
use URI;
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
my $root_uri = 'http://example.com/';
my $response = $ua->get($root_uri);
if ($response->is_success) {
my $html = $response->decoded_content;
my $p = HTML::TokeParser->new( \$html );
while (my $tag = $p->get_tag('a')) {
my $href = $tag->[1]{href};
next unless $href;
my $uri = URI->new_abs( $href, $root_uri );
next unless ($uri->scheme eq 'mailto');
print $uri->to, "\n";
}
} else {
die $response->status_line;
}
| [reply] [Watch: Dir/Any] [d/l] |
Re: How to extract an email address from a mailto URL?
by eye (Chaplain) on Dec 30, 2008 at 07:03 UTC
|
If you want to differentiate between addresses in anchor tags and other uses of "mailto:" in the file, read the entire file into memory and use the match operator (m//). As suggested previously, you should use Regexp::Common::Email::Address to help compose a regular expression for the email address and enclosing HTML. I would use "\s+" between the "a" and "href" and "\s*" adjacent to the equal sign to match HTML's treatment of whitespace. Note that HTML allows quoting with both single and double quotes. Also, older HTML allowed you to not quote the information after the equal sign in some circumstances. | [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
My experience in perl is going on about 3 weeks...so some of what you are saying is greek to me. Can you provide an example of how you would do it? The source file to pull the information from has the tag as follows:
showTollfree(1010)
// -->
'/script'
Fax: (301)931-1285
'br''a href='mailto:KHargrove@servpro1010.com'>KHargrove@servpro1010.com'/a'
'/td'
I'm sorry to have to be wet nursed through this...but I have learned a ton of stuff over the last few weeks...I feel like my brain is going to explode!
| [reply] [Watch: Dir/Any] |
|
Well, first install these two modules (and their unresolved dependencies if there are any):
Then you can do something like this (Quickshot, untested):
#!/usr/bin/perl
use strict;
use warnings;
use Regexp::Common qw(Email::Address);
use Email::Address;
my $filename = 'file_to_parse.dat';
open my $rh, '<', $filename or die "$filename: $!";
# Requirement: href=, mailto: and the mailaddress must be in the same
+line!
my @addresses =
map { m/mailto:($RE{Email}{Address})/o; $1 }
grep { m/href=.+?mailto:/ }
<$rh>
;
close $rh;
{
local $, = local $\ = "\n";
print @addresses;
}
__END__
| [reply] [Watch: Dir/Any] [d/l] |
|
Re: How to extract an email address from a mailto URL?
by Anonymous Monk on Dec 30, 2008 at 09:46 UTC
|
| [reply] [Watch: Dir/Any] |