<?xml version="1.0" encoding="windows-1252"?>
<node id="733265" title="Extracting email addresses from mailto URIs using HTML and URI parsers instead of regular expressions" created="2008-12-30 10:07:34" updated="2008-12-30 10:07:34">
<type id="11">
note</type>
<author id="48685">
dorward</author>
<data>
<field name="doctext">
I don't like using regular expressions on HTML documents, so my approach would be to use a proper HTML parser instead. This has a number of benefits, including the decoding of entities in the HTML representing the email address.

This code uses &lt;a href="http://search.cpan.org/~gaas/libwww-perl-5.822/lib/LWP/UserAgent.pm"&gt;LWP::UserAgent&lt;/a&gt; to fetch the HTML document, &lt;a href="http://search.cpan.org/~gaas/HTML-Parser-3.59/lib/HTML/TokeParser.pm"&gt;HTML::TokeParser&lt;/a&gt; to read it, and &lt;a href="http://search.cpan.org/~gaas/URI-1.37/URI.pm"&gt;URI&lt;/a&gt; to parse the URIs in it.

&lt;code&gt;
#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;
use HTML::TokeParser;
use URI;

my $ua = LWP::UserAgent-&gt;new;
$ua-&gt;timeout(10);
my $root_uri = 'http://example.com/';
my $response = $ua-&gt;get($root_uri);
if ($response-&gt;is_success) {
    my $html = $response-&gt;decoded_content;
    my $p = HTML::TokeParser-&gt;new( \$html );
    while (my $tag = $p-&gt;get_tag('a')) {
       my $href = $tag-&gt;[1]{href};
       next unless $href;
       my $uri = URI-&gt;new_abs( $href, $root_uri );
       next unless ($uri-&gt;scheme eq 'mailto');
       print $uri-&gt;to, "\n";
    }
} else {
    die $response-&gt;status_line;
}
&lt;/code&gt;</field>
<field name="root_node">
733104</field>
<field name="parent_node">
733104</field>
</data>
</node>
