Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

How to extract an email address from a mailto URL?

by jdlev (Scribe)
on Dec 29, 2008 at 20:06 UTC ( #733104=perlquestion: print w/replies, xml ) Need Help??
jdlev has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks! I'm trying to pull a simple information from a text file. The information I wish to pull is in the following code:

a href = ''>Their email address /a

There are other a href tags on the page, and the only one I want to pull from contains the mailto: in it. How would I extract from the .txt file?

Thanks, Jeff

  • Comment on How to extract an email address from a mailto URL?

Replies are listed 'Best First'.
Re: How to extract an email address from a mailto URL?
by linuxer (Curate) on Dec 29, 2008 at 20:23 UTC

    check (e.g. grep) for lines, which contain 'mailto:' (be more specific if you like to; match the 'href' ...);
    use Regexp::Common together with Regexp::Common::Email::Address to identify the mail address in matching lines

    PS: Don't remove your original question here.


    1. RCEA added
Re: How to extract an email address from a mailto URL?
by CountZero (Bishop) on Dec 29, 2008 at 21:04 UTC
    grep is indeed the answer to your question if you can be sure that the whole of the 'a' ... '/a' phrase is on the same line.


    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Extracting email addresses from mailto URIs using HTML and URI parsers instead of regular expressions
by dorward (Curate) on Dec 30, 2008 at 15:07 UTC
    I don't like using regular expressions on HTML documents, so my approach would be to use a proper HTML parser instead. This has a number of benefits, including the decoding of entities in the HTML representing the email address. This code uses LWP::UserAgent to fetch the HTML document, HTML::TokeParser to read it, and URI to parse the URIs in it.
    #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use HTML::TokeParser; use URI; my $ua = LWP::UserAgent->new; $ua->timeout(10); my $root_uri = ''; my $response = $ua->get($root_uri); if ($response->is_success) { my $html = $response->decoded_content; my $p = HTML::TokeParser->new( \$html ); while (my $tag = $p->get_tag('a')) { my $href = $tag->[1]{href}; next unless $href; my $uri = URI->new_abs( $href, $root_uri ); next unless ($uri->scheme eq 'mailto'); print $uri->to, "\n"; } } else { die $response->status_line; }
Re: How to extract an email address from a mailto URL?
by eye (Chaplain) on Dec 30, 2008 at 07:03 UTC
    If you want to differentiate between addresses in anchor tags and other uses of "mailto:" in the file, read the entire file into memory and use the match operator (m//). As suggested previously, you should use Regexp::Common::Email::Address to help compose a regular expression for the email address and enclosing HTML. I would use "\s+" between the "a" and "href" and "\s*" adjacent to the equal sign to match HTML's treatment of whitespace. Note that HTML allows quoting with both single and double quotes. Also, older HTML allowed you to not quote the information after the equal sign in some circumstances.
      My experience in perl is going on about 3 some of what you are saying is greek to me. Can you provide an example of how you would do it? The source file to pull the information from has the tag as follows:

      // -->
      Fax:  (301)931-1285 


      I'm sorry to have to be wet nursed through this...but I have learned a ton of stuff over the last few weeks...I feel like my brain is going to explode!

        Well, first install these two modules (and their unresolved dependencies if there are any):

        Then you can do something like this (Quickshot, untested):

        #!/usr/bin/perl use strict; use warnings; use Regexp::Common qw(Email::Address); use Email::Address; my $filename = 'file_to_parse.dat'; open my $rh, '<', $filename or die "$filename: $!"; # Requirement: href=, mailto: and the mailaddress must be in the same +line! my @addresses = map { m/mailto:($RE{Email}{Address})/o; $1 } grep { m/href=.+?mailto:/ } <$rh> ; close $rh; { local $, = local $\ = "\n"; print @addresses; } __END__
Re: How to extract an email address from a mailto URL?
by Anonymous Monk on Dec 30, 2008 at 09:46 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://733104]
Approved by GrandFather
[Corion]: Ouch. I just learned something about Chrome - the "version numbers" are not really feature numbers... Chrome 62 still gets new features even though Chrome 64 is out and Chrome 65 is the dev version
[erix]: why ouch? Something to be said for that ,surely?
LanX is a dev version
[marto]: That reminds me, I made some notes somewhere with regard W::M::C, I'll get round to a PR when I've time to flesh it out
[Corion]: erix: But that makes for fun bug hunting. "What version of Chrome are you running?" "v62". "I also run v62 and it works on my machine". :-(
[Corion]: marto: Great, looking forward to the PR!
LanX wonders, do we have a rule against systematic down voting?
[erix]: we frown :)

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (9)
As of 2017-12-12 20:13 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (335 votes). Check out past polls.