Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

How to extract an email address from a mailto URL?

by jdlev (Scribe)
on Dec 29, 2008 at 20:06 UTC ( #733104=perlquestion: print w/ replies, xml ) Need Help??
jdlev has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks! I'm trying to pull a simple information from a text file. The information I wish to pull is in the following code:

a href = 'mailto:email@email.com'>Their email address /a

There are other a href tags on the page, and the only one I want to pull from contains the mailto: in it. How would I extract email@email.com from the .txt file?

Thanks, Jeff

Comment on How to extract an email address from a mailto URL?
Re: How to extract an email address from a mailto URL?
by linuxer (Deacon) on Dec 29, 2008 at 20:23 UTC

    check (e.g. grep) for lines, which contain 'mailto:' (be more specific if you like to; match the 'href' ...);
    use Regexp::Common together with Regexp::Common::Email::Address to identify the mail address in matching lines

    PS: Don't remove your original question here.

    update

    1. RCEA added
Re: How to extract an email address from a mailto URL?
by CountZero (Bishop) on Dec 29, 2008 at 21:04 UTC
    grep is indeed the answer to your question if you can be sure that the whole of the 'a' ... '/a' phrase is on the same line.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: How to extract an email address from a mailto URL?
by eye (Chaplain) on Dec 30, 2008 at 07:03 UTC
    If you want to differentiate between addresses in anchor tags and other uses of "mailto:" in the file, read the entire file into memory and use the match operator (m//). As suggested previously, you should use Regexp::Common::Email::Address to help compose a regular expression for the email address and enclosing HTML. I would use "\s+" between the "a" and "href" and "\s*" adjacent to the equal sign to match HTML's treatment of whitespace. Note that HTML allows quoting with both single and double quotes. Also, older HTML allowed you to not quote the information after the equal sign in some circumstances.
      My experience in perl is going on about 3 weeks...so some of what you are saying is greek to me. Can you provide an example of how you would do it? The source file to pull the information from has the tag as follows:

      showTollfree(1010)
      // -->
      '/script'
      Fax:  (301)931-1285 
      'br''a
      href='mailto:KHargrove@servpro1010.com'>KHargrove@servpro1010.com'/a'

      '/td'

      I'm sorry to have to be wet nursed through this...but I have learned a ton of stuff over the last few weeks...I feel like my brain is going to explode!

        Well, first install these two modules (and their unresolved dependencies if there are any):

        Then you can do something like this (Quickshot, untested):

        #!/usr/bin/perl use strict; use warnings; use Regexp::Common qw(Email::Address); use Email::Address; my $filename = 'file_to_parse.dat'; open my $rh, '<', $filename or die "$filename: $!"; # Requirement: href=, mailto: and the mailaddress must be in the same +line! my @addresses = map { m/mailto:($RE{Email}{Address})/o; $1 } grep { m/href=.+?mailto:/ } <$rh> ; close $rh; { local $, = local $\ = "\n"; print @addresses; } __END__
Re: How to extract an email address from a mailto URL?
by Anonymous Monk on Dec 30, 2008 at 09:46 UTC
Extracting email addresses from mailto URIs using HTML and URI parsers instead of regular expressions
by dorward (Curate) on Dec 30, 2008 at 15:07 UTC
    I don't like using regular expressions on HTML documents, so my approach would be to use a proper HTML parser instead. This has a number of benefits, including the decoding of entities in the HTML representing the email address. This code uses LWP::UserAgent to fetch the HTML document, HTML::TokeParser to read it, and URI to parse the URIs in it.
    #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use HTML::TokeParser; use URI; my $ua = LWP::UserAgent->new; $ua->timeout(10); my $root_uri = 'http://example.com/'; my $response = $ua->get($root_uri); if ($response->is_success) { my $html = $response->decoded_content; my $p = HTML::TokeParser->new( \$html ); while (my $tag = $p->get_tag('a')) { my $href = $tag->[1]{href}; next unless $href; my $uri = URI->new_abs( $href, $root_uri ); next unless ($uri->scheme eq 'mailto'); print $uri->to, "\n"; } } else { die $response->status_line; }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://733104]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (12)
As of 2014-09-02 19:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (29 votes), past polls