Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

How do I parse links out of a web page

( #13054=categorized question: print w/ replies, xml ) Need Help??
Contributed by Anonymous Monk on May 18, 2000 at 20:28 UTC
Q&A  > HTTP and FTP clients


Description:

I'm trying to parse all the links in a web page into an array organized like this: ($link, $description) where:
<a href="http://www.mysite.com/mypage.html">Come <b>visit</b> my <u>we +b page</u>!</a>
gets parsed into: $link = "http://www.mysite.com/mypage.html" $description = "Come visit my web page!" thanks very much for the help!

Answer: How do I parse links out of a web page
contributed by tokpela

Or you can use WWW::Mechanize

use strict; use warnings; use WWW::Mechanize; my $url = "file:///D:/webpage.html"; #my $url = "http://www.domain.com/webpage.html"; my $mech = WWW::Mechanize->new(); $mech->get( $url ); my @links = $mech->links(); foreach my $link (@links) { print "LINK: " . $link->url() . "\n"; print "DESCRIPTION: " . $link->text() . "\n"; }
Answer: How do I parse links out of a web page
contributed by gregorovius

Unfortunately HTML::LinkExtor does not offer a way of extracting the link text from the 'A' tag. You can resort to the HTML::TokeParser instead.

The HTML::TokeParser perldoc contains a snippet that does exactly what you ask for, except that the link URLs it extracts can be relative so you need to concatenate a base to them.

Answer: How do I parse links out of a web page
contributed by Anonymous Monk

You could try this as well

#!/usr/bin/perl -w use LWP::UserAgent; use HTML::LinkExtor; use URI::URL; $url = "http://www.google.ca/"; # for instance $ua = LWP::UserAgent->new; # Set up a callback that collect image links my @imgs = (); sub callback { my($tag, %attr) = @_; return if $tag ne 'a'; # we only look closer at <img ...> push(@imgs, values %attr); } # Make the parser. Unfortunately, we don't know the base yet # (it might be diffent from $url) $p = HTML::LinkExtor->new(\&callback); # Request document and parse it as it arrives $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])}); # Expand all image URLs to absolute ones my $base = $res->base; @imgs = map { $_ = url($_, $base)->abs; } @imgs; # Print them out print join("\n", @imgs), "\n";
Answer: How do I parse links out of a web page
contributed by merlyn

See HTML::LinkExtor in the LWP module in the CPAN.

Answer: How do I parse links out of a web page
contributed by agent00013

The Perl Cookbook has a good example:

#!/usr/local/bin/perl # xurl - extract unique, sorted lists of links from URL use HTML::LinkExtor; use LWP::Simple; $base_url = shift; $parser = HTML::LinkExtor->new(undef, $base_url); $parser->parse(get($base_url))->eof; @links = $parser->links; foreach $linkarray (@links) { local(@element) = @$linkarray; local($elt_type) = shift @element; while (@element) { local($attr_name, $attr_value) = splice (@element, 0, 2); $seen{$attr_value}++; } } for (sort keys %seen) { print $_, "\n"}
Hope this helps. /msg me if you need anything else.

Please (register and) log in if you wish to add an answer



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others romping around the Monastery: (14)
    As of 2015-07-31 08:45 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









      Results (276 votes), past polls