Contributed by Anonymous Monk
on May 18, 2000 at 20:28 UTC
Q&A
> HTTP and FTP clients
Description: I'm trying to parse all the links in a web page into an array organized like this:
($link, $description)
where:
<a href="http://www.mysite.com/mypage.html">Come <b>visit</b> my <u>we
+b page</u>!</a>
gets parsed into:
$link = "http://www.mysite.com/mypage.html"
$description = "Come visit my web page!"
thanks very much for the help! Answer: How do I parse links out of a web page contributed by tokpela Or you can use WWW::Mechanize
use strict;
use warnings;
use WWW::Mechanize;
my $url = "file:///D:/webpage.html";
#my $url = "http://www.domain.com/webpage.html";
my $mech = WWW::Mechanize->new();
$mech->get( $url );
my @links = $mech->links();
foreach my $link (@links) {
print "LINK: " . $link->url() . "\n";
print "DESCRIPTION: " . $link->text() . "\n";
}
| Answer: How do I parse links out of a web page contributed by gregorovius Unfortunately HTML::LinkExtor does not offer a
way of extracting the link text from the 'A' tag.
You can resort to the HTML::TokeParser instead.
The HTML::TokeParser perldoc contains a snippet
that does exactly what you ask for, except that the
link URLs it extracts can be relative so you need to
concatenate a base to them.
| Answer: How do I parse links out of a web page contributed by Anonymous Monk You could try this as well
#!/usr/bin/perl -w
use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;
$url = "http://www.google.ca/"; # for instance
$ua = LWP::UserAgent->new;
# Set up a callback that collect image links
my @imgs = ();
sub callback {
my($tag, %attr) = @_;
return if $tag ne 'a'; # we only look closer at <img ...>
push(@imgs, values %attr);
}
# Make the parser. Unfortunately, we don't know the base yet
# (it might be diffent from $url)
$p = HTML::LinkExtor->new(\&callback);
# Request document and parse it as it arrives
$res = $ua->request(HTTP::Request->new(GET => $url),
sub {$p->parse($_[0])});
# Expand all image URLs to absolute ones
my $base = $res->base;
@imgs = map { $_ = url($_, $base)->abs; } @imgs;
# Print them out
print join("\n", @imgs), "\n";
| Answer: How do I parse links out of a web page contributed by merlyn See HTML::LinkExtor in the LWP module in the CPAN. | Answer: How do I parse links out of a web page contributed by agent00013 The Perl Cookbook has a good example:
#!/usr/local/bin/perl
# xurl - extract unique, sorted lists of links from URL
use HTML::LinkExtor;
use LWP::Simple;
$base_url = shift;
$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse(get($base_url))->eof;
@links = $parser->links;
foreach $linkarray (@links) {
local(@element) = @$linkarray;
local($elt_type) = shift @element;
while (@element) {
local($attr_name, $attr_value) = splice (@element, 0, 2);
$seen{$attr_value}++;
}
}
for (sort keys %seen) { print $_, "\n"}
Hope this helps. /msg me if you need anything else. |
Please (register and) log in if you wish to add an answer
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Outside of code tags, you may need to use entities for some characters:
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
|
|