Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Recompose a webpage using LWP::UserAgent and HTML::Parse

by Discipulus (Monsignor)
on May 26, 2011 at 11:26 UTC ( #906812=perlquestion: print w/replies, xml ) Need Help??
Discipulus has asked for the wisdom of the Perl Monks concerning the following question:

Hello wise ones,

I was experimenting the use of LWP::* modules aiming to build a tool able to time the download time of an arbitrary web url (www.domain.org | www.domain.org/page.cgi | www.domain.org/path/to/page.cgi). It was pretty simple to get the body of the page but I suddenly realized that was only a skeleton without all the inclueded content (images and so on..).

Then i had the idea to separate the content relative to the base url from the content served by other site.

I have finished with this testing code below but i'm not sure at all it consider all the options of embeddidding/linking methods all over the web.

I'm not even sure about the exhaustiveness of the parsed link (body img src) I used in the example code.

excuse me for a so general question, sure of your patience, waiting for some hint.

Lor*
#!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use HTML::Parse; use Data::Dumper; $|++; foreach my $url (@ARGV){ my $totsize = 0; my (@intlink,@extlink,@brokenlink ); print"PROCESSING:\t$url\n"; $url = 'http://'.$url; $url =~ s/\s+//g; #delete spaces $url =~s/\/$//; #removing an eventual / as last char my $ua = new LWP::UserAgent; $ua->agent("libwww-perl/5.10.1"); my $response = $ua->get($url); my $body = $response->content; print "body size:\t",length($body),"\n"; $totsize += length($body); my $parsed_html = parse_html($body); for (@{ $parsed_html->extract_links(qw(body img src)) }) { #print "@$ +_\n";next; my ($link) = @$_; # internal included content if ($link =~ /^\// || $link =~ /^$url/) { $link= $url.$link unless $link =~ /^$url/; push @intlink, $link; #DEBUG a:->$link<\n"; } # external included content elsif ($$_[0] =~ /http:\/\//) { push @extlink, $link; #print "DEBUG b:->$link<-\n"; } # ? included content else { push @intlink, $link; #print "DEBUG c:->$link<-\n"; } } print "-" x34,"\n","code\tbytes\tlink\n","-" x34,"\n"; $totsize += (&get_links ($url, @intlink)||0); $totsize += (&get_links ($url, @extlink)||0); print "\n\nTOTSIZE: ".&Arrotonda_Mega($totsize)." ($totsize bytes)\n +" } sub get_links { my $urlbase = shift; my @links = @_; my $totsize; my $ua = new LWP::UserAgent; $ua->agent("libwww-perl/5.10.1"); my $request = HTTP::Request->new('GET'); foreach my $url (@links) { next if $url =~ /^#/; $request->url($url); my $response = $ua->request($request); print $response->code."\t".length($response->content)."\t$url\ +n"; $totsize += length($response->content) } return $totsize; } ###################################################################### +########## sub Arrotonda_Mega { my( $size, $n ) =( shift, 0 ); return "0 bytes" unless defined $size; return "0 bytes" unless $size > 0; ++$n and $size /= 1024 until $size < 1024; return sprintf "%.4f %s", $size, ( qw[ bytes Kb Mb Gb ] )[ $n ]; } ###################################################################### +##########
there are no rules, there are no thumbs..

Replies are listed 'Best First'.
Re: Recompose a webpage using LWP::UserAgent and HTML::Parse
by Anonymous Monk on May 26, 2011 at 13:10 UTC
      oooohh.. ten years on Perl without the vague idea of the http://www.cpan.org/scripts/ !!!!

      here I found weeks of reading

      thanks

      L*

      there are no rules, there are no thumbs..

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://906812]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2018-08-15 05:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Asked to put a square peg in a round hole, I would:









    Results (159 votes). Check out past polls.

    Notices?