Re: Web Crawling using Perl -- TIMTOWTDT

Hello ckj the subject is very interesting,

just I do not understand when you say:

> if it's good then I can post it on a larger platform

There is something larger than perlmonks? (kidding but perhaps you must ask author the permission to repost their code on other platform.. for example I'd prefere you just link to my present post from wherever you want)

Now I'm not an expert at scraping the web but there is something much simpler than use while starting scraping; consider what I use to extract titles from nodes I want to bookmark, when I also add some html tags to put the result into an unordered list:

io@COMP:C> perl -MLWP::UserAgent -e "print qq(<li>[id://$ARGV[0]|).LWP
+::UserAgent->new->get('http://www.perlmonks.org/index.pl?node_id='.$A
+RGV[0])->title,']</li>'" 1193449


<li>[id://1193449|Web Crawling using Perl]</li>
[download]

Just one step forward you can get the content using few lines:

 use strict;
 use warnings;
 use LWP::UserAgent ();

 my $ua = LWP::UserAgent->new;

 my $response = $ua->get('http://www.perlmonks.org/?node_id=1193449');

 if ($response->is_success) {
        print $response->decoded_content;
 }
 else {
     die $response->status_line;
 }
[download]

Also scraping it is not just "to copy content" you can exctract or examine the response. I've done this into my webtimeload023.pl:

# just monitoring --verbosity 0 --count 4 --sleep 5

perl webtimeload023.pl -u http://www.perlmonks.org/?node_id=1193449 -v
+ 0 -c 4 -s 5

Sat Jun 24 15:34:09 2017 http://www.perlmonks.org/?node_id=1193449  20
+0 99562 2.126046 45.7321 Kb/s
Sat Jun 24 15:34:16 2017 http://www.perlmonks.org/?node_id=1193449  20
+0 99599 1.986645 48.9592 Kb/s
Sat Jun 24 15:34:23 2017 http://www.perlmonks.org/?node_id=1193449  20
+0 99192 2.064141 46.9286 Kb/s
Sat Jun 24 15:34:30 2017 http://www.perlmonks.org/?node_id=1193449  20
+0 98852 1.972459 48.9415 Kb/s

# some detail more with --verbosity 4

perl webtimeload023.pl -u http://www.perlmonks.org/?node_id=1193449 -v
+ 4

====================================================================
http://www.perlmonks.org/?node_id=1193449       Sat Jun 24 15:34:40 20
+17
--------------------------------------------------------------------
Response code:            200
Response message:         OK
Response server:          Apache/2.4.26
Response declared length: UNDEF
Response title:           Web Crawling using Perl
--------------------------------------------------------------------
main page content (1):  31.8506 Kb in 1.248003 seconds @ 25.5212 Kb/s)
--------------------------------------------------------------------
        detail of loaded pages (url):
--------------------------------------------------------------------
        http://www.perlmonks.org/?node_id=1193449
--------------------------------------------------------------------
no included content found.
external content  (1):  64.6660 Kb in 0.723429 seconds @ 89.3882 Kb/s)
no broken links found.
--------------------------------------------------------------------
        detail of loaded content (url bytes seconds):
--------------------------------------------------------------------
        http://promote.pair.com/i/pair-banner-current.gif 66218 0.7234
+29
--------------------------------------------------------------------
downloaded 96.5166 Kb (98833 bytes) in 1.971432 seconds (48.9576 Kb/s)
====================================================================
[download]

As you can see I just experimented with LWP::UserAgent.. Many more good possibilities are present on CPAN, for scraping and for parsing results. In a mixed order

WWW::Mechanize and all his children (Firefox, Chrome, PhantomJs being by Corion)
HTML::TreeBuilder
Mojo::UserAgent
Mojo::DOM
Web::Scraper
LWP::Simple
App::scrape by our dear monk Corion

You can also be interested to read following wise threads:

Above threads link to many other useful information; other links are in my homenode.

When next year (;=) you have tried them all I'll be very glad to see your opinion.

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Comment on Re: Web Crawling using Perl -- TIMTOWTDT Select or Download Code


Perl: the Markov chain saw
	PerlMonks