Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Web Crawling using Perl -- TIMTOWTDT

by Discipulus (Canon)
on Jun 24, 2017 at 13:49 UTC ( [id://1193453]=note: print w/replies, xml ) Need Help??


in reply to Web Crawling using Perl

Hello ckj the subject is very interesting,

just I do not understand when you say:

> if it's good then I can post it on a larger platform

There is something larger than perlmonks? (kidding but perhaps you must ask author the permission to repost their code on other platform.. for example I'd prefere you just link to my present post from wherever you want)

Now I'm not an expert at scraping the web but there is something much simpler than use while starting scraping; consider what I use to extract titles from nodes I want to bookmark, when I also add some html tags to put the result into an unordered list:

io@COMP:C> perl -MLWP::UserAgent -e "print qq(<li>[id://$ARGV[0]|).LWP +::UserAgent->new->get('http://www.perlmonks.org/index.pl?node_id='.$A +RGV[0])->title,']</li>'" 1193449 <li>[id://1193449|Web Crawling using Perl]</li>

Just one step forward you can get the content using few lines:

use strict; use warnings; use LWP::UserAgent (); my $ua = LWP::UserAgent->new; my $response = $ua->get('http://www.perlmonks.org/?node_id=1193449'); if ($response->is_success) { print $response->decoded_content; } else { die $response->status_line; }

Also scraping it is not just "to copy content" you can exctract or examine the response. I've done this into my webtimeload023.pl:

# just monitoring --verbosity 0 --count 4 --sleep 5 perl webtimeload023.pl -u http://www.perlmonks.org/?node_id=1193449 -v + 0 -c 4 -s 5 Sat Jun 24 15:34:09 2017 http://www.perlmonks.org/?node_id=1193449 20 +0 99562 2.126046 45.7321 Kb/s Sat Jun 24 15:34:16 2017 http://www.perlmonks.org/?node_id=1193449 20 +0 99599 1.986645 48.9592 Kb/s Sat Jun 24 15:34:23 2017 http://www.perlmonks.org/?node_id=1193449 20 +0 99192 2.064141 46.9286 Kb/s Sat Jun 24 15:34:30 2017 http://www.perlmonks.org/?node_id=1193449 20 +0 98852 1.972459 48.9415 Kb/s # some detail more with --verbosity 4 perl webtimeload023.pl -u http://www.perlmonks.org/?node_id=1193449 -v + 4 ==================================================================== http://www.perlmonks.org/?node_id=1193449 Sat Jun 24 15:34:40 20 +17 -------------------------------------------------------------------- Response code: 200 Response message: OK Response server: Apache/2.4.26 Response declared length: UNDEF Response title: Web Crawling using Perl -------------------------------------------------------------------- main page content (1): 31.8506 Kb in 1.248003 seconds @ 25.5212 Kb/s) -------------------------------------------------------------------- detail of loaded pages (url): -------------------------------------------------------------------- http://www.perlmonks.org/?node_id=1193449 -------------------------------------------------------------------- no included content found. external content (1): 64.6660 Kb in 0.723429 seconds @ 89.3882 Kb/s) no broken links found. -------------------------------------------------------------------- detail of loaded content (url bytes seconds): -------------------------------------------------------------------- http://promote.pair.com/i/pair-banner-current.gif 66218 0.7234 +29 -------------------------------------------------------------------- downloaded 96.5166 Kb (98833 bytes) in 1.971432 seconds (48.9576 Kb/s) ====================================================================

As you can see I just experimented with LWP::UserAgent.. Many more good possibilities are present on CPAN, for scraping and for parsing results. In a mixed order

You can also be interested to read following wise threads:

Above threads link to many other useful information; other links are in my homenode.

When next year (;=) you have tried them all I'll be very glad to see your opinion.

L*

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Replies are listed 'Best First'.
Re^2: Web Crawling using Perl -- TIMTOWTDT
by ckj (Chaplain) on Jun 24, 2017 at 14:24 UTC
    Thanks for your valuable feedback Discipulus , I already started to go through the information you shared and will get back with my opinion(Next year is way too far). The code I pasted here was a long time back code written by me and yes a simpler code would make more sense to put rather than displaying all functionalities. Lastly, regarding "> if it's good then I can post it on a larger platform" the actual meaning of my statement was actually about putting this article in a bigger article where crawling in multiple languages has been explained :) . Thanks again for your feedback, I can probably work on my article in a better way now.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1193453]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (2)
As of 2024-04-20 02:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found