Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Perl program for webscraping

by Emmasue (Initiate)
on Feb 08, 2018 at 20:04 UTC ( #1208746=perlquestion: print w/replies, xml ) Need Help??
Emmasue has asked for the wisdom of the Perl Monks concerning the following question:

Attached below is a PERL snippet that gets a specific recipe form allrecipes.com and writes the xml source code to a txt file.

#! C:/Perl64/bin/perl # Calls the PERL interpreter use strict; #use features that provide detailed warning and cleanup + use warnings; #use features that provide detailed warning and cleanup + use autodie; #sometimes prevents the program from hanging and kills i +t use LWP::Simple; # PERL module to connect to Internet my $out_file = 'recipe.txt'; #Defines output my $encoding = ":encoding(UTF-8)"; #Defines encoding type for text t +ypical of western webpages open (my $handle2, ">> $encoding", $out_file) || die "Could not open $ +out_file: $!"; #Opens output and assigns an internal name independe +nt of file name my $content=get("http://allrecipes.com/recipe/11253/mock-peanut-brittl +e/") or die "ouch"; #Gets the webpage as xml print $handle2 $content."\n"; #Writes the URL as text to the output +file exit;

My question is how i can build a program with this snippet that loops through a series of recipes (lets say from recipe 11253 to 11300) and writes the individual xml code to separate files, each having a variable file name. So i just have to insert a loop somewhere in this code that pulls the source code of each recipe in the range I specify from Allrecipes.com and dumps the text into a file that is than collected into one of my directories containing all the recipe files. I am getting the recipe ID numbers from the allrecipes url. Every recipe has a unique number, so www.allrecipes.com/recipe/10413/ for example, is an easy valentine day cookie recipe. i know that I will need to insert a sleep command between url fetches or i will be flagged as a bot and kicked out. And that it needs to be at some random time. It will probably look like: sleep(rand(10 also, i have to choose a sensible way of naming each recipe file In the end, I should be able to have 10,000 named txt files of the recipes I specify in a directory. After this step is when i will parse the info that I need form the text to do my analysis.

Replies are listed 'Best First'.
Re: Perl program for webscraping
by Anonymous Monk on Feb 08, 2018 at 20:50 UTC
      Hi Anonymous Monk,

      What has TOS 7(e), from http://www.meredith.com/legal/terms, got to do with the site Emmasue is dealing with, i.e. http://allrecipes.com ?

        Probably the AM followed the link "Terms of Service" on the http://allrecipes.com Website. At least, when I tried that, I came to the aforementioned legal terms. Admittedly, due to "infinite scroll" it was a bit difficult to find that link on the bottom of the page…

        Because that is where you go when you click on "terms of service"?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1208746]
Front-paged by Corion
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2018-05-23 18:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?