Perl program for webscraping

Emmasue has asked for the wisdom of the Perl Monks concerning the following question:

Attached below is a PERL snippet that gets a specific recipe form allrecipes.com and writes the xml source code to a txt file.

 

#! C:/Perl64/bin/perl   # Calls the PERL interpreter
use strict;  #use features that provide detailed warning and cleanup  
+ 
use warnings;  #use features that provide detailed warning and cleanup
+   
use autodie;  #sometimes prevents the program from hanging and kills i
+t 
use LWP::Simple;  # PERL module to connect to Internet

my $out_file = 'recipe.txt';   #Defines output
my $encoding = ":encoding(UTF-8)";   #Defines encoding type for text t
+ypical of western webpages

open (my $handle2, ">> $encoding", $out_file) || die "Could not open $
+out_file: $!";   #Opens output and assigns an internal name independe
+nt of file name

my $content=get("http://allrecipes.com/recipe/11253/mock-peanut-brittl
+e/") or die "ouch";  #Gets the webpage as xml

print $handle2 $content."\n";   #Writes the URL as text to the output 
+file

exit;
[download]

My question is how i can build a program with this snippet that loops through a series of recipes (lets say from recipe 11253 to 11300) and writes the individual xml code to separate files, each having a variable file name. So i just have to insert a loop somewhere in this code that pulls the source code of each recipe in the range I specify from Allrecipes.com and dumps the text into a file that is than collected into one of my directories containing all the recipe files. I am getting the recipe ID numbers from the allrecipes url. Every recipe has a unique number, so www.allrecipes.com/recipe/10413/ for example, is an easy valentine day cookie recipe. i know that I will need to insert a sleep command between url fetches or i will be flagged as a bot and kicked out. And that it needs to be at some random time. It will probably look like: sleep(rand(10 also, i have to choose a sensible way of naming each recipe file In the end, I should be able to have 10,000 named txt files of the recipes I specify in a directory. After this step is when i will parse the info that I need form the text to do my analysis.

Comment on Perl program for webscraping Download Code

Replies are listed 'Best First'.
Re: Perl program for webscraping by Anonymous Monk on Feb 08, 2018 at 20:50 UTC
TOS 7. (e) 'you shall not use any manual or automated software, devices or other processes (including but not limited to spiders, robots, scrapers, crawlers, avatars, data mining tools or the like) to "scrape" or download data from the Services' Crossposted from SO https://stackoverflow.com/questions/48610522/webscraping-in-allrecipes-com-using-perl after noone wanted to help you break the TOS there...	[reply]
Re^2: Perl program for webscraping by tel2 (Pilgrim) on Feb 09, 2018 at 05:46 UTC
Hi Anonymous Monk, What has TOS 7(e), from http://www.meredith.com/legal/terms, got to do with the site Emmasue is dealing with, i.e. http://allrecipes.com ?	[reply]
Re^3: Perl program for webscraping by soonix (Canon) on Feb 09, 2018 at 07:45 UTC
Probably the AM followed the link "Terms of Service" on the http://allrecipes.com Website. At least, when I tried that, I came to the aforementioned legal terms. Admittedly, due to "infinite scroll" it was a bit difficult to find that link on the bottom of the page…	[reply]
Re^4: Perl program for webscraping by tel2 (Pilgrim) on Feb 09, 2018 at 08:14 UTC
Re^3: Perl program for webscraping by huck (Prior) on Feb 09, 2018 at 07:48 UTC
Because that is where you go when you click on "terms of service"?	[reply]

Back to Seekers of Perl Wisdom