Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Perl program for webscraping

by Emmasue (Initiate)
on Feb 08, 2018 at 20:04 UTC ( #1208746=perlquestion: print w/replies, xml ) Need Help??
Emmasue has asked for the wisdom of the Perl Monks concerning the following question:

Attached below is a PERL snippet that gets a specific recipe form allrecipes.com and writes the xml source code to a txt file.

#! C:/Perl64/bin/perl # Calls the PERL interpreter use strict; #use features that provide detailed warning and cleanup + use warnings; #use features that provide detailed warning and cleanup + use autodie; #sometimes prevents the program from hanging and kills i +t use LWP::Simple; # PERL module to connect to Internet my $out_file = 'recipe.txt'; #Defines output my $encoding = ":encoding(UTF-8)"; #Defines encoding type for text t +ypical of western webpages open (my $handle2, ">> $encoding", $out_file) || die "Could not open $ +out_file: $!"; #Opens output and assigns an internal name independe +nt of file name my $content=get("http://allrecipes.com/recipe/11253/mock-peanut-brittl +e/") or die "ouch"; #Gets the webpage as xml print $handle2 $content."\n"; #Writes the URL as text to the output +file exit;

My question is how i can build a program with this snippet that loops through a series of recipes (lets say from recipe 11253 to 11300) and writes the individual xml code to separate files, each having a variable file name. So i just have to insert a loop somewhere in this code that pulls the source code of each recipe in the range I specify from Allrecipes.com and dumps the text into a file that is than collected into one of my directories containing all the recipe files. I am getting the recipe ID numbers from the allrecipes url. Every recipe has a unique number, so www.allrecipes.com/recipe/10413/ for example, is an easy valentine day cookie recipe. i know that I will need to insert a sleep command between url fetches or i will be flagged as a bot and kicked out. And that it needs to be at some random time. It will probably look like: sleep(rand(10 also, i have to choose a sensible way of naming each recipe file In the end, I should be able to have 10,000 named txt files of the recipes I specify in a directory. After this step is when i will parse the info that I need form the text to do my analysis.

Replies are listed 'Best First'.
Re: Perl program for webscraping
by Anonymous Monk on Feb 08, 2018 at 20:50 UTC
      Hi Anonymous Monk,

      What has TOS 7(e), from http://www.meredith.com/legal/terms, got to do with the site Emmasue is dealing with, i.e. http://allrecipes.com ?

        Probably the AM followed the link "Terms of Service" on the http://allrecipes.com Website. At least, when I tried that, I came to the aforementioned legal terms. Admittedly, due to "infinite scroll" it was a bit difficult to find that link on the bottom of the page…

        Because that is where you go when you click on "terms of service"?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1208746]
Front-paged by Corion
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2018-08-20 02:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Asked to put a square peg in a round hole, I would:









    Results (190 votes). Check out past polls.

    Notices?