Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Perl program for webscraping

by Emmasue (Initiate)
on Feb 08, 2018 at 20:04 UTC ( #1208746=perlquestion: print w/replies, xml ) Need Help??
Emmasue has asked for the wisdom of the Perl Monks concerning the following question:

Attached below is a PERL snippet that gets a specific recipe form and writes the xml source code to a txt file.

#! C:/Perl64/bin/perl # Calls the PERL interpreter use strict; #use features that provide detailed warning and cleanup + use warnings; #use features that provide detailed warning and cleanup + use autodie; #sometimes prevents the program from hanging and kills i +t use LWP::Simple; # PERL module to connect to Internet my $out_file = 'recipe.txt'; #Defines output my $encoding = ":encoding(UTF-8)"; #Defines encoding type for text t +ypical of western webpages open (my $handle2, ">> $encoding", $out_file) || die "Could not open $ +out_file: $!"; #Opens output and assigns an internal name independe +nt of file name my $content=get(" +e/") or die "ouch"; #Gets the webpage as xml print $handle2 $content."\n"; #Writes the URL as text to the output +file exit;

My question is how i can build a program with this snippet that loops through a series of recipes (lets say from recipe 11253 to 11300) and writes the individual xml code to separate files, each having a variable file name. So i just have to insert a loop somewhere in this code that pulls the source code of each recipe in the range I specify from and dumps the text into a file that is than collected into one of my directories containing all the recipe files. I am getting the recipe ID numbers from the allrecipes url. Every recipe has a unique number, so for example, is an easy valentine day cookie recipe. i know that I will need to insert a sleep command between url fetches or i will be flagged as a bot and kicked out. And that it needs to be at some random time. It will probably look like: sleep(rand(10 also, i have to choose a sensible way of naming each recipe file In the end, I should be able to have 10,000 named txt files of the recipes I specify in a directory. After this step is when i will parse the info that I need form the text to do my analysis.

Replies are listed 'Best First'.
Re: Perl program for webscraping
by Anonymous Monk on Feb 08, 2018 at 20:50 UTC
      Hi Anonymous Monk,

      What has TOS 7(e), from, got to do with the site Emmasue is dealing with, i.e. ?

        Probably the AM followed the link "Terms of Service" on the Website. At least, when I tried that, I came to the aforementioned legal terms. Admittedly, due to "infinite scroll" it was a bit difficult to find that link on the bottom of the page…

        Because that is where you go when you click on "terms of service"?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1208746]
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2018-12-11 07:00 GMT
Find Nodes?
    Voting Booth?
    How many stories does it take before you've heard them all?

    Results (53 votes). Check out past polls.

    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!