Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Help with Web Scraping Script - Updated

by EagerforPerl (Novice)
on Oct 19, 2017 at 00:44 UTC ( [id://1201637]=perlquestion: print w/replies, xml ) Need Help??

EagerforPerl has asked for the wisdom of the Perl Monks concerning the following question:

use strict; use warnings; use LWP::Simple; use File::Compare; use File::Copy; $| = 1; sub main { #Create a file with current content, compare with all present file +s in directory if same, delete, if not, keep. unless(-e('filesaves') or mkdir('filesaves')) { die("Directory Couldn't Be Created.\n"); } #create directory if it does not already exist my $fileName; print("Enter Site Directory: "); #Test input: http://caveofprogram +ming.com #Gather site URL with directory my $siteDirectory = <STDIN>; print("Number of Times to Run: "); #Test input: 10 my $runAmount = <STDIN>; #Gather the number of times to check the web address unless(opendir(DIR, 'C:\\Program Files\\OSNE')) { die("Unable to open directory 'C:\\Program Files\\OSNE'\n"); } for(my $i = 0; $i <= $runAmount; $i++) { my $file = readdir(DIR); closedir(DIR); $file = grep(/\.txt$/i, $file); #Filter as to only look for .t +xt files my $searchTable = get($siteDirectory); #Get HTML code from web +site if(defined($searchTable)) { $fileName = localtime() . '.txt'; #Set file name to the ti +me it will be created $fileName =~ s/:/-/g; #remove the disallowed characters an +d replace them so that it can be the file name open(my $outputFile, '>', $fileName) or die("Couldn't Crea +te File.\n"); while($searchTable =~ m|<\s*a\s+[^>]*href\s*=\s*['"]([^>"' +]+)['"][^>]*>\s*([^<>]*)</|sig) { #HTML code title filter regex print $outputFile ("$2: $1\n"); #print the titles to t +he text file } if(compare($fileName, $file) == 0) { close($outputFile); #close output unlink($fileName); #delete file } else { close($outputFile); move("C:\\Program Files\\OSNE\\'$file'","C:\\Program F +iles\\OSNE\\filesaves\\'$file'"); #Move the old file to filesave folder +and keep the new file in the same directory as the script print("Change Detected.\n"); } } else { print("URL Unaccessible: $siteDirectory\n"); } } } main();

I'm new to Perl, and I am trying to make a program that reads a sites html(specifically the titles) continuously as long as the user has specified and compares it with the other scan of the website by comparing files. If the file is the same as the other, delete the newer file. If the file is different, move the old file into the filesaves folder and keep the newer file in the same directory as the script. The program runs, but doesn't create the amount of files specified by the for loop, doesn't move them to the correct file, and doesn't delete them. For example, if you specify the number of times to run as 10, then you will only have 7 text files. Console Log: readdir() attempted on invalid dirhandle DIR at C:\Program Files\OSNE\OSNE.pl line 23, <STDIN> line 2. closedir() attempted on invalid dirhandle DIR at C:\Program Files\OSNE\CPMonitor.pl line 23, <STDIN> line 2. Use of uninitialized value $_ in pattern match (m//) at C:\Program Files\OSNE\CPMonitor.pl line 24, <STDIN> line 2. Change Detected.

Replies are listed 'Best First'.
Re: Help with Web Scraping Script
by 1nickt (Canon) on Oct 19, 2017 at 11:07 UTC

    Hello EagerforPerl,

    Your program does not compile:

    Global symbol "$directory" requires explicit package name (did you for +get to declare "my $directory"?) at 1201637.pl line 25. Global symbol "$directory" requires explicit package name (did you for +get to declare "my $directory"?) at 1201637.pl line 26. 1201637.pl had compilation errors.
    This is because the unless block creates scope around the contents of the block outside which the lexically declared variables are not accessible:
    $ perl -Mstrict -wlE 'unless (0) { my $foo = 42 }; say $foo' Global symbol "$foo" requires explicit package name (did you forget to + declare "my $foo"?) at -e line 1. Execution of -e aborted due to compilation errors.
    If I change the code for opening the directory to:
    opendir my $directory, 'C:\\Program Files\\OSNE' or die("Unable to open directory 'C:\\Program Files\\OSNE'\n");
    ... then I get a warning:
    Name "main::OUTPUT" used only once: possible typo at 1201637.pl line 3 +4. 1201637.pl syntax OK
    This is because you open your filehandle as $output but try to use it as OUTPUT ...

    Also please consider that most websites don't appreciate repeated or frequent polling; I would recommend no more than a daily check if you simply want to see whether a site has new pages.


    The way forward always starts with a minimal test.
      I appreciate your response. The problem with the handle arose when I was trying to implement what the first reply suggested, and I missed one of the instances of OUTPUT, which I have fixed. Still doesn't work though. About your concern in regards to frequent polling, this is going to no more than 5 people including myself and we don't plan on using it for very long. This is really just something I'm trying to do for practice. I of course don't want to unknowingly denial of service somebody's website.
Re: Help with Web Scraping Script
by stevieb (Canon) on Oct 19, 2017 at 01:18 UTC

    Welcome to the Monastery, EagerforPerl!

    You've provided code, that's awesome (so is formatting it well!).

    You're also using (at a quick glance) the majority of proper techniques (strict, warnings, 3-arg open etc ++).

    What I'd ask you to do so the Monks may be better able to help is tell us what the code currently does, and how it deviates from what you're expecting. It would also be beneficial if you could provide the data that you're sending in as standard input so the Monks can test for themselves. If the URLs/input are off-limits somehow, that's understandable too... you'll just have to provide more detail on the expected/problematic situations.

    ps. You do not need  sub main {... in Perl. If your file does not contain only a package (class), the code will run just fine without a main() function. You can just put your code left-justified (unlike eg: C).

    pps. I would recommend, despite what I said above, one change to the 3-arg open you use. Bareword file handles (ie., things like OUTPUT are global in scope. It is best-common-practice to use lexical (ie. scoped) handles instead. To do this, simply assign a scalar variable to hold the handle as opposed to the bareword: open my $fh, '...', '...' or die ...

      I'm aware I can run it without a main function/subroutine in Perl, it's just a convention that I've decided to willingly borrow from my programming instructor. The bareword file handles are perhaps another convention really, but I will consider what you have advised. Thank you for your response.
Re: Help with Web Scraping Script
by marto (Cardinal) on Oct 19, 2017 at 10:24 UTC

    This isn't a SSCCE (e.g. get is not shown), however, a couple of points. You read from STDIN but don't chomp. You don't print the reason for failure when you call die, e.g. ....or die "Can't open directory: $!\n". Hopefully you have rate limited requests so you're not hammering the sites in question. Also How do I post a question effectively?

    Update: Strikeout, see below.

      This isn't a SSCCE (e.g. get is not shown)

      The OP is using LWP::Simple which exports get so it's probably safe to assume that it is that one.

        (Picard) Face palm. I sit corrected. Too early for me.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1201637]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-04-18 15:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found