Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

questions concerning using perl to monitor webpages

by Anonymous Monk
on May 22, 2003 at 00:29 UTC ( #259948=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I need a script that i can run manually, that will check on several webpages (which i would specify) to see if they had changed since the last time it checked them, and if they had been changed, to tell me which one(s). now i don't know if something like this exists already or whether i'd have to make one myself...does anything like this exist and is it easy to find/use?

20030522 Edit by Corion : Retitled from "Does this exist yet?..."

Comment on questions concerning using perl to monitor webpages
Re: questions concerning using perl to monitor webpages
by TVSET (Chaplain) on May 22, 2003 at 00:39 UTC
    I don't know if this exists, but it's trivial to do, if you ask me. You can schedule a job to fetch a page and compare it to the previous copy. You can use Text::Diff for differences. Alternatively, you can compute a checksum (like MD5) and check if that changed. I am sure there are billion of other ways to do this.

    Leonid Mamtchenkov aka TVSET

      TVSET,
      Your solution is basically comparing two files. While this method is valid - there are some flaws. The first is assuming the new page you fetch will not be served up from cache some place. It would also be a problem if the overall content of the page was the same, but something like the <date> was different every day. Of course, this can be argued both ways, but one must assume that changed is subjective and not objective.

      I don't have any other solutions. I would probably roll my own very much like you have suggested. Since the number of pages to track could get large, I would probably store the MD5 sum and the URL in a database and that's it. I would code special cases for the problem cases.

      This being said - I am betting someone will show a much more elegant and powerful solution.

      Cheers - L~R

        The first is assuming the new page you fetch will not be served up from cache some place.

        That's not the problem of my solution. :) It should have access to two copies of the site from different times to compare them. :) Validating that content was not supplied from the cache or something, is either user's headache or yet another addon to the script. :)

        It would also be a problem if the overall content of the page was the same, but something like the <date> was different every day. Of course, this can be argued both ways, but one must assume that changed is subjective and not objective.

        Well, that was one of the reasons I suggested the use of Text::Diff from the very beginning, since it will minimize the headache. You'll be able to quickly grep away things like dates. :)

        I would probably roll my own very much like you have suggested. Since the number of pages to track could get large, I would probably store the MD5 sum and the URL in a database and that's it.

        You could always start away with the hash like:

        my %internet = ( 'url' => 'md5 checksum', );

        Thanks for the feedback anyway. :)

        Leonid Mamtchenkov aka TVSET

      Don't use Algorithm::Diff (nor stuff like Text::Diff that uses it) to simply compare for equality.

      The person wanted to know which pages had changed, not which lines of each page were unchanged, deleted, added, or modified. Going to the effort to try to find the greatest common sequence of unchanged lines between old and new versions of each page could be a huge waste of resources when all you really want back is a simple Boolean value (per page).

                      - tye

        I agree, that was a bit of ahead of time to rush with Text::Diff from my side, but the memories are still alive in my head when I had to do something very similar. Suprisingly, simple equality tests will not live for a long time untouched in this case. See L~R's comment about dates and other small dynamics on websites.

        But your point taken. :)

        Leonid Mamtchenkov aka TVSET

Re: questions concerning using perl to monitor webpages
by Abigail-II (Bishop) on May 22, 2003 at 00:56 UTC
    The HTTP protocol helps, as there is a "If-Modified-Since" header. So, you could use LWP, create an user agent, and an HTTP Request object, set the "If-Modified-Since" header, do the request, and look at the return status of the response. If the status is 304, the content hasn't changed. If the status is 200, the content has changed. For any other status, see RFC 2068.

    Abigail

      This is great in principle for truly static pages, but for CGI output the "If-modified-since" is, like, now. So it depends on what kind of pages you're going to fetch if this particular trick will work.

      It's dead easy to use, though, when it works. :-)

      Gary Blackburn
      Trained Killer

Re: questions concerning using perl to monitor webpages
by newrisedesigns (Curate) on May 22, 2003 at 02:55 UTC
Re: questions concerning using perl to monitor webpages
by Anonymous Monk on May 22, 2003 at 03:47 UTC
    lemme see if i can explain better what i need...see, i work at a videogame news site, and i want a script i can run from my box that will check the press release pages of like 20 or 30 different companies and tell me which ones have been changed. then i can pull up the changed pages in my browser to see *what* changes...so i can use LWP::UserAgent to get the Last Modified string of a page? how would i do that?
      Completely non-Perl answer.

      However, I've been using Copernic Pro for site monitoring and many other things for a while.

      Yeah, it's not as sexy as rolling your own, but it works.

      No I don't work for them, and there may be other tools that function similarly.

      I, also, have a non-Perl answer. Using recent versions of Mozilla, you can make groups of tabs into bookmarks that go into your personal toolbar folder. Under 'Manage Bookmarks' you can schedule checking for changes (only on a per-bookmark basis, not globally for the whole tab-group) and then make the bookmark blink at you, or automatically open, or some such thing.
Re: questions concerning using perl to monitor webpages
by @ncientgoose (Novice) on May 22, 2003 at 04:18 UTC
    use this bit to get the page code
    @ancientgoose
    use Net::HTTP; my $link = 'http://www.whateverlink.com'; $link =~ s/.+:\/\/(.+)/$1/ if ($link =~ /.+:\/\/.+/); my $s = Net::HTTP->new(Host => "$link") || 0; my $htmlcode = ''; if ($s) { $s->write_request(GET => "/", 'User-Agent' => "Mozilla/5.0"); my($code, $mess, %h) = $s->read_response_headers; while ($n = $s->read_entity_body($buf, 1024)) { $htmlcode .= "$buf"; } }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://259948]
Approved by Trimbach
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (12)
As of 2014-10-24 11:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (131 votes), past polls