Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

web page update notifier

by Juerd (Abbot)
on Jun 17, 2004 at 20:51 UTC ( #367755=sourcecode: print w/replies, xml ) Need Help??
Category: Web stuff
Author/Contact Info foo
Description: Very simply script that keeps me informed of changes in a static page. Runs daily with cron, which emails me any changes or error messages.
#!/usr/bin/perl -w
use strict;
use File::Copy qw(copy);
use LWP::Simple qw(mirror is_error);

my $url = 'http://...';
my $file = '/home/juerd/tmp/foo.html';

copy $file, "$file.old" or warn $!;

my $status = mirror $url, $file;
warn "HTTP $status" if is_error $status;

system qw(diff -u), "$file.old", $file;
 #!/usr/bin/perl -w
 use strict;
+use File::Copy qw(copy);
-use LWP::Simple qw(mirror is_success);
+use LWP::Simple qw(mirror is_error);

 my $url = 'http://...';
 my $file = '/home/juerd/tmp/foo.html';

-rename $file, "$file.old" or warn $!;
+copy $file, "$file.old" or warn $!;

 my $status = mirror $url, $file;
-warn "HTTP $status" unless is_success $status;
+warn "HTTP $status" if is_error $status;

 system qw(diff -u), "$file.old", $file;
Replies are listed 'Best First'.
Re: web page update notifier
by b10m (Vicar) on Jun 17, 2004 at 21:04 UTC

    In this snippet, you actually download the whole file each time you (cron) run(s) the script. Wouldn't it be nicer if you'd just ask for a HEAD and check the "Last-Modified" header and do some local testing on that?

    $ HEAD http://www.server.tld/page.htm | grep "Last-Modified"
    --
    b10m

    All code is usually tested, but rarely trusted.

      In this snippet, you actually download the whole file each time you (cron) run(s) the script.

      Not true.

      From LWP::UserAgent, that LWP::Simple uses under the hood:

      $ua->mirror( $url, $filename ) This method will get the document identified by $url and store it in file called $filename. If the file already exists, then the request will contain an "If-Modified-Since" header matching the modification time of the file. If the document on the server has not changed since this time, then nothing happens. If the document has been updated, it will be downloaded again. The modification time of the file will be forced to match that of the server.

      Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

        Aren't you defeating the mirror check by renaming $file to "$file.old" before giving $file to the mirror call by which time it won't exists?


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re: web page update notifier
by rob_au (Abbot) on Jun 18, 2004 at 07:13 UTC
    I've posted something similar previously on this site with the node Scripted Actions upon Page Changes which may additionally be of interest (albeit it is somewhat dated now). This code differs in that it employs the last-modified-header or, where this is unavailable, a message digest of the page, in order to determine page changes.

     

    perl -le "print unpack'N', pack'B32', '00000000000000000000001011100100'"

Re: web page update notifier
by ihb (Deacon) on Jun 18, 2004 at 01:25 UTC

    I'd be even happier if you used Text::Diff or something equivalent instead of a system call. :-)

    ihb

      I'd be even happier if you used Text::Diff or something equivalent instead of a system call. :-)

      For something that runs once per day, it's not worth the trouble. I even use `cat foo` in scripts like this one.

      Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

        In short, my point is that when sharing it with other monks I'd be happier to see a portable snippet since it doesn't require much work to make it that. Of course, it's better to share a non-portable snippet than not share at all; that's why I said "happier" and not "happy".

        It's not worth the trouble for you when you use it, but since this post isn't targeted to you I just figured it would be nice if you patched it so that more could benefit from it. Just as you'd do with any CPAN module you publish.

        ihb

Re: web page update notifier
by zby (Vicar) on Jun 18, 2004 at 08:36 UTC
    In my spare time I am developing a more complicated notifier with a web interface. The additional feature is that it let's you add some regexps to ignore some changes (it is usefull for pages that for example show current date somewhere). I plan it to evolve into something like what RSS does by extracting what is new on the page (with a kind of HTML diff). You can read some documentation for that, download it or try it on my home server at Active Bookmarks Manual.

    I wanted to use it as a replacement for Personal Nodelet - so it has a special (undocumented) feature that links to Perl Monks are internally converted to links to appriopriate The Pen pages.

    By the way most current web browsers can notify you about changes to pages in your bookmarks.

      By the way most current web browsers can notify you about changes to pages in your bookmarks.

      I don't just want to know that it changed, I want to know exactly which lines were added and removed. There are numerous scripts that do something like this, but creating a new one is MUCH easier than reading manuals of other scripts, because they're all bloated with features I don't need right now.

      Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Re: web page update notifier
by danielcid (Scribe) on Jul 09, 2004 at 15:31 UTC

    You could use a md5 hash to check if the file has
    modified or not. It is much more accurate and safe than
    only using a diff. In addition, using the MD5 hash will
    make the storage for the file much smaller..

    *someone said to use "HEAD", to check the last modified
    date. This value is not safe/trustworthy.

    []'s

    -DBC

      It is much more accurate and safe than only using a diff.

      Accuracy is irrelevant for text documents. Either a line is the same, or it is not. Besides that, I'm especially interested in *which* lines are different, and how they changed. diff tells me exactly that.

      the last modified date. This value is not safe/trustworthy.

      It has proven to be worthy of my trust.

      Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://367755]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2019-07-20 03:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?