Re: questions concerning using perl to monitor webpages
by Abigail-II (Bishop) on May 22, 2003 at 00:56 UTC
|
The HTTP protocol helps, as there is a "If-Modified-Since"
header. So, you could use LWP, create an user agent, and
an HTTP Request object, set the "If-Modified-Since" header,
do the request, and look at the return status of the response.
If the status is 304, the content hasn't changed. If the status is 200, the content has changed. For any other status,
see RFC 2068.
Abigail | [reply] |
|
This is great in principle for truly static pages, but for CGI output the "If-modified-since" is, like, now. So it depends on what kind of pages you're going to fetch if this particular trick will work.
It's dead easy to use, though, when it works. :-)
Gary Blackburn
Trained Killer
| [reply] |
Re: questions concerning using perl to monitor webpages
by newrisedesigns (Curate) on May 22, 2003 at 02:55 UTC
|
| [reply] |
Re: questions concerning using perl to monitor webpages
by TVSET (Chaplain) on May 22, 2003 at 00:39 UTC
|
| [reply] |
|
Don't use Algorithm::Diff (nor stuff like Text::Diff that uses it) to simply compare for equality.
The person wanted to know which pages had changed, not which lines of each page were unchanged, deleted, added, or modified. Going to the effort to try to find the greatest common sequence of unchanged lines between old and new versions of each page could be a huge waste of resources when all you really want back is a simple Boolean value (per page).
- tye
| [reply] |
|
I agree, that was a bit of ahead of time to rush with Text::Diff from my side, but the memories are still alive in my head when I had to do something very similar. Suprisingly, simple equality tests will not live for a long time untouched in this case. See L~R's comment about dates and other small dynamics on websites.
But your point taken. :)
Leonid Mamtchenkov aka TVSET
| [reply] |
|
TVSET,
Your solution is basically comparing two files. While this method is valid - there are some flaws. The first is assuming the new page you fetch will not be served up from cache some place. It would also be a problem if the overall content of the page was the same, but something like the <date> was different every day. Of course, this can be argued both ways, but one must assume that changed is subjective and not objective.
I don't have any other solutions. I would probably roll my own very much like you have suggested. Since the number of pages to track could get large, I would probably store the MD5 sum and the URL in a database and that's it. I would code special cases for the problem cases.
This being said - I am betting someone will show a much more elegant and powerful solution.
Cheers - L~R
| [reply] |
|
The first is assuming the new page you fetch will not be served up from cache some place.
That's not the problem of my solution. :) It should have access to two copies of the site from different times to compare them. :) Validating that content was not supplied from the cache or something, is either user's headache or yet another addon to the script. :)
It would also be a problem if the overall content of the page was the same, but something like the <date> was different every day. Of course, this can be argued both ways, but one must assume that changed is subjective and not objective.
Well, that was one of the reasons I suggested the use of Text::Diff from the very beginning, since it will minimize the headache. You'll be able to quickly grep away things like dates. :)
I would probably roll my own very much like you have suggested. Since the number of pages to track could get large, I would probably store the MD5 sum and the URL in a database and that's it.
You could always start away with the hash like:
my %internet =
(
'url' => 'md5 checksum',
);
Thanks for the feedback anyway. :)
Leonid Mamtchenkov aka TVSET | [reply] [d/l] |
Re: questions concerning using perl to monitor webpages
by @ncientgoose (Novice) on May 22, 2003 at 04:18 UTC
|
use this bit to get the page code
@ancientgoose
use Net::HTTP;
my $link = 'http://www.whateverlink.com';
$link =~ s/.+:\/\/(.+)/$1/ if ($link =~ /.+:\/\/.+/);
my $s = Net::HTTP->new(Host => "$link") || 0;
my $htmlcode = '';
if ($s) {
$s->write_request(GET => "/", 'User-Agent' => "Mozilla/5.0");
my($code, $mess, %h) = $s->read_response_headers;
while ($n = $s->read_entity_body($buf, 1024)) {
$htmlcode .= "$buf";
}
}
| [reply] [d/l] |
Re: questions concerning using perl to monitor webpages
by Anonymous Monk on May 22, 2003 at 03:47 UTC
|
lemme see if i can explain better what i need...see, i work at a videogame news site, and i want a script i can run from my box that will check the press release pages of like 20 or 30 different companies and tell me which ones have been changed. then i can pull up the changed pages in my browser to see *what* changes...so i can use LWP::UserAgent to get the Last Modified string of a page? how would i do that? | [reply] |
|
Completely non-Perl answer.However, I've been using Copernic Pro for site monitoring and many other things for a while. Yeah, it's not as sexy as rolling your own, but it works. No I don't work for them, and there may be other tools that function similarly.
| [reply] |
|
I, also, have a non-Perl answer. Using recent versions of Mozilla, you can make groups of tabs into bookmarks that go into your personal toolbar folder. Under 'Manage Bookmarks' you can schedule checking for changes (only on a per-bookmark basis, not globally for the whole tab-group) and then make the bookmark blink at you, or automatically open, or some such thing.
| [reply] |