Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
Think about Loose Coupling
 
PerlMonks  

Re: questions concerning using perl to monitor webpages

by TVSET (Chaplain)
on May 22, 2003 at 00:39 UTC ( #259949=note: print w/ replies, xml ) Need Help??


in reply to questions concerning using perl to monitor webpages

I don't know if this exists, but it's trivial to do, if you ask me. You can schedule a job to fetch a page and compare it to the previous copy. You can use Text::Diff for differences. Alternatively, you can compute a checksum (like MD5) and check if that changed. I am sure there are billion of other ways to do this.

Leonid Mamtchenkov aka TVSET


Comment on Re: questions concerning using perl to monitor webpages
Re: Re: questions concerning using perl to monitor webpages
by Limbic~Region (Chancellor) on May 22, 2003 at 00:53 UTC
    TVSET,
    Your solution is basically comparing two files. While this method is valid - there are some flaws. The first is assuming the new page you fetch will not be served up from cache some place. It would also be a problem if the overall content of the page was the same, but something like the <date> was different every day. Of course, this can be argued both ways, but one must assume that changed is subjective and not objective.

    I don't have any other solutions. I would probably roll my own very much like you have suggested. Since the number of pages to track could get large, I would probably store the MD5 sum and the URL in a database and that's it. I would code special cases for the problem cases.

    This being said - I am betting someone will show a much more elegant and powerful solution.

    Cheers - L~R

      The first is assuming the new page you fetch will not be served up from cache some place.

      That's not the problem of my solution. :) It should have access to two copies of the site from different times to compare them. :) Validating that content was not supplied from the cache or something, is either user's headache or yet another addon to the script. :)

      It would also be a problem if the overall content of the page was the same, but something like the <date> was different every day. Of course, this can be argued both ways, but one must assume that changed is subjective and not objective.

      Well, that was one of the reasons I suggested the use of Text::Diff from the very beginning, since it will minimize the headache. You'll be able to quickly grep away things like dates. :)

      I would probably roll my own very much like you have suggested. Since the number of pages to track could get large, I would probably store the MD5 sum and the URL in a database and that's it.

      You could always start away with the hash like:

      my %internet = ( 'url' => 'md5 checksum', );

      Thanks for the feedback anyway. :)

      Leonid Mamtchenkov aka TVSET

Re^2: Does this exist yet?... (eq not diff)
by tye (Cardinal) on May 22, 2003 at 01:07 UTC

    Don't use Algorithm::Diff (nor stuff like Text::Diff that uses it) to simply compare for equality.

    The person wanted to know which pages had changed, not which lines of each page were unchanged, deleted, added, or modified. Going to the effort to try to find the greatest common sequence of unchanged lines between old and new versions of each page could be a huge waste of resources when all you really want back is a simple Boolean value (per page).

                    - tye

      I agree, that was a bit of ahead of time to rush with Text::Diff from my side, but the memories are still alive in my head when I had to do something very similar. Suprisingly, simple equality tests will not live for a long time untouched in this case. See L~R's comment about dates and other small dynamics on websites.

      But your point taken. :)

      Leonid Mamtchenkov aka TVSET

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://259949]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (10)
As of 2014-04-25 08:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (585 votes), past polls