LukeyBoy has asked for the wisdom of the Perl Monks concerning the following question:

So I'm writing a web grabbing utility, and all web related functions work great. Problem is I want to check a file's timestamp and see if the date and time of the file on the web server is more recent - and then obviously download a new version. Once that's done, I have to set the local file's timestamp to the server's.

I'm writing it under both Windows and Linux, and I've searched and cannot find a way to do this. Thanks in advance!

  • Comment on Changing and checking timestamps for files

Replies are listed 'Best First'.
Re: Changing and checking timestamps for files
by DamnDirtyApe (Curate) on Jul 03, 2002 at 06:18 UTC

    Update files, you will, hmmm? Web server you use, hmm? Yes... yes. LWP::Simple you seek. Aide you, mirror() will.


    _______________
    D a m n D i r t y A p e
    Home Node | Email
Re: Changing and checking timestamps for files
by Anonymous Monk on Jul 03, 2002 at 05:48 UTC
    See stat and utime
      Cool, that's one way to do it. Kinda lame that the file can't exist or else utime will set the stamp to now. Thanks!
Re: Changing and checking timestamps for files
by dda (Friar) on Jul 03, 2002 at 07:09 UTC
    Try this code:
    #!/usr/bin/perl -w use strict; use LWP::Simple; my ($content_type, $document_length, $modified_time, $expires, $server +) = head("http://www.gnu.org/index.html"); $modified_time = localtime($modified_time) if $modified_time; print "Modified: " . ($modified_time ? $modified_time : "Unknown") . " +\n"; exit;
    Nothe that modified_time is not available for all web servers/pages.

    --dda

Re: Changing and checking timestamps for files
by flocto (Pilgrim) on Jul 03, 2002 at 07:01 UTC

    If I understand you right you should read about the Cache-Control header field in the HTTP header. For example:

    Cache-Control: max-age = 3600

    This will tell the webserver to send the file, if it has been modified within the last 3600 seconds. Otherwise (if it hasn't been edited for at least 3600 seconds) the server will answer with the Status code 304 (Not Modified) (at least it should. Apache does, that's what I know for sure..)

    For cecking the age of the local file(s) use stat or the file test operator -M..

    Regards,
    -octo-

Far OT (was Re: Changing and checking timestamps for) remote (files)
by little (Curate) on Jul 03, 2002 at 23:29 UTC

    Sorry for beeing really far OT this time folks, but I think its an interesting problem, that could not be solved with perl as the only tool but where as always perl could be helpfull.

    There might occur a wrong information if a document you are fetching from a remote server will have an "old" content but proclaims to be just generated due to the fact that it was parsed and changed by the server or that it was newly generated due to the use of a content management system and for example a change in the layout that affected the document as well or even worse, the document does not exist at all, but is generated upon request from any datasource.

    No, I have no perl solution to this as a search for "Document last modified" or "Last Updated" will not assure to get any such info, especially if you also include files in other languages into your search.

    The only thing that I could think of would be "something like a webservice" via XMLRPC or alike, where a server will answer queries for a document URL with an appropriate

    <document mime-type=".." last-modified=".." content="dynamic" />
    But thats just an idea how to avoid conflicts with subjective and objective manipulation date of that document.

    The much more simpler approach would be to add an attribute for 'modified' or 'updated' to lets say a 'div' tag. So that for example if I'd visit any node on perlmonks.org it would carry its date with it, which it actually does, so easy to parse, but not always its wanted to have that date displayed. If for example you'd want to search through merlyn's webtechniques' columns it would be helpfull if he' have added such attribute to an article, which by the way gives the ability to mix "old" content and "newer" content, so its not necessary to research the old one again. A small example to get to the end:

    <div id="article" lastmodified="20-02-1989" tmfmt="dd-mm-YYYY"> <p>paragraph ...</p> </div> <div id="article" lastmodified="23-04-1996" tmfmt="dd-mm-YYYY"> <p>paragraph ...</p> </div>
    Ok, I'm also aware that PM might be the wrong place for such ideas :-) I just try to give my part to make it a better world ;-)

    I really would like to get others ideas or opinions about this - or if I didn't see an already existant solution, please point me to it.

    Have a nice day
    All decision is left to your taste

      There might occur a wrong information if a document you are fetching from a remote server will have an "old" content but proclaims to be just generated due to the fact that it was parsed and changed by the server or that it was newly generated due to the use of a content management system and for example a change in the layout that affected the document as well or even worse, the document does not exist at all, but is generated upon request from any datasource.

      I'm not 100% clear on what you think the problem is. If you're trying to detect whether a remote server is presenting new content for a page, and are beging foiled by automatically generated timestamps in headers or footers (or elsewhere on the page), and you really, really need to know if content has changed, then I see two options for you.

      First, write page-specific processing code that strips out the dynamic parts. Then, compute an MD5 hash on what's left. If that hash hasn't changed since the last time you looked, you don't have new content.

      The other approach is to do use Algorithm::Diff to do a diff, then try to get smart (perhaps on a page by page basis) about what differences you really care about. For examaple, if the text fragments that differ look like dates or times, ignore them.