http://www.perlmonks.org?node_id=604468

hacker has asked for the wisdom of the Perl Monks concerning the following question:

I have a need to traverse a web tree remotely over http, parse a list of directories which come back, and grab the latest or second-to-latest that are displayed.

Once I have that, I need to fetch some files within that directory by name (which includes the date in the title of the filename).

For example, I will see something like this:

   Parent Directory/                      -    Directory
   20060922/         2006-Nov-13 01:11:31 -    Directory
   20060927/         2006-Nov-13 01:16:45 -    Directory
   20061016/         2006-Dec-25 03:16:32 -    Directory
   20061103/         2006-Dec-25 03:18:05 -    Directory
   20061202/         2007-Jan-30 18:07:53 -    Directory
   20061224/         2007-Feb-13 23:23:44 -    Directory
   20070126/         2007-Mar-11 19:16:45 -    Directory
   20070208/         2007-Feb-09 03:04:34 -    Directory
   20070225/         2007-Feb-25 23:44:05 -    Directory

From here, I can see that I want either

20070225
or
20070208
as the latest and second-to-latest directories in the tree.

Once I know this, I need to traverse into one of those directories and fetch a series of files, which have the date in the filename. These files are VERY enormous (tens of gigabytes in size)

What is the best approach to solve this problem, keeping in mind that this is over http, remotely, and the ability to resume aborted fetches is highly critical (ala wget -c).

Here is the order of events:

  1. Connect to directory resource and fetch html page that lists directories available
  2. Parse the list, sorting and retrieving the latest two most-recent directories
  3. Traverse into one or the other, starting with second-to-latest, and fetch file-$DATE-001.dat .. n, resuming where required from previous aborted fetches.
  4. Store locally, verifying full transfer, and delete any other local instances of previous directories that remain (thus keeping a "mirror" of only the latest two remote copies).

Which modules should I be exploring, other than the obvious LWP, WWW::Robot, File::Path, Date::Calc, Date::Manip and such?

Are there any canned routines or snippets somewhere that can help? Or in the absence of that, a tutorial that goes through some of this?

  • Comment on Traversing directories to get the "most-recent" or "second-to-most-recent" directory contents

Replies are listed 'Best First'.
Re: Traversing directories to get the "most-recent" or "second-to-most-recent" directory contents
by sgifford (Prior) on Mar 13, 2007 at 05:04 UTC
    This looks pretty straightforward to me. You seem to have described a pretty good approach. You might want to look at HTML::Parser to parse the HTML, and Date::Parse to parse the dates.

    The only part that sounds tricky is resuming the download. It looks like the HTTP 1.1 Range header should let you do what you want, if the server supports it. See RFC 2616 sec. 14.35 for the details.

Re: Traversing directories to get the "most-recent" or "second-to-most-recent" directory contents
by ikegami (Patriarch) on Mar 13, 2007 at 03:37 UTC

    Wouldn't it be easy if you just called a CGI script which returns the URIs of the desired files?

    I used such a system to download log files a while ago. get_list.cgi would return a list of available log files, and get_log.cgi?file=... would get a particular file (since they weren't in a web-accessible directory).

      It would, if the server were my own... but it isn't. I don't have access to the server's namespace or anything on it. Its run by another organization with which I have no direct (public) affiliation.

      Update: Ironically, that's almost exactly what was suggested in this node from abaxaba.

Re: Traversing directories to get the "most-recent" or "second-to-most-recent" directory contents
by Limbic~Region (Chancellor) on Mar 13, 2007 at 12:50 UTC
    hacker,
    This seems extremely straight forward.

    Step 1:
    To fetch the first page, I would probably use WWW::Mechanize provided that you comply with the robots.txt of the site.

    Step 2:
    To parse the directory listing page, I would use HTML::Parser, HTML::TokeParser::Simple, or HTML::TableContentParser - depending on what is appropriate.

    Step 3:
    The directories, if in the YYYYMMDD format as you describe, will sort correctly without the use of any module. Unless the listing was more than a few dozen entries - I wouldn't even bother with a modified water mark algorithm (I would just sort). If you do end up needing to use a date module, DateTime is the way to go.

    Step 4:
    Again, I would use WWW::Mechanize to go into the desired directory and get a contents listing. It should be easy to use the links() method to retrieve a list of files and then a simple regex to filter that list for files containing the date in question.

    Step 6:
    This is that hard part. I do not know of any modules on CPAN that allow you to resume downloads so you may have to implement this feature yourself. You could try WWW::Curl which is a wrapper module. In any case, this shouldn't be too difficult.

    Cheers - L~R