http://www.perlmonks.org?node_id=604468

hacker has asked for the wisdom of the Perl Monks concerning the following question:

I have a need to traverse a web tree remotely over http, parse a list of directories which come back, and grab the latest or second-to-latest that are displayed.

Once I have that, I need to fetch some files within that directory by name (which includes the date in the title of the filename).

For example, I will see something like this:

   Parent Directory/                      -    Directory
   20060922/         2006-Nov-13 01:11:31 -    Directory
   20060927/         2006-Nov-13 01:16:45 -    Directory
   20061016/         2006-Dec-25 03:16:32 -    Directory
   20061103/         2006-Dec-25 03:18:05 -    Directory
   20061202/         2007-Jan-30 18:07:53 -    Directory
   20061224/         2007-Feb-13 23:23:44 -    Directory
   20070126/         2007-Mar-11 19:16:45 -    Directory
   20070208/         2007-Feb-09 03:04:34 -    Directory
   20070225/         2007-Feb-25 23:44:05 -    Directory

From here, I can see that I want either

20070225
or
20070208
as the latest and second-to-latest directories in the tree.

Once I know this, I need to traverse into one of those directories and fetch a series of files, which have the date in the filename. These files are VERY enormous (tens of gigabytes in size)

What is the best approach to solve this problem, keeping in mind that this is over http, remotely, and the ability to resume aborted fetches is highly critical (ala wget -c).

Here is the order of events:

  1. Connect to directory resource and fetch html page that lists directories available
  2. Parse the list, sorting and retrieving the latest two most-recent directories
  3. Traverse into one or the other, starting with second-to-latest, and fetch file-$DATE-001.dat .. n, resuming where required from previous aborted fetches.
  4. Store locally, verifying full transfer, and delete any other local instances of previous directories that remain (thus keeping a "mirror" of only the latest two remote copies).

Which modules should I be exploring, other than the obvious LWP, WWW::Robot, File::Path, Date::Calc, Date::Manip and such?

Are there any canned routines or snippets somewhere that can help? Or in the absence of that, a tutorial that goes through some of this?