http://www.perlmonks.org?node_id=447180

rhoadesjn has asked for the wisdom of the Perl Monks concerning the following question:

I'm a novice with perl so take it easy on me. I'm trying to open a http directory on our network. What command do I use to open a http directory and read all the directories below it? The opendir command does not read http directories.

Our IS department has moved all our documentation that we need for on-call to SharePoint so I can't just copy and paste anymore for a standard directory on the network. Urrggh.

Here is a simple sample of code that I have written (this isn't complete)...

#Open the directory opendir (DIR, "http://yoda/Documents/PSG/Customers/") || die ("Unable +to open directory"); #Read list of files in the directory while ($filename = readdir(DIR)) { @files = readdir(DIR); } closedir(DIR); #Remove the file . which represents the current directory shift (@files); #Go through each of the files in the directory foreach $file (@files) { #Open subfolder open (DIR, "http://yoda/Documents/PSG/Customers/$file/Support/sitema +p/") #code here copies each document from subfolder to c:\sitemaps }

Thanks!

Replies are listed 'Best First'.
Re: Opening a web page directory
by brian_d_foy (Abbot) on Apr 12, 2005 at 22:12 UTC

    There is really no such thing as an HTTP directory. URLs look like paths, but they might have nothing to do with files at all. Despite this, servers will sometimes serve up a "directory index" sometimes.

    Besides guessing at URLs, the only way you figure out which URLs are available is to get someone to tell you (i.e. via a link on a page. Tools like WWW::Mechanize can help you with that. At a lower level, HTML::SimpleLinkExtor can grab links out of HTML.

    You can also look at various web spiders to see how they do things.

    If you have access to the filesystem (perhaps because it's one of your servers, you can get the list of files from the directory and then construct the URL. Besides that, maybe a case of beer would convince the IS people to give you a copy of what you need. :)

    --
    brian d foy <brian@stonehenge.com>
Re: Opening a web page directory
by elwarren (Priest) on Apr 12, 2005 at 23:17 UTC
    If your webserver is using WebDAV aka Web Folders in the Windows world, you can access the directory using the HTTP::DAV module. This is from the pod:
    use HTTP::DAV; $d = new HTTP::DAV; $url = "http://host.org:8080/dav/"; $d->credentials( -user=>"pcollins",-pass =>"mypass", -url =>$url, -realm=>"DAV Realm" ); $d->open($url) or die("Couldn't open $url: " .$d->message . "\n"); #Recursively get remote my_dir/ to . $d->get("my_dir/",".");
      Thank you for all the responses. Those are all great ideas.

      To explain in more detail. I'm trying to access a Webfolder in Microsoft Sharepoint. I don't have permissions to the actual server...just the webfolders. I tried downloading http::dav to my local Windows XP laptop...but received the error that that the ppd was not built for my version of perl. It looks like the easiest solution is to give my IS a case a beer and get permissions to the directories on the server rather than the webfolders.

      Thanks again for all the help.
      Jessica
Re: Opening a web page directory
by moot (Chaplain) on Apr 12, 2005 at 21:15 UTC
    It would be nice (sometimes) if perl had this functionality, but sadly it's not that easy. You probably need to investigate WWW::Mechanize and possibly LWP::UserAgent or LWP::Simple. Essentially you will load a page via a web request, parse it for links, and then follow each link, possibly recursively. This sounds a lot like spidering, so you may also want to look into the various Spider modules like WWW::CheckSite::Spider.

    Oh, you could also look at wget and similar tools.

Re: Opening a web page directory
by halley (Prior) on Apr 12, 2005 at 21:21 UTC
    You can't read a directory by http, you can only read a page.

    Some servers will assume you meant to fetch index.html or index.htm or default.htm if you give a directory name. Some servers will also give an automatic "file index" page when there is no such file in that directory, but there's no guarantee that the index page received contains links to all files in that directory.

    --
    [ e d @ h a l l e y . c c ]

Re: Opening a web page directory
by ww (Archbishop) on Apr 12, 2005 at 22:06 UTC
    As with some others above, I'm not sure I understand exactly what you're trying to do, but it sounds as though you may be trying to find reference-material inside (reasonably named) files in what you call the "http" directory. (Unless they're browser-accessible and find-able via some sort of intranet search or index, why would they be in an html-oriented directory?)

    But, trying to follow the code rather than the words, I infer that may be the case because you say
    #code here copies each document from subfolder to c:\sitemaps.
    (I doubt the IT folk will be well pleased if you actually copy all the documents and load up some drive with dupes so I'll offer some answers on the suspicion you would be satisfied to copy document names and addresses -- in effect, creating your own index.)

    If so, perhaps some of these references will be helpful:

    ...and... I may be waaaay off target, but suggest you read the pod (documentation) for File::Find

    Finally it also might be well to read the Monastery's docs on search and supersearch, in case the answers here come up short on responsiveness.

Re: Opening a web page directory
by sh1tn (Priest) on Apr 12, 2005 at 21:35 UTC
    Not Perl program but quick solution:

    c:\sitemaps\wget -e robots=off -rL -nd Your_site