Opening a web page directory

rhoadesjn has asked for the wisdom of the Perl Monks concerning the following question:

I'm a novice with perl so take it easy on me. I'm trying to open a http directory on our network. What command do I use to open a http directory and read all the directories below it? The opendir command does not read http directories.

Our IS department has moved all our documentation that we need for on-call to SharePoint so I can't just copy and paste anymore for a standard directory on the network. Urrggh.

Here is a simple sample of code that I have written (this isn't complete)...

#Open the directory
opendir (DIR, "http://yoda/Documents/PSG/Customers/") || die ("Unable 
+to open directory");

#Read list of files in the directory
while ($filename = readdir(DIR)) {
    @files = readdir(DIR); 
}
closedir(DIR);

#Remove the file . which represents the current directory
shift (@files);

#Go through each of the files in the directory
foreach $file (@files) {
  
  #Open subfolder
  open (DIR, "http://yoda/Documents/PSG/Customers/$file/Support/sitema
+p/") 

  #code here copies each document from subfolder to c:\sitemaps
}
[download]

Thanks!

Comment on Opening a web page directory Download Code

Replies are listed 'Best First'.
Re: Opening a web page directory by brian_d_foy (Abbot) on Apr 12, 2005 at 22:12 UTC
There is really no such thing as an HTTP directory. URLs look like paths, but they might have nothing to do with files at all. Despite this, servers will sometimes serve up a "directory index" sometimes. Besides guessing at URLs, the only way you figure out which URLs are available is to get someone to tell you (i.e. via a link on a page. Tools like WWW::Mechanize can help you with that. At a lower level, HTML::SimpleLinkExtor can grab links out of HTML. You can also look at various web spiders to see how they do things. If you have access to the filesystem (perhaps because it's one of your servers, you can get the list of files from the directory and then construct the URL. Besides that, maybe a case of beer would convince the IS people to give you a copy of what you need. :) -- brian d foy <brian@stonehenge.com>	[reply]
Re: Opening a web page directory by elwarren (Priest) on Apr 12, 2005 at 23:17 UTC
If your webserver is using WebDAV aka Web Folders in the Windows world, you can access the directory using the HTTP::DAV module. This is from the pod: `use HTTP::DAV; $d = new HTTP::DAV; $url = "http://host.org:8080/dav/"; $d->credentials( -user=>"pcollins",-pass =>"mypass", -url =>$url, -realm=>"DAV Realm" ); $d->open($url) or die("Couldn't open $url: " .$d->message . "\n"); #Recursively get remote my_dir/ to . $d->get("my_dir/",".");` [download]	[reply] [d/l]
Re^2: Opening a web page directory by rhoadesjn (Novice) on Apr 13, 2005 at 14:21 UTC
Thank you for all the responses. Those are all great ideas. To explain in more detail. I'm trying to access a Webfolder in Microsoft Sharepoint. I don't have permissions to the actual server...just the webfolders. I tried downloading http::dav to my local Windows XP laptop...but received the error that that the ppd was not built for my version of perl. It looks like the easiest solution is to give my IS a case a beer and get permissions to the directories on the server rather than the webfolders. Thanks again for all the help. Jessica	[reply]
Re: Opening a web page directory by moot (Chaplain) on Apr 12, 2005 at 21:15 UTC
It would be nice (sometimes) if perl had this functionality, but sadly it's not that easy. You probably need to investigate WWW::Mechanize and possibly LWP::UserAgent or LWP::Simple. Essentially you will load a page via a web request, parse it for links, and then follow each link, possibly recursively. This sounds a lot like spidering, so you may also want to look into the various Spider modules like WWW::CheckSite::Spider. Oh, you could also look at wget and similar tools.	[reply]
Re: Opening a web page directory by halley (Prior) on Apr 12, 2005 at 21:21 UTC
You can't read a directory by http, you can only read a page. Some servers will assume you meant to fetch `index.html` or `index.htm` or `default.htm` if you give a directory name. Some servers will also give an automatic "file index" page when there is no such file in that directory, but there's no guarantee that the index page received contains links to all files in that directory. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l] [select]
Re: Opening a web page directory by ww (Archbishop) on Apr 12, 2005 at 22:06 UTC
As with some others above, I'm not sure I understand exactly what you're trying to do, but it sounds as though you may be trying to find reference-material inside (reasonably named) files in what you call the "http" directory. *(Unless they're browser-accessible and* find-able via some sort of intranet search or index, why would they be in an html-oriented directory?)** But, trying to follow the code rather than the words, I infer that may be the case because you say `#code here copies each document from subfolder to c:\sitemaps.` (I doubt the IT folk will be well pleased if you actually copy all the documents and load up some drive with dupes so I'll offer some answers on the suspicion you would be satisfied to copy document names and addresses -- in effect, creating your own index.) If so, perhaps some of these references will be helpful: "-e" and "-d" switches -- to distinguish between files and dirs; typeglobs and filehandles -- partially OT, but some good examples... Re:{3} File renaming and removal to a subdirectory (this item is in the middle of a thread; if it addresses the crux of your case, read the whole thread.) ...and... I may be waaaay off target, but suggest you read the pod (documentation) for File::Find Finally it also might be well to read the Monastery's docs on search and supersearch, in case the answers here come up short on responsiveness.	[reply] [d/l]
Re: Opening a web page directory by sh1tn (Priest) on Apr 12, 2005 at 21:35 UTC
Not Perl program but quick solution: c:\sitemaps\wget -e robots=off -rL -nd Your_site	[reply]

Back to Seekers of Perl Wisdom