HTML Crawler

damian has asked for the wisdom of the Perl Monks concerning the following question:

hi fella monks, i want to create a program that would crawl through directories and print the html files depending on the pattern i am looking for. what i have done is like this:


$basedir = "/usr/local/apache/htdocs/";
@htmlfiles = ('*.html','aboutus/*.html','abuse/*.html','abuse/email-au
+p/*.html','abuse/news-aup/*.html','admin/*.html',
                'admin/emailpass/*.html','admin/loginpass/*.html','adm
+in/webpass/*.html','contact/*.html','customer/*.html',
                'customer/account/*.html','customer/account/changepack
+age/*.html','customer/account/changeuserid/*.html',
                'customer/account/suspend/*.html','customer/account/te
+rminate/*.html','customer/account/updateinfo/*.html',
                'customer/billing/*.html','customer/billing/billingcyc
+le/*.html','customer/billing/bpiexpress/*.html',
                'customer/billing/faq/*.html','customer/billing/paymen
+t/*.html','customer/billing/statement/*.html',
                'customer/email/*.html','customer/usage/*.html','downl
+oad/*.html','exclusives/*.html','exclusives/loyalty/*.html',
                'exclusives/megamall/*.html','exclusives/prepay/*.html
+','news/*.html','news/intl/*.html','products/*.html',
                'products/business/*.html','products/corporate/*.html'
+,'products/corporate/dedicated-dial/*.html',
                'products/corporate/isdn/*.html','products/corporate/l
+eased_line/*.html','products/corporate/multi-user/*.html',
                'products/dealers/*.html','products/individual/*.html'
+,'products/roaming/*.html','products/websolutions/*.html',
                'products/websolutions/colocation/*.html','products/we
+bsolutions/webhosting/*.html','search/imagesearch/*.html',
                'search/lyricssearch/*.html','search/mp3search/*.html'
+,'search/newssearch/*.html','search/peoplesearch/*.html');
[download]

i am saving the directories into an array. is there a way to shorten this one? thanks

Comment on HTML Crawler Download Code

Replies are listed 'Best First'.
Re: HTML Crawler by chromatic (Archbishop) on Aug 11, 2000 at 08:16 UTC
Use File::Find. Something like the following is relatively workable: `#!/usr/bin/perl -w use strict; use Cwd; use File::Find; my $type = shift \|\| '.html'; sub fetch { print "$_\n" if $File::Find::name =~ /$type$/; } find(\&fetch, cwd());` [download]	[reply] [d/l]
RE: HTML Crawler by t0mas (Priest) on Aug 11, 2000 at 12:13 UTC
If you want to search all directories for html files, go for chromatics solution. If on the other hand you wish to search some directories for html files I suggest you put your dir/file names in a separate file that you read into an array. Otherwise you will need to put either the the "do's" or the "dont's" in your fetch sub. I find a separete file more easy to maintain, but thats a matter of taste I guess... /brother t0mas	[reply]
Re: HTML Crawler by eak (Monk) on Aug 11, 2000 at 08:32 UTC
I know its not Perl, but this is one time I think the shell is a little cleaner. `find /usr/local/apache/htdocs/ -name '*.html'` [download] --eric	[reply] [d/l]
Re: HTML Crawler by eak (Monk) on Aug 11, 2000 at 08:32 UTC
I know its not Perl, but this is one time I think the shell is a little cleaner. `find /usr/local/apache/htdocs/ -name '*.html'` [download] --eric	[reply] [d/l]
RE: Re: HTML Crawler by damian (Beadle) on Aug 11, 2000 at 09:33 UTC
find unlike grep you can search for patterns that you want from different files.	[reply]


go ahead... be a heretic
	PerlMonks