Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

HTML Crawler

by damian (Beadle)
on Aug 11, 2000 at 07:51 UTC ( [id://27454]=perlquestion: print w/replies, xml ) Need Help??

damian has asked for the wisdom of the Perl Monks concerning the following question:

hi fella monks, i want to create a program that would crawl through directories and print the html files depending on the pattern i am looking for. what i have done is like this:
$basedir = "/usr/local/apache/htdocs/"; @htmlfiles = ('*.html','aboutus/*.html','abuse/*.html','abuse/email-au +p/*.html','abuse/news-aup/*.html','admin/*.html', 'admin/emailpass/*.html','admin/loginpass/*.html','adm +in/webpass/*.html','contact/*.html','customer/*.html', 'customer/account/*.html','customer/account/changepack +age/*.html','customer/account/changeuserid/*.html', 'customer/account/suspend/*.html','customer/account/te +rminate/*.html','customer/account/updateinfo/*.html', 'customer/billing/*.html','customer/billing/billingcyc +le/*.html','customer/billing/bpiexpress/*.html', 'customer/billing/faq/*.html','customer/billing/paymen +t/*.html','customer/billing/statement/*.html', 'customer/email/*.html','customer/usage/*.html','downl +oad/*.html','exclusives/*.html','exclusives/loyalty/*.html', 'exclusives/megamall/*.html','exclusives/prepay/*.html +','news/*.html','news/intl/*.html','products/*.html', 'products/business/*.html','products/corporate/*.html' +,'products/corporate/dedicated-dial/*.html', 'products/corporate/isdn/*.html','products/corporate/l +eased_line/*.html','products/corporate/multi-user/*.html', 'products/dealers/*.html','products/individual/*.html' +,'products/roaming/*.html','products/websolutions/*.html', 'products/websolutions/colocation/*.html','products/we +bsolutions/webhosting/*.html','search/imagesearch/*.html', 'search/lyricssearch/*.html','search/mp3search/*.html' +,'search/newssearch/*.html','search/peoplesearch/*.html');
i am saving the directories into an array. is there a way to shorten this one? thanks

Replies are listed 'Best First'.
Re: HTML Crawler
by chromatic (Archbishop) on Aug 11, 2000 at 08:16 UTC
    Use File::Find. Something like the following is relatively workable:
    #!/usr/bin/perl -w use strict; use Cwd; use File::Find; my $type = shift || '.html'; sub fetch { print "$_\n" if $File::Find::name =~ /$type$/; } find(\&fetch, cwd());
RE: HTML Crawler
by t0mas (Priest) on Aug 11, 2000 at 12:13 UTC
    If you want to search *all* directories for html files, go for chromatics solution. If on the other hand you wish to search *some* directories for html files I suggest you put your dir/file names in a separate file that you read into an array. Otherwise you will need to put either the the "do's" or the "dont's" in your fetch sub.
    I find a separete file more easy to maintain, but thats a matter of taste I guess...

    /brother t0mas
Re: HTML Crawler
by eak (Monk) on Aug 11, 2000 at 08:32 UTC
    I know its not Perl, but this is one time I think the shell is a little cleaner.
    find /usr/local/apache/htdocs/ -name '*.html'
    --eric
Re: HTML Crawler
by eak (Monk) on Aug 11, 2000 at 08:32 UTC
    I know its not Perl, but this is one time I think the shell is a little cleaner.
    find /usr/local/apache/htdocs/ -name '*.html'
    --eric
      find unlike grep you can search for patterns that you want from different files.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://27454]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2024-04-24 02:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found