Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Link Hunter

by wizbancp (Sexton)
on Feb 13, 2007 at 15:39 UTC ( #599702=sourcecode: print w/replies, xml ) Need Help??
Category: Web Stuff
Author/Contact Info wizbancp
Description: A script for exploring site and catch link simply specify the starting url and the searching depth (sorry for my english!:-)) at the end the script produce a text files with the address catched.
After the critics(:-)) i modified the script to catch only link address & don't also email.... =:-( usage: "script.pl url depth" or simply "script.pl"
#!/usr/bin/perl -w

require LWP::UserAgent;

open LINK,  ">", "link.txt";

if (!@ARGV)
{
    print "Insert starting URL: ";
    $indirizzo=<STDIN>;
    chomp($indirizzo);
    print "\nInsert searching depth: ";
    $profond=<STDIN>;
    chomp($profond);
}
else
{
    $indirizzo = $ARGV[0];
    $profond = $ARGV[1];
}

$indirizzohttp="http://".$indirizzo;
my @elencolink = $indirizzohttp;

my $ua = LWP::UserAgent->new; 
$ua->agent('WizCaptureBot/1.11');
$ua->timeout(10);
$ua->env_proxy;

sub pausa #pausing the script before ending
{
   print "\nPress Enter to exit.\n";
   my $pausa = <STDIN>;
} 


sub catturalink #procedure for url capture 
{
   my $codice = shift;
   my $cont = 0;
   
   while ($codice =~m/(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@
+?^=%&:\/~\+#]*[\w\-\@?^=%&\/~\+#])?/g)
    {
       $indirizzolink="$&";
       $cont++;
       print LINK "$indirizzolink\n";
       push @elencolink, $indirizzolink;
    }
   
   print "Find $cont links\n";
} 

sub visitapagina #capture the site code
{
    my $pagina = shift;
    
    my $response = $ua->get("$pagina");
    if ($response->is_success)
    {
        $codicehtml = $response->content;
        print "\n -- $pagina --\n";
        catturalink($codicehtml);
    }
    else
    {
        print "\n -- $pagina --\n";
        print $response->status_line."\n";
    }
}

my $inizio=0;
my $fine=0;

visitapagina($elencolink[0]);
while($profond!=0)
{
    $profond--;
    $inizio=$fine+1;
    $fine = scalar(@elencolink)-1;
    for($c=$inizio; $c<=$fine; $c++)
    {
        print "\n$inizio  $c  $fine";
        visitapagina($elencolink[$c]);
    }
}

print"\n Operation ended! \n";
pausa;

close LINK;

Replies are listed 'Best First'.
Re: Link & Email Hunter
by merlyn (Sage) on Feb 13, 2007 at 16:33 UTC
      Pebbles and Bam Bam on a Friday night Tried to get to Heaven on a paper kite Lightning struck and down they fell Instead of going to Heaven they went straight to Hell Singing Yabba, Dabba, Dabba Doo Singing Yabba Dabba Dabba Dabba Dabba Doo Dino the dog was chewing his bone While Fred and Barney rocked the microphone I heard a scream I heard a shout It was Mrs. Slate knockin Wilma out Singing Yabba, Dabba, Dabba Doo Singing Yabba Dabba Dabba Dabba Dabba Doo There wasn't very much old Freddy could do! 'Cept holler yabba dabba dabba dabba dabba do!
      It has been over 6 years since I was in the Army and yet I still can't see a Flinstone reference without thinking of this cadence. I didn't let it stop me from reading about 2/3rds of Learning Perl.

      Cheers - L~R

Re: Link & Email Hunter
by blue_cowdawg (Monsignor) on Feb 13, 2007 at 16:19 UTC

    The existance of just this sort of script is why I generally council clients of mine not to put email addresses on their websites directly, with the exception of "catchall" email accounts like info@blah.org and such.

    Email harvesting is quite frequently the tool of spammers.


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
      A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Link Hunter
by wizbancp (Sexton) on Feb 14, 2007 at 08:23 UTC
    I modified the code ...:-)
    <----------------->
    Feel the Dark Power of Regular Expressions...
Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://599702]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (8)
As of 2019-12-14 14:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?