Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Proxy list fetcher

by sock (Monk)
on Jul 07, 2004 at 22:46 UTC ( [id://372591]=sourcecode: print w/replies, xml ) Need Help??
Category: Web Stuff
Author/Contact Info Nick Gerakines, snipersock@gmail.com
Description: This script retrieves several websites that list anonymous http proxies and then creates a formated list based on what was retrieved. Please refer to my website, http://www.socklabs.com for updates and more information. Updates have been made, use strict and use warnings applied as well as huge code reduction.
#!/usr/bin/perl

# ProxyHunter 0.4
# By Sock (http://www.socklabs.com)
# ChangeLog
#  18.5.2004 First release

use strict;
#use warnings;
use LWP::UserAgent;
use HTTP::Request::Common;
use Getopt::Long;

my $opt_page = 1;
our %info = (version => "0.4", 
             format => 0, 
             agent => "Sock's Proxy Hunter/0.3",
             clear => 0,
             anonymous => 0,
             spranonymous => 1,
             elite   => 1
            );
our @data;
our @marks;
# Lets define the data structure for the proxy array
# ip/host, port, type, country, anon level, use (0 no:1 yes)
push @data, {
  host => '127.0.0.1',
  port => 8080,
  type => "http",
  country => "USA",
  level => "http",
  use => 0,
};

# Name, target page, clear all whitespace (1 yes, 0 no)
push @marks, [1, "Stayinvisible.com", "http://www.stayinvisible.com/in
+dex.pl/proxy_list?order=&offset=0", 1];
push @marks, [1, "PublicProxyServer.com", "http://www.publicproxyserve
+rs.com/page1.html", 1];
push @marks, [1, "Proxy4Free.com", "http://www.proxy4free.com/page1.ht
+ml", 1];

print <<"EOF";
ProxyHunter $info{version}
(c) 2003-2004 Sock
Released without warranty under the terms of the Artistic License.

EOF

our $opt_debug;
our $opt_format;
our $opt_verbose;
GetOptions("help|?",\&showhelp, "debug", \$opt_debug, "page=s", \$opt_
+page, "output=s", \$opt_format, "verbose|v", \$opt_verbose);

sub showhelp() {
print << "EOF" ;
Usage: $0 [options] dir1 dir2 file1 file2 ...
Options:
--debug        Show extra debug information
--spage=x    Set the page to read from on the stayinvis website. Not U
+sed
--output=x    Set the output format. See examples:
        [0] 192.168.1.1 80
        [1] http    192.168.1.1 8080 # Highly Anonymous - Russia
        [2] 192.168.1.1:80\@http
--help        Display this help message
EOF
exit;
}

our $debug = $opt_debug;
our $format = $opt_format ? $opt_format : $info{format};
our $t = 0;

print "Debug option set.\n" if ($debug);

#stayinvis();
markloop();

sub markloop() {
    foreach my $m (@marks) {
        if ($m->[0] == 1) {
            print "[+Scan] Starting engine for $m->[1]\n";
            print " [Debug] Initilizing agent.\n" if ($debug);
            my $proxhunter = LWP::UserAgent->new();
            $proxhunter->agent($info{agent});
            print " [Debug] Retrieving site: $m->[2].\n" if ($debug);
            my $http_res = $proxhunter->request(POST $m->[2]);
            my $stuff = $http_res->content;
            print " [Debug] Error pulling site contents.\n" and return
+ unless ($http_res->is_success);
            if ($m->[3] == 1){
                print " [Debug] Removing white spaces.\n" if ($debug);
                $stuff =~ s/\s//g;
            }
            print " [Debug] Sifting through results.\n" if ($debug);
            while ($stuff =~ m/<td[^>]*>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d
+{1,3})<\/td><td[^>]*>(\d{1,5})<\/td><td[^>]*>(\w*)<\/td><td[^>]*>(\w*
+)<\/td>/g) {
                $t++;
                my $use_this;
                for ($3) {
                    /transparent/ && $info{clear} == 1 && do {$use_thi
+s = 1; last;};
                    /anonymous/ && $info{anonymous} == 1 && do {$use_t
+his = 1; last;};
                    /highanonymity/ && $info{spranonymous} == 1 && do 
+ {$use_this = 1; last;};
                }
                print " [Debug] Found: $1, $2, $3, $4, use $use_this\n
+" if ($debug);
                push @data, {
                    host => $1,
                    port => $2,
                    type => "http",
                    country => $3,
                    level => $4,
                    use => $use_this,
                };
#                push @data, [$1, $2, "http", $3, $4, "$use_this"];
            }
            print "[-Scan] Stopping engine for $m->[1]\n\n";
        }
    }
}

(my $sec, my $min, my $hour, my $mday, my $mon, my $year, my $wday, my
+ $yday, my $isdst)=localtime(time);
$mon++;
my $g;
my $outfile = "n$min"."h$hour"."d$mday"."m$mon".".proxylist";
print "Setting output file as $outfile.\n" if ($debug);
open(fileOUT, ">>$outfile");
foreach my $d (@data) {
    if ($d->{use} == 1) {
    $g++;
    if ($format == 0) {print fileOUT "$d->{host}:$d->{port}\n";}
    if ($format == 1) {print fileOUT "$d->{type}\t$d->{host}:$d->{port
+} #$d->{country} - $d->{level}\n";}
    if ($format == 2) {print fileOUT "$d->{host}:$d->{port}\@$d->{type
+}\n";}
    }
}
close(fileOUT);
print "All tasks complete. Loaded $g out of $t proxies.\n";

exit;
Replies are listed 'Best First'.
Re: Proxy list fetcher
by grinder (Bishop) on Jul 07, 2004 at 23:58 UTC

    Not bad. There are a number of glitches that running under warnings and strictures would have picked up:

    In sub show_help, the part "192.168.1.1:80@http" will try and interpolate a non-existent array named @http. You can avoid this by not using a double-quoted heredoc.

    # Lets define the data structure for the proxy array # ip/host, port, type, country, anon level, use push @data, ["127.0.0.1", "8080", "http", "USA", "Anonymous", "0"];

    The fact that you have to leave a comment to help people understand what the array contains leads me to conclude that you would be better served by a hash, to make the data structure self-documenting:

    push @data, { host => '127.0.0.1', port => 8080, ... };

    This lets you write print fileOUT "$data[$i]{host}:$data[$i]{port} later on instead of print fileOUT "$data[$i][0]:$data[$i][1] which is probably better, and saves you having to shuffle around magic constants should you for some reason have the burning desire to add a new element to the beginning of the array thereby throwing everything off by one.

    sub stayinvis($format) { ... }

    The above doesn't really do anything, at least not what you expect. Variables passed to subs are made available in the @_ array. All this is really doing is getting the compiler confused about prototypes.

    our %info = ("version", "0.3", "format", "0", ...

    The above is more idiomatically written as

    our %info = ( version => "0.3", format => 0, anonymouns => 1, ...

    ... mainly because it shows more clearly what are the keys and what are their values, rather making them an indistinguishable bunch that only the compiler knows how to get right. Make it easy on humans too.

    push @data, ["$1", "$2", "http", "$3", "$4", "$use_this"]

    You don't need to interpolate the variables in double quotes, they'll do just fine without them.

    Other than that, in terms of optimisation there's nothing much that needs to be done. $& is known to introduce slowdowns in code, but as you're dealing with lengthy http transactions anyway, it just does not matter here.

    The best advice I can give is to stick around and read some code here. The best practices tend to become obvious after a while.

    Oh and kudos to you for using some modules, rather than trying to do it all yourself. Big time savings there.

    - another intruder with the mooring of the heat of the Perl

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://372591]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2024-03-29 11:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found