http://www.perlmonks.org?node_id=29917

elusion has asked for the wisdom of the Perl Monks concerning the following question:

I want to be able to grab a web page from a perl script, how would I go about doing that? I want to have it so I can save it as a text file after I do some stuff to it. I know how to do the second part, just not how to get the page.

p u n k k i d
"Reality is merely an illusion, albeit a very persistent one." -Albert Einstein

Replies are listed 'Best First'.
Re: Grabbing a Web Page
by btrott (Parson) on Aug 27, 2000 at 22:45 UTC
    Here you go; this will save the contents of the page to $page.
    use LWP::Simple; my $page = get 'http://www.perl.com/';
    If you want to write it directly to a file (it doesn't sound like you want to do this, but perhaps...), you can use the getstore routine.
    use LWP::Simple; getstore('http://www.perl.com/', 'foo.html');
      The Perl Cookbook by Tom Christiansen and Nathan Torkington has a 5-page write-up on "Fetching a URL from a Perl Script" using the "Use LWP::Simple; $content = get ($URL);" syntax you discuss.

      This O'Reilly book might be worth a look if you have it around. I have always gotten the impression that author Tom Christiansen along with Randal Schwartz and Larry Wall are the trinity of holy Perl worship and are each worthy of reverence from humble initiates.
RE: Grabbing a Web Page
by BigJoe (Curate) on Aug 27, 2000 at 23:05 UTC
    If you don't have LWP installed, you can do what the CPAN.pm module does, when LWP ins't installed and do a
    open(OUTFILE, ">out.txt"); my $htmlpage = `lynx -source http://www.perl.com`; print OUTFILE $htmlpage; close OUTFILE;


    --BigJoe

    Learn patience, you must.
    Young PerlMonk, craves Not these things.
    Use the source Luke.
(atl: see w3mir) RE: Grabbing a Web Page
by atl (Pilgrim) on Aug 27, 2000 at 23:17 UTC
    You might want to take a look at w3mir, a perl script that mirrors entire sites. This includes grabbing the pages and saving them after rewriting the links ...

    Andreas

Re: Grabbing a Web Page
by reyjrar (Hermit) on Aug 28, 2000 at 03:27 UTC
    or, if you want to learn how things work.. and communicate with raw sockets and "re-invent the wheel" which I find is FAR more educationally valuable than modules, you can:
    #!/usr/bin/perl use Socket; use strict; my $line; my $URL = "http://www.yahoo.com"; $URL =~ s/http\:\/\///; my ($HOST,@temppage) = split('/', $URL); my $PAGE = join('/', @temppage); if(!$PAGE) { $PAGE = "/"; } $PAGE = "/$PAGE"; open(OUTFILE, ">html.out"); socket(HTML, PF_INET, SOCK_STREAM, getprotobyname('tcp')) || die $!; connect(HTML, sockaddr_in(80,inet_aton($HOST))); my $REQUEST = "GET $PAGE HTTP/1.0\n\n"; send(HTML, $REQUEST, ''); while(<HTML>) { print OUTFILE; } close HTML; close OUTFILE;
    And like I said, its more time efficient to use the LWP module.. but this way you're actually using just perl, and not relying on some machine to have lynx, or LWP installed.. and its fun! :)
      definitely fun, especially after spending all afternoon installing various modules prerequisite to LWP (smiley)

      i modified this program so you can say:
      www.foo.com/ instead of www.foo.com/index.html
      and it reads the url from the command line and just prints the page to stdout
      just in case anyone cared
      #!/usr/bin/perl use Socket; use strict; #i don't know what $line is my $line; #but i left it in anyway my $trailingslash; my $URL = $ARGV[0]; #get URL from command line $URL =~ s/http\:\/\///; #get rid of "http://" if it's there if ($URL =~ m/\/$/) { #check for trailing slash $trailingslash = 'true'; #(i.e. get /index.foo) } else { $trailingslash = 0; } my ($HOST,@temppage) = split('/', $URL); my $PAGE = join('/', @temppage); if (($trailingslash) && ($PAGE)) { $PAGE = "/$PAGE/"; #reattach the trailing slash } else { $PAGE = "/$PAGE"; } socket(HTML, PF_INET, SOCK_STREAM, getprotobyname('tcp')) || die $!; connect(HTML, sockaddr_in(80,inet_aton($HOST))); my $REQUEST = "GET $PAGE HTTP/1.0\n\n"; send(HTML, $REQUEST, ''); while(<HTML>) { print; #to STDOUT } close HTML;

      of course, we could just make the program respond to 301 Moved Permanently. ha.
      -b
        definitely fun, especially after spending all afternoon installing various modules prerequisite to LWP (smiley)

        Isn't that what the CPAN module is for ;-)

        perl -MCPAN -eshell install LWP
        Then sit back and relax!

        Make sure you install the latest version of the CPAN module first though so it doesn't try to upgrade your perl to 5.6.0...

        I thought I tested everything, but I was wrong.. good call.. uhm.. I believe if we make this change it'll work too: my $REQUEST = "GET $PAGE HTTP/1.0 \n\n"; goes to: my $REQUEST = "GET $PAGE\n\n"; I tested it on my apache server and it seemed to work fine.. and I recall from past experience with Squid, that it will work. lemme know if you find differently..

      > ...if you want to learn how things work.. and
      >communicate with raw sockets and "re-invent the
      >wheel" which I find is FAR more educationally
      >valuable than modules, you can:

      Or, for even MORE of an education, take out that

      use Socket;
      line. Then try and get it to run on different platforms! :) At the very least you'll gain an appreciation for the Socket module.

      P.S. The open in the code above should have an "or die..." after it.

Re: Grabbing a Web Page
by mcwee (Pilgrim) on Aug 28, 2000 at 00:46 UTC
    Although this isn't a Perl solution to this problem (but, NB that TIMTOWTDI), you could just make a system call to lynx -dump http://www.somewhere.com (which would spit out the digested page, HTML interpreted) or lynx -source http://www.somewhere.com (which would give you the page's source, just ike $page = get "http://www.somewhere.com".)

    Remember, variety is the spice of life.

    The Autonomic Pilot; it's FunkyTown, babe.

Re: Grabbing a Web Page
by pschoonveld (Pilgrim) on Aug 28, 2000 at 15:57 UTC
    OK, I see a lot of good ideas have come from this node and as a result it is worth something. But, why is everyone voting up a question that has been answered so many times before?

    People need to stop rewarding others for being too lazy to even look around for this stuff.