Grabbing a Web Page

elusion has asked for the wisdom of the Perl Monks concerning the following question:

I want to be able to grab a web page from a perl script, how would I go about doing that? I want to have it so I can save it as a text file after I do some stuff to it. I know how to do the second part, just not how to get the page.

p u n k k i d
"Reality is merely an illusion, albeit a very persistent one." -Albert Einstein

Comment on Grabbing a Web Page

Replies are listed 'Best First'.
Re: Grabbing a Web Page by btrott (Parson) on Aug 27, 2000 at 22:45 UTC
Here you go; this will save the contents of the page to $page. `use LWP::Simple; my $page = get 'http://www.perl.com/';` [download] If you want to write it directly to a file (it doesn't sound like you want to do this, but perhaps...), you can use the getstore routine. `use LWP::Simple; getstore('http://www.perl.com/', 'foo.html');` [download]	[reply] [d/l] [select]
RE: Re: Grabbing a Web Page by Anonymous Monk on Aug 28, 2000 at 00:23 UTC
The Perl Cookbook by Tom Christiansen and Nathan Torkington has a 5-page write-up on "Fetching a URL from a Perl Script" using the "Use LWP::Simple; $content = get ($URL);" syntax you discuss. This O'Reilly book might be worth a look if you have it around. I have always gotten the impression that author Tom Christiansen along with Randal Schwartz and Larry Wall are the trinity of holy Perl worship and are each worthy of reverence from humble initiates.	[reply]
RE: Grabbing a Web Page by BigJoe (Curate) on Aug 27, 2000 at 23:05 UTC
If you don't have LWP installed, you can do what the CPAN.pm module does, when LWP ins't installed and do a open(OUTFILE, ">out.txt"); my $htmlpage = `lynx -source http://www.perl.com`; print OUTFILE $htmlpage; close OUTFILE; [download] --BigJoe Learn patience, you must. Young PerlMonk, craves Not these things. Use the source Luke.	[reply] [d/l]
(atl: see w3mir) RE: Grabbing a Web Page by atl (Pilgrim) on Aug 27, 2000 at 23:17 UTC
You might want to take a look at w3mir, a perl script that mirrors entire sites. This includes grabbing the pages and saving them after rewriting the links ... Andreas	[reply]
Re: Grabbing a Web Page by reyjrar (Hermit) on Aug 28, 2000 at 03:27 UTC
or, if you want to learn how things work.. and communicate with raw sockets and "re-invent the wheel" which I find is FAR more educationally valuable than modules, you can: `#!/usr/bin/perl use Socket; use strict; my $line; my $URL = "http://www.yahoo.com"; $URL =~ s/http\:\/\///; my ($HOST,@temppage) = split('/', $URL); my $PAGE = join('/', @temppage); if(!$PAGE) { $PAGE = "/"; } $PAGE = "/$PAGE"; open(OUTFILE, ">html.out"); socket(HTML, PF_INET, SOCK_STREAM, getprotobyname('tcp')) \|\| die $!; connect(HTML, sockaddr_in(80,inet_aton($HOST))); my $REQUEST = "GET $PAGE HTTP/1.0\n\n"; send(HTML, $REQUEST, ''); while(<HTML>) { print OUTFILE; } close HTML; close OUTFILE;` [download] And like I said, its more time efficient to use the LWP module.. but this way you're actually using just perl, and not relying on some machine to have lynx, or LWP installed.. and its fun! :)	[reply] [d/l]
RE: Re: Grabbing a Web Page by bobby (Sexton) on Aug 28, 2000 at 06:09 UTC
definitely fun, especially after spending all afternoon installing various modules prerequisite to LWP (smiley) i modified this program so you can say: www.foo.com/ instead of www.foo.com/index.html and it reads the url from the command line and just prints the page to stdout just in case anyone cared #!/usr/bin/perl use Socket; use strict; #i don't know what $line is my $line; #but i left it in anyway my $trailingslash; my $URL = $ARGV[0]; #get URL from command line $URL =~ s/http\:\/\///; #get rid of "http://" if it's there if ($URL =~ m/\/$/) { #check for trailing slash $trailingslash = 'true'; #(i.e. get /index.foo) } else { $trailingslash = 0; } my ($HOST,@temppage) = split('/', $URL); my $PAGE = join('/', @temppage); if (($trailingslash) && ($PAGE)) { $PAGE = "/$PAGE/"; #reattach the trailing slash } else { $PAGE = "/$PAGE"; } socket(HTML, PF_INET, SOCK_STREAM, getprotobyname('tcp')) \|\| die $!; connect(HTML, sockaddr_in(80,inet_aton($HOST))); my $REQUEST = "GET $PAGE HTTP/1.0\n\n"; send(HTML, $REQUEST, ''); while(<HTML>) { print; #to STDOUT } close HTML; [download] of course, we could just make the program respond to 301 Moved Permanently. ha. -b	[reply] [d/l]
RE: RE: Re: Grabbing a Web Page by ncw (Friar) on Aug 28, 2000 at 13:56 UTC
definitely fun, especially after spending all afternoon installing various modules prerequisite to LWP (smiley) Isn't that what the CPAN module is for ;-) `perl -MCPAN -eshell install LWP` [download] Then sit back and relax! Make sure you install the latest version of the CPAN module first though so it doesn't try to upgrade your perl to 5.6.0...	[reply] [d/l]
RE: RE: Re: Grabbing a Web Page by reyjrar (Hermit) on Aug 28, 2000 at 06:50 UTC
I thought I tested everything, but I was wrong.. good call.. uhm.. I believe if we make this change it'll work too: my $REQUEST = "GET $PAGE HTTP/1.0 \n\n"; goes to: my $REQUEST = "GET $PAGE\n\n"; I tested it on my apache server and it seemed to work fine.. and I recall from past experience with Squid, that it will work. lemme know if you find differently..	[reply]
RE: Re: Grabbing a Web Page by turnstep (Parson) on Aug 28, 2000 at 14:38 UTC
> ...if you want to learn how things work.. and >communicate with raw sockets and "re-invent the >wheel" which I find is FAR more educationally >valuable than modules, you can: Or, for even MORE of an education, take out that `use Socket;` [download] line. Then try and get it to run on different platforms! :) At the very least you'll gain an appreciation for the Socket module. P.S. The open in the code above should have an "or die..." after it.	[reply] [d/l]
Re: Grabbing a Web Page by mcwee (Pilgrim) on Aug 28, 2000 at 00:46 UTC
Although this isn't a Perl solution to this problem (but, NB that TIMTOWTDI), you could just make a system call to `lynx -dump http://www.somewhere.com` (which would spit out the digested page, HTML interpreted) or `lynx -source http://www.somewhere.com` (which would give you the page's source, just ike `$page = get "http://www.somewhere.com"`.) Remember, variety is the spice of life. The Autonomic Pilot; it's FunkyTown, babe.	[reply] [d/l] [select]
Re: Grabbing a Web Page by pschoonveld (Pilgrim) on Aug 28, 2000 at 15:57 UTC
OK, I see a lot of good ideas have come from this node and as a result it is worth something. But, why is everyone voting up a question that has been answered so many times before? People need to stop rewarding others for being too lazy to even look around for this stuff.	[reply]

Back to Seekers of Perl Wisdom