http://www.perlmonks.org?node_id=455094

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

monks,

can we able to extract the webpage. please guide me, is there any module or any script.

Replies are listed 'Best First'.
Re: Extract Web Page
by gopalr (Priest) on May 09, 2005 at 04:53 UTC

    Hello

    Yes, we can able to extract web page by using the below:

    1st Step:

    use strict; use warnings; use LWP::UserAgent; use HTTP::Request; my $url = 'http://www.cpan.org'; my $ua = LWP::UserAgent->new; my $request = HTTP::Request->new(GET => $url); my $response = $ua->request($request); if ($response->is_success) { print $response->content; } else { print $response->status_line, " <URL:$url>\n"; }

    2nd Step:: Its very Simple:

    use strict; use warnings; use LWP::Simple; getprint ('http://www.cpan.org/');

    3rd Step:: we can get the web page from command promp..

    C:\>lwp-download "http://www.cpan.org"

    4th Step:

    open MYHANDLE, "GET http://www.perlmonks.org|"; while(<MYHANDLE>) { print $_; }

    Thanks

    Gopal

    Update Added the 4th Step.

      open MYHANDLE, "GET http://www.perlmonks.org|";
      Incidentally, even when posting minimal code it is sensible to use the open() or die semantic, especially when showing stuff like this to a newbie, who is probably already exposed to (bad) examples of unchecked open()s.

      Whatever... is this really supposed to work? AFAICT this is just a piped open(), so it depends on the availability of the "GET" command (which BTW I've never heard) and thus is at best system-dependent. Isn't it that you meant "wget", maybe?

      Speaking of which, ideally it would be nice to have an open mode (for the three args form of open(), for reasons of backward compatibility) doing exactly this by calling the appropriate modules behind the scenes, a la

      open my $fh, 'web', 'http://www.perlmonks.org' or die $!;
      or a {layer,discipline}, maybe:
      open my $fh, '<:web', 'http://www.perlmonks.org' or die $!;
      (Something vaguely along these lines has been discussed in p6l, but on a different level - of course!)

      Just a few random thoughts...

      The fourth option is equivalent to the third, GET usually is a symlink to lwp-download. BUT note that during installation of LWP you're asked if you want the GET/HEAD/POST symlinks, and you (or the OP) could choose to answer no to avoid program name clashing, expecially for the HEAD command in filesystems that do not support case distinction.

      A typical case is in the cygwin environment when it lives in a FAT32 filesystem: in this case, HEAD and head (which gives you the first lines in a file) would clash and you wouldn't get what you want. To be "portable" in the examples, I'd always use the lwp- beginning commands.

      Flavio (perl -e 'print(scalar(reverse("\nti.xittelop\@oivalf")))')

      Don't fool yourself.
Re: Extract Web Page
by davido (Cardinal) on May 09, 2005 at 06:10 UTC

    There is a module: LWP::Simple. Guidance is found in its POD. If you get snagged on something specific, let us know.


    Dave