http://www.perlmonks.org?node_id=136495

Bishma has asked for the wisdom of the Perl Monks concerning the following question:

Here's the situation. I need to be able to get a list of files out of a remote webpage. All I need to do is get the html source, remove all the html tags, and store each word into it's own line of a local file. (before you ask, I have the remote systems admin's permision to access this page).

I'm an intermediate perl programmer (at best) and I just don't know how to get the remote web page.

Replies are listed 'Best First'.
Re: Getting data out of a remote web page
by Juerd (Abbot) on Jan 05, 2002 at 18:17 UTC
Re: Getting data out of a remote web page
by Amoe (Friar) on Jan 05, 2002 at 20:53 UTC
    And you should use HTML::Parser or a derivative to "strip the tags". Actually, it's more like "get everything that isn't a tag".
Re: Getting data out of a remote web page
by jonjacobmoon (Pilgrim) on Jan 06, 2002 at 19:56 UTC
    I'm not sure exactly what you want, so this might be more of an alternative than an answer...

    If the data is in html table format, you might look at HTML::TableExtract. I have used it for getting detailed financial info from Yahoo! and it worked quite well.

Re: Getting data out of a remote web page
by nindza (Novice) on Feb 10, 2002 at 01:09 UTC
    Hi!

    You can use this script as example... I used to download pics from that site...

    Cheers, nindza.

    ---
    #!/usr/bin/perl use LWP::UserAgent; use HTTP::Request; use HTTP::Response; use HTML::SimpleLinkExtor; $ua = new LWP::UserAgent; while(1) { $request = new HTTP::Request('GET', 'http://www.celebdaily.com/'); $response = $ua->request($request); if ($response->is_success) { print "succ\n"; last; } else { print "fail\n"; ;; } } $e = HTML::SimpleLinkExtor->new(); $e->parse($response->content); @links = $e->href; chdir("/mnt/depot/babes"); foreach $link (@links) { if($link =~ /[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/) { system("wget -c \"http://www.celebdaily.com/$link\""); } }

      foreach $link (@links) { if($link =~ /[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/) { system("wget -c \"http://www.celebdaily.com/$link\""); } }

      Ouch. If $link is '$(rm -rf /)1111-11-11', you're not going to like it. You should anchor the regex (/^[0-9] ... [0-9]$/) so nothing can be in front of or after your pattern. Another great thing is not having a shell parse the command line: system('wget', '-c', "http://www.celebdaily.com/$link");.

      2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

      Putting that code in an infinite loop is a terrible strategy. It could be a horrendous strain on the server if it's got high load currently (okay, so this probably isn't one of your concerns), it doesn't make it easy to add new code, and it reads horribly.

      Using system to get the fiels you want is also pointless. You've already shown that you can use LWP::UserAgent to get pages; why not just get the files with that?

      $ua->request(HTTP::Request->new(GET => "http://www.celebdaily.com/$link", $link)

      That'll store the file in a local file of the same name. (Might have some security implications, I don't know the internals of simple_request). Anyway, you don't really need to do this, when you have <plug>pronbot</plug> to do it for you, and having a look at the site it will work with it (Note: all disclaimers apply, pronbot's still a work in progress.)



      --
      my one true love
Re: Getting data out of a remote web page
by zOrK (Initiate) on Jun 11, 2008 at 19:42 UTC
    #!/usr/bin/perl use strict; use LWP::UserAgent; my $ua = LWP::UserAgent->new; $ua->timeout(10); my $response = $ua->get("http://www.unab.cl"); if ($response->is_success) { open(FILE, ">fileout.html"); print FILE $response->content; close(FILE); } else { die; } system("lynx --dump -nonumbers fileout.html > beauty.txt");
Re: Getting data out of a remote web page
by jonjacobmoon (Pilgrim) on Jan 06, 2002 at 19:56 UTC
    I am not sure I have a better answer so much as an alternative.

    If the data is in html table format, you might look at HTML::TableExtract. I have used it once or twice for getting detailed financial info from Yahoo! and it was very handy.

    Originally posted as a Categorized Answer.