Getting data out of a remote web page

Bishma has asked for the wisdom of the Perl Monks concerning the following question:

Here's the situation. I need to be able to get a list of files out of a remote webpage. All I need to do is get the html source, remove all the html tags, and store each word into it's own line of a local file. (before you ask, I have the remote systems admin's permision to access this page).

I'm an intermediate perl programmer (at best) and I just don't know how to get the remote web page.

Comment on Getting data out of a remote web page

Replies are listed 'Best First'.
Re: Getting data out of a remote web page by Juerd (Abbot) on Jan 05, 2002 at 18:17 UTC
Use LWP::Simple. If LWP::Simple is too simple, use LWP::UserAgent.	[reply]
Re: Getting data out of a remote web page by Amoe (Friar) on Jan 05, 2002 at 20:53 UTC
And you should use HTML::Parser or a derivative to "strip the tags". Actually, it's more like "get everything that isn't a tag".	[reply]
Re: Getting data out of a remote web page by jonjacobmoon (Pilgrim) on Jan 06, 2002 at 19:56 UTC
I'm not sure exactly what you want, so this might be more of an alternative than an answer... If the data is in html table format, you might look at HTML::TableExtract. I have used it for getting detailed financial info from Yahoo! and it worked quite well.	[reply]
Re: Getting data out of a remote web page by nindza (Novice) on Feb 10, 2002 at 01:09 UTC
Hi! You can use this script as example... I used to download pics from that site... Cheers, nindza. --- #!/usr/bin/perl use LWP::UserAgent; use HTTP::Request; use HTTP::Response; use HTML::SimpleLinkExtor; $ua = new LWP::UserAgent; while(1) { $request = new HTTP::Request('GET', 'http://www.celebdaily.com/'); $response = $ua->request($request); if ($response->is_success) { print "succ\n"; last; } else { print "fail\n"; ;; } } $e = HTML::SimpleLinkExtor->new(); $e->parse($response->content); @links = $e->href; chdir("/mnt/depot/babes"); foreach $link (@links) { if($link =~ /[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/) { system("wget -c \"http://www.celebdaily.com/$link\""); } } [download]	[reply] [d/l]
Re: Answer: Getting data out of a remote web page by Juerd (Abbot) on Feb 10, 2002 at 12:11 UTC
`foreach $link (@links) { if($link =~ /[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/) { system("wget -c \"http://www.celebdaily.com/$link\""); } }` [download] Ouch. If `$link` is `'$(rm -rf /)1111-11-11'`, you're not going to like it. You should anchor the regex (`/^[0-9] ... [0-9]$/`) so nothing can be in front of or after your pattern. Another great thing is not having a shell parse the command line: `system('wget', '-c', "http://www.celebdaily.com/$link");`. `2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$` [download]	[reply] [d/l] [select]
Re: Answer: Getting data out of a remote web page by Amoe (Friar) on Feb 10, 2002 at 14:16 UTC
Putting that code in an infinite loop is a terrible strategy. It could be a horrendous strain on the server if it's got high load currently (okay, so this probably isn't one of your concerns), it doesn't make it easy to add new code, and it reads horribly. Using `system` to get the fiels you want is also pointless. You've already shown that you can use LWP::UserAgent to get pages; why not just get the files with that? `$ua->request(HTTP::Request->new(GET => "http://www.celebdaily.com/$link", $link)` That'll store the file in a local file of the same name. (Might have some security implications, I don't know the internals of `simple_request`). Anyway, you don't really need to do this, when you have <plug>pronbot</plug> to do it for you, and having a look at the site it will work with it (Note: all disclaimers apply, pronbot's still a work in progress.) -- my one true love	[reply] [d/l] [select]
Re: Getting data out of a remote web page by zOrK (Initiate) on Jun 11, 2008 at 19:42 UTC
`#!/usr/bin/perl use strict; use LWP::UserAgent; my $ua = LWP::UserAgent->new; $ua->timeout(10); my $response = $ua->get("http://www.unab.cl"); if ($response->is_success) { open(FILE, ">fileout.html"); print FILE $response->content; close(FILE); } else { die; } system("lynx --dump -nonumbers fileout.html > beauty.txt");` [download]	[reply] [d/l]
Re: Getting data out of a remote web page by jonjacobmoon (Pilgrim) on Jan 06, 2002 at 19:56 UTC
I am not sure I have a better answer so much as an alternative. If the data is in html table format, you might look at HTML::TableExtract. I have used it once or twice for getting detailed financial info from Yahoo! and it was very handy. Originally posted as a Categorized Answer.	[reply]

Back to Seekers of Perl Wisdom