| [reply] |
And you should use HTML::Parser or a derivative to "strip the tags".
Actually, it's more like "get everything that isn't a tag". | [reply] |
I'm not sure exactly what you want, so
this might be more of an alternative than an answer...
If the data is in html table format, you might look
at HTML::TableExtract. I have used it
for getting detailed financial info from Yahoo! and
it worked quite well. | [reply] |
Hi!
You can use this script as example... I used to download pics from that site...
Cheers, nindza.
---
#!/usr/bin/perl
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::SimpleLinkExtor;
$ua = new LWP::UserAgent;
while(1) {
$request = new HTTP::Request('GET', 'http://www.celebdaily.com/');
$response = $ua->request($request);
if ($response->is_success) {
print "succ\n";
last;
} else {
print "fail\n";
;;
}
}
$e = HTML::SimpleLinkExtor->new();
$e->parse($response->content);
@links = $e->href;
chdir("/mnt/depot/babes");
foreach $link (@links) {
if($link =~ /[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/) {
system("wget -c \"http://www.celebdaily.com/$link\"");
}
}
| [reply] [d/l] |
foreach $link (@links) {
if($link =~ /[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/) {
system("wget -c \"http://www.celebdaily.com/$link\"");
}
}
Ouch. If $link is '$(rm -rf /)1111-11-11', you're not going to like it. You should anchor the regex (/^[0-9] ... [0-9]$/) so nothing can be in front of or after your pattern. Another great thing is not having a shell parse the command line: system('wget', '-c', "http://www.celebdaily.com/$link");.
2;0 juerd@ouranos:~$ perl -e'undef christmas'
Segmentation fault
2;139 juerd@ouranos:~$
| [reply] [d/l] [select] |
Putting that code in an infinite loop is a terrible strategy. It could be a horrendous strain on the server if it's got high load currently (okay, so this probably isn't one of your concerns), it doesn't make it easy to add new code, and it reads horribly.
Using system to get the fiels you want is also pointless. You've already shown that you can use LWP::UserAgent to get pages; why not just get the files with that?
$ua->request(HTTP::Request->new(GET => "http://www.celebdaily.com/$link", $link)
That'll store the file in a local file of the same name. (Might have some security implications, I don't know the internals of simple_request). Anyway, you don't really need to do this, when you have <plug>pronbot</plug> to do it for you, and having a look at the site it will work with it (Note: all disclaimers apply, pronbot's still a work in progress.)
--
my one true love
| [reply] [d/l] [select] |
#!/usr/bin/perl
use strict;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
my $response = $ua->get("http://www.unab.cl");
if ($response->is_success) {
open(FILE, ">fileout.html");
print FILE $response->content;
close(FILE);
} else {
die;
}
system("lynx --dump -nonumbers fileout.html > beauty.txt");
| [reply] [d/l] |
I am not sure I have a better answer so much as an
alternative.
If the data is in html table format, you might look
at HTML::TableExtract. I have used it once or twice
for getting detailed financial info from Yahoo! and
it was very handy.
Originally posted as a Categorized Answer. | [reply] |