Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

downloading images from a webpage

by Aldebaran (Curate)
on Apr 06, 2012 at 20:39 UTC ( [id://963858]=perlquestion: print w/replies, xml ) Need Help??

Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

Greetings good people

I've been trying everything I can to download the images off a webpage. Although the content is ideological, I can assure you that I am not a fascist.

I redirected the output of this to get a list of the images: $ cat hitler1.pl

#!/usr/bin/perl -w use strict; use WWW::Mechanize; my $domain = 'http://www.nobeliefs.com/nazis.htm'; my $m = WWW::Mechanize->new; $m->get( $domain); my @list = $m->dump_images(); print "@list \n"; $

my list doesn't do anything there (why not?) and I redirected output to text1.txt

then I try to download these images with getstore and got only html as output and then this gives me jpg of zero size. What gives? $ cat hitler5.pl

#!/usr/bin/perl -w use strict; use LWP::Simple; open FILE, "text1.txt" or die $!; my $data; my $url; my $text; my %params; while (<FILE>) { $text = $_; $url = 'http://www.nobeliefs.com/nazis/' . $text; $data = LWP::Simple::get $params{URL}; $text =~ s#images/##; print "$url\n"; print "$text\n"; open (FH, ">$text"); binmode (FH); print FH $data; close (FH); } $

Thx for your comment and happy easter.

Replies are listed 'Best First'.
Re: downloading images from a webpage
by blakew (Monk) on Apr 06, 2012 at 21:47 UTC
    You don't call getstore you call get.

      that was my first tack, and I thought I was on the right track, but what I ended up with using getstore() was files that kind of thought they were jpg's and kind of thought they were html docs. Here's the script I used:

      #!/usr/bin/perl -w use strict; use LWP::Simple; open FILE, "text1.txt" or die $!; my $url; my $text; while (<FILE>) { $text = $_; $url = 'http://www.nobeliefs.com/nazis/' . $text; $text =~ s#images/##; print "$url\n"; print "$text\n"; getstore($url, $text) or die "Can't download: $@\n"; }

      an ls command shows question marks:

      $ ls ... prayingHitler.jpg? PraysingCelebration.jpg? priests-salute.jpg? received.jpg reichchurch.gif? ...

      and when I open up a jpg it looks like this:

      <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http:/ +/www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html> <head> <meta http-equiv="Content-type" content="text/html; charset=utf-8"> <title>Website Moved</title> <style type="text/css"> .statusBox { width: 80px; } .fb { width:43%; float:left; text-align:center; margin:5px 20px 5px 20px; padding:20px 0 20px 0px; background:#eef8fd; height:110px; border:solid 1px #dff4fe; } .fb2 { width:43%; float:right; text-align:center; margin:5px 20px 5px 20px; padding:20px 0 20px 0px; background:#eef8fd; height:110px; border:solid 1px #dff4fe; ...

      I think the trick might be to find a way to define $params such that this works, but I haven't been able to do that yet. (I only get errors)

      my $data = LWP::Simple::get $params{URL}; my $filename = "image.jpg"; open (FH, ">$filename"); binmode (FH); print FH $data; close (FH);

        I finally got output, but it looks like a kid did it. I'd like to polish it up and be able to have a script that uses WWW::Mechanize more effectively.

        #!/usr/bin/perl -w use strict; use LWP::Simple; open FILE, "text1.txt" or die $!; my $url; my $text; while (<FILE>) { $text = $_; $text =~ s/\s+//; $url = 'http://www.nobeliefs.com/' . $text; print qq[ '$url' ]; $text =~ s#images/##; print "$text\n"; getstore($url, $text) or die "Can't download: $@\n"; }

        How would I use chomp instead of $text =~ s/\s+//;? Nothing I tried worked.

        My failure with WWW::Mechanize was almost complete. The most I could get it to do was dump the names of the images to STDOUT. How could I re-write this to avoid all the nonsense with saving to a file which I then have to read? The syntax for $mech->images is: Lists all the images on the current page. Each image is a WWW::Mechanize::Image object. In list context, returns a list of all images. In scalar context, returns an array reference of all images. I tried a dozen different things, but I don't get why this is not list context:

        #!/usr/bin/perl -w use strict; use WWW::Mechanize; open FILE, "text2.txt" or die $!; my $domain = 'http://www.nobeliefs.com/nazis.htm'; my $m = WWW::Mechanize->new; $m->get( $domain); my @list = $m->images(); print "@list \n"; #$m->text(); #$m->content( format => 'text2.txt' ); #print FILE $m; close FILE;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://963858]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-10-07 21:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The PerlMonks site front end has:





    Results (44 votes). Check out past polls.

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.