http://www.perlmonks.org?node_id=1037326

I'm looking for testers to test a quick script I threw together that I'm not sure how to improve upon anymore - it's a basic downloading tool I've created in the last two days. It's nothing fancy, it's useful to automate large amounts of downloads if you need a quick-and-dirty way to download tons of files without needing to babysit them, though.

I built it out of necessity for my job, since I either needed to build a script that automated the download of over 2000 files, find one that works well for the task, or do it myself... I couldn't find one that worked for me, so I learned more about LWP and WWW::Mechanize, asked a couple questions from you great Monks, and have it working! It's currently being used to download said 2000+ files onto one of our computers, so I thought I'd ask for criticism on how to improve the meager tool, and release it into the wilderness of the internet.

It's available for viewing/download on my dropbox here, feel free to download it and tell me what you think. Are there glaring issues with it? Are there features you'd want added if you were going to use this tool? Be nice please, but feel free to critique.

Edit: Here's the code, upon advisement that I should post the code directly here for people to view.
#!/usr/bin/perl -w use strict; use warnings; use LWP::UserAgent; use LWP::Simple; use WWW::Mechanize; use Digest::MD5 qw( md5_hex ); # Coded by Brendan Galvin from June 3rd 2013, to June 5th 2013. # This product is open-source freeware, and credit to the original sou +rce must be given to Brendan Galvin upon re-distribution of the origi +nal source or any program, script or application made using the origi +nal source. # http://www.linkedin.com/pub/brendan-galvin/26/267/94b my $flag=0; print"\n\nURL for mass-downloading (only download links using the <a h +ref> tag, not images or other embedded elements): "; chomp(my $url = <STDIN>); print"\nExtensions to download (seperated by comma's): "; chomp(my $extensions = <STDIN>); $extensions =~ s/[.]//g; print"\nLocation to store downloaded files: "; chomp(my $location = <STDIN>); print"\nHow many downloads would you like to skip starting from the fi +rst (in case you started this download earlier and have already downl +oaded some of the files)? "; chomp(my $skips = <STDIN>); print"\nAre you going to want to skip any files while the program is r +unning (y/n)?"; chomp(my $skiporno = <STDIN>); my $error = ""; my @extension = split(',', $extensions); my %extens = map{$_ => 1} @extension; sub GetFileSize{ my $url=shift; my $ua = new LWP::UserAgent; $ua->agent("Mozilla/5.0"); my $req = new HTTP::Request 'HEAD' => $url; $req->header('Accept' => 'text/html'); my $res = $ua->request($req); if ($res->is_success) { my $headers = $res->headers; return $headers; }else{ $flag = 1; $error .= "Error retrieving file information at $url "; } return 0; } my $mech = WWW::Mechanize->new(); $mech->get($url); my $base = $mech->base; my @links = $mech->links(); for my $link ( @links ) { my $skip = 'n'; if($link->url() =~ m/([^.]+)$/){ my $ext = ($link->url() =~ m/([^.]+)$/)[0]; if(exists($extens{$ext})){ my $newurl = $link->url(); if($newurl !~ /http::\/\/$/ig){ my $baseurl = URI->new_abs($newurl, $base); $newurl = $baseurl; } my $filename = $newurl; $filename =~ m/.*\/(.*)$/; $filename = $1; if($skips > 0){ $skips -= 1; print "\n\nSkipped $filename at " . $link->url(); next; }else{ my $header = GetFileSize($newurl); my $urlmech = WWW::Mechanize->new(); $urlmech->show_progress("true"); print"\n\n\n$filename at $newurl\n"; print "File size: ".$header->content_length." bytes\n" + unless $flag==1; print "Last modified: ".localtime($header->last_modifi +ed)."\n" unless $flag==1; if($skiporno eq 'y'){ print"Skip file (y/n)?"; chomp($skip = <STDIN>); } if($skip ne 'y'){ print " downloading...\n"; my $response = $urlmech->get($newurl, ':content_fi +le' => "$filename", )->decoded_content; }else{ print"\nSkipping...\n\n"; next or print"Error skipping.\n"; } } } } } print"\n\n\nTasks completed.\n"; if($error ne ""){ print"\nErrors: $error"; }else{ print"No errors.\n"; }

Replies are listed 'Best First'.
Re: RFC: Script testers needed (ITYM code review)
by MidLifeXis (Monsignor) on Jun 06, 2013 at 13:13 UTC

    It sounds like you are looking more for a code review, not a tester. If that is the case, you may get more responses with a different title.

    If that is the case, post the code here, within <readmore><code>...</code></readmore> tags. PerlMonks seems to prefer having posts fairly well self contained, in order to allow for the content to be long lived.

    That being said, I would probably start your options gathering with Getopt::Long or something similar. Prompting the user is not necessarily a bad thing, but not allowing for automation is (imo). That is all the further I got at this point.

    --MidLifeXis

      I was actually planning on making this able to take arguments from the command line in a later version - but thanks for pointing out Getopt::Long! That looks like it's a really useful module.

      I also took your advice and posted the code here on the post, for everyone to view within the Monastery. Hope that helps!

Re: RFC: Script testers needed
by Athanasius (Archbishop) on Jun 06, 2013 at 16:10 UTC

    Some comments on the code:

    1. Digest::MD5::md5_hex is imported but not used.

    2. $location is initialised via user input but then not used.

    3. Line 61 could be simplified to: my $ext = $1;

    4. Lines 71 and 72 could be written as a single line: $filename =~ s{.*/(.*)$}{$1};

    5. The next statements on lines 76 and 94 are redundant, as each occurs immediately before the loop iterates anyway.

    6. In addition, on line 94 the or clause does nothing. next can’t “fail”, and it doesn’t return a value.

    7. Lines 101 to 105 could be rewritten as a single line: print $error eq '' ? "No errors.\n" : "\nErrors: $error";

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      In order:

      1. You're right! Digest::MD5::md5_hex was used previously to check the md5 hash of the file. I will remove it now that it is no longer used.
      2. Whoa, that's a bit of a big error on my part. I guess I edited the use of $location out, but that's not how it's supposed to be. Thanks for catching that.
      3. Good point. I'm not very well versed on the intricacies of regular expressions, thanks for the tip.
      4. Alright, thanks for the help there, once again regular expressions aren't something I'm used to using much of; usually just simple substitution or word-matching regex's.
      5. Good catch, they will be removed.
      6. Kind of a newby error to make, on my part. I was trying to simply make sure that many of the possible errors were found and reported to the user, and I didn't take the time to figure out if next even CAN have an error. I'll remove the or statement.
      7. Awesome, thanks for that tip :)

      That was a lot of help, I'll be sure to tweak the code up using your help! Thanks!

Re: RFC: Code testers/reviewers needed
by hbm (Hermit) on Jun 07, 2013 at 12:48 UTC

    Some minor suggestions:

    1. In GetFileSize, change this:
      if ($res->is_success) { my $headers = $res->headers; return $headers; }else{ $flag = 1; $error .= "Error retrieving file information at $url "; } return 0;
      To this:
      return $res->headers if $res->is_success; $flag = 1; $error .= "Error ..."; return 0;
    2. Within your for loop, change this:
      if(exists($extens{$ext})){
      To this, and remove one level of indentation.
      next unless exists $extens{$ext};
    3. Change delimiters rather than escaping. And in this example the global modifier isn't necessary.
      #if($newurl !~ /http::\/\/$/ig){ if($newurl !~ m#http:://$#i){
    4. $skips -= 1; vs. $skips--;?
    5. Change this:
      print"\n\n\n$filename at $newurl\n"; print "File size: ".$header->content_length." bytes\n" unless $flag==1 +; print "Last modified: ".localtime($header->last_modified)."\n" unless +$flag==1;
      To this:
      print "\n\n\n$filename at $newurl\n", printf("File size: %d bytes\nLast modified: %s\n", $header->content_le +ngth, localtime($header->last_modified)) unless $flag == 1;
    6. Spelling and grammar...
      #print"\nExtensions to download (seperated by comma's): "; print"\nExtensions to download (separated by commas): ";
      Great suggestions all - will definitely make the code more elegant. I haven't coded in Perl for a while, I guess I've forgotten how to make beautiful code :) thanks for the help!
Re: RFC: Code testers/reviewers needed
by RichardK (Parson) on Jun 07, 2013 at 14:28 UTC

    If you need this for your job, then why not just use curl?

    curl has more features than you can shake a stick at, and just works -- no coding required :)

      Because at the time, due to Windows 8 sucking, curl wasn't working - so I made this, found out Windows 8 was causing everything not to work, transferred the program to a different computer that runs a different OS, and basically now we have our own tool already, so why not use it? :)
Re: RFC: Code testers/reviewers needed
by vsespb (Chaplain) on Jun 07, 2013 at 08:49 UTC
    1. $header->content_length is not available for chunked transfer (according to HTTP spec)
    2. why are you mixing decoded_content and content_file ?
    3. you are not checking if file downloaded correct (see mirror method in LWP::UserAgent)
    4. is_success can give you false positives (but probably HEAD requests are OK in practice) https://github.com/libwww-perl/libwww-perl/issues/46 https://rt.cpan.org/Public/Bug/Display.html?id=85759
    5. you can avoid HEAD requests at all - use LWP callbacks, you will be able to get length/last_mod_time before whole file is transferred
      All your points are good, I'll look into LWP callbacks so I can avoid HEAD requests - the mixing of decoded_content and content_file was due to me misunderstanding what decoded_content did. Can you elaborate what it does a little bit, for me?
        Documentation for decoded_content: http://search.cpan.org/perldoc?HTTP%3A%3AMessage Note that there is no content at all, as file downloaded using content_file option.