Beefy Boxes and Bandwidth Generously Provided by pair Networks Frank
laziness, impatience, and hubris
 
PerlMonks  

Help With MOJO::UserAgent IOLoop recurring

by mr_p (Scribe)
on May 13, 2013 at 19:46 UTC ( #1033343=perlquestion: print w/ replies, xml ) Need Help??
mr_p has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Gurus, I have a issue with this code below. What is happening is I am programmed it to no process link such as .pdf .gif and in some cases it is processing those link.

I put in prints statements and its clearly saying it never added the link to be processed but its still being processed.

This is the current result:

Not Adding Link: http://www.vasco.com/Images/end_of_life.pdf url getting problem: http://www.vasco.com/Images/end_of_life.pdf
#!/usr/bin/env perl use 5.010; use open qw(:locale); use strict; use utf8; use warnings qw(all); use Mojo::UserAgent; use Try::Tiny; # FIFO queue my @urls = map { Mojo::URL->new($_) } qw( http://www.f5.com/ ); my @allUrls = ( "http://www.novartis.com", "http://www.vasco.com", "http://www.ravenind.com", "http://www.nepstar.cn", "http://www.f5.com", "http://www.lorillard.com/", "http://www.lowes.com/", "http://www.leggmason.com/", ); my $totalPagesVisited = 0; my $maxPages = 100; my %uniq; my @highProbabableMatch = ( "RSS", "subscribe", "news feed", "press", +"feed", "investor" ); # Limit parallel connections to 4 my $max_conn = 4; my $incorrectAttempts = 0; my $currentIP = (); # User agent following up to 5 redirects my $ua = Mojo::UserAgent ->new(max_redirects => 5) ->detect_proxy; # Keep track of active connections my $active = 0; Mojo::IOLoop->recurring( 0 => sub { for ($active + 1 .. $max_conn) { # Dequeue or halt if there are no active crawlers anymore return ($active or Mojo::IOLoop->stop or $totalPagesVisited > +$maxPages) unless my $url = shift @urls; # Fetch non-blocking just by adding # a callback and marking as active ++$active; $ua->get($url => \&get_callback); } } ); sub get_callback { my (undef, $tx) = @_; # Deactivate --$active; # say "1.2 number of Links: $#urls active: $active"; # Request URL my $url = $tx->req->url; # Parse only OK HTML responses if ((! $tx->res->is_status_class(200)) or ($tx->res->headers->cont +ent_type !~ m{^text/html\b}ix)) { say "url getting problem: $url"; return; } #say $url; parse_html($url, $tx); return; } sub parse_html { my ($url, $tx) = @_; my $rssPageFound = 0; my @rssUrls=(); my $followLink = 0; my $linkAndTitle = (); try { $linkAndTitle = $tx->res->dom->at('html title')->text; } catch { say "was not able to get content from link: $url"; }; #say $tx->res->dom->at('html title')->text; # Extract and enqueue URLs for my $e ($tx->res->dom('a[href]')->each) { # Validate href attribute my $link = Mojo::URL->new($e->{href}); next if 'Mojo::URL' ne ref $link; # "normalize" link $link = $link->to_abs($tx->req->url)->fragment(undef); next unless grep { $link->protocol eq $_ } qw(http https); if ( !( testNegativeLinkMatch($link->to_string)) ) { say "Not Adding Link: " . $link->to_string; next; } # Don't go deeper than /a/b/c # Access every link only once state $uniq = {}; ++$uniq->{$url->to_string}; next if ++$uniq->{$link->to_string} > 1; ## Don't visit other hosts next if $link->host ne $url->host; if (testMatchIfRSS(\@highProbabableMatch, $e)) { $rssPageFound = 1; } if ($rssPageFound eq 1) { $totalPagesVisited++; if ( $totalPagesVisited < $maxPages) { say "adding link: " . $link->to_string; push @urls, $link; } } } return; } sub testMatchIfRSS { my ($rssUrls, $toMatchInfo) = @_; foreach my $match (@$rssUrls) { if ( ($toMatchInfo =~ />(.*?)$match(.*?)</is) or ( $toMatchInfo =~ /\042(.*?)$match(.*?)\042/is ) ) { #say "Matched: $match\n"; return 1; } } return undef; } sub testNegativeLinkMatch { my ($testLink) = @_; if ( $testLink =~ /RSS/si ) { return 1;} if ($testLink =~ m{\.(?:css|js|png|pdf|jpe?g|wmv|mp3|docx?|xlsx?|p +ptx?|gif\$|mov\$|eps|avi)}i) { return undef; } if ($testLink =~ m{^.*(jpe?g|gif|xbm|png|bmp|svg)$}) { return undef; } #say "matchedLink: $flag"; return 1; } { #main for my $href ( @allUrls ) { #$currentIP = getHostIP($href->{webSite}); my $origUrl = $href; @urls = (); $incorrectAttempts = 0; $totalPagesVisited = 0; push(@urls , Mojo::URL->new($origUrl)); #push (@urls, $origUrl); $active = 0; say "Processing: $href"; ### Start event loop if necessary Mojo::IOLoop->start unless Mojo::IOLoop->is_running; } } #endMain

Comment on Help With MOJO::UserAgent IOLoop recurring
Select or Download Code
Re: Help With MOJO::UserAgent IOLoop recurring
by Anonymous Monk on May 14, 2013 at 07:16 UTC

    I put in prints statements and its clearly saying it never added the link to be processed but its still being processed.

    Maybe its a redirect

      Thanks for the help. You are right it was redirected.

      I have another problem that I am facing. I fetch these links looking for RSS feed link for many websites and after like it finds couple websites RSS links. MOJO::Useragent get returns me DUMMY when I do get.

      Do you know anything about this?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1033343]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (10)
As of 2014-04-20 15:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls