Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Help With MOJO::UserAgent IOLoop recurring

by mr_p (Scribe)
on May 13, 2013 at 19:46 UTC ( #1033343=perlquestion: print w/ replies, xml ) Need Help??
mr_p has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Gurus, I have a issue with this code below. What is happening is I am programmed it to no process link such as .pdf .gif and in some cases it is processing those link.

I put in prints statements and its clearly saying it never added the link to be processed but its still being processed.

This is the current result:

Not Adding Link: http://www.vasco.com/Images/end_of_life.pdf url getting problem: http://www.vasco.com/Images/end_of_life.pdf
#!/usr/bin/env perl use 5.010; use open qw(:locale); use strict; use utf8; use warnings qw(all); use Mojo::UserAgent; use Try::Tiny; # FIFO queue my @urls = map { Mojo::URL->new($_) } qw( http://www.f5.com/ ); my @allUrls = ( "http://www.novartis.com", "http://www.vasco.com", "http://www.ravenind.com", "http://www.nepstar.cn", "http://www.f5.com", "http://www.lorillard.com/", "http://www.lowes.com/", "http://www.leggmason.com/", ); my $totalPagesVisited = 0; my $maxPages = 100; my %uniq; my @highProbabableMatch = ( "RSS", "subscribe", "news feed", "press", +"feed", "investor" ); # Limit parallel connections to 4 my $max_conn = 4; my $incorrectAttempts = 0; my $currentIP = (); # User agent following up to 5 redirects my $ua = Mojo::UserAgent ->new(max_redirects => 5) ->detect_proxy; # Keep track of active connections my $active = 0; Mojo::IOLoop->recurring( 0 => sub { for ($active + 1 .. $max_conn) { # Dequeue or halt if there are no active crawlers anymore return ($active or Mojo::IOLoop->stop or $totalPagesVisited > +$maxPages) unless my $url = shift @urls; # Fetch non-blocking just by adding # a callback and marking as active ++$active; $ua->get($url => \&get_callback); } } ); sub get_callback { my (undef, $tx) = @_; # Deactivate --$active; # say "1.2 number of Links: $#urls active: $active"; # Request URL my $url = $tx->req->url; # Parse only OK HTML responses if ((! $tx->res->is_status_class(200)) or ($tx->res->headers->cont +ent_type !~ m{^text/html\b}ix)) { say "url getting problem: $url"; return; } #say $url; parse_html($url, $tx); return; } sub parse_html { my ($url, $tx) = @_; my $rssPageFound = 0; my @rssUrls=(); my $followLink = 0; my $linkAndTitle = (); try { $linkAndTitle = $tx->res->dom->at('html title')->text; } catch { say "was not able to get content from link: $url"; }; #say $tx->res->dom->at('html title')->text; # Extract and enqueue URLs for my $e ($tx->res->dom('a[href]')->each) { # Validate href attribute my $link = Mojo::URL->new($e->{href}); next if 'Mojo::URL' ne ref $link; # "normalize" link $link = $link->to_abs($tx->req->url)->fragment(undef); next unless grep { $link->protocol eq $_ } qw(http https); if ( !( testNegativeLinkMatch($link->to_string)) ) { say "Not Adding Link: " . $link->to_string; next; } # Don't go deeper than /a/b/c # Access every link only once state $uniq = {}; ++$uniq->{$url->to_string}; next if ++$uniq->{$link->to_string} > 1; ## Don't visit other hosts next if $link->host ne $url->host; if (testMatchIfRSS(\@highProbabableMatch, $e)) { $rssPageFound = 1; } if ($rssPageFound eq 1) { $totalPagesVisited++; if ( $totalPagesVisited < $maxPages) { say "adding link: " . $link->to_string; push @urls, $link; } } } return; } sub testMatchIfRSS { my ($rssUrls, $toMatchInfo) = @_; foreach my $match (@$rssUrls) { if ( ($toMatchInfo =~ />(.*?)$match(.*?)</is) or ( $toMatchInfo =~ /\042(.*?)$match(.*?)\042/is ) ) { #say "Matched: $match\n"; return 1; } } return undef; } sub testNegativeLinkMatch { my ($testLink) = @_; if ( $testLink =~ /RSS/si ) { return 1;} if ($testLink =~ m{\.(?:css|js|png|pdf|jpe?g|wmv|mp3|docx?|xlsx?|p +ptx?|gif\$|mov\$|eps|avi)}i) { return undef; } if ($testLink =~ m{^.*(jpe?g|gif|xbm|png|bmp|svg)$}) { return undef; } #say "matchedLink: $flag"; return 1; } { #main for my $href ( @allUrls ) { #$currentIP = getHostIP($href->{webSite}); my $origUrl = $href; @urls = (); $incorrectAttempts = 0; $totalPagesVisited = 0; push(@urls , Mojo::URL->new($origUrl)); #push (@urls, $origUrl); $active = 0; say "Processing: $href"; ### Start event loop if necessary Mojo::IOLoop->start unless Mojo::IOLoop->is_running; } } #endMain

Comment on Help With MOJO::UserAgent IOLoop recurring
Select or Download Code
Re: Help With MOJO::UserAgent IOLoop recurring
by Anonymous Monk on May 14, 2013 at 07:16 UTC

    I put in prints statements and its clearly saying it never added the link to be processed but its still being processed.

    Maybe its a redirect

      Thanks for the help. You are right it was redirected.

      I have another problem that I am facing. I fetch these links looking for RSS feed link for many websites and after like it finds couple websites RSS links. MOJO::Useragent get returns me DUMMY when I do get.

      Do you know anything about this?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1033343]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (9)
As of 2015-07-07 07:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (87 votes), past polls