Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Yet another example to get URLs in parallel

by karlgoethebier (Abbot)
on Jun 17, 2017 at 15:36 UTC ( [id://1193016]=CUFP: print w/replies, xml ) Need Help??

The role

1.12

Please note that this version contains some annoying errors mistakes. Use 1.17 instead. See the explanations from marioroy below in this thread.

package MyRole; # $Id: MyRole.pm,v 1.12 2017/06/17 14:00:17 karl Exp karl $ use Role::Tiny; use threads; use MCE::Loop; use MCE::Shared; use MCE::Mutex; use WWW::Curl::Easy; use Config::Tiny; my $cfg = Config::Tiny->read(q(MyRole.cfg)); MCE::Loop::init { max_workers => $cfg->{params}->{workers}, chunk_size => 1, interval => $cfg->{params}->{interval}, }; my $fetch = sub { my $curl = WWW::Curl::Easy->new; my ( $header, $body ); $curl->setopt( CURLOPT_URL, shift ); $curl->setopt( CURLOPT_WRITEHEADER, \$header ); $curl->setopt( CURLOPT_WRITEDATA, \$body ); $curl->setopt( CURLOPT_FOLLOWLOCATION, $cfg->{params}->{followloca +tion} ); $curl->setopt( CURLOPT_TIMEOUT, $cfg->{params}->{timeout} ) +; $curl->perform; { header => $header, body => $body, info => $curl->getinfo(CURLINFO_HTTP_CODE), error => $curl->errbuf, }; }; sub uagent { my $urls = $_[1]; my $shared = MCE::Shared->hash; my $mutex = MCE::Mutex->new; mce_loop { MCE->yield; $mutex->enter( $shared->set( $_ => $fetch->($_) ) ); } $urls; my $iter = $shared->iterator(); my $result; while ( my ( $url, $data ) = $iter->() ) { $result->{$url} = $data; } $result; } 1; __END__
1.17
package MyRole; # $Id: MyRole.pm,v 1.17 2017/06/18 08:45:19 karl Exp karl $ use Role::Tiny; use threads; use MCE::Loop; use MCE::Shared; use WWW::Curl::Easy; use Config::Tiny; my $cfg = Config::Tiny->read(q(MyRole.cfg)); MCE::Loop::init { max_workers => $cfg->{params}->{workers}, chunk_size => 1, interval => $cfg->{params}->{interval}, }; my $fetch = sub { my $curl = WWW::Curl::Easy->new; my ( $header, $body ); $curl->setopt( CURLOPT_URL, shift ); $curl->setopt( CURLOPT_WRITEHEADER, \$header ); $curl->setopt( CURLOPT_WRITEDATA, \$body ); $curl->setopt( CURLOPT_FOLLOWLOCATION, $cfg->{params}->{followloca +tion} ); $curl->setopt( CURLOPT_TIMEOUT, $cfg->{params}->{timeout} ) +; $curl->perform; { header => $header, body => $body, info => $curl->getinfo(CURLINFO_HTTP_CODE), error => $curl->errbuf, }; }; sub uagent { my $urls = $_[1]; my $shared = MCE::Shared->hash; mce_loop { MCE->yield; $shared->set( $_ => $fetch->($_) ); } $urls; $shared->export; } 1; __END__

The config file

# $Id: MyRole.cfg,v 1.4 2017/06/17 13:48:19 karl Exp karl $ [params] timeout=10 followlocation=1 interval=0.005 workers=auto

The class

# $Id: MyClass.pm,v 1.5 2017/06/16 15:35:32 karl Exp karl $ package MyClass; use Class::Tiny; use Role::Tiny::With; with qw(MyRole); 1; __END__

The app

#!/usr/bin/env perl # $Id: run.pl,v 1.14 2017/06/17 14:43:57 karl Exp karl $ use strict; use warnings; use MyClass; use Data::Dump; use HTML::Strip::Whitespace qw(html_strip_whitespace); use feature qw(say); my @urls = grep { $_ ne "" } <DATA>; chomp @urls; my $object = MyClass->new; my $result = $object->uagent( \@urls ); # dd $result; while ( my ( $url, $data ) = each %$result ) { say qq($url); say $data->{header}; # my $html; # html_strip_whitespace( # 'source' => \$data->{body}, # 'out' => \$html # ); # say $html; } __DATA__ http://fantasy.xecu.net http://perlmonks.org http://stackoverflow.com http://www.trumptowerny.com http://www.maralagoclub.com http://www.sundialservices.com

Update: Fixed mistakes. Thank you marioroy.

Update2: Deleted unused module.

Best regards, Karl

«The Crux of the Biscuit is the Apostrophe»

Furthermore I consider that Donald Trump must be impeached as soon as possible

Replies are listed 'Best First'.
Re: Yet another example to get URLs in parallel
by marioroy (Prior) on Jun 17, 2017 at 18:08 UTC

    Hi karlgoethebier,

    Let's imagine for a minute, the following statement.

    $mutex->enter( $shared->set( $_ => $fetch->($_) ) ); 1. the worker enters a mutex meaning one worker runs solo while inside the mutex 2. then does a fetch on given URL 3. then stores the result into a shared hash 4. finally, leaves the mutex

    The statement above is causing MCE workers to run serially, not parallel. I've gone back to your earlier example here and that looks fine. However for this thread, maybe running solo is what karlgoethebier intended and respecting his decision to do so. Surely, he wanted the code to run parallel ;-).

    mce_loop { MCE->yield; # run parallel my $url = $_; my $result = $fetch->($url); # run solo to store the result $mutex->enter( $shared->set( $url => $result ) ); # am back to running parallel # ... }

    A mutex isn't needed when IPC involves a single trip, typical for the OO interface.

    mce_loop { MCE->yield; # run parallel, without a mutex $shared->set( $_ => $fetch->($_) ); }

    A mutex is often necessary for a shared hash when constructed via the TIE interface.

    tie my %hash, 'MCE::Shared'; my $shared = MCE::Shared->hash(); my $mutex = MCE::Mutex->new(); $hash{number} = 0; # 1 trip, store $shared->set( number => 0 ); # 1 trip # 2 trips fetch and store, needs a mutex $mutex->enter( $hash{number} += 2 ); # 1 trip via the OO interface $shared->incrby( number => 2 );

    Regards, Mario

Re: Yet another example to get URLs in parallel
by marioroy (Prior) on Jun 17, 2017 at 17:16 UTC

    Hi karlgoethebier,

    I want to share an optimization for extracting the results from the shared-manager. Iterating and fetching keys individually from a shared-hash is not necessary after running parallel.

    my $iter = $shared->iterator(); my $result; while ( my ( $url, $data ) = $iter->() ) { $result->{$url} = $data; } $result;

    All that IPC behind the scene may be reduced to a single call.

    # export to a non-shared MCE::Shared::Hash object my $result = $shared->export( ); # or simply return an unblessed hash return $shared->export( { unbless => 1 } ); # or export-destroy the shared object from the shared-manager # because, the shared hash isn't needed once parallel is completed return $shared->destroy( { unbless => 1 } );

    Our fellow brother 1nickt is who requested for the unbless option. Thank you, 1nickt.

    Regards, Mario

Re: Yet another example to get URLs in parallel
by etj (Priest) on Sep 02, 2024 at 07:37 UTC
    In the spirit of "more than one way", Mojolicious::Command::bulkget, in its entirety:
    package Mojolicious::Command::bulkget; use Mojo::Base 'Mojolicious::Command'; use Mojo::UserAgent; use Mojo::Promise; use Mojo::File 'path'; use Mojo::Util qw(getopt); our $VERSION = '0.03'; my $MAXREQ = 20; has description => 'Perform bulk get requests'; has usage => sub { shift->extract_usage . "\n" }; sub run { my ($self, @args) = @_; getopt \@args, 'v|verbose' => \my $verbose; my ($urlbase, $outdir, $suffixesfile) = @args; die $self->usage . "No URL" if !$urlbase; die $self->usage . "$outdir: $!" if ! -d $outdir; die $self->usage . "$suffixesfile: $!" if ! -f $suffixesfile; my $ua = Mojo::UserAgent->new; # Detect proxy for absolute URLs $urlbase !~ m!^/! ? $ua->proxy->detect : $ua->server->app($self->app +); my $outpath = path($outdir); my @suffixes = _getsuffixes($suffixesfile, $outpath); my @promises = map _makepromise($urlbase, $ua, \@suffixes, $outpath, + $verbose), (1..$MAXREQ); Mojo::Promise->all(@promises)->wait if @promises; } sub _makepromise { my ($urlbase, $ua, $suffixes, $outpath, $verbose) = @_; my $s = shift @$suffixes; return if !defined $s; my $url = $urlbase . $s; warn "getting $url\n" if $verbose; $ua->get_p($url)->then(sub { my ($tx) = @_; _handle_result($outpath, $tx, $s, $verbose); _makepromise($urlbase, $ua, $suffixes, $outpath, $verbose); }); } sub _handle_result { my ($outpath, $tx, $s, $verbose) = @_; if ($tx->res->is_success) { warn "got $s\n" if $verbose; $outpath->child($s)->spurt($tx->res->body); } else { warn "error $s\n" if $verbose; } } sub _getsuffixes { my ($suffixesfile, $outpath) = @_; open my $fh, '<', $suffixesfile or die $!; grep { !-f $outpath->child($_); } map { chomp; $_ } <$fh>; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://1193016]
Approved by hippo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-10-07 18:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The PerlMonks site front end has:





    Results (44 votes). Check out past polls.

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.