Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Yet another example to get URLs in parallel

by karlgoethebier (Prior)
on Jun 17, 2017 at 15:36 UTC ( #1193016=CUFP: print w/replies, xml ) Need Help??

The role

1.12

Please note that this version contains some annoying errors mistakes. Use 1.17 instead. See the explanations from marioroy below in this thread.

package MyRole; # $Id: MyRole.pm,v 1.12 2017/06/17 14:00:17 karl Exp karl $ use Role::Tiny; use threads; use MCE::Loop; use MCE::Shared; use MCE::Mutex; use WWW::Curl::Easy; use Config::Tiny; my $cfg = Config::Tiny->read(q(MyRole.cfg)); MCE::Loop::init { max_workers => $cfg->{params}->{workers}, chunk_size => 1, interval => $cfg->{params}->{interval}, }; my $fetch = sub { my $curl = WWW::Curl::Easy->new; my ( $header, $body ); $curl->setopt( CURLOPT_URL, shift ); $curl->setopt( CURLOPT_WRITEHEADER, \$header ); $curl->setopt( CURLOPT_WRITEDATA, \$body ); $curl->setopt( CURLOPT_FOLLOWLOCATION, $cfg->{params}->{followloca +tion} ); $curl->setopt( CURLOPT_TIMEOUT, $cfg->{params}->{timeout} ) +; $curl->perform; { header => $header, body => $body, info => $curl->getinfo(CURLINFO_HTTP_CODE), error => $curl->errbuf, }; }; sub uagent { my $urls = $_[1]; my $shared = MCE::Shared->hash; my $mutex = MCE::Mutex->new; mce_loop { MCE->yield; $mutex->enter( $shared->set( $_ => $fetch->($_) ) ); } $urls; my $iter = $shared->iterator(); my $result; while ( my ( $url, $data ) = $iter->() ) { $result->{$url} = $data; } $result; } 1; __END__
1.17
package MyRole; # $Id: MyRole.pm,v 1.17 2017/06/18 08:45:19 karl Exp karl $ use Role::Tiny; use threads; use MCE::Loop; use MCE::Shared; use WWW::Curl::Easy; use Config::Tiny; my $cfg = Config::Tiny->read(q(MyRole.cfg)); MCE::Loop::init { max_workers => $cfg->{params}->{workers}, chunk_size => 1, interval => $cfg->{params}->{interval}, }; my $fetch = sub { my $curl = WWW::Curl::Easy->new; my ( $header, $body ); $curl->setopt( CURLOPT_URL, shift ); $curl->setopt( CURLOPT_WRITEHEADER, \$header ); $curl->setopt( CURLOPT_WRITEDATA, \$body ); $curl->setopt( CURLOPT_FOLLOWLOCATION, $cfg->{params}->{followloca +tion} ); $curl->setopt( CURLOPT_TIMEOUT, $cfg->{params}->{timeout} ) +; $curl->perform; { header => $header, body => $body, info => $curl->getinfo(CURLINFO_HTTP_CODE), error => $curl->errbuf, }; }; sub uagent { my $urls = $_[1]; my $shared = MCE::Shared->hash; mce_loop { MCE->yield; $shared->set( $_ => $fetch->($_) ); } $urls; $shared->export; } 1; __END__

The config file

# $Id: MyRole.cfg,v 1.4 2017/06/17 13:48:19 karl Exp karl $ [params] timeout=10 followlocation=1 interval=0.005 workers=auto

The class

# $Id: MyClass.pm,v 1.5 2017/06/16 15:35:32 karl Exp karl $ package MyClass; use Class::Tiny; use Role::Tiny::With; with qw(MyRole); 1; __END__

The app

#!/usr/bin/env perl # $Id: run.pl,v 1.14 2017/06/17 14:43:57 karl Exp karl $ use strict; use warnings; use MyClass; use Data::Dump; use HTML::Strip::Whitespace qw(html_strip_whitespace); use feature qw(say); my @urls = grep { $_ ne "" } <DATA>; chomp @urls; my $object = MyClass->new; my $result = $object->uagent( \@urls ); # dd $result; while ( my ( $url, $data ) = each %$result ) { say qq($url); say $data->{header}; # my $html; # html_strip_whitespace( # 'source' => \$data->{body}, # 'out' => \$html # ); # say $html; } __DATA__ http://fantasy.xecu.net http://perlmonks.org http://stackoverflow.com http://www.trumptowerny.com http://www.maralagoclub.com http://www.sundialservices.com

Update: Fixed mistakes. Thank you marioroy.

Update2: Deleted unused module.

Best regards, Karl

«The Crux of the Biscuit is the Apostrophe»

Furthermore I consider that Donald Trump must be impeached as soon as possible

Replies are listed 'Best First'.
Re: Yet another example to get URLs in parallel
by marioroy (Priest) on Jun 17, 2017 at 17:16 UTC

    Hi karlgoethebier,

    I want to share an optimization for extracting the results from the shared-manager. Iterating and fetching keys individually from a shared-hash is not necessary after running parallel.

    my $iter = $shared->iterator(); my $result; while ( my ( $url, $data ) = $iter->() ) { $result->{$url} = $data; } $result;

    All that IPC behind the scene may be reduced to a single call.

    # export to a non-shared MCE::Shared::Hash object my $result = $shared->export( ); # or simply return an unblessed hash return $shared->export( { unbless => 1 } ); # or export-destroy the shared object from the shared-manager # because, the shared hash isn't needed once parallel is completed return $shared->destroy( { unbless => 1 } );

    Our fellow brother 1nickt is who requested for the unbless option. Thank you, 1nickt.

    Regards, Mario

Re: Yet another example to get URLs in parallel
by marioroy (Priest) on Jun 17, 2017 at 18:08 UTC

    Hi karlgoethebier,

    Let's imagine for a minute, the following statement.

    $mutex->enter( $shared->set( $_ => $fetch->($_) ) ); 1. the worker enters a mutex meaning one worker runs solo while inside the mutex 2. then does a fetch on given URL 3. then stores the result into a shared hash 4. finally, leaves the mutex

    The statement above is causing MCE workers to run serially, not parallel. I've gone back to your earlier example here and that looks fine. However for this thread, maybe running solo is what karlgoethebier intended and respecting his decision to do so. Surely, he wanted the code to run parallel ;-).

    mce_loop { MCE->yield; # run parallel my $url = $_; my $result = $fetch->($url); # run solo to store the result $mutex->enter( $shared->set( $url => $result ) ); # am back to running parallel # ... }

    A mutex isn't needed when IPC involves a single trip, typical for the OO interface.

    mce_loop { MCE->yield; # run parallel, without a mutex $shared->set( $_ => $fetch->($_) ); }

    A mutex is often necessary for a shared hash when constructed via the TIE interface.

    tie my %hash, 'MCE::Shared'; my $shared = MCE::Shared->hash(); my $mutex = MCE::Mutex->new(); $hash{number} = 0; # 1 trip, store $shared->set( number => 0 ); # 1 trip # 2 trips fetch and store, needs a mutex $mutex->enter( $hash{number} += 2 ); # 1 trip via the OO interface $shared->incrby( number => 2 );

    Regards, Mario

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://1193016]
Approved by hippo
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2017-10-24 05:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My fridge is mostly full of:

















    Results (286 votes). Check out past polls.

    Notices?