Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Storable problem of data sharing in multiprocess

by hhheng (Initiate)
on Oct 03, 2014 at 08:49 UTC ( [id://1102710]=perlquestion: print w/replies, xml ) Need Help??

hhheng has asked for the wisdom of the Perl Monks concerning the following question:

I tried to develop a script to grab urls from a website, and since it's a very big site I need to fork many processes and then use Storable to share the data among processes.

Parent process will fetch the main page for some urls, put in a hash and array. The hash is for containin the urls, while the array is also for containation of the urls, but will be used for iteration (shift one url each time) and 0 urls as the end of iteration. Then the child process will fetch pages for links, put into the hash and array.

The design is that the child process will work on the shared hash and array, but my script actually all copy the hash and array from parent process. See script below:

Please see the test result by this link: http://www.aobu.net/cgi-bin/test_gseSM.pl. And you can see each child process is doing the same thing without sharing %urls and @unique_urls between them.

my %urls; #hash to contain all the urls my @unique_urls; #Array contains all the urls for iteration my $base = "http://www.somedomain.com"; my $mech = WWW::Mechanize->new; $mech->get($base); #### Start point of %urls and for start crawling %urls = my_own_sub($mech->links); #My own sub to process&extract l +inks from the page, with key is the link @unique_urls = keys %urls; lock_store \%url, 'url_ref'; lock_store \@unique_url, 'unique_ref'; my @child_pids; for($i=0; $i<10; $i++){ $pid = fork(); push @child_pids, $pid; die "Couldn't fork $!" unless defined $pid; unless($pid){ $url_ref = lock_retrieve('url_ref'); $unique_ref = lock_retrieve('unique_ref'); print "Number url: ", scalar(keys %$url_ref), "num-unique_url: ", +scalar(@$unique_ref), "\n"; while($cnt<100 && (my $u=shift @$unique_ref)){ #each fork maximum +process 100 urls; $mech->get($u); %links = my_own_sub($mech->links); foreach my $link(sort keys %links){ next if existed $url_ref->{$link}; push @$unique_ref, $link; $url_ref->{$link} = 1; } } lock_store $url_ref, 'url_ref'; lock_store $unique_ref, 'unique_ref'; sleep(1); exit(0); } } waitpid($_, 0) foreach @child_pids; $url_ref = lock_retrieve('url_ref'); $unique_ref = lock_retrieve('unique_ref'); print $_, "\n" foreach(sort keys %$url_ref); print "Number of links left to be crawled: ", scalar(@$unique_ref), "\ +n";

Testing the code with a small size website, and found that each forked child process will get the %urls and @unique_urls from the parent process which marked as the start point, while my aim is that each child process will write to %urls, and each process will shift urls from and then push urls into @unique_urls, and then each process will retrieve the other child process modified %urls an @unique_urls.

I don't want to use other modules like IPC::Sharable, Parallel::ForkManager, etc to achieve my aim, and just want to use fork and Storable module.

Can anybody tell me what's wrong in my script?

Replies are listed 'Best First'.
Re: Storable problem of data sharing in multiprocess
by jellisii2 (Hermit) on Oct 03, 2014 at 11:29 UTC
    This is kind of unrelated, but unless the targets of this script know you're doing this work, please process and respect robots.txt.
      Will do check the robots.txt in another module, but first we need to make this script work first.
Re: Storable problem of data sharing in multiprocess
by Anonymous Monk on Oct 03, 2014 at 08:58 UTC

    Can anybody tell me what's wrong in my script?

    Maybe, does it work? What does it do correctly? What does it do incorrectly?

    You should be able to tell by running your program against your test website if there is a problem that needs solving

    :) yes I am a comedian

        Here is what I would do , partition the url list upfront into different storable files, so when you fork you're only sharing a single filename, then later the parent process unifies the results of the child process ... only parent partitions jobs because only parent spawns children

        #!/usr/bin/perl -- ## ## ## ## perltidy -olq -csc -csci=3 -cscl="sub : BEGIN END " -otr -opr -ce +-nibc -i=4 -pt=0 "-nsak=*" ## perltidy -olq -csc -csci=10 -cscl="sub : BEGIN END if " -otr -opr +-ce -nibc -i=4 -pt=0 "-nsak=*" ## perltidy -olq -csc -csci=10 -cscl="sub : BEGIN END if while " -otr + -opr -ce -nibc -i=4 -pt=0 "-nsak=*" #!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw/ dd /; Main( @ARGV ); exit( 0 ); sub Main { my @files = StorePartitionUrls( GetInitialUniqueUrls() ); ForkThisStuff( @files ); UnifyChildResults( 'Ohmy-unique-hostname-urls-storable', @files ); } ## end sub Main sub GetInitialUniqueUrls { my @urls; ... return \@urls; } ## end sub GetInitialUniqueUrls sub ForkThisStuff { ## spawn kids with one file, wait, whatever for my $file ( @files ) { EachChildGetsItsOwn( $file ); } } ## end sub ForkThisStuff sub ForkThisStuff { for my $file( @files ){ ## something forking here EachChildGetsItsOwn( $file ); } } sub StorePartitionUrls { my( $urls , $partition , $fnamet, ) = @_; $partition ||= 100; $fnamet ||= 'Ohmy-candidate-urls-%d-%d-storable'; my @files; while( @$urls ){ my @hundred = splice @$urls, 0, $partition ; #~ my $file = "Ohmy-".int( @$urls ).'-'.int( @hundred ).'-s +torable'; my $file = sprintf $fnamet, int( @$urls ), int( @hundred ); lock_store \@hundred, $file; push @files, $file; } return @files; } ## end sub StorePartitionUrls __END__

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1102710]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2024-03-28 08:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found