Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Perl crashing with Parallel::ForkManager and WWW::Mechanize

by NeonFlash (Novice)
on Aug 04, 2012 at 02:46 UTC ( #985378=perlquestion: print w/ replies, xml ) Need Help??
NeonFlash has asked for the wisdom of the Perl Monks concerning the following question:

I have written a Perl Script using WWW::Mechanize which reads URLs from a text file and connects to them one by one. In each operation, it parses the content of the webpage looking for some specific keywords and if found, it will be written to the output file.
To speed up the process, I used Parallel::ForkManager with MAX_CHILDREN set to 3. Though I have observed an increase in the speed, the problem is that, after a while the script crashes. Perl.exe process gets killed and it does not display any specific error message.
I have run the script multiple times to see if it always fails at the same point, however the point of failure seems to be intermittent.
Please note that I have already taken care of any memory leaks in WWW::Mechanize and HTML::TreeBuilder::XPath as follows:
For WWW::Mechanize, I set stack_depth(0) so that it does not cache the history of visited pages.
HTML::TreeBuilder::XPath, I delete the root node once I am done with it. This approach helped me in resolving a memory leak issue in another similar script which does not use fork.
Here is the structure of the script, I have mentioned only the relevant parts here, please let me know if more details are required to troubleshoot:

#! /usr/bin/perl use HTML::TreeBuilder::XPath; use WWW::Mechanize; use warnings; use diagnostics; use constant MAX_CHILDREN => 3; open(INPUT,"<",$input) || die("Couldn't read from the file, $input wit +h error: $!\n"); open(OUTPUT, ">>", $output) || die("Couldn't open the file, $output wi +th error: $!\n"); $pm = Parallel::ForkManager->new(MAX_CHILDREN); $mech=WWW::Mechanize->new(); $mech->stack_depth(0); while(<INPUT>) { chomp $_; $url=$_; $pm->start() and next; $mech->get($url); if($mech->success) { $tree=HTML::TreeBuilder::XPath->new(); $tree->parse($mech->content); # do some processing here on the content and print the results to +OUTPUT file # once done then delete the root node $tree->delete(); } $pm->finish(); print "Child Processing finished\n"; # it never reaches this point! } $pm->wait_all_children;

1. I would like to know, why does this Perl script keep failing after a while?
2. For understanding purpose, I added a print statement right after the finish method of fork manager, however it does not print that.
3. I have also used, wait_all_children method, since as per the document of the module on CPAN, it will wait for the processing to get over for all the children of the parent process.
4. I have not understood why, wait_all_children method is place outside the while or the for loop though (as observed in the documentation as well), since all the processing is taking place inside the loop.
5. Memory usage of the perl.exe process keeps growing gradually even though I have taken care of the memory leaks in WWW::Mechanize and HTML::TreeBuilder
Thanks.

Comment on Perl crashing with Parallel::ForkManager and WWW::Mechanize
Download Code
Re: Perl crashing with Parallel::ForkManager and WWW::Mechanize
by bulk88 (Priest) on Aug 04, 2012 at 06:32 UTC
    Your sample script does not run. Fatal error is. Fixing that. It didn't crash for me. But if I do a -d, it does crash with a "Free to wrong pool". This is on Perl 5.10. The C stack is. Interp curcop file is, file = "(eval 2)C:/Perl/lib/DynaLoader.pm:225", line=1, is last Perl line executed.

    line 225 is
    void Uninitialize(pTHX_ PERINTERP *pInterp) { DBG(("Uninitialize\n")); EnterCriticalSection(&g_CriticalSection); if (g_bInitialized) { OBJECTHEADER *pHeader = g_pObj; while (pHeader) { DBG(("Zombiefy object |%lx| lMagic=%lx\n", pHeader, pHeader->lMagic)); switch (pHeader->lMagic) { case WINOLE_MAGIC: >>>>>>>>>>>> ReleasePerlObject(aTHX_ (WINOLEOBJECT*)pHeader); break;
    Done for now. I might try to reproduce it on a newer Perl later today. What is causing the double free() (or is it my old activeperl 5.10), (it may or not be Win32::OLE) I dont know.

      Yes indeed. I have not posted the complete script as I mentioned in my first post.
      It only gives the structure of the Perl Script and the relevant information which should help in troubleshooting this issue.
      I have am unable to understand from your response, that was the correct solution!
      Thanks.

      Yeah, fork on windows, and 5.10, guaranteed to have bugs

      Much better idea to use threads on windows, but not with 5.10, guaranteed to have bugs, much better to upgrade perl :)

Re: Perl crashing with Parallel::ForkManager and WWW::Mechanize
by aitap (Deacon) on Aug 04, 2012 at 08:28 UTC
    2. For understanding purpose, I added a print statement right after the finish method of fork manager, however it does not print that.

    This is because the child process gets terminated earlier by $pm->finish();, and parent process skips the code because of the next statement in $pm->start() and next;.

    4. I have not understood why, wait_all_children method is place outside the while or the for loop though (as observed in the documentation as well), since all the processing is taking place inside the loop.

    As it's said in the documentation,

    use $pm->start to do the fork. $pm returns 0 for the child process, and child pid for the parent process (see also "fork()" in perlfunc(1p)). The "and next" skips the internal loop in the parent process. NOTE: $pm->start dies if the fork fails. $pm->finish terminates the child process (assuming a fork was done in the "start").
    $pm->start acts like fork. In the point of the program where fork is called a new process is created (called child process), the other process is called "parent". fork returns its PID in the parent process; any integer means "true" to Perl, so fork and next lets parent process start next iteration of loop. fork returns 0 to the child process; 0 means "false" to Perl, so fork and next does not let the child process skip the loop code.

    So, the parent process creates all the children then waits for them to finish outside the loop.

    Sorry if my advice was wrong.
Re: Perl crashing with Parallel::ForkManager and WWW::Mechanize
by Corion (Pope) on Aug 04, 2012 at 08:31 UTC

    My personal recommendation is to avoid fork() on Windows, either in favour of Thread::Queue and threads or in favour of AnyEvent.

    Personally I find that using threads requires fewer modifications to an existing single-threaded program.

Re: Perl crashing with Parallel::ForkManager and WWW::Mechanize
by BrowserUk (Pope) on Aug 04, 2012 at 12:13 UTC

    Windows & fork are like windows & stones; mix them and some thing's gonna break :)

    Try this (slightly tested):

    #! perl -slw use strict; use threads; use threads::shared; use LWP::Simple; use HTML::TreeBuilder::XPath; sub locked(\$) :lvalue { lock ${$_[0]}; ${$_[0]} } our $T //= 3; my $stdoutSem :shared; my $running :shared = 0; while( my $url = <> ) { chomp $url; async { ++locked( $running ); if( my $content = get $url ) { my $tree = HTML::TreeBuilder::XPath->new(); $tree->parse( $content ); # do some processing here on the content if( my $title = $tree->findnodes( '/html/head/title' ) ) { chomp $title; lock $stdoutSem; print "$url : $title "; } # once done then delete the root node $tree->delete(); } --locked( $running ); }->detach; Win32::Sleep 500 while $running >= $T; }

    And use it like this:

    thisScript.pl urls.list > output.file

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Thank you.

      So, I tried running my script on Linux Platform (Ubuntu based Distro) and the script ran completely without being killed in between. This is a new concept for me, thanks again :)

      One question though. When a child process is spawned by the Parent Process, do they run on different cores of a processor or the same?

      For instance, if I set, MAX_CHILDREN to 3, so 3 children run together, do they run on the same core of the processor? :)

      Because, I have a quad core machine, so wanted to know if increasing the value of the MAX_CHILDREN setting will help in achieving a better speed?

        When a child process is spawned by the Parent Process

        When you use fork on Windows, you do not create a child process. You spawn a thread within the existing process that simulates forking.

        More generally -- unless you explicitly restrict them -- all threads are eligible to run on all available processors.

        And processes are threads -- on all platforms. Even a single-threaded process, is a thread at the OS level.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://985378]
Approved by bulk88
Front-paged by bulk88
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2014-09-20 01:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (151 votes), past polls