Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Code using threads crashes intermittently

by ashish.kvarma (Monk)
on Sep 28, 2009 at 03:35 UTC ( [id://797815]=perlquestion: print w/replies, xml ) Need Help??

ashish.kvarma has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

For last couple of days I have been stuck to an error and have no clue where I might be doing wrong. The code given below is of a crawler using threads to crawl multiple pages at a time. The code works fine for most of the time but some time it crashes with following error:

Faulting application perl.exe, version 5.10.0.1003, time stamp 0x482a29fd, faulting module SSLeay.dll, version 0.0.0.0, time stamp 0x482a38ff, exception code 0xc0000005, fault offset 0x0001b323, process id 0x135c, application start time 0x01ca3ba8cbd1d852.

I am using activestate perl 5.10.0.1003 (have also tried 5.10.0.1005) on Windows Vista (got same results on Windows XP).

Here is a brief decription of code and the code itself.

Event listener is a class which loops over the Thread::Queue contents (listen method) and based on the contents inserts data in database and does other processing. Crawler class provides methods to crawl over the web sites (using WWW::Mechanize) and adds data to Queue. Crawler first login to the website and collect various links from where data is to be extracted and then inside a infinite loop crawl over those pages. Crawler returns immediately if there is no data available to extract, if data is found then it stays there and refreshes the page every 30 sec to get data till the data is available.

#!c://perl/bin/perl -w use strict; use warnings; use threads; use threads::shared; use Thread::Queue; use App::Options; use Crawler; use EventListner; use Data::Dumper; my $q = Thread::Queue->new(); my $options = \%App::options; # Get command line options my $page_status = {}; share $page_status; my $page_priorty = {}; share $page_priorty; # Open Listneracks my $event_listner = EventListner->new($q, $page_status, $page_priorty) +; my $listner_thr = threads->create(sub { $event_listner->listen(); }); $listner_thr->detach(); # Create crawler object my $crawler = Crawler->new($q, $options); $crawler->login(); # Login $crawler->fetch_pages(); # Fetch pages my $threads_count :shared = 0; while (1) { # Get all closed pages and sort on there priorty my @closed_pages = grep { $page_status->{$_} eq 'C' } keys %$page_ +status; my @priorty_sort_pages = sort {$page_priorty->{$a} <=> $page_prior +ty->{$b}} @closed_pages; foreach my $page (@priorty_sort_pages) { if ($threads_count < $options->{max_crawlers}) { $threads_count++; $page_status->{$page} = 'O'; $page_priorty->{$page}++; my $thr = threads->create(sub { threads->detach(); $crawler->run($page); $crawler->reset(); unless ($page_status->{$page} eq 'X') { $page_status->{$page} = 'C'; } $threads_count--; }); #if ($thr->is_running()) { sleep($options->{pause}) if $options->{pause}; #} #print "A\n"; } #print "B\n"; } }
So far I have only found only two things
  • the script crashes only if no data is found on the first two pages(please note that crawler returns if no data is found.) but even this is not consistent. (Some time it works even in this conditions but it fails under no other conditions)
  • I did some debugging by adding some print statements and found it prints "A" but crashes before "B" is printed.

I appreciate any help to solve and enhance the script. Thanking you all in advance.

Regards,
Ashish

Replies are listed 'Best First'.
Re: Code using threads crashes intermittently
by ikegami (Patriarch) on Sep 28, 2009 at 03:52 UTC
    Is SSLeay.dll threadsafe?
      I think it’s not thread safe as Crypt::SSLeay doesn’t talk about thread safety. Though Net::SSLeay does talk about resetting callbacks to undef to prevent thread safety problems and crashes on exit. So far I have ignored this due to following reason.
      1. I believed LWP (and Mechanize as well) uses Crypt::SSLeay for HTTPS.
      2. Other than login page (which is hit at start of the script) there are no https links, therefore I was not sure about the use of SSLeay.dll
      I can try to reset callbacks if that seems to be an issue, however not sure what to do if my assumption about use of Crypt::SSLeay (and not Net::SSLeay is used for HTTPS) is correct.
      Regards,
      Ashish
Re: Code using threads crashes intermittently
by diotalevi (Canon) on Sep 29, 2009 at 03:01 UTC

    you're sharing a hash reference $page_status but not the hash it referenced. You're not taking care to lock the hash when accessing it in other threads.

    ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re: Code using threads crashes intermittently
by ashish.kvarma (Monk) on Oct 04, 2009 at 05:47 UTC

    After experimenting lots of solutions and code changes I was finally able to fix the issue.

    As evident from the beginning the issue was with sharing SSLeay (and Mechnaize) between multiple threads. I was sharing Mechanize object as I needed the session to be maintained.

    Solution came very simple. Save the cookies after login and use the cookies to resume last session while creating new mechanize object. Very basic isn’t it.
    Now I don't share Mechanize agent instead I share the cookies and create a agent per crawler

    Regards,
    Ashish

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://797815]
Approved by GrandFather
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (7)
As of 2024-04-26 08:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found