Re^2: Segmentation fault: problem with perl threads

Hi,

_replicate() is passed with a datastructure and it looks like this,

$VAR1 = [
          {
            'sc_name' => 'XYZ_Scenario',
            'rsync' => [
                         {
                           'elements' => [
                                           'rsync --archive --relative
+ --stats --verbose --links --copy-links --copy-unsafe-links --safe-li
+nks --times  --files-from=\'./Sep_17_2008(22h.50m.52s)/gen/file-from/
+XYZ_Scenario/rsync_input-1.txt\' ',
                                           'rsync://xxx.yyy.zzz.corp:1
+873/',
                                           'contexts',
                                           ' /var/workshare/contexts >
+> \'./Sep_17_2008(22h.50m.52s)/log/XYZ_Scenario/rsync_input-1.log\' 2
+>&1'
                                         ],
                           'statistics' => {
                                             'total_files' => 863,
                                             'size' => 232563375
                                           },
                           'status' => undef # here i write something 
+meaningful for db updation later.
                         },
                        ... 
                        ...
                        ]# this array will have 10 or less, such objec
+ts with diff. 'rsync_input-*.txt' files to replicate
            'total_size' => 2184209735,
            'total_dep' => 13725
          },
          ... 
          ...
        ]# varies, based on scenarios
[download]

I checked this datastructure carefully, it looks like what i intended, so no problem till here

Below is the same set of functions im posting as earlier. Because i think the problem lies here. I tried running my script on perl5.8.8 and still i get 'Segementation fault' even if i actually execute the rsync command in thread or just print rsync command in thread and return. I strongly believe, i might be calling join() method on thread object which might have died after finishing its job. Hence i try to dereference a reference which is deallocated(may be or ..?).

This happens because, when 10 threads are running parallely and i wait for a 2nd thread, suppose, to join. Meanwhile 3rd or 4th or 8th(anything till 10) might have finished running. Once 2nd joins and main thread tries to call join() on next thread object, in the array(either returned by threads->list or i keep thread object in a array), which no more exists, or no clue whether the thread is joinnable.

I tried to make sure whether thread is running as you can see in below code, _replicate(),

sub _replicate{
    my $ref = shift;
    my $logger = get_logger();
    print "Starting replication of dependency files", $/;
    $logger->info("Starting replication of dependency files");
    foreach my $sc(@{$ref}){
        next unless (defined $sc);
        mkdir($LOG_FOLDER."/".$sc->{sc_name});
        $logger->info("\tScenario: ".$sc->{sc_name});
        $logger->info("\tLatest Dependencies: ".$sc->{total_dep}." of 
+size "._get_readable_size($sc->{total_size}));
        my @thr_arr = ();
        foreach my $robj(@{$sc->{rsync}}){
            # I will add a key to this datastructure, to check whether
+ thread is joinnable or it is still running? 
            $robj->{thr} => 'running';
            
            my $th = threads->create(\&worker, $robj);
            $logger->info("\tThread-".$th->tid.", Total files: ".$robj
+->{statistics}->{total_files}.", Size: "._get_readable_size($robj->{s
+tatistics}->{size})."[".$robj->{statistics}->{size}."B]");
            $logger->info("\tcmd: ".join("", @{$robj->{elements}}));
            push @thr_arr, $th->tid;
        }
        $logger->info("\twaiting for threads to finish its job...");
        
        # 3rd try
        foreach my $k(0..$#thr_arr){            
            # lets check tid and then access the thread object!
            print $k," ",$sc->{rsync}->[$k]->{thr}, $/;
            if ($sc->{rsync}->[$k]->{thr} eq 'running'){# if not, thre
+ad might have died and we try to acces the mem. which is deallocated 
+after thread's death
                my $t =  $thr_arr[$k];
                my $th = threads->object($t);
                $th->join() if ($th);
            }
        }

        # 2nd try
        # map{
            # my $th = $_;
            # just a blind belief whether this might cause 'Segmentati
+on fault', hence the check. But here may, the thread object im referr
+ing might have been deallocated due to death of thread, hence i get '
+Segmentation fault' .... ?
            # my $k = $th->join if($th);        
        # }@thr_arr;
        
        # 1st try
        # May be the thread objects returned by threads->list are unjo
+ined, but are they joinnable? no clue...!
        #map {my $k = $_->join} threads->list;
        
        $logger->info("\tFinished replicating dependencies of ".$sc->{
+sc_name});
    }
}

sub worker{
    my $robj = shift;
    my ($rsync, $server, $from, $to) = @{$robj->{elements}};
    my $alt_server = $RSYNC_CONN_STR_2;

    print "Thread-".threads->self->tid." running";
    my $i = 0;
    while(++$i <= $MAX_REPL_ATTEMPT){
        #$logger->info("\t\t[Attempt-".$i."]Thread-".threads->self->ti
+d." executing [".$rsync_cmd."]");
        #$logger->info("\t\t\tTotal files: ".$robj->{statistics}->{tot
+al_files}.", Size: "._get_readable_size($robj->{statistics}->{size}).
+"[".$robj->{statistics}->{size}."B]");
        my $rsync_cmd = $rsync.$server.$from.$to;
        `$rsync_cmd`;
        if ($?){ # because of connection refusal from server, command 
+fails
            $robj->{status} = "Completed with error!";
            $rsync_cmd = $rsync.$server.$from.$to;
            $server = ($i%2) ? $RSYNC_CONN_STR_1 : $RSYNC_CONN_STR_2; 
+# just a small trick to use other port on the same server for connect
+ion
            #$logger->error("ERROR: Thread-".threads->self->tid." says
+, replication Attempt-".$i." failed, trying again after 2 mins.");
            sleep(120);
        }else{
            $robj->{status} = "Completed";
            last;
        }
    }
    $robj->{thr} = 'done';
    my $etime = time;
    
    my $spent_time = $etime - $stime;
    my $logger = get_logger();
    $logger->info("\t\t[Attempt-".$i."]Thread-".threads->self->tid." t
+ook "._format_spent_time($spent_time)." time");
}
[download]

I would ask, is there anyway i would make sure all threads are finished or call join on only those threads which are joinnable or i have to go with other solution which sent earlier, fork() ing processes, instead thread?

Thanks in advance,
katharnakh.

Comment on Re^2: Segmentation fault: problem with perl threads Select or Download Code

Replies are listed 'Best First'.
Re^3: Segmentation fault: problem with perl threads by BrowserUk (Patriarch) on Sep 18, 2008 at 08:35 UTC
This happens because, when 10 threads are running parallely and i wait for a 2nd thread, suppose, to join. Meanwhile 3rd or 4th or 8th(anything till 10) might have finished running. Once 2nd joins and main thread tries to call join() on next thread object, in the array(either returned by threads->list or i keep thread object in a array), which no more exists, or no clue whether the thread is joinnable. This is a red herring. When non-detached threads end, they wait until you call join on them before being cleaned up. You do not need to check anything before calling join. If the thread has ended before you call join, it will return immediately. If the thread is still running, it will block until the thread ends. This is how they are designed to work. Your problem lies elsewhere. You keep posting these snippets of code, but they are so dependant upon the rest of the program that you are not posting, that it is impossible for anyone to run them in order to try and help. They are also full of lumps of commented out code, rambling comments that wrap 3 times and worst of all, all this insane "logger" crap which completely obscures the structure of the code. It is not surprising that you cannot get this to work as you cannot see what it is that you own code is doing. So, a lot of critisism which you may not like, so I'll try to show you that the critisism can help. Here is your code above, with all the crap stripped away, a few extra spaces and blank lines etc. sub _replicate{ my $ref = shift; foreach my $sc ( @{ $ref } ) { next unless (defined $sc); mkdir( $LOG_FOLDER . "/" . $sc->{sc_name} ); my @thr_arr = (); foreach my $robj( @{ $sc->{ rsync} } ){ $robj->{thr} => 'running'; my $th = threads->create( \&worker, $robj ); push @thr_arr, $th->tid; } $_->join for @thr_arr; } } sub worker{ my $robj = shift; my( $rsync, $server, $from, $to ) = @{ $robj->{ elements } }; my $alt_server = $RSYNC_CONN_STR_2; for my $i ( 0 .. $MAX_REPL_ATTEMPT ){ my $rsync_cmd = $rsync . $server . $from . $to; `$rsync_cmd`; if ($?){ $rsync_cmd = $rsync . $server . $from . $to; $server = ( $i % 2 ) ? $RSYNC_CONN_STR_1 : $RSYNC_CONN_STR +_2; sleep(120); } else{ $robj->{status} = "Completed"; last; } } $robj->{thr} = 'done'; } [download] Now the structure and essentials of the code are clear and easy to follow, and it is easy to pick out several problems: You create your thread here `my $th = threads->create( \&worker, $robj );`, but then you do `push @thr_arr, $th->tid;` *which means that `@thr_arr` contains a list of thread ids, not thread objects!* which means when you come to try and join your threads, you are trying to call the method `join()` on a number and that obviously isn't going to work. Now that should not segfault. You should be seeing an error message, (assuming you are using strict & warnings) along the lines of: `Can't call method "join" without a package or object reference at...` [download] . And you shoud have seen that error the very first time you ran this code, and every time you've run it since. Instead of fixing the actual problem, you've guessed as to what the cause might be and basically wasted your time trying to fix a problem that doesn't exist. Please note: I'm not saying your code will work once you've fixed that problem. I am saying that it will never work until you do. You are calling rsync using backticks: `$rsync_cmd`;, but you are doing nothing with any ouput produced. That means you are having the system build a pipe and collect the output, and then just throwing it all away. Have you heard of system? And now for the biggest problem, the design of your code in `_replicate()`. You have 2 nested loops. Within the outer loop you run the inner loop which creates a bunch of threads all trying to contact same server. And then block until that finishes, with several retrys and 120 second waits, before starting another bunch of threads to contact the next server. This is fundamentally bad design. If one server is slow, or broken, with all your threads trying to talk to the same server, you will basically be doing a lot of nothing, when you could be talking to one or more of the other servers in parallel. If you are going to be doing multi-processing, whether through threads or forks, the secret is to start simple. Write your worker subroutine in a standalone, single threaded program, and make it work. Once you've make sure it is working that way, then try running two copies concurrently using threads or forks. Once you've got that working reliably, only then try to scale it up! You asked whether you should move to using forks. If you have a native fork on the platform you are working on, then there is nothing obvious from the code you have posted that requires threads, so you probably could use forks. But, on the basis of the code you've posted, I think that you are likely to have just as many problems trying to work in that environment as you are having with threads. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^4: Segmentation fault: problem with perl threads by katharnakh (Novice) on Sep 23, 2008 at 07:04 UTC
When non-detached threads end, they wait until you call join on them before being cleaned up. You do not need to check anything before calling join. If the thread has ended before you call join, it will return immediately. If the thread is still running, it will block until the thread ends. This is how they are designed to work. Your problem lies elsewhere. Thanks for letting me know the design, where i had falsely assumed something. You keep posting these snippets of code, but they are so dependant upon the rest of the program that you are not posting, that it is impossible for anyone to run them in order to try and help. They are also full of lumps of commented out code, rambling comments that wrap 3 times and worst of all, all this insane "logger" crap which completely obscures the structure of the code. It is not surprising that you cannot get this to work as you cannot see what it is that you own code is doing. So, a lot of critisism which you may not like, so I'll try to show you that the critisism can help. My apologize, for sending unstructured code with unwanted comment which made difficult to people to look at code, who really want to help. Yes i know, the code sent depedends on rest part of the program, but i cannot post the whole code, because it is too big. Thanks for showing me 'how critisism can help'. Now the structure and essentials of the code are clear and easy to follow, and it is easy to pick out several problems: You create your thread here my $th = threads->create( \&worker, $robj );, but then you do push @thr_arr, $th->tid and then call join() on that object; which means that @thr_arr contains a list of thread ids, not thread objects! Correct. which means when you come to try and join your threads, you are trying to call the method join() on a number and that obviously isn't going to work. No, you have missed one line, while formatting the code to show how one can neatly post a code which is clear and easy to follow, which actaully gets the thread object associated with thread-id. `sub _replicate{ ... foreach my $k(0..$#thr_arr){ print $k," ",$sc->{rsync}->[$k]->{thr}, $/; if ($sc->{rsync}->[$k]->{thr} eq 'running'){ my $t = $thr_arr[$k]; my $th = threads->object($t); $th->join() if ($th); } } ... }` [download] You are calling rsync using backticks: `$rsync_cmd`;, but you are doing nothing with any ouput produced. That means you are having the system build a pipe and collect the output, and then just throwing it all away. That is because, im redirecting the command output to a file. If you wish, you can look in `sub worker{ ... }` and sent datastructure. Have you heard of system? Yes. And i consider executing command using backticks(``) is another way of doing it, which in my case does nothing if i use system, right? And now for the biggest problem, the design of your code in _replicate(). You have 2 nested loops. Within the outer loop you run the inner loop which creates a bunch of threads all trying to contact same server. And then block until that finishes, with several retrys and 120 second waits, before starting another bunch of threads to contact the next server. This is fundamentally bad design. If one server is slow, or broken, with all your threads trying to talk to the same server, you will basically be doing a lot of nothing, when you could be talking to one or more of the other servers in parallel. Can i ask you, what made you think(from code) that, i contact to different(or next) server when i create next bunch of threads? For every set of threads i create inside a loop, i contact same server. But during execution (in thread block) i wait, if command execution fails, 120s to contact same server with diff. port. I appreciate your descriptive post. katharnakh.	[reply] [d/l] [select]