Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

problem of my multithreading perl script

by qingfengzealot (Initiate)
on Jan 11, 2013 at 12:02 UTC ( #1012870=perlquestion: print w/replies, xml ) Need Help??
qingfengzealot has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl gurus, I met a problem when I tried to apply a multithread perl script. In my script, I need to read two lists of files (for example, list1 has 7 files, and list2 has 5 files) and grab some information into a hash (If there are 12 files, then I create 12 threads to read files and put all the information into the same hash). Each file has ~21 millions lines. So I want to use multithread to speed up my script. However, it seems that the multithread one ran even much slower than the single thread script. I know that there should be some problems in my script, but I don't know where they are. Hope you could help me. Thanks very much. Following is my script:
#!/usr/bin/perl -w use strict; use warnings; use threads; use threads::shared; use Statistics::R; # using R to do t.test if(@ARGV != 3) { print STDERR "Usage: input_list1 input_list2 output\n"; exit(0); } my ($inf1, $inf2, $outf)=@ARGV; my @inf1=`ls $inf1`; # get all of files in list1 my @inf2=`ls $inf2`; # get all of files in list2 my %hash :shared; sub read_file{ my $inf=shift; $hash{$inf} = &share({}); open(IN, $inf) or die "cannot open $inf\n"; while(<IN>){ if($_=~/\w/){ chomp; my @info=split(/\t/, $_); # lock(%hash); $hash{$inf}{$info[0]."_".$info[1]}=$info[2]; } } close IN; } my @threads; my $thread_count=0; for(my $i=0; $i<=$#inf1; $i++){ my $t = threads->create(\&read_file, $inf1[$i]); push(@threads, $t); $thread_count++; } for(my $i=0; $i<=$#inf2; $i++){ my $t = threads->create(\&read_file, $inf2[$i]); push(@threads, $t); $thread_count++; } print STDERR "Total threads to read the files: $thread_count\n"; $_->join foreach @threads; sleep 1; open(OUT, ">$outf") or die "cannot open $outf\n"; # Create a communication bridge with R and start R my $R = Statistics::R->new(); ### below is to do some R related calculation based on the hash; the +problem should exist above :) open(IN, $inf1[0]) or die "cannot open $inf1[0]\n"; while(<IN>){ if($_=~/\w/){ my @info=split(/\t/, $_); my $cpg_score; my $total1; my $total2; my @list1; my @list2; foreach my $sample (@inf1){ $cpg_score.="\t".$hash{$sample}{$info[0]."_".$info[1]}; $total1+=$hash{$sample}{$info[0]."_".$info[1]}; push(@list1, $hash{$sample}{$info[0]."_".$info[1]}); } foreach my $sample (@inf2){ $cpg_score.="\t".$hash{$sample}{$info[0]."_".$info[1]}; $total2+=$hash{$sample}{$info[0]."_".$info[1]}; push(@list2, $hash{$sample}{$info[0]."_".$info[1]}); } if(abs($total1/@sample1 - $total2/@sample2)>=0.2){ my $mean1=sprintf("%.2f", $total1/@inf1); my $mean2=sprintf("%.2f", $total2/@inf2); my $list1=join",", @list1; my $list2=join",", @list2; ### Run R commands $R->run(qq`x <- t.test(c($list1), c($list2))`); my $p_value= $R -> get('x$p.value'); print OUT "$info[0]\t$info[1]$cpg_score\t$mean1\t$mean2\t$p_ +value\n"; } } } $R->stop(); close IN; close OUT;

Replies are listed 'Best First'.
Re: problem of my multithreading perl script
by RichardK (Parson) on Jan 11, 2013 at 12:34 UTC

    Reading large files is likely to be I/O bound not CPU bound, so it's not surprising that making it multi-threads didn't help.

    You could try profiling your single threaded version to see where the time is being spent, and then try to improve that.

    How do I profile my Perl programs?

    But your code looks pretty simple, so it may not have much improvement to give.

      Thanks RichardK for your kind reply. This is my first time to use multi-threads in perl. So I'm not sure if I properly use it or not. Even it is likely to to be I/O bound, the multi-threads one shouldn't run slower than the single thread one, am I right? Please correct me if I'm wrong. Thanks again for your help. Best regards, Qiongyi


        No, there are multiple ways that a threaded program can be slower than a single threaded program. In this case, I don't know exactly which one(s) you're hitting, but there are two candidates I can think of:

        • Disk thrashing: When you're reading a file sequentially, the disk drive can normally read sector after sector without (many) intermediate seeks. If you're reading multiple files at the same time, your drive needs to seek frequently between file locations. In other words, reading files one at a time generally has fewer disk seeks than reading them in parallel, and seeks are slow operations.
        • Data structure locking: In a single-threaded program, you don't need to worry about locking your data structures. But in a multithreaded program, you need to lock data structures when modifying them if another thread has the potential to also update the data structure. Even when there's no contention when accessing the data structure you're spending time creating and freeing locks.

        Having said that, you can probably speed things up without going to a single-threaded program.

        If your program is experiencing disk thrashing, you might be able to avoid it by placing your files on different drives. That way you won't have as many disk seeks.

        If your program is spending too much time managing data structure locks, you might be able to rearrange your code a bit to reduce the number of locks. For example, you might have each thread read and process a file in a local (nonshared) data structure, and then once it's finished a file, then merge the data into the shared structure.

        There are other possible problems and solutions to your problem, these are just the ones that came to mind. (I've run into both of them frequently enough...unfortunately.)


        When your only tool is a hammer, all problems look like your thumb.

        There is extra management overhead dealing with the threads, so yes, a multi-threaded program can run slower than a single threaded program, especially if there is no work other than the I/O to do.


        Actually it will run slower. All of the threads are writing to the same hash. For each write, the hash is locked and any other threads have to way until the current thread finishes its write before they can start.

        Try having each thread write to a separate hash, then merge them in the main thread once the reading is complete.

Re: problem of my multithreading perl script
by BrowserUk (Pope) on Jan 11, 2013 at 16:54 UTC
    However, it seems that the multithread one ran even much slower than the single thread script.

    The problem here is nothing the threading per se -- although it may be compounded by it. A big part of the slowdown is due to asking your harddisk to read from multiple files simultaneously, forcing the read head to dance all over the disk to fetch one block from one file; then one block from another; then one block from another ...; and then back to the first; and so on.

    Regardless of the speed of your disk -- the same would be true for SSDs, though less so -- and regardless of whether you use threads or separate processes, reading 12 large files concurrently will be far slower than reading those same file sequentially.

    A good analogy would be you trying to read 12 books by reading one page of one then one of the next and so on. Even ignoring the affect that will have on your brain trying to keep all the stories straight; just the simple need to constantly switch from one book to another to another will seriously slow down your throughput rate.

    This can be somewhat mitigated by putting the files on different physical drives -- either multiple physical configured as multiple logical drives; or by multiple physical drives raided as a single logical drive -- because one can be actually transferring data whilst other(s) are moving their read heads. But even this will usually result in lower throughput than sequential reading because of the extra context switches (thread or process); system bus and device bus conflicts etc. It also creates extra load on the physical/virtual memory mapping; and the L1/L2/L3 memory and system file caches.

    In the past, I have had some success speeding up the reading of many files by serially slurping them into scalars and then handing those scalars off to threads to process line by line -- I do that by opening the slurped scalar as a ram file, and then using the familiar while( <$FH> ){ ... } loop on the scalar -- but even this requires considerable care to ensure that the (huge) slurped scalars don't get unnecessarily duplicated in the process of handing them off to the threads for processing.

    Swapping doesn't help

    I also note that you are building a shared hash containing (from what you said) 12 shared sub-hashes, each containing 21,000,000 key/value pairs.

    On the basis of a simple experiment -- a shared hash containing 5,000,000 key value pairs shared by just 2 threads requires 2.0GB on my 64-bit perl -- I therefore estimate that your shared hash will require at least 12 * 2.0/5*21 = 100.8GB, and possible much more if you have long keys or values. So, unless you have a very large amount of memory you will be moving your system into swapping by loading that much data. Indeed, trying to load that amount into a non-shared hash is going to move you into swapping even if you read them sequentially in a single-threaded process unless you have circa. 64GB ram in your system.

    Basically, I think you need to re-think the way you are tackling this problem. Is it really necessary to have all that data in memory concurrently?

    What is the data? What calculations are you performing?

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      -- the same would be true for SSDs, though less so --
      As an aside, I am quite intrigued by this comment.   At your convenience, I would like to know more of your insights with regard to this assertion, perhaps best moved to a separate thread.   I would intuit that such latencies could not exist with solid-state, although I do not dispute you; hence, “howcum?”

        Take a look at an SSD review.

        Note how the random 4k reads achieves a relatively low throughput (101.4MB/s in that particular case), whereas the throughput for sequential reads is considerably higher (431.8MB/s).

        This is a common factor with all SSDs. Much faster than spinning rust, but it is still considerably faster to read a file sequentially than randomly.

        If you are reading 12 files sequentially, concurrently, you are reading 4K blocks randomly from the viewpoint of the controller.

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1012870]
Approved by vinoth.ree
[shmem]: can't help you there since that song is not on my bench
[Lady_Aleena]: shmem, I need to look at MP3::Tag to see how to get the date from one song.
shmem sticks himself into the envelope. gn.(1<<3) all

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2017-04-23 22:21 GMT
Find Nodes?
    Voting Booth?
    I'm a fool:

    Results (432 votes). Check out past polls.