comment on

Hello again,

Recently I asked about tuning multiple regular expressions and got some good suggestions and help with benchmarking. This code is now working well. However, I want to take advantage of the fact that I'm working on a many-CPU machine and to try forking these different searches.

Essentially, my program (given at the bottom) does three things:

open a series of files one-by-one and load into memory
for each file perform many (10-30) searches
write the search results to a single file

My thought was to fork each of the separate string searches, so the string gets loaded from disk only once, and multiple CPUs can be used for searching it.

Aside from wondering if this is a reasonable approach, my main question is about implementing the output-to-disk. Can I have all the children write to the same file-handle or will this lead to corruption? And because the file output will almost certainly be the limiting step, am I being silly in parallelizing this in the first place? If there's a way to pass data back from the child to the parent, I could create one aggreated results hash and print it to disk simultaneously, for example.

Any advice/suggestions much appreciated!

Here's the code:

### INCLUDES #########################################################
+################################
use strict;
use Bio::SeqIO;
use Carp;

### PARAMETERS #######################################################
+################################
my $chr_file = $ARGV[0];
my $seq_file = $ARGV[1];

if ( 2 != scalar(@ARGV) ) {
        croak 'Invalid parameter number';
        }
elsif ( ! -e $chr_file || ! -T $chr_file ) {
        croak 'Missing or invalid chromsome-listing file';
        }
elsif ( ! -e $seq_file || ! -T $seq_file ) {
        croak 'Missing or invalid sequence-listing file';
        }

### LOCALS ###########################################################
+################################
my @chromosomes;
my %motifs;

### LOAD THE CHROMOSOME LIST #########################################
+################################
open(my $fh_chr, '<', $chr_file) or croak "Unable to open chromsome li
+st: $chr_file";

while (<$fh_chr>) {

        s/^\s+//;
        s/\s+$//;

        my $row = $_;
        next() if (!$row);

        push @chromosomes, $row;

        }

close($fh_chr);

### LOAD THE MOTIF LIST ##############################################
+################################
open(my $fh_seq, '<', $seq_file) or croak "Unable to open motif file: 
+$seq_file";

while (<$fh_seq>) {

        s/^\s+//;
        s/\s+$//;

        my @row = split("\t");

        next() if ( 2 != scalar(@row) );

        $motifs{ $row[0] } = $row[1];

        }

close($fh_seq);

### FIND SEQUENCE MOTIFS #############################################
+################################
foreach my $chromosome (@chromosomes) {

        my $directory = $chromosome.'/';
        my $file = 'chr'.$chromosome.'.fa.masked';
        my $path = $directory.$file;

        my $seqio = Bio::SeqIO->new(
                -file    =>  "<$path",
                -format  =>  'largefasta'
                );

        my $seq = $seqio->next_seq();
        my $sequence = $seq->seq();

        foreach my $motif ( keys(%motifs) ) {

                my $str = $motifs{$motif};
                my $len = length($str);
                my $pos = 0;

                while ( ($pos = index($sequence, $str, $pos)) >= 0 ) {
                        print join("\t", $chromosome, $pos, $motif), "
+\n";
                        $pos += $len;
                        }

                }

        }
[download]

In reply to Forking Multiple Regex's on a Single String by bernanke01

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Pathologically Eclectic Rubbish Lister
	PerlMonks