Re: Regular expressions across multiple lines

Hello abcd,

The following demonstration uses the BioUtil::Seq module on CPAN. It is beneficial for this use case.

use strict;
use warnings;

use BioUtil::Seq;
use constant { HDR => 0, SEQ => 1 };

# From the documentation:
#
# FastaReader returns an anonymous subroutine, when called, returns
# a fasta record which is a reference of an array containing the fasta
# header and sequence. By default, spaces and \r?\n are trimmed from
# the sequence.
#

my $next_seq = FastaReader("input_file.fasta");

while ( my $fa = $next_seq->() ) {
 # print ">$fa->[HDR]\n$fa->[SEQ]\n";
   my $name = ( split(/ /, $fa->[HDR], 2) )[0];

   while ( $fa->[SEQ] =~ /(?<=(.....))abc(.{10})def(?=(.....))/g ) {
      print "$name: $1, $2, $3\n";
   }
}
[download]

Regards, Mario.

Comment on Re: Regular expressions across multiple lines Download Code

Replies are listed 'Best First'.
Re^2: Regular expressions across multiple lines by marioroy (Prior) on Apr 25, 2016 at 00:13 UTC
Update: Changed chunk_size to '2M'. Update: Added full example. Update: Added missing tr line to trim white space. For the spirit of Perl and Bioinformaticians at large, the following does the same thing by utilizing the record separator option in MCE. The "\n>" is a special case which anchors ">" at the start of the line. Workers receive records beginning with ">" and ending in "\n". The following demonstration is fast for small and large sequences. A chunk_size greater than 8192 means to read at least the number of bytes. Perl will read until the record separator. A worker may receive 1 or several records depending on the size of the record(s). use strict; use warnings; use MCE::Flow; use MCE::Shared; mce_open my $out_fh, '>', \STDOUT or die "open error: $!\n"; mce_flow { max_workers => 4, chunk_size => '2m', input_data => "input_file.fasta", RS => "\n>", }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my ( $name, $output ); for ( @{ $chunk_ref } ) { /^>(\w+)/; $name = $1; tr/\t\r\n //d; # trim white space while ( $_ =~ /(?<=(.....))CCCC(.{10})AGA(?=(.....))/g ) { $output .= "$name: $1, $2, $3\n"; } } print $out_fh $output if length($output); }; [download] The following demonstration was created mainly as a template for extracting the seq_id, seq_desc, and sequence separately and doing so with low memory consumption. Basically, the whole header line is trimmed from the record leaving just sequence in $_ without Perl making an extra copy. use strict; use warnings; use MCE::Flow; use MCE::Shared; mce_open my $out_fh, '>', \STDOUT or die "open error: $!\n"; mce_flow { max_workers => 4, chunk_size => '2m', input_data => "input_file.fasta", RS => "\n>", }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my ( $pos, $hdr, $seq_id, $seq_desc, $output ); for ( @{ $chunk_ref } ) { $pos = index($_, "\n") + 1; $hdr = substr($_, 0, $pos - 1); # skip the first record, e.g. comment at the top of the file next if ( $chunk_id == 1 && substr($hdr, 0, 1) ne '>' ); # extract seq_id and seq_desc $hdr =~ /^>(\w+)\s([^\r\n])/; $seq_id = $1, $seq_desc = $2; # $_ becomes sequence, without making an extra copy substr($_, 0, $pos, ''); # trim any white space in sequence tr/\t\r\n //d; # for printing ">header\nsequence\n", uncomment the next 3 lines # ( length $seq_desc ) # ? print ">$seq_id $seq_desc\n$_\n" # : print ">$seq_id\n$_\n"; # loop through match patterns while ( /(?<=(.....))CCCC(.{10})AGA(?=(.....))/g ) { $output .= "$seq_id: $1, $2, $3\n"; } } print $out_fh $output if length($output); }; [download] Regards, Mario.	[reply] [d/l] [select]
Re^2: Regular expressions across multiple lines by marioroy (Prior) on Apr 24, 2016 at 23:12 UTC
The following is a parallel demonstration when extra performance is desired for very large sequences. Otherwise, the serial demonstration is faster. use strict; use warnings; use BioUtil::Seq; use constant { HDR => 0, SEQ => 1 }; use MCE::Flow; use MCE::Shared; mce_open my $out_fh, '>', \*STDOUT or die "open error: $!\n"; # From the documentation: # # FastaReader returns an anonymous subroutine, when called, returns # a fasta record which is a reference of an array containing the fasta # header and sequence. By default, spaces and \r?\n are trimmed from # the sequence. # mce_flow { max_workers => 4, chunk_size => 1, input_data => FastaReader("input_file.fasta") }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $fa = $chunk_ref->[0]; # my $fa = $_; # same thing for chunk_size => 1 # therefore, the 2 lines above may be omitted # print ">$fa->[HDR]\n$fa->[SEQ]\n"; my $name = ( split(/ /, $fa->[HDR], 2) )[0]; my $output; while ( $fa->[SEQ] =~ /(?<=(.....))abc(.{10})def(?=(.....))/g ) { $output .= "$name: $1, $2, $3\n"; } print $out_fh $output if length($output); }; [download] Regards, Mario.	[reply] [d/l]


Perl-Sensitive Sunglasses
	PerlMonks