http://www.perlmonks.org?node_id=1071248

Oligo has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks!

I'm trying to do something I thought would be relatively simple, but it's, well, not. Basically I have a large BLAST output file and a list of headers in FastA format. All I want to do is extract the blast entries (from "Query=" to "Effective search space used: ~") for each of the headers in the list, into one smaller blast output file. You're probably familiar with blast outputs, but here's a snippet anyway:

Query= M01757:19:000000000-A3WAT:1:1101:15902:2061 1:N:0:1 Length=53 +Score E Sequences producing significant alignments: ( +Bits) Value dbj|AP012555.1| Mycobacterium avium subsp. hominissuis TH135 chr... +75.0 3e-11 gb|CP005928.1| Mycobacterium avium subsp. paratuberculosis MAP4,... +75.0 3e-11 gb|CP000479.1| Mycobacterium avium 104, complete genome +75.0 3e-11 gb|AE016958.1| Mycobacterium avium subsp. paratuberculosis str. ... +75.0 3e-11 >dbj|AP012555.1| Mycobacterium avium subsp. hominissuis TH135 chromoso +mal DNA, complete genome Length=4951217 Score = 75.0 bits (40), Expect = 3e-11 Identities = 42/43 (98%), Gaps = 0/43 (0%) Strand=Plus/Plus Query 7 CGGTGGGCAAGAAGGAACTCGAGAAGGACCCGCTGATGGGCAC 49 |||||||||||||||||||||||||||||||| |||||||||| Sbjct 1543504 CGGTGGGCAAGAAGGAACTCGAGAAGGACCCGGTGATGGGCAC 1543546 Lambda K H 1.33 0.621 1.12 Gapped Lambda K H 1.28 0.460 0.850 Effective search space used: 1252612366488 Query= M01757:19:000000000-A3WAT:1:1101:16099:2064 1:N:0:1 Length=102 ***** No hits found ***** Lambda K H 1.33 0.621 1.12 Gapped Lambda K H 1.28 0.460 0.850 Effective search space used: 3701553444831

...and so on. The header list looks like:

>M01757:19:000000000-A3WAT:1:1101:15902:2061 1:N:0:1 >M01757:19:000000000-A3WAT:1:1101:19601:5029 1:N:0:1 >M01757:19:000000000-A3WAT:1:1101:17448:5029 1:N:0:1

etc etc.

.

I've tried a nested while loop and the until function, that looks like:

# List of headers # BLAST file # Output file (BLAST format) use strict; # Specify I/O my $headerfile = $ARGV[0]; my $blastfile = $ARGV[1]; my $outfile = $ARGV[2]; # open header file open (HEADERFILE, $headerfile) or die "Cannot open header file\n"; # open blast file open (BLASTFILE, $blastfile) or die "Cannot open BLAST file\n"; # open output file open(OUTPUT, ">$outfile"); my $ender = "Query="; while (my $headerline = <HEADERFILE> ) { chomp $headerline; my $headercut = substr $headerline, 1; while (my $blastline = <BLASTFILE>) { if ($blastline =~ $headercut) { print $blastline; print OUTPUT $blastline; print until ( ($_ = <BLASTFILE>) =~ /$ender/i); print OUTPUT until ( ($_ = <BLASTFILE>) =~ /$ender/i); } } seek (BLASTFILE,0,0); } exit;

This works on test files, but isn't appropriate for my proper files as the blast file are >10G each (so looping through the whole thing for a few thousand headers takes forever. Reversing the loop just gives me no output, so I've tried putting the headers into an array and looping through the larger file like so:

# List of headers # BLAST file # Output file (BLAST format) use strict; # Specify I/O my $blastfile = $ARGV[0]; my $headerfile = $ARGV[1]; my $outfile = $ARGV[2]; # open blast file open (BLASTFILE, $blastfile) or die "Cannot open BLAST file\n"; # open header file open (HEADERFILE, $headerfile) or die "Cannot open header file\n"; # open output file open(OUTPUT, ">$outfile"); my @headerline = <HEADERFILE>; my $headerline = join ( '', @headerline); while (my $blastline = <BLASTFILE> ) { my $blastcut = substr $blastline, 7; if ($headerline =~ /\Q$blastcut\E/) { print $blastcut; } } exit;

This works for getting the first part, i.e. the headers from the blast output file so I know the loop in theory works. It's getting the rest of the information which is a bit tricky. I've tried using the until function as before to get to the next query entry like so:

# open blast file open (BLASTFILE, $blastfile) or die "Cannot open BLAST file\n"; # open header file open (HEADERFILE, $headerfile) or die "Cannot open header file\n"; # open output file open(OUTPUT, ">$outfile"); my $ender = "Query="; my @headerline = <HEADERFILE>; my $headerline = join ( '', @headerline); while (my $blastline = <BLASTFILE> ) { my $blastcut = substr $blastline, 7; if ($headerline =~ /\Q$blastcut\E/) { print $blastcut; print until ( ($_ = <BLASTFILE>) =~ /$ender/i); } } exit;

...but that more or less prints the entire blast file, and I have no idea why. I can't use a here-doc command as the blast file doesn't have any solo EOF points between the individual entries, so I'm kind of stuck. Any ideas greatly appreciated (and sorry for the shameful length of this post)!

Replies are listed 'Best First'.
Re: Extracting BLAST hits from a list of sequences
by Kenosis (Priest) on Jan 20, 2014 at 02:45 UTC

    Given your datasets, perhaps the following will be helpful (it correctly returns the one BLAST entry which contains a matching fasta header):

    use strict; use warnings; my %headers; while (<>) { $headers{$1} = 1 if /^>(.+)/; last if eof; } local $/ = 'Query= '; while (<>) { chomp; print $/. $_ if /(.+)/ and defined $headers{$1}; }

    Usage: perl script.pl headerFile blastFile [>outFile]

    The last, optional parameter directs output to a file.

    The local $/ = 'Query= '; sets file reading to one BLAST entry ('record') at a time. The header info after Query= is captured and the complete entry is printed if that header is in the hash.

      Using $/ is a good suggestion (and I upvoted it), but I feel that relying on the magic of the <> operator is unnecessarily clever in this case.

        Appreciate the upvote. It seemed only 'natural' to use <> in this case, since the OP already had files in @ARGV and using <> to process them avoided adding the code to do so.

Re: Extracting BLAST hits from a list of sequences
by no_slogan (Deacon) on Jan 20, 2014 at 00:38 UTC

    I know nothing about BLAST files, but it's probably because

    my $blastcut = substr $blastline, 7;

    is sometimes an empty string. Maybe do something like this:

    if ($blastline =~ /$ender/ && $headerline =~ /\Q$blastcut\E/) {

    or this:

    my $printing = 0; while (<BLASTFILE>) { if (/Query=\s*(.*)/) { $printing = (index($headerline,$1) >= 0); } print if $printing; }
Re: Extracting BLAST hits from a list of sequences
by Oligo (Novice) on Feb 18, 2014 at 15:41 UTC

    Thank you so much everyone, that was a really big help (and also sorry for taking so long to reply!)