Re: Deconvolutinng FastQ files
by BrowserUk (Patriarch) on Aug 06, 2012 at 08:25 UTC
|
#! perl -sw
use strict;
my %outFHs = map {
open my $fh, '>', "$_.fastQ" or die $!;
$_ => $fh;
} qw[ TTGT GGTT ACCT ];
until( eof() ) {
my @lines = map scalar <>, 1 .. 4;
my $barcode = substr $lines[1], 0, 9;
my $tag = substr $barcode, 3, 4;
print { $outFHs{ $tag } } @lines;
}
__END__
usage:
thisScript theBigfile.fastQ
## outputs to TTGT.fastQ GGTT.fastQ ACCT.fastQ
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [d/l] |
|
Thank you browseruk. Wow.. I didn't realise you could sort out the whole thing using less than 2 dozen lines of perl script :O. I am so glad I asked.
However, when testing the script out with my fastq files I got three files with a few records in each, but then the script stopped and I got the following error message:
Use of uninitialized value within %outFHs in ref-to-glob cast at Sort.pl line 14, <> line 120.
Can't use string ("") as a symbol ref while "strict refs" in use at Sort.pl line 14, <> line 120.
How can I debug this? I guess this has to do with my records?
Thank you.
| [reply] |
|
Line 120 of your data file contains a record where the key field (characters 4 through 7) are not one of
"TTGT" "GGTT" "ACCT"
To handle that, try this modified version:
#! perl -sw
use strict;
my %outFHs = map {
open my $fh, '>', "$_.fastQ" or die $!;
$_ => $fh;
} qw[ TTGT GGTT ACCT other ];
until( eof() ) {
my @lines = map scalar <>, 1 .. 4;
my $barcode = substr $lines[1], 0, 9;
my $tag = substr $barcode, 3, 4;
print { $outFHs{ $tag } // $outFGs{ other } } @lines;
}
__END__
usage:
thisScript theBigfile.fastQ
## outputs to TTGT.fastQ GGTT.fastQ ACCT.fastQ
## Unrecognised records are put into "other.fastQ"
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [d/l] [select] |
|
|
|
Re: Deconvolutinng FastQ files
by frozenwithjoy (Priest) on Aug 06, 2012 at 07:01 UTC
|
You should check out the FASTX Toolkit. In it there is a very complete barcode splitting script (written in Perl) in addition to many other useful tools. You can download pre-compiled for binaries or the source here. If you want to subsequently trim the barcodes, you can use the fastx_trimmer. | [reply] |
|
Thank you frozenjoy. With FastQ on galaxy I need to trim the first three letters for my record to be able use the barcode splitting function. I haven't tried the stand-alone version yet. I will give that a go once I can get it to work on my computer.
These three letters are important for my analysis, so I am not entirely sure if I can use FastX's barcode splitter tool. I am playing around with the galaxy version of it at present.
| [reply] |
|
I took a look at fastx_barcode_splitter.pl and I think I've figured out a solution. I haven't tested it, but if you change line 161 from:
unless $barcode =~ m/^[AGCT]+$/;
to:
unless $barcode =~ m/^[AGCTN]+$/;
then you should be able to prefix your barcodes w/ 3 N's as long as you set --mismatches to at least 3 on the command line when running the script.
One caveat is that you will want to toss out any reads that have any Ns in the first X bases (where X = 3+ barcode length). Have you run FastQC? If so, this will tell you the per base N content. It probably won't be an issue if you've already done preliminary filtering based on Illumina's Y/N flags (assuming Illumina sequencing, of course).
Also, (depending on your computer, of course) I suspect fastx_barcode_splitter.pl will run a lot faster at the CLI than on Galaxy (at least if you are using the public galaxy server).
Edit: to avoid the potential problem w/ Ns, just use some other non-nucleotide character! | [reply] [d/l] [select] |
|
Re: Deconvolutinng FastQ files
by ww (Archbishop) on Aug 06, 2012 at 08:02 UTC
|
If production is your goal, frozenwithjoy's suggestion may be just what you need.
If this is (as it appears) schoolwork (homework?), then you need to understand that this is NOT 'code-a-matic.'
We'll be pleased to help you learn; you need merely show that you've made a good faith effort to solve your problem. In this case, means, post your code and tell us how it fails or post an algorithm (or pseudo-code) where you can't work out the syntax.
You've outlined a fairly ambitious project for a 'complete newbie in perl,' so -- in case you're stuck on which of Perl's capabilities will help you here, consider
- Do you know how to get the data from the "main file" into a script? If not, perldoc -f while will be helpful. Hint: given the size of your data file, you'll probably want to do so, line by line.
- Consider pushing each line into a temporary cache as it's read; then test to see if it satisfies some criterion for being line 2. If not, read the next line, and see if it's line 2.
If so, test its first 9 chars with regexen (lottsa' reading here: perldoc perlretut and company) and if those match the characteristics for replicate 1, 2 or 3, stashthe cached line, the line with the match and the next two lines (Hint: set a flag when you find any match for any target replicate and use it and the ++ [perldoc -f increment increment operator to know when you've pushed all four lines of the record) into the approrpiate array, say, @rep1, @rep2 or @rep3.
- wash, rinse, repeat...
My suspicion is that working out an appropriate set of regular expressions (there's a broad hint in the word "set" and a part of one of many possible solutions next) will be your biggest challenge, so...
my (@rep1,@rep2,@rep3);
my $prefix = qr/[ACTG]{3}/;
my $rep1 = qr/TTGT/;
my $rep2 = qr/GGTT/;
my $rep3 = qr/ACCT/;
my $postfix = qr/[ACTG]{2}/;
while (my $line = <DATA>) {
if ($line =~ /^$prefix
$rep1
$postfix/x )
{
push @rep1, $line; # ignoring, for regex instruction,
# the need to push your cached line, etc..
+.
}
elsif ($line =~ /^$prefix
$rep2
$postfix/x )
{
....
There may be a way around this line-by-line approach. If you can absolutely count on "+" as the entire content of the third line of each record, you could use that fact as part of an approach to reading your "main file" record-by-record -- but that would be an additional complexity. Your addendum does, however, suggest an approach.
So, my suggestion is -- try this, if you're working on homework... and come back when you get stuck, with code, and details about the shortcomings of that code
And BTW, welcome to the Monastery. | [reply] [d/l] [select] |
|
I was hoping to get an idea where I should focus my reading about perl, but obviously I am not expecting a code-o-matic solution. It's not quite a homework. I am more interested in the biological question rather than the coding part which I know I'm not very good at.
| [reply] |
|
That's fine... and good for you. ++ The hope that that was your aim was my reason for including the code snippet and the links to a few relevant docs.
| [reply] |
Re: Deconvolutinng FastQ files
by Kenosis (Priest) on Aug 06, 2012 at 19:49 UTC
|
Here's an option that uses Tie::File to bind the fastq records' file to an array for processing. The files will automatically close when my %FHs falls out of scope. My thanks to BrowserUk for the elegant file handle/hash routine.
use Modern::Perl;
use Tie::File;
{
tie my @fastqLines, 'Tie::File', 'records.fastq', recsep => "\n" o
+r die $!;
my %FHs = map {
open my $fh, '>', "$_.fastq" or die $!;
$_ => $fh
} qw[ TTGT GGTT ACCT ];
for ( my $i = 0 ; $i < scalar @fastqLines ; $i += 4 ) {
$fastqLines[ $i + 1 ] =~ /^...(.{4})/
and print { $FHs{$1} } @fastqLines[ $i .. $i + 3 ];
}
untie @fastqLines;
}
Update: Don't try this at home, as the OP's 34G file is much too large for Tie::File to efficiently handle.
| [reply] [d/l] [select] |
|
| [reply] |
|
Oh, this is sad. I just had the script work on a 1G+ file, and extrapolated the time to about 17hrs of processing for a 34G file. It's not weeks, but it's also not practical. Didn't realize that Tie:File was so inefficient with large files.
Good call to point this out, BrowserUK! Will place an "Update:" in the posting.
| [reply] [d/l] |
|
|
| [reply] |
|
my %mindHash = (
'Tie:File + Large File' => '"Run Away!" (King Arthur, Monty Py
+thon)'
);
| [reply] [d/l] |
|
|
|
Re: Deconvolutinng FastQ files
by snakebites (Initiate) on Aug 06, 2012 at 06:40 UTC
|
I must add that line 2 and 4 have a length of 51 characters for each record. The length of line 1 is also the same for each record. | [reply] |