http://www.perlmonks.org?node_id=985609

snakebites has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,

I have a series of records (4 lines per record) in FastQ format from DNA sequencing projects. The data actually come for 3 different replicates that are "barcoded". The barcode is basically the first 9 letters in the second line of the record. In the example below:

@DJB775P1:314:C0K9AACXX:3:1101:1597:2067 1:N:0:

AGTTTGTATAGCACGCGCGCCAGCAGCTGCAGCAGGTGCCCGGGCTGCTGG

+

CCCFFFFFHHHHHJJJJJJJIJJJJJIJIJJJIJJJFGIJJIJHGFFFFEE

@DJB775P1:314:C0K9AACXX:3:1101:1644:2074 1:N:0:

CTTGGTTAGAGCTCTTTCTCGATTCCGTGGGTGGTGGTGCAGATCGGAAGA

+

CCCFFDFFHHHHHJJJJJJJJJJJJJJFHIJCGIDGIGHGIJJJJJJHIBH

@DJB775P1:314:C0K9AACXX:3:1101:1707:2079 1:N:0:

ACGACCTTATAAACGGGTGGGGTCCGCGCAGTCCGCCCGGAGGATTAGATC

+

@CCFFFFFHGGHHJJJJ<CGHJCGIJIIJ@AFCHEHHHBEACD?CCDCCDD

@DJB775P1:314:C0K9AACXX:3:1101:1543:2082 1:N:0:

TCTTTGTTCAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGAGATCG

+

@@CFFFDFHHHHHJJJJEIJIJIJJJJJJJJJJJD9B<CDDDCDDDDDCDD

There are 4 records and the first 9 letters for each record are the following:

AGTTTGTAT

CTTGGTTAG

ACGACCTTA

TCTTTGTTC

Thus, the general form of the barcode is the following:

NNNXXXXNN , where N can be either of the following letters: A, C, T or G and XXXX is either TTGT, GGTT or ACCT (one for each library). As you can see record 1 and record 4 belong to the same replicate.

I would like to separae the records that I have (currently in a single file) into 3 separate files based on the either TTGT, GGTT or ACCT. Currently, the main file is about ~34GB in size.

I wonder if anyone can make a suggestion about how I can go about doing this using perl and bioperl module. I am a complete newbie in perl programming.

Thank you very much for your time and for reading my post.

Regards,

snakebites

Replies are listed 'Best First'.
Re: Deconvolutinng FastQ files
by BrowserUk (Patriarch) on Aug 06, 2012 at 08:25 UTC

    Try this. It should take around 40 minutes for 34GB:

    #! perl -sw use strict; my %outFHs = map { open my $fh, '>', "$_.fastQ" or die $!; $_ => $fh; } qw[ TTGT GGTT ACCT ]; until( eof() ) { my @lines = map scalar <>, 1 .. 4; my $barcode = substr $lines[1], 0, 9; my $tag = substr $barcode, 3, 4; print { $outFHs{ $tag } } @lines; } __END__ usage: thisScript theBigfile.fastQ ## outputs to TTGT.fastQ GGTT.fastQ ACCT.fastQ

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Thank you browseruk. Wow.. I didn't realise you could sort out the whole thing using less than 2 dozen lines of perl script :O. I am so glad I asked. However, when testing the script out with my fastq files I got three files with a few records in each, but then the script stopped and I got the following error message:

      Use of uninitialized value within %outFHs in ref-to-glob cast at Sort.pl line 14, <> line 120.

      Can't use string ("") as a symbol ref while "strict refs" in use at Sort.pl line 14, <> line 120.

      How can I debug this? I guess this has to do with my records? Thank you.

        Line 120 of your data file contains a record where the key field (characters 4 through 7) are not one of "TTGT" "GGTT" "ACCT"

        To handle that, try this modified version:

        #! perl -sw use strict; my %outFHs = map { open my $fh, '>', "$_.fastQ" or die $!; $_ => $fh; } qw[ TTGT GGTT ACCT other ]; until( eof() ) { my @lines = map scalar <>, 1 .. 4; my $barcode = substr $lines[1], 0, 9; my $tag = substr $barcode, 3, 4; print { $outFHs{ $tag } // $outFGs{ other } } @lines; } __END__ usage: thisScript theBigfile.fastQ ## outputs to TTGT.fastQ GGTT.fastQ ACCT.fastQ ## Unrecognised records are put into "other.fastQ"

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: Deconvolutinng FastQ files
by frozenwithjoy (Priest) on Aug 06, 2012 at 07:01 UTC
    You should check out the FASTX Toolkit. In it there is a very complete barcode splitting script (written in Perl) in addition to many other useful tools. You can download pre-compiled for binaries or the source here. If you want to subsequently trim the barcodes, you can use the fastx_trimmer.
      Thank you frozenjoy. With FastQ on galaxy I need to trim the first three letters for my record to be able use the barcode splitting function. I haven't tried the stand-alone version yet. I will give that a go once I can get it to work on my computer. These three letters are important for my analysis, so I am not entirely sure if I can use FastX's barcode splitter tool. I am playing around with the galaxy version of it at present.
        I took a look at fastx_barcode_splitter.pl and I think I've figured out a solution. I haven't tested it, but if you change line 161 from:
        unless $barcode =~ m/^[AGCT]+$/;
        to:
        unless $barcode =~ m/^[AGCTN]+$/;
        then you should be able to prefix your barcodes w/ 3 N's as long as you set --mismatches to at least 3 on the command line when running the script.

        One caveat is that you will want to toss out any reads that have any Ns in the first X bases (where X = 3+ barcode length). Have you run FastQC? If so, this will tell you the per base N content. It probably won't be an issue if you've already done preliminary filtering based on Illumina's Y/N flags (assuming Illumina sequencing, of course).

        Also, (depending on your computer, of course) I suspect fastx_barcode_splitter.pl will run a lot faster at the CLI than on Galaxy (at least if you are using the public galaxy server).

        Edit: to avoid the potential problem w/ Ns, just use some other non-nucleotide character!

Re: Deconvolutinng FastQ files
by ww (Archbishop) on Aug 06, 2012 at 08:02 UTC
    If production is your goal, frozenwithjoy's suggestion may be just what you need.

    If this is (as it appears) schoolwork (homework?), then you need to understand that this is NOT 'code-a-matic.'

    We'll be pleased to help you learn; you need merely show that you've made a good faith effort to solve your problem. In this case, means, post your code and tell us how it fails or post an algorithm (or pseudo-code) where you can't work out the syntax.

    You've outlined a fairly ambitious project for a 'complete newbie in perl,' so -- in case you're stuck on which of Perl's capabilities will help you here, consider

    • Do you know how to get the data from the "main file" into a script? If not, perldoc -f while will be helpful. Hint: given the size of your data file, you'll probably want to do so, line by line.
    • Consider pushing each line into a temporary cache as it's read; then test to see if it satisfies some criterion for being line 2. If not, read the next line, and see if it's line 2. If so, test its first 9 chars with regexen (lottsa' reading here: perldoc perlretut and company) and if those match the characteristics for replicate 1, 2 or 3, stashthe cached line, the line with the match and the next two lines (Hint: set a flag when you find any match for any target replicate and use it and the ++ [perldoc -f increment increment operator to know when you've pushed all four lines of the record) into the approrpiate array, say, @rep1, @rep2 or @rep3.
    • wash, rinse, repeat...

    My suspicion is that working out an appropriate set of regular expressions (there's a broad hint in the word "set" and a part of one of many possible solutions next) will be your biggest challenge, so...

    my (@rep1,@rep2,@rep3); my $prefix = qr/[ACTG]{3}/; my $rep1 = qr/TTGT/; my $rep2 = qr/GGTT/; my $rep3 = qr/ACCT/; my $postfix = qr/[ACTG]{2}/; while (my $line = <DATA>) { if ($line =~ /^$prefix $rep1 $postfix/x ) { push @rep1, $line; # ignoring, for regex instruction, # the need to push your cached line, etc.. +. } elsif ($line =~ /^$prefix $rep2 $postfix/x ) { ....

    There may be a way around this line-by-line approach. If you can absolutely count on "+" as the entire content of the third line of each record, you could use that fact as part of an approach to reading your "main file" record-by-record -- but that would be an additional complexity. Your addendum does, however, suggest an approach.

    So, my suggestion is -- try this, if you're working on homework... and come back when you get stuck, with code, and details about the shortcomings of that code

    And BTW, welcome to the Monastery.
      I was hoping to get an idea where I should focus my reading about perl, but obviously I am not expecting a code-o-matic solution. It's not quite a homework. I am more interested in the biological question rather than the coding part which I know I'm not very good at.
        That's fine... and good for you. ++ The hope that that was your aim was my reason for including the code snippet and the links to a few relevant docs.
Re: Deconvolutinng FastQ files
by Kenosis (Priest) on Aug 06, 2012 at 19:49 UTC

    Here's an option that uses Tie::File to bind the fastq records' file to an array for processing. The files will automatically close when my %FHs falls out of scope. My thanks to BrowserUk for the elegant file handle/hash routine.

    use Modern::Perl; use Tie::File; { tie my @fastqLines, 'Tie::File', 'records.fastq', recsep => "\n" o +r die $!; my %FHs = map { open my $fh, '>', "$_.fastq" or die $!; $_ => $fh } qw[ TTGT GGTT ACCT ]; for ( my $i = 0 ; $i < scalar @fastqLines ; $i += 4 ) { $fastqLines[ $i + 1 ] =~ /^...(.{4})/ and print { $FHs{$1} } @fastqLines[ $i .. $i + 3 ]; } untie @fastqLines; }

    Update: Don't try this at home, as the OP's 34G file is much too large for Tie::File to efficiently handle.

      I hate to be there bearer of bad tidings here, but if you'd ever tried using Tie::File on a 34GB file, you'd never be suggesting it to others. It would take weeks to complete.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        Oh, this is sad. I just had the script work on a 1G+ file, and extrapolated the time to about 17hrs of processing for a 34G file. It's not weeks, but it's also not practical. Didn't realize that Tie:File was so inefficient with large files.

        Good call to point this out, BrowserUK! Will place an "Update:" in the posting.

      Just to make the point about the inappropraitness of Tie::File for this.

      1. Using normal IO on a 400MB fastQ file takes 20 seconds:

        And use 1.3 MB of ram; performs 96e3 reads 97e3 writes; and consumes 42 billion clock cycles.

        Giving a projected runtime for the OPs 34GB file of 30.8 minutes.

      2. Using Tie::File on that same 400MB file takes 1340 seconds:

        And uses 601MB of ram; performs 10000e3 reads; 95e3 writes; and consumes 3,182 billion clock cycles.

        Giving a projected runtime for the OPs 34GB file, of 34 hours!

      Tie::File has its uses. This is not one of them.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        I'll remember:

        my %mindHash = ( 'Tie:File + Large File' => '"Run Away!" (King Arthur, Monty Py +thon)' );
Re: Deconvolutinng FastQ files
by snakebites (Initiate) on Aug 06, 2012 at 06:40 UTC
    I must add that line 2 and 4 have a length of 51 characters for each record. The length of line 1 is also the same for each record.