Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Re: Deconvolutinng FastQ files

by ww (Bishop)
on Aug 06, 2012 at 08:02 UTC ( #985624=note: print w/replies, xml ) Need Help??

in reply to Deconvolutinng FastQ files

If production is your goal, frozenwithjoy's suggestion may be just what you need.

If this is (as it appears) schoolwork (homework?), then you need to understand that this is NOT 'code-a-matic.'

We'll be pleased to help you learn; you need merely show that you've made a good faith effort to solve your problem. In this case, means, post your code and tell us how it fails or post an algorithm (or pseudo-code) where you can't work out the syntax.

You've outlined a fairly ambitious project for a 'complete newbie in perl,' so -- in case you're stuck on which of Perl's capabilities will help you here, consider

  • Do you know how to get the data from the "main file" into a script? If not, perldoc -f while will be helpful. Hint: given the size of your data file, you'll probably want to do so, line by line.
  • Consider pushing each line into a temporary cache as it's read; then test to see if it satisfies some criterion for being line 2. If not, read the next line, and see if it's line 2. If so, test its first 9 chars with regexen (lottsa' reading here: perldoc perlretut and company) and if those match the characteristics for replicate 1, 2 or 3, stashthe cached line, the line with the match and the next two lines (Hint: set a flag when you find any match for any target replicate and use it and the ++ [perldoc -f increment increment operator to know when you've pushed all four lines of the record) into the approrpiate array, say, @rep1, @rep2 or @rep3.
  • wash, rinse, repeat...

My suspicion is that working out an appropriate set of regular expressions (there's a broad hint in the word "set" and a part of one of many possible solutions next) will be your biggest challenge, so...

my (@rep1,@rep2,@rep3); my $prefix = qr/[ACTG]{3}/; my $rep1 = qr/TTGT/; my $rep2 = qr/GGTT/; my $rep3 = qr/ACCT/; my $postfix = qr/[ACTG]{2}/; while (my $line = <DATA>) { if ($line =~ /^$prefix $rep1 $postfix/x ) { push @rep1, $line; # ignoring, for regex instruction, # the need to push your cached line, etc.. +. } elsif ($line =~ /^$prefix $rep2 $postfix/x ) { ....

There may be a way around this line-by-line approach. If you can absolutely count on "+" as the entire content of the third line of each record, you could use that fact as part of an approach to reading your "main file" record-by-record -- but that would be an additional complexity. Your addendum does, however, suggest an approach.

So, my suggestion is -- try this, if you're working on homework... and come back when you get stuck, with code, and details about the shortcomings of that code

And BTW, welcome to the Monastery.

Replies are listed 'Best First'.
Re^2: Deconvolutinng FastQ files
by snakebites (Initiate) on Aug 07, 2012 at 13:40 UTC
    I was hoping to get an idea where I should focus my reading about perl, but obviously I am not expecting a code-o-matic solution. It's not quite a homework. I am more interested in the biological question rather than the coding part which I know I'm not very good at.
      That's fine... and good for you. ++ The hope that that was your aim was my reason for including the code snippet and the links to a few relevant docs.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://985624]
and cookies bake in the oven...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (2)
As of 2017-03-29 00:22 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (343 votes). Check out past polls.