http://www.perlmonks.org?node_id=741239

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
I have the following dataset, where each DNA (ACGT) string has its corresponding value. The values comes in group and TAB separated. Given the length K string, there will be K group. Each group contain 4 values, which correspond to A,C,G,T respectively.
AGAC <TAB> 9 -29 -39 -37 <TAB> 27 -28 -39 -37 <TAB> 26 -27 -39 -37 + <TAB> 27 -27 -39 12
What I want to do is to extract the corresponding base value of the given DNA string. Hence with the given string above the desired output is:
$VAR = [9,-39, 26, -27];
Note that tag length may be greater than four (up to 100 bp). Is there a fast way to achieve this? For there are millions of such lines.

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: Picking up Values By Group
by BrowserUk (Pope) on Feb 04, 2009 at 12:07 UTC

    Mixing tab delimeters with space delimited data is a really bad idea, and if you have any choice in the matter, you should change it.

    On the basis that you don't have the choice, the following should work, but realise that the tabs I've embedded in the data will likely have been corrupted in the process of upload and download, and the wrapping etc, that PM does to code:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; my %data; while( <DATA> ) { my( $str, @values ) = map{ s[^\s+|\s+$][]g; $_ } split "\t"; $data{ $str } = [ map [ split ' ' ], @values ]; } pp %data; my @output; while( my( $key, $valueRef ) = each %data ) { my @required; for my $c ( 0 .. length( $key ) - 1 ) { push @required, $valueRef->[ $c ][ index "ACGT", substr $key, $c, +1 ]; } push @output, \@required; } pp \@output; __DATA__ AGAC 9 -29 -39 -37 27 -28 -39 -37 26 -27 -39 -37 2 +7 -27 -39 12 ACGT 1 -2 3 -4 5 -6 7 -8 9 -10 11 -12 13 -14 15 -1 +6

    Output:

    c:\test>junk6 ( "AGAC", [ [9, -29, -39, -37], [27, -28, -39, -37], [26, -27, -39, -37], [27, -27, -39, 12], ], "ACGT", [ [1, -2, 3, -4], [5, -6, 7, -8], [9, -10, 11, -12], [13, -14, 15, -16], ], ) [[9, -39, 26, -27], [1, -6, 11, -16]]

    This is the same, but I've substituted the text '<TAB>' for the tab character which shoudl make it easier to try:

    Like I say, if you have any influence over the file format, change the tab delimiters to something visible that does not match "\s".


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I would use a lookup hash for the indices instead of calling index repeatedly.


      holli

      When you're up to your ass in alligators, it's difficult to remember that your original purpose was to drain the swamp.
        instead of calling index repeatedly.

        The tradeoff is:

        • scanning a 4 character string for a single character.
        • hashing a single character to a 32-bit hash and then performing a modulo 4 operation upon it.

        Which actually favours the former:

        #! perl -slw use strict; use Benchmark qw[ cmpthese ]; our %lookup = ( A=>0, B=>1, C=>2, D=>3 ); our $input = 'ACGT' x 1000; cmpthese -1, { index => q[ our( %lookup, $input ); my $n; $n = index "ACGT", $_ for split '', $input; ], hash => q[ our( %lookup, $input ); my $n; $n = $lookup{ $_ } for split '', $input; ], }; __END__ c:\test>junk5 Rate hash index hash 107/s -- -21% index 135/s 27% --

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Picking up Values By Group
by Anonymous Monk on Feb 04, 2009 at 16:27 UTC
    How fast is "fast"?

    i.e. What code have you written? How fast does it run? How much faster does it need to run?