Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Lower-casing Substrings and Iterating Two Files together

by neversaint (Deacon)
on Dec 27, 2008 at 14:14 UTC ( #732789=perlquestion: print w/replies, xml ) Need Help??

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
I have two files as input:

data1.txt
>seq1 GGTACACAGAAGCCAAAGCAGGCTCCAGGCTCTGAGCTGTCAGCACAGAGACCGAT >seq2 GTCTCTGTCTCTAAAATAAATAAACATTAAAAAAATTTTAAAAGAAAAGATTCTCTCC
data2.txt (this is called "hard masked")
>seq1 GGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNT >seq2 GTCTCTGTCTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGATTCTCTCC
My aim is to generate output by lower-casing data1.txt on the position of "N" that appears in data2.txt, yielding:
>seq1 GGTacacagaagccaaagcaggctccaggctctgagctgtcagcacagagaccgaT >seq2 GTCTCTGTCTCtaaaataaataaacattaaaaaaattttaaaagaaaaGATTCTCTCC
Typically, in real situation there are ~10^5 sequences, each of length 10^5 ~ 10^7 characters (upto 3.5Gb in file size).
I have a script that slurps two files together and then looping over sequence index. My approach is time consuming and memory inefficient. Thus, I look for enlightenment from my fellow monks on how to perform this task more efficiently.

Update: Running time comparison between BrowserUk's approach and mine is shown below.

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: Lower-casing Substrings and Iterating Two Files together
by BrowserUk (Pope) on Dec 27, 2008 at 14:31 UTC

    If you bitwise or (|) an uppercase letter with a space, (assuming latin-1/ASCII files), it will lowercase it:

    print 'ACGT' | ' ';; acgt

    So, if you translate all the 'N's in your mask to spaces and then bitwise or the sequence and the mask, it will achieve your goal very efficiently:

    $s = 'GGTACACAGAAGCCAAAGCAGGCTCCAGGCTCTGAGCTGTCAGCACAGAGACCGAT';; $m = 'GGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNT';; ( $mm = $m ) =~ tr[N][\x20];; print $mm;; GGT T print $s | $mm;; GGTacacagaagccaaagcaggctccaggctctgagctgtcagcacagagaccgaT

    Which makes your entire program (excluding the unmentioned fact that your files may be in FASTA format):

    #! perl -slw use strict; open SEQ, '<', 'data1.dat' or die $!; open MASK, '<', 'data2.dat' or die $!; while( my $seq = <SEQ> ) { ## Read a sequence my $mask = <MASK>; ## And the corresponding mask $mask =~ tr[N][ ]; ## Ns => spaces print $seq | $mask; ## bitwise-OR them and print the result } close SEQ; close MASK;

    Redirect the output to a third file and you're done.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Were I maintaining such code I would definitely want a comment along the lines of:
      # ord(" ") = 32 and flipping that bit lower cases ASCII.
      Otherwise++.
        ord(" ") = 32

        If it were my code, I think ord(" ") = 0x20 or even ord(" ") = 0b0010_0000 might be clearer, but that's essentially why I rarely comment the code I post. We tend to write comments that mean something to us, rather than others.

        My antisipation (hope!) is that what I post will rarely be used verbatim in other peoples programs. Indeed, I quite often leave the code deliberately shy of being a complete solution to the OPs question so that they will have to adapt it to their use. With the intent that anyone basing their code upon my postings, will first play with it a little to ensure that they understand it, and then adapt it to their needs.

        At that point they can adjust the coding style to their preferences or local requirements, and add whatever comments they think most appropriate.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Dear BrowserUK,
      Your magical approach is at least 6980x faster than mine!

      My approach:
      191.52user 153.47system 5:49.44elapsed 98%CPU
      BrowserUk Perl's approach (modified to handle FASTA):
      open SEQ, '<', $ARGV[0] or die $!; #plain open MASK, '<', $ARGV[1] or die $!; #hardmask while ( my $seq = <SEQ> ) { ## Read a sequence my $mask = <MASK>; ## And the corresponding mask if ( $mask =~ /^>/ ) { print "$seq"; } else { $mask =~ tr[N][ ]; ## Ns => spaces print $seq | $mask; ## bitwise-OR them and print the re +sult; } } close SEQ; close MASK;
      Takes only this much time:
      0.45user 0.07system 0:05.59elapsed 9%CPU
      This is tested on 12MB dataset, and the breakdown of sequence length is this:
      #Seq_name #Seq_len 2-micron 6318 MT 85779 I 230208 VI 270148 III 316617 IX 439885 VIII 562643 V 576869 XI 666454 X 745745 XIV 784333 II 813178 XIII 924429 XVI 948062 XII 1078175 VII 1090946 XV 1091289 IV 1531918
      Update: These test datasets (full and hardmasked) can be downloaded here.

      ---
      neversaint and everlastingly indebted.......
Re: Lower-casing Substrings and Iterating Two Files together
by bruno (Friar) on Dec 27, 2008 at 14:24 UTC
    From your posts, it seems that you are working with genome sequences. I suggest you give BioPerl a try. It might be slower, but you'll be able to do a lot with very little coding.

    In particular, they have a Bio::Seq::LargePrimarySeq module to deal with large sequences such as the ones that you are working with. Give it a go, it's funnier than reinventing wheels poorly.

      Funnier it may be, but nowhere near as educational ... I, for one, have learnt something from BrowserUks response.

      A user level that continues to overstate my experience :-))
        I'm very happy that you learnt something! I also didn't know about the bitwise operator. My advice stands though; if you are working with large datasets of genome sequences, to me it's quite clear that BioPerl is the right tool for the job.

        If, however, you are just doing exercises to practice your Perl skills, then it's fine to roll your own solution, but it seemed to me that neversaint was trying to do some actual work.

      I suggest you give BioPerl a try. It might be slower, but you'll be able to do a lot with very little coding.

      That statement encapsulates exactly what is wrong with BioPerl. The very essence of dealing with Genome Sequences is that they are huge. Individually, the algorithms used are for the most part very simple. The problem is that they need to be applied to huge volumes of data, and that means that they take a long time unless they are crafted from the get-go to be as efficient as possible.

      The standard response to concerns of efficiency here (and more widely) is that programmer efficiency is more important than program efficiency, but this meme completely ignores the wider picture. User time is just as important, and arguably more so, than programmer time.

      The big value-add (ROI multiplier) of code, is that once constructed correctly, it can be reused over and over, by many users, for no extra cost beyond that of the time those users spend waiting for it to complete. Therefore, it is positive investment for the programmer to spend some extra time optimising the code he produces--where that code will be re-used, especially for algorithms and processes that will be used for large-scale processing. The extra time (and costs) the programmer spends making the algorithms efficient (once they are correct), amortises many-fold over the lifetime of that code.

      In particular, they have a Bio::Seq::LargePrimarySeq module to deal with large sequences such as the ones that you are working with. Give it a go, it's funnier than reinventing wheels poorly.

      First, there is an assumption in that statement that the algorithms and implementations in the (huge; over 2000 packages) BioPerl suite, are beautifully rounded wheels. My (very limited) experience of using a few of them is that not all of them are.

      They may work, but they often seem to have inconvenient interfaces--requiring glue code to allow the outputs of one module to be used as inputs to the next, even when those modules are both a part of the overall BIO suite. And they are often not written with a view to efficiency--which given the nature and scale of the problems they are designed to address, should be a primary goal.

      For example, for the user to obtain the actual data for a given sequence from a fasta format file, try chasing through the number of times that sequence gets copied--from the point where it is read from the file, to the point where it is actually delivered to the user for use. Much of that copying could be avoided by passing references to scalars rather that the scalars themselves.

      The shear scale of the Bio:: namespace is the other problem. You said: "you'll be able to do a lot with very little coding.", but that only holds true once you have discovered the appropriate modules and methods to use. And given the 2000+ component modules, that is a steep learning curve for a full-time programmer--let alone your average genomic biologist for whom the code is just a means to an end.

      For example, you;ve suggested Bio::Seq::LargePrimarySeq, but why that module rather than say Bio::SeqIO::fasta or Bio::SeqIO::largefasta? And where is the breakpoint at which you should transition from using one to the other?

      For the OPs question, which needs to deal with the sequences in the input files one at a time, the overhead of using the module you've recommended--taking those files and splitting them into a bunch of temporary files each containing one sequence--is exactly that. Unnecessary, pure overhead.

      I realise that the Bio suite contains advanced methods for parallelising (clustering) solutions, and that for labs who deal with large scale genomic processing on a regular basis, the investment and set-up costs to allow the utilisation of those techniques will be far more cost effective than optimising the serial processing. But for many of the people working in this area, those setup and hardware investments are a non-starter. They have to do what they need to do, with whatever equipment they have to hand. And the more efficiently they can process their data, the less time they'll spend waiting to see if their experiments are achieving results, allowing them to try more ideas and variations.

      For well-known processing--payroll, inventory etc.--of large scale data volumes done on a regular basis, it is usually possible to schedule that processing to run in the dead time of night, or over weekends, and whether it takes 2 hours or 20 doesn't really matter. But for the kind of experimental--what if I try this; or this; or this; or this;--investigations that typify the work of small players in new fields like genomics, the time taken to see the results of each investigation often comes straight out of the overall time allotment. And time spent waiting--whether trying to work out what combination of modules and methods to use; or for the results to be produced--is just dead time.

      The bottom line is that if the basic algorithms and implementations are written to operate efficiently, they will also save time (and money; eg. hardware costs), when used in clustered/parallel environments. So the failure to optimise the core algorithms and implementations of complex frameworks, is not hubris and laziness in the Perlish good-sense of those terms, but rather arrogance and negligence!


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        There is little to argue about; most of what you've said I agree with.

        I agree, for example, that BioPerl is not a perfectly rounded wheel. It has rough edges and some parts could use some love. But I also think that the solution is not to turn your back against it on the basis of poor performance or interface clunkiness alone.

        My general advice for anyone doing bioinformatics in Perl is: learn BioPerl. It's true that it's big, but that's because biology is big. But you don't have to know it all before you start using it, so the cognitive overhead that you talk about is really not that big.

        If you have a problem and a after a quick search you find that there's a BioPerl module that solves it, chances are that it will be your last stop: the solution will most probably be correct and tested. But if you have performance issues, then you can start thinking of profiling and looking for alternative solutions, among which there's contributing to the BioPerl module in question to try to make it faster.

        That concept above is (and I think that you'll agree with me here) essential for any open source project to move forward.

        As you may know, most of us doing research in computational biology are programmers by training, and sometimes implement less than optimal algorithms or have poor programming style; this is evident when looking at BioPerl's code. We are, however, open to anyone (biologists or not) willing to make contributions and improvements to the code; much of the tasks in this area don't require any bio-knowledge.

Re: Lower-casing Substrings and Iterating Two Files together
by Corion (Pope) on Dec 27, 2008 at 14:19 UTC

    So where's your problem, exactly? Why not just simply read one line from data1.txt, read one line from data2.txt, mask the file and then output the result? You'd keep on going as long as there are lines in both files. Maybe if you showed some short and to the point code, we could tell you where to optimize your program.

Re: Lower-casing Substrings and Iterating Two Files together
by kirillm (Friar) on Dec 27, 2008 at 14:27 UTC

    How about reading one character from each of the files and depending on the current value of character from data2.txt print out either lowercased or uppercased value of the character from data1.txt. Something like this (no error checking etc, but you see the idea):

    open D1, 'data1.txt'; open D2, 'data2.txt'; my ($c1, $c2); while (read D2, $c2, 1) { read D1, $c1, 1; print $c2 eq 'N' ? lc($c1) : $c1; } close D2; close D1;

    This produces the desired output:

    >seq1 GGTacacagaagccaaagcaggctccaggctctgagctgtcagcacagagaccgaT >seq2 GTCTCTGTCTCtaaaataaataaacattaaaaaaattttaaaagaaaaGATTCTCTCC

    -- Kirill

Re: Lower-casing Substrings and Iterating Two Files together
by linuxer (Curate) on Dec 27, 2008 at 14:51 UTC

    My idea is based upon using index, rindex and substr.

    #!/usr/bin/perl use strict; use warnings; open SEQ, '<', 'data1.dat' or die $!; open MASK, '<', 'data2.dat' or die $!; while ( my $seq = <SEQ> ) { my $mask = <MASK>; my $start = index( $mask, 'N' ); my $length = rindex( $mask, 'N' ) - $start + 1; my $lc_seq = lc substr( $seq, $start, $length ); substr( $seq, $start, $length, $lc_seq ); print $seq; }

    Updates:

    1. Important notice from kirillm. My solution doesn't cover all possible cases.
    2. fixed calculation error
    3. see my other post for my updated code.

      Your proposal seem to assume that there will be one continuous sequence of the N characters in the mask. What if there are non-N characters between $start and $start+$length?

        Very good question.

        I didn't know about those cases. I only can consider those cases, which are mentioned as examples.

        Update: Here's an update, so it should consider that special case you mentioned. But I had to change the code and remove (r)index.

        #!/usr/bin/perl use strict; use warnings; open SEQ, '<', 'data1.dat' or die $!; open MASK, '<', 'data2.dat' or die $!; while ( my $seq = <SEQ> ) { my $mask = <MASK>; while ( $mask =~ m/(N+)/g ) { my $length = length $1; my $start = pos($mask) - $length; my $lc_seq = lc substr( $seq, $start, $length ); substr( $seq, $start, $length, $lc_seq ); } print $seq; } close MASK; close SEQ;
Re: Lower-casing Substrings and Iterating Two Files together
by eye (Chaplain) on Dec 27, 2008 at 19:52 UTC
    When I initially read this, I did not assume the sequences in "data2.txt" were always as long as the corresponding sequences in "data1.txt". Prior replies have assumed they are and produced suggestions that are more efficient than what I propose. If the assumption is not true, this can be done with regular expressions. This snippet uses the OP's "seq1" to demonstrate constructing search and replace strings from the "hard mask" provided.
    $seq = 'GGTACACAGAAGCCAAAGCAGGCTCCAGGCTCTGAGCTGTCAGCACAGAGACCGAT'; $mask = 'GGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNT'; ($srch = $mask) =~ s/(N+)/($1)/g; $srch =~ tr/N/./; $cnt = 1; ($repl = $mask) =~ s/N+/" . lc(\$" . $cnt++ . ") . "/ge; print $srch, "\n", $repl, "\n\n", $seq, "\n"; $seq =~ s/$srch/$repl/ee; print $seq, "\n";
    The output should look like:
    GGT(....................................................)T GGT . lc($1) . T GGTACACAGAAGCCAAAGCAGGCTCCAGGCTCTGAGCTGTCAGCACAGAGACCGAT GGTacacagaagccaaagcaggctccaggctctgagctgtcagcacagagaccgaT
    That said, if the aforementioned assumption does hold, I have to think that a search for efficiency should begin in the code that generated "data2.txt", if at all possible.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://732789]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2020-02-29 04:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What numbers are you going to focus on primarily in 2020?










    Results (128 votes). Check out past polls.

    Notices?