Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re: concatenating identical sequences

by Limbic~Region (Chancellor)
on Oct 04, 2011 at 15:51 UTC ( #929585=note: print w/ replies, xml ) Need Help??

in reply to concatenating identical sequences

How big are the files you need to work with. It should be fairly trivial to do this using a hash of arrays if everything fits in memory. Assume for a second you had a function that could fetch the next record (mutli-line or not) as well as the id. It would look something like this:

my %data; while (my $rec = fetch_record($fh)) { my $id = $rec->{id}; push @{$data{$id}}, $rec->{sequence}; } for my $id (keys %data) { print "$id "; print "$_\n" for @{$data{$id}}; }
Alternatively, if you can't afford to fit the entire file in memory, you could still use this technique by storing the file offset and not the actual sequence. This will require more IO with tell and seek but should allow the same simplicity in the code.

One last alternative would be to re-write the file merging all the rows for a record on one line. Next, sort the file so duplicate IDs are adjacent and then it should be straight forward to merge them. Since it appears each row is fixed length, recreating the original structure from a single line should be straight forward.

Cheers - L~R

Comment on Re: concatenating identical sequences
Download Code

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://929585]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2014-07-29 05:32 GMT
Find Nodes?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:

    Results (211 votes), past polls