sorting headers in a file


Keep It Simple, Stupid
	PerlMonks

sorting headers in a file

by utpalmtbi (Acolyte)

on Dec 13, 2013 at 06:42 UTC ( [id://1066988]=perlquestion: print w/replies, xml )

Need Help??

utpalmtbi has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have a input file with many headers and sequences, such as:

>or3

agagatgatagat

>or10

aacctttagtag

>or1

gtatatatata

>or2

tactacatgagg

......

I want to sort the headers according to the number after "or".. in case of above input, the output file becomes:

>or1

gtatatatata

>or2

tactacatgagg

>or3

agagatgatagat

>or10

aacctttagtag

......

Plz suggest.. Thanks

Comment on sorting headers in a file

Replies are listed 'Best First'.
Re: sorting headers in a file by Corion (Patriarch) on Dec 13, 2013 at 07:26 UTC
There are many, many nodes here regarding how to best parse FASTA files. Most likely Super Search will turn up even more nodes. Have you looked through what BioPerl has to offer?	[reply]
Re: sorting headers in a file by hdb (Monsignor) on Dec 13, 2013 at 07:10 UTC
For the sorting, the file needs to be small enough to be read into memory. If not, I would suggest to make use of some kind of database. In order to split your data into the individual records, you could set the special variable `$/` to the value `">or"`. This way, when you read your file, you get it in the chunks that you need. You would then retrieve the number at the beginning of each chunk (e.g. by regex), sort it, and print it.	[reply] [d/l] [select]
Re: sorting headers in a file by Lennotoecom (Pilgrim) on Dec 13, 2013 at 07:57 UTC
one of ways to go: `$/ = '>or'; sub{$h{$1} = $2 if /(\d+)\n\n(\w+)/}->() foreach <DATA>; foreach (sort { $a<=>$b } keys %h){ print ">or$_\n\n$h{$_}\n\n"; } __DATA__ >or3 agagatgatagat >or10 aacctttagtag >or1 gtatatatata >or2 tactacatgagg` [download] output `>or1 gtatatatata >or2 tactacatgagg >or3 agagatgatagat >or10 aacctttagtag` [download]	[reply] [d/l] [select]
Re^2: sorting headers in a file by hdb (Monsignor) on Dec 13, 2013 at 08:34 UTC
I would write `sub{$h{$1} = $2 if /(\d+)\n\n(\w+)/}->() foreach <DATA>;` [download] as `/(\d+)\n+(\w+)/ and $h{$1} = $2 for <DATA>;` [download] which I think is more readible. The modification of the regex makes it a little more robust relative to the formatting of the data. You would also lose data if there are duplicate ">or..." bits and multiline sequences.	[reply] [d/l] [select]
Re^3: sorting headers in a file by Lennotoecom (Pilgrim) on Dec 13, 2013 at 08:47 UTC
thanks for corrections your code is beautiful gotta remember that construction	[reply]
Re: sorting headers in a file by 2teez (Vicar) on Dec 13, 2013 at 08:03 UTC
Hi utpalmtbi, I agree totally with the advice of usage of some kind of database as mentioned by hdb. However, using your dataset one can get his/her dirty using perl hash and a kind of "modified" Schwartzian transform like so: `use warnings; use strict; my %hash; my $key; while (<DATA>) { next if /^\s+$/; if (/^>/) { $key = $_; } else { $hash{$key} = $_; } } print map { $_->[0], $hash{ $_->[0] }, $/ } sort { $a->[1] <=> $b->[1] } map { [ $_, /(\d+$)/ ] } keys %hash; __DATA__ >or3 agagatgatagat >or10 aacctttagtag >or1 gtatatatata >or2 tactacatgagg` [download] NOTE: I can't tell how this is do using a larger datasets.. If you tell me, I'll forget. If you show me, I'll remember. if you involve me, I'll understand. --- Author unknown to me	[reply] [d/l]
Re^2: sorting headers in a file by hdb (Monsignor) on Dec 13, 2013 at 08:23 UTC
Your code relies on a number of implicit assumptions: Using a hash implies that the ">or..." bits are unique. If two or more identical appear, your code would lose data. You also assume that blank lines are of no significance. If there is a multiline sequence, you will only store the last line.	[reply]
Re^3: sorting headers in a file by 2teez (Vicar) on Dec 13, 2013 at 08:36 UTC
Your code relies on a number of implicit assumptions:... Assumptions that are savely ok considering the OP dataset. And that was why I said ..using your dataset.. i.e that of the OP in my previous reply, which I don't suppose have anything which follows "THE" assumptions, you stated. Except the OP has said otherwise to you privately. If you tell me, I'll forget. If you show me, I'll remember. if you involve me, I'll understand. --- Author unknown to me	[reply]
Re^4: sorting headers in a file by hdb (Monsignor) on Dec 13, 2013 at 09:07 UTC

Back to Seekers of Perl Wisdom

Log In^?

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: perlquestion [id://1066988]
Approved by hdb
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others wandering the Monastery: (5)

As of 2024-04-23 17:22 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found