Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

sorting headers in a file

by utpalmtbi (Acolyte)
on Dec 13, 2013 at 06:42 UTC ( [id://1066988]=perlquestion: print w/replies, xml ) Need Help??

utpalmtbi has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have a input file with many headers and sequences, such as:

>or3

agagatgatagat

>or10

aacctttagtag

>or1

gtatatatata

>or2

tactacatgagg

......

I want to sort the headers according to the number after "or".. in case of above input, the output file becomes:

>or1

gtatatatata

>or2

tactacatgagg

>or3

agagatgatagat

>or10

aacctttagtag

......

Plz suggest.. Thanks

Replies are listed 'Best First'.
Re: sorting headers in a file
by Corion (Patriarch) on Dec 13, 2013 at 07:26 UTC

    There are many, many nodes here regarding how to best parse FASTA files. Most likely Super Search will turn up even more nodes. Have you looked through what BioPerl has to offer?

Re: sorting headers in a file
by hdb (Monsignor) on Dec 13, 2013 at 07:10 UTC

    For the sorting, the file needs to be small enough to be read into memory. If not, I would suggest to make use of some kind of database.

    In order to split your data into the individual records, you could set the special variable $/ to the value ">or". This way, when you read your file, you get it in the chunks that you need. You would then retrieve the number at the beginning of each chunk (e.g. by regex), sort it, and print it.

Re: sorting headers in a file
by Lennotoecom (Pilgrim) on Dec 13, 2013 at 07:57 UTC
    one of ways to go:
    $/ = '>or'; sub{$h{$1} = $2 if /(\d+)\n\n(\w+)/}->() foreach <DATA>; foreach (sort { $a<=>$b } keys %h){ print ">or$_\n\n$h{$_}\n\n"; } __DATA__ >or3 agagatgatagat >or10 aacctttagtag >or1 gtatatatata >or2 tactacatgagg
    output
    >or1 gtatatatata >or2 tactacatgagg >or3 agagatgatagat >or10 aacctttagtag

      I would write

      sub{$h{$1} = $2 if /(\d+)\n\n(\w+)/}->() foreach <DATA>;
      as
      /(\d+)\n+(\w+)/ and $h{$1} = $2 for <DATA>;

      which I think is more readible. The modification of the regex makes it a little more robust relative to the formatting of the data. You would also lose data if there are duplicate ">or..." bits and multiline sequences.

        thanks for corrections
        your code is beautiful
        gotta remember that construction
Re: sorting headers in a file
by 2teez (Vicar) on Dec 13, 2013 at 08:03 UTC

    Hi utpalmtbi,
    I agree totally with the advice of usage of some kind of database as mentioned by hdb.
    However, using your dataset one can get his/her dirty using perl hash and a kind of "modified" Schwartzian transform like so:

    use warnings; use strict; my %hash; my $key; while (<DATA>) { next if /^\s+$/; if (/^>/) { $key = $_; } else { $hash{$key} = $_; } } print map { $_->[0], $hash{ $_->[0] }, $/ } sort { $a->[1] <=> $b->[1] } map { [ $_, /(\d+$)/ ] } keys %hash; __DATA__ >or3 agagatgatagat >or10 aacctttagtag >or1 gtatatatata >or2 tactacatgagg
    NOTE:
    I can't tell how this is do using a larger datasets..

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me

      Your code relies on a number of implicit assumptions:

      • Using a hash implies that the ">or..." bits are unique. If two or more identical appear, your code would lose data.
      • You also assume that blank lines are of no significance.
      • If there is a multiline sequence, you will only store the last line.

        Your code relies on a number of implicit assumptions:...

        Assumptions that are savely ok considering the OP dataset.
        And that was why I said ..using your dataset.. i.e that of the OP in my previous reply, which I don't suppose have anything which follows "THE" assumptions, you stated. Except the OP has said otherwise to you privately.

        If you tell me, I'll forget.
        If you show me, I'll remember.
        if you involve me, I'll understand.
        --- Author unknown to me

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1066988]
Approved by hdb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-23 17:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found