Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

create unique array

by Anonymous Monk
on Jan 27, 2007 at 09:27 UTC ( #596851=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks!
I have an array like the following:
ID	ORGANISM	NUMBER
protein1	organism1	0.843534
protein2	organism2	2.45
protein3	organism3	9.5322
protein4	organism4	0.3475474
protein1	organism6	9.4534
protein2	organism7	0.43534
protein2	organism8	1.2434
protein3	organism9	0.000003
protein3	orgnanism10	1.23325
The elements in the array are split with tabs. What I want to do is make a unique array that will hold only protein1, protein2, protein3 and protein4 and, if a protein is found more than once, I will hold the one that has the smallest number.
The desired output is, for the above example:
protein1	organism1	0.843534
protein2	organism7	0.43534
protein3	organism9	0.000003
protein4	organism4	0.3475474
I tried searching the manual, but I have the problem of how to check the numbers and compare them in order to decide which number to keep for each protein...
Thank you for any help!

Replies are listed 'Best First'.
Re: create unique array
by GrandFather (Sage) on Jan 27, 2007 at 09:52 UTC

    "unique == hash". Consider:

    use strict; use warnings; use Data::Dump::Streamer; my @array = <DATA>; chomp @array; my %uniq; for (@array) { my @parts = split "\t"; next if @parts < 3; next if exists $uniq{$parts[0]} and $uniq{$parts[0]}{val} < $part +s[2]; $uniq{$parts[0]} = {val => $parts[2], data => $_}; } Dump (\%uniq); __DATA__ protein1 organism1 0.843534 protein2 organism2 2.45 protein3 organism3 9.5322 protein4 organism4 0.3475474 protein1 organism6 9.4534 protein2 organism7 0.43534 protein2 organism8 1.2434 protein3 organism9 0.000003 protein3 orgnanism10 1.23325

    Prints:

    $HASH1 = { protein1 => { data => "protein1\torganism1\t0.843534", val => 0.843534 }, protein2 => { data => "protein2\torganism7\t0.43534", val => 0.43534 }, protein3 => { data => "protein3\torganism9\t0.000003", val => 0.000003 }, protein4 => { data => "protein4\torganism4\t0.3475474", val => 0.3475474 } };

    DWIM is Perl's answer to Gödel
Re: create unique array
by virtualsue (Vicar) on Jan 27, 2007 at 09:39 UTC
    Your task can be more easily handled by using a hash rather than an array. When you say that the data is in an tab-separated array, do you mean that you have a file which consists of tab-separated lines containing the data? Since you only want to keep one entry for each ID, then a hash containing a reference to an array holding the organism and number values will be a good way to go. Do you have any code we can help you with? Check out the Tutorials section if you need to read up on the data types.

    Update: A skeleton which will read in your data line by line...

    #!/usr/bin/perl use warnings; use strict; my %proteins; while (my $line = <DATA>) { chomp $line; my ($pro, $org, $value) = split '\s+, $line; # \s+ : one or more wh +itespace chars } __DATA__ protein1 organism1 0.843534 protein2 organism2 2.45 protein3 organism3 9.5322 protein4 organism4 0.3475474 protein1 organism6 9.4534 protein2 organism7 0.43534 protein2 organism8 1.2434 protein3 organism9 0.000003 protein3 orgnanism10 1.23325
Re: create unique array
by johngg (Canon) on Jan 27, 2007 at 12:46 UTC
    As virtualsue and GrandFather have advised, a hash is the way to go. You can always recreate an array from your hash afterwards if you need to. The code below uses a form of Schwartzian Transform and works by getting the three elements from each data line then sorting the data lines in descending numerical order on the number field. Thus, for whichever protein/organism combination, the smallest will come last. Finally, the sorted line items are placed in turn into the hash, successively smaller values overwriting any previous larger "duplicates". I then rebuild an array at this point but that may not be what you actually want.

    use strict; use warnings; my %smallest = map { $_->[0] => { org => $_->[1], val => $_->[2] } } sort { $b->[2] <=> $a->[2] } map { chomp; [ split m{\t} ] } <DATA>; my @sorted = (); foreach my $protein ( sort keys %smallest ) { push @sorted, [ $protein, $smallest{$protein}->{org}, $smallest{$protein}->{val} ]; } print Data::Dumper->Dump([\@sorted], [qw{*sorted}]); __END__ protein1 organism1 0.843534 protein2 organism2 2.45 protein3 organism3 9.5322 protein4 organism4 0.3475474 protein1 organism6 9.4534 protein2 organism7 0.43534 protein2 organism8 1.2434 protein3 organism9 0.000003 protein3 orgnanism10 1.23325

    Here's the output

    @sorted = ( [ 'protein1', 'organism1', '0.843534' ], [ 'protein2', 'organism7', '0.43534' ], [ 'protein3', 'organism9', '0.000003' ], [ 'protein4', 'organism4', '0.3475474' ] );

    I hope this is of use.

    Cheers,

    JohnGG

      Thank you all guys for your help!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://596851]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (10)
As of 2019-12-10 13:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?