http://www.perlmonks.org?node_id=11152595

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
another newbie here, trying to make sense of the hashes and how to best use them (if this is what I need) for my following problem:
Assume the following file, where each 'entry' has 3 lines, namely:
>id_1|id_2 sequence_of_chars label_of_chars

Now, what I want is to store the unique entries, and, by unique in my case i define the ones that have the same id_2 and sequence_of_chars. The label_of_chars does not matter much, as it will only vary a little bit if the other 2 lines are the same. The only change (and I don't care which one I keep of those) is the id_1, where I can have multiple ones. Example below:
>4kt0_M|P72986 MALSDTQILAALVVALLPAFLAFRLSTELYK iiiiiiiiiMMMMMMMMMMMMMMMMMIIIII >6uzv_m|P72986 MALSDTQILAALVVALLPAFLAFRLSTELYK iiiiiiiiiiiiMMMMMMMMMMMMMMMMMII >5oy0_m|P72986 MALSDTQILAALVVALLPAFLAFRLSTELYK iiiiiiiiiMMMMMMMMMMMMMMMMMIIIII >6hqb_M|P72986 MALSDTQILAALVVALLPAFLAFRLSTELYK iiiiiiiiiiiMMMMMMMMMMMMMMIIIIII

Now, from the example above, the desired output would be any of the 4kt0_M, 6uzv_m, 5oy0_m or 6hqb_M and then |P72986, the sequence MALSDTQILAALVVALLPAFLAFRLSTELYK below this and any of the 4 available labels. Is hashes the way to go? I can split the line starting with > and store each of the 4 elements into variables, but I don't know how to proceed from there.

Replies are listed 'Best First'.
Re: How to make unique entries
by hv (Prior) on Jun 02, 2023 at 00:12 UTC

    Let's assume you have split out the 4 elements as variables $id1, $id2, $sequence, $label. The next thing you need to create is the signature that represents a "unique" value, by combining $id2 and $sequence: simplest is if you can join them with some character known not to appear in either value - from the example above I will guess that the pipe character '|' is safe to use:

    my $signature = join '|', $id2, $sequence;

    Now you can use this signature as the key in a hash. For simplicity, I'll use this to store the entire structure:

    my %hash; # somewhere before you start to loop over the data ... # within the loop over your data my $signature = join '|', $id2, $sequence; my $structure = { id1 => $id1, id2 => $id2, sequence => $sequence, label => $label, }; $hash{$signature} = $structure; # save it

    In the case of duplicate signatures this overwrites, so ends up saving a structure for the last example of any given signature, but there are other strategies possible.

    You can then emit the data by looping over the hash something like:

    for my $signature (keys %hash) { my $structure = $hash{$signature}; printf "%s|%s\n%s\n%s\n", $structure->{id1}, $structure->{id2}, $structure->{sequence}, $structure->{label}; }

      One note in addition is that if you’re having a hard time finding a “safe” unused character remember Perl can handle nulls in strings just fine so "\0" is an option for the join char.

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

Re: How to make unique entries
by kcott (Archbishop) on Jun 02, 2023 at 07:58 UTC
    "Is hashes the way to go?"

    Your choice of data structure will depend on a number of factors, such as how you want to store and retrieve the structure, how you want to access the data in the structure, and so on. Have a read of "Perl Data Structures Cookbook" to get an idea of what's available.

    Here's one possible way:

    #!/usr/bin/env perl use strict; use warnings; my %data; { local $/ = "\n>"; while (<DATA>) { $_ = substr $_, 1 if $. == 1; my ($ids, $seq, $lab) = split /\n/; my ($id1, $id2) = split /[|]/, $ids; push @{$data{"$id2-$seq"}}, [$id1, $lab]; } } # For DEMO use Data::Dump; dd \%data; __DATA__ >4kt0_M|P72986 MALSDTQILAALVVALLPAFLAFRLSTELYK iiiiiiiiiMMMMMMMMMMMMMMMMMIIIII >6uzv_m|P72986 MALSDTQILAALVVALLPAFLAFRLSTELYK iiiiiiiiiiiiMMMMMMMMMMMMMMMMMII >5oy0_m|P72986 MALSDTQILAALVVALLPAFLAFRLSTELYK iiiiiiiiiMMMMMMMMMMMMMMMMMIIIII >6hqb_M|P72986 MALSDTQILAALVVALLPAFLAFRLSTELYK iiiiiiiiiiiMMMMMMMMMMMMMMIIIIII

    Output:

    { "P72986-MALSDTQILAALVVALLPAFLAFRLSTELYK" => [ ["4kt0_M", "iiiiiiiiiMMMMMMMMMMMMMMMMMIIIII"], ["6uzv_m", "iiiiiiiiiiiiMMMMMMMMMMMMMMMMMII"], ["5oy0_m", "iiiiiiiiiMMMMMMMMMMMMMMMMMIIIII"], ["6hqb_M", "iiiiiiiiiiiMMMMMMMMMMMMMMIIIIII"], ], }

    — Ken

Re: How to make unique entries
by hippo (Bishop) on Jun 02, 2023 at 08:34 UTC
    another newbie here ... Is hashes the way to go?

    It's certainly the way to start and the approaches given by hv and kcott will get you well along that path. As someone new to Perl, the use of hashes is fundamental to so much that the language can do that becoming familiar with them would be very beneficial to you in both the short and long term.

    The only problem you might hit for this particular task is if the amount of data to be stored is comparable or greater than the amount of RAM you have available. If that happens then you will see the performance drop off a cliff edge and other tactics or strategies may be required. But only worry about that if/when you hit that limit.

    Do consider signing up for an account here. It is free and will make it easier for you to revisit your old posts and for the rest of us to recognise you instead of being lost among all our other anonymous brethren. You've written a very good post for first timer and it would be good to see more of the same.


    🦛

Re: How to make unique entries
by tybalt89 (Monsignor) on Jun 02, 2023 at 08:46 UTC
    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11152595 use warnings; my $input = <<END; >4kt0_M|P72986 MALSDTQILAALVVALLPAFLAFRLSTELYK iiiiiiiiiMMMMMMMMMMMMMMMMMIIIII >6uzv_m|P72986 MALSDTQILAALVVALLPAFLAFRLSTELYK iiiiiiiiiiiiMMMMMMMMMMMMMMMMMII >5oy0_m|P72986 MALSDTQILAALVVALLPAFLAFRLSTELYK iiiiiiiiiMMMMMMMMMMMMMMMMMIIIII >5oy0_m|P72986 MALSDTQILDIFFERENTAFLAFRLSTELYK iiiiiiiiiMMMMMMMMMMMMMMMMMIIIII >7oy0_m|P72996 MALSDTQILAALVVALLPAFLAFRLSTELYK iiiiiiiiiMMMMMMMMMMMMMMMMMIIIII >6hqb_M|P72986 MALSDTQILAALVVALLPAFLAFRLSTELYK iiiiiiiiiiiMMMMMMMMMMMMMMIIIIII END my %unique; for ( split /(?=>)/, $input ) { $unique{ join "\n", (split /[|\n]/)[1, 2]} //= $_; } use Data::Dump 'dd'; dd sort values %unique;

    Outputs:

    ( ">4kt0_M|P72986\nMALSDTQILAALVVALLPAFLAFRLSTELYK\niiiiiiiiiMMMMMMMMM +MMMMMMMMIIIII\n", ">5oy0_m|P72986\nMALSDTQILDIFFERENTAFLAFRLSTELYK\niiiiiiiiiMMMMMMMMM +MMMMMMMMIIIII\n", ">7oy0_m|P72996\nMALSDTQILAALVVALLPAFLAFRLSTELYK\niiiiiiiiiMMMMMMMMM +MMMMMMMMIIIII\n", )