http://www.perlmonks.org?node_id=1222551

pearllearner315 has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I have a file formatted like this:


bird beak
bird beak
bird claw
bird wings
bird feathers
snake fangs
snake scales
snake fangs
snake tail

I want to loop through this file and get the bird and snake as keys, and the word next to them as the values in an array. I only want unique elements tho for each array so for example, bird will be the key in the hash, and its corresponding value will be an array that contains (beak, claw, wings, feathers). I have the current code that works but it doesn't get rid of the duplicates:
#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $file = 'file.txt'; open( FILE, '<', $file ) or die $!; my %hash; while ( <FILE> ) { chomp; my $lines = $_; my $key = (split(/" "/, $lines))[0]; my $value = (split(/" "/, $lines))[1]; push @{ $hash{$key} }, $value; } print Dumper(\%hash);
How would I get rid of the duplicates in the arrays?

Replies are listed 'Best First'.
Re: assigning arrays as values to keys of hash
by davido (Cardinal) on Sep 18, 2018 at 05:20 UTC

    In the canonical solution for finding unique elements a hash is employed since the keys are guaranteed to be unique. You simply need two hashes; one top-level (snake, bird), and one deeper level (scales, fangs, tail). But then once you've used that deeper level hash to remove the duplicate attributes, you can convert its keys to the contents of an anonymous array referred to by the top-level hash. In other words, you can replace the lower level hash with an array containing the keys that lower level hash once held.

    #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; my %hash; while (<DATA>) { chomp; my ($k, $v) = split /\s/; $hash{$k}{$v} = undef; } $hash{$_} = [keys %{$hash{$_}}] for keys %hash; print Dumper \%hash; __DATA__ bird beak bird beak bird claw bird wings bird feathers snake fangs snake scales snake fangs snake tail

    The output will be:

    $VAR1 = { 'snake' => [ 'tail', 'fangs', 'scales' ], 'bird' => [ 'claw', 'wings', 'beak', 'feathers' ] };

    Another approach could be to track whether an attribute pair has been seen before in realtime during the while loop, rather than postprocessing the hash of hashes into a hash of arrays. To do this you could use a temporary %attribseen hash where the keys are some unique concatenation of the animal type and a given attribute of that animal. For example, 'bird' and 'beak' could be used to form a hash key of bird|beak, and then you use that to assure uniqueness:

    my %hash; { my %attribseen; while (<DATA>) { chomp; my ($k, $v) = split /\s/; push @{$hash{$k}}, $v unless $attribseen{"$k|$v"}++; } } print Dumper \%hash; __DATA__ bird beak bird beak bird claw bird wings bird feathers snake fangs snake scales snake fangs snake tail

    The output will be the same as before. For some this may be simpler to look at. Even better (from a legibility standpoint) may be to separate out the uniqueness check into its own object, which can offer some internal state:

    #!/usr/bin/env perl package PairUnique; use strict; use warnings; sub new {return bless {}, shift} sub unique { my ($self, $k, $v) = @_; return !$self->{"$k=>$v"}++ ? $v : (); } package main; use strict; use warnings; use Data::Dumper; my %hash; { my $get = PairUnique->new; while (<DATA>) { chomp; my ($k, $v) = split /\s/; my $aref = $hash{$k} //= []; push @$aref, $get->unique($k,$v); } } print Dumper \%hash; __DATA__ bird beak bird beak bird claw bird wings bird feathers snake fangs snake scales snake fangs snake tail

    This separates out the uniqueness logic, and keeps only the structure-building logic inside the while loop. More code means more to maintain and understand, but it's possible that intent will be clearer to the person reading the code.


    Dave

Re: assigning arrays as values to keys of hash
by jwkrahn (Monsignor) on Sep 18, 2018 at 04:28 UTC

    For unique values you probably want to use a hash instead of an array, something like this:

    while ( <FILE> ) { my ( $key, $value ) = split; $hash{ $key }{ $value } = (); } print Dumper \%hash;
      I specifically need a hash of arrays.. any way that's possible?

        Why do you need a hash of arrays? Is this question related to homework? As mentioned by jwkrahn, using a multi-level hash would allow you to readily avoid duplicates.

        I have the current code that works but it doesn't get rid of the duplicates

        The code you posted does not seem to work. When I ran your code, the hash keys contained an entire line of text and the values were undefined array references. When I removed the double quotes from the first argument to split, I was able to get a hash of array references. However, as you mentioned, there are duplicates in the array references.
        #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; my $file = 'file.txt'; open( FILE, '<', $file ) or die $!; my %hash; while ( <FILE> ) { chomp; my $lines = $_; my $key = (split(/ /, $lines))[0]; my $value = (split(/ /, $lines))[1]; push @{ $hash{$key} }, $value; } print Dumper(\%hash); exit;

        There are quite a few ways you could go about removing the duplicates. Here is one way to do it with help from the uniq function of List::Util.

        #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; use List::Util qw/uniq/; my $file = 'file.txt'; open( FILE, '<', $file ) or die $!; my %hash; while ( <FILE> ) { chomp; my $lines = $_; my ($key, $value) = split(/ /, $lines); push @{ $hash{$key} }, $value; } foreach my $key( keys %hash ){ my @array = @{$hash{$key}}; my @uniq_elems = uniq @array; $hash{$key} = \@uniq_elems; } print Dumper(\%hash); exit;

        push-ing each "organ" to an autovivified anonymous array keyed by its "animal" allows preservation of the original order of "organs" as found in the file (if this is of any importance). If preserving original order isn't important, use the simpler two-level hash approach described by others.

        c:\@Work\Perl\monks>perl -wMstrict -le "use autodie; no autodie qw(open close); ;; use List::MoreUtils qw(uniq); ;; use Data::Dump qw(dd); ;; my $file = qq{bird beak\n} . qq{bird beak\n} . qq{bird claw\n} . qq{bird wings\n} . qq{bird feathers\n} . qq{snake fangs\n} . qq{snake scales\n} . qq{snake fangs\n} . qq{snake tail\n} ; print qq{[[$file]]}; ;; open my $fh, '<', \$file or die qq{opening ram file: $!}; ;; my %hash; while (my $line = <$fh>) { my $parsed = my ($animal, $organ) = $line =~ m{ \A ([[:alpha:]]+) \s+ ([[:alpha:]]+) \Z }xmsg; ;; die qq{bad line '$line'} unless $parsed; ;; push @{ $hash{$animal} }, $organ; } ;; close $fh or die qq{closing ram file: $!}; ;; @$_ = uniq @$_ for values %hash; dd \%hash; " [[bird beak bird beak bird claw bird wings bird feathers snake fangs snake scales snake fangs snake tail ]] { bird => ["beak", "claw", "wings", "feathers"], snake => ["fangs", "scales", "tail"], }


        Give a man a fish:  <%-{-{-{-<

        my %unique; while ( <FILE> ) { my ( $key, $value ) = split; $hash{ $key }{ $value } = (); } $_ = [ keys %$_ ] for values %hash; print Dumper \%hash;
Re: assigning arrays as values to keys of hash
by tybalt89 (Parson) on Sep 18, 2018 at 20:06 UTC

    Why do it in two passes when it can be done in one pass? Efficiency is over-rated. (tybalt89 ducks :)

    #!/usr/bin/perl # https://perlmonks.org/?node_id=1222551 use strict; use warnings; use Data::Dumper; my %hash; while( <DATA> ) { /(\S+)\s+(\S+)/ and $hash{$1} = [ keys %{{map {$_, 1} @{$hash{$1}}, +$2}} ]; } print Dumper \%hash; __DATA__ snake fangs snake tail snake fangs bird feathers bird beak snake scales bird beak bird claw bird wings
      Hi tybalt89!

      Great post!
      However, I am not convinced that your implementation would be more efficient than any of the "2 pass solutions".
      I thought my response to the OP at Re: assigning arrays as values to keys of hash to be reasonable and importantly: understandable by the OP.

      I think that sometimes PerlMonks fails new Perler's with overly complicated solutions that they can't understand or generalize.
      This OP is a beginner, not by user name, but by his original code.

      Your solution hides a foreach loop in terms of a map{} which does a lot of work. Shorter Perl code doesn't always mean "faster".

        It seems my "Efficiency is over-rated" comment was unclear. I fully believe (without testing, therefor as an article of faith) that my solution is slower than the two pass solutions. I guess I didn't make that clear. I am less interested in efficiency and more interested in solutions that show more rarely used perl capabilities.

        TIMTOWTDI forever :)

Re: assigning arrays as values to keys of hash
by BillKSmith (Parson) on Sep 18, 2018 at 14:33 UTC
    It is possible to do exactly what you asked for by explicitly testing for duplicates (use the function 'none' from List::Util) before storing. Because it uses arrays, it does preserve the order of the features. This solution is probably the slowest running of all the suggestion you received.
    ?type pearllearner315.pl #!/usr/bin/perl use strict; use warnings; use List::Util qw(none); use Data::Dumper; my $file = \<<'EOF'; bird beak bird beak bird claw bird wings bird feathers snake fangs snake scales snake fangs snake tail EOF #my $file = 'file.txt'; open( FILE, '<', $file ) or die $!; my %hash; while ( <FILE> ) { chomp; my $lines = $_; my ($key, $value) = split(/\s+/, $lines, 2); push @{ $hash{$key} }, $value if none {$value eq $_} @{$hash{$key} +}; } print Dumper(\%hash); ?perl pearllearner315.pl $VAR1 = { 'snake' => [ 'fangs', 'scales', 'tail' ], 'bird' => [ 'beak', 'claw', 'wings', 'feathers' ] }; ?
    Bill
Re: assigning arrays as values to keys of hash
by Marshall (Abbot) on Sep 18, 2018 at 19:00 UTC
    I guess while we are beating this thing to death...
    I find this pretty easy to read... List::Util is great module.

    #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; use List::Util qw(uniq); my %hash; while (<DATA>) { my ($animal, $part) = split /\s+/; push @{$hash{$animal}},$part; } @{$hash{$_}} = uniq @{$hash{$_}} for keys %hash; print Dumper \%hash; __DATA__ bird beak bird beak bird claw bird wings bird feathers snake fangs snake scales snake fangs snake tail
      @{$hash{$_}} = uniq @{$hash{$_}} for keys %hash;

      I think
          @$_ = uniq @$_ for values %hash;
      is more concise and even easier to read :)


      Give a man a fish:  <%-{-{-{-<