Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

'Simple' comparing a hash with an array

by Anonymous Monk
on Apr 16, 2008 at 21:19 UTC ( [id://680923]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am struggling with finding the cause of my code's strange output.

I am trying to find a list of words and count their frequency in a text file given through STDIN. So far, I hoped that the code would print the hash's key/value pairs that matched my array's elements.

It appears to output some key/values and not others using a logic I can't follow. I'm sure I'm commiting a common mistake for this bread-and-butter Perl problem.

Anonymous Newbie

#!/usr/bin/perl -w use strict; my %histogram; my @wordstofind = qw( a am an and are as at be been but by can co de do due each ); #Many more words snipped while (<STDIN>) { my @currentline = split; $histogram{$_}++ for @currentline; } foreach my $i (@wordstofind) { while ( my ($key, $value) = each (%histogram) ) { if ($key =~ /$i/) { print "Found $key, $value times\n"; } } }

Replies are listed 'Best First'.
Re: 'Simple' comparing a hash with an array
by moritz (Cardinal) on Apr 16, 2008 at 21:31 UTC
    The regex match looks a bit weird to me.

    Since 'a' is one of the "words" you're looking for, each word that contains an 'a' is printed out in the end. A word like "candidate" is printed for 'a', 'an' and 'can'.

    You should be a bit more specific about what kind of output you expect. Provide a few lines of sample input and what the output should look like, then we can help you a bit more.

      I'm replying here to you, moritz, not because I'm disagreeing with you but because I want to add to your advice and I think it flows better here than as another response to the OP.

      Anon could fix the specific regex issue using anchors, of course. There are other issues, though, like why a regex is used in the first place when $key eq $i would have worked just as well, or why one would loop over the keys of a hash and an array containing the important keys of the hash as well in the first place.

      I think your advice to be more specific about input and output is apt. I think, too, though, that the OP could use some time with each, perlre, keys, and maybe perldata, map, perlrequick, perlretut, and perlcheat.

Re: 'Simple' comparing a hash with an array
by FunkyMonk (Chancellor) on Apr 16, 2008 at 21:36 UTC
    Your match is finding portions of a word: $key =~ /$i/ will match when $key is "elephant" and $i is "a" because "elephant" contains an "a". To match on whole words only, use /b (word boundary - see perlre) before and after your search term:
    my %histogram; my @wordstofind = qw( a am an and are as at be been but by can co de do due each ); while (<DATA>) { my @currentline = split; $histogram{$_}++ for @currentline; } foreach my $i (@wordstofind) { while ( my ($key, $value) = each (%histogram) ) { if ($key =~ /\b$i\b/) { print "Found $key, $value times\n"; } } } __DATA__ hello, I am an elephant. I have a long trunk.

    Output:

    Found a, 1 times Found am, 1 times Found an, 1 times

Re: 'Simple' comparing a hash with an array
by mr_mischief (Monsignor) on Apr 16, 2008 at 21:45 UTC
    Let's start by eliminating unnecessary steps and by not storing more values into memory than necessary. By getting rid of the fluff, you might be able to see the problem and solution more clearly.

    #!/usr/bin/perl use strict; use warnings; my %histogram = map { $_ => 0 } qw( a am an and are as at be been but by can co de do due each ); # hash of the words to find so we can do an O(1) lookup for them while ( <> ) { chomp; for ( split ) { # split returns a list we can use directly $histogram{ $_ }++ if exists $histogram{ $_ }; # only store counts for words that matter } } foreach my $word ( keys %histogram ) { # keys() will list the keys, and we've already taken care # of making sure we don't have extra words stored. # Now there's no need to do two loops and check an array # against a hash. print "Found $word, $histogram{$word} times.\n"; }
      Thank you monks. I've learnt a lot from this thread :-) I had originally used a hash after a perlfaq4 suggestion about comparing arrays, which would avoid some iteration. However, I ended up iterating over the hash anyway as I didn't know any other ways :-) As some have asked, sample input to this problem is:
      WE regret that a press of matter prevents our noticing
      I want to count frequencies of certain words, as listed in mr_mischief's %histogram. As for output something along the line of the following is what I'm trying to obtain:
      Found for, 1 times. Found such, 1 times. Found up, 1 times. Found at, 2 times. Found had, 1 times. Found was, 1 times. Cumulative total of all words found: 50
      To obtain this, and because I don't need to be concerned about the case of the input, I have added to mr_mischief's elegant code very slightly:
      #!/usr/bin/perl use strict; use warnings; my %histogram = map { $_ => 0 } qw( a am an and are as at be been but by can co de do due each ); # hash of the words to find so we can do an O(1) lookup for them while ( <> ) { chomp; for ( split ) { # split returns a list we can use directly tr/A-Z/a-z/; # lowercase all input print "$_\n"; $histogram{ $_ }++ if exists $histogram{ $_ }; # only store counts for words that matter } } my $count=0; foreach my $word ( keys %histogram ) { # keys() will list the keys, and we've already taken care # of making sure we don't have extra words stored. # Now there's no need to do two loops and check an array # against a hash. if ($histogram{$word} >0) { print "Found $word, $histogram{$word} times.\n"; $count = $count + $histogram{$word}; } } print "Cumulative total of all words found: $count\n";
      Thanks again
        If you're at all worried about locale and language issues, or if you're just concerned about doing things the canonical way, you can use $_ = lc $_; instead of tr/A-Z/a-z/; to get a lowercase version. lc and uc are built in, and they honor the current language and localization settings.
      ++mr_mischief, nice design!
      --
      Wade
Re: 'Simple' comparing a hash with an array
by TGI (Parson) on Apr 16, 2008 at 21:41 UTC

    One problem leaps out at me immediately, you aren't chomping your input lines. Sometimes you'll end up with words like "at\n" in your histogram.

    while (defined (my $line = <STDIN> )) { chomp $line; my @currentline = split /\s+/, $line; $histogram{$_}++ for @currentline; }

    Another basic issue, why do you iterate over your histogram hash? This sort of thing is exactly what hashes are designed to allow you to avoid. Just test for existence, and retrieve the value as needed. I

    if( exists $histogram{$word} ) { print "Found $word, $histogram{$word} times\n"; }

    Update: ysth is right. The chomp is not needed. split seems to be especially hungry. I was expecting it to match like the code below.

    use Data::Dumper; __END__ #Result: while ( <STDIN> ) { my @foo = /(\s+)/; print Dumper \@foo; } C:\> test.pl cat food $VAR1 = [ ' ' ];


    TGI says moo

      One problem leaps out at me immediately, you aren't chomping your input lines. Sometimes you'll end up with words like "at\n" in your histogram.
      Nope. Try it. \n is whitespace, which is consumed by the split.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://680923]
Approved by almut
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2024-04-20 08:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found