counting elements using a hash

bk1388 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: counting elements using a hash by roboticus (Chancellor) on Sep 23, 2012 at 01:42 UTC
bk1388: Your second for loop should also specify the variable you're using as an iterator: `foreach $number (sort keys %count) {` [download] Otherwise the code looks fine. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l]
Re: counting elements using a hash by davido (Cardinal) on Sep 23, 2012 at 06:36 UTC
Are the numbers actually in the range 1 .. 100? What is the actual range? This could have some bearing on what constitutes a good solution. Your hash technique could be written like this (please note that we're supplying a numeric comparison to sort. Otherwise you don't get numerical order.) `use strict; use warnings; my @numbers = (1..100); my %count; $count{$_}++ foreach @numbers; print "$_ occurs $count{$_} time(s)\n" foreach sort { $a <=> $b } keys + %count;` [download] The reason I asked about the actual range is because sometimes there are patterns to the data set that might allow you to employ a less general approach, with greater efficiency. For example, if you're dealing with positive integers in some reasonably narrow range, you can avoid the O(n log n) sort, and the expense of hash lookups if you just use an array to hold your counts. In the following demonstration I did away with the output (output would swamp a benchmark). Instead, I build an array of anonymous arrays holding an integer value and a count. It's a little contrived, but the goal was to provide a similar platform upon which the two methods could be similarly compared. `use strict; use warnings; use Benchmark qw(cmpthese); @main::numbers = map {int(rand(10000))+1 } 1 .. 300000; sub hash { my %count; $count{$_}++ foreach @main::numbers; my @results = map { [ $_ => $count{$_} ] } sort { $a <=> $b } keys + %count; } sub array { my @count; $count[$_]++ foreach @main::numbers; my @results = map { [ $_ => $count[$_] ] } grep { $count[$_] } 0 . +. $#count; } cmpthese ( -5, { hash => \&hash, array => \&array, } );` [download] On my system... `Rate hash array hash 19.4/s -- -55% array 43.2/s 122% --` [download] Going with the less general solution wouldn't make sense if your data doesn't consist only of positive integers in a fairly narrow range. But this is an example of where a more specialized algorithm may be a better approach if the data supports it. In fact, if you are dealing with positive integers, and they do span an even greater range, as long as it fits in memory the array may still be a better approach. It doesn't scale as well for sparse data sets, but for data sets where the counts from 0 to 'n' can fit in memory, the larger sets will see even more improvement over the hash/sort technique (since a sort grows log-linearly, and the array approach grows linearly). Dave	[reply] [d/l] [select]
Re: counting elements using a hash by CountZero (Bishop) on Sep 23, 2012 at 07:12 UTC
You are missing the final semi-colon in your first line. Also, you are using an alphanumeric search (which means "2" will come between "19" and "20"). You will want to use a numeric sort. `use Modern::Perl; my @numbers = map int(rand(100)), (1..1000); my %count; $count{$_}++ for @numbers; say "$_ occurs $count{$_} time(s)" for sort {$a <=> $b} keys %count;` [download] CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply] [d/l]
Re: counting elements using a hash by Marshall (Canon) on Sep 23, 2012 at 02:02 UTC
The first task is to get the "number" that you wish to track from the input lines. The second step is simply to count using a hash table. It is not necessary to test for exists(). Perl will create a new key if the key doesn't exist. `#!/usr/bin/perl -w use strict; my %count; open (INPUT, '<', "some_input_file") or die "unable to open some_input_file"); while (<INPUT>) { my ($number) =~ /(\d+)/; #maybe???? next unless $number; #skip undef's $count{$number}++; #new key if $number not there } foreach my $number (sort{$a <=> $b} keys %count) #numeric sort { print "$number occurs $count{$number} time(s)\n"; }` [download] Try again and show some just a couple of lines of example input and your revised code. There is no need to iterate over all 101 possibilities for the number on each line. Process each line once, get the number, make a decision and it is "over with" - don't loop 100 times for each input line.	[reply] [d/l]
Re^2: counting elements using a hash by AnomalousMonk (Archbishop) on Sep 23, 2012 at 09:08 UTC
`next unless $number; #skip undef's` What happens if the extracted `$number` string is '0'? Will zeros be counted?	[reply] [d/l] [select]
Re^3: counting elements using a hash by Marshall (Canon) on Sep 23, 2012 at 12:28 UTC
That is a good point. These zero's or undef's are problematic. The problem statement had 1..100 so this zero didn't enter into the situation. In this case, $number==0 should be skipped (same as undef). So what about $number >100? Well, the OP should should show some revised code and we will iterate again to get it fine tuned. A major point was that Perl will create a new key (auto-vivify) when needed - there is no need to test whether or not the key exists. Second point was a numeric instead of alpha sort.	[reply]
Re^2: counting elements using a hash by bk1388 (Initiate) on Sep 23, 2012 at 21:13 UTC
here is my revised code and it's working like a charm I the file contains only numbers (integers and decimals), no zeros, thank you all for your input `my $INPUT_FILE; my $OUTPUT_FILE; my %count; open ($INPUT_FILE, "< " ) or die "Can't open file"; open ($OUTPUT_FILE, ">output") or die "can't open output file"; while($number = readline($INPUT_FILE)){ $count{$number}++; } foreach $number(sort keys %count){ print "$number occurs $count{$number} times(s)\n"; } close ($OUTPUT_FILE) or die "Can't close file!!!";` [download]	[reply] [d/l]
Re: counting elements using a hash by Kenosis (Priest) on Sep 23, 2012 at 06:58 UTC
Consider the following option: `use strict; use warnings; use Regexp::Common; my %numbers; open my $fh, '<', 'numbersFile.txt' or die $!; while (<$fh>) { $numbers{$1}++ while /($RE{num}{real})/g; } close $fh; print "Number: $_\tCount: $numbers{$_}\n" for sort { $a <=> $b } keys %numbers;` [download] Contents of `numbersFile.txt`: `12.7 abce -3.14 21 This is a line with 8 words in it. 10000 is a big number. Looks like 21 and 12.7 again. 5 + 7 = 12` [download] Output: `Number: -3.14 Count: 1 Number: 5 Count: 1 Number: 7 Count: 1 Number: 8 Count: 1 Number: 12 Count: 1 Number: 12.7 Count: 2 Number: 21 Count: 2 Number: 10000 Count: 1` [download] Regexp::Common is used (the `$RE{num}{real}` in the regex) to capture the numbers on each line that's read. You may not need to use it, if the numbers are just sets of ~~decimals~~ integers, in which case `\d+` can be used, instead. The `while` before the regex insures processing all numbers on a line. The anonymous subroutine `{ $a <=> $b }` is used to sort numbers. Hope this helps!	[reply] [d/l] [select]
Re^2: counting elements using a hash by Anonymous Monk on Sep 23, 2012 at 07:01 UTC
if the numbers are just sets of decimals, in which case \d+ can be used I call those integers, I think of decimals as reals (0.0)	[reply]
Re^3: counting elements using a hash by Kenosis (Priest) on Sep 23, 2012 at 07:05 UTC
Oh, my! They are called integers. Thank you for bringing this to my attention. Will correct. Time for sleep...	[reply]


Keep It Simple, Stupid
	PerlMonks