Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

counting elements using a hash

by bk1388 (Initiate)
on Sep 23, 2012 at 01:31 UTC ( [id://995149]=perlquestion: print w/replies, xml ) Need Help??

bk1388 has asked for the wisdom of the Perl Monks concerning the following question:

hi monks I have a file with several hundred thousand numbers , and I need to determine the number of times each value is present in the file and print out a numeric sorted list of values and the number of times seen. I thought using a hash such a the example below would work . my idea was For each number in the array, if the number already exists in the hash, then increment the value by 1 ; else if it does not exist in the hash, then add the number to the hash with the initial value of 1 . HOWEVER IT'S NOT WORKING :( AND I CAN'T FIND WHY, PLEASE HELP

my @numbers = (1..100) my %count; foreach $number(@numbers) { if (exists $count{$number}) { $count{$number}++; } else { $count{$number} = 1; } } foreach (sort keys %count) { print "$number occurs $count{$number} time(s)\n"; }

Replies are listed 'Best First'.
Re: counting elements using a hash
by roboticus (Chancellor) on Sep 23, 2012 at 01:42 UTC

    bk1388:

    Your second for loop should also specify the variable you're using as an iterator:

    foreach $number (sort keys %count) {

    Otherwise the code looks fine.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: counting elements using a hash
by davido (Cardinal) on Sep 23, 2012 at 06:36 UTC

    Are the numbers actually in the range 1 .. 100? What is the actual range? This could have some bearing on what constitutes a good solution.

    Your hash technique could be written like this (please note that we're supplying a numeric comparison to sort. Otherwise you don't get numerical order.)

    use strict; use warnings; my @numbers = (1..100); my %count; $count{$_}++ foreach @numbers; print "$_ occurs $count{$_} time(s)\n" foreach sort { $a <=> $b } keys + %count;

    The reason I asked about the actual range is because sometimes there are patterns to the data set that might allow you to employ a less general approach, with greater efficiency. For example, if you're dealing with positive integers in some reasonably narrow range, you can avoid the O(n log n) sort, and the expense of hash lookups if you just use an array to hold your counts. In the following demonstration I did away with the output (output would swamp a benchmark). Instead, I build an array of anonymous arrays holding an integer value and a count. It's a little contrived, but the goal was to provide a similar platform upon which the two methods could be similarly compared.

    use strict; use warnings; use Benchmark qw(cmpthese); @main::numbers = map {int(rand(10000))+1 } 1 .. 300000; sub hash { my %count; $count{$_}++ foreach @main::numbers; my @results = map { [ $_ => $count{$_} ] } sort { $a <=> $b } keys + %count; } sub array { my @count; $count[$_]++ foreach @main::numbers; my @results = map { [ $_ => $count[$_] ] } grep { $count[$_] } 0 . +. $#count; } cmpthese ( -5, { hash => \&hash, array => \&array, } );

    On my system...

    Rate hash array hash 19.4/s -- -55% array 43.2/s 122% --

    Going with the less general solution wouldn't make sense if your data doesn't consist only of positive integers in a fairly narrow range. But this is an example of where a more specialized algorithm may be a better approach if the data supports it. In fact, if you are dealing with positive integers, and they do span an even greater range, as long as it fits in memory the array may still be a better approach. It doesn't scale as well for sparse data sets, but for data sets where the counts from 0 to 'n' can fit in memory, the larger sets will see even more improvement over the hash/sort technique (since a sort grows log-linearly, and the array approach grows linearly).


    Dave

Re: counting elements using a hash
by CountZero (Bishop) on Sep 23, 2012 at 07:12 UTC
    You are missing the final semi-colon in your first line.

    Also, you are using an alphanumeric search (which means "2" will come between "19" and "20"). You will want to use a numeric sort.

    use Modern::Perl; my @numbers = map int(rand(100)), (1..1000); my %count; $count{$_}++ for @numbers; say "$_ occurs $count{$_} time(s)" for sort {$a <=> $b} keys %count;

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: counting elements using a hash
by Marshall (Canon) on Sep 23, 2012 at 02:02 UTC
    The first task is to get the "number" that you wish to track from the input lines.

    The second step is simply to count using a hash table. It is not necessary to test for exists(). Perl will create a new key if the key doesn't exist.

    #!/usr/bin/perl -w use strict; my %count; open (INPUT, '<', "some_input_file") or die "unable to open some_input_file"); while (<INPUT>) { my ($number) =~ /(\d+)/; #maybe???? next unless $number; #skip undef's $count{$number}++; #new key if $number not there } foreach my $number (sort{$a <=> $b} keys %count) #numeric sort { print "$number occurs $count{$number} time(s)\n"; }
    Try again and show some just a couple of lines of example input and your revised code.

    There is no need to iterate over all 101 possibilities for the number on each line. Process each line once, get the number, make a decision and it is "over with" - don't loop 100 times for each input line.

          next unless $number; #skip undef's

      What happens if the extracted  $number string is '0'? Will zeros be counted?

        That is a good point. These zero's or undef's are problematic.

        The problem statement had 1..100 so this zero didn't enter into the situation. In this case, $number==0 should be skipped (same as undef). So what about $number >100? Well, the OP should should show some revised code and we will iterate again to get it fine tuned. A major point was that Perl will create a new key (auto-vivify) when needed - there is no need to test whether or not the key exists. Second point was a numeric instead of alpha sort.

      here is my revised code and it's working like a charm I the file contains only numbers (integers and decimals), no zeros, thank you all for your input

      my $INPUT_FILE; my $OUTPUT_FILE; my %count; open ($INPUT_FILE, "< " ) or die "Can't open file"; open ($OUTPUT_FILE, ">output") or die "can't open output file"; while($number = readline($INPUT_FILE)){ $count{$number}++; } foreach $number(sort keys %count){ print "$number occurs $count{$number} times(s)\n"; } close ($OUTPUT_FILE) or die "Can't close file!!!";
Re: counting elements using a hash
by Kenosis (Priest) on Sep 23, 2012 at 06:58 UTC

    Consider the following option:

    use strict; use warnings; use Regexp::Common; my %numbers; open my $fh, '<', 'numbersFile.txt' or die $!; while (<$fh>) { $numbers{$1}++ while /($RE{num}{real})/g; } close $fh; print "Number: $_\tCount: $numbers{$_}\n" for sort { $a <=> $b } keys %numbers;

    Contents of numbersFile.txt:

    12.7 abce -3.14 21 This is a line with 8 words in it. 10000 is a big number. Looks like 21 and 12.7 again. 5 + 7 = 12

    Output:

    Number: -3.14 Count: 1 Number: 5 Count: 1 Number: 7 Count: 1 Number: 8 Count: 1 Number: 12 Count: 1 Number: 12.7 Count: 2 Number: 21 Count: 2 Number: 10000 Count: 1

    Regexp::Common is used (the $RE{num}{real} in the regex) to capture the numbers on each line that's read. You may not need to use it, if the numbers are just sets of decimals integers, in which case \d+ can be used, instead.

    The while before the regex insures processing all numbers on a line. The anonymous subroutine { $a <=> $b } is used to sort numbers.

    Hope this helps!

      if the numbers are just sets of decimals, in which case \d+ can be used

      I call those integers, I think of decimals as reals (0.0)

        Oh, my! They are called integers. Thank you for bringing this to my attention. Will correct. Time for sleep...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://995149]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2024-04-19 01:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found