Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Count byte/character occurrence (quickly)

by james28909 (Chaplain)
on Apr 01, 2016 at 06:19 UTC ( #1159245=perlquestion: print w/replies, xml ) Need Help??
james28909 has asked for the wisdom of the Perl Monks concerning the following question:

I have this script I have written, which seems to work (or it seems to haha), but it is lacking in speed. For a 16 MB file it takes 9-10 secs for it to complete and I will need to use it on 256 MB files in the future. I was hoping I could get some help in speeding it up, if at all possible.
use strict; use warnings; open my $file, '<', shift; binmode($file); my %seen = (); while ( read( $file, my $buf, 1 ) ) { if ( !$seen{$buf} ) { $seen{$buf} = 1; } else { $seen{$buf}++; } } print "$_ - $seen{$_}\n" for ( keys %seen );
Any help in speeding this up would be much appreciated. :)

Replies are listed 'Best First'.
Re: Count byte/character occurrence (1/4)
by BrowserUk (Pope) on Apr 01, 2016 at 09:37 UTC

    While your version took 12 seconds for 16MB on my system:

    C:\test>dir 1123355.bin 14/04/2015 16:50 16,777,216 1123355.bin C:\test>1159245 1123355.bin Took 12.897612 secs

    (That was the third run so the cache was primed.)

    This version took:

    C:\test>1159245 1123355.bin Took 3.832763 secs : 3762666 &#9786; : 46120 &#9787; : 43642 &#9829; : 44106 &#9830; : 43878 ...

    The code;

    #! perl -slw use strict; use Time::HiRes qw[ time ]; my $start = time; open I, '<:raw', $ARGV[ 0 ]; my @seen; while( read( I, my $buf, 16384 ) ) { ++$seen[$_] for unpack 'C*', $buf; } printf "Took %f secs\n", time() - $start; printf "%c : %u\n", $_, $seen[$_] for 0 .. 255;
      while( read( I, my $buf, 16484  ) ) {

      The choice of buffer size here is intriguing. I would have guessed that the ideal would be some exact multiple of the block size. What is the thinking behind the 100 byte excess?

        What is the thinking behind the 100 byte excess?

        A typo :) Now corrected. (Didn't make much difference in terms of speed; which surprises me.)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
      WOW! thats alot faster. I didnt even think about unpack. that code runs in ~2.6 seconds on my machine. Thanks!
      Away for pc right now, but would something with a for loop and substr be faster? I dont necessarily have to unpack any bytes, do I?
        would something with a for loop and substr be faster? I dont necessarily have to unpack any bytes, do I?

        That requires a call into C (substr) for every byte; where using unpack requires a single call for the entire string.

        The cardinal rule for optimising Perl code, is to get perl's built-ins to do as much of the work as you can.

        Using this loop:

        ++$seen[ ord chop $buf ] while length $buf;
        in place of the unpack loop is almost but not quite as fast.

        It trades 2 built-in calls per byte, against the cost of building the unpack return list on the stack, and loses by a hair.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Count byte/character occurrence (quickly)
by Corion (Pope) on Apr 01, 2016 at 06:30 UTC

    Don't read the file byte by byte, but read it in larger chunks and have an inner loop counting the characters:

    open my $file, '<', shift; binmode($file); my %seen = (); $/ = \( 1024 * 1024 ); # Set default buffer size while ( <$file> ) { for my $buf (split //, $_) { if ( !$seen{$buf} ) { $seen{$buf} = 1; } else { $seen{$buf}++; } } } print "$_ - $seen{$_}\n" for ( keys %seen );

      Sadly, when I try to set it to a high(er) buffer size it slows down for some reason. I have an Intel i5 2500k and samsung 840 pro with a biostar tz77xe4 motherboard. It should be reading faster than heck. Is mine (or yours) code some how saving the results of the read instead of throwing them away after reading and adding to seen? I tried to undef $buf as well and that even slowed it down

      EDIT: also, i am using perl 5.16.3 for certain reasons.

Re: Count byte/character occurrence (quickly)
by Tux (Abbot) on Apr 01, 2016 at 07:57 UTC

    Perl offers autovivivication and magic auto-increment. You don't need the if at all:

    $seen{$_}++ for split // => $buf;

    Enjoy, Have FUN! H.Merijn
      This is very good to know, thanks. It doesnt however speed up the script, which is what I am ultimately after. I have been looking at multi-threading, but I am lost on that subject and unsure if I would see that much of a speed up.
Re: Count byte/character occurrence (quickly)
by marioroy (Priest) on Apr 01, 2016 at 14:17 UTC

    The following is a parallel demonstration, based on the code by BrowserUk. There is a bug in MCE::Shared 1.001 and the reason for the length check below. MCE 1.704 and MCE 1.002 will be released in ETA ~ 1 week with the fix.

    MCE::Flow and MCE::Shared

    use strict; use warnings; use MCE::Flow; use MCE::Shared; use Time::HiRes qw[ time ]; my $start = time; my $fh = MCE::Shared->handle( "<:raw", $ARGV[ 0 ] ); my @seen; sub tally { my ($aref) = @_; for ( 0 .. 255 ) { $seen[$_] += $aref->[$_] if $aref->[$_]; } return; } mce_flow { max_workers => 8 }, sub { my @_seen; while( read( $fh, my $buf, 16384 * 4 ) ) { # the length check may be omitted with MCE::Shared 1.002+ last unless length($buf); ++$_seen[$_] for unpack 'C*', $buf; } MCE->do('tally', \@_seen); }; close $fh; printf "Took %f secs\n", time() - $start; # for ( 0 .. 255 ) { # printf "%c : %u\n", $_, $seen[$_] if $seen[$_]; # }

    The serial code takes 8.390 seconds. In comparison, the parallel code completes in 2.253 seconds for a 126 MB file on a machine with 4 real cores and 4 hyper-threads.

    Update:

    The upcoming MCE::Shared 1.002 release will support the following construction by allowing the main or worker process to handle the error. I've been wanting for the shared open call to feel like the native open call.

    use MCE::Shared 1.002; mce_open my $IN, "<:gzip", "wat.paths.gz" or die "open error: $!"; mce_open my $OUT, ">", \*STDOUT or die "open error: $!";

      The following are parallel demonstrations using MCE::Hobo and threads.

      MCE::Hobo and MCE::Shared

      A Hobo is a migratory worker inside the machine that carries the asynchronous gene. Hobos are equipped with threads-like capability for running code asynchronously. Unlike threads, each hobo is a unique process to the underlying OS. The IPC is managed by MCE::Shared, which runs on all the major platforms including Cygwin.

      use strict; use warnings; use MCE::Hobo; use MCE::Shared; use Time::HiRes qw[ time ]; my $start = time; my $fh = MCE::Shared->handle( "<:raw", $ARGV[ 0 ] ); my $seen = MCE::Shared->array; sub task { my @_seen; while( read( $fh, my $buf, 16384 * 4 ) ) { # the length check may be omitted with MCE::Shared 1.002+ last unless length($buf); ++$_seen[$_] for unpack 'C*', $buf; } for ( 0 .. 255 ) { $seen->incrby($_, $_seen[$_]) if $_seen[$_]; } } MCE::Hobo->create('task') for 1 .. 8; # do other stuff if desired $_->join for MCE::Hobo->list; close $fh; printf "Took %f secs\n", time() - $start; # export and destroy the shared array into a local non-shared array $seen = $seen->destroy; # for ( 0 .. 255 ) { # printf "%c : %u\n", $_, $seen->[$_] if $seen->[$_]; # }

      threads and MCE::Shared

      The code for MCE::Hobo and threads are very similar.

      use strict; use warnings; use threads; use MCE::Shared; use Time::HiRes qw[ time ]; my $start = time; my $fh = MCE::Shared->handle( "<:raw", $ARGV[ 0 ] ); my $seen = MCE::Shared->array; sub task { my @_seen; while( read( $fh, my $buf, 16384 * 4 ) ) { # the length check may be omitted with MCE::Shared 1.002+ last unless length($buf); ++$_seen[$_] for unpack 'C*', $buf; } for ( 0 .. 255 ) { $seen->incrby($_, $_seen[$_]) if $_seen[$_]; } } threads->create('task') for 1 .. 8; # do other stuff if desired $_->join for threads->list; close $fh; printf "Took %f secs\n", time() - $start; # export and destroy the shared array into a local non-shared array $seen = $seen->destroy; # for ( 0 .. 255 ) { # printf "%c : %u\n", $_, $seen->[$_] if $seen->[$_]; # }

        You guys are awesome. Thanks for these good examples :) When I run this code, it calculates byte occurrence in .9 secs or less!

        Edit: I do have a few questions as well. I dont have time to ask right now, but I will be back!
Re: Count byte/character occurrence (quickly)
by james28909 (Chaplain) on Apr 05, 2016 at 19:24 UTC

    Thanks to all, I ended up doing it in C. I were having problems with MCE, if I dropped two files in one instance of the GUI it would not populated the count, and doing it in C has been the fastest so far. Thanks for everyones input.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1159245]
Approved by GrandFather
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2017-12-17 12:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What programming language do you hate the most?




















    Results (464 votes). Check out past polls.

    Notices?