Beefy Boxes and Bandwidth Generously Provided by pair Networks chromatic writing perl on a camel
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

How to process each byte in a binary file?

by John M. Dlugosz (Monsignor)
on Aug 12, 2002 at 19:46 UTC ( #189604=perlquestion: print w/ replies, xml ) Need Help??
John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

I've commonly used two ways to look at each byte in turn of a binary buffer.

The clearest is probably

foreach my $byte (unpack "C*", $string) { do something with my $byte
But I've also used something like
while ($string =~ /./sg) { do something with $1
and always found the equivilent construct using substr instead of a regex to be unenlightened.

But what's a good way to do this with a potentially large buffer? Does the first way generate a huge list first, or does it optomize down like 1..100000 does in a modern perl build?

Is there a, shall we say even "cooler", method that I'm missing?

—John

Comment on How to process each byte in a binary file?
Select or Download Code
Re: How to process each byte in a binary file?
by particle (Vicar) on Aug 12, 2002 at 19:54 UTC
    for large files...

    my $file = 'myverylargefile.bin'; { local *INPUT; local $/ = \1; open INPUT, '<', $file or die $!; while(<INPUT>) { ## process here... } }
    from pervar:
    Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer.

    ~Particle *accelerates*

      It looks like $/ doesn't have any effect on IO::Scalar. I see that if the input is already in a file, and really is a primitive file handle, that this saves the trouble of reading it in first. But I wonder if the overhead of one read at a time is still high, compared to reading a chunk at a time and processing the chunks using one of the other methods.

Re: How to process each byte in a binary file?
by Anonymous Monk on Aug 12, 2002 at 20:23 UTC
    vec EXPR,OFFSET,BITS

    Treats the string in EXPR as a bit vector made up of elements of width BITS, and returns the value of the element specified by OFFSET as an unsigned integer. BITS therefore specifies the number of bits that are reserved for each element in the bit vector. This must be a power of two from 1 to 32 (or 64, if your platform supports that).

    Maybe this is what you mean?

Re: How to process each byte in a binary file?
by kschwab (Priest) on Aug 12, 2002 at 20:58 UTC
    How about IO::Scalar or IO::String ?

    You could then seek() and tell() around the string, or read() in 1 byte increments.

    I'm not sure what's under the covers, but both seem elegant from the outside.

      Doing a read of one byte does indeed work better than changing the input record size as suggested by an earlier reply.

      It also runs an order of magnitude slower than the next slower method under discussion!

      How it works under the covers? It uses substr.

      —John

Benchmark Results
by John M. Dlugosz (Monsignor) on Aug 12, 2002 at 21:48 UTC
    Thus far, unpack"C" is the fastest. vec is 9% faster on a small input, 2% on a larger input, so there may be setup overhead there?

    substr is about 10% slower than vec.

    The regex/g is 1/3 to 1/2 the speed of substr. And using IO::Scalar is ten times slower than that!

    —John

Re: How to process each byte in a binary file?
by kschwab (Priest) on Aug 12, 2002 at 22:32 UTC
    Okay, looks like my IO::Scalar suggestion is not going to work. I guess that leaves unpack(), substr(),split(), and the regex ?

    Looks like unpack() is the clear winner on my machine:

    #!/usr/bin/perl use Benchmark; my $string="X" x 102400; timethese(100, { 'split' => sub { for (split(//,$string)) {}; }, 'unpack' => sub { for (unpack("C*",$string)) {}; }, 'regex' => sub { while($string =~ /./sg) {} }, 'substr' => sub { for(my $i=0;$i<length($string);$i++){ substr($string,$i,1); } }, });
    Gives me:

    $ perl foo
    Benchmark: timing 100 iterations of regex, split, substr, unpack...
         regex: 44 wallclock secs (43.13 usr +  0.00 sys = 43.13 CPU)
         split: 49 wallclock secs (47.90 usr +  0.04 sys = 47.94 CPU)
        substr: 58 wallclock secs (55.70 usr +  0.00 sys = 55.70 CPU)
        unpack: 27 wallclock secs (26.48 usr +  0.00 sys = 26.48 CPU)
    
    
    Update:Reposted results after correcting typo.
      I get similar results: split is between regex and substr. Makes me wonder, though, since split// is a "special case" that splits on every character, why it isn't simply as fast as unpack?

      —John

        I added a test case for just using read(FILE,1) from a real file, and it's about the same speed as the unpack() on a string (for largish strings).

        Of course, this leaves the file open the whole time..but it's wonderfully simple :) I also have a very expensive Netapp filer helping the speed with read-ahead and a huge cache..YMMV.

Re: How to process each byte in a binary file?
by jmcnamara (Monsignor) on Aug 12, 2002 at 22:42 UTC

    I'd guess that unpack is the fastest but if you are looking for alternatives to benchmark you could try this:     for (split //, $str, length $str) { ... }

    Regardless of the method you choose it would probably be best to read and process the file in chunks. Playing around with the buffer size might lead to an optimization between the size of the read and size of the data to process:

    #!/usr/bin/perl -w use strict; open FILE, 'reload.xls' or die "Error message here: $!"; binmode FILE; # as required my $buffer = 4096; my $str; while (read FILE, $str, $buffer) { for (split //, $str, $buffer) { # Your code here } }

    --
    John.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://189604]
Approved by particle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (9)
As of 2014-04-17 22:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (458 votes), past polls