http://www.perlmonks.org?node_id=1044445

Ineffectual has asked for the wisdom of the Perl Monks concerning the following question:

Hey all,

I'm attempting to unpack an 8 bit unsigned char matrix that was generated in python. I'm not sure how to do this in perl. Here's what I've attempted so far.

my $file = $matrixPath.'/'.$info{$transcriptName}{'PATH'}; my $fileSize = -s $file; open IN, $file or die "Can't open inputfile $file $!"; binmode(IN); my $buffer; my $nRows = $info{$transcriptName}{'CNT'}; print "reading file $file with size $fileSize with nCols $nCols an +d nrows $nRows\n"; while ( my $read = sysread(IN, $buffer, $fileSize)) { my ($data) = unpack("C*", $read); print " data is ".Dumper $data; print sprintf("%08b", $data)."\n"; } close IN;
This will print out:
reading file 05398.bin with size 90942 with nCols 3954 and nRows 23
nCols stays static for all of the files, but nRows changes for each file.
So this should come out to be a matrix that has 3954 columns and 23 rows.
I've tried using C$nCols or C$nRows in the split or b$nRows or B$nRows, but none seems to give the appropriate output. Each cell of the matrix should contain one number 0 or 1. I've tried splitting on length($read) or $nRows before unpacking. When I do an unpack('C*', $read) into an array then the # of elements is generally 5-6, not the length of my rows. I'm stumped- help perl monks!

Replies are listed 'Best First'.
Re: Unpack an 8 bit unsigned char matrix
by BrowserUk (Patriarch) on Jul 15, 2013 at 20:22 UTC

    The Python code doesn't show the value of nRow or from where it is obtained.

    And your words:

    reading file 05398.bin with size 90942 with nCols 3954 and nRows 23 nCols stays static for all of the files, but nRows changes for each file.

    contradicts the evidence of the Python code which has nRow preset (somewhere) and calculates the value of nCols.

    But, taking you at your word, and assuming the files consist of N rows of 3954 columns, you could do this:

    use constant NCOLS => 3954; my $file = $matrixPath.'/'.$info{$transcriptName}{'PATH'}; my $fileSize = -s $file; die 'Bad filesize' unless $filesize % NCOLS == 0; open IN, $file or die "Can't open inputfile $file $!"; binmode(IN); ## Note: You're reading the whole file in a single read NO NEED FOR A +LOOP. sysread(IN, my $buffer, $fileSize) or die. close IN; ## The first (rightmost) unpack splits the buffer into NCOLS length se +ctions of bytes. ## The second unpack breaks each of those into its individual (uchar) +integers ## and puts them in an anonymous array. ## the map assigns al the anonymous arrays to @matrix. # Update: template corrected, see posts below. my @matrix = map[ unpack 'C*', $_ ], unpack '(a' . NCOLS . ')*', $buff +er; ...

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Hi BrowserUk! Thanks for the help!

      The python code gets the number of rows from a text file that was written when the matrix was written. So for each file, I have a text file that tells me the file size and the number of rows in that file (along with other metadata).

      When I use the code above, it gives back one row with 3954 columns. However, most of the files have more than one row of information. Is it truncating the rest? Do I need to use a while loop to only get fileSize/$nRows bytes and process those?

        A quick (not binary, but same difference) demo:

        #! perl -slw use strict; use Data::Dump qw[ pp ]; sysread( DATA, my $buffer, 80 ) or die $!; my @matrixX10 = map[ unpack 'C*', $_ ], unpack '(a10)*', $buffer; pp\@matrixX10; my @matrixX5 = map[ unpack 'C*', $_ ], unpack '(a5)*', $buffer; pp\@matrixX5; __DATA__ 1234567890123456789012345678901234567890123456789012345678901234567890 +1234567890

        Produces:

        C:\test\primes>..\junk94 [ [49, 50, 51, 52, 53, 54, 55, 56, 57, 48], [49, 50, 51, 52, 53, 54, 55, 56, 57, 48], [49, 50, 51, 52, 53, 54, 55, 56, 57, 48], [49, 50, 51, 52, 53, 54, 55, 56, 57, 48], [49, 50, 51, 52, 53, 54, 55, 56, 57, 48], [49, 50, 51, 52, 53, 54, 55, 56, 57, 48], [49, 50, 51, 52, 53, 54, 55, 56, 57, 48], [49, 50, 51, 52, 53, 54, 55, 56, 57, 48], ] [ [49, 50, 51, 52, 53], [54, 55, 56, 57, 48], [49, 50, 51, 52, 53], [54, 55, 56, 57, 48], [49, 50, 51, 52, 53], [54, 55, 56, 57, 48], [49, 50, 51, 52, 53], [54, 55, 56, 57, 48], [49, 50, 51, 52, 53], [54, 55, 56, 57, 48], [49, 50, 51, 52, 53], [54, 55, 56, 57, 48], [49, 50, 51, 52, 53], [54, 55, 56, 57, 48], [49, 50, 51, 52, 53], [54, 55, 56, 57, 48], ]

        Broken code replaced above:


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        I apologise. I omitted an essential part of the first unpack template (the repeat *). Please try substituting this:

        my @matrix = map[ unpack 'C*', $_ ], unpack '(a' . NCOLS . ')*', $buff +er;
        The template has become: '(a3954)*' which tells unpack to split the buffer into as many 3954-byte chunks as are available.

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.