Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Pack/Unpack Tutorial (aka How the System Stores Data)

by pfaut (Priest)
on Jan 06, 2003 at 16:56 UTC ( [id://224666]=perltutorial: print w/replies, xml ) Need Help??

Pack/Unpack Tutorial

A recent conversation in the chatterbox gave me the idea to write this. A beginning programmer was trying to encode some information with pack and unpack but was having trouble coming to grips with exactly how they work. I have never had trouble with them but I came to programming from a hardware background and I'm very familiar with assembly and C programming. People who have come to programming recently have probably never dealt with things at such a low level and may not understand how a computer stores data. A little understanding at this level might make pack and unpack a little easier to figure out.

Why we need pack and unpack

Perl can handle strings, integers and floating point values. Occassionally a perl programmer will need to exchange data with programs written in other languages. These other languages have a much larger set of datatypes. They have integer values of different sizes. They may only be capable of dealing with fixed length strings (dare I say COBOL?). Sometimes, there may be a need to exchange binary data over a network with other machines. These machines may have different word sizes or even store values differently. Somehow, we need to get our data into a format that these other programs and machines can understand. We also need to be able to interpret the responses we get back.

Perl's pack and unpack functions allow us to read and write buffers of data according to a template string. The template string allows us to indicate specific byte orderings and word sizes or use the local system's default sizes and ordering. This gives us a great deal of flexibility when dealing with external programs.

In order to understand how all of this works, it helps to understand how computers store different types of information.

Integer Formats

Computer memory can be looked at as a large array of bytes. A byte contains eight bits and can represent unsigned values between 0 and 255 or signed values between -128 and 127. You can't do a whole lot of computation with such a small range of values so a modern computer's registers are larger than a byte. Most modern processors use 32 bit registers and there are some processors with 64 bit registers. A 32 bit register can store unsigned values between 0 and 4294967295 or signed values between -2147483648 and 2147483647.

When storing values greater than 8 bits long to memory, the value is broken up into 8 bit segments and stored in multiple consecutive storage locations. Some processors will store the segment containing the most significant bits in the first memory location and work up in memory with lesser segments. This is referred to as "big-endian" format. Other processors will store the least significant segment in the first byte and store more significant segments into higher memory locations. This is referred to as "little-endian" format.

This might be easier to see with a picture. Suppose a register contains the value 0x12345678 and we're trying to store it to memory at address 1000. Here's how it looks.

AddressBig-Endian
Machine
Little-Endian
Machine
10000x120x78
10010x340x56
10020x560x34
10030x780x12

If you have looked at perldoc -f pack or have looked up the pack function in Programming Perl, you have seen a table listing template characters with a description of the type of datum they match. That table lists integer formats of several sizes and byte orders. There are also signed and unsigned versions.

FormatDescription
c,CA signed/unsigned char (8-bit integer) value
s,SA signed/unsigned short, always 16 bits
l,LA signed/unsigned long, always 32 bits
q,QA signed/unsigned quad (64-bit integer) value
i,IA signed/unsigned integer, native format
n,NA 16/32 bit value in "network" (big-endian) order
v,VA 16/32 bit value in "VAX" (little-endian) order

The s, l, and q formats pack 16, 32, and 64 bit values in the host machine's native order. The i format packs a value of the host machine's word length. The n and v formats allow you to specify the size and storage order and are useful for interchange with other systems.

Character Formats

Strings are stored as arrays of characters. Traditionally, each character was encoded in a single byte using some coding system like ASCII or EBCDIC. Newer encoding systems like Unicode use either multi-byte or variable length encodings to represent characters.

Perl's pack function accepts the following template characters for strings.

FormatDescription
a,AA null/space padded string
b,BA bit (binary) string in ascending/descending bit order
h,HA hexadecimal string, low/high nybble first
ZA null terminated string

Strings are stored in successive increasing memory locations with the first character in the lowest address location.

Perl's pack function

The pack function accepts a template string and a list of values. It returns a scalar containing the list of values stored according to the formats specified in the template. This allows us to write data in a format that would be readable by a program written in C or another language or to pass data to a remote system through a network socket.

The template contains a series of letters from the tables above. Each letter is optionally followed by a repeat count (for numeric values) or a length (for strings). A '*' on an integer format tells pack to use this format for the rest of the values. A '*' on a string format tells pack to use the length of the string.

Now, let's try an example. Suppose we're collecting some information from a web form and posting it for processing by our backend system which is written in C. The form allows a monk to request office supplies. The backend system wants to see input in the following format.

struct SupplyRequest { time_t request_time; // time request was entered int employee_id; // employee making request char item[32]; // item requested short quantity; // quantity needed short urgent; // request is urgent };

After looking through our system header files, we determine that time_t is a long. To create a suitable record for sending to the backend, we could use the following.

$rec = pack( "l i Z32 s2", time, $emp_id, $item, $quan, $urgent);

That template says 'a long, an int, a 32 character null terminated string and two shorts'.

If monk number 217641 (hey! that's me!) placed an urgent order for two boxes of paperclips on January 1, 2003 at 1pm EST, $rec would contain the following (first line in decimal, second in hex, third as characters where applicable). Pipe characters indicate field boundaries.

Offset Contents (increasing addresses left to right) 0 160 44 19 62| 41 82 3 0| 98 111 120 101 115 32 1 +11 102 A0 2C 13 3E| 29 52 03 00| 62 6f 78 65 73 20 +6f 66 | b o x e s + o f 16 32 112 97 112 101 114 99 108 105 112 115 0 0 0 + 0 0 20 70 61 70 65 72 63 6c 69 70 73 00 00 00 +00 00 p a p e r c l i p s 32 0 0 0 0 0 0 0 0| 2 0| 1 0 00 00 00 00 00 00 00 00| 02 00| 01 00

Let's figure out where all of that came from. The first template item is a 'l' which packs a long. A long is 32 bits or four bytes. The value that was stored came from the time function. The actual value was 1041444000 or 0x3e132ca0. See how that fits into the beginning of the buffer? My system has an Intel Pentium processor which is little endian.

The second template item is a 'i'. This calls for an integer of the machine's native size. The Pentium is a 32 bit processor so again we pack into four bytes. The monk's number is 217641 or 0x00035229.

The third template item is 'Z32'. This specifies a 32 character null terminated field. You can see the string 'boxes of paperclips' stored next in the buffer followed by zeros (null characters) until the 32 bytes have been filled.

The last template item is 's2'. This calls for two shorts which are 16 bit integers. This consumes two values from the list of values passed to pack. 16 bits get stored in two bytes. The first value was the quantity 2 and the second was the 1 indicating urgent. These two values occupy the last four bytes of the buffer.

Perl's unpack function

Unbeknownst to us when we wrote the web side of this application, someone was porting the backend from C to perl (something about eating dog food, I don't think I heard it right). But, since we've already written the web side of the application, they figured they would just use the same data format. Therefore, they need to use unpack to read the data we sent them.

unpack is kind of the opposite of pack. pack takes a template string and a list of values and returns a scalar. unpack takes a template string and a scalar and returns a list of values.

Theoretically, if we give unpack the same template string and the scalar produced by pack, we should get back the list of values we passed to pack. I say theoretically because if the unpacking is done on a machine with a different byte order (big vs. little endian) or a different word size (16, 32, 64 bit), unpack might interpret the data differently than pack wrote it. The formats we used all used our machine's native byte order and 'i' could be different sizes on different machines so we could be in trouble. But in our simple case, we'll assume the backend runs on the same machine as the web interface.

To unpack the data we wrote, the backend program would use a statement like this.

($order_time, $monk, $itemname, $quantity, $ignore) = unpack( "l i Z32 s2", $rec );

Notice that the template string is identical to the one we used above for packing and the same information is returned in the same order (except they used $ignore where we packed with $urgent, what are they trying to say?).

Integer Formats
aka, Why all those template types?

You may be asking why there are so many different ways to write the same data type. 'i', 'l', 'N', and 'V' could all be used to write a 32 bit integer to a buffer. Why use any specific one? Well, that depends on what you are trying to exchange information with.

If you are only going to be exchanging information with programs on the same machine, you can use 'i', 'l', 's', and 'q' and their uppercase unsigned counterparts. Since both the reading and writing programs will be running on the same system architecture, you might as well use the native formats.

If you are writing a program to read files whose layout is architecture specific, use the 'n', 'N', 'v' and 'V' formats. This way, you will know that you are interpreting the information correctly no matter what architecture your program is running on. For example, the 'wav' file format is defined for Windows on the Intel processor which is little endian. If you were trying to read the header of a 'wav' file, you should use 'v' and 'V' to read out 16 and 32 bit values respectively.

The 'n' and 'N' formats are called "network order" for a reason: they are the order specified for TCP/IP communications. If you are doing certain types of network programming, you will need to use these formats.

String formats

Choosing between the string formats is a little different. You would probably choose between 'a', 'A' and 'Z' depending on the language of the other program. If the other program is written in C or C++, you probably want 'a' or 'Z'. 'A' would be a good choice for COBOL or FORTRAN.

'a', 'A', and 'Z' formats

When packing, 'a' and 'z' with a count fill extra locations with nulls. 'A' fills the extra locations with spaces. When unpacking, 'A' removes trailing spaces and nulls, 'Z' strips everything after the first null, and 'a' returns the full field as is.

Examples

pack('a8',"hello") produces "hello\0\0\0" pack('Z8',"hello") produces "hello\0\0\0" pack('A8',"hello") produces "hello " unpack('a8',"hello\0\0\0") produces "hello\0\0\0" unpack('Z8',"hello\0\0\0") produces "hello" unpack('A8',"hello ") produces "hello" unpack('A8',"hello\0\0\0") produces "hello"

'b' and 'B' formats

The 'b' and 'B' formats pack strings consisting of '0' and '1' characters to bytes and unpack bytes to strings of '0' and '1' characters. Perl treats even valued characters as 0 and odd valued characters as 1 while packing. The difference between the two is the order of the bits within each byte. With 'b', the bits are specified in increasing order. With 'B', in descending order. The count represents the number of bits to pack.

Examples

ord(pack('b8','00100110')) produces 100 (4 + 32 + 64) ord(pack('B8','00100110')) produces 38 (32 + 4 + 2)

'h' and 'H' formats

The 'h' and 'H' formats pack strings containing hexadecimal digits. 'h' takes the low nybble first, 'H' takes the high nybble first. The count represents the number of nybbles to pack. In case you were wondering, a nybble is half a byte.

Examples

Each of the following returns a two byte scalar.

pack('h4','1234') produces 0x21,0x43 pack('H4','1234') produces 0x12,0x34

Additional Information

Perl 5.8 includes its own tutorial for pack and unpack. That tutorial is a bit more indepth than this one but some of the things it covers may be specific to perl 5.8. If you are still using perl 5.6, check your own documentation if things don't work as that tutorial describes.

There are more template characters that I haven't covered here. There are also ways to read and write counted ASCII fields as well as some additional tricks you can play with pack and unpack. Try perldoc -f pack on your system or refer to Programming Perl. And above all, don't be afraid to experiment (except on live programs). Use the DumpString function below to examine the buffers returned by pack until you understand how it manipulates data.

References

Programming Perl, Third Edition, Larry Wall, Tom Christiansen, and Jon Orwant, © 2000, 1996, 1991 O'Reilly & Associates, Inc. ISBN 0-596-00027-8

Thanks to bart for the reference to the pack/unpack tutorial from perl 5.8.

Thanks to Zaxo and jeffa for reviewing this document and sharing their own efforts at creating a tutorial.

Thanks to sulfericacid and PodMaster for inspiring this on the CB.

Example Code

The following program contains the examples in this document.

#!/usr/bin/perl -w use strict; # dump the contents of a string as decimal and hex bytes and character +s sub DumpString { my @a = unpack('C*',$_[0]); my $o = 0; while (@a) { my @b = splice @a,0,16; my @d = map sprintf("%03d",$_), @b; my @x = map sprintf("%02x",$_), @b; my $c = substr($_[0],$o,16); $c =~ s/[[:^print:]]/ /g; printf "%6d %s\n",$o,join(' ',@d); print " "x8,join(' ',@x),"\n"; print " "x9,join(' ',split(//,$c)),"\n"; $o += 16; } } # place our web order my $t = time; my $emp_id = 217641; my $item = "boxes of paperclips"; my $quan = 2; my $urgent = 1; my $rec = pack( "l i a32 s2", $t, $emp_id, $item, $quan, $urgent); DumpString($rec); # process a web order my ($order_time, $monk, $itemname, $quantity, $ignore) = unpack( "l i a32 s2", $rec ); print "Order time: ",scalar localtime($order_time),"\n"; print "Placed by monk #$monk for $quantity $itemname\n"; # string formats $rec = pack('a8',"hello"); # should produce 'hello\0\0\0 +' DumpScalar($rec); $rec = pack('Z8',"hello"); # should produce 'hello\0\0\0 +' DumpScalar($rec); $rec = pack('A8',"hello"); # should produce 'hello ' DumpScalar($rec); ($rec) = unpack('a8',"hello\0\0\0"); # should produce 'hello\0\0\0 +' DumpScalar($rec); ($rec) = unpack('Z8',"hello\0\0\0"); # should produce 'hello' DumpScalar($rec); ($rec) = unpack('A8',"hello "); # should produce 'hello' DumpScalar($rec); ($rec) = unpack('A8',"hello\0\0\0"); # should produce 'hello' DumpScalar($rec); # bit format $rec = pack('b8',"00100110"); # should produce 0x64 (100) DumpScalar($rec); $rec = pack('B8',"00100110"); # should produce 0x26 (38) DumpScalar($rec); # hex format $rec = pack('h4',"1234"); # should produce 0x21,0x43 DumpScalar($rec); $rec = pack('H4',"1234"); # should produce 0x12,0x34 DumpScalar($rec);
http://nbpfaus.net/~pfau/

Replies are listed 'Best First'.
Re: Pack/Unpack Tutorial (aka How the System Stores Data)
by diotalevi (Canon) on Jan 06, 2003 at 18:11 UTC

    I have a few comments and I'll just leave them here in no particular order:

    1. I wish you would have defined "word" prior to using it willy-nilly. It's jargon that your tutorial's audience it's likely to be familiar with. From WordNet: a word is a string of bits stored in computer memory; "large computers use words up to 64 bits long".

    2. Bytes are almost always eight bits though that's not a universal constant. Perhaps it's infrequent enough that I didn't even need to mention but this one always gets my goat.

    3. Your use of "most" and "least" significant byte was also jargon. If you assume the value 0x12345678 then the most significant byte has the value 0x12 and the least significant has the value 0x78. From there the point on differently endian machines is just which order you start with when transcribing bytes.

    4. Your use of memory addresses is obfuscatory. This is better written as "Byte 0, byte 1, byte 2, byte 3". The only point at which a perl programmer cares about memory addresses is when doing non-perl programming or with the 'p' or 'P' format options. The point here is to indicate an order to the bytes in memory - that byte 0 might be located at a memory address 1000 is entire beside the point.

    5. White space is allowed without consequence in an unpack/pack format. It's just ignored except when it's a fatal error. I haven't nailed it down but some uses of whitespace just don't parse. That may be a bug but it's worth noting. This just means that in general people should use whitespace in a format to enhance readability - it doesn't affect it's operation.

    6. I've never been clear on the bit order within a byte - can you expand on that? I used to think that the differently endian machines also shuffled the bit order around as well. At this point I'm just confused.

      I've never been clear on the bit order within a byte - can you expand on that? I used to think that the differently endian machines also shuffled the bit order around as well. At this point I'm just confused.

      In many cases, you can ignore bit order since bits do not have separate memory addresses. "Byte order" matters because you can access a byte as either the "first" byte (lowest address) in a "string" of bytes or as the "high" or "low" byte in a multi-byte numeric value. If you don't try to do both, then "byte order" doesn't matter, but using pack or unpack often means you are looking at bytes both ways. But, there is no "first" bit in packed data.

      If you have a text format ("unpacked" string) that shows bits (or hexidecimal nybbles or octal "digits") in a specific order, then you may have to worry about "bit order" (or other sub-byte order) if you've got something not using the near-universal "most significant digit first" ordering that is used when writing numbers in any base. Of course, pack and unpack (quite unfortunately) make a mess of this, as noted in Re^2: pack/unpack 6-bit fields. (precision) and (tye)Re: Ascending vs descending bit order.

      Put another way, "byte order" is usually used in reference to a detail of a computer's design and "bit order" doesn't matter in this context. However, both "bit order" and "byte order" can be applied to text representations of data (or even other "unpacked" representations where bits from within a byte get encoded into multiple bytes/characters of some other representation).

      - tye        

      I acknowledge that my comments here come over a decade later, but still, ought to be made. This article was great, and I found it useful today (2014) for some work I am doing while bit-banging with perl. I found this article as a result of a very specific search with Google for what I am trying to do. These comments are my reply to the comments made just above here. In numbered order....

      1. I think that more people will know what "word" means, than what "willy-nilly" means. Remember the audience here: people wanting to use the pack/unpack functions.

      2. Bytes are almost eight bits, since the word "byte" is a contraction of "by eight", as in describing hardware design of memory. The context was while saying that there are a hundred or a thousand or a zillion memory addresses, BY EIGHT bits wide. In the old, old days, like magnetic core, there was a single bit of memory per location, or per cell. As things progressed, it was common to bring out a 'parallel' load or store, by eight bits. So yeah, eight bits.

      3. Captain obvious here... but this is exactly the point he was making. The most significant byte is often placed at the 'opposite' end in some systems compared to others. It's still the most significant but not always in the location where you would find the most significant byte.

      4. Obfuscatory unless someone is trying to read or write a memory-mapped location in memory, very typical of someone using perl to do this. Picking an arbitrary starting point like 0x1000 is better than starting at "byte 0", which implies that it has some special significance. It doesn't.

      5. White space in contrast to what, maybe a zero-fill? It's called white space because it doesn't show up on paper. Nulls, on the other hand, are often used to indicate end-of-string, which is something very different. Whitespace is printable; NUL is not.

      6. See #2 above for some clarification, although this is beyond the scope of this pack/unpack tutorial.

      Thank you again to the original author -- this was just the refresher I needed to use these awesome functions of perl.

      2018-07-08 Athanasius added paragraph tags

Re: Pack/Unpack Tutorial (aka How the System Stores Data)
by fredopalus (Friar) on Jan 08, 2003 at 01:36 UTC
    People having trouble understanding some things in this tutorial may find Assembly Language Step-by-Step a very helpful book. It focuses more on the hardware aspects of programming.
Re: Pack/Unpack Tutorial (aka How the System Stores Data)
by NetWallah (Canon) on Sep 12, 2003 at 19:06 UTC
    Thanks, pfaut for a good tutorial.

    Your "Example code" has errors that need to be fixed.

    Undefined subroutine &main::DumpScalar called at <code-file> line 38.
    Replacing all 11 occurrances of "DumpScalar" by "DumpString" corrects the problem.
Re: Pack/Unpack Tutorial (aka How the System Stores Data)
by planetscape (Chancellor) on Apr 01, 2009 at 14:35 UTC
      thanks
Re: Pack/Unpack Tutorial (aka How the System Stores Data)
by planetscape (Chancellor) on Jul 04, 2006 at 14:40 UTC
Re: Pack/Unpack Tutorial (aka How the System Stores Data)
by almsdealer (Acolyte) on Apr 11, 2023 at 13:04 UTC
    Thank you for a great tutorial on pack/unpack. Still very useful 20 years later!
Re: Pack/Unpack Tutorial (aka How the System Stores Data)
by freonpsandoz (Beadle) on Aug 27, 2021 at 20:27 UTC

    "When unpacking, 'A' removes trailing spaces and nulls..." I couldn't find that important fact in the Perl documentation when I was trying to use 'A' with unpack. This entire article is too important to be so far down in the search results for documentation on pack/unpack. It would be nice if it could be integrated into perlpacktut.

Re: Pack/Unpack Tutorial (aka How the System Stores Data)
by Anonymous Monk on Oct 09, 2008 at 07:37 UTC
    i had these errors when i ran the last example code on perl v5.8.8
    Possible unintended interpolation of @a in string at pack02.pl line 6. Possible unintended interpolation of @a in string at pack02.pl line 6. Possible unintended interpolation of @b in string at pack02.pl line 6. Possible unintended interpolation of @a in string at pack02.pl line 6. Possible unintended interpolation of @d in string at pack02.pl line 6. Global symbol "@a" requires explicit package name at pack02.pl line 6. Global symbol "$o" requires explicit package name at pack02.pl line 6. Global symbol "@a" requires explicit package name at pack02.pl line 6. Global symbol "@b" requires explicit package name at pack02.pl line 6. Global symbol "@a" requires explicit package name at pack02.pl line 6. Global symbol "@d" requires explicit package name at pack02.pl line 6. syntax error at pack02.pl line 11, near "my @d = map sprintf" (Might be a runaway multi-line ss string starting on line 6) Global symbol "@b" requires explicit package name at pack02.pl line 11 +. Global symbol "@b" requires explicit package name at pack02.pl line 12 +. Global symbol "$o" requires explicit package name at pack02.pl line 13 +. Global symbol "$o" requires explicit package name at pack02.pl line 15 +. Global symbol "@d" requires explicit package name at pack02.pl line 15 +. Global symbol "$o" requires explicit package name at pack02.pl line 18 +. Unmatched right curly bracket at pack02.pl line 19, at end of line pack02.pl has too many errors.
Re: Pack/Unpack Tutorial (aka How the System Stores Data)
by Anonymous Monk on Apr 29, 2009 at 00:28 UTC
    Great tutorial, despite having some experience with assembly-level programming I found the pack/unpack sections of the perlfunc man page very confusing and this tutorial cleared things up for me. Thanks!
Re: Pack/Unpack Tutorial (aka How the System Stores Data)
by gvenkat (Novice) on Mar 30, 2009 at 07:35 UTC
    Brilliant tutorial, Thanks for this! Cheers.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perltutorial [id://224666]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-03-19 10:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found