### How is perl able to handle the null byte?

by muba (Priest)
 on Jun 15, 2006 at 18:38 UTC Need Help??
muba has asked for the wisdom of the Perl Monks concerning the following question:

I was wondering this, and hoping you could tell me.

First off: I know almost nothing about C, but I do know that C treads a null byte as the end of a string (or whatever those things are called in C).
But then how is perl able to have them included in strings?
Being written in C, a null byte in a perl string (let's say "abc\x00def") would logically just be "abc", with the "def" part being cut off by C.
Yet, that isn't the case.

How is that possible?

Replies are listed 'Best First'.
Re: How is perl able to handle the null byte?
by Joost (Canon) on Jun 15, 2006 at 18:52 UTC
Yup, C treats a 0 byte as the end of a string. And yes perl is written in C. However, perl strings are not C strings.

In C a string is nothing more than a pointer to the first character of the string. C can't know how long a string is except by counting all the characters from the start until the first 0 byte. You need to know the length of a string in order to do things like copy or compare them.

Perl's strings are C structs that include (amongst other things) a pointer to the first character, AND the length of the string. Since perl doesn't need a special terminating character you can use 0 characters in the middle of a string too.

Note that internally perl strings are still 0-terminated in order to facilitate interoperation with functions that expect C style strings. Also note that sending strings with embedded 0 characters to system calls or XS libraries does not always work as you'd expect, though all the buildin string and IO operations (regexes, print etc) should work correctly.

A struct is basically like creating your own data type, right?

However, I understand now. That was fairly simple, actually...
Thank you!
A struct is a "complex" type. That means a struct contains a fixed number of fields that can be of different types themselves. A bit like a perl hash but with a fixed number of key-value pairs.

PerlGuts Illustrated has some nice diagrams.

Re: How is perl able to handle the null byte?
by hobbs (Monk) on Jun 15, 2006 at 20:33 UTC
To say that "C treats a null byte as the end of a string" is a bit unfair. C barely knows what a "string" is. A large amount of C code, including a great number of standard library functions, work with null-terminated strings. But that doesn't mean you can't write code that treats your own data however you like. Nothing is forcing you to stop at a null; there's just a certain class of pre-written functions that do so by convention. If you know better (because, for example, your string is stored together with its length), then that's fine and well. C only cares about bits and bytes.
Quite true. Strictly speaking, C doesn't have a string type: what's conventionally used instead is a pointer to a (single) character which is equivalent to an array of characters because of the way C arrays work 1]. The "string type" in C is literally "char *".

1] C arrays do not really have a length either, defining an array with a certain length only reserves that amount of memory, the length isn't stored anywhere.

Reminding me of a classic exchange from a CS class I took once (OK, the CS class I took...):

Student (slightly paraphrased): you mentioned that the address after the last member of the array is guaranteed to be a legal address, though you don't technically have it allocated to you. Don't a lot of people use that fact to just pretend their array indexes are 1-based instead of 0-based?

Professor: Lot's of people J-walk, too! Some of them get killed!

Ah, those were the days... ;-)

If God had meant us to fly, he would *never* have given us the railroads.
--Michael Flanders

Re: How is perl able to handle the null byte?
by ikegami (Pope) on Jun 15, 2006 at 22:02 UTC

Perl doesn't always handle \0 in strings properly.

>dir

Directory of dir

2006/06/15  05:58p      <DIR>          .
2006/06/15  05:58p      <DIR>          ..
0 File(s)              0 bytes
2 Dir(s)   2,322,644,992 bytes free

>echo file contents > file

>perl -e "open($fh, qq{file\0junk}) or die; print <$fh>;"
file contents
[download]

Be wary of system calls and library calls.

Update: Some might say this is the OS's fault, but I expect better of Perl. Conversion from SV to char* results in a loss of information, but Perl apparently assumes it's a safe operation. I shouldn't have to know implementation details of a language's builtins.

..., but I expect better of Perl.

If you're thinking that Perl should test every SV that it passes the char* of to some system call, and examine if it contains a null byte a a position other than the last byte, and what? Die? Issue a warning? Convert the embedded null to a space?

Except of course where the embedded null is a legitimate part of a multi-byte character--which means a full unicode verification of every string passed to every system API---X, chdir, chmod, chown, chroot, fcntl, glob, ioctl, link, lstat, mkdir, open, opendir, readlink, rename, rmdir, stat, symlink, sysopen, umask, unlink, utime, etc. etc.

Just because it is possible to do something--like embedding nulls in filenames--doesn't mean that it isn't an obviously bad idea; and ham-stringing the performance of Perl and every other utility program in order to cater for idiots that ignore the obviously bad ideas, would result in today's systems running with roughly the same performance as the 40MHz cpu's that became available in the late 1980's.

Give me pragmatism over perfection every time.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Well, especially for builtins perl should know if the char* it's handing off to the system call will be interpreted as a classic C string or as a wide character.

I question strongly that this is a serious performance issue - among other things, null-containing strings could easily be flagged as such, the way tainted strings are when running under taint mode.

As for what perl should do, I certainly think that a warning when running under -w is appropriate - this is at least as big a problem as interpolating an undef variable into a string. I might even be convinced that perl in taint mode should treat nul-containing strings as tainted when passing them to C APIs - that is, die.

--
@/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/;
map{y/X_/\n /;print}map{pop@\$_}@/for@/
[download]
Except of course where the embedded null is a legitimate part of a multi-byte character--which means a full unicode verification of every string passed to every system API

Um, sorry, but if you're talking about the perl-internal representation of unicode, which is utf-8, the only thing that involves a null byte is the unicode code-point U+0000 (i.e. "NULL") -- and this, BTW, is simply the single-byte-null itself in utf-8. For every other utf-8 character, every byte is always non-null. And I don't know of any pre-unicode encodings that use nulls as parts of multi-byte characters.

If a string of octets is supposed to represent utf-16, then sure, we would expect some of those octets to be null -- each octet is supposed to be treated as half of a 16-bit binary "word"; but this is a very different situation. Here we are talking about something more akin to plain old raw binary data, not a string of characters that can be transmuted directly to a char* and treated as a string in C.

Good point. To be clear that this is not just a "windows" issue, I got the same behavior (scalar string truncated at null byte when it was used as a file name in "open()") on macosx, which is basically unix, so I'd expect all unix and linux to behave the same way as well -- if it's "the OS's fault", this is one of those rare cases where all OS's can share the blame equally.

As for expecting better of Perl, I don't know what all happens under the covers during the conversion from SV to char*, but if it were to include doing the extra steps to check for string-internal null bytes, so it could throw some sort of exception or warning whenever it found one, I would expect most scripts to slow down noticeably.

Some folks might view that cost as too high for the resulting benefit, so it's relatively more worthwhile to force you to know about these implementation details (especially in your case, since you happen to know them already, anyway :D ).

Re: How is perl able to handle the null byte?
by bart (Canon) on Jun 15, 2006 at 18:53 UTC
Perl doesn't use a delimiter as a string end marker, instead, it stores both the byte sequence and the length. So, a string can contain anything.
Re: How is perl able to handle the null byte?
by Moron (Curate) on Jun 16, 2006 at 09:26 UTC
C does not treat the null byte as the end of a string unless it is told to do so by the variable length string datatype - an array of char is not zero terminated and a pointer to char has no termination at all. In C you can even assign a pointer to char to the address of the first character of your variable length string in order to betray it by walking ad libitum off the end until the operating system detects an access violation - its why strong typing isn't as safe as its cracked out to be.

-M

Re: How is perl able to handle the null byte?
by zentara (Archbishop) on Jun 16, 2006 at 13:08 UTC

Create A New User
Node Status?
node history
Node Type: perlquestion [id://555579]
Approved by gellyfish
Front-paged by stonecolddevin
help
Chatterbox?
 [thezip]: So I have a script that generates a log file. After script completion, I want tohave VIM open this logfile. [thezip]: i don't get the command line "back" until I close VIM. No what I want to happen... [thezip]: I currently don't have access to CYGWIN, else I'd just do a tail -f on the logfile. [Corion]: thezip: If you want to open vim and can live with opening a second console window, use start "The results" vim.exe c:\path\to\logfile .log [thezip]: Ooops... I lied. I guess Cygwin is back. I'll just do a tail -f instead. Better. Sorry for the noise.

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (12)
As of 2017-03-27 18:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
Should Pluto Get Its Planethood Back?

Results (321 votes). Check out past polls.