Beefy Boxes and Bandwidth Generously Provided by pair Networks Frank
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: How is perl able to handle the null byte?

by ikegami (Pope)
on Jun 15, 2006 at 22:02 UTC ( #555647=note: print w/ replies, xml ) Need Help??


in reply to How is perl able to handle the null byte?

Perl doesn't always handle \0 in strings properly.

>dir Directory of dir 2006/06/15 05:58p <DIR> . 2006/06/15 05:58p <DIR> .. 0 File(s) 0 bytes 2 Dir(s) 2,322,644,992 bytes free >echo file contents > file >perl -e "open($fh, qq{file\0junk}) or die; print <$fh>;" file contents

Be wary of system calls and library calls.

Update: Some might say this is the OS's fault, but I expect better of Perl. Conversion from SV to char* results in a loss of information, but Perl apparently assumes it's a safe operation. I shouldn't have to know implementation details of a language's builtins.


Comment on Re: How is perl able to handle the null byte?
Download Code
Re^2: How is perl able to handle the null byte?
by graff (Chancellor) on Jun 15, 2006 at 23:56 UTC
    Good point. To be clear that this is not just a "windows" issue, I got the same behavior (scalar string truncated at null byte when it was used as a file name in "open()") on macosx, which is basically unix, so I'd expect all unix and linux to behave the same way as well -- if it's "the OS's fault", this is one of those rare cases where all OS's can share the blame equally.

    As for expecting better of Perl, I don't know what all happens under the covers during the conversion from SV to char*, but if it were to include doing the extra steps to check for string-internal null bytes, so it could throw some sort of exception or warning whenever it found one, I would expect most scripts to slow down noticeably.

    Some folks might view that cost as too high for the resulting benefit, so it's relatively more worthwhile to force you to know about these implementation details (especially in your case, since you happen to know them already, anyway :D ).

Re^2: How is perl able to handle the null byte?
by BrowserUk (Pope) on Jun 16, 2006 at 00:22 UTC
    ..., but I expect better of Perl.

    What would your solution be?

    If you're thinking that Perl should test every SV that it passes the char* of to some system call, and examine if it contains a null byte a a position other than the last byte, and what? Die? Issue a warning? Convert the embedded null to a space?

    Except of course where the embedded null is a legitimate part of a multi-byte character--which means a full unicode verification of every string passed to every system API---X, chdir, chmod, chown, chroot, fcntl, glob, ioctl, link, lstat, mkdir, open, opendir, readlink, rename, rmdir, stat, symlink, sysopen, umask, unlink, utime, etc. etc.

    Just because it is possible to do something--like embedding nulls in filenames--doesn't mean that it isn't an obviously bad idea; and ham-stringing the performance of Perl and every other utility program in order to cater for idiots that ignore the obviously bad ideas, would result in today's systems running with roughly the same performance as the 40MHz cpu's that became available in the late 1980's.

    Give me pragmatism over perfection every time.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Well, especially for builtins perl should know if the char* it's handing off to the system call will be interpreted as a classic C string or as a wide character.

      I question strongly that this is a serious performance issue - among other things, null-containing strings could easily be flagged as such, the way tainted strings are when running under taint mode.

      As for what perl should do, I certainly think that a warning when running under -w is appropriate - this is at least as big a problem as interpolating an undef variable into a string. I might even be convinced that perl in taint mode should treat nul-containing strings as tainted when passing them to C APIs - that is, die.

      --
      @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
        I question strongly that this is a serious performance issue ...

        Think again.

        The taint flag is set at a few, very specific points of input. And it remains set until it is modified

        Think of all the different ways a string can be read in, constructed or modified. Interpolation of other strings, concatenation, join, pack, unpack, qq//, s///, tr///, substr, chomp, chop, sprintf, read, sysread, vec, promotion of IVs & NVs to PVs etc. etc. Every time a scalar is modified it would be necessary to recheck whether it now (or still) contains one or more null characters--and if it does, whether they are a legitimate part of a multibyte character or not.

        Still doubt the performance impact?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Except of course where the embedded null is a legitimate part of a multi-byte character--which means a full unicode verification of every string passed to every system API

      Um, sorry, but if you're talking about the perl-internal representation of unicode, which is utf-8, the only thing that involves a null byte is the unicode code-point U+0000 (i.e. "NULL") -- and this, BTW, is simply the single-byte-null itself in utf-8. For every other utf-8 character, every byte is always non-null. And I don't know of any pre-unicode encodings that use nulls as parts of multi-byte characters.

      If a string of octets is supposed to represent utf-16, then sure, we would expect some of those octets to be null -- each octet is supposed to be treated as half of a 16-bit binary "word"; but this is a very different situation. Here we are talking about something more akin to plain old raw binary data, not a string of characters that can be transmuted directly to a char* and treated as a string in C.

        You obviously know more about things unicode than I--I've had barely any reason to use them--so I'll ask you:

        Is there no possibility that when encoding a string to one of the many forms of unicode for output to an external system that there might legitimately be null bytes embedded within the string?

        If there isn't, then detecting and warning of embedded nulls would only require a single pass of every scalar passed to a system api looking for nulls.

        If there is--and I feel sure that some of the MS wide character sets contain some characters where one half of the 16-bit values can be null, but I don't have proof yet--then it would require two passes in order to ensure against false positives causes spurious warnings/dies.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://555647]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2014-04-17 07:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (440 votes), past polls