Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^7: Error binmode() on unopened filehandle

by Marshall (Canon)
on May 07, 2020 at 00:09 UTC ( [id://11116525]=note: print w/replies, xml ) Need Help??


in reply to Re^6: Error binmode() on unopened filehandle
in thread Error binmode() on unopened filehandle

Of course, both are junk. Why would you only read the first 20,000 bytes?

There are a lot of scenarios where you might want to read the first part of a file without reading the whole file. I think there are some Unix file commands that read the first 1-2K of a file to determine if the file is text or binary? Perhaps I want to concatenate some big .WAV files together. There is some header info at the beginning of these files that needs to be interpreted. In the OP's question, this is a single .jpg and there is no reason to read the file in "hunks" because the image has to be processed as a single unit. However, other scenarios do exist.

I do commend you for the choice of 8*1024 as buf size. That is a very good number with most file systems. Certain byte boundaries are important for the file system to work efficiently.

  • Comment on Re^7: Error binmode() on unopened filehandle

Replies are listed 'Best First'.
Re^8: Error binmode() on unopened filehandle
by ikegami (Patriarch) on May 07, 2020 at 18:57 UTC

    Re "There are a lot of scenarios", Maybe, but the discussion at hand is about reading the entire file.

    I used 8*1024 because read reads in 8 KiB chunks anyway.

    $ perl -e'print "x" x 100_000' \ | strace perl -e'read(\*STDIN, my $buf, 100_000)' 2>&1 \ | grep -P 'read\(0,' read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 1696

    But the parameter refers to the number of character to return, which could be different than the number of bytes read if an :encoding layer is used. So really, the number I picked is nothing to praise. If you want efficiency, it's probably best to use sysread with a very large number and decode afterwards.

      Thsnks for your interesting benchmark. That surprised me.

      The reason why 8K is "good"... The smallest unit of data that can be written to the disk is called a sector. For a bunch of historical and practical reasons, the most common value seen today is 512 bytes. There is no need for the file system to keep track of such a small unit. So the file system keeps track of blocks of sectors. An extremely common value of this smallest file system data unit is 8Kytes or 16 sectors. A combination of df or du commands can show this on a Unix system. Sorry don't have a Unix sys right now to post an example. If you write a file with one byte in it, it will take 8K of space on the disk. It is more efficient to just start out in the first place with a buffer size that will make the "file system happy" (increment of 8K). Bigger buffers typically help, but there are limits. I suspect not much to be gained once you are past 4*8192 bytes. Yes, sysread would have lower overhead. The OP's situation doesn't sound like any kind of performance issue.

        I know that, but I've already explained what it's irrelevant. There's no correspondence between the parameter passed to read and the amount that needs to be read from disk, so saying that reading 8192 bytes from disk at a time is a good idea doesn't make requesting 8192 characters from read a good idea.

        Take for example text consisting entirely of ASCII characters save for a one character with a 3-byte encoding. read(..., 8192) requires reading 8194 bytes from disk. So asking for 8190 characters would have been a better choice if reading 8192 bytes from disk is optimal as you claim.

        The only time one might be able to claim that providing a size of 8192 to read is a good choice is when reading text using a fixed-width encoding (so not UTF-8 or UTF-16le). These days, that would mostly be binary data, but using read at all to slurp a binary file is surely slower than using sysread. So even then, read(..., 8192) would be suboptimal.

        In fact, even with text files, using sysread and decoding afterwards is probably faster than using read with an encoding layer if you're interested in slurping the whole file.

        Your statements about performance seem quite uninformed.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11116525]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2024-04-23 14:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found