Re^7: Error binmode() on unopened filehandle

Of course, both are junk. Why would you only read the first 20,000 bytes?

There are a lot of scenarios where you might want to read the first part of a file without reading the whole file. I think there are some Unix file commands that read the first 1-2K of a file to determine if the file is text or binary? Perhaps I want to concatenate some big .WAV files together. There is some header info at the beginning of these files that needs to be interpreted. In the OP's question, this is a single .jpg and there is no reason to read the file in "hunks" because the image has to be processed as a single unit. However, other scenarios do exist.

I do commend you for the choice of 8*1024 as buf size. That is a very good number with most file systems. Certain byte boundaries are important for the file system to work efficiently.

Comment on Re^7: Error binmode() on unopened filehandle

Replies are listed 'Best First'.
Re^8: Error binmode() on unopened filehandle by ikegami (Patriarch) on May 07, 2020 at 18:57 UTC
Re "There are a lot of scenarios", Maybe, but the discussion at hand is about reading the entire file. I used 81024 because `read` reads in 8 KiB chunks anyway. $ perl -e'print "x" x 100_000' \ \| strace perl -e'read(\STDIN, my $buf, 100_000)' 2>&1 \ \| grep -P 'read\(0,' read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 1696 [download] But the parameter refers to the number of character to return, which could be different than the number of bytes read if an :encoding layer is used. So really, the number I picked is nothing to praise. If you want efficiency, it's probably best to use `sysread` with a very large number and decode afterwards.	[reply] [d/l] [select]
Re^9: Error binmode() on unopened filehandle by Marshall (Canon) on May 08, 2020 at 22:22 UTC
Thsnks for your interesting benchmark. That surprised me. The reason why 8K is "good"... The smallest unit of data that can be written to the disk is called a sector. For a bunch of historical and practical reasons, the most common value seen today is 512 bytes. There is no need for the file system to keep track of such a small unit. So the file system keeps track of blocks of sectors. An extremely common value of this smallest file system data unit is 8Kytes or 16 sectors. A combination of df or du commands can show this on a Unix system. Sorry don't have a Unix sys right now to post an example. If you write a file with one byte in it, it will take 8K of space on the disk. It is more efficient to just start out in the first place with a buffer size that will make the "file system happy" (increment of 8K). Bigger buffers typically help, but there are limits. I suspect not much to be gained once you are past 4*8192 bytes. Yes, sysread would have lower overhead. The OP's situation doesn't sound like any kind of performance issue.	[reply]
Re^10: Error binmode() on unopened filehandle by ikegami (Patriarch) on May 09, 2020 at 08:27 UTC
I know that, but I've already explained what it's irrelevant. There's no correspondence between the parameter passed to `read` and the amount that needs to be read from disk, so saying that reading 8192 bytes from disk at a time is a good idea doesn't make requesting 8192 characters from `read` a good idea. Take for example text consisting entirely of ASCII characters save for a one character with a 3-byte encoding. `read(..., 8192)` requires reading 8194 bytes from disk. So asking for 8190 characters would have been a better choice if reading 8192 bytes from disk is optimal as you claim. The only time one might be able to claim that providing a size of 8192 to `read` is a good choice is when reading text using a fixed-width encoding (so not UTF-8 or UTF-16le). These days, that would mostly be binary data, but using `read` at all to slurp a binary file is surely slower than using `sysread`. So even then, `read(..., 8192)` would be suboptimal. In fact, even with text files, using `sysread` and decoding afterwards is probably faster than using `read` with an encoding layer if you're interested in slurping the whole file. Your statements about performance seem quite uninformed.	[reply] [d/l] [select]
Re^11: Error binmode() on unopened filehandle by Marshall (Canon) on May 10, 2020 at 23:29 UTC


Pathologically Eclectic Rubbish Lister
	PerlMonks