Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^3: PerlIO file handle dup

by marioroy (Vicar)
on Mar 07, 2017 at 19:37 UTC ( #1183876=note: print w/replies, xml ) Need Help??


in reply to Re^2: PerlIO file handle dup
in thread PerlIO file handle dup

In this context, a record is one line; e.g. $/ = "\n". When the 3rd argument to read contains a suffix 'k' or 'm', then it slurps up (e.g. '4k') including till the end of line, not EOL. This read behavior applies to MCE::Shared::Handle only. When missing the suffix 'k' or 'm', read behaves exactly like the native read.

Yes, I had thought about adding readlines at the time. But, decided against it after writing the following.

my @lines = tied(*{$fh})->readlines(10);

In the end, I settled on having the file-handle specifics feel like native Perl and it does. The 'k' or 'm' suffix (extra behavior) provides chunk IO. Likewise, $. giving you chunk_id. One can get an extimate by "cat csv file | head -500 | wc". Take that and divide by 1024, append the k suffix to use with read. IMHO, there's no reason for workers to receive the same number of lines. Some will get a little less, some a little more.

A possibility that comes to mind is having MCE::Shared export "mce_read" to provide full MCE-like chunk IO capabilites. A value greater than 8192 means to read number of bytes including till the end of line. If doing so, the following will only work for handles constructed with mce_open.

# same as chunk_size => 1 in MCE $n_lines = mce_read $fh, \@lines, 1; # read max 500 lines $n_lines = mce_read $fh, \@lines, 500; # read 1m, including till the end of line $n_lines = mce_read $fh, \@lines, '1m'; # read 16k, ditto regarding till the end of line $n_lines = mce_read $fh, \@lines, '16k'; # same thing as above, but slurp into $buf $n_chars = mce_read $fh, $buf, 500; $n_chars = mce_read $fh, $buf, '1m'; $n_chars = mce_read $fh, $buf, '16k'; $. gives chunk_id

Regards, Mario.

Replies are listed 'Best First'.
Re^4: PerlIO file handle dup
by chris212 (Scribe) on Mar 07, 2017 at 22:15 UTC

    It works great, and it is actually faster using MCE::Shared with chunked reading on uncompressed files rather than a dup'ed file handle and seeking to the correct position stored in a shared scalar. That includes using the Text::CSV_XS module even though I need to call it once for every record rather than a single call telling it to read 500 lines from a file handle (which I think would use getline 500 times). I don't see any improvement on output over dup'ed file handles with autoflush. The semaphores keeping the output in order already prevents concurrent writes, and would still be needed with MCE::Shared.

    Specifying the size of the chunk in bytes does make more sense for memory management since the size of each record can vary greatly from one file to the next. I think mce_read would be more intuitive, since you don't expect the usage of a core function like read to change like that, but I understand it now. Thanks!

    Maybe some other suggestions would be a mce_read that returns an array (or array reference) of records, since it is already great at reading a chunk of records. Just to save a split($/,$chunk). Also maybe a write that would take a chunk ID argument and keep the output chunks in the same sequence. Not sure if it would block until previous chunks are written (my script currently does), or buffer in memory and return before it is eventually written (could eat up memory if you read and process faster than you can write).

      Greetings,

      MCE::Shared was made to complement MCE and other parallel modules. After review, I realize that it was never the intention for MCE::Shared to be 100% paritiy with MCE regarding chunking. Currently, MCE::Shared has limited chunking capabilities. But that was possible because of little effort, simply by enhancing read with (k,m) suffix. Unfortunately, full chunk IO capability inside MCE::Shared is not likely anytime soon over what is possible now.

      Many examples were provided for writing output orderly. Another one was made moments ago, here. Workers there write directly to the output handle, orderly and serially, very much like testa.

      Cheers, Mario.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1183876]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (14)
As of 2019-10-22 14:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?