in reply to Re: PerlIO file handle dup
in thread PerlIO file handle dup

If we read one record at a time, the input semaphore isn't needed. However, I'm reading 500 records at a time, and they need to be in sequence. I suppose if I read in and processed one record at a time, I could eliminate the input semaphore when MCE::Shared is being used (probably not for regular file handles). However, I think that would make output slower since each thread needs to block until its processed data is the next to be written.

I only put the yield in there because the first thread seemed to be hogging all the input before the other threads even started. In my actual script I'm not using MCE::Shared for the output file, and autoflush is needed to keep the output in order.

So this

read $fh, my($buf), '4k';

is the same but faster than this?

my $buf = <$fh>;

If it always reads exactly one entire record regardless of "chunk size", what does the chunk size do exactly? Or is the chunk size a minimum, then it continues reading until EOL? It is confusing that MCE's read works fundamentally differently from Perl's read.

I don't suppose there is a "readlines" function for MCE file handles? I assume if I could read all 500 lines at a time, that would minimize overhead related to MCE. For delimited input, I'm currently letting Text::CSV_XS read from the file handle, though.

Replies are listed 'Best First'.
Re^3: PerlIO file handle dup
by hippo (Chancellor) on Mar 07, 2017 at 16:25 UTC
    It is confusing that MCE's read works fundamentally differently from Perl's read.

    It's not clear what you mean by "MCE's read" but the snippet which you quoted as

    read $fh, my($buf), '4k';

    is most definitely Perl's read. HTH.

      MCE's read is how read behaves when an MCE::Shared file handle is used. MCE must create an I/O layer which makes Perl's read work differently, but it does not behave the same. First of all, AFAIK, Perl's read does not support the type suffix such as "k". More importantly, Perl's read will read the exact length of characters in the 3rd argument (unless EOF). MCE's read will do that, but then continue reading until it reaches $/ so it does not split records between reads. This is very useful, but still confusing. It would make more sense to have read behave the same with or without MCE, but implement another function such as "getlines".


        Regarding MCE, MCE::Flow, and friends, the chunk_size option is dual mode. A value greater than 8192 means to read n_bytes and till the end of line. A value lower or equal to 8192 reads number of lines. I used the word lines to indicate the default $/ = "\n" or RS => "\n".

        Regarding MCE::Shared and MCE::Shared::Handle, which have native-like usage, read behaves similarly to the native read function without the 'k' or 'm' suffix. The 'k' or 'm' suffix is extra functionality to get MCE-like chunking capabilities. Basically, to have read continue reading till the end of line.

        # this feels closer to Perl-like read with extra functionality $n_chars = read $fh, $buf, "4k"; # this is another option, but involves more typing @lines = tied(%{$fh})->getlines(500);

        $fh is a tied object so that it can be used natively. The extra behaviour applies to read only for chunk IO capabilities. Currently, it is not 100% parity with MCE, because Perl's read function cannot store into an array. Therefore, I may have MCE::Shared and MCE:::Shared::Handle export "mce_read", mentioned here, to have 100% parity with MCE's chunking engine.

        Regards, Mario.

Re^3: PerlIO file handle dup
by marioroy (Vicar) on Mar 07, 2017 at 19:37 UTC

    In this context, a record is one line; e.g. $/ = "\n". When the 3rd argument to read contains a suffix 'k' or 'm', then it slurps up (e.g. '4k') including till the end of line, not EOL. This read behavior applies to MCE::Shared::Handle only. When missing the suffix 'k' or 'm', read behaves exactly like the native read.

    Yes, I had thought about adding readlines at the time. But, decided against it after writing the following.

    my @lines = tied(*{$fh})->readlines(10);

    In the end, I settled on having the file-handle specifics feel like native Perl and it does. The 'k' or 'm' suffix (extra behavior) provides chunk IO. Likewise, $. giving you chunk_id. One can get an extimate by "cat csv file | head -500 | wc". Take that and divide by 1024, append the k suffix to use with read. IMHO, there's no reason for workers to receive the same number of lines. Some will get a little less, some a little more.

    A possibility that comes to mind is having MCE::Shared export "mce_read" to provide full MCE-like chunk IO capabilites. A value greater than 8192 means to read number of bytes including till the end of line. If doing so, the following will only work for handles constructed with mce_open.

    # same as chunk_size => 1 in MCE $n_lines = mce_read $fh, \@lines, 1; # read max 500 lines $n_lines = mce_read $fh, \@lines, 500; # read 1m, including till the end of line $n_lines = mce_read $fh, \@lines, '1m'; # read 16k, ditto regarding till the end of line $n_lines = mce_read $fh, \@lines, '16k'; # same thing as above, but slurp into $buf $n_chars = mce_read $fh, $buf, 500; $n_chars = mce_read $fh, $buf, '1m'; $n_chars = mce_read $fh, $buf, '16k'; $. gives chunk_id

    Regards, Mario.

      It works great, and it is actually faster using MCE::Shared with chunked reading on uncompressed files rather than a dup'ed file handle and seeking to the correct position stored in a shared scalar. That includes using the Text::CSV_XS module even though I need to call it once for every record rather than a single call telling it to read 500 lines from a file handle (which I think would use getline 500 times). I don't see any improvement on output over dup'ed file handles with autoflush. The semaphores keeping the output in order already prevents concurrent writes, and would still be needed with MCE::Shared.

      Specifying the size of the chunk in bytes does make more sense for memory management since the size of each record can vary greatly from one file to the next. I think mce_read would be more intuitive, since you don't expect the usage of a core function like read to change like that, but I understand it now. Thanks!

      Maybe some other suggestions would be a mce_read that returns an array (or array reference) of records, since it is already great at reading a chunk of records. Just to save a split($/,$chunk). Also maybe a write that would take a chunk ID argument and keep the output chunks in the same sequence. Not sure if it would block until previous chunks are written (my script currently does), or buffer in memory and return before it is eventually written (could eat up memory if you read and process faster than you can write).


        MCE::Shared was made to complement MCE and other parallel modules. After review, I realize that it was never the intention for MCE::Shared to be 100% paritiy with MCE regarding chunking. Currently, MCE::Shared has limited chunking capabilities. But that was possible because of little effort, simply by enhancing read with (k,m) suffix. Unfortunately, full chunk IO capability inside MCE::Shared is not likely anytime soon over what is possible now.

        Many examples were provided for writing output orderly. Another one was made moments ago, here. Workers there write directly to the output handle, orderly and serially, very much like testa.

        Cheers, Mario.