Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: PerlIO file handle dup

by marioroy (Priest)
on Mar 07, 2017 at 08:45 UTC ( #1183822=note: print w/replies, xml ) Need Help??


in reply to PerlIO file handle dup

Greetings,

Welcome to the world of multi-threads or cores for that matter. It's one thing to make possible on the UNIX platform and altogether something else on the Windows platform. And not to forget Cygwin.

For me, Perl is a box of crayons. MCE and MCE::Shared are my paintings. I imagine folks around the world joined together in a relay race. I have no idea what we're aiming for. It just happens to be my turn at this moment in time. The tools given me are a box of crayons named Perl, a laptop, and an agent name Perseverance. Perseverance brought along a long-time friend named Grace. Grace invited Randomness, an Uncle.

The thing is that MCE and MCE::Shared may not be perfect. They are paintings, after all. Paintings take time to paint.

In regards to any slowness, IMHO, let the script fly. Oh, please do. For this use-case, a semaphore is not necessary. Nor is yield. Upon starting a thread, that thread will begin interacting with the shared-manager immediately. That is why the same thread ID is shown repeatedly for many lines in the output. Eventually, the 2nd thread has completed spawning and joins the 1st thread. Thread 3 joins later due to being spawned last.

I up'ed the count to 100k. The OP's script with semaphore + yield takes 3.6 seconds to run on a laptop (2.6 GHz - Core i7 Haswell). Removing the semaphore + yield allows the script to complete in 1.1 seconds. The latter includes threads writing to a shared output handle. In case it was missed, I removed the line to autoflush STDOUT; e.g. $| = 1. There's no reason to slow down IO. Let Perl fly. Ditto for MCE::Shared and workers.

use strict; use threads; use MCE::Shared; { open my $fh, '|-', 'gzip > test.txt.gz'; foreach (1..100000) { print {$fh} sprintf('%04d',$_).('abc123' x 10)."\n"; } close $fh; } { mce_open my $fh, '-|', 'gzip -cd test.txt.gz' or die "open error: +$!\n"; mce_open my $out, '>', \*STDOUT or die "open error: $!\n"; my @thrs; foreach (1..3) { push @thrs, threads->create('test'); } $_->join() foreach @thrs; close($fh); sub test { my $tid = threads->tid(); # using shared output to not garble among threads while ( my $line = <$fh> ) { print {$out} "thread: $tid, line: $., ".$line; } } }

It can run faster. To be continued in the next post.

Regards, Mario.

Replies are listed 'Best First'.
Re^2: PerlIO file handle dup
by marioroy (Priest) on Mar 07, 2017 at 09:27 UTC

    Greetings,

    To decrease the number of trips to and from the shared-manager, one can provide a suffix (k * 1024) or (m * 1024 * 1024) for the 3rd argument to read. That there enables chunk IO. Not to worry, the shared-manager completes reading until reaching the end of line or record. Notice $. It is the chunk_id, not the actual line number. The chunk_id value is important when output order is desired.

    OP's script involving semaphore + yield: 3.6 seconds. Shared handle (non-chunking): 1.1 seconds.

    Below, chunking completes in 0.240 seconds which is the total running time including initial gzip.

    use strict; use threads; use MCE::Shared; { open my $fh, '|-', 'gzip > test.txt.gz'; foreach (1..100000) { print {$fh} sprintf('%04d',$_).('abc123' x 10)."\n"; } close $fh; } { mce_open my $fh, '-|', 'gzip -cd test.txt.gz' or die "open error: +$!\n"; mce_open my $out, '>', \*STDOUT or die "open error: $!\n"; my @thrs; foreach (1..3) { push @thrs, threads->create('test'); } $_->join() foreach @thrs; close($fh); sub test { my $tid = threads->tid(); # using shared output to not garble among threads while (1) { my $n_chars = read $fh, my($buf), '4k'; last if (!defined $n_chars || $n_chars <= 0); print {$out} "## thread: $tid, chunkid: $.\n".$buf; } } }

    Regards, Mario.

Re^2: PerlIO file handle dup
by chris212 (Scribe) on Mar 07, 2017 at 16:05 UTC

    If we read one record at a time, the input semaphore isn't needed. However, I'm reading 500 records at a time, and they need to be in sequence. I suppose if I read in and processed one record at a time, I could eliminate the input semaphore when MCE::Shared is being used (probably not for regular file handles). However, I think that would make output slower since each thread needs to block until its processed data is the next to be written.

    I only put the yield in there because the first thread seemed to be hogging all the input before the other threads even started. In my actual script I'm not using MCE::Shared for the output file, and autoflush is needed to keep the output in order.

    So this

    read $fh, my($buf), '4k';

    is the same but faster than this?

    my $buf = <$fh>;

    If it always reads exactly one entire record regardless of "chunk size", what does the chunk size do exactly? Or is the chunk size a minimum, then it continues reading until EOL? It is confusing that MCE's read works fundamentally differently from Perl's read.

    I don't suppose there is a "readlines" function for MCE file handles? I assume if I could read all 500 lines at a time, that would minimize overhead related to MCE. For delimited input, I'm currently letting Text::CSV_XS read from the file handle, though.

      It is confusing that MCE's read works fundamentally differently from Perl's read.

      It's not clear what you mean by "MCE's read" but the snippet which you quoted as

      read $fh, my($buf), '4k';

      is most definitely Perl's read. HTH.

        MCE's read is how read behaves when an MCE::Shared file handle is used. MCE must create an I/O layer which makes Perl's read work differently, but it does not behave the same. First of all, AFAIK, Perl's read does not support the type suffix such as "k". More importantly, Perl's read will read the exact length of characters in the 3rd argument (unless EOF). MCE's read will do that, but then continue reading until it reaches $/ so it does not split records between reads. This is very useful, but still confusing. It would make more sense to have read behave the same with or without MCE, but implement another function such as "getlines".

        MCE::Shared::Handle

      In this context, a record is one line; e.g. $/ = "\n". When the 3rd argument to read contains a suffix 'k' or 'm', then it slurps up (e.g. '4k') including till the end of line, not EOL. This read behavior applies to MCE::Shared::Handle only. When missing the suffix 'k' or 'm', read behaves exactly like the native read.

      Yes, I had thought about adding readlines at the time. But, decided against it after writing the following.

      my @lines = tied(*{$fh})->readlines(10);

      In the end, I settled on having the file-handle specifics feel like native Perl and it does. The 'k' or 'm' suffix (extra behavior) provides chunk IO. Likewise, $. giving you chunk_id. One can get an extimate by "cat csv file | head -500 | wc". Take that and divide by 1024, append the k suffix to use with read. IMHO, there's no reason for workers to receive the same number of lines. Some will get a little less, some a little more.

      A possibility that comes to mind is having MCE::Shared export "mce_read" to provide full MCE-like chunk IO capabilites. A value greater than 8192 means to read number of bytes including till the end of line. If doing so, the following will only work for handles constructed with mce_open.

      # same as chunk_size => 1 in MCE $n_lines = mce_read $fh, \@lines, 1; # read max 500 lines $n_lines = mce_read $fh, \@lines, 500; # read 1m, including till the end of line $n_lines = mce_read $fh, \@lines, '1m'; # read 16k, ditto regarding till the end of line $n_lines = mce_read $fh, \@lines, '16k'; # same thing as above, but slurp into $buf $n_chars = mce_read $fh, $buf, 500; $n_chars = mce_read $fh, $buf, '1m'; $n_chars = mce_read $fh, $buf, '16k'; $. gives chunk_id

      Regards, Mario.

        It works great, and it is actually faster using MCE::Shared with chunked reading on uncompressed files rather than a dup'ed file handle and seeking to the correct position stored in a shared scalar. That includes using the Text::CSV_XS module even though I need to call it once for every record rather than a single call telling it to read 500 lines from a file handle (which I think would use getline 500 times). I don't see any improvement on output over dup'ed file handles with autoflush. The semaphores keeping the output in order already prevents concurrent writes, and would still be needed with MCE::Shared.

        Specifying the size of the chunk in bytes does make more sense for memory management since the size of each record can vary greatly from one file to the next. I think mce_read would be more intuitive, since you don't expect the usage of a core function like read to change like that, but I understand it now. Thanks!

        Maybe some other suggestions would be a mce_read that returns an array (or array reference) of records, since it is already great at reading a chunk of records. Just to save a split($/,$chunk). Also maybe a write that would take a chunk ID argument and keep the output chunks in the same sequence. Not sure if it would block until previous chunks are written (my script currently does), or buffer in memory and return before it is eventually written (could eat up memory if you read and process faster than you can write).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1183822]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2019-05-22 01:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you enjoy 3D movies?



    Results (138 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!