Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Fellow Monks,

I received a request from John Martel to process a large flat file and expand each record to many records based on splitting out items in field 4 delimited by semicolons. Each row in the output is given a unique ID starting with one while preserving output order.

Thank you, John. This is a great use-case for MCE::Relay (2nd example).

Input File -- Possibly larger than 500 GiB in size

foo|field2|field3|item1;item2;item3;item4;itemN|field5|field6|field7 bar|field2|field3|item1;item2;item3;item4;itemN|field5|field6|field7 baz|field2|field3|item1;item2;item3;item4;itemN|field5|field6|field7 ...

Output File

000000000000001|item1|foo|field2|field3|field5|field6|field7 000000000000002|item2|foo|field2|field3|field5|field6|field7 000000000000003|item3|foo|field2|field3|field5|field6|field7 000000000000004|item4|foo|field2|field3|field5|field6|field7 000000000000005|itemN|foo|field2|field3|field5|field6|field7 000000000000006|item1|bar|field2|field3|field5|field6|field7 000000000000007|item2|bar|field2|field3|field5|field6|field7 000000000000008|item3|bar|field2|field3|field5|field6|field7 000000000000009|item4|bar|field2|field3|field5|field6|field7 000000000000010|itemN|bar|field2|field3|field5|field6|field7 000000000000011|item1|baz|field2|field3|field5|field6|field7 000000000000012|item2|baz|field2|field3|field5|field6|field7 000000000000013|item3|baz|field2|field3|field5|field6|field7 000000000000014|item4|baz|field2|field3|field5|field6|field7 000000000000015|itemN|baz|field2|field3|field5|field6|field7 ...

Example One

This example configures a custom function for preserving output order. Unfortunately, the sprintf function alone involves extra CPU time causing the manager process to fall behind. The workers may idle while waiting for the manager process to respond to the gather request.

use strict; use warnings; use MCE::Loop; my $infile = shift or die "Usage: $0 infile\n"; my $newfile = 'output.dat'; open my $fh_out, '>', $newfile or die "open error $newfile: $!\n"; sub preserve_order { my ($fh) = @_; my ($order_id, $start_idx, $idx, %tmp) = (1, 1); return sub { my ($chunk_id, $aref) = @_; $tmp{ $chunk_id } = $aref; while ( my $aref = delete $tmp{ $order_id } ) { foreach my $line ( @{ $aref } ) { $idx = sprintf "%015d", $start_idx++; print $fh $idx, $line; } $order_id++; } } } MCE::Loop::init { chunk_size => 'auto', max_workers => 3, gather => preserve_order($fh_out) }; mce_loop_f { my ($mce, $chunk_ref, $chunk_id) = @_; my @buf; foreach my $line (@{ $chunk_ref }) { $line =~ s/\r//g; chomp $line; my ($f1,$f2,$f3,$items,$f5,$f6,$f7) = split /\|/, $line; my @items_array = split /;/, $items; foreach my $item (@items_array) { push @buf, "|$item|$f1|$f2|$f3|$f5|$f6|$f7\n"; } } MCE->gather($chunk_id, \@buf); } $infile; MCE::Loop::finish(); close $fh_out;

Example Two

To factor out sprintf from the manager process, another way is via MCE::Relay for incrementing the ID value. Workers obtain the current ID value and increment/relay for the next worker, ordered by chunk ID behind the scene. Workers call sprintf in parallel. This allows the manager process (out_iter_fh) to accommodate up to 32 workers and not fall behind. It also depends on IO performance, of course.

The MCE::Relay module is loaded automatically whenever the MCE init_relay option is specified.

use strict; use warnings; use MCE::Loop; use MCE::Candy; my $infile = shift or die "Usage: $0 infile\n"; my $newfile = 'output.dat'; open my $fh_out, '>', $newfile or die "open error $newfile: $!\n"; MCE::Loop::init { chunk_size => 'auto', max_workers => 8, gather => MCE::Candy::out_iter_fh($fh_out), init_relay => 1 }; mce_loop_f { my ($mce, $chunk_ref, $chunk_id) = @_; my @lines; foreach my $line (@{ $chunk_ref }) { $line =~ s/\r//g; chomp $line; my ($f1,$f2,$f3,$items,$f5,$f6,$f7) = split /\|/, $line; my @items_array = split /;/, $items; foreach my $item (@items_array) { push @lines, "$item|$f1|$f2|$f3|$f5|$f6|$f7\n"; } } my $idx = MCE::relay { $_ += scalar @lines }; my $buf = ''; foreach my $line ( @lines ) { $buf .= sprintf "%015d|%s", $idx++, $line } MCE->gather($chunk_id, $buf); } $infile; MCE::Loop::finish(); close $fh_out;

Relay accounts for the worker handling the next chunk_id value. Therefore, do not call relay more than once inside the block. Doing so will cause IPC to stall.

Regards, Mario

In reply to MCE gather and relay demonstrations by marioroy

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others studying the Monastery: (5)
    As of 2018-07-22 22:37 GMT
    Find Nodes?
      Voting Booth?
      It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?

      Results (456 votes). Check out past polls.