Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^4: Searching large files a block at a time

by JediWombat (Novice)
on Aug 03, 2017 at 23:56 UTC ( #1196690=note: print w/replies, xml ) Need Help??


in reply to Re^3: Searching large files a block at a time
in thread Searching large files a block at a time

Thanks again, Ken. I've built this code using your and Mario's responses:
$/ = ""; open my $fh, "-|", "/usr/bin/bzcat $file"; while (<$fh>) { if (/uid=$mbnum/m) { print $_; last; } }

I've timed this version, and all others: this completed in 3.2 seconds, the previous version I built with your help took 8 seconds, and my original took 25.4! As I need to scan through three different LDIF's, that's a total of under 10 seconds, on average.

Replies are listed 'Best First'.
Re^5: Searching large files a block at a time
by marioroy (Parson) on Aug 04, 2017 at 04:41 UTC

    Hi JediWombat,

    The following provides a parallel demonstration to process all 3 LDIF files simultaneously. It requires MCE 1.830 minimally due to MCE::Relay failing from setting the input separator. That said, am posting the code before releasing 1.830 in order to have the perlmonks URL for the changelog.

    Well, here is the code.

    #!/usr/bin/perl use strict; use warnings; # Requires MCE 1.830. Fixes MCE::Relay stalling from setting $/. use MCE::Loop 1.830; my $mbnum = $ARGV[0] or die "usage: $0 mbnum\n"; my @ldif_files = qw( /path/to/file1.ldif.bz2 /path/to/file2.ldif.bz2 /path/to/file3.ldif.bz2 ); MCE::Loop->init( max_workers => scalar @ldif_files, chunk_size => 1, init_relay => '' ); mce_loop { # my ($mce, $chunk_ref, $chunk_id) = @_; # When chunk_size 1 is specified, $_ contains # the same value as $chunk_ref->[0]. my ($file, $ret) = ($_, ''); # Must localize $/ to not stall MCE, fixed in 1.830. # Localizing $/ is recommended, but fixed MCE if not. local $/ = ""; open my $fh, "-|", "/usr/bin/bzcat $file" or warn "open error ($file): $!"; if (defined fileno($fh)) { while (<$fh>) { if (/uid=$mbnum/m) { $ret = "## $file\n"; $ret .= $_; last; } } close $fh; } # Relay is beneficial for running a code block serially # and orderly. The init_relay option enables MCE::Relay. # All participating workers must call relay. Here, workers # write to STDOUT orderly starting with chunk_id 1. MCE::relay { print $ret }; } \@ldif_files; MCE::Loop->finish;

    MCE 1.830 will be released on CPAN no latter than Monday, the 7th of August. In the meantime, the MCE Github repository is current.

    Regards, Mario

      Another possibility is sending the result to the manager process via MCE->gather. MCE::Candy provides an ordered output iterator.

      Unlike the previous demonstration, this one doesn't require MCE 1.830 minimally.

      #!/usr/bin/perl use strict; use warnings; use MCE::Loop; use MCE::Candy; my $mbnum = $ARGV[0] or die "usage: $0 mbnum\n"; my @ldif_files = qw( /path/to/file1.ldif.bz2 /path/to/file2.ldif.bz2 /path/to/file3.ldif.bz2 ); MCE::Loop->init( chunk_size => 1, max_workers => scalar @ldif_files, gather => MCE::Candy::out_iter_fh(\*STDOUT) ); mce_loop { my ($mce, $chunk_ref, $chunk_id) = @_; my ($file, $ret) = ($chunk_ref->[0], ''); # Must localize $/ to not stall MCE, fixed in 1.830. # Localizing $/ is recommended, but fixed MCE if not. local $/ = ""; open my $fh, "-|", "/usr/bin/bzcat $file" or warn "open error ($file): $!"; if (defined fileno($fh)) { while (<$fh>) { if (/uid=$mbnum/m) { $ret = "## $file\n"; $ret .= $_; last; } } close $fh; } # The out_iter_fh iterator wants the chunk_id value. # Thus, all participating workers must call gather once only. # The manager process outputs the value for chunk_id 1 first, # then chunk_id 2, et cetera. MCE->gather($chunk_id, $ret); } \@ldif_files; MCE::Loop->finish;

      Regards, Mario

      Yet another demonstration to be sure MCE::Hobo and MCE::Shared are not impacted when modifying the record separtor. Not localizing the record separator works too. Workers store the result into a shared array.

      #!/usr/bin/perl use strict; use warnings; use MCE::Hobo; use MCE::Shared; my $mbnum = $ARGV[0] or die "usage: $0 mbnum\n"; my @ldif_files = qw( /path/to/file1.ldif.bz2 /path/to/file2.ldif.bz2 /path/to/file3.ldif.bz2 ); my $ret = MCE::Shared->array(); for my $idx (0 .. $#ldif_files) { my $file = $ldif_files[$idx]; mce_async { local $/ = ""; $ret->set($idx, ""); open my $fh, "-|", "/usr/bin/bzcat $file" or warn "open error ($file): $!"; if (defined fileno($fh)) { while (<$fh>) { if (/uid=$mbnum/m) { $ret->append($idx, $_); last; } } close $fh; } }; } $_->join for MCE::Hobo->list; # or MCE::Hobo->waitall; print join('', $ret->values) if $ret->len;

      Regards, Mario

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1196690]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2021-12-03 20:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (30 votes). Check out past polls.

    Notices?