Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Re^2: Searching large files a block at a time

by JediWombat (Novice)
on Aug 02, 2017 at 05:53 UTC ( #1196511=note: print w/replies, xml ) Need Help??

in reply to Re: Searching large files a block at a time
in thread Searching large files a block at a time

Thank you Ken! The bit that helped me understand what I was doing wrong, was
my $z = IO::Uncompress::Bunzip2::->new($filename); while (<$z>) { }
What I needed was a way to use "while (<data>)" without using the "getline()" method that seemed to be reading the data in one line at a time. My LDIF is over 15 million lines, so that was quite slow. Using your code, I get results in ~10 seconds, which is acceptable (though still a lot slower than the shell script that pipes into Perl, and I'm not sure why that is). Thanks to you and Roboticus for steering me in the right direction. Cheers, JW.

Replies are listed 'Best First'.
Re^3: Searching large files a block at a time
by kcott (Bishop) on Aug 02, 2017 at 06:52 UTC
    "... helped me understand what I was doing wrong ..."

    OK, that's a good start.

    "Using your code, I get results in ~10 seconds, which is acceptable (though still a lot slower than the shell script that pipes into Perl, and I'm not sure why that is). "

    I'm completely guessing but the overhead may be due to the IO::Uncompress::Bunzip2 module. You could avoid using that module by setting up the same pipe but from within the Perl script (rather than piping to that script).

    I put exactly the same data I used previously into a text file (just a copy and paste):

    $ cat > pm_1196493_paragraph_mode_test_data.txt Block1 Line1 ... Block4 Line3 ^D

    I then modified the start of my previous example code, so it now looks like this:

    #!/usr/bin/env perl -l use strict; use warnings; use autodie; my $filename = 'pm_1196493_paragraph_mode_test_data.txt'; open my $z, '-|', "cat $filename"; { local $/ = ''; while (<$z>) { chomp; print '--- One Block ---'; print; } }

    This produces exactly the same output as before. Obviously, you'll want to change 'cat' to '/usr/bin/bzcat' (and, of course, use *.bz2 instead of *.txt files). This solution will not be platform-independent: that may not matter to you. See open for more on the '-|', and closely related '|-', modes.

    Also, note that I used the autodie pragma. If you want more control over handling I/O problems, you can hand-craft messages (e.g. open ... or die "..."), or use something like Try::Tiny.

    — Ken

      Thanks again, Ken. I've built this code using your and Mario's responses:
      $/ = ""; open my $fh, "-|", "/usr/bin/bzcat $file"; while (<$fh>) { if (/uid=$mbnum/m) { print $_; last; } }

      I've timed this version, and all others: this completed in 3.2 seconds, the previous version I built with your help took 8 seconds, and my original took 25.4! As I need to scan through three different LDIF's, that's a total of under 10 seconds, on average.

        Hi JediWombat,

        The following provides a parallel demonstration to process all 3 LDIF files simultaneously. It requires MCE 1.830 minimally due to MCE::Relay failing from setting the input separator. That said, am posting the code before releasing 1.830 in order to have the perlmonks URL for the changelog.

        Well, here is the code.

        #!/usr/bin/perl use strict; use warnings; # Requires MCE 1.830. Fixes MCE::Relay stalling from setting $/. use MCE::Loop 1.830; my $mbnum = $ARGV[0] or die "usage: $0 mbnum\n"; my @ldif_files = qw( /path/to/file1.ldif.bz2 /path/to/file2.ldif.bz2 /path/to/file3.ldif.bz2 ); MCE::Loop->init( max_workers => scalar @ldif_files, chunk_size => 1, init_relay => '' ); mce_loop { # my ($mce, $chunk_ref, $chunk_id) = @_; # When chunk_size 1 is specified, $_ contains # the same value as $chunk_ref->[0]. my ($file, $ret) = ($_, ''); # Must localize $/ to not stall MCE, fixed in 1.830. # Localizing $/ is recommended, but fixed MCE if not. local $/ = ""; open my $fh, "-|", "/usr/bin/bzcat $file" or warn "open error ($file): $!"; if (defined fileno($fh)) { while (<$fh>) { if (/uid=$mbnum/m) { $ret = "## $file\n"; $ret .= $_; last; } } close $fh; } # Relay is beneficial for running a code block serially # and orderly. The init_relay option enables MCE::Relay. # All participating workers must call relay. Here, workers # write to STDOUT orderly starting with chunk_id 1. MCE::relay { print $ret }; } \@ldif_files; MCE::Loop->finish;

        MCE 1.830 will be released on CPAN no latter than Monday, the 7th of August. In the meantime, the MCE Github repository is current.

        Regards, Mario

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1196511]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (1)
As of 2022-05-19 03:33 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (71 votes). Check out past polls.