Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Searching large files a block at a time

by JediWombat (Novice)
on Aug 02, 2017 at 00:59 UTC ( #1196493=perlquestion: print w/replies, xml ) Need Help??

JediWombat has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I'm fairly new to Perl, but I'm in a job where a lot of existing Perl code is in use, so I'm working off what I have available and might be getting in over my head. I'm hoping you can help. I have a large LDIF file in bzip format that I want to search, and for the block (as defined by blank lines before and after it) that matches my pattern, print the block. Looking at a shell script we have here, which pipes into perl:
/usr/bin/bzcat $ldif | perl -e "$/ = \"\n\n\"; while (<>) { if (/uid=$match/) { print $_ ; last; } ; }"
This uses the $/ input field separator, and then uses while (<>) to read a block at a time. I'd like to do this in pure perl, but I can't find a way. I'm using the IO::Uncompress::Bunzip2 module, which gives me an IO:File object, but the only way I can seem to interact with this in a useful way is with getline() or getlines() - neither of which let me iterate through the file a block at a time, like the shell script does. Like I said, I am pretty new to Perl, so I'm sure there's a better way to do this. Can someone offer some assistance please? Here's the code I've got, which works, but is very slow:
my $z = new IO::Uncompress::Bunzip2 $file; $mbnum = $ARGV[0]; while ($line = $z->getline()) { if ($line =~ /^dn: uid=$mbnum,/) { $found = "true"; print $line; for (my $i=0; $i<100; $i++) { $matchLine = $z->getline(); print "$matchLine"; if ($matchLine =~ /^$/) { last; } } } if ($found eq "true") { last; } }

Replies are listed 'Best First'.
Re: Searching large files a block at a time
by roboticus (Chancellor) on Aug 02, 2017 at 01:42 UTC

    JediWombat:

    It looks like IO::Uncompress::Bunzip2 handles the $/ variable just fine, so it shouldn't have any problem reading blocks:

    #!env perl use strict; use warnings; use IO::Uncompress::Bunzip2 qw(bunzip2 $Bunzip2Error); $/="\n\n"; my $z = new IO::Uncompress::Bunzip2("zzzzz.bz2") or die "Argh! $Bunzip2Error\n"; my $cnt=0; while (my $buff = $z->getline()) { ++$cnt; my $len = length($buff); print "BLOCK: $cnt\nLEN: $len\n$buff\n\n\n\n"; last if $cnt>10; }

    So perhaps I'm misunderstanding your problem....

    Nevermind what I wrote below. When I saw you asking about reading by blocks, I thought you meant fixed-sized blocks, not delimited as your code shows. Sorry about that.


    You should be able to read fixed-size blocks like this:

    #!env perl use strict; use warnings; use IO::Uncompress::Bunzip2 qw(bunzip2 $Bunzip2Error); #my $z = new IO::Uncompress::Bunzip2("zzzzz.bz2", {BlockSize=>512}) #my $z = new IO::Uncompress::Bunzip2("zzzzz.bz2", BlockSize=>512) my $z = new IO::Uncompress::Bunzip2("zzzzz.bz2") or die "Argh! $Bunzip2Error\n"; my $buff; my $cnt=0; while (my $status = $z->read($buff, 512)) { ++$cnt; my $len = length($buff); print "BLOCK: $cnt\nLEN: $len\n$buff\n\n\n\n"; last if $cnt>10; }

    NOTE: The BlockSize argument in the constructor didn't work, I tried it both as a hashref and as just a couple extra arguments, as above. But the read method accepts a block size argument, so you can still read fixed sized blocks. If anyone sees what I did wrong on the constructor, I'd like to hear what it is.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Thanks Roboticus. I think I might have been unclear - what I want to avoid is reading line-per-line, as my code is very slow. I assume that's because of the while (my $buff = $z->getline()) loop, but feel free to correct me on this. Using this structure, my program takes a solid minute to run, whereas the shell script that does the same thing completes in a second or two. Maybe I could call bzcat from the system, and store its output in a variable? But I'm still not sure how to use while (<>) inside a full Perl program, when I'm not reading in from a pipe. Cheers, JW.
        Bunzip2's getline works just like <>; you can set $/ = "\n\n" to read in paragraph mode. It doesn't seem to be all that well optimized, though. You might try this:
        open my $BZ, "bzcat $file |"; while (<$BZ>) { ... }
Re: Searching large files a block at a time
by kcott (Bishop) on Aug 02, 2017 at 05:21 UTC

    G'day JediWombat,

    Welcome to the Monastery.

    "This uses the $/ input field separator, and then uses while (<>) to read a block at a time. I'd like to do this in pure perl, but I can't find a way."

    Firstly, here's a simple example of how you might do this.

    #!/usr/bin/env perl -l use strict; use warnings; { local $/ = ''; while (<DATA>) { chomp; print '--- One Block ---'; print; } } __DATA__ Block1 Line1 Block1 Line2 Block1 Line3 Block2 Line1 Block2 Line2 Block2 Line3 Block3 Line1 Block3 Line2 Block3 Line3 Block4 Line1 Block4 Line2 Block4 Line3

    Notes:

    • Setting $/ to an empty string puts you in what's called "paragraph mode". This allows reading blocks (lines separated by one or more blank lines). The number of blank lines doesn't matter: note how that differs from '$/ = "\n\n"' which specifies an exact number of blank lines. See $/ in perlvar for further details.
    • When you modify '$/', or indeed any special variable, you should localise the change in limited scope so that the special variable works normally elsewhere in your code. In this instance, I've used an anonymous block (the code is within braces by themselves); subroutine definitions, BEGIN blocks, and so on, could work just as well: just keep the special variable modification separate from other code. See local, and the links that page provides, for more on this.
    • I'm reading using '<DATA>', which is just a handy way of reading the data after '__DATA__'. You could use '<$filehandle>', where that filehandle may come from open or some other source (see below).
    • For the purposes of demonstration, I've separated each block with a varying number blanks lines (specifically 2, 3, and 1). This is to show that the number of intervening blank lines doesn't matter when in paragraph mode.
    • See also chomp and -l in perlrun which I've used. Also look at say.

    The output looks like this:

    --- One Block --- Block1 Line1 Block1 Line2 Block1 Line3 --- One Block --- Block2 Line1 Block2 Line2 Block2 Line3 --- One Block --- Block3 Line1 Block3 Line2 Block3 Line3 --- One Block --- Block4 Line1 Block4 Line2 Block4 Line3

    I thought ++roboticus had generally covered issues relating to '$/' and IO::Uncompress::Bunzip2; however, your reply seems to suggest you were looking for something else.

    I'm not entirely sure what you're looking for. Note in IO::Uncompress::Bunzip2's Constructor section:

    ... the object, $z, returned from IO::Uncompress::Bunzip2 can be used exactly like an IO::File filehandle. This means that all normal input file operations can be carried out with $z. For example, to read a line from a compressed file/buffer you can use either of these forms

    $line = $z->getline(); $line = <$z>;

    Try using '<$z>', in a way similar to my example with '<DATA>', and see if that does what you want. Something like this (untested):

    my $z = IO::Uncompress::Bunzip2::->new($filename); { local $/ = ''; while (<$z>) { ... } }

    Note that the constructor code I've used differs from that shown in the IO::Uncompress::Bunzip2 documentation. This is on purpose and I recommend you use this instead. The IO::Uncompress::Bunzip2 documentation uses "Indirect Object Syntax: if you follow that link, you'll see in bold text

    "... this syntax is discouraged ..."

    along with a discussion of why that syntax should be avoided.

    — Ken

      Thank you Ken! The bit that helped me understand what I was doing wrong, was
      my $z = IO::Uncompress::Bunzip2::->new($filename); while (<$z>) { }
      What I needed was a way to use "while (<data>)" without using the "getline()" method that seemed to be reading the data in one line at a time. My LDIF is over 15 million lines, so that was quite slow. Using your code, I get results in ~10 seconds, which is acceptable (though still a lot slower than the shell script that pipes into Perl, and I'm not sure why that is). Thanks to you and Roboticus for steering me in the right direction. Cheers, JW.
        "... helped me understand what I was doing wrong ..."

        OK, that's a good start.

        "Using your code, I get results in ~10 seconds, which is acceptable (though still a lot slower than the shell script that pipes into Perl, and I'm not sure why that is). "

        I'm completely guessing but the overhead may be due to the IO::Uncompress::Bunzip2 module. You could avoid using that module by setting up the same pipe but from within the Perl script (rather than piping to that script).

        I put exactly the same data I used previously into a text file (just a copy and paste):

        $ cat > pm_1196493_paragraph_mode_test_data.txt Block1 Line1 ... Block4 Line3 ^D

        I then modified the start of my previous example code, so it now looks like this:

        #!/usr/bin/env perl -l use strict; use warnings; use autodie; my $filename = 'pm_1196493_paragraph_mode_test_data.txt'; open my $z, '-|', "cat $filename"; { local $/ = ''; while (<$z>) { chomp; print '--- One Block ---'; print; } }

        This produces exactly the same output as before. Obviously, you'll want to change 'cat' to '/usr/bin/bzcat' (and, of course, use *.bz2 instead of *.txt files). This solution will not be platform-independent: that may not matter to you. See open for more on the '-|', and closely related '|-', modes.

        Also, note that I used the autodie pragma. If you want more control over handling I/O problems, you can hand-craft messages (e.g. open ... or die "..."), or use something like Try::Tiny.

        — Ken

Re: Searching large files a block at a time
by marioroy (Parson) on Aug 02, 2017 at 06:41 UTC

    Hello JediWombat,

    Perl may call bzcat directly via open, similarly to calling bzcat from shell.

    See tip by kcott for blank input field separator. Thanks kcott for the tip.
    See tip by Anonymous Monk for reading from bzcat directly.

    #!/usr/bin/perl use strict; use warnings; my $ldif = "file.ldif.bz2"; my $match = "g2ucab"; $/ = ""; open my $fh, "-|", "/usr/bin/bzcat $ldif" or die "open error ($ldif): $!"; while ( <$fh> ) { if ( /uid=$match/m ) { print $_; last; } } close $fh;

    Parallel processing is another possibility when involving extra work inside the block, which isn't the case here. Thus, the next demonstration will not run any faster. Workers read 8 blocks at a time, configurable via the chunk_size option. Calling $mce->last causes all workers to leave the parallel block, similarly to calling last inside a while loop.

    The reason that this small use-case doesn't run any faster is due to workers waiting on bzcat.

    #!/usr/bin/perl use strict; use warnings; use MCE::Loop; my $ldif = "file.ldif.bz2"; my $match = "g2ucab"; $/ = ""; open my $fh, "-|", "/usr/bin/bzcat $ldif" or die "open error ($ldif): $!"; MCE::Loop->init( max_workers => 3, chunk_size => 8, use_slurpio => 1, ); mce_loop { my ( $mce, $slurp_ref, $chunk_id ) = @_; open my $local_fh, "<", $slurp_ref; while ( <$local_fh> ) { if ( /uid=$match/m ) { print $_; $mce->last; } } close $local_fh; } $fh; MCE::Loop->finish; close $fh;

    Regards, Mario

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1196493]
Front-paged by stevieb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (2)
As of 2021-12-08 01:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (34 votes). Check out past polls.

    Notices?