Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Searching for 'chunks' of data in very large files

by Ovid (Cardinal)
on Feb 28, 2002 at 18:21 UTC ( #148323=sourcecode: print w/replies, xml ) Need Help??
Category: Utility Scripts
Author/Contact Info Ovid

Recently, in the Perl beginners list, someone had a bit of a quandary. They were reading a 600 MB file and needed to find a search term, grab from the file 200 bytes of data both before and after this term and then search for another term within that 'chunk' of data.

I thought this was such a fun problem that I went ahead and wrote the program for this person (yeah, I know, I gave him a fish). This is deliberately overcommented in case the person did not know a lot of Perl. The basic idea is to search the file and return 400 byte 'chunks' in an array.

use strict;
use warnings;
use Data::Dumper;

# this is how far forward or back you need to read
my $width  = 200;

# this is your target string.  You can make it a regex if you prefer
my $target = 'search';

# file to search
my $file   = 'test.txt';
my $fsize  = -s $file;

# when you're done, this should contain the data you're looking for
my @chunks;

open FILE, "< $file" or die "Cannot open $file for reading: $!";

while (<FILE>)
    if ( /$target/g )
        my $file_position = tell FILE;

        # backwards from end of string
        my $word_position = $file_position - (length( $_ ) - pos( $_ )
        # to beginning of word.  It's separate so you can
        # pull it out if necessary.
        $word_position -= length $target;
        push @chunks, get_chunk( \*FILE, $word_position, $file_positio
+n, $width, $fsize );

print Dumper \@chunks;

close FILE;

sub get_chunk
    my ( $fh, $word_position, $file_position, $width, $fsize ) = @_;

    # don't try to read before beginning of file
    my $start = $word_position >= $width
        ? $word_position - $width
        : 0;

    # don't try to read after end of file
    my $end   = $word_position + $width <= $fsize
        ? $word_position + $width
        : $fsize;

    # position to start of where we want to read
    seek $fh, $start, 0;
    my $chunk;

    # shouldn't fail unless I got my boundaries wrong
    read ( $fh, $chunk, $end-$start ) or die "Problem reading file: $!

    # put us back to where we were
    seek $fh, $file_position, 0;
    return $chunk;
Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://148323]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2017-07-28 06:16 GMT
Find Nodes?
    Voting Booth?
    I came, I saw, I ...

    Results (424 votes). Check out past polls.