Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Searching for 'chunks' of data in very large files

by Ovid (Cardinal)
on Feb 28, 2002 at 18:21 UTC ( #148323=sourcecode: print w/replies, xml ) Need Help??
Category: Utility Scripts
Author/Contact Info Ovid

Recently, in the Perl beginners list, someone had a bit of a quandary. They were reading a 600 MB file and needed to find a search term, grab from the file 200 bytes of data both before and after this term and then search for another term within that 'chunk' of data.

I thought this was such a fun problem that I went ahead and wrote the program for this person (yeah, I know, I gave him a fish). This is deliberately overcommented in case the person did not know a lot of Perl. The basic idea is to search the file and return 400 byte 'chunks' in an array.

use strict;
use warnings;
use Data::Dumper;

# this is how far forward or back you need to read
my $width  = 200;

# this is your target string.  You can make it a regex if you prefer
my $target = 'search';

# file to search
my $file   = 'test.txt';
my $fsize  = -s $file;

# when you're done, this should contain the data you're looking for
my @chunks;

open FILE, "< $file" or die "Cannot open $file for reading: $!";

while (<FILE>)
    if ( /$target/g )
        my $file_position = tell FILE;

        # backwards from end of string
        my $word_position = $file_position - (length( $_ ) - pos( $_ )
        # to beginning of word.  It's separate so you can
        # pull it out if necessary.
        $word_position -= length $target;
        push @chunks, get_chunk( \*FILE, $word_position, $file_positio
+n, $width, $fsize );

print Dumper \@chunks;

close FILE;

sub get_chunk
    my ( $fh, $word_position, $file_position, $width, $fsize ) = @_;

    # don't try to read before beginning of file
    my $start = $word_position >= $width
        ? $word_position - $width
        : 0;

    # don't try to read after end of file
    my $end   = $word_position + $width <= $fsize
        ? $word_position + $width
        : $fsize;

    # position to start of where we want to read
    seek $fh, $start, 0;
    my $chunk;

    # shouldn't fail unless I got my boundaries wrong
    read ( $fh, $chunk, $end-$start ) or die "Problem reading file: $!

    # put us back to where we were
    seek $fh, $file_position, 0;
    return $chunk;
Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://148323]
[choroba]: Email::Address DoS
[Corion]: (that module has been deprecated by its author already, so that's fair. Although I wonder why the backtracking can't be fixed to handle the formfeeds gracefully)
[choroba]: not enough tuits?
[Corion]: choroba: Yeah, maybe. I'm also unaware of who uses Email:: modules, but that's more my limited horizon of things ;)
[Corion]: Ah - there even is the replacement of Email::Address::XS , by the bug reporter, which hopefully fixes this bug already ;)

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (9)
As of 2018-06-20 11:56 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (116 votes). Check out past polls.