http://www.perlmonks.org?node_id=470222

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I want to process a text file in human-compatible chunks of a certain size or below, as in, not "sysread" for chunks of 2048 bytes, but something which won't have end up dividing words or sentences.

By which I mean I want to go paragraph by paragraph, until I get to a chunk of a certain size, so I got something like this:

open( X, 'x.txt' ); my ( $chunk, $line ); while (<X>) { $line = $_; if ( ( length($chunk) + length($line) ) > 2048 ) { # if the next line would take us over the set size doSomethingWithChunk($chunk); $chunk = $line; } else { $chunk .= $line; # append line and keep going } } doSomethingWithChunk($chunk); # process whatever's left in $chunk at the end
which is fine, but what I'd really like to do is have some kind of sub which would return me those chunks until the file ran out.

Something like how Algorithm::Permute does this:

my $p = new Algorithm::Permute(['a'..'d']); while (@res = $p->next) { # do something with @res }

Is there a module or a recognised way to do this? I'm blanking on the way to do it.



($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

Replies are listed 'Best First'.
Re: Handling A File In Human-Compatible Chunks
by BrowserUk (Patriarch) on Jun 27, 2005 at 11:59 UTC

    Setting local $/ = ''; # paragraph mode will cause each use of <FH> or readline to return a 'paragraph' of text from the file, where a paragraph is defined as block of text delimited by 1 or more blank lines.

    Paragraph mode is described breifly (and rather confusingly) in perlvar under the description for the $INPUT_RECORD_SEPARATOR.

    Setting it to "\n\n" means something slightly different than setting to "", if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Handling A File In Human-Compatible Chunks
by holli (Abbot) on Jun 27, 2005 at 12:23 UTC
    Here you are:
    use ChunkReader; # create reader, threshold is optional (def. 2048) my $reader = new ChunkReader (threshold=>1024); # open file $reader->open ("c:/test2.txt"); #iterate while ( my $chunk = $reader->chunk ) { print "$chunk\n****************\n"; }
    Put this in a module "ChunkReader.pm"
    package ChunkReader; sub new { my $class = shift; my %args = @_; $args{threshold} = 2048 unless defined $args{threshold}; return bless {%args}, $class; } sub open { my $self = shift; my $file = shift || $self->{file}; my $handle; open $handle, "<", $file or die "cannot open '$self->{file}\n'"; $self->{handle} = $handle; } sub chunk { my $self = shift; my $handle = $self->{handle} || die "File not open!\n"; my $chunk = $self->{lastline}; # reset last line $self->{lastline} = undef; while ( my $line = <$handle> ) { if ( ( length($chunk) + length($line) ) > $self->{threshold} ) + { # unless we already read a chunk # (when a single line is bigger than the threshold) unless ( $chunk ) { #return the line return $line; } else { # save the last line for further use # and return the chunk $self->{lastline} = $line; return $chunk; } } else { # append line and keep going $chunk .= $line; } } #end of file return $chunk; } 1;


    Update:
    Taken the advice from TheDamian at sub that sets $_, I updated the module so you can now do
    use ChunkReader; # create reader, threshold is optional (def. 2048) my $reader = new ChunkReader (); # open file $reader->open ("c:/test2.txt"); while ( $reader->chunk ) { print "$_****************\n"; }



    holli, /regexed monk/
      Thanks everyone for you contributions. I don't know if I confused everyone with the use of the word "paragraph", that wasn't my main concern as much as the sub.

      Holli's contribution looks like exactly what I wanted.



      ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
      =~y~b-v~a-z~s; print
        Well, all I did was taking your code and adding some OO-sugar around it.


        holli, /regexed monk/
Re: Handling A File In Human-Compatible Chunks
by tlm (Prior) on Jun 27, 2005 at 11:58 UTC

    You could set $/ to the empty string '' to get the special "paragraph mode". Then <X> would read input "paragraph-wise" instead of "line-wise". See $/ in perlvar.

    the lowliest monk

Re: Handling A File In Human-Compatible Chunks
by broquaint (Abbot) on Jun 27, 2005 at 12:10 UTC
    I don't know of an existing module but you could do it something like this:
    sub get_nice_chunk { my($fh, $size) = @_; local $/ = ''; local $_; my $chunk; $chunk .= $_ while $_ = <$fh> and length($_) + length($chunk) < $size; seek $fh, tell($fh) - length($_), 0 if length($_) + length($chunk) < $size; return $chunk; } print "got chunk: $_" while $_ = get_nice_chunk(\*DATA, 19); __DATA__ foo bar baz quux ichi ni san shi the last bit
    So that just sets the INPUT_RECORD_SEPARATOR to paragraph mode, reads paragraphs until we've hit the $size limit, rewinds the file pointer and returns the chunk. However, this does suffer from the bug that it won't read pargraphs greater than $size but I'll leave that as an exercise to the OP of what to do in that case.
    HTH

    _________
    broquaint