Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

per - selects every Nth line

by jkahn (Friar)
on Nov 23, 2002 at 23:17 UTC ( #215446=sourcecode: print w/ replies, xml ) Need Help??

Category: Text Processing
Author/Contact Info Jeremy Kahn kahn@cpan.org
Description: For how many tasks have you wanted to use a sampling of every Nth line of a file?
  • selecting a "random" subset before running on all five million lines
  • getting a flavor of what's in a line-oriented database
  • holding out test data

Well, for me, it's nearly every line-based text-processing tool I write -- if it's not a standard requirement, it's usually much more informative to test on every 50th line of my test corpus than it is to use the first 50 lines for test data.

In fact, I find it very frustrating that there's no Unix power tool a la grep or tail that does this.

So, per is an addition to the Unix power-tool library -- it's sort of like head or tail except that it takes every Nth line instead of the first or last N. Save it as ~/bin/per (or /usr/bin/per) and use it every day, like me.

Windows users can run pl2bat on this and put it somewhere in your path -- my NT box happily uses a variant of this.

Usage info is in POD, in the script. But here it is in HTML anyway (I love pod2html):


NAME

per - return one line per N lines


SYNOPSIS

  per -oOFFSET -N files
  per -90 -o2 file.txt  # every 90th line starting with line 2
  per -o500 -3 file.txt # every 3rd line starting with line 500
  per -o1 -2 file.txt   # every other line, starting with the first
  per -2 file.txt       # same as above

It can also read from STDIN, for pipelining:

  tail -5000 bigfile.txt | per -100 # show every 100th line for the
                                    # last 5000 in the file


DESCRIPTION

per writes every Nth line, starting with OFFSET, to STDOUT.


OPTIONS

-N
the integer value N provided (e.g. -50, -2) is used to decide which lines to return -- every Nth.

-oOFFSET
the value OFFSET provided says how far down in the input to proceed before beginning. The output will begin at line number OFFSET. Default is 1.

files

Note that per works on files specified on the commandline, or on STDIN if no files are provided. The special input file - indicates that remaining data should be read from STDIN.

#!perl
use strict;
use warnings;

use constant DEBUG => 0;

my ($divisor,$offset) = handleArgs();

if (DEBUG) {
  warn "offset $offset\n";
  warn "divisor $divisor\n";
}

while (<>) {
  next if $. < $offset; # haven't reached the first offset
  next if (($. - $offset) % $divisor);
  print;
}

sub handleArgs {
  my ($offset, $divisor);
  while (@ARGV and $ARGV[0] =~ s/^-//) {
    my $arg = shift @ARGV;
    if ($arg =~ s/^o//) {
      if (defined $offset) {
        warn "-o switch found more than once\n"
      }
      $offset = $arg;
    }
    else {
      if ($arg eq '') {
        unshift @ARGV, '-';
        last; # arg was '-', which says "ignore following"
      }
      if (defined $divisor) {
        warn "divisor argument (-N) found more than once\n";
      }
      $divisor = $arg;
    }
  }
  if (not defined $divisor) {
    die "no divisor (-N) defined on commandline!\n";
  }
  if (not defined $offset) {
    $offset = 1;
  }
  if ($divisor <= 0) {
    die "divisor $divisor is <= 0, which makes no sense.\n";
  }
  if ($offset <= 0) {
    die "offset $offset is <=, which makes no sense.\n";
  }

  if ($divisor != int($divisor)) {
    warn "divisor $divisor non-integer. truncating\n";
    $divisor = int($divisor);
  }
  if ($offset != int($offset)) {
    warn "offset $offset non-integer. truncating\n";
    $offset = int($offset);
  }
  return ($divisor, $offset);
}

=head1 NAME

per - return one line per N lines

=head1 SYNOPSIS

  per [-oOFFSET] -N [files]

  per -90 -o2 file.txt  # every 90th line starting with line 2
  per -o500 -3 file.txt # every 3rd line starting with line 500
  per -o1 -2 file.txt   # every other line, starting with the first
  per -2 file.txt       # same as above

It can also read from C<STDIN>, for pipelining:

  tail -5000 bigfile.txt | per -100 # show every 100th line for the
                                    # last 5000 in the file

=head1 DESCRIPTION

C<per> writes every C<N>th line, starting with C<OFFSET>, to
C<STDOUT>.

=head1 OPTIONS

=over

=item -N

the integer value C<N> provided (e.g. C<-50>, C<-2>) is used to decide
which lines to return -- every C<N>th.

=item -oOFFSET

the value C<OFFSET> provided says how far down in the input to proceed
before beginning. The output will begin at line number
C<OFFSET>. Default is 1.

=item [ files ]

=back

Note that C<per> works on files specified on the commandline, or on
C<STDIN> if no files are provided. The special input file C<->
indicates that remaining data should be read from C<STDIN>.

=cut

__END__

Comment on per - selects every Nth line
Download Code

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://215446]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (10)
As of 2014-07-25 23:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls