Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

part - split up files according to column value

by Corion (Pope)
on Feb 07, 2007 at 09:44 UTC ( #598718=sourcecode: print w/replies, xml ) Need Help??
Category: Utility Scripts
Author/Contact Info Corion
Description:

I often have to split up text files for consumption in Excel. A convenient way of splitting up a file meaningfully is to split it up by the value of a column. The program does this by accumulating the input into a hash of arrays keyed by the column value.

There is an awk oneliner (I'm told) that circumvents the memory limitations this program encounters:

awk -F 'print $0 > $3' FILES

If you want something like this program in a module, see List::Part

Update: Also see join - join two files according to a common key. If you need one, you'll likely need the other too.

Update: v0.03 now can part according to more than one column.

Update: v0.04 now can output multiple header lines into every file.

Update: v0.06 fixes two errors: The column at the end of each line couldn't be used (well) as the key column. Header lines are now actually printed.

Update: Now there also is a Github repository for the case that you want to submit a patch.

#!/usr/bin/perl -w
use strict;
use Getopt::Long;

use vars qw($VERSION);
$VERSION = '0.06';

# Try to load Pod::Usage and install a fallback if it doesn't exist
eval {
    require Pod::Usage;
    Pod::Usage->import();
    1;
} or do {
    *pod2usage = sub {
        die "Error in command line.\n";
    };
};

=head1 NAME

part - split up a single input file into multiple files according to a
+ column value

=head1 SYNOPSIS

part FILES

=head1 OPTIONS

=item B<--out> - set the output template

If the output template is not given it is guessed from
the name of the first input file or set to C<part-%s.txt>.
The C<%s> will be replaced by the column value.

=item B<--column> - set the column to part on

This is the zero-based number of the column.
Multiple columns may be given.

=item B<--separator> - set the column separator

This is the separator for the columns. It defaults
to a tab character ("\t").

=item B<--header-line> - output the first line into every file

This defines the line as header line which is output
into every file. If it is given an argument that string
is output as header, otherwise the first line read
will be repeated as the header.

If the value is a number, that many lines will be read from
the file and used as the header. This makes it impossible
to use just a number as the header.

=item B<--verbose> - output the generated filenames

In normal operation, the program will be silent. If you
need to know the generated filenames, the C<--verbose>
option will output them.

=item B<--filename-sep> - set the separator for the filenames

If you prefer a different separator for the filenames
than a newline, this option allows you to set it. If
the separator looks like an octal number (three digits)
it is interpreted as such. Otherwise it will
be taken literally. A common
use is to set the separator to C<000> to separate the
files by the zero character if you suspect that your
filenames might contain newlines.

It defaults to C<012>, a newline.

=item B<--version> - output version information

=head1 CAVEAT

The program loads the whole input into RAM
before writing the output. A future enhancement
might be a C<uniq>-like option that tells the
program to assume that the input will be grouped
according to the parted column so it does not
need to allocate memory.

If your memory is not large enough, the following
C<awk> one-liner might help you:

    # Example of parting on column 3
    awk -F '{ print $0 > $3 }' FILE

=head1 AUTHOR

Copyright (c) 2007 Max Maischein (C<< corion@cpan.org >>)

=cut

GetOptions(
    'out=s'             => \my $tmpl,
    'column=i'          => \my @col,
    'separator=s'       => \my $sep,
    'verbose'           => \my $verbose,
    'filename-sep=s'    => \my $filename_sep,
    'header-line:s'     => \my $header,
    'help'              => \my $help,
    'version'           => \my $version,
) or pod2usage(2);
pod2usage(1) if $help;
if (defined $version) {
    print "$VERSION\n";
    exit 0;
};
pod2usage("$0: No files given.")  if ((@ARGV == 0) && (-t STDIN));

if (! defined $tmpl) {
    # Let's hope we can guess from the first filename
    my $placeholder = '-%s' x @col;
    ($tmpl = $ARGV[0] || 'part.txt') =~ s/\.(\w+)$/$placeholder.$1/;
};

if (! defined $sep) {
    $sep = "\t";
};

$filename_sep ||= "012";
if ($filename_sep =~ /^\d{3}$/) {
    $filename_sep = chr oct $filename_sep
};

my %lines;
if (defined $header) {
    $header ||= 1;
    if ($header =~ /^\d+$/) {
        my $count = $header;
        $header = "";
        $header .= <>
            while $count--;
    };
};

while (<>) {
    s/\r?\n$//;
    my @c = split /$sep/o;
    my $key = join $sep, @c[ @col ];
    if (not defined $lines{ $key }) {
        $lines{ $key } ||= [];
    };
    push @{ $lines{$key}}, $_
}

for my $key (sort keys %lines) {
    my @vals = split /$sep/o, $key;
    my $name = sprintf $tmpl, @vals;
    open my $fh, ">", $name
        or die "Couldn't create '$name': $!";
    if ($header) {
        print {$fh} $header;
    }
    print "$name$filename_sep"
        if $verbose;
    print {$fh} "$_\n"
        for (@{ $lines{ $key }});
};
Replies are listed 'Best First'.
Re: part - split up files according to column value
by jdporter (Canon) on Feb 12, 2007 at 17:53 UTC

    The following writes out each line as it is read in, just like the awk version. The cost is having a potentially very large number of file handles open at once — one per unique value seen in the given column. In fact, you might very easily run into your system's open filehandle limit. :-)

    This solution avoids a potential problem with file naming. Rather than name the output files with the actual value seen in the field, it uses its own one-up scheme. At the end of processing the input, it prints out a table mapping filenames to column values.

    It also properly handles the cases where the given column has no content (empty string) and does not exist in the row at all (undef). The output file number zero is reserved for the latter case.

    # config: my $field = 0; my $sep = "\t"; $, = $sep; $\ = $/; my %file; # { num, name, $fh } my $fnum = 1; while (<>) { chomp; my @c = split /$sep/o; my( $key, $num ) = defined $c[$field] ? ( $c[$field], $fnum++ ) : ( '(column not present)', 0 ); unless ( $file{$key} ) { $file{$key}{num} = $num; $file{$key}{name} = sprintf 'part.%03d', $file{$key}{num}; -f $file{$key}{name} and die "Sorry, '$file{$key}{name}' exists; won't clobber."; open $file{$key}{fh}, ">", $file{$key}{name} or die "Error opening '$file{$key}{name}' for write - $!"; } print {$file{$key}{fh}} @c; } print $file{$_}{name}, $_ for sort { $file{$a}{num} <=> $file{$b}{num} } keys %file;

    Update: Corion has suggested FileCache as a way to circumvent the open filehandle limit.

    A word spoken in Mind will reach its own level, in the objective world, by its own weight
      Hi, Could you give an example of how to run this. I am new to perl. Also how would you incorporate the Filecache in this example. I want to split a file based on the first column and save in file with the name as the name in the first column field without the quotations. ex. data:
      "1", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" "2", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" "4", "This" , "is" , "test", "data" "2", "This" , "is" , "test", "data" "3", "This" , "is" , "test", "data"
      would create four files named 1,2,3,4 with the data in it.
      file 1: "1", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" file 2: "2", "This" , "is" , "test", "data" "2", "This" , "is" , "test", "data" file 3: "3", "This" , "is" , "test", "data" file 4: "4", "This" , "is" , "test", "data"
      It is large file so I need to use the Filecache Thanks For any help

        You can start by telling us where you encounter problems and what difficulties you have incorporating FileCache into jdporter's code.

      I have just found this piece of code and it is perfect for a regular task I have. However I am struggling to get headers to work. If I use the header-line switch it ignores the headline for file creation but it does not seem to paste the header in to the top of each file it is creating. Any advice?

        jdporter's program has no option for headers. Did you mean to reply to my program?

        If so, please do consider telling me how you invoke the program, and what the layout of the first few rows of your input file is and what you get for output, so that I can reproduce the problem.

Re: part - split up files according to column value
by Corion (Pope) on Jul 14, 2009 at 12:30 UTC

    As I constantly mislay the version of the program that reads and writes Excel files, here it is:

    #!/usr/bin/perl -w use strict; use Getopt::Long; use vars qw($VERSION); use Spreadsheet::WriteExcel; use Spreadsheet::ParseExcel; use File::Glob qw(bsd_glob); use ExcelTools; $VERSION = '0.05'; # Try to load Pod::Usage and install a fallback if it doesn't exist eval { require Pod::Usage; Pod::Usage->import(); 1; } or do { *pod2usage = sub { die "Error in command line.\n"; }; }; GetOptions( 'out=s' => \my $tmpl, 'column=i' => \my $col, 'verbose' => \my $verbose, 'header-line:s' => \my $header, 'help' => \my $help, 'version' => \my $version, ) or pod2usage(2); pod2usage(1) if $help; if (defined $version) { print "$VERSION\n"; exit 0; }; pod2usage("$0: No files given.") if ((@ARGV == 0) && (-t STDIN)); if (! defined $tmpl) { # Let's hope we can guess from the first filename ($tmpl = $ARGV[0] || 'part.xls') =~ s/\.(\w+)$/-%s.$1/; }; $col ||= 0; $header ||= 0; my $header_cols; @ARGV = map { bsd_glob $_ } @ARGV; my %lines; for my $file (@ARGV) { my $wb = Spreadsheet::ParseExcel::Workbook->Parse($file); my $data = ExcelTools::sheet_data($wb->{Worksheet}->[0]); $header_cols = splice @$data, 0, $header; for my $c (@$data) { $lines{ $c->[$col]} ||= []; push @{ $lines{$c->[$col]} }, $c }; }; for my $key (sort keys %lines) { (my $clean = $key) =~ s/\s+$//ms; my $name = sprintf $tmpl, $clean; my $out = Spreadsheet::WriteExcel->new($name); my $sheet = $out->add_worksheet('Kandidaten'); unshift @{$lines{ $key }}, $header_cols if $header; $sheet->write_col( 'A1', $lines{ $key }); print "$name\n" #$filename_sep" if $verbose; }; __END__ =head1 NAME xlpart - split up an Excel file into multiple files according to a col +umn value =head1 SYNOPSIS part [OPTIONS] FILES =head1 OPTIONS =item B<--version> - print program version Outputs the program version. =item B<--help> - print this page Outputs this help text. =item B<--out> - set the output template If the output template is not given it is guessed from the name of the first input file or set to C<part-%s.txt>. The C<%s> will be replaced by the column value. =item B<--column> - set the column to part on This is the zero-based number of the column. =item B<--header-line> - output the first line into every file This defines the line as header line which is output into every file. If it is given an argument that string is output as header, otherwise the first line read will be repeated as the header. If the value is a number, that many lines will be read from the file and used as the header. This makes it impossible to use just a number as the header. =item B<--verbose> - output the generated filenames In normal operation, the program will be silent. If you need to know the generated filenames, the B<--verbose> option will output them. =head1 CAVEAT The program loads the whole input into RAM before writing the output. A future enhancement might be a C<uniq>-like option that tells the program to assume that the input will be grouped according to the parted column so it does not need to allocate memory. If your memory is not large enough, the following C<awk> one-liner might help you: # Example of parting on column 3 awk -F '{ print $0 > $3 }' FILE =head1 AUTHOR Max Maischein (C<< corion@cpan.org >>)
Re: part - split up files according to column value
by toolic (Bishop) on Feb 09, 2011 at 15:50 UTC
    There are a couple minor POD errors in version 6:
    perldoc part ... POD ERRORS Hey! The above document had some coding errors, which are expla +ined below: Around line 29: '=item' outside of any '=over' Around line 77: You forgot a '=back' before '=head1'
    Here is a patch:
    --- part.6 2011-02-09 10:44:22.000000000 -0500 +++ part.fix 2011-02-09 10:45:32.000000000 -0500 @@ -26,6 +26,8 @@ =head1 OPTIONS +=over + =item B<--out> - set the output template If the output template is not given it is guessed from @@ -74,6 +76,8 @@ =item B<--version> - output version information +=back + =head1 CAVEAT The program loads the whole input into RAM
    Also, it would be helpful if you could add a few example command lines in your POD, like the one you showed in Re^6: part - split up files according to column value.

      Monks: Trying to modify script to give me a subtotal of column 14 (see below.)

      while (<>) { s/\r?\n$//; my @c = split /$sep/o; my $sub_total = 0; $sub_total += $c[14]; my $key = join $sep, @c[ @col ]; if (not defined $lines{ $key }) { $lines{ $key } ||= []; }; push @{ $lines{$key}}, $_ push @{ $totals{$key}}, $sub_total }

      My code is failing with the following: "my" variable %lines masks earlier declaration in same scope at ./part-v2.pl line 179. "my" variable $key masks earlier declaration in same statement at ./part-v2.pl line 180. syntax error at ./part-v2.pl line 176, near "$_ push" syntax error at ./part-v2.pl line 176, near "}}" Global symbol "$sub_total" requires explicit package name at ./part-v2.pl line 177.

      syntax error at ./part-v2.pl line 191, near "}" Execution of ./part-v2.pl aborted due to compilation errors.

      ========================

      Any help would be appreciated.

        Statements end with semicolons. Blocks do not:

        } push @{ $lines{$key}}, $_; push @{ $totals{$key}}, $sub_total; }

        You also have a problem in that you instantiate $sub_total with a value of zero every time through the loop, right before you add a value to it, so it will always equal that value. This is probably not what you want. You should move the my line that creates it to before your loop, so it can accumulate inside the loop.

        Aaron B.
        Available for small or large Perl jobs; see my home node.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://598718]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (7)
As of 2018-07-19 08:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?















    Results (404 votes). Check out past polls.

    Notices?