My code may not be as elegant as others, and my approach, while attempting to follow the spirit of the guidelines, would definitely not follow the letter of it.
Knowing that I would generate N files, I would retrieve the ordered list from step 1. At that point, I would create an AoA into which I would push the appropriate file name. (Given 12 files of ascending size and a target of 5 output files, for example, I would create the following:
@set = (
[ 'file00.csv', 'file01.csv', 'file02.csv', ],
[ 'file03.csv', 'file04.csv', 'file05'csv', ],
[ 'file06.csv', 'file07.csv', ],
[ 'file08.csv', 'file09.csv', ],
[ 'file10.csv', 'file11.csv', ],
)
The partitioning would be accomplished by a loop similar to the following:
# my $n = 5;
my @set;
my $file_count;
my $partition_size;
my $remainder;
$file_count = scalar @file; # 12
if ( $file_count >= $n ) {
$partition_size = int( $file_count / $n ); # 2
$remainder = $file_count % $n; # 2
}
else {
$partition_size = 1;
$remainder = 0;
}
my $i = 0;
while ( scalar @file ) {
foreach my $j ( 1 .. $partition_size ) {
my $fn = shift @file;
push @{$set[$i]}, $fn;
}
if ( $i < $remainder ) {
my $fn = shift @file;
push @{$set[$i]}, $fn;
}
$i++;
}
At this point, it would seem at first blush to be a relatively easy thing to open the intended output file, loop through its list of files using Text::CSV to read them line by line (skipping the first line) and writing the lines to the output file using an IO::Compress::Gzip file handle and Text::CSV's print() method.
This avoids writing the temporary file, or having to add a marker to avoid splitting lines from an input file when writing the subfiles.
Thoughts?
Code implementing the above process:
#!/usr/bin/perl
use strict;
use warnings;
use Cwd;
use Data::Dumper;
use Getopt::Long;
use IO::Compress::Gzip qw( $GzipError );
use Text::CSV;
$Data::Dumper::Deepcopy = 1;
$Data::Dumper::Sortkeys = 1;
$| = 1;
srand();
my $output_files = 5;
my $outfile_name = $0 . q{.csv};
my $path = q{./};
$outfile_name =~ s/\.pl.*$//g;
GetOptions(
q{help} => sub {
&help(
output_files => $output_files,
outfile_name => $outfile_name,
path => $path,
);
},
q{output_files:i} => \$output_files,
q{outfile_name:s} => \$outfile_name,
q{path:s} => \$path,
);
my $start_dir = getcwd;
if ( !-d $path ) {
die qq{Directory $path not found: $!\n};
}
my @file = get_files( path => $path, );
my @set =
partition_files( files => \@file, n => $output_files, );
write_subfiles( set => \@set, prefix => $outfile_name, );
#
# Subroutines
#
sub help {
my ( %param, ) = @_;
print sprintf
<<HELP_TEXT, $param{outfile_name}, $param{output_files}, $param{
+path};
Usage:
$0
$0 [--help]
$0 [--max_lines N] [--outfile_name str] [--path str]
Where:
outfile_name str - Output filename prefix
(naming will be {prefix}-nn.csv;
default: %s).
output_files N - Device data into at most N files
(data in the same input file
will appear in the same file;
default: %d).
path str - Path to process
(default: %s).
HELP_TEXT
exit;
}
sub get_files {
my ( %param, ) = @_;
my @file = ();
if ( !exists $param{path} ) {
return @file;
}
opendir my $dir, $param{path} or die $!;
while ( my $fn = readdir($dir) ) {
next if ( $fn =~ m/^.{1,2}$/ );
next unless ( $fn =~ m/\.csv$/i );
push @file, $fn;
}
closedir $dir;
@file = sort { -s $a <=> -s $b } @file;
return @file;
}
sub partition_files {
my (%param) = @_;
my @set;
my $file_count;
my $partition_size;
my $remainder;
my $n = $param{n};
my @file = @{ $param{files} };
$file_count = scalar @file; # 12
if ( $file_count >= $n ) {
$partition_size = int( $file_count / $n ); # 2
$remainder = $file_count % $n; # 2
}
else {
$partition_size = 1;
$remainder = 0;
}
my $i = 0;
while ( scalar @file ) {
foreach my $j ( 1 .. $partition_size ) {
my $fn = shift @file;
push @{ $set[$i] }, $fn;
}
if ( $i < $remainder ) {
my $fn = shift @file;
push @{ $set[$i] }, $fn;
}
$i++;
}
return @set;
}
sub write_subfiles {
my (%param) = @_;
my @set = @{ $param{set} };
my $prefix = $param{prefix};
my $name_format =
$prefix . q{-} . q{%0}
. int( log( scalar @set ) / log(10) + 1 + 1 ) . q{d}
. q{.csv} . q{.gz};
my $csv =
Text::CSV->new(
{ binary => 1, auto_diag => 1, eol => $/, } );
foreach my $i ( 0 .. $#set ) {
my $fn = sprintf $name_format, $i;
my $z = new IO::Compress::Gzip $fn,
-Level => IO::Compress::Gzip::Z_BEST_COMPRESSION,
or die
qq{IO::Compress::Gzip failed: $GzipError\n};
foreach my $ifn ( @{ $set[$i] } ) {
my $flag = 1;
open my $ifh, q{<:encoding(utf8)}, $ifn
or die qq{$ifn: $!};
while ( my $row = $csv->getline($ifh) ) {
if ($flag) {
$flag--;
next;
}
my $status = $csv->print( $z, $row, );
$row = undef;
}
close $ifh;
}
$z->close;
}
}
2019-08-13: Edited for case of fewer files than requested partitions (will create only as many partitions as files exist).
2019-08-13: Added code implementing the described process.
2019-08-13: Reformatted added code using perltidy -l 60 -ple.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.