My code may not be as elegant as others, and my approach, while attempting to follow the spirit of the guidelines, would definitely not follow the letter of it.
Knowing that I would generate N files, I would retrieve the ordered list from step 1. At that point, I would create an AoA into which I would push the appropriate file name. (Given 12 files of ascending size and a target of 5 output files, for example, I would create the following:
@set = (
[ 'file00.csv', 'file01.csv', 'file02.csv', ],
[ 'file03.csv', 'file04.csv', 'file05'csv', ],
[ 'file06.csv', 'file07.csv', ],
[ 'file08.csv', 'file09.csv', ],
[ 'file10.csv', 'file11.csv', ],
)
The partitioning would be accomplished by a loop similar to the following:
# my $n = 5;
my @set;
my $file_count;
my $partition_size;
my $remainder;
$file_count = scalar @file; # 12
if ( $file_count >= $n ) {
$partition_size = int( $file_count / $n ); # 2
$remainder = $file_count % $n; # 2
}
else {
$partition_size = 1;
$remainder = 0;
}
my $i = 0;
while ( scalar @file ) {
foreach my $j ( 1 .. $partition_size ) {
my $fn = shift @file;
push @{$set[$i]}, $fn;
}
if ( $i < $remainder ) {
my $fn = shift @file;
push @{$set[$i]}, $fn;
}
$i++;
}
At this point, it would seem at first blush to be a relatively easy thing to open the intended output file, loop through its list of files using Text::CSV to read them line by line (skipping the first line) and writing the lines to the output file using an IO::Compress::Gzip file handle and Text::CSV's print() method.
This avoids writing the temporary file, or having to add a marker to avoid splitting lines from an input file when writing the subfiles.
Thoughts?
Code implementing the above process:
2019-08-13: Edited for case of fewer files than requested partitions (will create only as many partitions as files exist).
2019-08-13: Added code implementing the described process.
2019-08-13: Reformatted added code using perltidy -l 60 -ple.
|