comment on

My code may not be as elegant as others, and my approach, while attempting to follow the spirit of the guidelines, would definitely not follow the letter of it.

Knowing that I would generate N files, I would retrieve the ordered list from step 1. At that point, I would create an AoA into which I would push the appropriate file name. (Given 12 files of ascending size and a target of 5 output files, for example, I would create the following:

    @set = (
                [ 'file00.csv', 'file01.csv', 'file02.csv', ],
                [ 'file03.csv', 'file04.csv', 'file05'csv', ],
                [ 'file06.csv', 'file07.csv', ],
                [ 'file08.csv', 'file09.csv', ],
                [ 'file10.csv', 'file11.csv', ],
            )
[download]

The partitioning would be accomplished by a loop similar to the following:

    # my $n              = 5;
    my @set;
    my $file_count;
    my $partition_size;
    my $remainder;

    $file_count     = scalar @file;                   # 12
    if ( $file_count >= $n ) {
        $partition_size = int( $file_count / $n );    # 2
        $remainder      = $file_count % $n;           # 2
    }
    else {
        $partition_size = 1;
        $remainder = 0;
    }
    my $i = 0;
    while ( scalar @file ) {
        foreach my $j ( 1 .. $partition_size ) {
            my $fn = shift @file;
            push @{$set[$i]}, $fn;
        }
        if ( $i < $remainder ) {
            my $fn = shift @file;
            push @{$set[$i]}, $fn;
        }
        $i++;
    }
[download]

At this point, it would seem at first blush to be a relatively easy thing to open the intended output file, loop through its list of files using Text::CSV to read them line by line (skipping the first line) and writing the lines to the output file using an IO::Compress::Gzip file handle and Text::CSV's print() method.

This avoids writing the temporary file, or having to add a marker to avoid splitting lines from an input file when writing the subfiles.

Thoughts?

Code implementing the above process:

#!/usr/bin/perl

use strict;
use warnings;

use Cwd;
use Data::Dumper;
use Getopt::Long;
use IO::Compress::Gzip qw( $GzipError );
use Text::CSV;

$Data::Dumper::Deepcopy = 1;
$Data::Dumper::Sortkeys = 1;

$| = 1;
srand();

my $output_files = 5;
my $outfile_name = $0 . q{.csv};
my $path         = q{./};

$outfile_name =~ s/\.pl.*$//g;

GetOptions(
    q{help} => sub {
        &help(
            output_files => $output_files,
            outfile_name => $outfile_name,
            path         => $path,
        );
    },
    q{output_files:i} => \$output_files,
    q{outfile_name:s} => \$outfile_name,
    q{path:s}         => \$path,
);

my $start_dir = getcwd;

if ( !-d $path ) {
    die qq{Directory $path not found: $!\n};
}

my @file = get_files( path => $path, );
my @set =
  partition_files( files => \@file, n => $output_files, );
write_subfiles( set => \@set, prefix => $outfile_name, );

#
# Subroutines
#
sub help {
    my ( %param, ) = @_;

    print sprintf
      <<HELP_TEXT, $param{outfile_name}, $param{output_files}, $param{
+path};

Usage:
        $0
        $0 [--help]
        $0 [--max_lines N] [--outfile_name str] [--path str]

Where:
    outfile_name str       - Output filename prefix
                               (naming will be {prefix}-nn.csv;
                               default: %s).
    output_files N         - Device data into at most N files
                               (data in the same input file
                               will appear in the same file;
                               default: %d).
    path str               - Path to process
                               (default: %s).

HELP_TEXT
    exit;
}

sub get_files {
    my ( %param, ) = @_;

    my @file = ();

    if ( !exists $param{path} ) {
        return @file;
    }

    opendir my $dir, $param{path} or die $!;
    while ( my $fn = readdir($dir) ) {
        next if ( $fn =~ m/^.{1,2}$/ );
        next unless ( $fn =~ m/\.csv$/i );
        push @file, $fn;
    }
    closedir $dir;

    @file = sort { -s $a <=> -s $b } @file;

    return @file;
}

sub partition_files {
    my (%param) = @_;

    my @set;

    my $file_count;
    my $partition_size;
    my $remainder;

    my $n    = $param{n};
    my @file = @{ $param{files} };

    $file_count = scalar @file;    # 12
    if ( $file_count >= $n ) {
        $partition_size = int( $file_count / $n );    # 2
        $remainder      = $file_count % $n;           # 2
    }
    else {
        $partition_size = 1;
        $remainder      = 0;
    }
    my $i = 0;
    while ( scalar @file ) {

        foreach my $j ( 1 .. $partition_size ) {
            my $fn = shift @file;
            push @{ $set[$i] }, $fn;
        }
        if ( $i < $remainder ) {
            my $fn = shift @file;
            push @{ $set[$i] }, $fn;
        }
        $i++;
    }

    return @set;
}

sub write_subfiles {
    my (%param) = @_;

    my @set    = @{ $param{set} };
    my $prefix = $param{prefix};

    my $name_format =
        $prefix . q{-} . q{%0}
      . int( log( scalar @set ) / log(10) + 1 + 1 ) . q{d}
      . q{.csv} . q{.gz};

    my $csv =
      Text::CSV->new(
        { binary => 1, auto_diag => 1, eol => $/, } );

    foreach my $i ( 0 .. $#set ) {
        my $fn = sprintf $name_format, $i;

        my $z = new IO::Compress::Gzip $fn,
          -Level => IO::Compress::Gzip::Z_BEST_COMPRESSION,
          or die
          qq{IO::Compress::Gzip failed: $GzipError\n};

        foreach my $ifn ( @{ $set[$i] } ) {
            my $flag = 1;
            open my $ifh, q{<:encoding(utf8)}, $ifn
              or die qq{$ifn: $!};
            while ( my $row = $csv->getline($ifh) ) {
                if ($flag) {
                    $flag--;
                    next;
                }
                my $status = $csv->print( $z, $row, );
                $row = undef;
            }
            close $ifh;
        }
        $z->close;
    }
}
[download]

2019-08-13: Edited for case of fewer files than requested partitions (will create only as many partitions as files exist).

2019-08-13: Added code implementing the described process.

2019-08-13: Reformatted added code using perltidy -l 60 -ple.

In reply to Re: Complex file manipulation challenge by atcroft
in thread Complex file manipulation challenge by jdporter

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Just another Perl shrine
	PerlMonks