<?xml version="1.0" encoding="windows-1252"?>
<node id="1013753" title="Re: Split a file based on column" created="2013-01-17 05:53:07" updated="2013-01-17 05:53:07">
<type id="11">
note</type>
<author id="880879">
space_monk</author>
<data>
<field name="doctext">
&lt;p&gt;All of the above answers seem to have problems with possible filehandle limits; personally I would read the entire file and convert it to a hash of arrays, and then write each array out to a file indicated by the array key. This has the advantage that only one file is open at any time. I will stick my neck out and say it will also be faster due to less file I/O

&lt;p&gt;As a second comment, you should use something like [mod://Text::CSV] to get the data, but if you want it quick and dirty there's a good argument for using split instead of a regex here.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Amount of Data:&lt;/b&gt; 300k rows = 64k per row = approx 19.6GB of data may cause problems, so maybe a compromise is to write the data when an array gets to a certain size.&lt;/p&gt;

&lt;p&gt;The following (untested/debugged) shows the idea...it assumes you specify the file(s) you want to read from on the command line.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Update:&lt;/b&gt; Changed when it writes to file as a result of a [davido] comment&lt;/p&gt;

&lt;code&gt;
use constant ROW_LIMIT =&gt; 10000;

sub writeData {
    my ($name, $data) = @_;
    open FH, "&gt;&gt;", "sample_$name";
    print FH @$data;
    # may not be needed (auto close on sub end)
    close FH;
}

my %hash;
my $ctr = 0;
while (&lt;&gt;) {
   my @elems = split /|/;
   my $idx = $elem[1];
   if (exists $hash{$idx}) {
       # save to existing array
       push @$hash{$idx}, $_;
   } else {
       # create new array
       $hash{$idx} = ( $_);
   };

   # if we've got too much data, write it out
   if ($ctr++ &gt;= ROW_LIMIT) {
       # write data to each file...
       foreach my $key (%hash) {
          writeData( $key, $hash{ $key}); delete $hash{$key};
       }
       $ctr = 0;
   }
}

# write remaining data to each file...
foreach my $key (%hash) {
   writeData( $key, $hash{ $key});
}
&lt;/code&gt;



&lt;!-- Node text goes above. Div tags should contain sig only --&gt;
&lt;div class="pmsig"&gt;&lt;div class="pmsig-880879"&gt;
A Monk aims to give answers to those who have none, and to learn from those who know more.
&lt;/div&gt;&lt;/div&gt;</field>
<field name="root_node">
1013641</field>
<field name="parent_node">
1013641</field>
</data>
</node>
