Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Split a file based on column

by space_monk (Chaplain)
on Jan 17, 2013 at 10:53 UTC ( #1013753=note: print w/ replies, xml ) Need Help??


in reply to Split a file based on column

All of the above answers seem to have problems with possible filehandle limits; personally I would read the entire file and convert it to a hash of arrays, and then write each array out to a file indicated by the array key. This has the advantage that only one file is open at any time. I will stick my neck out and say it will also be faster due to less file I/O

As a second comment, you should use something like Text::CSV to get the data, but if you want it quick and dirty there's a good argument for using split instead of a regex here.

Amount of Data: 300k rows = 64k per row = approx 19.6GB of data may cause problems, so maybe a compromise is to write the data when an array gets to a certain size.

The following (untested/debugged) shows the idea...it assumes you specify the file(s) you want to read from on the command line.

Update: Changed when it writes to file as a result of a davido comment

use constant ROW_LIMIT => 10000; sub writeData { my ($name, $data) = @_; open FH, ">>", "sample_$name"; print FH @$data; # may not be needed (auto close on sub end) close FH; } my %hash; my $ctr = 0; while (<>) { my @elems = split /|/; my $idx = $elem[1]; if (exists $hash{$idx}) { # save to existing array push @$hash{$idx}, $_; } else { # create new array $hash{$idx} = ( $_); }; # if we've got too much data, write it out if ($ctr++ >= ROW_LIMIT) { # write data to each file... foreach my $key (%hash) { writeData( $key, $hash{ $key}); delete $hash{$key}; } $ctr = 0; } } # write remaining data to each file... foreach my $key (%hash) { writeData( $key, $hash{ $key}); }
A Monk aims to give answers to those who have none, and to learn from those who know more.


Comment on Re: Split a file based on column
Download Code
Re^2: Split a file based on column
by Anonymous Monk on Jan 17, 2013 at 10:59 UTC

    All of the above answers seem to have problems with possible filehandle limits;

    Re: Split a file based on column doesn't , also doesn't suffer from load-file-into-ram

      You caught my comment whilst it was being drafted; I did state another reason for the approach I suggested.

      Memory is almost never a problem nowadays unless you're running it on your 15 year old PC, but 300k rows * 64 k per row (19GB??) may give some pause for thought. Time to go shopping for more memory or increase your cache. :-)

      A Monk aims to give answers to those who have none, and to learn from those who know more.

        Loading a 19GB file into memory does indeed give pause for thought.... long long pause. :) Time enough to contemplate approaches that do scale well.

        Your accumulate and write when full strategy is a pretty good idea. It would be a data cache rather than a filehandle cache, and the implementation ought to be pretty straight forward. Implementing the file-handle LFU cache seems like it would be more fun though.


        Dave

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1013753]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (4)
As of 2015-07-05 06:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (60 votes), past polls