Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Split a file based on column

by brad_nov (Novice)
on Jan 16, 2013 at 20:16 UTC ( #1013641=perlquestion: print w/replies, xml ) Need Help??
brad_nov has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a file sample_1.txt (300k rows) which has data like below: * Also each record is around 64k bytes
11|1|abc|102553|125589|64k bytes of data 10|2|def|123452|123356|...... 13|2|geh|144351|121123|... 25|4|fgh|165250|118890|.. 14|1|abc|186149|116657|...... 21|7|def|207048|114424|...... 23|7|geh|227947|112191|...... 26|32|fgh|248846|109958|...... 27|23|abc|269745|107725|...... 29|34|def|290644|105492|...... 30|32|geh|311543|103259|...... 33|23|fgh|332442|101026|...... 35|34|abc|353341|98793|...... 37|7|def|374240|96560|...... 39|4|geh|395139|94327|...... 41|2|fgh|416038|92094|...... 44|23|abc|436937|89861|...... 46|1|def|457836|87628|...... 48|3|geh|478735|85395|...... 50|23|fgh|499634|83162|......
I am trying to split the files based on the 2nd column like below sample_1_1.txt
11|1|abc|102553|125589|...... 14|1|abc|186149|116657|...... 46|1|def|457836|87628|......
10|2|def|123452|123356|...... 13|2|geh|144351|121123|...... 41|2|fgh|416038|92094|......
and so on Could some help me on this. Thanks in advance

Replies are listed 'Best First'.
Re: Split a file based on column
by davido (Archbishop) on Jan 16, 2013 at 21:03 UTC

    I would do something like this (untested):

    use strict; use warnings; use autodie; use constant IN_FN => 'sample_1.txt'; my %handles; open my $infh, '<', IN_FN; while( <$infh> ) { my( $key ) = m/^[^|]\|([^|]+)/; if( ! defined $key ) { warn "Line $. appears malformed. Skipping: $_"; next; } open $handles{$key}, '>', IN_FN . "$key.txt" unless exists $handles{$key}; print {$handles{$key}} $_; } close $_ for $infh, values %handles;

    You didn't mention the need, but it would be pretty easy to adapt this to work with a list of input files. Just replace the constant with code to deal with different input filenames, and put it in a loop. :)

    What I like about this solution is that you only open each output file once, and then just keep track of the file handles as values in a hash, indexed on the key parsed from the 2nd column.

    Update: This solution has the efficiency advantage of not having to re-open an output file if it's already been opened before. But johngg correctly observed that at some point it's possible to get a "Too many open files" error. On one of my systems that kicked in after trying to open 1020 files simultaneously. My solution assumes that column two holds two digits, which would yield just under 100 possible output files. That should be ok.

    However, if it turns out that you're exceeding the number of allowable open files on your system, you can open/close on each iteration (the simplest solution).


      davido, brad_nov:

      I saw davido's solution, and played around with it to add a limit to the number of open files in %handles using a least-recently used (LRU) cache. No real reason, but I thought I'd amuse myself while my son got ready for bed.

      You could trim it down a bit, as much of the code just implements traces to show what's happening as it runs.

      $ cat #!/usr/bin/perl # Updated PM 1013651 to have a limit on file handles use strict; use warnings; use autodie; use 5.10.0; my %handles; my $MAX_OPEN_FH=3; while( <DATA> ) { my( $key ) = m/^[^|]\|([^|]+)/; if( ! defined $key ) { warn "Line $. appears malformed. Skipping: $_"; next; } print {FH("$key.txt")} $_; } close $$_{FH} for values %handles; sub FH { # Return file handle for named file state $cnt=0; my $key= shift; # Return current handle if it exists if (exists $handles{$key}) { $handles{$key}{cnt}=++$cnt; print "$key: (cnt=$cnt) found\n"; return $handles{$key}{FH}; } # Doesn't exist, retire the "oldest" one if we're at the limit if (keys %handles >= $MAX_OPEN_FH) { my @tmp = sort { $$a{cnt} <=> $$b{cnt} } values %handles; say "$key: Too many open files, close one: ", join(", ",map { "$$_{FName}:$$_{cnt}" } @tmp); my $hr = $tmp[0]; print " closing $$hr{FName}\n"; close $$hr{FH}; delete $handles{$$hr{FName}}; } open my $FH, '>>', $key; $handles{$key} = { cnt=>++$cnt, FName=>$key, FH=>$FH }; print "$key: opened new file ($cnt)\n"; return $FH; } __DATA__ a|1|foo b|1|bar c|2|baz d|1|xyzzy e|2|blarg f|2|The g|3|quick h|2|red i|2|fox j|3|jumped k|4|over l|1|the m|1|lazy n|1|brown o|1|dog p|5|gorgonzola

      Running it gives me:

      $ ./ 1.txt: opened new file (1) 1.txt: (cnt=2) found 2.txt: opened new file (3) 1.txt: (cnt=4) found 2.txt: (cnt=5) found 2.txt: (cnt=6) found 3.txt: opened new file (7) 2.txt: (cnt=8) found 2.txt: (cnt=9) found 3.txt: (cnt=10) found 4.txt: Too many open files, close one: 1.txt:4, 2.txt:9, 3.txt:10 closing 1.txt 4.txt: opened new file (11) 1.txt: Too many open files, close one: 2.txt:9, 3.txt:10, 4.txt:11 closing 2.txt 1.txt: opened new file (12) 1.txt: (cnt=13) found 1.txt: (cnt=14) found 1.txt: (cnt=15) found 5.txt: Too many open files, close one: 3.txt:10, 4.txt:11, 1.txt:15 closing 3.txt 5.txt: opened new file (16)


      When your only tool is a hammer, all problems look like your thumb.

        Good job roboticus. I was thinking instead of some solution that would keep track of frequency of use for opened filehandles. Whenever 'open' fails due to too many files open, drop the least used handle. But I wasn't sure how to implement the frequency structure. A heap (priority queue) sounds good, except that it's probably relatively expensive to update the priority of a file handle each time it's used. Most heap implementations would just delete and re-insert the element being modified. Seems like there must be a solution that isn't prohibitively expensive, but I'm drawing a blank.

        There must be something on CPAN, but regardless, it would be nice to know how best to implement "priority cache"? ;)


Re: Split a file based on column
by Kenosis (Priest) on Jan 16, 2013 at 20:43 UTC

    Although your data's a good candidate for a CSV parsing module, perhaps the following will assist:

    use strict; use warnings; while (<>) { if ( my ($col2) = /^\d+\|(\d+?)\|/ ) { open my $fh, '>>', 'sample_1_' . $col2 . '.txt' or die $!; print $fh $_; } }

    Usage: perl sample_1.txt

    The regex captures the value in the second column, and that's used later in creating the file name. The file's opened for appending, and the line's written to it. Since a lexically-scoped file handle is used, the file's automatically closed at the end of the if code block.

      Thanks, got it working

        Excellent, brad_nov, and you're most welcome!

Re: Split a file based on column
by Cristoforo (Curate) on Jan 16, 2013 at 20:47 UTC
    Deleted my answer :-(

    Update: I was trying to think of something along davido's answer where you only opened the output file once and not repeated opening and closing with every loop iteration.

Re: Split a file based on column
by space_monk (Chaplain) on Jan 17, 2013 at 10:53 UTC

    All of the above answers seem to have problems with possible filehandle limits; personally I would read the entire file and convert it to a hash of arrays, and then write each array out to a file indicated by the array key. This has the advantage that only one file is open at any time. I will stick my neck out and say it will also be faster due to less file I/O

    As a second comment, you should use something like Text::CSV to get the data, but if you want it quick and dirty there's a good argument for using split instead of a regex here.

    Amount of Data: 300k rows = 64k per row = approx 19.6GB of data may cause problems, so maybe a compromise is to write the data when an array gets to a certain size.

    The following (untested/debugged) shows the assumes you specify the file(s) you want to read from on the command line.

    Update: Changed when it writes to file as a result of a davido comment

    use constant ROW_LIMIT => 10000; sub writeData { my ($name, $data) = @_; open FH, ">>", "sample_$name"; print FH @$data; # may not be needed (auto close on sub end) close FH; } my %hash; my $ctr = 0; while (<>) { my @elems = split /|/; my $idx = $elem[1]; if (exists $hash{$idx}) { # save to existing array push @$hash{$idx}, $_; } else { # create new array $hash{$idx} = ( $_); }; # if we've got too much data, write it out if ($ctr++ >= ROW_LIMIT) { # write data to each file... foreach my $key (%hash) { writeData( $key, $hash{ $key}); delete $hash{$key}; } $ctr = 0; } } # write remaining data to each file... foreach my $key (%hash) { writeData( $key, $hash{ $key}); }
    A Monk aims to give answers to those who have none, and to learn from those who know more.

      All of the above answers seem to have problems with possible filehandle limits;

      Re: Split a file based on column doesn't , also doesn't suffer from load-file-into-ram

        You caught my comment whilst it was being drafted; I did state another reason for the approach I suggested.

        Memory is almost never a problem nowadays unless you're running it on your 15 year old PC, but 300k rows * 64 k per row (19GB??) may give some pause for thought. Time to go shopping for more memory or increase your cache. :-)

        A Monk aims to give answers to those who have none, and to learn from those who know more.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1013641]
Approved by davido
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (9)
As of 2018-06-20 13:57 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (116 votes). Check out past polls.