Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Split a file based on column

by davido (Archbishop)
on Jan 16, 2013 at 21:03 UTC ( #1013651=note: print w/ replies, xml ) Need Help??


in reply to Split a file based on column

I would do something like this (untested):

use strict; use warnings; use autodie; use constant IN_FN => 'sample_1.txt'; my %handles; open my $infh, '<', IN_FN; while( <$infh> ) { my( $key ) = m/^[^|]\|([^|]+)/; if( ! defined $key ) { warn "Line $. appears malformed. Skipping: $_"; next; } open $handles{$key}, '>', IN_FN . "$key.txt" unless exists $handles{$key}; print {$handles{$key}} $_; } close $_ for $infh, values %handles;

You didn't mention the need, but it would be pretty easy to adapt this to work with a list of input files. Just replace the constant with code to deal with different input filenames, and put it in a loop. :)

What I like about this solution is that you only open each output file once, and then just keep track of the file handles as values in a hash, indexed on the key parsed from the 2nd column.

Update: This solution has the efficiency advantage of not having to re-open an output file if it's already been opened before. But johngg correctly observed that at some point it's possible to get a "Too many open files" error. On one of my systems that kicked in after trying to open 1020 files simultaneously. My solution assumes that column two holds two digits, which would yield just under 100 possible output files. That should be ok.

However, if it turns out that you're exceeding the number of allowable open files on your system, you can open/close on each iteration (the simplest solution).


Dave


Comment on Re: Split a file based on column
Download Code
Re^2: Split a file based on column
by roboticus (Canon) on Jan 17, 2013 at 04:19 UTC

    davido, brad_nov:

    I saw davido's solution, and played around with it to add a limit to the number of open files in %handles using a least-recently used (LRU) cache. No real reason, but I thought I'd amuse myself while my son got ready for bed.

    You could trim it down a bit, as much of the code just implements traces to show what's happening as it runs.

    $ cat t_file_queue.pl #!/usr/bin/perl # Updated PM 1013651 to have a limit on file handles use strict; use warnings; use autodie; use 5.10.0; my %handles; my $MAX_OPEN_FH=3; while( <DATA> ) { my( $key ) = m/^[^|]\|([^|]+)/; if( ! defined $key ) { warn "Line $. appears malformed. Skipping: $_"; next; } print {FH("$key.txt")} $_; } close $$_{FH} for values %handles; sub FH { # Return file handle for named file state $cnt=0; my $key= shift; # Return current handle if it exists if (exists $handles{$key}) { $handles{$key}{cnt}=++$cnt; print "$key: (cnt=$cnt) found\n"; return $handles{$key}{FH}; } # Doesn't exist, retire the "oldest" one if we're at the limit if (keys %handles >= $MAX_OPEN_FH) { my @tmp = sort { $$a{cnt} <=> $$b{cnt} } values %handles; say "$key: Too many open files, close one: ", join(", ",map { "$$_{FName}:$$_{cnt}" } @tmp); my $hr = $tmp[0]; print " closing $$hr{FName}\n"; close $$hr{FH}; delete $handles{$$hr{FName}}; } open my $FH, '>>', $key; $handles{$key} = { cnt=>++$cnt, FName=>$key, FH=>$FH }; print "$key: opened new file ($cnt)\n"; return $FH; } __DATA__ a|1|foo b|1|bar c|2|baz d|1|xyzzy e|2|blarg f|2|The g|3|quick h|2|red i|2|fox j|3|jumped k|4|over l|1|the m|1|lazy n|1|brown o|1|dog p|5|gorgonzola

    Running it gives me:

    $ ./t_file_queue.pl 1.txt: opened new file (1) 1.txt: (cnt=2) found 2.txt: opened new file (3) 1.txt: (cnt=4) found 2.txt: (cnt=5) found 2.txt: (cnt=6) found 3.txt: opened new file (7) 2.txt: (cnt=8) found 2.txt: (cnt=9) found 3.txt: (cnt=10) found 4.txt: Too many open files, close one: 1.txt:4, 2.txt:9, 3.txt:10 closing 1.txt 4.txt: opened new file (11) 1.txt: Too many open files, close one: 2.txt:9, 3.txt:10, 4.txt:11 closing 2.txt 1.txt: opened new file (12) 1.txt: (cnt=13) found 1.txt: (cnt=14) found 1.txt: (cnt=15) found 5.txt: Too many open files, close one: 3.txt:10, 4.txt:11, 1.txt:15 closing 3.txt 5.txt: opened new file (16)

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Good job roboticus. I was thinking instead of some solution that would keep track of frequency of use for opened filehandles. Whenever 'open' fails due to too many files open, drop the least used handle. But I wasn't sure how to implement the frequency structure. A heap (priority queue) sounds good, except that it's probably relatively expensive to update the priority of a file handle each time it's used. Most heap implementations would just delete and re-insert the element being modified. Seems like there must be a solution that isn't prohibitively expensive, but I'm drawing a blank.

      There must be something on CPAN, but regardless, it would be nice to know how best to implement a...um... "priority cache"? ;)


      Dave

        FileCache - keep more files open than the system permits

        davido:

        I've used a priority queue in a C program a dozen or so years ago, and it worked well. As far as the overhead goes, I wouldn't expect it to be prohibitive, especially when compared to the time savings of opening a file.

        Part of the reason I chose an LRU cache for this one is that I've found they work pretty well for the types of applications I use--at least when the number of file handles is more reasonable. Most of the data I play with tends to be 'clumped' in that similar records tend to be closer together. For example, when I process some credit card data, I'll have long runs of Visa transactions, somewhat shorter runs of MasterCard transactions, while others (American Express, Discover) are frequently very short runs.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1013651]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (11)
As of 2014-07-30 19:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (240 votes), past polls