Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re^2: Split a file based on column

by roboticus (Chancellor)
on Jan 17, 2013 at 04:19 UTC ( #1013703=note: print w/replies, xml ) Need Help??

in reply to Re: Split a file based on column
in thread Split a file based on column

davido, brad_nov:

I saw davido's solution, and played around with it to add a limit to the number of open files in %handles using a least-recently used (LRU) cache. No real reason, but I thought I'd amuse myself while my son got ready for bed.

You could trim it down a bit, as much of the code just implements traces to show what's happening as it runs.

$ cat #!/usr/bin/perl # Updated PM 1013651 to have a limit on file handles use strict; use warnings; use autodie; use 5.10.0; my %handles; my $MAX_OPEN_FH=3; while( <DATA> ) { my( $key ) = m/^[^|]\|([^|]+)/; if( ! defined $key ) { warn "Line $. appears malformed. Skipping: $_"; next; } print {FH("$key.txt")} $_; } close $$_{FH} for values %handles; sub FH { # Return file handle for named file state $cnt=0; my $key= shift; # Return current handle if it exists if (exists $handles{$key}) { $handles{$key}{cnt}=++$cnt; print "$key: (cnt=$cnt) found\n"; return $handles{$key}{FH}; } # Doesn't exist, retire the "oldest" one if we're at the limit if (keys %handles >= $MAX_OPEN_FH) { my @tmp = sort { $$a{cnt} <=> $$b{cnt} } values %handles; say "$key: Too many open files, close one: ", join(", ",map { "$$_{FName}:$$_{cnt}" } @tmp); my $hr = $tmp[0]; print " closing $$hr{FName}\n"; close $$hr{FH}; delete $handles{$$hr{FName}}; } open my $FH, '>>', $key; $handles{$key} = { cnt=>++$cnt, FName=>$key, FH=>$FH }; print "$key: opened new file ($cnt)\n"; return $FH; } __DATA__ a|1|foo b|1|bar c|2|baz d|1|xyzzy e|2|blarg f|2|The g|3|quick h|2|red i|2|fox j|3|jumped k|4|over l|1|the m|1|lazy n|1|brown o|1|dog p|5|gorgonzola

Running it gives me:

$ ./ 1.txt: opened new file (1) 1.txt: (cnt=2) found 2.txt: opened new file (3) 1.txt: (cnt=4) found 2.txt: (cnt=5) found 2.txt: (cnt=6) found 3.txt: opened new file (7) 2.txt: (cnt=8) found 2.txt: (cnt=9) found 3.txt: (cnt=10) found 4.txt: Too many open files, close one: 1.txt:4, 2.txt:9, 3.txt:10 closing 1.txt 4.txt: opened new file (11) 1.txt: Too many open files, close one: 2.txt:9, 3.txt:10, 4.txt:11 closing 2.txt 1.txt: opened new file (12) 1.txt: (cnt=13) found 1.txt: (cnt=14) found 1.txt: (cnt=15) found 5.txt: Too many open files, close one: 3.txt:10, 4.txt:11, 1.txt:15 closing 3.txt 5.txt: opened new file (16)


When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^3: Split a file based on column
by davido (Archbishop) on Jan 17, 2013 at 06:11 UTC

    Good job roboticus. I was thinking instead of some solution that would keep track of frequency of use for opened filehandles. Whenever 'open' fails due to too many files open, drop the least used handle. But I wasn't sure how to implement the frequency structure. A heap (priority queue) sounds good, except that it's probably relatively expensive to update the priority of a file handle each time it's used. Most heap implementations would just delete and re-insert the element being modified. Seems like there must be a solution that isn't prohibitively expensive, but I'm drawing a blank.

    There must be something on CPAN, but regardless, it would be nice to know how best to implement "priority cache"? ;)



      I've used a priority queue in a C program a dozen or so years ago, and it worked well. As far as the overhead goes, I wouldn't expect it to be prohibitive, especially when compared to the time savings of opening a file.

      Part of the reason I chose an LRU cache for this one is that I've found they work pretty well for the types of applications I use--at least when the number of file handles is more reasonable. Most of the data I play with tends to be 'clumped' in that similar records tend to be closer together. For example, when I process some credit card data, I'll have long runs of Visa transactions, somewhat shorter runs of MasterCard transactions, while others (American Express, Discover) are frequently very short runs.


      When your only tool is a hammer, all problems look like your thumb.

      FileCache - keep more files open than the system permits O(n log n) time per handle fetch (if I'm reading the code right).

        The module seems to implement a Least Frequently Used cache (though it calls it Least Recently Used cache), and does so by incrementing and re-sorting on each request for a handle. And then there's this: "The module functionality relies on symbolic references, so things will break under 'use strict' unless 'no strict "refs"' is also specified." So the only way to use the module is to wrap the code that uses it in a scope block and specify no strict 'refs'; for that block. Talk about the implementation leaking outward into the interface!

        It's hard to believe it's still in the core.

        I think that a Fibonacci Heap solution would work nicely. Inserts (adding a handle) are constant time, remove least (dropping a seldom-used handle) is O(log n) time, change priority (calling for a handle and incrementing its use count) is a delete+insert (O(log n)). Of course the devil is in the details. If I find some free time I might give it a go and see how it works out.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1013703]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2017-06-26 22:04 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (594 votes). Check out past polls.