http://www.perlmonks.org?node_id=1013714


in reply to Re^2: Split a file based on column
in thread Split a file based on column

Good job roboticus. I was thinking instead of some solution that would keep track of frequency of use for opened filehandles. Whenever 'open' fails due to too many files open, drop the least used handle. But I wasn't sure how to implement the frequency structure. A heap (priority queue) sounds good, except that it's probably relatively expensive to update the priority of a file handle each time it's used. Most heap implementations would just delete and re-insert the element being modified. Seems like there must be a solution that isn't prohibitively expensive, but I'm drawing a blank.

There must be something on CPAN, but regardless, it would be nice to know how best to implement a...um... "priority cache"? ;)


Dave

Replies are listed 'Best First'.
Re^4: Split a file based on column
by roboticus (Chancellor) on Jan 17, 2013 at 11:35 UTC

    davido:

    I've used a priority queue in a C program a dozen or so years ago, and it worked well. As far as the overhead goes, I wouldn't expect it to be prohibitive, especially when compared to the time savings of opening a file.

    Part of the reason I chose an LRU cache for this one is that I've found they work pretty well for the types of applications I use--at least when the number of file handles is more reasonable. Most of the data I play with tends to be 'clumped' in that similar records tend to be closer together. For example, when I process some credit card data, I'll have long runs of Visa transactions, somewhat shorter runs of MasterCard transactions, while others (American Express, Discover) are frequently very short runs.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re^4: Split a file based on column
by Anonymous Monk on Jan 17, 2013 at 10:56 UTC
    FileCache - keep more files open than the system permits

      ...in O(n log n) time per handle fetch (if I'm reading the code right).

      The module seems to implement a Least Frequently Used cache (though it calls it Least Recently Used cache), and does so by incrementing and re-sorting on each request for a handle. And then there's this: "The module functionality relies on symbolic references, so things will break under 'use strict' unless 'no strict "refs"' is also specified." So the only way to use the module is to wrap the code that uses it in a scope block and specify no strict 'refs'; for that block. Talk about the implementation leaking outward into the interface!

      It's hard to believe it's still in the core.

      I think that a Fibonacci Heap solution would work nicely. Inserts (adding a handle) are constant time, remove least (dropping a seldom-used handle) is O(log n) time, change priority (calling for a handle and incrementing its use count) is a delete+insert (O(log n)). Of course the devil is in the details. If I find some free time I might give it a go and see how it works out.


      Dave