Monks,
My prog needs to load a set of files from a dir.
The filename format is aaa_bbb_ttt.ddd.eee, where ttt is a timestamp of file creation in epoch seconds.
The prog will receive 2 input params, start_datetime, end_datetime, which I'll cvt to epoch secs to match aginst ttt above.
Ideally, I'd like a way of efficiently extracting the subset I need.
Note that there are 2 constraints:
1. some timestamps may not be represented (ie no files with that value)
2. it is likely that many files will exist with the same timestamp(s).
I'm going to take snapshot list of files when I start, as the dir will still be being written to, but the end_datetime will be a fixed value, less than 'now'.
I'm sure it's possible in theory, via some combo of map/split/grep/sort/hash etc, to extract the middle part of the list ie files that I need, but I'm not sure that the overall processing time will be any quicker than just working through my snapshot list sequentially.
Any file with a datetime in the desired range will be read and the contents inserted into a DB (Ingres).
The num of files in the dir will be in the order 1k - 10k approx.
I was thinking of amending something like this:
@sorted = sort # default sort numeric
map { $_->[2] } # grab 3rd field (timestamp) of ar
+ray (ref)
map { [ split(/_/,$_) ] } # split fnames on '_', rtn array r
+ef
grep { !/^\./ } # filter out dot files
readdir(EVT_DIR); # read all entries
except I don't need the sort (not reqd), but I'd need replace that line with code to say only timestamp values in the desired range.
Cheers
Chris
PS Also need to ignore any dirs that exist in the target dir