Extract the middle part of a list

chrism01 has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

My prog needs to load a set of files from a dir.
The filename format is aaa_bbb_ttt.ddd.eee, where ttt is a timestamp of file creation in epoch seconds.
The prog will receive 2 input params, start_datetime, end_datetime, which I'll cvt to epoch secs to match aginst ttt above.
Ideally, I'd like a way of efficiently extracting the subset I need.

Note that there are 2 constraints:
1. some timestamps may not be represented (ie no files with that value)
2. it is likely that many files will exist with the same timestamp(s).

I'm going to take snapshot list of files when I start, as the dir will still be being written to, but the end_datetime will be a fixed value, less than 'now'.
I'm sure it's possible in theory, via some combo of map/split/grep/sort/hash etc, to extract the middle part of the list ie files that I need, but I'm not sure that the overall processing time will be any quicker than just working through my snapshot list sequentially.
Any file with a datetime in the desired range will be read and the contents inserted into a DB (Ingres).
The num of files in the dir will be in the order 1k - 10k approx.
I was thinking of amending something like this:

@sorted = sort                      # default sort numeric
         map  { $_->[2] }           # grab 3rd field (timestamp) of ar
+ray (ref)
         map  { [ split(/_/,$_) ] } # split fnames on '_', rtn array r
+ef
          grep { !/^\./ }           # filter out dot files
         readdir(EVT_DIR);          # read all entries
[download]

except I don't need the sort (not reqd), but I'd need replace that line with code to say only timestamp values in the desired range.

Cheers
Chris
PS Also need to ignore any dirs that exist in the target dir

Comment on Extract the middle part of a list Download Code

Replies are listed 'Best First'.
Re: Extract the middle part of a list by Zaxo (Archbishop) on Jun 29, 2007 at 02:12 UTC
You want grep, with some data extraction in the choice routine. `our ($starttime, $endtime) = init(); # . . . my @selected = grep { my $time = (split /[_.]/)[2]; $time > $starttime and $time <= $endtime;; } </path/to/*>;` [download] Not having to sort helps, since we don't need to keep values for comparison. If there are files there which don't follow the naming scheme, you may need to filter them out with another grep or with map, or a refinement of the glob in angles. After Compline, Zaxo	[reply] [d/l]
Re^2: Extract the middle part of a list by chrism01 (Friar) on Jun 29, 2007 at 04:51 UTC
Guys, Thx for both of those. I decided to go with Zaxo because it's simpler and works down the page (I think). However, it's complaining about the curr/parent dir files (., ..) which i tried to fix, but I'm not good with these nested code layouts. I tried: `@t_arr = grep { grep { $_ !~ /^\./ } $var3 = (split /[_.]/)[2]; $var3 > $var1 and $var3 <= $var2; } readdir(EVT_DIR);` [download] and a couple of variations, but still get warnings: `Use of uninitialized value in pattern match (m//) at ./t.pl line 341. Use of uninitialized value in numeric gt (>) at ./t.pl line 345. Use of uninitialized value in pattern match (m//) at ./t.pl line 341. Use of uninitialized value in numeric gt (>) at ./t.pl line 345.` [download] I also need to ignore any dirs that exist. Any chance of the correct code? Guess I need a tutorial article on nested code blocks (if that's the correct description) Cheers Chris	[reply] [d/l] [select]
Re^3: Extract the middle part of a list by Zaxo (Archbishop) on Jun 29, 2007 at 05:04 UTC
As you say, your blocking is incorrect. Here's how to write that: `@t_arr = grep { $var3 = (split /[_.]/)[2]; $var3 > $var1 and $var3 <= $var2; } grep { !-d and $_ !~ /^\./ } readdir(EVT_DIR);` [download] where I've added `!-d` to exclude directories. You might want to use `-f` instead to admit only regular files. After Compline, Zaxo	[reply] [d/l] [select]
Re^4: Extract the middle part of a list by chrism01 (Friar) on Jun 29, 2007 at 05:38 UTC
Re^5: Extract the middle part of a list by doom (Deacon) on Jun 29, 2007 at 10:00 UTC
Re: Extract the middle part of a list by GrandFather (Saint) on Jun 29, 2007 at 02:10 UTC
You had all the pieces, just need to alter them slightly: `use warnings; use strict; my @files = qw( aaa_bbb_10.ddd.eee aaa_bbb_11.ddd.eee aaa_bbb_12.ddd.eee aaa_bbb_13.ddd.eee aaa_bbb_14.ddd.eee aaa_bbb_15.ddd.eee ); my $min = 12; my $max = 14; my @filtered = map {$_->[0]} grep {$_->[1] >= $min && $_->[1] <= $max} map {my ($time) = /(?:[^_]*_){2}(\d+)/; [$_, $time]} @files; print "@filtered";` [download] Prints: `aaa_bbb_12.ddd.eee aaa_bbb_13.ddd.eee aaa_bbb_14.ddd.eee` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Extract the middle part of a list by jettero (Monsignor) on Jun 29, 2007 at 01:55 UTC
I don't think the default sort is numeric, I think it's ascii... I seem to have to `sort { $a<=>$b } @ar` to get numeric things to happen. If your subset is much smaller than the 10k in the dir, it might not be efficient to push them all into a giant array and push them through grep, map, map, sort. It might be better to do something more like: `while( my $ent = readdir $dirhandle ) { next if $stuff or $ent =~ m/\./; push @wanted, $ent if $various and $things; }` [download] -Paul	[reply] [d/l] [select]
Re: Extract the middle part of a list by jbert (Priest) on Jun 29, 2007 at 07:00 UTC
One issue which is sometimes ignored in these file-sharing schemes is a race condition which can lead to half-written files being processed. You can have a problem where something like this happens: Writing process gets current epoch time A, makes up filename Writing process creates file, starts writing, doesn't finish Reading process comes along at time A+delta, notes an 'old' file with timestamp A, opens and reads it. In this case, the reading process has seen a half-completed file. One might argue that the writing process couldn't stall for long enough for this to happen, but that depends on the size of the file being written, whether it now (or in the future) will be writing over a network, whether the writing process has to wait to get more data, etc. The safe way to do this (which you might already be doing) is for the reader and writer to agree on a pattern match of files to ignore (e.g. *.tmp). The writer can then create and write to a x-y-z.tmp file, flush and sync it to disk and then do a `rename()` on it once it's finished.	[reply] [d/l]
Re^2: Extract the middle part of a list by chrism01 (Friar) on Jun 29, 2007 at 07:19 UTC
Actually, that's not a problem here. My prog is just loading up some old files which were missed when the DB was down. I'm taking a snapshot list of extant files at the start of the prog so it doesn't have to try to play catchup with the writing process. Unless the sysadmins mess up the datetime params badly, the end_datetime will be some way behind the 'latest' file. The writer is part of a monitor system and runs 24/7. The monitor writes a file for each event and also a copy of the data should be written to a row in the DB. If the DB goes down, the new prog will 'fill in the gap' after the DB is fixed. The monitor will write event files even if it can't talk to the DB. In fact, the update prog will prob be fast enough to catch up anyway Chris	[reply]
Re^3: Extract the middle part of a list by jbert (Priest) on Jun 29, 2007 at 07:35 UTC
So you can guarantee that file_xxx_N is complete if file_xxx_N+1 (or later) exists? Fair enough. But for reliability, the reader should check this condition. (And not process a file if it is the latest one). But that might have problems too (if there hasn't been any activity since, then you'll miss the last record). Really, I'm nitpicking, because the race condition is probably unlikely to be hit. But systems like this often run unattended for a long time, on systems which sometimes bog down under load. Race conditions lead to unpredictable behaviour and lots of those time-consuming "oh...we sometimes get that problem, we don't know why" issues. IMHO, the only safe way to do this is "create temp file, then rename".	[reply]


Perl Monk, Perl Meditation
	PerlMonks