Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Extract the middle part of a list

by chrism01 (Friar)
on Jun 29, 2007 at 01:37 UTC ( [id://624012]=perlquestion: print w/replies, xml ) Need Help??

chrism01 has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

My prog needs to load a set of files from a dir.
The filename format is aaa_bbb_ttt.ddd.eee, where ttt is a timestamp of file creation in epoch seconds.
The prog will receive 2 input params, start_datetime, end_datetime, which I'll cvt to epoch secs to match aginst ttt above.
Ideally, I'd like a way of efficiently extracting the subset I need.

Note that there are 2 constraints:
1. some timestamps may not be represented (ie no files with that value)
2. it is likely that many files will exist with the same timestamp(s).

I'm going to take snapshot list of files when I start, as the dir will still be being written to, but the end_datetime will be a fixed value, less than 'now'.
I'm sure it's possible in theory, via some combo of map/split/grep/sort/hash etc, to extract the middle part of the list ie files that I need, but I'm not sure that the overall processing time will be any quicker than just working through my snapshot list sequentially.
Any file with a datetime in the desired range will be read and the contents inserted into a DB (Ingres).
The num of files in the dir will be in the order 1k - 10k approx.
I was thinking of amending something like this:

@sorted = sort # default sort numeric map { $_->[2] } # grab 3rd field (timestamp) of ar +ray (ref) map { [ split(/_/,$_) ] } # split fnames on '_', rtn array r +ef grep { !/^\./ } # filter out dot files readdir(EVT_DIR); # read all entries
except I don't need the sort (not reqd), but I'd need replace that line with code to say only timestamp values in the desired range.

Cheers
Chris
PS Also need to ignore any dirs that exist in the target dir

Replies are listed 'Best First'.
Re: Extract the middle part of a list
by Zaxo (Archbishop) on Jun 29, 2007 at 02:12 UTC

    You want grep, with some data extraction in the choice routine.

    our ($starttime, $endtime) = init(); # . . . my @selected = grep { my $time = (split /[_.]/)[2]; $time > $starttime and $time <= $endtime;; } </path/to/*>;
    Not having to sort helps, since we don't need to keep values for comparison. If there are files there which don't follow the naming scheme, you may need to filter them out with another grep or with map, or a refinement of the glob in angles.

    After Compline,
    Zaxo

      Guys,

      Thx for both of those. I decided to go with Zaxo because it's simpler and works down the page (I think).
      However, it's complaining about the curr/parent dir files (., ..) which i tried to fix, but I'm not good with these nested code layouts.
      I tried:

      @t_arr = grep { grep { $_ !~ /^\./ } $var3 = (split /[_.]/)[2]; $var3 > $var1 and $var3 <= $var2; } readdir(EVT_DIR);
      and a couple of variations, but still get warnings:
      Use of uninitialized value in pattern match (m//) at ./t.pl line 341. Use of uninitialized value in numeric gt (>) at ./t.pl line 345. Use of uninitialized value in pattern match (m//) at ./t.pl line 341. Use of uninitialized value in numeric gt (>) at ./t.pl line 345.
      I also need to ignore any dirs that exist.
      Any chance of the correct code?
      Guess I need a tutorial article on nested code blocks (if that's the correct description)

      Cheers
      Chris

        As you say, your blocking is incorrect. Here's how to write that:

        @t_arr = grep { $var3 = (split /[_.]/)[2]; $var3 > $var1 and $var3 <= $var2; } grep { !-d and $_ !~ /^\./ } readdir(EVT_DIR);
        where I've added !-d to exclude directories. You might want to use -f instead to admit only regular files.

        After Compline,
        Zaxo

Re: Extract the middle part of a list
by GrandFather (Saint) on Jun 29, 2007 at 02:10 UTC

    You had all the pieces, just need to alter them slightly:

    use warnings; use strict; my @files = qw( aaa_bbb_10.ddd.eee aaa_bbb_11.ddd.eee aaa_bbb_12.ddd.eee aaa_bbb_13.ddd.eee aaa_bbb_14.ddd.eee aaa_bbb_15.ddd.eee ); my $min = 12; my $max = 14; my @filtered = map {$_->[0]} grep {$_->[1] >= $min && $_->[1] <= $max} map {my ($time) = /(?:[^_]*_){2}(\d+)/; [$_, $time]} @files; print "@filtered";

    Prints:

    aaa_bbb_12.ddd.eee aaa_bbb_13.ddd.eee aaa_bbb_14.ddd.eee

    DWIM is Perl's answer to Gödel
Re: Extract the middle part of a list
by jettero (Monsignor) on Jun 29, 2007 at 01:55 UTC

    I don't think the default sort is numeric, I think it's ascii... I seem to have to sort { $a<=>$b } @ar to get numeric things to happen.

    If your subset is much smaller than the 10k in the dir, it might not be efficient to push them all into a giant array and push them through grep, map, map, sort. It might be better to do something more like:

    while( my $ent = readdir $dirhandle ) { next if $stuff or $ent =~ m/\./; push @wanted, $ent if $various and $things; }

    -Paul

Re: Extract the middle part of a list
by jbert (Priest) on Jun 29, 2007 at 07:00 UTC
    One issue which is sometimes ignored in these file-sharing schemes is a race condition which can lead to half-written files being processed.

    You can have a problem where something like this happens:

    1. Writing process gets current epoch time A, makes up filename
    2. Writing process creates file, starts writing, doesn't finish
    3. Reading process comes along at time A+delta, notes an 'old' file with timestamp A, opens and reads it.
    In this case, the reading process has seen a half-completed file.

    One might argue that the writing process couldn't stall for long enough for this to happen, but that depends on the size of the file being written, whether it now (or in the future) will be writing over a network, whether the writing process has to wait to get more data, etc.

    The safe way to do this (which you might already be doing) is for the reader and writer to agree on a pattern match of files to ignore (e.g. *.tmp). The writer can then create and write to a x-y-z.tmp file, flush and sync it to disk and then do a rename() on it once it's finished.

      Actually, that's not a problem here. My prog is just loading up some old files which were missed when the DB was down.
      I'm taking a snapshot list of extant files at the start of the prog so it doesn't have to try to play catchup with the writing process.
      Unless the sysadmins mess up the datetime params badly, the end_datetime will be some way behind the 'latest' file.
      The writer is part of a monitor system and runs 24/7. The monitor writes a file for each event and also a copy of the data should be written to a row in the DB.
      If the DB goes down, the new prog will 'fill in the gap' after the DB is fixed.
      The monitor will write event files even if it can't talk to the DB.
      In fact, the update prog will prob be fast enough to catch up anyway

      Chris

        So you can guarantee that file_xxx_N is complete if file_xxx_N+1 (or later) exists? Fair enough.

        But for reliability, the reader should check this condition. (And not process a file if it is the latest one). But that might have problems too (if there hasn't been any activity since, then you'll miss the last record).

        Really, I'm nitpicking, because the race condition is probably unlikely to be hit. But systems like this often run unattended for a long time, on systems which sometimes bog down under load. Race conditions lead to unpredictable behaviour and lots of those time-consuming "oh...we sometimes get that problem, we don't know why" issues.

        IMHO, the only safe way to do this is "create temp file, then rename".

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://624012]
Approved by GrandFather
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (7)
As of 2024-04-24 10:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found