Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Monitoring directory for new files

by ranciid (Novice)
on Jan 20, 2010 at 04:28 UTC ( #818361=perlquestion: print w/ replies, xml ) Need Help??
ranciid has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

There's a directory (in Windows) that I have to monitor for new files created (sftp) there. Once it is created/dropped, I have to move (by sftp) it to another server for processing.

There is no fixed schedule when these files are created and there's no fixed file size either. The files can come immediately one after another in a "stream" or file by file over a period of time.

I have looked through perl modules "Win32::ChangeNotify", "File::Monitor::Simple" and "Dir-Watch" but I don't seem to be able to find where / how to find the filename of the newly created file.

Some concerns :

1) How do I avoid "race conditions"? If a big file happened to be created, how do I ensure that it is completely created before it is sftp-ed to the other processing server?

2) if the files come one after another, how should they be queued for processing?

Apologies for the lengthy maiden post as I am a complete newbie and I hope someone out there can help.

Thanks in advance

Comment on Monitoring directory for new files
Re: Monitoring directory for new files
by ikegami (Pope) on Jan 20, 2010 at 05:29 UTC

    If a big file happened to be created, how do I ensure that it is completely created before it is sftp-ed to the other processing server?

    Create the big file in one directory, then rename it into the monitored directory. Renaming a file just edits the directory entries, so the file instantly appears in its entirely in the second directory.

    You could also keep the file locked until you are done adding to it.

    but I don't seem to be able to find where / how to find the filename of the newly created file.

    Maybe you're suppose to read the directory's contents and compared them with what you read previously? I didn't do any research.

Re: Monitoring directory for new files
by Marshall (Prior) on Jan 20, 2010 at 05:57 UTC
    You will be notified that something has changed in the "drop directory". When you see that happened, you read the directory and process all files that you know of.

    If a file isn't finished being created, your read loop will hang while waiting for EOF. That appears to be fine in your app.

    The potential race condition comes into play after you have finished processing all files that you knew about when you got the triggering event.

    It could be that a new file has been or is being created in the "drop directory" since you last checked. So you run the "process_all_files()" routine again to make sure that the directory is empty as far as you know.

    But this is may still not be perfect. Unless you have good OS event support and I'm not sure that Windows does, you should run process_all_files() when the event trigger happens, then run process_all_files() again to clear the directory "to the best of your knowledge". And then run process_all_files() periodically so that no "files get stuck" in the drop directory without an event trigger.

    There are various alternate schemes, but the above is a good one provided that you can deal with a little delay when a file "gets stuck", which won't happen often.

      If a file isn't finished being created, your read loop will hang while waiting for EOF.

      I'm sure this is wrong (since we're talking about plain files, not pipes or sockets), but I can't test it right now.

        If that's not true, then what happens? I figure the reading process will keep trying to read until EOF. It will not see EOF until the other process which is writing the file has closed it. I could be mistaken, but again if it doesn't work like this, then what happens in the "reading process"?

        So basically, yes this is like a pipe. I think Ikegami's test is correct.

Re: Monitoring directory for new files
by cdarke (Prior) on Jan 20, 2010 at 06:31 UTC
    The only *true* way to avoid race conditions is to use the Win32 API ReadDirectoryChangesW with OVERLAPPED, and do that from C/C++.

    Probably not what you want to hear, sorry.

      Actually, there is a module on this site by D. Faure that encapsulates this API - Win32::ReadDirectoryChangesW.

      I found this a few weeks ago and have been using it in a test program and it works well enough for my purposes. I am in the process of moving (countries) but intend to ask D Faure if I can put this on CPAN when I get settled.

Re: Monitoring directory for new files
by salva (Monsignor) on Jan 20, 2010 at 09:32 UTC
    IIRC, Windows will not let you open a file that's already open "rw" by another process so all you have to do is to read the directory, and try to open any file there (doing it in some way that does not wait for the file to be released), if the open call fails, just continue with the following one. Sleep. Repeat.

    A completely different aproach is to put the required functionality into the SFTP server. You can extend Net::SFTP::Server adding your custom logic in the file close operation.

Re: Monitoring directory for new files
by pdltiger (Novice) on Jan 20, 2010 at 14:04 UTC

    It looks like you want something that will actually tell you if the OS is aware that the file is being changed or has been modified. I don't know of anything like that, but it's not hard to roll your own file-checker, one that simply 'gets the job done'. I've put my solution below. It monitors all the files in the directory for changes in size and when a file hasn't changed its size since the last iteration, it deems it done and sftp's it. It also monitors files that have been sftp'd already, in case a new file arrives with the name of an old one.

    Note that this code compiles, but I didn't check to see if the logic actually works as advertised (-:

    #!/usr/bin/perl use warnings; use strict; my $dir_to_monitor = 'your/directory/here'; chdir $dir_to_monitor; # We'll wait for 60 seconds between invocations my $sleep_time = 60; # We deem a file 'unchanging' if its size hasn't changed since # the last time we checked. Thus, these hashes hold the file # sizes from our last time through the loop: my %unchanging_file_sizes; # Already been sftp'd elsewhere my %changing_file_sizes; # Files that are potentially growing # Holds the state of the loop. This is set to zero when the file # named 'stop_sftp_checking' is found. my $still_running = 1; while ($still_running) { # Check all the files in the directory FILE: foreach my $filename (glob '*') { # Ignore the file if it hasn't changed next FILE if (-s $filename == $unchanging_file_sizes{$filename}); # Check files that were changing the last time around to see # if they're done changing. if (-s $filename == $changing_file_sizes{$filename}) { sftp_file($filename); delete $changing_file_sizes{$filename}; $unchanging_file_sizes{$filename} = -s $filename; next FILE; } # At this point we can be sure that either the file is new # or the file size has changed since we last checked. Either # way, make sure we check it on our next go 'round the loop: delete $unchanging_file_sizes{$filename}; $changing_file_sizes{$filename} = -s $filename; # Finally, our exit condition. If we find a file named # 'stop_sftp_checking' then quit after handling the last file. $still_running = 0 if $filename eq 'stop_sftp_checking'; } # Wait and repeat sleep $sleep_time; }

    Note that I assumed you have a function named sftp_file() that will perform the needed sftp file copy command.

    This has really, really obvious holes in it, but you should be able to patch them if you care. It waits for a minute between checks, so if you need a fast response, it won't work for you. Also, it assumes that you will never receive a file named 'stop_sftp_checking'; though it's unlikely you'll ever receive such a file by accident, it could pose a security hole if constant up-time is critical and you're facing the world. There are better ways to handle the stopping condition such as monitoring a control file outside the directory. I assume you have some notion of security and how to secure your system.

    Remember, Perl is fun, and hackish solutions are usually all that you really need to fix the problem and move on. Hope that helps!

Re: Monitoring directory for new files
by ranciid (Novice) on Jan 25, 2010 at 03:12 UTC
    Thanks for all your replies.

    All of you have provided very valuable inputs and very helpful snippets of codes.

    Really appreciate you all taking the time to reply.

    Cheers !

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://818361]
Approved by ikegami
Front-paged by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (13)
As of 2014-10-22 07:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (114 votes), past polls