Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

how to tell if a file is still being modified

by oylee (Pilgrim)
on Sep 15, 2003 at 19:18 UTC ( [id://291634]=perlquestion: print w/replies, xml ) Need Help??

oylee has asked for the wisdom of the Perl Monks concerning the following question:

I have a perl script that runs on a cron job that periodically polls a directory. If the directory contains any files, it decrypts those files and moves them to another directory. The problem, however, is that those files are getting placed there by some other FTP process over which I have no control. This means that occasionally, my script will try to process a file that is in the process of being transferred. Is there any 'natural' perl way to prevent this? I've poked around the File module list on CPAN but didn't see anything immediately relevant, but my eyes are pretty small. We've already looked into using fuser and File::stat (we could stat() the file and then compare it's mtime with the current time and if it's less than a certain threshold, don't process it). We can certainly go with either of these solutions but they don't feel perlfect...

Thanks for your help!
Allen

Replies are listed 'Best First'.
Re: how to tell if a file is still being modified (use the filename as a communications channel)
by grinder (Bishop) on Sep 15, 2003 at 20:08 UTC

    In cases like these I use the name of the file itself as a channel to other processes to let them know whether they are allowed to play with it or not. This does, however, require that you have control over the process that is sending you the files.

    All you have to do is to arrange for the sender to put files on your server according to a specific filename convention (e.g. PUT sekret.data or PUT sekret.data.uploading in ftp parlance).

    After the transfer is complete, the sender then sends down another command to rename the file: RENAME sekret.data sekret.data.ready or RENAME sekret.data.uploading sekret.data, respectively. Whatever works best for you. The trick is that the sender must do this, the receiver cannot.

    As a receiver, you only have to search for files with the agreed-upon extension (.ready or whatever). You can even push the vice as far as renaming the file, on the receiving side (e.g. sekret.data.done) so that the sending side knows that the file has been processed, should the housekeeping be their responsibility.

    This is also pretty robust in terms of sudden death reboots. It becomes trivial to determine if files need to be resent or reprocessed.

    This is a language- and platform-agnostic technique. You can use it pretty much anywhere you can give names to things. If you can't rename, (sometimes not possible with anonymous ftp uploads) you can always create another file along the principal file (e.g. sekret.data.is-ready) possibly with zero-length, possibly containing an MD5 checksum, to achieve a similar result.

    The main point to remember is that you don't want to try and second-guess the sender on the receiving side. To try and do so will cause untold pain. Just get the sender to tell you.

      This is the method we use in batch processing at one of the largest banks in the US. Another method is to poll the size of the file every so many seconds. After the file doesn't grow for N polls (usually 2), we can assume the file is done. -s isn't that expensive, especially in the batch world.

      ------
      We are the carpenters and bricklayers of the Information Age.

      The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: how to tell if a file is still being modified
by samtregar (Abbot) on Sep 15, 2003 at 21:33 UTC
    One way to solve this problem is to write your own FTP server in Perl. Net::FTPServer makes it easy. Just inherit from the provided base classes and implement your processing at the end of the close() method. I did this for Bricolage and it works very well.

    -sam

      Neat idea! This is something I'll definitely look into.
Re: how to tell if a file is still being modified
by halley (Prior) on Sep 15, 2003 at 19:45 UTC

    No easy way. No portable way.

    Your method is about as good as I've found, if you can't otherwise ask the OS about open files. Things like lsof on Linux may help, but only on the same machine which is actually regulating the filesystem; I can imagine certain network filesystems like Samba or NFS hiding those details.

    I made a quickie script I call 'settle' which runs a shell command repeatedly until the standard output or a given file ceases to change. It offers a variable window for polling for changes, but defaults to 5sec.

    --
    [ e d @ h a l l e y . c c ]

Re: how to tell if a file is still being modified
by blue_cowdawg (Monsignor) on Sep 15, 2003 at 19:55 UTC

    Along the same lines of what halley had to say in his reply I once solved that problem by using an algorithm described in the following psuedocode:

    open file for reading slurp in file into a scalar close file calculate the MD5 of slurp buffer wait X seconds open file for reading slurp into second scalar close file calculate MD5 of second slurp buffer compare and repeat if not equal.
    Another scheme I have used in the past is to let the FTP client that is putting the files into the directory bit-twiddle the permission bits so that when the client is done copying the permission bits are of some pre-agreed upon value, (444 being my favorite) and then opening the file.


    Peter L. Berghold -- Unix Professional
    Peter at Berghold dot Net
       Dog trainer, dog agility exhibitor, brewer of fine Belgian style ales. Happiness is a warm, tired, contented dog curled up at your side and a good Belgian ale in your chalice.

      That scheme is inefficient and fragile. Why bother to read the entire file and calculate a hash on it, when checking the file size is a whole lot faster and for an FTP upload, is just as good?

      It is fragile; imagine what would happen if the upload stalled for X+1 seconds and then resumed. If you make X large to try to avoid this, it makes the processing slower.

      The permission bits scheme is better, if your system supports permission bits.

            That scheme is inefficient and fragile. Why bother to read the entire file and calculate a hash on it, when checking the file size is a whole lot faster and for an FTP upload, is just as good?
        If my scheme of checking a hash is fragile then checking file sizes is just as bad if not worse for the same reasons you stated mine was bad.

        As you say if the upload stalls for X+1 seconds then you are going to end up colliding with the upload when you act on what you assume is a finished file.

        I personally like the bit banging method much better but unfortuneatly that is UNIX-centric and is not portable to say Win32 and friends.

        As others have said there are no really clean and portable ways of doing this and YMMV no matter what method you use. Generating MD5 hashes worked for me and in a batch environment are not that expensive.


        Peter L. Berghold -- Unix Professional
        Peter at Berghold dot Net
           Dog trainer, dog agility exhibitor, brewer of fine Belgian style ales. Happiness is a warm, tired, contented dog curled up at your side and a good Belgian ale in your chalice.
      There is no reason to slurp the whole file into memory just to calculate an MD5. See Digest::MD5 for the correct way to do it; efficiently. It's quite common for a disk file to be larger than the available RAM on a machine, so that's just about the worst algorithm error you could get in there (on an algorithm that works at all). Though the usefullness of an MD5 here is somewhat suspect in the first place; expensive, and gains nothing over the sollution it would replace.
      --
      Snazzy tagline here
Re: how to tell if a file is still being modified
by nimdokk (Vicar) on Sep 15, 2003 at 20:26 UTC
    FTP does not lock files when it is still writing to them (I've run into this issue already where a 500 Mbyte file was being uploaded and our process picked it up and moved it on to another location before it had completed the upload). Our solution in that case was to have to sender create a "lock" file once they had completed transmitting the file to us. The lock file is usually very small (0-byte preferrably) that was we look for the lock file and then perform the actions we need on the other file. It might not be the cleanest solution, but it seems to work.


    "Ex libris un peut de tout"
      As there is no reliable cross-platform file locking system, flag files are a common pattern for indicating process state. I worked with a system that had directories called "do", "done", "pending", "success", and "fail". The actual data file was dropped in the "pending" directory, then a file of the same name was created in the "do" directory. When the file transfer subsystem had done the transfer, it moved the "pending" file to "success", and moved the marker file from "do" to "done".

      Can you get the sending process to create a marker file (either with an extension, a prefix, or in another directory) to indicate completion, and monitor for the marker instead?

Re: how to tell if a file is still being modified
by Roger (Parson) on Sep 15, 2003 at 23:34 UTC
    I rememberred that we had this problem with our automatic job processing too. We will kick off processing when certain file has arrived from FTP. We came up with several solutions:

    1) By periodically checking the size of the file coming in, and if the file size has stopped growing, then we would assume that the file transfer has stopped. However this does *NOT* work, at least not reliablly! We had one perticular case where the FTP has paused / died and the system thought that the file has been received properly, and started to process the file. It made a total mess that took many days to resolve.

    2) A better approach than the first one is to modify the system to receive the data file first, followed by a trigger file. The system will act on the arrival of the trigger file. This approach is more reliable than the first one, however, it assumes that the sender works properly. We had a case when the sender program/script sent half the data file, and then somehow sent the trigger file without checking the completion of the data file. This of cause caused another mess.

    3) The best senario is when the data file has an integrated verification mechanism, like a ZIP file. You can be certain if the incoming ZIP file has arrived completely by periodically testing the integrity of the ZIP file with the zip -T switch. This method works 100% of the time.

    The 3rd method, with a self validating file format, is the preferred method, if the client/sender can produce such format. If not possible, then fall back to the 2nd method with additional trigger file. If this is not possible, then fall back to the 1st method and pray. :-D
      I have considered option 3, but it feels a bit too kludgy and there's always the possibility that different versions of zip behave differently since there is no standard for creating zip archives. I'm still keeping it in the back of my mind though since as you pointed out, option 2 can cause problems when the "trigger" file is loaded without verifying that the original file arrived correctly, or I've also seen cases where someone will load a trigger file without sending the data file (or the other way around). I'd say that it works 80-90% of the time which is good. Its just that 10-20% when it does not that is annoying, especially when you get paged at 3am because some idjit mistakenly created a data file without a trigger, or vice versa. The best solution would perhaps be a combination of some or all of these options (provided a workable solution could be created easily) :-)


      "Ex libris un peut de tout"
        I agree with you on the diversity of versions of ZIP out there. I'd say most of the differences would be in its encryption algorithm. (I am not in the US, so I am using the export version of the strong encryption algorithm to compile my ZIP program, hmmm, perhaps that is why I still haven't had any problems yet.) Provided there is no encryption requirement, ZIP is still a good solution though. And of cause if there was any problem, it would show up in the testing phase, wouldn't it?
      You might even enhance the functionality of the 'trigger' file, by including the MD5 sum of the transferred file...
Re: how to tell if a file is still being modified
by Abigail-II (Bishop) on Sep 15, 2003 at 21:52 UTC
    I would use fuser. You said you have looked into it, but you don't say why you rejected it. For obvious reasons, you either need to be root, or own the process that's accessing the file though. Alternatively, if your OS supports it, you might use the /proc filesystem, but then you must have the same permissions as fuser (which uses /proc as well). And if you have the license for it, I bet glance/advisor will be able to give you the information as well.

    Abigail

Re: how to tell if a file is still being modified
by oylee (Pilgrim) on Sep 15, 2003 at 22:57 UTC
    Thanks for all the replies. The best of all possible worlds would be if we could control the FTP sender in addition to the receiver and just set up some handshake protocol there, but that'd make life too easy x). The poll-the-file-until-it-no-longer-changes is something that would most certainly work but it just feels a li'l brittle.

    I've already implemented it using fuser but the thought that the production machine's OS might not always have fuser available crossed my mind and wouldn't let go (the warning in Linux::Fuser stating that "even then it may not work on other than 2.2.* kernels" made me squirm a bit too).

    In any case, fuser seems to be working fine (though this thing is a bit awkward to test) . Once again, I've shamed my ancestors and forgotten to heed the potential wisdom of YouArentGonnaNeedIt .

    Thanks again for all the input!
    Allen
Re: how to tell if a file is still being modified
by iburrell (Chaplain) on Sep 15, 2003 at 23:15 UTC
    Another mechanism you mentioned is to look at the creation time or mod time. If you can figure out the maximum upload time, your script can skip files younger than that threshold.

    For example, if you say every file will take 1 hour max to upload, a five minute old file could be still uploaded. But the two hour file is safe. This doesn't work if the files are really big, the upload time varies too much, or you need frequent processing.

Re: how to tell if a file is still being modified
by exussum0 (Vicar) on Sep 15, 2003 at 22:01 UTC
    If you can garantee some sorta bandwidht on the party ftping the file in, then you can do two or three ls's and make sure the file size don't change.
    --
    Play that funky music white boy..
Re: how to tell if a file is still being modified
by monktim (Friar) on Sep 16, 2003 at 14:18 UTC
    Here is another alternative. It may be overkill for what you want to do and it involves more work but maybe its worth considering.

    Instead of another FTP process putting the files in your directory, you can run a process to establish a passive mode FTP connection to the FTP server and get the files. WGET is a good utility for this and is freely available by GNU. WGET won't put a partial file in your directory, it will wait until a complete successful download to do so. It also works well on unstable connections and can do retries.

    Then you can run you own FTP server on that same machine and allow access to the directory. On your final destination machine you can again use WGET. This time you can establish a passive mode connection to your FTP server to get the files.

    Passive mode FTP connections allow you to get files across the network(s) securely. That is, as long as you have good firewall rules and you have good security on your FTP server.

    Update I jumped the gun. I was wong about WGET not showing the file until it is complete. I tested it with a 100MB file and it got created slowly. The version of WGET I have dumps everything is gets into a single file and then creates the individual files from it. It still creates the individual files slowly and not all at once. My bad.

      This doesn't really do the job - you've just shifted the problem from one machine to another. How do you know the file is ready to get?

      This is unrelated, but where possible I would recommend getting files rather than putting them. I have found it much easier to set up UATs, parallel runs and so on with this arrangement.

      I've used variations on most of the methods mentioned above:

      • polling the checksum, mod time or size;
      • trigger files;
      • trailer records;
      • renaming or moving to another directory.

      I favour trigger files. I have to admit that I've spent far more time worrying about the possible problems than actually encountering them.

Re: how to tell if a file is still being modified
by Beechbone (Friar) on Sep 18, 2003 at 20:38 UTC
    What about the FTP server's log file? Isn't there something like "completed <filename>" if you raise the log level high enough?
      Old thread but here is another suggestion. If files arrive in a certain order then process everything except the last file.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://291634]
Approved by jdtoronto
Front-paged by samtregar
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (8)
As of 2024-03-28 19:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found