Re: how to tell if a file is still being modified (use the filename as a communications channel)
by grinder (Bishop) on Sep 15, 2003 at 20:08 UTC
|
In cases like these I use the name of the file itself as a channel to other processes to let them know whether they are allowed to play with it or not. This does, however, require that you have control over the process that is sending you the files.
All you have to do is to arrange for the sender to put files on your server according to a specific filename convention (e.g. PUT sekret.data or PUT sekret.data.uploading in ftp parlance).
After the transfer is complete, the sender then sends down another command to rename the file: RENAME sekret.data sekret.data.ready or RENAME sekret.data.uploading sekret.data, respectively. Whatever works best for you. The trick is that the sender must do this, the receiver cannot.
As a receiver, you only have to search for files with the agreed-upon extension (.ready or whatever). You can even push the vice as far as renaming the file, on the receiving side (e.g. sekret.data.done) so that the sending side knows that the file has been processed, should the housekeeping be their responsibility.
This is also pretty robust in terms of sudden death reboots. It becomes trivial to determine if files need to be resent or reprocessed.
This is a language- and platform-agnostic technique. You can use it pretty much anywhere you can give names to things. If you can't rename, (sometimes not possible with anonymous ftp uploads) you can always create another file along the principal file (e.g. sekret.data.is-ready) possibly with zero-length, possibly containing an MD5 checksum, to achieve a similar result.
The main point to remember is that you don't want to try and second-guess the sender on the receiving side. To try and do so will cause untold pain. Just get the sender to tell you.
| [reply] [Watch: Dir/Any] |
|
This is the method we use in batch processing at one of the largest banks in the US. Another method is to poll the size of the file every so many seconds. After the file doesn't grow for N polls (usually 2), we can assume the file is done. -s isn't that expensive, especially in the batch world.
------ We are the carpenters and bricklayers of the Information Age. The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6 Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.
| [reply] [Watch: Dir/Any] |
Re: how to tell if a file is still being modified
by samtregar (Abbot) on Sep 15, 2003 at 21:33 UTC
|
One way to solve this problem is to write your own FTP server in Perl. Net::FTPServer makes it easy. Just inherit from the provided base classes and implement your processing at the end of the close() method. I did this for Bricolage and it works very well.
-sam
| [reply] [Watch: Dir/Any] [d/l] |
|
Neat idea! This is something I'll definitely look into.
| [reply] [Watch: Dir/Any] |
Re: how to tell if a file is still being modified
by halley (Prior) on Sep 15, 2003 at 19:45 UTC
|
No easy way. No portable way.
Your method is about as good as I've found, if you can't otherwise ask the OS about open files. Things like lsof on Linux may help, but only on the same machine which is actually regulating the filesystem; I can imagine certain network filesystems like Samba or NFS hiding those details.
I made a quickie script I call 'settle' which runs a shell command repeatedly until the standard output or a given file ceases to change. It offers a variable window for polling for changes, but defaults to 5sec.
-- [ e d @ h a l l e y . c c ] | [reply] [Watch: Dir/Any] [d/l] |
Re: how to tell if a file is still being modified
by blue_cowdawg (Monsignor) on Sep 15, 2003 at 19:55 UTC
|
Along the same lines of what halley had to say in his
reply I once solved that problem by using an algorithm
described in the following psuedocode:
open file for reading
slurp in file into a scalar
close file
calculate the MD5 of slurp buffer
wait X seconds
open file for reading
slurp into second scalar
close file
calculate MD5 of second slurp buffer
compare and repeat if not equal.
Another scheme I have used in the past is to let the
FTP client that is putting the files into the directory
bit-twiddle the permission bits so that when the client
is done copying the permission bits are of some pre-agreed
upon value, (444 being my favorite) and then opening the
file.
Peter L. Berghold -- Unix Professional Peter at Berghold dot Net |
|
Dog trainer, dog agility exhibitor, brewer of
fine Belgian style ales. Happiness is a warm, tired, contented dog curled up at your side and
a good Belgian ale in your chalice. |
| [reply] [Watch: Dir/Any] [d/l] |
|
That scheme is inefficient and fragile. Why bother to read the entire file and calculate a hash on it, when checking the file size is a whole lot faster and for an FTP upload, is just as good?
It is fragile; imagine what would happen if the upload stalled for X+1 seconds and then resumed. If you make X large to try to avoid this, it makes the processing slower.
The permission bits scheme is better, if your system supports permission bits.
| [reply] [Watch: Dir/Any] |
|
That scheme is inefficient and fragile. Why bother to read the entire file and calculate a hash on it, when checking the file size is a whole lot faster and for an FTP upload, is just as good?
If my scheme of checking a hash is fragile then checking
file sizes is just as bad if not worse for the same
reasons you stated mine was bad.
As you say if the upload stalls for X+1 seconds then
you are going to end up colliding with the upload when
you act on what you assume is a finished file.
I personally like the bit banging method much better
but unfortuneatly that is UNIX-centric and is not portable
to say Win32 and friends.
As others have said there are no really clean and
portable ways of doing this and YMMV no matter what
method you use. Generating MD5 hashes worked for me
and in a batch environment are not that expensive.
Peter L. Berghold -- Unix Professional Peter at Berghold dot Net |
|
Dog trainer, dog agility exhibitor, brewer of
fine Belgian style ales. Happiness is a warm, tired, contented dog curled up at your side and
a good Belgian ale in your chalice. |
| [reply] [Watch: Dir/Any] |
|
There is no reason to slurp the whole file into memory just to calculate an MD5. See Digest::MD5 for the correct way to do it; efficiently.
It's quite common for a disk file to be larger than the available RAM on a machine, so that's just about the worst algorithm error you could get in there (on an algorithm that works at all).
Though the usefullness of an MD5 here is somewhat suspect in the first place; expensive, and gains nothing over the sollution it would replace.
-- Snazzy tagline here
| [reply] [Watch: Dir/Any] |
Re: how to tell if a file is still being modified
by nimdokk (Vicar) on Sep 15, 2003 at 20:26 UTC
|
FTP does not lock files when it is still writing to them (I've run into this issue already where a 500 Mbyte file was being uploaded and our process picked it up and moved it on to another location before it had completed the upload). Our solution in that case was to have to sender create a "lock" file once they had completed transmitting the file to us. The lock file is usually very small (0-byte preferrably) that was we look for the lock file and then perform the actions we need on the other file. It might not be the cleanest solution, but it seems to work.
"Ex libris un peut de tout" | [reply] [Watch: Dir/Any] |
|
As there is no reliable cross-platform file locking system, flag files are a common pattern for indicating process state. I worked with a system that had directories called "do", "done", "pending", "success", and "fail". The actual data file was dropped in the "pending" directory, then a file of the same name was created in the "do" directory. When the file transfer subsystem had done the transfer, it moved the "pending" file to "success", and moved the marker file from "do" to "done".Can you get the sending process to create a marker file (either with an extension, a prefix, or in another directory) to indicate completion, and monitor for the marker instead?
| [reply] [Watch: Dir/Any] |
Re: how to tell if a file is still being modified
by Roger (Parson) on Sep 15, 2003 at 23:34 UTC
|
I rememberred that we had this problem with our automatic job processing too. We will kick off processing when certain file has arrived from FTP. We came up with several solutions:
1) By periodically checking the size of the file coming in, and if the file size has stopped growing, then we would assume that the file transfer has stopped. However this does *NOT* work, at least not reliablly! We had one perticular case where the FTP has paused / died and the system thought that the file has been received properly, and started to process the file. It made a total mess that took many days to resolve.
2) A better approach than the first one is to modify the system to receive the data file first, followed by a trigger file. The system will act on the arrival of the trigger file. This approach is more reliable than the first one, however, it assumes that the sender works properly. We had a case when the sender program/script sent half the data file, and then somehow sent the trigger file without checking the completion of the data file. This of cause caused another mess.
3) The best senario is when the data file has an integrated verification mechanism, like a ZIP file. You can be certain if the incoming ZIP file has arrived completely by periodically testing the integrity of the ZIP file with the zip -T switch. This method works 100% of the time.
The 3rd method, with a self validating file format, is the preferred method, if the client/sender can produce such format. If not possible, then fall back to the 2nd method with additional trigger file. If this is not possible, then fall back to the 1st method and pray. :-D | [reply] [Watch: Dir/Any] |
|
I have considered option 3, but it feels a bit too kludgy and there's always the possibility that different versions of zip behave differently since there is no standard for creating zip archives. I'm still keeping it in the back of my mind though since as you pointed out, option 2 can cause problems when the "trigger" file is loaded without verifying that the original file arrived correctly, or I've also seen cases where someone will load a trigger file without sending the data file (or the other way around). I'd say that it works 80-90% of the time which is good. Its just that 10-20% when it does not that is annoying, especially when you get paged at 3am because some idjit mistakenly created a data file without a trigger, or vice versa. The best solution would perhaps be a combination of some or all of these options (provided a workable solution could be created easily) :-)
"Ex libris un peut de tout"
| [reply] [Watch: Dir/Any] |
|
I agree with you on the diversity of versions of ZIP out there. I'd say most of the differences would be in its encryption algorithm. (I am not in the US, so I am using the export version of the strong encryption algorithm to compile my ZIP program, hmmm, perhaps that is why I still haven't had any problems yet.) Provided there is no encryption requirement, ZIP is still a good solution though. And of cause if there was any problem, it would show up in the testing phase, wouldn't it?
| [reply] [Watch: Dir/Any] |
|
|
You might even enhance the functionality of the 'trigger' file, by including the MD5 sum of the transferred file...
| [reply] [Watch: Dir/Any] |
Re: how to tell if a file is still being modified
by Abigail-II (Bishop) on Sep 15, 2003 at 21:52 UTC
|
I would use fuser. You said you have looked into it,
but you don't say why you rejected it. For obvious reasons,
you either need to be root, or own the process that's accessing the file though. Alternatively, if your OS supports
it, you might use the /proc filesystem, but then you
must have the same permissions as fuser (which uses
/proc as well). And if you have the license for it,
I bet glance/advisor will be able to give you the information
as well.
Abigail | [reply] [Watch: Dir/Any] |
Re: how to tell if a file is still being modified
by oylee (Pilgrim) on Sep 15, 2003 at 22:57 UTC
|
Thanks for all the replies. The best of all possible worlds would be if we could control the FTP sender in addition to the receiver and just set up some handshake protocol there, but that'd make life too easy x). The poll-the-file-until-it-no-longer-changes is something that would most certainly work but it just feels a li'l brittle.
I've already implemented it using fuser but the thought that the production machine's OS might not always have fuser available crossed my mind and wouldn't let go (the warning in Linux::Fuser stating that "even then it may not work on other than 2.2.* kernels" made me squirm a bit too).
In any case, fuser seems to be working fine (though this thing is a bit awkward to test) . Once again, I've shamed my ancestors and forgotten to heed the potential wisdom of YouArentGonnaNeedIt .
Thanks again for all the input!
Allen | [reply] [Watch: Dir/Any] |
Re: how to tell if a file is still being modified
by iburrell (Chaplain) on Sep 15, 2003 at 23:15 UTC
|
Another mechanism you mentioned is to look at the creation time or mod time. If you can figure out the maximum upload time, your script can skip files younger than that threshold.
For example, if you say every file will take 1 hour max to upload, a five minute old file could be still uploaded. But the two hour file is safe. This doesn't work if the files are really big, the upload time varies too much, or you need frequent processing.
| [reply] [Watch: Dir/Any] |
Re: how to tell if a file is still being modified
by exussum0 (Vicar) on Sep 15, 2003 at 22:01 UTC
|
If you can garantee some sorta bandwidht on the party ftping the file in, then you can do two or three ls's and make sure the file size don't change.
--
Play that funky music white boy.. | [reply] [Watch: Dir/Any] |
Re: how to tell if a file is still being modified
by monktim (Friar) on Sep 16, 2003 at 14:18 UTC
|
Here is another alternative. It may be overkill for what you want to do and it involves more work but maybe its worth considering.
Instead of another FTP process putting the files in your directory, you can run a process to establish a passive mode FTP connection to the FTP server and get the files. WGET is a good utility for this and is freely available by GNU. WGET won't put a partial file in your directory, it will wait until a complete successful download to do so. It also works well on unstable connections and can do retries.
Then you can run you own FTP server on that same machine and allow access to the directory. On your final destination machine you can again use WGET. This time you can establish a passive mode connection to your FTP server to get the files.
Passive mode FTP connections allow you to get files across the network(s) securely. That is, as long as you have good firewall rules and you have good security on your FTP server.
Update I jumped the gun. I was wong about WGET not showing the file until it is complete. I tested it with a 100MB file and it got created slowly. The version of WGET I have dumps everything is gets into a single file and then creates the individual files from it. It still creates the individual files slowly and not all at once. My bad. | [reply] [Watch: Dir/Any] |
|
This doesn't really do the job - you've just shifted the problem from one machine to another. How do you know the file is ready to get?
This is unrelated, but where possible I would recommend getting files rather than putting them. I have found it much easier to set up UATs, parallel runs and so on with this arrangement.
I've used variations on most of the methods mentioned above:
- polling the checksum, mod time or size;
- trigger files;
- trailer records;
- renaming or moving to another directory.
I favour trigger files. I have to admit that I've spent far more time worrying about the possible problems than actually encountering them.
| [reply] [Watch: Dir/Any] |
Re: how to tell if a file is still being modified
by Beechbone (Friar) on Sep 18, 2003 at 20:38 UTC
|
What about the FTP server's log file? Isn't there something like "completed <filename>" if you raise the log level high enough? | [reply] [Watch: Dir/Any] |
|
Old thread but here is another suggestion. If files arrive in a certain order then process everything except the last file.
| [reply] [Watch: Dir/Any] |