Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re^3: Finding and sorting files in massive file directory

by dave_the_m (Prior)
on Jan 20, 2013 at 22:41 UTC ( #1014354=note: print w/replies, xml ) Need Help??

in reply to Re^2: Finding and sorting files in massive file directory
in thread Finding and sorting files in massive file directory

So basically this reads one file into memory at a time?
No, it reads one filename at a time into memory.
Do you know how to deal with zipping and tarring each file up in that case?
Well I've already shown you the easy way to compress individual files with the external gzip command. If you want to combine multiple file into a single tar file (possibly compressed), you're going to have to be more specific: how many files approximately are you wanting to put in a single tar file? All 2 million of them? Or just a select few? And do you want to delete the files afterwards?

You are likely to need to use the Archive::Tar module.


  • Comment on Re^3: Finding and sorting files in massive file directory

Replies are listed 'Best First'.
Re^4: Finding and sorting files in massive file directory
by CColin (Scribe) on Jan 21, 2013 at 08:35 UTC

    I'd like to be able to combine the distinct file types into single compressed tar files. There are approx 5 types, split roughly c. 1 million for one type and the other 4 c. 300k - 500k each.

    Yes, the files need to be deleted afterwards to make disk space for incoming.


      If there is a process adding new files to the directory while your "tar up" script is running, then you need to face the twin issues of deleting files which haven't been put in the tar file, and putting empty or half-written files into the tarball. If possible, you need to be able to stop the process from adding any new files while the script is running; but if you can't, then the following should be safe.

      Use the script I gave you above to, for example, move all files starting with 'b' into a b/ subdirectory. Then wait a few minutes, or however long it could reasonably take for the process to finish writing the current file, then from the command line, simply:

      $ tar -cfz .../some-path/b.tar.gz b/ $ tar -tfz .../some-path/b.tar.gz > /tmp/foo View /tmp/foo in a text editor to see if it looks reasonable, then $ rm -rf b/
      If the rm fails due to too many files, then write another perl script similar to the one above, but using 'unlink' to remove each file one by one.


        Also, on Linux the inotify interface provides a way to discover when new files are created/open and later closed.
        Thanks, I'll try it.

        I am intrigued by your earlier reference to Archive::Tar. I did look at the module briefly but it seemed rather complicated. What would it add over and above using readdir with while and basic unix commands?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1014354]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (3)
As of 2018-08-18 21:37 GMT
Find Nodes?
    Voting Booth?
    Asked to put a square peg in a round hole, I would:

    Results (186 votes). Check out past polls.