Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Finding and sorting files in massive file directory

by CColin (Scribe)
on Jan 20, 2013 at 19:10 UTC ( #1014320=perlquestion: print w/ replies, xml ) Need Help??
CColin has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a directory with c. 2-3 million files (and growing) that I need to categorise and compress.

Categorise might be by prefix, eg. all files beginning with a, b, c and so on. Compress would then be eg. compress/ archive all files beginning with b. I'd probably then want to delete all files beginning with b* once archived

I can think of using Find::File::Rule to walk through the directory and group files by type, but am concerned about how to properly handle a directory of files on this size and the correct way to compress them as I go, and then delete them once compressed.

At the moment, shell commands on the directory - find, ls etc - just hang!

Thanks

Comment on Finding and sorting files in massive file directory
Re: Finding and sorting files in massive file directory
by frozenwithjoy (Curate) on Jan 20, 2013 at 19:29 UTC
Re: Finding and sorting files in massive file directory
by dave_the_m (Parson) on Jan 20, 2013 at 20:38 UTC
    You don't make it clear whether all these files are in a single directory, or in a directory heirarchy. I'm going to assume the former. Most shell commands and much perl code will appear to hang forever, since it will attempt to read the entire directory listing into memory, then sort it, before doing anything else. What you need (in general terms) is the following perl code, which must be run from the directory in question:
    #!/usr/bin/perl use warnings; use strict; opendir my $dir, '.' or die "opendir .: $!\n"; my $file; my $count = 0; while (defined($file = readdir($dir))) { # give yourself some progression feedback $count++; print "file $count ...\n" unless $count % 1000; # skip all files not begining with b next unless $file =~ /^b/; # if you've created directories, may need to skip them; # this will slow things down, so don't do so unless necessary next unless -f $file; # do something with the file rename $file, "b/$file" or die "rename $file b/$file: $!\n"; }
    This example deals with directory entries at the most efficient and lowest level. In this case, it just moves all files starting with "b" into the subdirectory b/.

    Obviously it needs adapting to your particular needs. For example, the rename could become

    system "gzip", $file;

    Dave.

      >You don't make it clear whether all these files are in a single directory Yes, they are. So basically this reads one file into memory at a time? Do you know how to deal with zipping and tarring each file up in that case?
        So basically this reads one file into memory at a time?
        No, it reads one filename at a time into memory.
        Do you know how to deal with zipping and tarring each file up in that case?
        Well I've already shown you the easy way to compress individual files with the external gzip command. If you want to combine multiple file into a single tar file (possibly compressed), you're going to have to be more specific: how many files approximately are you wanting to put in a single tar file? All 2 million of them? Or just a select few? And do you want to delete the files afterwards?

        You are likely to need to use the Archive::Tar module.

        Dave

Re: Finding and sorting files in massive file directory
by pvaldes (Chaplain) on Jan 20, 2013 at 21:04 UTC

    I have a directory with c. 2-3 million files (and growing) that I need to categorise and compress.

    shell commands on the directory - find, ls etc - just hang!

    Use wildcards in bash or glob in perl (ie: Fragment and move to several directories with mv a*.* /A-dir. Compress the files beginning for b with something like gzip b*.*)

      Seem to recall when I tried something like this it failed due to too many arguments being passed to the command. ie. you can't run a million or so files through gzip command?
        If you put something like this in a while (there are files that start w/ a) loop, you should get around the issue of too many arguments. (Be sure to set increment=1 before the loop).
        mkdir dir_a.$increment mv `ls a* | head -n1000` dir_a.$increment/ let increment++

        See the xargs command. It can help with breaking apart large command lines.

        find . -type f -name b\* -depth -2 | xargs $command_that_can_be_run_mu +ltiple_times

        Update: added example.

        --MidLifeXis

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1014320]
Approved by Lotus1
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (14)
As of 2014-10-23 12:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (125 votes), past polls