Finding and sorting files in massive file directory

CColin has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a directory with c. 2-3 million files (and growing) that I need to categorise and compress.

Categorise might be by prefix, eg. all files beginning with a, b, c and so on. Compress would then be eg. compress/ archive all files beginning with b. I'd probably then want to delete all files beginning with b* once archived

I can think of using Find::File::Rule to walk through the directory and group files by type, but am concerned about how to properly handle a directory of files on this size and the correct way to compress them as I go, and then delete them once compressed.

At the moment, shell commands on the directory - find, ls etc - just hang!

Thanks

Comment on Finding and sorting files in massive file directory

Replies are listed 'Best First'.
Re: Finding and sorting files in massive file directory by dave_the_m (Monsignor) on Jan 20, 2013 at 20:38 UTC
You don't make it clear whether all these files are in a single directory, or in a directory heirarchy. I'm going to assume the former. Most shell commands and much perl code will appear to hang forever, since it will attempt to read the entire directory listing into memory, then sort it, before doing anything else. What you need (in general terms) is the following perl code, which must be run from the directory in question: #!/usr/bin/perl use warnings; use strict; opendir my $dir, '.' or die "opendir .: $!\n"; my $file; my $count = 0; while (defined($file = readdir($dir))) { # give yourself some progression feedback $count++; print "file $count ...\n" unless $count % 1000; # skip all files not begining with b next unless $file =~ /^b/; # if you've created directories, may need to skip them; # this will slow things down, so don't do so unless necessary next unless -f $file; # do something with the file rename $file, "b/$file" or die "rename $file b/$file: $!\n"; } [download] This example deals with directory entries at the most efficient and lowest level. In this case, it just moves all files starting with "b" into the subdirectory b/. Obviously it needs adapting to your particular needs. For example, the rename could become `system "gzip", $file;` Dave.	[reply] [d/l] [select]
Re^2: Finding and sorting files in massive file directory by CColin (Scribe) on Jan 20, 2013 at 21:28 UTC
>You don't make it clear whether all these files are in a single directory Yes, they are. So basically this reads one file into memory at a time? Do you know how to deal with zipping and tarring each file up in that case?	[reply]
Re^3: Finding and sorting files in massive file directory by dave_the_m (Monsignor) on Jan 20, 2013 at 22:41 UTC
So basically this reads one file into memory at a time? No, it reads one filename at a time into memory. Do you know how to deal with zipping and tarring each file up in that case? Well I've already shown you the easy way to compress individual files with the external gzip command. If you want to combine multiple file into a single tar file (possibly compressed), you're going to have to be more specific: how many files approximately are you wanting to put in a single tar file? All 2 million of them? Or just a select few? And do you want to delete the files afterwards? You are likely to need to use the Archive::Tar module. Dave	[reply]
Re^4: Finding and sorting files in massive file directory by CColin (Scribe) on Jan 21, 2013 at 08:35 UTC
Re^5: Finding and sorting files in massive file directory by dave_the_m (Monsignor) on Jan 21, 2013 at 11:21 UTC
Some notes below your chosen depth have not been shown here
Re: Finding and sorting files in massive file directory by frozenwithjoy (Priest) on Jan 20, 2013 at 19:29 UTC
They're not dealing with such large directories, but here are a couple threads that might be relevant: Fastest way to recurse through VERY LARGE directory tree Efficient processing of large directory I wonder if the best first step might be something like this on the command line: `for i in {a..z} do mkdir $i mv $i* $i/ done` [download]	[reply] [d/l]
Re: Finding and sorting files in massive file directory by pvaldes (Chaplain) on Jan 20, 2013 at 21:04 UTC
I have a directory with c. 2-3 million files (and growing) that I need to categorise and compress. shell commands on the directory - find, ls etc - just hang! Use wildcards in bash or glob in perl (ie: Fragment and move to several directories with mv a. /A-dir. Compress the files beginning for b with something like gzip b.)	[reply]
Re^2: Finding and sorting files in massive file directory by CColin (Scribe) on Jan 20, 2013 at 21:30 UTC
Seem to recall when I tried something like this it failed due to too many arguments being passed to the command. ie. you can't run a million or so files through gzip command?	[reply]
Re^3: Finding and sorting files in massive file directory by frozenwithjoy (Priest) on Jan 21, 2013 at 01:04 UTC
If you put something like this in a while (there are files that start w/ `a`) loop, you should get around the issue of too many arguments. (Be sure to set `increment=1` before the loop). mkdir dir_a.$increment mv `ls a* \| head -n1000` dir_a.$increment/ let increment++ [download]	[reply] [d/l] [select]
Re^4: Finding and sorting files in massive file directory by choroba (Cardinal) on Jan 21, 2013 at 02:04 UTC
Re^3: Finding and sorting files in massive file directory by MidLifeXis (Monsignor) on Jan 21, 2013 at 13:12 UTC
See the xargs command. It can help with breaking apart large command lines. `find . -type f -name b\* -depth -2 \| xargs $command_that_can_be_run_mu +ltiple_times` [download] Update: added example. --MidLifeXis	[reply] [d/l]

Back to Seekers of Perl Wisdom