http://www.perlmonks.org?node_id=1026493


in reply to Random Files Recursively Form Directory

When you say user gives a max limit (in percentage), is that for all the files in the directory or only the target files in the directory? e.g. say you are supposed to just check 25% (size, not quantity) and 75% of the directory is made up of extensions you are not interested in, does that mean you will scan all of the files with extensions you are interested in?

I can give you one approach (assuming the max limit given by user is for % of files that match extension (not % of all files):
Use File::Find to find all files with extensions you are interested in.
With the help of "-s" or stat() to get the file sizes, build a data structure that looks like this:
$myFiles = { "dir1" => { filelist => [ { filename => "file1.txt", size => 123 }, { filename => "file2.doc", size => 456 }, +], dir_size => 579, }, ... };
Iterate through each directory in this structure.
For each random file in "filelist" (can use Shuffle in List::Util for this), scan/process file as long as sum of size of files processed in the directory does not exceed the user supplied percentage of dir_size.

Replies are listed 'Best First'.
Re^2: Random Files Recursively Form Directory
by gautamparimoo (Beadle) on Apr 02, 2013 at 05:07 UTC

    Yes the % is only for the target files(ie 25% should be the extensions specified). Next the data structure would become very big as I have drives with minimum 100GB that means 25Gb(25%) minimum so the time would be quite high (Time for file::find and Filelist prparation + shuffle of filelist). So is there another workaround?

    I saw this module File::Random which implements rand functions. I was thinking of using this module in a while loop to iterate over the directory till specified size is encountered. What do you think of this?

    One machine can do the work of fifty ordinary men. No machine can do the work of one extraordinary man. -Elbert Hubbard

      I don't see how the size of the files in a directory would affect File::Find, data structure preparation and shuffling the list. I would say that number of files in a directory would have an effect. What is the number of files per directory?

      I would think File::Random would also be slow if the number of files in a directory is huge. A quick check of the code in File::Random make me think that it re-reads the whole directory every time you call random_file() and therefore you could get the same file returned more than one time.

        Each directory consists of atleast 5000 files plus sub directories having atleast 3000 files. But any how it looks like the only possible solution so ill give it a try. Thnks...

        One machine can do the work of fifty ordinary men. No machine can do the work of one extraordinary man. -Elbert Hubbard