Random Files Recursively Form Directory

by gautamparimoo (Beadle)
on Apr 01, 2013 at 08:52 UTC ( #1026442=perlquestion: print w/replies, xml ) Need Help??
gautamparimoo has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks..

I have the following specifications to search a given directory(recursively) totally random with following limitations:

  • The user would give a max limit which only has to be searched from the directory(Limit on size ie if directory is 100GB then 25% is randomly searched ie 25GB atmost which means the total size of all random files selected should not exceed 25GB)
  • Only a few file extensions have to be searched like .txt,.doc etc

I thought over it and saw old nodes which gave me these strategies:

1. Use File::Find to recurse and then use shuffle in List::Util to get random files and then scan those random files.But this will create a very large array which contains the address of random files selected and also time will be significantly more.

2. Use file::find to recurse and simultaneously use rand function to sect file and then scan it immediately and then move to next random file. I do not know how to work in this approach

3. A module File::Random exists but it does not work for Windows as written in its limitations.

Also I am not able to decide how to implement the max size check(as told above). please provide guidance? Example codes or other approaches


Re: Random Files Recursively Form Directory
by CountOrlok (Friar) on Apr 01, 2013 at 14:19 UTC
    When you say user gives a max limit (in percentage), is that for all the files in the directory or only the target files in the directory? e.g. say you are supposed to just check 25% (size, not quantity) and 75% of the directory is made up of extensions you are not interested in, does that mean you will scan all of the files with extensions you are interested in?

    I can give you one approach (assuming the max limit given by user is for % of files that match extension (not % of all files):
    Use File::Find to find all files with extensions you are interested in.
    With the help of "-s" or stat() to get the file sizes, build a data structure that looks like this:
    $myFiles = { "dir1" => { filelist => [ { filename => "file1.txt", size => 123 }, { filename => "file2.doc", size => 456 }, +], dir_size => 579, }, ... };
    Iterate through each directory in this structure.
    For each random file in "filelist" (can use Shuffle in List::Util for this), scan/process file as long as sum of size of files processed in the directory does not exceed the user supplied percentage of dir_size.

      Yes the % is only for the target files(ie 25% should be the extensions specified). Next the data structure would become very big as I have drives with minimum 100GB that means 25Gb(25%) minimum so the time would be quite high (Time for file::find and Filelist prparation + shuffle of filelist). So is there another workaround?

      I saw this module File::Random which implements rand functions. I was thinking of using this module in a while loop to iterate over the directory till specified size is encountered. What do you think of this?

        I don't see how the size of the files in a directory would affect File::Find, data structure preparation and shuffling the list. I would say that number of files in a directory would have an effect. What is the number of files per directory?

        I would think File::Random would also be slow if the number of files in a directory is huge. A quick check of the code in File::Random make me think that it re-reads the whole directory every time you call random_file() and therefore you could get the same file returned more than one time.

