Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Random Files Recursively Form Directory

by gautamparimoo (Beadle)
on Apr 01, 2013 at 08:52 UTC ( #1026442=perlquestion: print w/ replies, xml ) Need Help??
gautamparimoo has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks..

I have the following specifications to search a given directory(recursively) totally random with following limitations:

  • The user would give a max limit which only has to be searched from the directory(Limit on size ie if directory is 100GB then 25% is randomly searched ie 25GB atmost which means the total size of all random files selected should not exceed 25GB)
  • Only a few file extensions have to be searched like .txt,.doc etc

I thought over it and saw old nodes which gave me these strategies:

1. Use File::Find to recurse and then use shuffle in List::Util to get random files and then scan those random files.But this will create a very large array which contains the address of random files selected and also time will be significantly more.

2. Use file::find to recurse and simultaneously use rand function to sect file and then scan it immediately and then move to next random file. I do not know how to work in this approach

3. A module File::Random exists but it does not work for Windows as written in its limitations.

Also I am not able to decide how to implement the max size check(as told above). please provide guidance? Example codes or other approaches

Thnx..

One machine can do the work of fifty ordinary men. No machine can do the work of one extraordinary man. -Elbert Hubbard

Comment on Random Files Recursively Form Directory
Re: Random Files Recursively Form Directory
by CountOrlok (Friar) on Apr 01, 2013 at 14:19 UTC
    When you say user gives a max limit (in percentage), is that for all the files in the directory or only the target files in the directory? e.g. say you are supposed to just check 25% (size, not quantity) and 75% of the directory is made up of extensions you are not interested in, does that mean you will scan all of the files with extensions you are interested in?

    I can give you one approach (assuming the max limit given by user is for % of files that match extension (not % of all files):
    Use File::Find to find all files with extensions you are interested in.
    With the help of "-s" or stat() to get the file sizes, build a data structure that looks like this:
    $myFiles = { "dir1" => { filelist => [ { filename => "file1.txt", size => 123 }, { filename => "file2.doc", size => 456 }, +], dir_size => 579, }, ... };
    Iterate through each directory in this structure.
    For each random file in "filelist" (can use Shuffle in List::Util for this), scan/process file as long as sum of size of files processed in the directory does not exceed the user supplied percentage of dir_size.

      Yes the % is only for the target files(ie 25% should be the extensions specified). Next the data structure would become very big as I have drives with minimum 100GB that means 25Gb(25%) minimum so the time would be quite high (Time for file::find and Filelist prparation + shuffle of filelist). So is there another workaround?

      I saw this module File::Random which implements rand functions. I was thinking of using this module in a while loop to iterate over the directory till specified size is encountered. What do you think of this?

      One machine can do the work of fifty ordinary men. No machine can do the work of one extraordinary man. -Elbert Hubbard

        I don't see how the size of the files in a directory would affect File::Find, data structure preparation and shuffling the list. I would say that number of files in a directory would have an effect. What is the number of files per directory?

        I would think File::Random would also be slow if the number of files in a directory is huge. A quick check of the code in File::Random make me think that it re-reads the whole directory every time you call random_file() and therefore you could get the same file returned more than one time.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1026442]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (5)
As of 2014-09-24 05:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (246 votes), past polls