Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Limiting a glob

by zod (Scribe)
on Mar 09, 2009 at 17:32 UTC ( #749359=perlquestion: print w/ replies, xml ) Need Help??
zod has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I've got ~40,000 files in a directory (Windows XP). Rather than globbing all of the files into an array at once, I am looking for a way to glob only a certain number at a time (let's say 100). Can you set a limit on a glob like that?

I guess I'm thinking in SQL here (which can be dangerous). I don't care about the order of the glob -- I just want 100 file names at a crack. (Like a TOP or LIMIT in SQL.)

The goal here is to move every 100 files into a newly created directory. It doesn't matter which 100 files, just 100 files. I know I could glob the whole dir into an array and then process the array 100 at a time, but I'm just curious if you can do it without globbing the whole dir at once.

Gratefully,

Zod

Comment on Limiting a glob
Re: Limiting a glob
by moritz (Cardinal) on Mar 09, 2009 at 17:39 UTC
      ... and use a regular expression to select the files you are interested in.

      You can use Text::Glob to transform a file name containing wildcards into a regular expression.

      Here you will find another function (_glob_to_regex) implementing a glob to regex transformation.

Re: Limiting a glob
by jethro (Monsignor) on Mar 09, 2009 at 17:43 UTC
    perl -e 'while ($s=<*>) { print $s,"\n"; sleep 5 }'

    As you can see a glob can read filenames one file at a time (hopefully perl really buffers the filenames in the background and doesn't read them in all at once). You might change it to something like this:

    use strict; use warnings; my @files; my $filecount=0; while (my $s=<*>) { push @files, $s; if (++$filecount>=100) { DoTheMoveWith(@files); $filecount=0; } } DoTheMoveWith(@files) if (@files);
      Thanks. This illustrates a basic misunderstanding I had of the way glob actually works internally.
      (hopefully perl really buffers the filenames in the background and doesn't read them in all at once)

      No, perl reads the entire list of files returned by glob into memory at once. Run the following to test this (WARNING may run for a long time or cause system resource problems on weakish machines).

      #!/usr/bin/perl use strict; use warnings; mkdir "gtest" or die "screaming"; for (1..40000) { open my $f,">","gtest/$_" or die "gnashing"; close $f or die "howling"; } my $c=0; while (my $r=glob("gtest/*")) { $c++; if ($c == 1) { for (1..40000) { unlink "gtest/$_" or die "wailing" } } } print "$c\n"; rmdir "gtest" or die "exhausted";

      I believe this is done so that glob returns a consisten snapshot of the directory contents as they were at some point, regardless of whether the content changes while you process the results. If you want more up-to-date data, with only current files being returned, you'll have to use opendir and readdir.


      All dogma is stupid.
        perl reads the entire list of files returned by glob into memory at once.

        This is what I originally assumed, hence my original question. So, I guess that makes the answer to my question, "No, you can't set a limit on a glob."

        Thanks

Re: Limiting a glob
by swampyankee (Parson) on Mar 09, 2009 at 17:51 UTC

    Don't use glob; use opendir and readdir to loop through the directory entries. Strip out file types you don't want to move (directories, system files, hidden files, etc) and use rename or File::Copy's move function. Alternatively, you can push the files you want to move in an array, move the files when the array is either filled or you run out of files (I think it's quite unlikely that the number of files mod 100 will be 0); in the former case, empty the array and iterate; in the latter, you're done.

    Now, are you moving 100 files into a new directory for each 100 files or moving all the files into a single new directory?


    Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

      It's all fair advice, but why wouldn't zod want to use glob?

        I've no idea why zod wouldn't want to; I just offered him a way to do so without using glob.


        Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

Re: Limiting a glob
by zentara (Archbishop) on Mar 09, 2009 at 18:58 UTC
Re: Limiting a glob
by dwm042 (Priest) on Mar 09, 2009 at 20:57 UTC

    Given the small number of files you're reading into an array at once (40,000 names isn't a lot, really), I'd grab all of them and then use an array slice to operate on subsets of your array until finished.

    Check also here, and here.

Re: Limiting a glob
by hbm (Hermit) on Mar 10, 2009 at 16:56 UTC

    A word from the field, and sorry it took me awhile to track down my notes:

    I have a script that worked fine for months, doing something like my @pdfs = glob("$path/*.pdf");. One day the script failed, and with use diagnostics; I got this:

    internal error: glob failed at ...
    (P) Something went wrong with the external program(s) used for glob and <*.c>. This may mean that your csh (C shell) is broken. If so, you should change all of the csh-related variables in config.sh: If you have tcsh, make the variables refer to it as if it were csh (e.g. full_csh='/usr/bin/tcsh'); otherwise, make them all empty (except that d_csh should be 'undef') so that Perl will think csh is missing. In either case, after editing config.sh, run ./Configure -S and rebuild Perl.

    There were not "too many files", which I had seen before with glob. This was new... The environment is out of my hands, so I simply gave up with glob and used File::Find.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://749359]
Approved by olus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (11)
As of 2014-10-21 10:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (100 votes), past polls