Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Using grep and glob to find directories containing file

by Anonymous Monk
on Feb 03, 2013 at 14:42 UTC ( #1016826=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying (for example) to identify the subset of directories (in a list of directories) that contain a file beginning with the letter 'f'.
I am using the following sample code to test:
perl -e 'use Data::Dumper; print Dumper grep(glob("$_/f*"),("dir1", +"dir2", "dir3") )'
The idea being that the glob (evaluated) in scalar context, returns true if there is a file of form dirN/f*

However, somehow it is seemingly not being evaluated in scalar context to the extent that if dir1 contains N files beginning with 'f' then the first N directories are returned by grep even if none of the others contain a file beginning with 'f'. It is as if glob is not being evaluated in scalar context. Note even if I force glob to scalar using (scalar glob("$_/f*")), it still fails this way.

Any clue what is going wrong?
Any suggestions for alternative approaches?

Comment on Using grep and glob to find directories containing file
Download Code
Re: Using grep and glob to find directories containing file
by BrowserUk (Pope) on Feb 03, 2013 at 14:52 UTC

    grep will return the input (eg."dir1", "dir2", "dir3"), if the glob returns true. Perhaps you want map.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I thought grep will *only* return the elements of the array for which the first expression is true. Since the first expression glob("$_/f*") is only true for 'dir1', it should only return that element of the directory list (even though glob returns 3 files in that directory). The glob finds no elements in the other 2 directories, so it should be undefined which would evaluate as false.

      I don't understand why the elements of glob("$_/f*") returned for the first directory entry seemingly spill over to subsequent directories in the (implicit) grep iteration.
        Since the first expression glob("$_/f*") is only true for 'dir1', it should only return that element of the directory list (even though glob returns 3 files in that directory).

        You are missing the fact that grep puts the glob in a scalar context, and that makes it act as an iterator:

        In scalar context, glob iterates through such filename expansions, returning undef when the list is exhausted.

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Using grep and glob to find directories containing file
by moritz (Cardinal) on Feb 03, 2013 at 15:14 UTC
    Weird, your code works for me:
    $ mkdir dir1 dir2 dir3 $ touch dir2/foo $ perl -wE 'use Data::Dumper;print Dumper grep(glob("$_/f*"),("dir1", +"dir2", "dir3"))' $VAR1 = 'dir2';

    Which is exactly what you seem to want.

    Maybe your problem is actually elsewhere?

    Update: I found a problem. If there's more than one match, the code skips every second dir. The reason is that glob in scalar context is stateful, and needs on extra cycle to reset. A possible fix is to evaluate the glob in list context, and check if the list has a least one element.

      Thanks. Putting it in array context -- i.e. @{glob("$_/f*")} - did the trick. I changed the code to:
      perl -wE 'use Data::Dumper;print Dumper grep(@{[glob("$_/f*")]},("dir1 +", "dir2", "dir3"))'
      The code seems to work with that change but I just want to check that there is no need to explicitly check "if the list has at least one element" since presumably evaluating the list (by grep) will determine if it is empty or not.

      I'm still not really sure why scalar context doesn't work (indeed, I would have thought scalar context would be better than list context). The stateful carry over part seems to be weird if not buggy. But as long as it works for me by forcing an array context, then it's all good even if it was far from obvious at first glance.
        I just want to check that there is no need to explicitly check "if the list has at least one element" since presumably evaluating the list (by grep) will determine if it is empty or not.

        Right, no need for an explicit check. Arrays in scalar context evaluate to the number of elements, so only empty arrays are false in boolean context.

        The stateful carry over part seems to be weird if not buggy.

        Well,

        glob</b> needs to carry state for this useful idiom to work: <code> while (my $file = glob '*.txt') { # do something with $file } </code> <p>Which is more friendly to memory than using a list and iterating th +at.</p> <p>But one could argue that <c>glob
        should discard its internal state when called with a different argument.

        > grep(@{[glob("$_/f*")]}

        there are easier ways to count in list context,

        grep { () = <$_/*> } <dir{1,2,3}>

        (kind of a "half goatse" ;)

        Cheers Rolf

Re: Using grep and glob to find directories containing file
by 7stud (Deacon) on Feb 03, 2013 at 17:28 UTC

    This behavior is certainly not obvious and it is not (clearly) documented either under 'perldoc -f glob' or at perldoc.perl.org. The line saying "In scalar context, glob iterates through such filename expansions, returning undef when the list is exhausted" does not make it clear (at least to me) that such a state persists across calls to glob with a new argument!

    I agree that the docs for glob (perldoc -f glob) are not clear. When I read the docs for glob, and I came upon the sentence:

    In scalar context, glob iterates through such filename expansions, returning undef when the list is exhausted.
    

    I had no idea what that meant. However, I immediately recognized that that sentence did NOT mean what you claimed it meant, namely that in scalar context glob returns true if there were any matching files. I have no idea how you arrived at that interpretation. In my opinion, the literal interpretation would be that in scalar context, glob() sits for a few seconds as it spins through the list of matches, and then glob returns undef, i.e. glob always returns undef in scalar context.

    In any case, after reading the docs I was prompted to try an experiment to see how glob() works in scalar context. So I setup this directory structure:

    /some_dir
       my_prog.pl
       dir1/
          a.txt
          b.txt
          f.txt
          f1.txt
       dir2/
       dir3/
          x.txt
          y.txt
    
    And then I ran this code:
    use strict; use warnings; use 5.012; while (my $x = glob "dir1/f*") { say $x; } --output:-- dir1/f.txt dir1/f2.txt

    After examining the output, I quickly understood how glob() works in scalar context. To be clearer, the docs should say something like this:

    In scalar context, glob returns the next filename from the list of matching filenames or undef if the list has been exhausted.

    Edit-- so that statement should have some qualifications:

    In scalar context:

    1. Inside a loop: glob returns the next filename from the list of matching filenames, or undef if the list has been exhausted--with the next call to glob returning the first matching filename again.
    2. Outside a loop: glob() always returns the first matching filename.

      I agree with your final revision to the documentation string. The doc text truly doesn't make much literal sense. And unless one has perfect Perl Monk karma, I don't see how one can easily intuit the difference between scalar context in a looped vs. non-looped context. The purpose of documentation is (presumably) to help those who are not yet experts. In this case, I humbly propose that the documentation fails to adequately and properly document the behavior in scalar context.

      Also what happens when glob is called in a function that is embedded in a loop? Either way I can imagine challenges. If it is still considered in a loop then the behavior for example of glob used somewhere deep in a module funciton would vary depending on whether it was at some level called from something in a loop. On the other hand if calling it from a function that is embedded in a loop behaves differently from calling directly, then again you have an odd behavior where simply wrapping 'glob' in a function call would change its behavior.

      To me, this still seems quite flaky and upnredictable. At a minimum, it deserves copious documentation to explain the behavior and potential issues.
Re: Using grep and glob to find directories containing file
by 7stud (Deacon) on Feb 03, 2013 at 23:28 UTC

    Also what happens when glob is called in a function that is embedded in a loop?

    Let's find out:

    use strict; use warnings; use 5.012; sub do_stuff { glob "dir1/f*"; } for my $i (1..10) { print "$i: "; if (my $x = do_stuff()) { print "\t$x"; } print "\n"; } --output:-- 1: dir1/f1 2: dir1/f2 3: 4: dir1/f1 5: dir1/f2 6: 7: dir1/f1 8: dir1/f2 9: 10: dir1/f1

    perl says the context of the glob iterator is still a loop. And it doesn't matter how deep the subs are nested:

    use strict; use warnings; use 5.012; sub do_stuff { get_glob(shift); } sub get_glob { glob shift; } for my $i (1..10) { print "$i: "; if (my $x = do_stuff('dir1/f*')) { print "\t$x"; } print "\n"; } --output:-- 1: dir1/f1 2: dir1/f2 3: 4: dir1/f1 5: dir1/f2 6: 7: dir1/f1 8: dir1/f2 9: 10: dir1/f1

    the behavior for example of glob used somewhere deep in a module funciton would vary depending on whether it was at some level called from something in a loop

    I can't figure out an example of that. Edit -- okay, here is an example that shows how a function that relies on the behavior of glob inside the function can produce faulty results when the function is called in a loop:

    use strict; use warnings; use 5.012; sub do_stuff { my $file_pattern = shift; #No loop in sight... my $d = glob $file_pattern; if ($d) { say $d; #...so expect 'dir1/f1' } #No loop in sight... my $e = glob $file_pattern; if ($e) { say $e; #...so expect 'dir1/f1' again } } for my $i (1..10) { do_stuff("dir1/f*"); } --output:-- dir1/f1 dir1/f1 dir1/f2 #uh oh dir1/f2 #no no dir1/f1 dir1/f1 dir1/f2 #Darn dir1/f2 #darn darn dir1/f1 dir1/f1 dir1/f2 #But, but...the docs... dir1/f2 #I'm fired?? !$#!#@$!@#!!!! dir1/f1 dir1/f1
      Thanks 7stud for all your patience and persistence in helping me figure out this strange/unexpected behavior.

      It seems to me that this is a hidden and potentially significant time-bomb type issue since glob is a core function and it's not inconceivable that people will bury it somewhere in a module where it is used in static context. Then it will lay there waiting until one day someone calls the module from a loop and gets wrong results.

      This would be bad enough if the behavior were fully or even adequately documented. But currently, the documentation at best alludes rather obscurely to the behavior that can lead to an issue in such a context.

      Do people agree this is a valid issue that needs addressing either in 'fixing' glob or at least in documenting and warning about the behavior?

      If so how does one properly report such an issue?

        I think the problem is most similar to the problem of keys (not) resetting the iterator over a hash. I guess that the best solution is to not call glob in scalar context at all.

        Most likely, part of the documentation of keys can be adapted to be added to the glob documentation. I would open a bug report using the perlbug utility, best together with a proposed documentation patch that cautions against using glob in scalar context.

Re: Using grep and glob to find directories containing file
by 7stud (Deacon) on Feb 04, 2013 at 18:18 UTC

    With "goatse" it's a list _assignment_ = which in scalar context returns the number of list elements assigned.

    I'll have to accept that on faith. Edit: Well, I guess I don't have to take it on faith:

    se strict; use warnings; use 5.012; my $x = () = ('a', 'b', 'c'); #scalar context on left, a list assignment on right #Full goatse syntax: my $x =()= ('a', 'b', 'c'); say $x; --output:-- 3

    PS: you replied to yourself twice.

    In my opinion, the indenting is a terrible feature. All posts in a thread should be at the same level of indenting--none. If you want to respond to a particular post, you should quote it.
Re: Using grep and glob to find directories containing file
by arnaud99 (Beadle) on Feb 06, 2013 at 12:06 UTC
    Hi, I have just noticed your post and would like to propose an alternative approach. It uses the File::Find::Rule package and simplify things (in my view). This is running on perl 5.16.0
    use strict; use warnings; use autodie; use File::Find::Rule; my @f_files = File::Find::Rule ->file ->name(qr/^f.*$/) ->in(qw(/dir1 /dir2 /dir3 )); foreach (@f_files) { print "$_\n"; }
    The output looks like:
    $ perl find_f_files.pl /dir1/f2.dat /dir1/f.dat /dir1/subdir1/f2.dat /dir1/subdir1/f.dat /dir3/subdir3/f3.dat
    I hope this helps.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1016826]
Approved by moritz
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (13)
As of 2014-09-02 15:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (25 votes), past polls