Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Reliable glob?

by hepcat72 (Sexton)
on Oct 21, 2014 at 18:34 UTC ( [id://1104595]=perlquestion: print w/replies, xml ) Need Help??

hepcat72 has asked for the wisdom of the Perl Monks concerning the following question:

I became disenchanted with perl's glob function quite some time ago and started using bsd_glob instead. It seems to be able to handle the strings it's given a lot better and more comprehensively, but I encountered some unexpected results yesterday. I implemented a work-around, but I'd like the implementation to be more simple.

I debugged an issue where, when the script was called and given a long glob pattern, it was truncating it to just the preceding directory. So I just wrote a preprocessing routine to pre-expand any patterns containing '{...}' to hopefully shorten the string sent in on the command line before calling bsd_glob on it.

As far as I understand this issue - and I could be slightly off - the posix flag GLOB_LIMIT (or *ARG_MAX?) is set too low to be able to handle a string that was successfully submitted to the script.

By implementing my preprocessing of the string that comes from the command line, I was able to break anything up that had a '{...}' pattern in it into shorter pieces that I then spoon-feed to bsd_glob a bite at a time.

I feel like there's got to be a better solution. If the script is receiving the whole string from the command line, shouldn't the limits of bsd_glob be the same as the surround shell - why would it not be able to handle as long of a string as the surrounding shell script can give it?

Here's a toy example which shows it not work in the first call, but remove 1 bit and it does work in the second call:

>perl -e '$x="cff_updated/1_lib/{A3DWE.1.Solexa-142587.splice.fastq,A3 +DWE.1.Solexa-142588.splice.fastq,A3DWE.1.Solexa-142589.splice.fastq,A +3DWE.1.Solexa-142590.splice.fastq,A3DWE.1.Solexa-142594.splice.fastq, +A3DWE.1.Solexa-142595.splice.fastq,A3DWE.1.Solexa-142596.splice.fastq +,A3DWE.1.Solexa-142597.splice.fastq,A3DWE.1.Solexa-142598.splice.fast +q,A3DWE.1.Solexa-142599.splice.fastq,A3DWE.1.Solexa-142600.splice.fas +tq,A3DWE.1.Solexa-142602.splice.fastq,A3DWE.1.Solexa-142603.splice.fa +stq,A3DWE.1.Solexa-142605.splice.fastq,A3DWE.1.Solexa-142606.splice.f +astq,A3DWE.1.Solexa-142607.splice.fastq,A3DWE.1.Solexa-142608.splice. +fastq,A3DWE.1.Solexa-142609.splice.fastq,A3DWE.1.Solexa-142610.splice +.fastq,A3DWE.1.Solexa-142611.splice.fastq,A3DWE.1.Solexa-142612.splic +e.fastq,A3DWE.1.Solexa-142613.splice.fastq,A3DWE.1.Solexa-142614.spli +ce.fastq,A3DWE.1.Solexa-142615.splice.fastq,A3DWE.1.Solexa-142616.spl +ice.fastq,A3DWE.1.Solexa-142617.splice.fastq,A3DWE.1.Solexa-142618.sp +lice.fastq,A3DWE.1.Solexa-142619.splice.fastq,A3DWE.1.Solexa-142621.s +plice.fastq}.drp.fna.lib";use File::Glob ":glob";@y=bsd_glob($x);prin +t(join("\n",@y),"\n");' cff_updated/1_lib/ >perl -e '$x="cff_updated/1_lib/{A3DWE.1.Solexa-142587.splice.fastq,A3 +DWE.1.Solexa-142588.splice.fastq,A3DWE.1.Solexa-142589.splice.fastq,A +3DWE.1.Solexa-142590.splice.fastq,A3DWE.1.Solexa-142594.splice.fastq, +A3DWE.1.Solexa-142595.splice.fastq,A3DWE.1.Solexa-142596.splice.fastq +,A3DWE.1.Solexa-142597.splice.fastq,A3DWE.1.Solexa-142598.splice.fast +q,A3DWE.1.Solexa-142599.splice.fastq,A3DWE.1.Solexa-142600.splice.fas +tq,A3DWE.1.Solexa-142602.splice.fastq,A3DWE.1.Solexa-142603.splice.fa +stq,A3DWE.1.Solexa-142605.splice.fastq,A3DWE.1.Solexa-142606.splice.f +astq,A3DWE.1.Solexa-142607.splice.fastq,A3DWE.1.Solexa-142608.splice. +fastq,A3DWE.1.Solexa-142609.splice.fastq,A3DWE.1.Solexa-142610.splice +.fastq,A3DWE.1.Solexa-142611.splice.fastq,A3DWE.1.Solexa-142612.splic +e.fastq,A3DWE.1.Solexa-142613.splice.fastq,A3DWE.1.Solexa-142614.spli +ce.fastq,A3DWE.1.Solexa-142615.splice.fastq,A3DWE.1.Solexa-142616.spl +ice.fastq,A3DWE.1.Solexa-142617.splice.fastq,A3DWE.1.Solexa-142618.sp +lice.fastq,A3DWE.1.Solexa-142619.splice.fastq}.drp.fna.lib";use File: +:Glob ":glob";@y=bsd_glob($x);print(join("\n",@y),"\n");' cff_updated/1_lib/A3DWE.1.Solexa-142587.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142588.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142589.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142590.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142594.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142595.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142596.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142597.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142598.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142599.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142600.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142602.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142603.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142605.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142606.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142607.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142608.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142609.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142610.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142611.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142612.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142613.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142614.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142615.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142616.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142617.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142618.splice.fastq.drp.fna.lib cff_updated/1_lib/A3DWE.1.Solexa-142619.splice.fastq.drp.fna.lib

Perhaps bsd_glob is behaving as it should and the design of my surrounding script is what needs to be fixed...

The surrounding shell script is part of a distributed package - a pipeline of analysis commands, each of which is a perl script. I had decided to handle the input files explicitly instead of supplying a glob pattern involving an '*' so that users could analyze a subset of their data files. Each step in the pipeline adds an extension to each of the files, which is why I chose to use the '{...}' glob pattern, so that I could do this: -d "{$STUBS}.extension". I wrapped it in double quotes so that the shell wouldn't expand them into a space-separated list and thus everything would be identified as an argument to the preceding flag. Like this:

script.pl -i "dir1/{$STUBS}.drp.fna.lib.n0s.cands" -d "dir2/{$STUBS}.drp.fna.lib"

Assuming that I might eventually encounter a command that exceeds an arbitrary command-line length limit, what would be a better way to submit a series of files, each with an added extension from the output of the previous step?

I await the wisdom of the perl monks.

Replies are listed 'Best First'.
Re: Reliable glob?
by Loops (Curate) on Oct 21, 2014 at 23:48 UTC

    For what it's worth, both examples you give work here on a Fedora box with Perl v5.18. The issue has nothing to do with the GLOB_LIMIT flag, since it is off by default. So it seems it's a problem with your system or perhaps an older version of Perl?

    What's interesting is that the same results can be obtained just by dropping the trailing curly brace in a glob of any length:

    use File::Glob ":bsd_glob"; my $x = 'cff_updated/1_lib/{a,b,c'; my @y=bsd_glob($x); print "Error: $!\n" if &File::Glob::GLOB_ERROR; print(join("\n",@y),"\n");
    Which prints the same thing as in your first example, without any error value returned:
    cff_updated/1_lib/

    You could add some diagnostics as shown in the example above and mentioned in the File::Glob documentation to your test and see if anything is returned. Also note that this File::Glob documentation also says that the ":glob" tag is now discouraged and you should use ":bsd_glob".

    So while it doesn't happen on Fedora you seem to be hitting some limit in the length of the input glob and it is being truncated, dropping the trailing curly brace. Of course maybe you'll actually get an error return value if you try the example above which will point to another issue.

    As an aside, since you mention an aversion to the built in Glob, the File::Glob documentation also mentions:

    Since v5.6.0, Perl's CORE::glob() is implemented in terms of bsd_glob(). Note that they don't share the same prototype--CORE::glob() only accepts a single argument. Due to historical reasons, CORE::glob() will also split its argument on whitespace, treating it as multiple patterns, whereas bsd_glob() considers them as one pattern.
    
      Regarding my perl version:

      perl 5, version 16, subversion 2 (v5.16.2) built for darwin-thread-multi-2level

      I expect that a good number of my users will be running on macs, but probably also a lot on Linux. Actually, the problem was reported from a Linux user (Ubuntu, I think) - though it was a much longer string when he used it. I had just started chopping off values from the '{}' pattern until it started to work when I started debugging the issue.

      I tried printing the error as you proposed:

      perl -e '$x="cff_updated/1_lib/{A3DWE.1.Solexa-142587.splice.fastq,A3D +WE.1.Solexa-142588.splice.fastq,A3DWE.1.Solexa-142589.splice.fastq,A3 +DWE.1.Solexa-142590.splice.fastq,A3DWE.1.Solexa-142594.splice.fastq,A +3DWE.1.Solexa-142595.splice.fastq,A3DWE.1.Solexa-14A3DWE.1.Solexa-142 +597.splice.fastq,A3DWE.1.Solexa-142598.splice.fastq,A3DWE.1.Solexa-14 +2599.splice.fastq,A3DWE.1.Solexa-142600.splice.fastq,A3DWE.1.Solexa-1 +42602.splice.fastq,A3DWE.1.Solexa-142603.splice.fastq,A3DWE.1.Solexa- +142605.splice.fastq,A3DWE.1.Solexa-142606.splice.fastq,A3DWE.1.Solexa +-142607.splice.fastq,A3DWE.1.Solexa-142608.splice.fastq,A3DWE.1.Solex +a-142609.splice.fastq,A3DWE.1.Solexa-142610.splice.fastq,A3DWE.1.Sole +xa-142611.splice.fastq,A3DWE.1.Solexa-142612.splice.fastq,A3DWE.1.Sol +exa-142613.splice.fastq,A3DWE.1.Solexa-142614.splice.fastq,A3DWE.1.So +lexa-142615.splice.fastq,A3DWE.1.Solexa-142616.splice.fastq,A3DWE.1.S +olexa-142617.splice.fastq,A3DWE.1.Solexa-142618.splice.fastq,A3DWE.1. +Solexa-142619.splice.fastq,A3DWE.1.Solexa-142621.splice.fastq}.drp.fn +a.lib";use File::Glob ":bsd_glob";@y=bsd_glob($x,GLOB_LIMIT | GLOB_CS +H);print(join("\n",@y),"\n");print "Error: $\!\n" if &File::Glob::GLO +B_ERROR;' cff_updated/1_lib/


      but like you, I didn't get an error. I even tried: "bsd_glob($x,GLOB_LIMIT | GLOB_CSH | GLOB_ERR)".

      My main problem with the built-in glob is that it splits on spaces even if they are escaped, and last I tried, it didn't do anything with glob characters like '?' or '{}' or maybe even character classes. I don't remember what version of perl I was running at the time, but it had to have been at least 5.6.

      Ultimately, it seems like using perl code to expand the '{}' patterns is the only way to mitigate this truncation issue. basically, I did it like this. Anyone have any streamlining/more-comprehensive suggestions?

      #Keep updating an array to be the expansion of a file pattern to #separate files my @expanded = ($nospace_string); #If there exists a '{X,Y,...}' pattern in the string if($nospace_string =~ /\{[^\{\}]+\}/) { #While the first element still has a '{X,Y,...}' pattern #(assuming everything else has the same pattern structure) while($expanded[0] =~ /\{[^\{\}]+\}/) { #Accumulate replaced file patterns in @g my @buffer = (); foreach my $str (@expanded) { #If there's a '{X,Y,...}' pattern, split on ',' if($str =~ /\{([^\{\}]+)\}/) { my $substr = $1; my $before = $`; my $after = $'; my @expansions = split(/,/,$substr); push(@buffer,map {$before . $_ . $after} @expansions); } #Otherwise, push on the whole string else {push(@buffer,$str)} } #Reset @f with the newly expanded file strings so that we #can handle additional '{X,Y,...}' patterns @expanded = @buffer; } } #Pass the newly expanded file strings through return(wantarray ? @expanded : [@expanded]);


      Rob

        Hi Rob,

        I took a look at the C source for bsd_glob and it does indeed truncate the input pattern in all cases, regardless of what options you use. That said, you're hitting an unusually short maximum buffer size. But that size is compiled into the C code and is not changeable at runtime. So as you suggest above you'll have to find some way to work around this if you're trying to support such platforms.

        If you don't mind a CPAN dependency, there are several Perl-only glob implementations on CPAN you could explore. I tried out Text::Glob::Expand and it handled your input string without a problem. Even added a second trailing braces expansion to make it longer, and it was still okay:

        use Text::Glob::Expand; my $x = 'cff_updated/1_lib/{A3DWE.1.Solexa-142587.splice.fastq,A3DWE.1 +.Solexa-142588.splice.fa­stq,A3DWE.1.Solexa-142589.splice.fastq,A3DWE +.1.Solexa-14 2590.splice.fastq,A3DWE.1.Solexa-1­42594.splice.fastq,A3DWE.1.Solexa-1 +42595.splice.fastq,A3DWE.1.Solexa-142596.splice.fastq,A­3DWE.1.Solexa +-142597.splice.fastq,A3DWE.1.Solexa-142598.splice.fastq,A3DWE.1.Solex +a-142599­.splice.fastq,A3DWE.1.Solexa-142600.splice.fastq,A3DWE.1.Sol +exa-142602.splice.fastq,A3DWE.­1.Solexa-142603.splice.fastq,A3DWE.1.S +olexa-142605.splice.fastq,A3DWE.1.Solexa-142606.spli­ce.fastq,A3DWE.1 +.Solexa-142607.splice.fastq,A3DWE.1.Solexa-142608.splice.fastq,A3DWE. +1.Sol­exa-142609.splice.fastq,A3DWE.1.Solexa-142610.splice.fastq,A3DW +E.1.Solexa-142611.splice.fa­stq,A3DWE.1.Solexa-142612.splice.fastq,A3 +DWE.1.Solexa-142613.splice.fastq,A3DWE.1.Solexa-1­42614.splice.fastq, +A3DWE.1.Solexa-142615.splice.fastq,A3DWE.1.Solexa-142616.splice.fastq +,A­3DWE.1.Solexa-142617.splice.fastq,A3DWE.1.Solexa-142618.splice.fas +tq,A3DWE.1.Solexa-142619­.splice.fastq,A3DWE.1.Solexa-142621.splice.f +astq}{.drp,.fna,.lib}'; my @y = map { $_->text } @{Text::Glob::Expand->parse($x)->explode}; print "Number of items: ", scalar @y, $/, join($/,@y);

        In any case, hope you find a relatively painless way to deal with this. Cheers.

Re: Reliable glob?
by Anonymous Monk on Oct 21, 2014 at 20:44 UTC
    check the docs, bsd_glob takes a lot of options
      Yes, I read about those after my post and added GLOB_CSH, which I deem to be an improvement, but it doesn't fix my issue with string length.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1104595]
Approved by GotToBTru
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (2)
As of 2024-04-19 21:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found