Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

How can I use printf FORMAT strings for (Win32) shell globs?

by ozboomer (Pilgrim)
on Jul 17, 2017 at 01:10 UTC ( #1195222=perlquestion: print w/replies, xml ) Need Help??
ozboomer has asked for the wisdom of the Perl Monks concerning the following question:

Hi, again, folks...

I'm trying to work-out a simple/reliable regex to deal with printf 'format' strings with shell globs.

I've experimented a lot and had a look on-line... I've gone though the O'Reilly books (Perl & Regular Expression books, cookbooks, etc) and although they mention glob2pat sub (Perl Cookbook, Sect 6.9), that converts the wrong way (but see Bug in glob2pat?)... and I can't work out an 'elegant' regex to do the job I need to do.

Although... the following works Ok... but it appears pretty long-winded and ugly to me... and I'm guessing there must be a cleaner way to do it...

use Data::Dumper; ($match) = (@ARGV); # Get the 'printf format' s +tring to match printf(" \$match: >%s<\n\n", $match); # Input as `Img%04d.png` (b +ackprime quoted) $match =~ /([\w]+)(%[0-9]+d)\.([\w]+)/i; # EX: Img%04d.png $prefix = $1; $format = $2; $ext = $3; printf(" \$prefix: >%s<\n", $prefix); printf(" \$format: >%s<\n", $format); printf(" \$ext: >%s<\n\n", $ext); $format =~ /%([0-9]+)d/; # Note the '%04d' portion $count = $1; $glob_str = sprintf("%s%s\.%s", # ...and build the DOS/glob + match strings $prefix, "\*", $ext); $grep_str = sprintf("%s[0-9]{%d}\.%s", $prefix, $count, $ext); printf("\$glob_str: >%s<\n", $glob_str); printf("\$grep_str: >%s<\n\n", $grep_str); @tmpfiles = glob($glob_str); # Get all the glob-matched +files printf("\@tmpfiles:\n"); # ...EX: Img*.png print Dumper(@tmpfiles); printf("\n"); @files = grep(/$grep_str/i, @tmpfiles); # Filter to match the 'prin +tf format' printf("\@files:\n"); print Dumper(@files); printf("\n");

In this case, I'm looking at running ActiveState Perl 5.16.3 under Win32, although I'll probably also need it to work under a similar Perl version on Linux as well.

I'd greatly appreciate any guidance, please.

Thanks a heap.

Replies are listed 'Best First'.
Re: How can I use printf FORMAT strings for (Win32) shell globs?
by stevieb (Abbot) on Jul 17, 2017 at 01:55 UTC

    I had a few minutes to take a crack, and the following is what I came up with. I made it about half-way through. It is tested, but not well vetted... ie. it runs, it's much safer than what you have, and it shows a bit cleaner of a way to display things (you don't need printf() in Perl in most cases as variables interpolate within double-quotes).

    use warnings; use strict; die "need arg!\n" if ! @ARGV; chomp (my $match = $ARGV[0]); print "match: $match\n"; my ($prefix, $format, $ext) = $match =~ /([\w]+)(%[0-9]+d)\.([\w]+)/i or die "nope, can't dig up stuff\n"; print "prefix: $prefix\n" . "format: $format\n" . "ext: $ext\n\n"; my ($count) = $format =~ /%([0-9]+)d/ or die "can't fetch format...\n"; print "count: $count\n";

    Output:

    match: Img%04d.png prefix: Img format: %04d ext: png count: 04

    I am literally gearing up to head out of the city to photograph Aurora Borealis (as the Kp index is extremely high), so the other Monks can help out with the rest, and correct me where I've been hasty ;)

      ill continue

      my @files=@{finder()}; sub finder{ use File::Find; my $lpre=length($prefix); my $lext=length($ext); my $mustbesize=$lpre+$count+$lext; my @txts; find(sub { my $name=substr($File::Find::name,2); # kill the ./ ; return unless (length($name)==$mustbesize); return unless (substr($name,0,$lpre) eq $prefix); return unless (substr($name,-1,$lext) eq $ext); return unless (substr($name,$lpre,$count)=~m/^\d+$/); push @txts,$name; } , '.'); return \@txts ; } # finder

      Edit, opps, added the digits test, hope i did it right with substr as an lvalue

Re: How can I use printf FORMAT strings for (Win32) shell globs?
by kcott (Chancellor) on Jul 17, 2017 at 06:44 UTC

    G'day ozboomer,

    "... but it appears pretty long-winded and ugly to me ..."

    Agreed. I think you've thrown far too much code into that solution. In your first regex, you have character classes inside character classes ([\w]); and a pointless 'i' modifier (aim to avoid that anyway as it slows down the regex). You seem to have gone somewhat overboard with sprintf usage to create a regex; qr// would have been a better choice in my opinion (I've used it in the code below).

    Here's the guts of what I think you need:

    my @parts = $ARGV[0] =~ /^(\w+)%(\d+)d[.](\w+)$/; $parts[1] =~ s/^0*//; my $re = qr{(?x: ^ $parts[0] \d{$parts[1]} [.] $parts[2] $ )}; my $glob_str = "$parts[0]*.$parts[2]"; print for grep { /$re/ } glob $glob_str;

    It would have been useful if you'd put together sample data, test input and actual/expected output. I dummied up these filenames for testing:

    $ ls -1 Img* Img.png Img.svg Img0000.png Img1.png Img12.png Img123.png Img1234.png Img12345.png Img1239.png

    I put those five lines of code into a script (pm_1195222_fmt_glob_re.pl) so that you can see each stage (much as your OP code does).

    #!/usr/bin/perl -l use strict; use warnings; print 'Command line arg:'; print $ARGV[0]; my @parts = $ARGV[0] =~ /^(\w+)%(\d+)d[.](\w+)$/; print "Parts: @parts"; $parts[1] =~ s/^0*//; print "Parts (after stripping zeros): @parts"; my $re = qr{(?x: ^ $parts[0] \d{$parts[1]} [.] $parts[2] $ )}; print "Filter RE: $re"; my $glob_str = "$parts[0]*.$parts[2]"; print "Glob string: $glob_str"; print 'Found files:'; print for grep { /$re/ } glob $glob_str;

    Here's a couple of sample runs (the first using your "Img%04d.png" string).

    $ pm_1195222_fmt_glob_re.pl 'Img%04d.png' Command line arg: Img%04d.png Parts: Img 04 png Parts (after stripping zeros): Img 4 png Filter RE: (?^:(?x: ^ Img \d{4} [.] png $ )) Glob string: Img*.png Found files: Img0000.png Img1234.png Img1239.png
    $ pm_1195222_fmt_glob_re.pl 'Img%03d.png' Command line arg: Img%03d.png Parts: Img 03 png Parts (after stripping zeros): Img 3 png Filter RE: (?^:(?x: ^ Img \d{3} [.] png $ )) Glob string: Img*.png Found files: Img123.png

    You might also be interested in Win32::Autoglob.

    — Ken

Re: How can I use printf FORMAT strings for (Win32) shell globs?
by haukex (Prior) on Jul 17, 2017 at 10:16 UTC

    That's an interesting question. But I am wondering about a few things:

    • Perl's sprintf format strings are pretty complex - are you aiming to support all of it (I assume not), or just a subset, and if so, what subset?
    • Why are you getting the patterns as printf format strings in the first place, instead of globs or regular expressions? Maybe you could approach the problem from a different angle and have the user input one of those?
    • A printf format string of "Img%04d.png" can also produce an output of "Img123456.png", since %04d is just a minimum width specifier, but in your code you take it to mean exactly four digits. Plus, you exclude negative numbers, which are also possible with %04d. Why the difference?
    • Why do you convert to glob patterns first, why not just stick to regular expressions all the way through? They're more powerful and can also be used for listing files in combination with readdir, Path::Class, Path::Tiny, or File::Find::Rule. (Update: And they could even be used like so: my @files = grep {/$regex/} glob(".* *");, although I'd strongly recommend one of the aforementioned modules.)

    The following is just something I played with, note it is very minimal and unfinished, e.g. it currently supports only a few specifiers and doesn't do anything with the flags or width fields. But basically, it'll turn a format string like "Img%04d.png" into a regex roughly like /^Img([-+]?[0123456789]+)\.png$/ (I used Regexp::Common::number to implement the number matching).

Re: How can I use printf FORMAT strings for (Win32) shell globs?
by salva (Abbot) on Jul 17, 2017 at 08:17 UTC
Re: How can I use printf FORMAT strings for (Win32) shell globs?
by RonW (Vicar) on Jul 17, 2017 at 21:19 UTC

    printf format codes are also used by scanf, and String::Scanf provides format_to_re()

    So, maybe something like:

    # Untested # Use a printf/scanf pattern to match file names use String::Scanf qw(); use File::Find::Rule; my @w = qw( . ); # where to look my $f = $ARGV[0]; $f =~ s/[\012\015]+$//; # universal chomp() my $r = String::Scanf::format_to_re($f); my @files = File::Find::Rule->file() ->name( $r ) ->in( @w );
Re: How can I use printf FORMAT strings for (Win32) shell globs?
by pryrt (Deacon) on Jul 17, 2017 at 13:32 UTC

    Though I am far from a regex expert, I had a thought that I hadn't seen suggested yet: watching you use the first regex to extract $prefix, and then a second to extract $format from $prefix, you could actually combine the two extractions into the same regex: with nested parentheses, the arguments are assigned in the left-to-right order of the open-parenthesis (see perlretut#Extracting matches). Using this, plus stevieb's and kcott's excellent suggestions, my regex snippet would be:

    use warnings; use strict; ... my ($prefix, $format, $count, $ext) = ($match =~ /(\w+)(%(\d+)d)\.(\w+ +)/) or die "cannot find match"; # EX: Img%04d.png printf(" \$prefix: >%s<\n", $prefix); # (\w+) => Img printf(" \$format: >%s<\n", $format); # (%...d) => %04d printf(" \$count: >%s<\n", $count); # (\d+) => 04 printf(" \$ext: >%s<\n\n", $ext); # (\w+) => png ...

      One can elaborate on this. Surely, the  % and  d parts are pointless to extract: if anything is extracted, you know it's a format and you know it's a  d format. The leading  0 on the  04 tells you zero versus space lead-padding in an integer format specifier, but padding with spaces in a file name seems a bit dodgy, so you're really only concerned with whether the width is true/false "fixed and leading-zero padded". (Update: Well, if it's never space-padded, I guess the  0 is superfluous and you don't have to worry about it at all: if a width is given, it's fixed!)

      c:\@Work\Perl\monks>perl -wMstrict -le "my $string = 'Img%04d.png'; ;; my $d_format = my ($prefix, $fixed, $width, $ext) = $string =~ m{ \A (\w+) % (0?) (\d+) d [.] (\w+) \z }xms; ;; die 'no format' unless $d_format; $fixed = length $fixed ? 1 : ''; ;; print qq{string: '$string'}; print qq{prefix: '$prefix'}; print qq{ fixed: '$fixed'}; print qq{ width: '$width'}; print qq{ ext: '$ext'}; " string: 'Img%04d.png' prefix: 'Img' fixed: '1' width: '4' ext: 'png'
      One can imagine extending this approach to the  %s format with various min/max widths, justification, etc.


      Give a man a fish:  <%-{-{-{-<

Re: How can I use printf FORMAT strings for (Win32) shell globs?
by ozboomer (Pilgrim) on Jul 17, 2017 at 23:33 UTC

    Many thanks for the really useful suggestions, folks.. I *knew* there had to be a better way... and, as the question involved regex, I expected *lots* of options ;)

    I guess I should've explained context a bit better.. and sure, example runs & data always help.. but I simply thought the question was one of 'translation' and 'substitution', so I didn't include test data and such this time around.

    By way of background... I'm basically putting an 'intelligent wrapper' around 'ffmpeg', the video manipulation tool... and one of the functions I'm dealing with is the creation/usage of a sequence of frames from a video. The normal syntax for extracting frame images uses an argument like 'Img%04d.png' so that the filename is a consistent length and ffmpeg can output/input a sequence of images in-order - this is why the spec. is zero-filled - and the wrapper determines the width ('4' in this example) after having counted all the frames in the video. This also explains why I don't need to consider negative numbers, etc... and the 'fixed-width, zero-filled' format is a requirement for some subsequent processing after this wrapper creates the image files.

    With Win32's CMD.exe (or 4NT.exe or TCC.exe, which I'm generally using), the shell gets confused with that '%' (even when escaped, it seems), so the string that's used on the command line is `Img%04d.png` (with the backprimes)... and that seems to work with Linux and Win32 CLI shells Ok...

    The wrapper does some checks with the string before it calls ffmpeg to see if there are any pre-existing, conflicting files... but I admit I'm slack in that I only considered using glob(), as I thought (unjustifiably?) it might be the simplest method and it might work better than using something like File::Find, particularly when checking for 10s of 1000s of (frame image) files per video... and I *did* forget that File::Find IS core in Perl these days... Ooop!

    Anyway, the original code was purposely 'ordinary' as I'm *still* not so flash with regex and I wanted to be sure I properly understood what each part of the code was doing.

    Certainly, there are a lot of *most helpful* postings here now, Fanx!... so I can try some more elegant methods.. and see which one best suits what I'm trying to do.

    Fanx! again, everyone...

      With Win32's CMD.exe (or 4NT.exe or TCC.exe, which I'm generally using), the shell gets confused with that '%' (even when escaped, it seems)
      Did you take into account that CMD escapes % differently from other characters? See e.g. on SS64 under "Escaping Percents"

        Oh, yes... ...as does printf() (when using ' or ")... and different versions of 4NT/TCC/TCC-LE/etc, which also use different multi-command separators which may also include '%' characters...

        Anyway, with all this escaping, the (originally) simple approach is getting more and more complex to read and understand... particularly when I come back to this in a year or sumfin'...(!) ...and I want to try and avoid any (somehow) avoidable complexity for something as simple as 'globbing' a list of files(!)

        ...but point well-taken... Fanx!

      ffmpeg

      Ah, well that narrows it down. From the documentation:

      The syntax foo-%03d.jpeg specifies to use a decimal number composed of three digits padded with zeroes to express the sequence number. It is the same syntax supported by the C printf function, but only formats accepting a normal integer are suitable.

      When importing an image sequence, -i also supports expanding shell-like wildcard patterns (globbing) internally, by selecting the image2-specific -pattern_type glob option.

      And the "image2" documentation goes into a little more detail:

      A sequence pattern may contain the string "%d" or "%0Nd", which specifies the position of the characters representing a sequential number in each filename matched by the pattern. If the form "%d0Nd" is used, the string representing the number in each filename is 0-padded and N is the total number of 0-padded digits representing the number. The literal character % can be specified in the pattern with the string "%%".

      And skimming the rest of the documentation, it seems these are pretty much the only two patterns you'll have to worry about, which makes the conversion much simpler - they can all be replaced by regexes of the form \d+ or \d{N}.

      Have you looked at FFmpeg on CPAN? There are other ffmpeg releated modules on CPAN, as well.

      (Note that the module name is case sensitive: FFmpeg is a module. ffmpeg is a program written in Perl.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1195222]
Approved by stevieb
Front-paged by kcott
help
Chatterbox?
[Eily]: s/complete ness/complete mess/ :P
[Eily]: and you can overload "0+" rather than bool, as numification is used instead of stringification in boolean context when available

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (12)
As of 2017-07-27 14:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I came, I saw, I ...
























    Results (415 votes). Check out past polls.