Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

multiple matches per line *AND* multiple capture groups per match

by Special_K (Monk)
on Dec 21, 2013 at 18:02 UTC ( #1068046=perlquestion: print w/replies, xml ) Need Help??

Special_K has asked for the wisdom of the Perl Monks concerning the following question:

I have the following precompiled regexp:

$test_regexp = qr/url="(http:\/\/downloads\.bbc\.co\.uk\/podcasts\/wor +ldservice\/globalnews\/(globalnews_${year}${mon}${mday}-\d{4}[a-z]\.m +p3))"/;

As you can see, I have 2 capture groups built into this match. One captures the complete URL, the other capture gets just the filename itself. The webpage I use this particular regexp on may contain multiple valid matches on the same line. I would like to capture both the complete URL and the filename for each one. If it matters (and I don't believe it does), the precompiled regexp is passed to a function that dumps the webpage to a file, opens the file with the TEMP_XML_FILE handle, and searches for the $test_regexp matches on each line. Right now I have this:

while (<TEMP_XML_FILE>) { if ((@complete_url, @filename) = ($_ =~ /$test_regexp/g)) { printf("found %d matches\n", scalar(@filename)); <>; for ($i = 0; $i < @filename; $i++) { printf("filename = %s, complete_url = %s\n", $filename[$i] +, $complete_url[$i]); <>; }

The problem is that the printf statement is reporting 0 matches. After reading through the entire file, I want the @complete_url array to contain the complete list of URLs, and the @filename array to contain the complete list of filenames. How can I accomplish this? I realize I might be able to capture just the complete url and derive the filenames from it in a separate step, but for the sake of this discussion how can I capture both the filenames and urls into their respective arrays when there could be multiple matches per line?

Replies are listed 'Best First'.
Re: multiple matches per line *AND* multiple capture groups per match
by hdb (Monsignor) on Dec 21, 2013 at 18:24 UTC

    In the assignment

    (@complete_url, @filename) = ($_ =~ /$test_regexp/g)

    all matches will be assigned to the array @complete_url. I would expect that you find the urls and filenames both in the array. So you have everything in one array and need to split it into two afterwards.

      This. Your solution is fine, but you need to do a bit of postprocessing, e.g.
      if (@groups = /$test_regexp/g) { while (@groups) { ($url, $file) = splice @groups, 0, 2; # ... } }
      or use a loop:
      while (/$test_regexp/g) { ($url, $file) = ($1, $2); # ... }
        Thanks, that also solves my problem.
      Thanks, this answers my original question. So in general, is it not possible to do a multi-array assignment in a single line, i.e.: (@a, @b) = <some_expression>
        "So in general, is it not possible to do a multi-array assignment in a single line, i.e.: (@a, @b) = <some_expression>"

        That's correct with the syntax you're using there; however, you can do it with references. Here's a rather contrived example to demonstrate.

        #!/usr/bin/env perl -l use strict; use warnings; my ($letters, $digits) = get_arrays(); print "Letters REF: $letters"; print "Letters: @$letters"; print for @$letters; print "Digits REF: $digits"; print "Digits: @$digits"; print for @$digits; sub get_arrays { my @three_letters = qw{A B C}; my @three_digits = qw{1 2 3}; return (\@three_letters, \@three_digits); }

        Output:

        Letters REF: ARRAY(0x7ff684047ad0) Letters: A B C A B C Digits REF: ARRAY(0x7ff684047938) Digits: 1 2 3 1 2 3

        If you're unfamiliar with references, a good place to start is "perlreftut - Mark's very short tutorial about references". In the "The Rest" section, you'll find links to more detailed documentation on this topic.

        -- Ken

Re: multiple matches per line *AND* multiple capture groups per match
by roboticus (Chancellor) on Dec 21, 2013 at 18:10 UTC

    Special_K:

    The URL part is always constant, so instead *just* capture the filenames, then build the URLs:

    if (@filename = ($_ =~ /$test_regexp/g)) { my @complete_urls = map { "http://...." . $_ } @filename; ... }

    Alternatively, capture the URL and split off the filenames:

    if (@complete_urls = ($_ =~ /$test_regexp/g)) { my @filename = map { s{^.*/}{}; $_ } @complete_urls; ... }

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: multiple matches per line *AND* multiple capture groups per match
by Laurent_R (Canon) on Dec 21, 2013 at 18:54 UTC
    Since you have so many slashes in your regex, I would suggest that you use some other character for delimiting your regex. This enables you to use slashes without escaping them and makes it much more readable. For example:
    $test_regexp = qr[url="(http://downloads\.bbc\.co\.uk/podcasts/worldse +rvice/globalnews/ ... mp3))"];
      Wow, I didn't even know you could change the beginning and ending delimiter of a regexp. Is that only if you use the qr function, or can you change the delimiter in any context?
        You can do that with the quote and quote-like operators, but also, for regexes, with the m// and the s/// operators, which can be written, for example, m{...} and s[...]{...} or even m#...#, etc, as shown in the following Perl one-liners:
        $ perl -e 'print $1 if "foobar" =~ m{f(oo)ba}' oo $ perl -e 'print $1 if "foobar" =~ m#f(oo)ba#' oo
        Update: well, thinking again about what I wrote above, m// and s/// are in fact part of the quote and quote-like operators (so Ken's answer said it all), but I just wanted to point out that this can be done in direct regex constructs.
Re: multiple matches per line *AND* multiple capture groups per match
by AnomalousMonk (Bishop) on Dec 21, 2013 at 19:48 UTC

    I think I prefer roboticus's alternate suggestion to extract complete ULRs first, then extract the filename from each URL, but if it absolutely must be done "in one line", this might serve (Perl version 5.10+ needed for state built-in, but this could be an ordinary my variable in the for-loop outside the if statement):

    >perl -wMstrict -le "use 5.010; ;; use List::MoreUtils qw(part); ;; my $rx = qr{ \b ([[:alpha:]]+ (\d+)) \b }xms; ;; for my $s ( 'foo abc333 bar de4444 baz fghi22 xyzzy jk123 z', 'zzz123 xx yyyy12 xx xx1234', ) { if (my @matches = part { state $i = 0; $i++ % 2 } $s =~ m{ $rx }xm +sg) { print qq{matched: full (@{$matches[0]}); digits (@{$matches[1]})} +; } else { print 'no matches'; } } " matched: full (abc333 de4444 fghi22 jk123); digits (333 4444 22 123) matched: full (zzz123 yyyy12 xx1234); digits (123 12 1234)

    See List::MoreUtils::part.

Re: multiple matches per line *AND* multiple capture groups per match
by johngg (Canon) on Dec 21, 2013 at 19:39 UTC

    You can use a ternary to push onto either the complete URL array or the filename array. To save space I have simplified the URLs and pattern but the principle would still hold for your data. Things would get more complicated if your URLs broke across lines.

    $ perl -Mstrict -Mwarnings -MData::Dumper -e ' open my $xmlFH, q{<}, \ <<EOF or die $!; blarg http://a.b.co.uk/path/to/file.mp3 bloop http://x.y.com/stuff.mp3 blooble http://some.firm.com/downloads/glooble.mp3 sploffle EOF my $rx = qr{(?x) ( http:// .*? ( [^/]+ \.mp3 ) ) }; my( @comp, @fn ); my $xmlText = do { local $/; <$xmlFH>; }; push @{ $_ =~ m{^http://} ? \ @comp : \ @fn }, $_ for $xmlText =~ m{$rx}g; print Data::Dumper->Dumpxs( [ \ @comp, \ @fn ], [ qw{ *comp *fn } ] ); +' @comp = ( 'http://a.b.co.uk/path/to/file.mp3', 'http://x.y.com/stuff.mp3', 'http://some.firm.com/downloads/glooble.mp3' ); @fn = ( 'file.mp3', 'stuff.mp3', 'glooble.mp3' ); $

    I hope this is helpful.

    Update: Corrected unescaped dot in regex and added (?x) extended syntax to space things out for readability.

    Cheers,

    JohnGG

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1068046]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2021-12-02 06:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (17 votes). Check out past polls.

    Notices?