Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Help with placing the matches of a regex into an Array

by kereekerra (Initiate)
on Apr 28, 2012 at 15:04 UTC ( #967819=perlquestion: print w/ replies, xml ) Need Help??
kereekerra has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to do some screen scraping. I'm using this regex to capture the variable portion of the URLs. However in my output, I only have the first match. My error messages also indicate that I'm only capturing one entry when I should be capturing about forty.

@game_array = ($gamepage = ~m/onclick="document\.location\.href='([.]+ +)'"/g);

Update: The data I'm trying to scrape is the URL's of pages for specific SC2 replays from the site. "http://www.sc2rep.com/" I'm trying to scrape the individual game pages and output them to a file for use with a DBI script. On closer inspection the match I'm getting is incorrect. Here is a more complete script. Sorry about the incomplete information.

#!/usr/bin/perl -w use strict; use DBI; use Data::Dumper; my $page ="http://sc2rep.com/"; #URL of page withlist ofgame pages to +scrape, must be from sc2rep.com my $index=0; my $gamepage= `curl $page`; my $game_data; my @game_array; my $counter; @game_array= ($gamepage =~ m/onclick="document\.location\.href='([.]+) +'"/g); open (OUT,">sc2data") or die$!; for($index<40) { $game_data = `curl "http://sc2rep.com$game_array[$index]"`; print OUT "$game_data" . "\n end_replay \n"; print "looped successfully.\n"; $index++; } close OUT; exit;

The fix by stevieb made the code work. Removing the bracket made it work like a charm. Thank you all for your help and time.

Comment on Help with placing the matches of a regex into an Array
Select or Download Code
Re: Help with placing the matches of a regex into an Array
by LanX (Canon) on Apr 28, 2012 at 15:51 UTC
    the perl-code is ok,
    DB<100> $gamepage=' a="1" b="2" ' => " a=\"1\" b=\"2\" " DB<101> @g=($gamepage =~ /\b\w="(\d)"/g ) => (1, 2)

    I suppose your wrong about the structure of your inputdata, try simplyfying your regex till they match and you will see what you missed. ( maybe ' instead of " )?

    Cheers Rolf

Re: Help with placing the matches of a regex into an Array
by stevieb (Hermit) on Apr 28, 2012 at 15:53 UTC

    I can make the match work by removing the square brackets (character class):

    while ( my $link = <DATA> ){ $link =~ m/onclick="document\.location\.href='(.+)'"/g; say $1; } __DATA__ onclick="document.location.href='hello'" onclick="document.location.href='there'" onclick="document.location.href='world'" __END__ %./match.pl hello there world
      true! putting a dot into a character class [.] is like escaping it \.

      But why did the OP say that he got one match?

      Cheers Rolf

        I don't know. There is not enough context, nor any sample data. For all I know, the @game_array could have been previously populated with something somewhere else. :)

Re: Help with placing the matches of a regex into an Array
by ww (Bishop) on Apr 28, 2012 at 17:05 UTC
    A sample of your data and verbatim reports on your messages (are you sure they're 'error' messages, rather than warnings?) would be a great help in finding a way to help you.

    But, that said, this is a llama-level primer-version of an approach to your titled issue:

    #!/usr/bin/perl use 5.014; my @matches; my @arr = qw/a1 ab2 aab3 ac4 dc5 bc6 da7 cb8 dbca9 aa0/; for $_( @arr ) { if ($_ =~ m/a/g) { say "\t regex matches $_"; push @matches, $_; } } for my $match( @matches) { say $match; }

    Note that this pushes each capture to the array, rather than replacing the array repeatedly, thus preserving only the last. That's something which you might have spotted in the course of writing your question, had you provided the data.

    For the reasoning behind this, see Re: a bit of monastery zen.

    Update: One and all may consider this another illustration of the danger another Monk noted recently; the danger of keeping a tab open too long (long enough to answer a false alarm, in this case).

Re: Help with placing the matches of a regex into an Array
by JavaFan (Canon) on Apr 28, 2012 at 20:06 UTC
    Replacing [.]+ with .+ appears to work. For now. But considering that you have a trailing /g, you are expecting multiple matches on a line.

    Your pattern will never match more than once, due to the greedy .+. I suggest to use

    /onclick="document\.location\.href='([^']+)'"/g
    as your pattern match.

      ahhhh, of course. Thanks JavaFan. In my example above, the /g isn't needed due to the fact I was reading a line at a time, but kept it in as I (correctly) assumed the OP would have multiple links in a scalar glob. Since I didn't test it as such though, I missed this bug.

      I'm far from an expert in regexes, so I'm wondering if using the non-greedy operator essentially does the same thing in this case... or is there possibly something else I am missing that you see?

      /onclick="document\.location\.href='(.+?)'"/g
        It depends on the data -- if the data is incorrectly formatted (a missing ' for instance), they may not do the same thing. And '.+?' has the tendency to be slower than '[^']+'. Usually, the more restrictive a pattern is, the faster: there's less opportunity to backtrack. There's much more commitment in '[^']+' then there is in '.+?': the former will always match two quotes in succession, and whatever is in between, regardless how the rest of the pattern looks like, but that's not the case with '.+?'; there's nothing stopping the .+? part to match quotes.
Re: Help with placing the matches of a regex into an Array
by stevieb (Hermit) on Apr 29, 2012 at 00:00 UTC

    Huge props to the OP kereekerra, a brand new member who formatted their question wonderfully, and also had the presence of mind to put changes to their post in an 'Update:' instead of raping the context away from the OP by just changing it.

    Cheers!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://967819]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (12)
As of 2014-10-31 10:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (216 votes), past polls