Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

regex only matching from last match

by Foxpond Hollow (Sexton)
on Sep 19, 2009 at 23:38 UTC ( #796325=perlquestion: print w/replies, xml ) Need Help??

Foxpond Hollow has asked for the wisdom of the Perl Monks concerning the following question:

I was having a bear of a time recently trying to figure out why my regex wasn't matching correctly, until I finally tried changing where it came up. It turns out that it was only looking from wherever the last regex happened to leave off.

Here's the text it should be matching in:

<!-- filename: full-000-body-cdl90 --> <tr> <td class="contentSmall" valign="top" id=bold width="5%" nowrap><str +ong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;020</strong></td> + <td class="contentSmall" valign="top">|a 9780470086223 (hardback)</t +d> </tr> <!-- end: full-000-body-cdl90 --> <!-- filename: full-000-body-cdl90 --> <tr> <td class="contentSmall" valign="top" id=bold width="5%" nowrap><str +ong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;24510</strong></t +d> <td class="contentSmall" valign="top">|a Heads in the sand : |b how +the Republicans screw up foreign policy and foreign policy screws up +the Democrats / |c Matthew Yglesias</td> </tr> <!-- end: full-000-body-cdl90 --> <!-- filename: full-000-body-cdl90 --> <tr> <td class="contentSmall" valign="top" id=bold width="5%" nowrap><str +ong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;24610</strong></t +d> <td class="contentSmall" valign="top">|a How the Republicans screw u +p foreign policy and foreign policy screws up the Democrats</td> </tr> <!-- end: full-000-body-cdl90 --> <!-- filename: full-000-body-cdl90 --> <tr> <td class="contentSmall" valign="top" id=bold width="5%" nowrap><str +ong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;61020</strong></t +d> <td class="contentSmall" valign="top">|a Democratic Party (U.S.)</td +> </tr> <!-- end: full-000-body-cdl90 -->
And here are the two regexes:
if ($MARC_page =~ m{ (?:020<)? # MARC code followed by a bracket to identify .*? # followed by anything \|a\s # followed by a pipe and the subfield (\d{13}) # followed by a 13-digit ISBN code }xmgs) { my $isbn = $1; } if ($MARC_page =~ m{ 245\d{0,2} # MARC code 245 followed by 0-2 indicators .*? # followed by anything \|a\s # followed by a pipe and the subfield (.*?) # followed by the title \| # followed by a pipe and the next subfield }xmgs) { my $title = $1; }

It works correctly now after I rearranged the regexes. However, before when I had the ISBN regex coming after, it would not match anything. I tried changing \d to just . to see where it would even land, and it was matching with "Democratic Pa," which would have been the next match after where the title regex matched. For the record, the correct matches should be "9780470086223" for the ISBN and "Heads in the sand : " for the title match.

As far as I'm aware, a regex with the g flag should match globally, meaning it would ignore wherever another regex happened to stop searching. Is this not correct? If I am right, can someone tell me why I'm seeing this behavior, and how I might correct it? Thanks a lot.

p.s. this is just a random example book and I don't mean to make any political statements by its use

Replies are listed 'Best First'.
Re: regex only matching from last match
by jwkrahn (Monsignor) on Sep 20, 2009 at 00:54 UTC

    The match operator with the /g option will match all the patterns if it is used in list context.   However, when used in scalar context it iterates through each pattern in turn, which is what you are experiencing.

      Further to jwkrahn's reply, consider:
      >perl -wMstrict -le "my $str = '123 456 789'; print qq{1st match: $1} if $str =~ m{ (\d{3}) }xmsg; print qq{2nd match: $1} if $str =~ m{ (\d{3}) }xmsg; print qq{3rd match: $1} if $str =~ m{ (\d{3}) }xmsg; " 1st match: 123 2nd match: 456 3rd match: 789
      Try those matches without the  //g modifier.
        Thanks AnomalousMonk and jwkrahn, this explained it perfectly.
Re: regex only matching from last match
by Anonymous Monk on Sep 20, 2009 at 00:35 UTC
Re: regex only matching from last match
by Marshall (Canon) on Sep 20, 2009 at 22:32 UTC
    Here is a bit of a different approach to the problem.

    I think that you can do this with line by line processing rather than "slurping all lines into a scalar" and getting "fancy" with the regex.

    I don't know what a MARC code is. But, the basic idea here is that we find an ISBN number, then we find the title. Then the search goes on for the next ISBN number skipping all records inbetween. The ISBN number is easy to find and if this MARC code makes a difference, you will see how to adapt the below regex'es.

    This type of approach only looks at each input line once. There is no need to put all HTML text into one variable. HTML can have infinitely long lines and is typically not formatted for easy user reading - that is an advantage when we parse it!

    The code below uses one of my favorite tricks, NOT using $1, etc!
    (my $ISBN = (m/\|a\s+(\d{13})/)[0]); uses list slice to assign $ISBN right away, eliminating the need for $1.

    #!/usr/bin/perl -w use strict; my %Isbn2Title; while (<DATA>) { next unless (my $ISBN = (m/\|a\s+(\d{13})/)[0]); my $title = get_title(); $title =~s/\s+$//; #delete trailing whitespace ${Isbn2Title}{$ISBN}=$title; } foreach my $isbn (sort keys %Isbn2Title) { print "ISBN=$isbn title=$Isbn2Title{$isbn}\n"; } #Prints: ISBN=9780470086223 title=Heads in the sand sub get_title { while (<DATA>) { my $title = (m/\|a\s+(.*?)[\|:]/)[0]; return $title if $title; } } __DATA__ <!-- filename: full-000-body-cdl90 --> <tr> <td class="contentSmall" valign="top" id=bold width="5%" nowrap><str +ong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;020</strong></td> + <td class="contentSmall" valign="top">|a 9780470086223 (hardback)</t +d> </tr> <!-- end: full-000-body-cdl90 --> <!-- filename: full-000-body-cdl90 --> <tr> <td class="contentSmall" valign="top" id=bold width="5%" nowrap><str +ong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;24510</strong></t +d> <td class="contentSmall" valign="top">|a Heads in the sand : |b how +the Republicans screw up foreign policy and foreign policy screws up +the Democrats / |c Matthew Yglesias</td> </tr> <!-- end: full-000-body-cdl90 --> <!-- filename: full-000-body-cdl90 --> <tr> <td class="contentSmall" valign="top" id=bold width="5%" nowrap><str +ong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;24610</strong></t +d> <td class="contentSmall" valign="top">|a How the Republicans screw u +p foreign policy and foreign policy screws up the Democrats</td> </tr> <!-- end: full-000-body-cdl90 --> <!-- filename: full-000-body-cdl90 --> <tr> <td class="contentSmall" valign="top" id=bold width="5%" nowrap><str +ong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;61020</strong></t +d> <td class="contentSmall" valign="top">|a Democratic Party (U.S.)</td +> </tr> <!-- end: full-000-body-cdl90 -->
      Thanks, the pages I'm working with have a lot of really awfully formatted HTML that ends up being quite long, this is probably a good workaround so I don't have to load it all into memory. Also I like your trick for getting rid of $1, I hate using the regex variables.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://796325]
Approved by Bloodnok
Front-paged by tye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (8)
As of 2022-05-17 14:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (66 votes). Check out past polls.

    Notices?