Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Matching set of paragraph tags with string inside.

by the_0ne (Pilgrim)
on Feb 08, 2008 at 20:45 UTC ( #667062=perlquestion: print w/ replies, xml ) Need Help??
the_0ne has asked for the wisdom of the Perl Monks concerning the following question:

I just can't seem to get these look-ahead/look-behind assertions down pat like some of you. I have this code...

$contents = "<p>no tabs here</p><p>column 1[tab]column 2</p><p>no tabs + here</p>"; if ($contents =~ /<p>(.*?\[tab\].*?)<\/p>/) { print "yes: ${1}\n"; } else { print "no\n"; } if ($contents =~ /<p>(?=\[tab\])<\/p/) { print "yes: ${1}\n"; } else { print "no\n"; }

What I want here is to say only match the set of paragraph markers with the actual string [tab] inside.

Notice the first one DOES match, however, since I am using .*? the first para marker will match and then the .* will take over and match all the way to the close para marker. I want the first para marker to fail because there is no [tab] inside the paragraph markers.

The second one is a look-ahead, but notice the [tab] would have to be immediately after the para marker. I can't figure out how to tell it the possibility of some text, then the [tab], then the possibility of more text.

Thanks for any assistance you may provide...

Comment on Matching set of paragraph tags with string inside.
Download Code
Replies are listed 'Best First'.
Re: Matching set of paragraph tags with string inside.
by Roy Johnson (Monsignor) on Feb 08, 2008 at 21:13 UTC
    my $re = qr/ <p> # paragraph-open ( # Capture (?: # group (?!<\/p>) # Make sure we aren't at a paragraph-close (?!\[tab\]) # and we're not at [tab] . # consume a char )* # any number of times \[tab\] # consume [tab] (hooray!) .*? # anything up to paragraph-close ) <\/p> /x;
    Update: to not capture paragraph tags

    Caution: Contents may have been coded under pressure.
      Thanks a lot for the code and explanation. I think this is exactly what I need. I'll mess with it some more and see if I can break it.
Re: Matching set of paragraph tags with string inside.
by Tanktalus (Canon) on Feb 08, 2008 at 21:10 UTC

    If your HTML is actually XHTML-compliant, you could use XML::Twig to parse it, and then do something like this:

    my @tagged_paragraphs = $twig->get_xpath('//p[string()=~/\[tag\]/'); my @texts = map { $_->text() } @tagged_paragraphs;
    Note that if you have p's in p's (e.g., "<p>some text<p>inner [tag] stuff</p>outter</p>", this may give you problems (you'll get both "some textinner [tag] stuffoutter" and "inner [tag] stuff", I believe).

Re: Matching set of paragraph tags with string inside.
by jepri (Parson) on Feb 08, 2008 at 21:11 UTC
    I try to avoid getting too tricky with regexes, because I am a bear of little brain. I'd break the string up then inspect it in pieces, perhaps like this:

    my @arr = split /<\/p>/, $string; @matches = grep { /\[tab\]/ } @arr;

    one extra line but much easier for me. If I was concerned about it looking neat, I'd put the code inside a subroutine called match_tabs().

    ___________________
    Jeremy
    I didn't believe in evil until I dated it.

      One minor difference is that splitting this way removes the close-paragraph, but not the open-paragraph, so you need to remove it.
      my @arr = grep {/\[tab\]/ and s/^<p>//} split /<\/p>/, $contents;
      If you wanted to retain both tags, you could do so by splitting on a lookbehind expression:
      my @arr = grep /\[tab\]/, split /(?<=<\/p>)/, $contents;
      You'd also retain the paragraph-close if you used it for $/ and read the string as a file.
      { local $/ = '</p>'; open (STR, '<', \$contents) or die "Opening string: $!\n"; @arr = grep /^<p>/ && /\[tab\]/, <STR>; print "read $_\n" for @arr; }
      But now I'm just getting silly.

      Caution: Contents may have been coded under pressure.
        You seem to be a bit focussed on using big regexes :P

        map { s/<P>// } grep { /\[tab\]/ } split /<\/p>/, $string;

        ___________________
        Jeremy
        I didn't believe in evil until I dated it.

      hmmm, actually makes sense. Luckily most of the time I will be looking exactly for what I originally showed in my example. This should work ok. I'll have to mess with it and see if there's anything I didn't think of that may pop up.

      Thanks.
Re: Matching set of paragraph tags with string inside.
by moritz (Cardinal) on Feb 08, 2008 at 20:54 UTC
    In the second example you're trying to match <p></p> directly, with an additional assertion. Try /<p>(?=.*?\[tab\]).*?<\/p>/ instead. But that doesn't check if the [tab] occurs before the </p>, so don'T try to mess with lookarounds but use your first pattern.

    Update: OK, that doesn't fix your real problem. I'll have to think a bit more about it. In this simple case you can just use [^<>] instead of a dot everywhere between <p> and </p>.

    The old truth that HTML shouldn't be parsed with regexes still holds.

      My problem with the first match is it matches too much...
      original: "<p>no tabs here</p><p>column 1[tab]column 2</p><p>no tabs here</p>" result of first regex: no tabs here</p><p>column 1[tab]column 2
      What I want it to match is...
      column 1[tab]column 2
      Since that is the set of para's with the string [tab].
        My first shot was too fast, here's a working solution:
        my $contents = "<p>no tabs here</p><p>column 1[tab]column 2</p><p>no t +abs here</p>"; if ($contents =~ m{(<p>[^<>]*\[tab\][^<>]*</p>)} ){ print "Matched '$1'\n;"; }

      I understand your concern of parsing html with regexes, but this string will never be a full set of html. It will have some html (of course with the para's) but using a parser would be overkill for this situation.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://667062]
Approved by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (10)
As of 2015-08-30 14:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The oldest computer book still on my shelves (or on my digital media) is ...













    Results (349 votes), past polls