regex: something...(!something)...something

on Jul 17, 2008 at 17:09 UTC

Hi! Regexes in Perl seem to be very powerful. But the thing I can't determine how to do is : match some part of text, which starts with "somebegin", ends with "someend", and there must NOT be some text or regex snippet inside the block...

Maybe, the solution is near (?!) blocks, but this is the part of regexes I can't understand in a full amount.

Can you give me some example or explanation of what I need? Thanks!)

Added: here is a code snippet that contains the bug - maybe it will be more easy for you just to find the errors
$x = "<tr><td>1</td><td>2</td><td><font>3</font></td><td>4</td><tr>"; $x =~ s{<td>(?!<td>)3(?!</td>)</td>}{<td>($1)a($2)</td>}; print $x; #output must be: <tr><td>1</td><td>2</td><td><font>a</font></td><td>4< +/td><tr>
Added: I'm having such troubles not only with HTML, so please do not recommend HTML Parsers..

$text =~ s/(<td>)((?:(?!<td>).)*)3((?:(?!<\/td>).)*)/<td>$1$2a$3/; OR $text =~ s{(?<=<td>)((?:(?!<td>).)*)3((?:(?!</td>).)*)(?=</td>)}{$1a$2 +};
Thanks for all good people for help )

Re: regex: something...(!something)...something
by Roy Johnson (Monsignor) on Jul 17, 2008 at 19:41 UTC
    My tutorial on look-ahead and look-behind covers this. You're specifying that the open tag is not followed immediately by another open tag; you want to specify that it's followed by any arbitrary text that isn't an open tag. This s/// expression should do what you want:
    $x =~ s{(?<=<td>)((?:(?!<td>).)*)3((?:(?!</td>).)*)(?=</td>)}{$1a$2};
    Update: I should have broken this down in commented style (I also note that I could have done lookahead for everything after the 3):
    $x =~ s{(?<=<td>) # Start match with open-tag ( # Capture (?:(?!<td>).)* # Any number of characters that do not star +t an open-tag )3 # Close capture; match literal 3 (?= # Look ahead to match (?:(?!</td>).)* # Any number of characters that do not star +t a close-tag </td> # then a close-tag )} # End lookahead and pattern {$1a}x;

      Thank you! It also works (I see it is more complex then Oluses regex, but probably more universal). I'll read your tutorial and will try to never ask again stupid questions about regexes :)
Re: regex: something...(!something)...something
by GrandFather (Sage) on Jul 17, 2008 at 21:09 UTC

    For markup use the appropriate parsing module. That will save you a pile of time and substantive questions won't get clouded by the "but you should be using a module" answer - you've preempted that answer and move on to real issues.

    So, lets look at the real issue without the distraction of HTML, XML, ... . Consider:

    use strict; use warnings; my $match1 = 'something '; my $fill = 'xxxxxx '; my $match2 = 'somethingelse '; my $nomatch = 'nomatch '; my @targets = ( "$nomatch$fill$match1$match2$nomatch", "$nomatch$fill$match1$fill$match2$nomatch", "$nomatch$fill$match1$fill$nomatch$fill$match2$nomatch", ); for my $test (@targets) { if ($test =~ /$match1 ((?:(?!$nomatch) .)*) $match2/x) { print "Matched >$1< for: $test\n"; } else { print "Failure for $test\n"; } }


    Matched > < for nomatch xxxxxx something somethingelse nomatch Matched > xxxxxx < for nomatch xxxxxx something xxxxxx somethingelse n +omatch Failure for nomatch xxxxxx something xxxxxx nomatch xxxxxx somethingel +se nomatch

    The trick is the (?:(?!$nomatch) .)* which will only match a character if it is not the start of the nomatch criteria.

      I begin to understand the idea of trick) thank you.
      In my case the regex must be more complex, I'll soon post the solution in my first post
Re: regex: something...(!something)...something
by olus (Curate) on Jul 17, 2008 at 17:31 UTC

    You could use negative lookahead. An example:

    use strict; use warnings; my @lines = <DATA>; my $data; foreach $data (@lines) { if($data =~ /sometext (?!not)\w* endtext/) { print "$data passed \n"; } } __DATA__ sometext not endtext sometext positive endtext


    sometext positive endtext passed
      Thanks, I see it works, but with simple moments... please look above (I gave snippet of my code where some bug inside) - maybe you will be able to correct it.
Re: regex: something...(!something)...something
by olus (Curate) on Jul 17, 2008 at 18:35 UTC

    man, you are hard to please

    use strict; use warnings; my $text="<tr><td>1</td><td>2</td><td>qw<font>3</font></td><td>4</td>< +tr>"; $text =~ s/<td>((?:(?!<td>).)*)3((?:(?!<\/td>).)*)/$1a$2/; print "$text";


      $text =~ s/(<td>)((?:(?!<td>).)*)3((?:(?!<\/td>).)*)/<td>$1$2a$3/;
      Yes! Thank you! It is what I wanted so much :) *I corrected your code a bit, and it seems to work well*
Re: regex: something...(!something)...something
by pileofrogs (Priest) on Jul 17, 2008 at 17:24 UTC

    I think you've basically got it right.


    Should work. The only complexity involves lookahead vs. backtrack and because you have stuff both before and after the stuff you don't want, that won't matter (in terms of the truth of the statement, I have no idea about the efficiency).

    Personally, if I'm confused by something like this I opt for the slow but readable...

    if ( /^somebegin(.*)someend$/ ) { my $middle = $1; if ( $middle !~ /^something$/ ) { # woot } }
      hey! it is really not flexible. I'll give more complex question: there is some html.
      <tr><td>1</td><td>2</td><td>3</td><td>4</td><tr> OR with <font> - the task is that it cannot be or there can be somethi +ng else <tr><td>1</td><td>2</td><td><font>3</font></td><td>4</td><tr>
      I need to do something like this:
      replace <td>..(!<td>)..3..(!</td>)..</td> with <td>(everything that was in the left middle before 3)TEXT(everyth +ing that was in the right middle after 3)</td>
      Hope, you understood me..

        For this, I'd use instead:

        use strict; use warnings; my $text="<tr><td>1</td><td>2</td><td>qw<font>3</font></td><td>4</td>< +tr>"; my @blocks = split /<td>(.*?)<\/td>/, $text; foreach my $block (@blocks) { if($block =~ /3/) { print $block."\n"; } }



        But you should consider one of the many HTML parser modules.

        hm.. I see this looks like ugly but working solution and with some modifications I'll use it... but it is so sad there is no any commmon regex to solve the problem (as I wrote in some other post, such problems I met not only with HTML..)
Re: regex: something...(!something)...something
by eosbuddy (Scribe) on Jul 17, 2008 at 18:36 UTC
    Hi, perhaps I haven't understood your question (and hence this solution may be wrong)... this relates to the greedy nature of quantifiers. Please review the code below and let me know in the same syntax:
    $x = "<tr><td>1</td><td>2</td><td><font>3</font></td><td>4</td><tr>"; print "$x\n"; $x =~ s/<td>.*(<td>.*?)3(.*?<\/td>).*<\/td/<td>$1a$2<\/td>/; print "$x\n";
    gives me:
    <tr><td>1</td><td>2</td><td><font>3</font></td><td>4</td><tr> <tr><td><td><font>a</font></td></td><tr>
    is this your desired output?
      Sorry, but no, this is not the desired result.. My mistake - I hadn't told the desired result. If you are interested in the solution, please see Olus's posts. Nevertheless, thanks for help!
        Hi, Sorry about that, nonetheless, this code also will do the trick you want :-)
        $x =~ s/<td>.*<td>(.*?)3(.*?)<\/td>.*<\/td>/<td>$13$2<\/td>/;
Re: regex: something...(!something)...something
by poolpi (Hermit) on Jul 18, 2008 at 12:39 UTC

    See HTML::Element ( replace_with_content method ) and HTML::TreeBuilder

    #!/usr/bin/perl -w use strict; use HTML::TreeBuilder; my $html = q{<tr><td>1</td><td>2</td><td><font>3</font></td><td>4</td><tr>}; my $tree = HTML::TreeBuilder->new_from_content($html); my $td = $tree->look_down( '_tag', 'td', sub { $_[0]->as_text =~ m/\b3/ }); $td->replace_with_content(); $tree->delete;


      It looks like XPath brother for html) thank you, I think I'll use this solution when I'll work directly with HTML code. For the solution please see the question of the topic - I added solution there.
Re: regex: something...(!something)...something
by toolic (Bishop) on Jul 17, 2008 at 17:28 UTC
    there must NOT be some text or regex snippet inside the block
    Can you elaborate on that?

    Otherwise, there is plenty to read on the topic at: perlretut, perlrequick, perlfaq6, perlre, etc.

      I've read that.. I understand I'm not a perl master, but if you know the solution, please, tell me where I'm not right..
      $x = "<tr><td>1</td><td>2</td><td><font>3</font></td><td>4</td><tr>"; $x =~ s{<td>(?!<td>)3(?!</td>)</td>}{<td>($1)a($2)</td>}; print $x;
      It does not work... are there any ideas?


      deep posts does not appear for some reasons.. I'll post again in the top-level:

      you see, I met such problem not with html only... this is a general question. I'm pretty sure there must exist some regex to solve the problem, but my knowledge is too small. But, nevertheless, thanks)

      Maybe, someone other knows any regex solutions?
        are there any ideas?
        Yes, consider abandoning a regex approach, and select an appropriate CPAN solution for parsing HTML. I have used HTML::TokeParser, although I am not experienced enough with it to know if it will solve your problem.
        deep posts does not appear for some reasons.. I'll post again in the top-level:

        I personally believe you should go to your User Settings and avoid to "post again in the top-level" - that is not going to earn you anything. It's a matter of visualization anyway. Personally, I've set both Replies header depth and Replies text depth to 1000: but IIRC, with lower values you still get a pointer to deeper posts.

        If you can't understand the incipit, then please check the IPB Campaign.

Node Type: perlquestion [id://698374]
Approved by olus
