http://www.perlmonks.org?node_id=1067550

taint has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, Monks.

I'm working on editing some documents (web pages), where I need to replace a block context, with a larger one. I've experimented, but can't yet get it quite right. For example, I am attempting to match the following:

</div> </body>
for some reason (my lack of experience with Perl RE) this doesn't work
\<\/div\>\n\<\/body\>
I can match </div> or </body>. But not both. Sorry for the bother. I'm so good with sed I feel I should be closer with Perl, than I am. But still haven't quite got the hang of it. :/

Thank you for your time, and consideration.

--Chris

UPDATE -- now with my broken example
Yes. What say about me, is true.

Replies are listed 'Best First'.
Re: Perl RE; how to capture, and replace based on a block?
by GrandFather (Saint) on Dec 18, 2013 at 00:23 UTC

    Don't do that. As a general thing correctly parsing and editing XML is tricky. Use a module such as XML::Twig to do the hard work for you.

    If you'd provided sample code, input data and expected output most likely someone would show you how to use an appropriate module to do the job.

    True laziness is hard work

      Ahh. I see. I didn't have any code (other than the RE I was using). Because, at this point, if I can't even match the block. There would be no point in attempting to replace. So I hadn't bothered to attempt to replace anything yet. I'm still trying to figure out how to correctly match what I need.

      Just seemed the logical progression, in learning to do it. :)

      Thanks GrandFather, for the reply (and suggestion).

      --Chris

      Yes. What say about me, is true.
      

        I agree that use of an XML parser is likely to be a better idea, but just an example of what your fellow monks were hoping for as an example of what you tried and what resulted (except this works):

        >perl -wMstrict -le "my $s = qq{xxx </div>\n</body> xxx}; print qq{[[$s]]}; ;; my $tags = qr{ </div> \n </body> }xms; $s =~ s{ $tags }{gone}xms; print qq{[[$s]]}; " [[xxx </div> </body> xxx]] [[xxx gone xxx]]

        Question: Are you sure it's only a single newline that's present? The presence of other whitespace characters than just a newline can confuse the issue. The following might be a better regex:
            qr{ </div> \s* </body> }xms

Re: Perl RE; how to capture, and replace based on a block?
by educated_foo (Vicar) on Dec 18, 2013 at 04:16 UTC
    To minimize unhelpful replies here, you should probably do something like this:
    1. Add "use strict;" to the top of your code.
    2. Add "my ();" right after it.
    3. Keep adding "VAR," between the parens in #2 as long as Perl complains about 'Global symbol "VAR"...'
    Someone should probably write an Acme:: module to do this automatically.

      Sorry. I just

      cat ./FILE.html | perl {...}
      in an open xterm. After several failures, and no more ideas. I closed the xterm, and asked for help. I didn't think it'd be of any use in the request.

      I've since read every single reference in the Perl documentation, and while I think I've got the RE part down. I'm quite sure I don't know how to feed Perl the file properly to do any more than eat a single line at a time.

      So let me have another go at it. The following

      #!/usr/bin/perl -w #retest.pl # my feeble attempt to a multi-line RE in Perl $regexp = shift; while (<>) { print if /$regexp/; }
      won't work as
      # ./retest.pl \</\div\>\n\<\/body\> ./FILE.html
      because shift will only manage input one line at a time. Attempts to figure how to make use of psed, and s2p, have failed miserably.

      Apologies for the previous noise, and thank you for the thoughtful responses.

      --Chris

      Yes. What say about me, is true.
      

        Hi Chris, specifying a regex on the command line seems a difficult thing to do. At least you should be printing your $regexp to see what it contains.

        In any case, this code seems to work:

        my $str = " </div> </body> "; print "Success\n" if $str =~ /\<\/div\>\n\<\/body\>/;

        which suggests that if you slurp in your whole file as a single string (e.g. by unsetting $/), your regex should do its job.

        local $/; my $str = <>; print "Success\n" if $str =~ /\<\/div\>\n\<\/body\>/;
        For fiddling with little bits of code, just use the debugger straight away:
        swedish_chef> perl -demo Loading DB routines from perl5db.pl version 1.32 Editor support available. Enter h or `h h' for help, or `man perldebug' for more help. main::(-e:1): mo DB<1> $string = "one two three four" DB<2> x $string =~ m/(\w+)/g 0 'one' 1 'two' 2 'three' 3 'four'

        Note that "my" variables don't work as expected, I think they get created in the Debug scope, and not in the interpreted scope. But otherwise, have fun in the sandbox.

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

      Thank you educated_foo.

      " Keep adding "VAR," between the parens in #2 as long as Perl complains about 'Global symbol "VAR"...' Someone should probably write an Acme:: module to do this automatically."

      I'll be glad to. Just as soon as I figure this all out. :)

      My biggest hangup, I think, is that I'm quite comfortable with sed. But sed is "greedy" by default, and while Perl RE can be. It's not, by default, and that's what I need here (not greedy).

      s/\<\/div\>/,/\<\/body\>/
      will match my pattern in sed. But it will match from the first </div> till the first </body>. Which is too much.

      Thanks again for the response, educated_foo

      --Chris

      Yes. What say about me, is true.
      
        By default, Perl RE are greedy. Have you considered the possibility that the end of line might be more than \n (if the file is coming from Windows, for example)?

        ... sed ...

        here is my test program

        use re 'debug'; $_ = q{</div> </body>}; print 'does it match ', int m{\<\/div\>\n\<\/body\>};
Re: Perl RE; how to capture, and replace based on a block?
by Anonymous Monk on Dec 18, 2013 at 00:12 UTC
    How about you post actual perl code, you know, stuff ready to run?
      Um. I did that;
      Code to replace
      </div> </body>
      RE I'm using, that doesn't work
      \<\/div\>\n\<\/body\>
      As stated in my OP; my RE (shown) matches one, or the other, not both. I had hoped to match both (</div></body>). Is it clearer?

      --Chris

      Yes. What say about me, is true.
      

        Um. I did that;

        Sorry but you didn't. You posted some data and a pattern you say you want to match the data but it doesn't match -- great, now show your code that uses the pattern with this data that fails to match

        it works "perfectly" as expected

        Compiling REx "\<\/div\>\n\<\/body\>" Final program: 1: EXACT <</div>\n</body>> (6) 6: END (0) anchored "</div>%n</body>" at 0 (checking anchored isall) minlen 14 Guessing start of match in sv for REx "\<\/div\>\n\<\/body\>" against +"</div>%n</body>" Found anchored substr "</div>%n</body>" at offset 0... Guessed: match at offset 0 Freeing REx: "\<\/div\>\n\<\/body\>"