Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

mutiple-line regexes?

by jens (Pilgrim)
on Aug 19, 2002 at 05:43 UTC ( #191086=perlquestion: print w/replies, xml ) Need Help??
jens has asked for the wisdom of the Perl Monks concerning the following question:

I don't have a great deal of Perl experience, so a bit of patience with this question please if it seems basic.

I recently used Perl to automate marking up some data files with HTML for inclusion in a website. This, naturally, proved much faster than coding HTML by hand! However, there was one scenario that I couldn't figure out, and wound up doing manually--I needed to remove six-line blocks of data, but only if a particular pattern existed in the fourth line.

For future reference, how do you do this in Perl? Thanks very much for your reply.


Replies are listed 'Best First'.
Re: mutiple-line regexes?
by spurperl (Priest) on Aug 19, 2002 at 06:06 UTC
    Shortly, the /m and /s pattern modifiers should help dealing with multi-line mathches.
    m Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of the string to matching the start or end of any line anywhere within the string. s Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
    Read "perldoc perlre" for more information, or post a specific problem.
Re: mutiple-line regexes?
by jens (Pilgrim) on Aug 19, 2002 at 06:55 UTC
    Here's some simplified code to give you an idea of what I wanted to accomplish:

    Example HTML code (produced by saving as HTML from OpenOffice Calc) simplified for sake of argument:

    <tag1> <tag2> <tag3> <tag4><MYTAG>is there stuff here?</MYTAG></tag4> </tag3> </tag2></tag1>

    If the contents between MYTAG were *blank*, then I wanted to delete the entire six lines.


      does your data always look like pseudo-HTML? tagged? maybe you can have better results with HTML::TokeParser

      here's some untested code:

      my $p = HTML::TokeParser->new($html) || die "Can't tokenize: $!"; # get each <tag1> alone. while (my $token = $p->get_tag('tag1')) { # store the original text $origtext = $token->[3]; # get data between <MYTAG></MYTAG> my $myTag = $p->get_tag('MYTAG'); my $text = $p->get_text('/MYTAG'); if ($text ne '') { # tag is not empty.. so $origtext retains # the data we want.. # ... do whatever with $origtext and move on } }

      He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.

      Chady |

      Instead of using a multine regex, you could consider looping over the lines in turn and removing every six lines where there's a match on the fourth.

      I think the code is pretty self-explanatory (@lines contains your file):

      my $i = 3; # from line 4 ... while ($i < @lines - 2) { # ... until last but 3 if ($lines[$i] =~ m!<MYTAG></MYTAG>!) { splice(@lines, $i - 3, 6); # kill 6 lines from $i - 3 } else { $i++; } }

      — Arien

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://191086]
Approved by rob_au
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2018-01-16 18:24 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (187 votes). Check out past polls.