Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

regex help humbly sought

by charlie_pi (Initiate)
on Mar 06, 2008 at 03:59 UTC ( #672348=perlquestion: print w/replies, xml ) Need Help??
charlie_pi has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I humbly beseech your help. I've been given over 9000 html pages from a dreamweaver website. The *real* content (minus templates and html tag garbage) is to be stripped out and converted into text files for addition to a Joomla site. So the structure of most of these pages looks like this: (gt/lt signs represented as '(' and ')'):

blah blah blah
(!-- #BeginEditable "region 1" --)
important stuff we wish to keep
(!-- #EndEditable --)
Unimportant stuff
(!-- #BeginEditable "region 2" --)
more stuff we wish to keep
(!-- #EndEditable --)
Some more unimportant stuff

I've thrown everything into one monolithic line:

while (<>) {
$big_line .= "$_ ";

then tried to run:

if ($big_line =~ /(!-- #BeginEditable ".*?" --)) {
$big_line =~ s/^.*?(!-- #BeginEditable ".*?" --)(.*?)(!-- #EndEditable --)/$1/;

But the problems I run into are:

-1- My pattern matching is missing something
-2- There are an indefinite number of BeginEditable tags any given file might have.
Ideally, the output would go something like:

EditableRegion X = (whatever's inside)
EditableRegion Y = (elsething)

I've used regex's for lots of things, but I might've met my match with this one. If anyone has wisdom, please bestow it upon a Brother.

Humbly Yourn,

Replies are listed 'Best First'.
Re: regex help humbly sought
by Punitha (Priest) on Mar 06, 2008 at 04:21 UTC

    Hi charlie_pi

    you can try like this,

    use strict; local $/;#####To read the whole content of a file while(<DATA>){ while($_=~/\(\!-- #BeginEditable \"([^"]*)" --\)(.*?)\(\!-- #EndEd +itable --\)/sgi){ print "IN:$1:$2\n"; ##do your stuffs here } }


      Punitha, you are a magician. I have made a donation to Perl Monks Foundation in your name, and I pledge also to study the regex you've written so that I can pass it on. Thank you for your timely help! Charlie Pi
Re: regex help humbly sought
by poolpi (Hermit) on Mar 06, 2008 at 09:47 UTC
Re: regex help humbly sought
by halfcountplus (Hermit) on Mar 06, 2008 at 05:19 UTC
    WOW, i'm sure i've never tried .*? in a regex, so if that was supposed to make sense, ignore me.

    Why would you put this into one line if it is already split into easily manipulated parts?

    You might as well just use BeginEditable and EndEditable and set a switch so that only the lines in the middle are accepted:

    use strict;
    my $sw;
    while (<DATA>) {
            if ($_ =~ /BeginEditable/) {$sw="on";}
            elsif ($_ =~ /EndEditable/) {$sw="off";}
            elsif ($sw eq "on") {print "$_"}
    (!_-%BeginEditable$$)      nb. the "html" is irrelevent 

    It's unclear from your question what else you wanted to do.
      .*? is perfectly legitimate in a Perl regex. Search for "non-greedy" sometime.

      Your solution is great if the input actually has the tags in question conveniently on lines by themselves, but breaks otherwise. If you've ever worked with real users, you know that's unlikely unless the input was purely machine generated with that constraint in mind and never touched by the users.

      WOW, i'm sure i've never tried .*? in a regex, so if that was supposed to make sense, ignore me

      This example should clearly illustrate the difference.

      $_ = '<S>abc<E> <S>def<E>'; print("Greedy:\n"); print(" $1\n") while /<S>(.*)<E>/g; print("Non-greedy:\n"); print(" $1\n") while /<S>(.*?)<E>/g;
      Greedy: abc<E> <S>def Non-greedy: abc def

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://672348]
Approved by kyle
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (1)
As of 2018-04-24 05:58 GMT
Find Nodes?
    Voting Booth?