|
|
| Welcome to the Monastery | |
| PerlMonks |
regex help humbly soughtby charlie_pi (Initiate) |
| on Mar 06, 2008 at 03:59 UTC ( #672348=perlquestion: print w/ replies, xml ) | Need Help?? |
|
charlie_pi has asked for the
wisdom of the Perl Monks concerning the following question:
Dear Monks, I humbly beseech your help. I've been given over 9000 html pages from a dreamweaver website. The *real* content (minus templates and html tag garbage) is to be stripped out and converted into text files for addition to a Joomla site. So the structure of most of these pages looks like this: (gt/lt signs represented as '(' and ')'):
(html) blah blah blah (!-- #BeginEditable "region 1" --) important stuff we wish to keep (!-- #EndEditable --) Unimportant stuff (!-- #BeginEditable "region 2" --) more stuff we wish to keep (!-- #EndEditable --) Some more unimportant stuff (/html) I've thrown everything into one monolithic line: while (<>) { chomp; $big_line .= "$_ "; } then tried to run: if ($big_line =~ /(!-- #BeginEditable ".*?" --)) { $big_line =~ s/^.*?(!-- #BeginEditable ".*?" --)(.*?)(!-- #EndEditable --)/$1/; } But the problems I run into are: -1- My pattern matching is missing something -2- There are an indefinite number of BeginEditable tags any given file might have. Ideally, the output would go something like: EditableRegion X = (whatever's inside) EditableRegion Y = (elsething) I've used regex's for lots of things, but I might've met my match with this one. If anyone has wisdom, please bestow it upon a Brother. Humbly Yourn, charlie_pi
Back to
Seekers of Perl Wisdom
|
|
||||||||||||||||||||||||||||||||