Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

extract substring between token a and token b

by philosophia (Sexton)
on Jul 29, 2004 at 00:44 UTC ( #378259=perlquestion: print w/replies, xml ) Need Help??

philosophia has asked for the wisdom of the Perl Monks concerning the following question:

hi

i'm trying to extract $substring from $text - where $text is a block of html and $substring is anything between

<!--begin--> and
<!--/end-->

in $text.

i've tried

$text =~ /%<\!--begin-->%(.*?)%<\!--end-->/;
$substring = $1;
print $substring;


but this doesn't seem to be working. i'm not sure if my regex is right first of all. also, is there any way to do this without using the global variable $1?

thanks
  • Comment on extract substring between token a and token b

Replies are listed 'Best First'.
Re: extract substring between token a and token b
by LassiLantar (Monk) on Jul 29, 2004 at 01:47 UTC
    This works... I'm not sure what the "%"s in your regex were for.
    #!/usr/bin/perl $text = "<!--begin-->this is a line of text<!--end--><!--begin-->anoth +er one<!--end-->"; @lines = $text =~ /<\!--begin-->(.*?)<\!--end-->/g; foreach (@lines) { print $_ . "\n"; }

    Peace!
    LassiLantar

Re: extract substring between token a and token b
by itub (Priest) on Jul 29, 2004 at 00:51 UTC
    If the string you want to match spans more than one line, you need to use the /s modifier, because otherwise the dot doesn't match the newline character.

    If you do the pattern matching in list context with the /g (global) option, you can capture all the matches in one array without using $1. e.g., @matches = $text =~ /(pattern)/g;

Re: extract substring between token a and token b
by philosophia (Sexton) on Jul 29, 2004 at 03:17 UTC
    you guys are awesome thanks
Re: extract substring between token a and token b
by revdiablo (Prior) on Jul 29, 2004 at 17:02 UTC

    Your question has been directly answered, but there is a major pitfall to your approach. Even using the non-greedy match, you will run into problems with nested tags. Example:

    $input = '<foo>bar<foo>oops</foo></foo>'; print $input =~ m{<foo>(.*?)</foo>}, "\n"; __OUTPUT__ bar<foo>oops

    This is one big reason you should avoid simple regular expressions for parsing nested formats, like HTML. The best solution is to use a real parser that actually understands its source data, rather than simple pattern matches.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://378259]
Approved by graff
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2022-05-29 12:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (101 votes). Check out past polls.

    Notices?