http://www.perlmonks.org?node_id=7633

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question: (regular expressions)

How would I write a regular expression to extract all text between two key words such as 'start' and 'end'?

Originally posted as a Categorized Question.

  • Comment on How do I extract all text between two keywords like start and end?

Replies are listed 'Best First'.
Re: How do I extract all text between two keywords like start and end?
by chromatic (Archbishop) on Apr 14, 2000 at 19:41 UTC
    The basic regexp is:
    if ($text =~ /\bstart\b(.*?)\bend\b/) { $result = $1; # do something with results }
    Note that the . character matches any character but a newline (see m// if you want to span lines), the * means match zero or more times, and the ? forces * to match as few times as possible -- so it will pick up the first end instead of the last one. The \b is in there to prevent mismatches on words like 'starting' and 'backend'.

    It has the limitation of not catching nested starts and ends, in which case you might go the recursion route, and write this as a function:

    sub between { my $text = shift; if ($text =~ /start(.*?)end/) { $result = $1; between($result); } else { return $text; } }
    That can become prohibitively expensive, depending on your data set. I suspect there's a more hideous solution involving split and join, but that's likely to be counterproductive at this point. It also depends on having balanced tags -- if you don't, don't do this!
Re: How do I extract all text between two keywords like start and end?
by stephen (Priest) on Apr 15, 2000 at 02:16 UTC
    Hmmm... I'm afraid that the recursive 'between' above there might not work for complex cases. The non-greedy regexp would make it match the first start-end pair it found, so if we had:
    yadda yadda start this is comment start this is still comment end this + should still be comment end yadda yadda
    then we should wind up with the whole thing, minus start and end and yadda, but instead we get:
    this is comment start this is still comment

    The only way I can think of to get around this is by keeping external track of the levels. This also de-recurses it, which makes it less beautiful, but faster (in theory):

    sub between { my ($text) = @_; my $level = 0; my @comments = (); while ( $text =~ m{\G .*? (start|end) (.*?) (?: (?=start|end) | $ +) }gxs ) { if ( $1 eq 'start') { $level++; } else { ($level > 0) and $level--; } $level > 0 and push(@comments, $2); } return join('', @comments); }

    This returns:

    this is comment this is still comment this should still be comment

    So what we're doing here is going through the text looking for 'start's and 'end's. We keep a counter indicating how many levels deep we are in 'start's and 'end's. Every time we hit a 'start', we add one. Every time we hit an 'end', we subtract one, checking first to make sure that our level doesn't go negative. (Otherwise, somebody could mess us up by starting a file "end end end".)

    Afterwards, we look at the patch of text between the current tag and the next start/end tag. If our level is greater than 0, we're between a 'start' and an 'end' tag, so we store that segment. Otherwise, we're not, so we look for another 'start' or 'end' tag until the end of file.

Re: How do I extract all text between two keywords like start and end?
by Anonymous Monk on Jul 03, 2000 at 00:04 UTC
Re: How do I extract all text between two keywords like start and end?
by little_mistress (Monk) on Apr 15, 2000 at 02:16 UTC
    I'm kinda wondering about this one. Since you know the structure of the data (ie. the data starts after delimiter 'a' and ends with delimter 'b') and you allow the key word to be a regular word in the data I would have to assume your the $text in chromatics answer means that you have all the text in the same string. As I recall, sorry I've lost my mastering regular expressions book, its in Japan with Sawako, you need a regular expression that treats the newline charactor as an embeded charactor rather than the end of a line.

    /start(.*)end/s; #rather than /start(.*)end/; #used like this $file = 'C:\fixthis.txt'; open(SESAME, $file); while(<SESAME>) { $text .= $_; } close(SESAME); print $text; $text=~/\n*$//;#get rid of trailing newlines $text=~m/^start(.*)end$/s; print $1; ########the file has this data ########## # I inserted alot of the words start and end to test it. #start #this is the start house startthat jackstart built #and i am end my fathers endchild #all end good boys do finend #and i eat end more chicken than any man that you have seen #end #############the out put is this############# #this is the start house startthat jackstart built #and i am end my fathers endchild #all end good boys do finend #and i eat end more chicken than any man that you have seen
    That seems to work ok, if i understood the structure of your data correctly. If not im sure you could modify the regular expression to fit your needs.

    Remember, simple is better.

    little_mistress@mainhall.com

Re: How do I extract all text between two keywords like start and end?
by Anonymous Monk on Aug 06, 2000 at 09:16 UTC
    Hi guys. This seems to work. I'd like some comments on this. I'm trying to further my perl education. Thanks. -Ty
    $line = "start this is the start this is another end first end start h +ello start inside of hello end there end"; while ($line=~ s/\s*\bstart\b(?!.*?\bstart\b) (.+?) \bend\b//) { print $1, "\n"; }
    tkroll@hawaii.edu
Re: How do I extract all text between two keywords like start and end?
by I0 (Priest) on Mar 07, 2001 at 11:22 UTC
    $_ = "start this is the start this is another end first end start hell +o start inside of hello end there end"; ($re=$_)=~s/((\bstart\b)|(\bend\b)|.)/${[')','']}[!$3]\Q$1\E${['(',''] +}[!$2]/gs; @$ = (eval{/$re/},$@); print join"\n",@$ unless $$[-1]=~/unmatched/;
Re: How do I extract all text between two keywords like start and end?
by Punto (Scribe) on May 17, 2000 at 16:28 UTC
    ($content) = $string =~ m/ start (.*) end /;
    That will get the stuff between " start " and " end " into $content.