Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

How do I extract all text between two keywords like start and end?

( #7633=categorized question: print w/ replies, xml ) Need Help??
Contributed by Anonymous Monk on Apr 14, 2000 at 19:27 UTC
Q&A  > regular expressions


Description:

How would I write a regular expression to extract all text between two key words such as 'start' and 'end'?

Answer: How do I extract all text between two keywords like start and end?
contributed by chromatic

The basic regexp is:

if ($text =~ /\bstart\b(.*?)\bend\b/) { $result = $1; # do something with results }
Note that the . character matches any character but a newline (see m// if you want to span lines), the * means match zero or more times, and the ? forces * to match as few times as possible -- so it will pick up the first end instead of the last one. The \b is in there to prevent mismatches on words like 'starting' and 'backend'.

It has the limitation of not catching nested starts and ends, in which case you might go the recursion route, and write this as a function:

sub between { my $text = shift; if ($text =~ /start(.*?)end/) { $result = $1; between($result); } else { return $text; } }
That can become prohibitively expensive, depending on your data set. I suspect there's a more hideous solution involving split and join, but that's likely to be counterproductive at this point. It also depends on having balanced tags -- if you don't, don't do this!
Answer: How do I extract all text between two keywords like start and end?
contributed by stephen

Hmmm... I'm afraid that the recursive 'between' above there might not work for complex cases. The non-greedy regexp would make it match the first start-end pair it found, so if we had:

yadda yadda start this is comment start this is still comment end this + should still be comment end yadda yadda
then we should wind up with the whole thing, minus start and end and yadda, but instead we get:
this is comment start this is still comment

The only way I can think of to get around this is by keeping external track of the levels. This also de-recurses it, which makes it less beautiful, but faster (in theory):

sub between { my ($text) = @_; my $level = 0; my @comments = (); while ( $text =~ m{\G .*? (start|end) (.*?) (?: (?=start|end) | $ +) }gxs ) { if ( $1 eq 'start') { $level++; } else { ($level > 0) and $level--; } $level > 0 and push(@comments, $2); } return join('', @comments); }

This returns:

this is comment this is still comment this should still be comment

So what we're doing here is going through the text looking for 'start's and 'end's. We keep a counter indicating how many levels deep we are in 'start's and 'end's. Every time we hit a 'start', we add one. Every time we hit an 'end', we subtract one, checking first to make sure that our level doesn't go negative. (Otherwise, somebody could mess us up by starting a file "end end end".)

Afterwards, we look at the patch of text between the current tag and the next start/end tag. If our level is greater than 0, we're between a 'start' and an 'end' tag, so we store that segment. Otherwise, we're not, so we look for another 'start' or 'end' tag until the end of file.

Answer: How do I extract all text between two keywords like start and end?
contributed by QandAEditors

Check out Text::DelimMatch on CPAN.

Answer: How do I extract all text between two keywords like start and end?
contributed by I0

$_ = "start this is the start this is another end first end start hell +o start inside of hello end there end"; ($re=$_)=~s/((\bstart\b)|(\bend\b)|.)/${[')','']}[!$3]\Q$1\E${['(',''] +}[!$2]/gs; @$ = (eval{/$re/},$@); print join"\n",@$ unless $$[-1]=~/unmatched/;
Answer: How do I extract all text between two keywords like start and end?
contributed by little_mistress

I'm kinda wondering about this one. Since you know the structure of the data (ie. the data starts after delimiter 'a' and ends with delimter 'b') and you allow the key word to be a regular word in the data I would have to assume your the $text in chromatics answer means that you have all the text in the same string. As I recall, sorry I've lost my mastering regular expressions book, its in Japan with Sawako, you need a regular expression that treats the newline charactor as an embeded charactor rather than the end of a line.

/start(.*)end/s; #rather than /start(.*)end/; #used like this $file = 'C:\fixthis.txt'; open(SESAME, $file); while(<SESAME>) { $text .= $_; } close(SESAME); print $text; $text=~/\n*$//;#get rid of trailing newlines $text=~m/^start(.*)end$/s; print $1; ########the file has this data ########## # I inserted alot of the words start and end to test it. #start #this is the start house startthat jackstart built #and i am end my fathers endchild #all end good boys do finend #and i eat end more chicken than any man that you have seen #end #############the out put is this############# #this is the start house startthat jackstart built #and i am end my fathers endchild #all end good boys do finend #and i eat end more chicken than any man that you have seen
That seems to work ok, if i understood the structure of your data correctly. If not im sure you could modify the regular expression to fit your needs.

Remember, simple is better.

little_mistress@mainhall.com

Answer: How do I extract all text between two keywords like start and end?
contributed by Anonymous Monk

Hi guys. This seems to work. I'd like some comments on this. I'm trying to further my perl education. Thanks. -Ty

$line = "start this is the start this is another end first end start h +ello start inside of hello end there end"; while ($line=~ s/\s*\bstart\b(?!.*?\bstart\b) (.+?) \bend\b//) { print $1, "\n"; }
tkroll@hawaii.edu
Answer: How do I extract all text between two keywords like start and end?
contributed by Punto

($content) = $string =~ m/ start (.*) end /;
That will get the stuff between " start " and " end " into $content.

Please (register and) log in if you wish to add an answer



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others lurking in the Monastery: (10)
    As of 2014-07-29 20:56 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite superfluous repetitious redundant duplicative phrase is:









      Results (228 votes), past polls