Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Reading particular line which repeats itself many times in text

by uday_sagar (Scribe)
on Jan 27, 2012 at 11:36 UTC ( #950335=perlquestion: print w/replies, xml ) Need Help??
uday_sagar has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, Here is the format of my text from which i need to grab some part of the text.

BEGIN hi abc def ghi jkl END BEGIN hello abc pqr stu mno END

As, you see above, each and every BEGIN has its own END, I want a snippet of the code from BEGIN hi to "its" END (not the end of "BEGIN hello"). The code i have written is shown below. But its pulling the lines from desired BEGIN to the END of the "other" BEGIN.

open DATA, "</hello.txt"; my @hello = <DATA>; foreach my $k (@hello) { if ($k=~/BEGIN ps7_slcr/../END/) { print $k; } }
Any solution? Thanks in Advance, Uday Sagar.

Replies are listed 'Best First'.
Re: Reading particular line which repeats itself many times in text
by choroba (Chancellor) on Jan 27, 2012 at 11:58 UTC
    If you want to use $k instead of $_, you have to use it everywhere:
    if ($k =~ /BEGIN/ .. $k =~ /END/)
Re: Reading particular line which repeats itself many times in text
by toolic (Bishop) on Jan 27, 2012 at 13:26 UTC
    Additionally, if your posted data actually had the string "ps7_slcr" in it, and if you were using warnings, you would have gotten warning messages of the type:
    Use of uninitialized value $_ in pattern match (m//)
Re: Reading particular line which repeats itself many times in text
by chessgui (Scribe) on Jan 28, 2012 at 08:30 UTC
    I don't know whether there is a fancy regular expression for this but in case where there are blocks I always use split (I consider it safer).
    my @begin_blocks=split /BEGIN $name/,$string_containing_blocks; my $block=$begin_blocks[1]; $block=~s/END\s*$//;
    If there is no match '' is returned (since @begin_blocks now has only the zero index element: the string itself). This won't work if BEGIN blocks can contain the string 'BEGIN' inside them. If this is necessary you can use an escape sequence: replace all 'BEGIN'-s inside the block with 'BEGIN_'. If such a coding is used the search looks like this (add one line for decoding BEGIN_):
    my @begin_blocks=split /BEGIN $name/,$string_containing_blocks; my $block=$begin_blocks[1]; $block=~s/END\s*$//; $block=~s/BEGIN_/BEGIN/g;

      Monks, Thanks for your replies. I have solved the problem by putting together the concepts of non greedy quantifier(?) and storing the whole file (using File::Slurp) in a scalar variable. Here is the code.

      use File::Slurp; local $/ = undef; open FILE, "my_text.txt" or die "Couldn't open file: $!"; binmode FILE; $string = <FILE>; close FILE; if ($string=~/BEGIN hi(.*?)END/s) { print ${1}."\n"; }
Re: Reading particular line which repeats itself many times in text
by sundialsvc4 (Abbot) on Jan 27, 2012 at 13:42 UTC

    What is probably biting you in this case is the default “greedy” behavior of a regex.   From perldoc perlre ...

    By default, a quantified subpattern is “greedy.”   That is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match.   If you want it to match the minimum number of times possible, follow the quantifier with a “?”.   (examples, not shown here, then follow in the perldoc)

    The longest possible string that matches “BEGIN .* END” extends from the first BEGIN, to the last END.   Which is what you are getting, but not what you want.

    The way that I usually choose to solve a problem like this, if I am not using a parser like Parse::RecDescent as I usually do, is with a regular expression something like /(BEGIN|END)(.*)?(BEGIN|END)/ig ... which uses the “repeating” modifier (so it can be used in a loop against a single string), and which grabs (as $1, $2, $3 ...) all three of the pieces that I am looking for.   My code then checks that (($1 =~ /BEGIN/i) && ($3 =~ /END/i)), to verify that the bounding delimiters that were actually found was the pair that I expected (otherwise, we have a syntax error, mismatched begin/end), and to do so in a case-insensitive way.   What lies between is $2.

      As the poster isn't using “BEGIN .* END”, how is this relevant?

      You do realise that if($k=~/BEGIN ps7_slcr/../END/) { is applying two regexen to two different strings, at two different times don't you.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://950335]
Approved by marto
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (8)
As of 2017-03-25 15:35 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (311 votes). Check out past polls.