Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

command line perl command to get between lines with non greedy match

by ravi_perl_monks (Initiate)
on Jan 17, 2020 at 19:31 UTC ( #11111545=perlquestion: print w/replies, xml ) Need Help??

ravi_perl_monks has asked for the wisdom of the Perl Monks concerning the following question:

PATTERN1 SOME INFO TEXT1 TEXT2 TEXT3 PATTERN2 SOME INFO PATTERN1 SOME INFO TEXT4 TEXT5 TEXT6 PATTERN3 SOME INFO

I know the following code perl -ne 'print if (/PATTERN1/../PATTERN3/)' is a greedy match and prints everthing.

What I want is to print the following output

PATTERN1 SOME INFO TEXT4 TEXT5 TEXT6 PATTERN3 SOME INFO

Note this is extremely large file and can't put the whole file into a string.

Thanks, Ravi

Replies are listed 'Best First'.
Re: command line perl command to get between lines with non greedy match
by GrandFather (Sage) on Jan 17, 2020 at 21:54 UTC

    A strategy that reads a line at a time and saves lines after PATTERN1 is found until either PATTERN3 is found (and the saved lines are printed), or some other pattern is found and the saved lines are discarded. That may be a bit much to cleanly do as a one liner so bite the bullet and write a script to do the work. The script can be called using a single command line so you haven't lost any convenience of use by writing the script.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: command line perl command to get between lines with non greedy match
by haukex (Chancellor) on Jan 18, 2020 at 17:54 UTC
    Note this is extremely large file and can't put the whole file into a string.

    How large is the section you want to read, does that fit into memory? One possible approach is to buffer the lines in an array, as several monks have shown. Just for fun, I thought about what a script like this might do if you didn't want to read the whole file nor the section being searched for into memory; in that case you could scan the file and remember the byte offsets of the strings you're looking for. In the following, I'm reading that data in, but that's not required, you could do something else with those byte offsets. Note that the code below only works with bytes, not Unicode characters.

    use warnings; use strict; my $file = 'in.txt'; my ($start,$end); open my $fh, '<:raw', $file or die "$file: $!"; my $offset = 0; while (<$fh>) { $start = $offset if /PATTERN1/; $offset = tell $fh or die "tell: $!"; $end = $offset if /PATTERN3/; } die "Failed to find second pattern after first pattern" unless defined $start && defined $end && $end > $start; seek $fh, $start, 0 or die "seek: $!"; my $bytes = $end-$start; read($fh, my $data, $bytes)==$bytes or die "failed to read $bytes bytes"; close $fh; print $data;
Re: command line perl command to get between lines with non greedy match
by GrandFather (Sage) on Jan 17, 2020 at 22:01 UTC

    Except for very minor edits it is usual to note in the node that you have updated it. In this case you have completely reformatted the node (which is good), but left several replies looking silly.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: command line perl command to get between lines with non greedy match
by tybalt89 (Parson) on Jan 18, 2020 at 20:47 UTC

    Here's one that uses almost no storage, by seeking back to the last PATTERN1 and re-reading the input file. Note that this will not work on a pipe.

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11111545 use warnings; my $fh = *DATA; # FIXME to your input file, DATA only used for testing my $lastpattern1; while( <$fh> ) { if( /PATTERN1/ ) { $lastpattern1 = tell($fh) - length $_; } elsif( $lastpattern1 and /PATTERN3/ ) { seek $fh, $lastpattern1, 0; while( <$fh> ) { my $end = s/ (?=PATTERN3)/\n\n/; print; $end and last; } $lastpattern1 = undef; } } __DATA__ PATTERN1 SOME INFO TEXT1 TEXT2 TEXT3 PATTERN2 SOME INFO PATTERN1 SOME INFO TEXT4 TEXT5 TEXT6 PATTERN3 SOME INFO PATTERN1 SOME INFO TEXT1 TEXT2 TEXT3 PATTERN4 SOME INFO PATTERN1 SOME INFO TEXT4 TEXT55 TEXT6 PATTERN3 SOME INFO

    I also do the fix up on the PATTERN3 line, though I'm curious if that was just a typo on your part?

Re: command line perl command to get between lines with non greedy match (updated)
by AnomalousMonk (Bishop) on Jan 17, 2020 at 21:51 UTC

    Please use  <code> ... </code> code tags for code, command line invocations, error messages and input/output data. Please see Writeup Formatting Tips, Markup in the Monastery and How do I post a question effectively? (and hopefully also How do I change/delete my post?). Please see also Short, Self-Contained, Correct Example.

    Something like this (insofar as I understand your input) can be done as a command line one-liner, but it gets messy and I prefer to write a script for it, something like
        perl print-chunks.pl  < input.file  > output.file
    that might look like (untested | semi-tested):

    use strict; use warnings; my $rx_start = qr{ \A \s* PATTERN1 }xms; my $rx_stop = qr{ \A \s* PATTERN3 }xms; my @records; RECORD: while (my $record = <STDIN>) { if ($record =~ $rx_start) { # push @records, $record; # UPDATE: NO: still "greedy" extracti +on of records/lines @records = $record; # UPDATE: FIXED: only extracts BETWEE +N start/stop patterns next RECORD; } if ($record =~ $rx_stop) { print @records, $record; @records = (); next RECORD; } push @records, $record if @records; } exit;

    Update: Example code fixed to extract records only between the start/stop patterns. Any reformatting of output that may be needed is still not addressed.


    Give a man a fish:  <%-{-{-{-<

Re: command line perl command to get between lines with non greedy match
by LanX (Archbishop) on Jan 17, 2020 at 22:36 UTC
    Thanks for editing! :)

    > is a greedy match

    That's not the accurate term, it's just a multiple match and you only want the last one.

    One way to achieve this in a one-liner is not to print all the matches but to store them in an array and to only print the last match in an END{} block.

    The difficulty here is to always reset the array for previous matches.

    > Note this is extremely large file and can't put the whole file into a string.

    In this case it might be better to go reverse and read a sliding window from the end.

    But I don't know how to do this with a one-liner.

    To decide this one needs to know how "large" is "extreme" ?

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

    ) I think I misunderstood your problem, see Re: command line perl command to get between lines with non greedy match for another approach.

      > The difficulty here is to always reset the array for previous matches.

      Too lazy for a full example, look at this demo code in the debugger

      DB<72> map {if ($x=(/b/../d/)) { $out[$x]=$_; $last=$x }} a..e,a..e, +a..b,1..3,d..e; DB<73> x @out[1..$last] 0 'b' 1 1 2 2 3 3 4 'd' DB<74>

      $x is actually a count of the flip-flop match and will be reset 3 times.

      we keep the last max $x in $last

      you only need to END{print @out[1..$last] } in your one-liner to eject just these last lines.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      update

      assuming PATTERN2 and PATTERN3 are similar

      >perl -ne"if ($x=(/PATTERN1/.../PATTERN?/)) { $out[$x]=$_; $last=$x; } +; END{ print @out[1..$last] }" input PATTERN1 SOME INFO TEXT4 TEXT5 TEXT6 PATTERN3 SOME INFO C:\tmp\files>
Re: command line perl command to get between lines with non greedy match
by tybalt89 (Parson) on Jan 18, 2020 at 20:29 UTC

    Assuming from your example you are on a linux system, you should have a "tac", a filter program the reverses a file line-by-line. So:

    tac somefilename | perl -ne 'print if /PATTERN3/../PATTERN1/' | tac

    works on my ArchLinux system, and all you then need to do is fix up the extra TEST on the PATTERN3 line.

Re: command line perl command to get between lines with non greedy match
by AnomalousMonk (Bishop) on Jan 17, 2020 at 22:10 UTC

    Input:

    TEXT6 PATTERN3 SOME INFO
    Output:
    TEXT6 PATTERN3 SOME INFO
    Is it true that your input has  TEXT6 and the terminating  PATTERN3 string on the same line, and that the output should be reformatted so that they are on separate lines separated by a blank line? (BTW: Thanks for editing your post, but you left no citation of any change (update: please see How do I change/delete my post?).)


    Give a man a fish:  <%-{-{-{-<

Re: command line perl command to get between lines with non greedy match (no flip-flop)
by LanX (Archbishop) on Jan 18, 2020 at 15:48 UTC
    Seems like avoiding the range operator is the trick.

    It's kind of an iterator version of print grep /PATTERN3/, split /PATTERN1/, slurp("input")

    C:\tmp\files>perl -nE" @o=() if /PATTERN1/; push @o,$_; say qq(<@o>) i +f /PATTERN3/ " input2 <PATTERN1 SOME INFO TEXT4 TEXT5 TEXT6 PATTERN3 SOME INFO > <PATTERN1 SOME INFO TEXT4 TEXT5 TEXT6 PATTERN3 SOME INFO > C:\tmp\files>

    NB: this will also fire if PATTERN3 appears before the first PATTERN1!

    If that's a problem, use a flag.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: command line perl command to get between lines with non greedy match
by LanX (Archbishop) on Jan 17, 2020 at 23:25 UTC
    I probably misunderstood your problem, this works in printing the last shortest match between 1 and 3
    C:\tmp\files>perl -ne"@out=() if /PATTERN1/; push @out,$_ if /PATTERN1 +/../PATTERN3/; END{ print @out }" input PATTERN1 SOME INFO TEXT4 TEXT5 TEXT6 PATTERN3 SOME INFO C:\tmp\files>

    NB: for linux you'll need to replace " to '

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      This will print all shortest records between Pattern 1 and 3

      I just doubled your input.

      C:\tmp\files>perl -nE"if ($x=(/PATTERN1/../PATTERN3/)) { @out=() if /P +ATTERN1/; push @out,$_; print @out if $x=~/E0$/ }" input2 PATTERN1 SOME INFO TEXT4 TEXT5 TEXT6 PATTERN3 SOME INFO PATTERN1 SOME INFO TEXT4 TEXT5 TEXT6 PATTERN3 SOME INFO C:\tmp\files>

      update

      a bit cleaner

      C:\tmp\files>perl -nE" $first=/PATTERN1/; $last=/PATTERN3/; if ( $firs +t..$last) { @o=() if $first; push @o,$_; say @o if $last }" input2 PATTERN1 SOME INFO TEXT4 TEXT5 TEXT6 PATTERN3 SOME INFO PATTERN1 SOME INFO TEXT4 TEXT5 TEXT6 PATTERN3 SOME INFO

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re:command line perl command to get between lines with non greedy match
by LanX (Archbishop) on Jan 17, 2020 at 21:49 UTC
    Unfortunately that's almost unreadable and even the original is only one long line.

    Please click the edit button and reformat your post to make it readable.

    Then please use <code>...</code> and <p> tags.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

    Update

    OP added tags in the meantime. :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11111545]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2020-02-23 13:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What numbers are you going to focus on primarily in 2020?










    Results (102 votes). Check out past polls.

    Notices?