Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Grab 3 lines before and 2 after each regex hit

by HarryPutnam (Novice)
on Apr 24, 2014 at 14:09 UTC ( [id://1083602]=perlquestion: print w/replies, xml ) Need Help??

HarryPutnam has asked for the wisdom of the Perl Monks concerning the following question:

I'm hoping there might be a module that can be made to do what I want (grab 3 lines before and 2 after each rgx hit). This would be in common txt files.

Seems like it would be as popular as gnu grep's -Anum -Bnum operators.

If not a module... then a brief walk thru of how it might be done so that I can write it myself... It's ok if its primitive. I do have some experience... just not much

I'm just thinking it must be a fairly common desire for hardened 'perlers' and maybe there is kind of common way of doing it.

Replies are listed 'Best First'.
Re: Grab 3 lines before and 2 after each regex hit
by InfiniteSilence (Curate) on Apr 24, 2014 at 14:35 UTC

    This is a fairly primitive way to do it:

    #!/usr/bin/perl -w use strict; my @lines = <DATA>; for(1..$#lines) { if($lines[$_]=~m/[^\d]+\d+/){ print qq~ $lines[$_-3] $lines[$_-2] $lines[$_-1] $lines[$_] $lines[$_+1] $lines[$_+2] ~; } } 1; __END__ alpha beta something a07607 b-alpha b-beta b-something b-something else c-alpha c-beta c-somethin a9706 d-alpha d-beta d-something d-something else

    produces...

    alpha beta something a07607 b-alpha b-beta c-alpha c-beta c-somethin a9706 d-alpha d-beta

    Celebrate Intellectual Diversity

      > This is a fairly primitive way to do it:

      using a sliding window (safer with huge streams)

      use strict; use warnings; use Data::Dump; my @window; push @window, scalar <DATA> for 1..5; # init while (my $line = <DATA>) { push @window, $line; chomp @window; if( $window[3] =~ m/[^\d]+\d+/ ){ dd \@window; } shift @window; } __END__ alpha beta something a07607 b-alpha b-beta b-something b-something else c-alpha c-beta c-somethin a9706 d-alpha d-beta d-something d-something else
      -->
      ["alpha", "beta", "something", "a07607", "b-alpha", "b-beta"] ["c-alpha", "c-beta", "c-somethin", "a9706", "d-alpha", "d-beta"]

      Cheers Rolf

      ( addicted to the Perl Programming Language)

      update

      maybe more elegant

      use strict; use warnings; use Data::Dump; my @window; while (my $line = <DATA>) { push @window, $line; next if @window < 6; # init if( $window[3] =~ m/[^\d]+\d+/ ){ dd \@window; } shift @window; }

      Update

      Oh the latter (more elegant) approach has a clear advantage, if you want to avoid overlapping results you just need to empty the window after a match and it gets automatically refilled. :)

        The sliding window sounds like another great suggestion

        Thank you.

      Your techinque answers the need nicely...
      thank you

      for(1..$#lines)
      {
          if($lines$_=~m/^\d+\d+/){
               print qq~
      ....... ...         
               ~;
      

      I guess that `pp~' operates something like a here document?
      Can you explain a bit?

      Can we go a little deeper into the intended usage of the techniques mentioned in this thread?

      I haven't understood everything that has been presented but enough to use some of the infomation posted and complete a working script for my purpose soon.

      There was some talk of slurping sections or even whole filesfiles:
      On that topic; let me explain very briefly what the intended usage is. The code will be used to search and extract thru some fairly massive piles of files at times

      Once File::Find is added into the script it will likely be expected to recurse thru usenet style hierarchies (hierarchies of my own creation, so smaller than real ones) that might consist of as many as 45000-55000 messages in total (not per group)

      So, with that scale of usage in mind would slurping of whole files still be a wise way to go? Or would that be so labor intensive as to make it worth while to do it a different way?

      Marvelously elegant, if the file-size is not too big ... as these days it is unlikely to be. ++
        This is the usual sycophantic crap you post after you've been called out on a series of junk posts filled with lies and bad advice. Anyone reading your post history will be familiar with this pattern.
Re: Grab 3 lines before and 2 after each regex hit
by clueless newbie (Curate) on Apr 24, 2014 at 15:32 UTC
    Andy Lester's ack will do the trick.
      `ack' does look pretty interesting... thank you.
Re: Grab 3 lines before and 2 after each regex hit
by Anonymous Monk on Apr 24, 2014 at 14:44 UTC

    I can think of two approaches off the top of my head, other monks will likely have MTOWTDI:

    1. Read multiple lines at a time (if your files aren't too big, just read the whole file), write a regex that matches your pattern and also captures the lines before and after your pattern. Writing a regex that matches multiple lines is not too difficult once you've read about the following topics in perlre: the /m and /s modifiers, and the exact meaning of ^, $, ., \s and \n. Also, see I'm having trouble matching over more than one line. What's wrong?

    Here's a somewhat inelegant regex that captures the lines before and after a match:

    my $input = "line1\nline2\nline3 foo\nline4\nline5"; my ($before,$match,$after) = $input=~/^ (?:(.*)\n)? (.*foo.*) (?:\n (? +:(.*)\n?)? )? /xm; print "before=<$before>, match=<$match>, after=<$after>\n";

    2. Keeping a buffer of lines, i.e. an array which always contains the most recent N lines. Such an array could be managed via push and shift. In other words, a sliding window of sorts. This approach would probably be considered less "perlish" than the first, but might be more efficient on large files. Actually, Tie::File may be better than managing the array yourself.

    InfiniteSilence just posted an answer that uses the array approach.

      "I can think of two approaches off the top of my head, other monks will likely have MTOWTDI:"

      Thank you. This may be a very promising approach to investigate. I was a bit worried about all that slurping.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1083602]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2024-03-19 05:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found