comment on

Another take. This will discard the first word or fragment, unless the first word is a match.

#!/usr/bin/perl

use strict;
use warnings;

die "No search terms supplied!" unless @ARGV;
my @words = @ARGV;

my $text;
{
    local $/ = undef;
    $text = <DATA>;
}


my $regex = join ( "|", @words ); # Words to highlight
my $expr = qr /(?i)($regex)/;     # Compile regex
my $glen = 20;                    # Characters before and after the en
+d of match to grab.


{
  no warnings 'uninitialized'; 
  my ( $ls, $le, @results );  # $ls=prev span start, $le=prev span end
+, @results, results destination
  
  # Markup any matches or exit block.
  last unless $text =~ s/\b($expr)\b/[$1]/gi; 

  while ( $text =~ m/\b($expr)\b/sg or $le <= length ($text) ) {    
    my ($ipos,$spos,$epos); # char span positions
    
    if ($ipos  = pos($text)) { # If the last match succeded
        $spos  = $ipos - $glen > 0 ? $ipos - $glen : 0;       # Range 
+check
        $epos = $ipos + $glen < length($text) ? $ipos + $glen : length
+($text);

        # Assign to ($ls,$le) if this is our first time through and ne
+xt.
        ( $ls, $le ) = ( $spos, $epos ) and next unless $le;  
    }

    if ( $spos and $spos < $le ) {    # If we have a match and it inte
+rsects the last match
        $le = $epos;                  # merge  overlapping char spans
    }
    else {
        # Lose the first word(possible fragment) unless the match is t
+he first word.
        $ls = index($text," ", $ls) + 1  unless ($ls == 0); 
        push @results,substr( $text, $ls, $le - $ls ) ;
        ( $ls, $le ) = ( $spos, $epos );                    # Set "las
+t position" to current.
    }
    last unless defined $spos;                              # End unle
+ss we have one more match                          
  }                                                     

  print '"',$_,'..."', "\n" foreach @results;
}
__DATA__
Regular expressions have always been a weak spot for me, and I've got 
+a
 question that's got me stumped. Here's the problem I'm trying to solv
+e.
 I have somewhat large articles of text (returned from a search), what
+ I'd
 like to do is capture the word and X number of words before and after
+ it
 while tagging the matching word in the captured text. My inital thoug
+ht
 was to try something like this. The problem I have is that if there i
+s
 more than one term and they overlap, the nth term will not be annotat
+ed.
 So my next thought is lookahead/lookbehind, but they don't capture.
 Is there a way to do this with a single regex? Is a regex even the be
+st
 way to do this? Thanks, -Lee
[download]

-Lee

perl digital dash (in progress)

In reply to Re: Regex: Matching around a word(s) by shotgunefx
in thread Regex: Matching around a word(s) by shotgunefx

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks