Re: Regex: Matching around a word(s)

Seeing your output I tried to go back to the original questions:

Is there a way to do this with a single regex?

My answer here is to present a solution with "one and a half" regexes ;-) (see below)

Is a regex even the best way to do this?

Depends, I would say. Obviously, if the regex is too convoluted, it is not likely to be maintainable. OTOH a solution with a lot of pos-calculation is more likely to suffer from +/-1 border errors.

So here's my take:

use strict;
use warnings;
use Data::Dumper ();

die "No search terms supplied!" unless @ARGV;
my @words = @ARGV;

my $text = do { local $/ = undef; <DATA> };

my $blen = 20;    # (max) chars before a matching word to take with us
my $alen = 20;    # (max) chars after a matching word to take with us
my $jlen = $blen+$alen;    # (max) chars between 2 matching words capt
+ured together

my $strwords = join "|" => map quotemeta, @words; # Words to highlight
my $rxwords = qr/\b(?i:$strwords)\b/;    # ... compiled highlight word
+ match
my $expr = qr/\b(?!\s)(?s:.{0,$blen}$rxwords(?:.{0,$jlen}$rxwords)*.{0
+,$alen}(?:(?<=\s)|[^\s]*\b))/;

my $D = Data::Dumper->new(
    [[grep {s/($rxwords)/[$1]/g} $text =~ /($expr)/g]], ['matched']
)->Indent(1);
print $D->Dump();

# reformatted the DATA to look nicer in the post
__DATA__
Regular expressions have always been a weak spot for me, and
I've got a question that's got me stumped.  Here's the
problem I'm trying to solve.  I have somewhat large articles
of text (returned from a search), what I'd like to do is
capture the word and X number of words before and after it
while tagging the matching word in the captured text.  My
inital thought was to try something like this.  The problem I
have is that if there is more than one term and they overlap,
the nth term will not be annotated.  So my next thought is
lookahead/lookbehind, but they don't capture.  Is there a way
to do this with a single regex?  Is a regex even the best way
to do this?  Thanks, -Lee
[download]

perl -Mstrict -Mwarnings context.pl is and the have
$matched = [
  'Regular expressions [have] always been a weak spot for me, [and]
I\'ve got a question',
  'me stumped.  Here\'s [the]
problem I\'m trying to solve.  I [have] somewhat large articles',
  'what I\'d like to do [is]
capture [the] word [and] X number of words before [and] after it
while tagging [the] matching word in [the] captured text.  My
',
  'like this.  [The] problem I
[have] [is] that if there [is] more than one term [and] they overlap,
[the] nth term will not be',
  'So my next thought [is]
lookahead/lookbehind',
  'don\'t capture.  [Is] there a way
to do this',
  'a single regex?  [Is] a regex even [the] best way
to do this'
];
[download]

The main idea is to handle the "overlapping" context as a single string which spans between 2 consecutive matching words "close" together.
Because I then have multiple occurences of matching words, then these words need to be matched again for markup.

For the primary match I made some small adjustments to match full words in the prefix and suffix context:

\b              # a word boundary
(?!\s)          # following char is not a white space (1)
(?s:            # . matches newline in rest of regex
  .{0,$blen}    # up to $blen chars (left context)
  $rxwords      # followed by a word we search for
  (?:           # group for repeatedly matching
    .{0,$jlen}  #   up to $jlen=$blen+$alen chars (2)
    $rxwords    #   followed by a searched word
  )*            # repeatedly match
  .{0,$alen}    # up to $alen chars (right context)
  (?:           # group for disjunction (3)
    (?<=\s)     #    last matched char was white space
   |            #  or
    [^\s]*      #    non white space chars
    \b          #    up to the next word boundary
  )
)
[download]

Additional remarks:

By making sure that we always break at word boundaries, we always have full words in the match on which we later reapply the $rxwords match to mark-up the words we search for.
Specificall, by (1) we trim white space on the left. And with (3) we make sure that either we end in white space which is safe to split there, or we extend the $alen chars with all following non white space chars up to the next word boundary.
When we look for spanning context between two matching words (2) this can indead incidentally contain additional matching words (not matched by $rxwords), but these additionals are "safe", since that (2) match borders to \b of $rxwords on both sides. And so they will be found in the reapplication of $rxwords in the postprocessing. This is the essential trick in avoiding any explicit gluing of separate contexts.
This all assumes that the words to match do not incorporate \b boundaries. Otherwise the usage of \b in the regex(es) have to be complemented/substituted by (negative) look ahead for (non) white space. Looking for phrases (allowing white space inside) with the appropriate context is probably a lot harder in this way.
Pathological texts which do not contain (enough) white space are not handled well.
Instead of matching into array context, the /($expr)/g match could also be executed in a while condition to address temp memory concerns with large texts to match.

Comments welcome

Comment on Re: Regex: Matching around a word(s) Select or Download Code


Think about Loose Coupling
	PerlMonks