Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Regex: Matching around a word(s)

by pKai (Priest)
on Dec 20, 2005 at 10:46 UTC ( [id://518018]=note: print w/replies, xml ) Need Help??


in reply to Regex: Matching around a word(s)

Seeing your output I tried to go back to the original questions:
Is there a way to do this with a single regex?
My answer here is to present a solution with "one and a half" regexes ;-) (see below)
Is a regex even the best way to do this?
Depends, I would say. Obviously, if the regex is too convoluted, it is not likely to be maintainable. OTOH a solution with a lot of pos-calculation is more likely to suffer from +/-1 border errors.

So here's my take:

use strict; use warnings; use Data::Dumper (); die "No search terms supplied!" unless @ARGV; my @words = @ARGV; my $text = do { local $/ = undef; <DATA> }; my $blen = 20; # (max) chars before a matching word to take with us my $alen = 20; # (max) chars after a matching word to take with us my $jlen = $blen+$alen; # (max) chars between 2 matching words capt +ured together my $strwords = join "|" => map quotemeta, @words; # Words to highlight my $rxwords = qr/\b(?i:$strwords)\b/; # ... compiled highlight word + match my $expr = qr/\b(?!\s)(?s:.{0,$blen}$rxwords(?:.{0,$jlen}$rxwords)*.{0 +,$alen}(?:(?<=\s)|[^\s]*\b))/; my $D = Data::Dumper->new( [[grep {s/($rxwords)/[$1]/g} $text =~ /($expr)/g]], ['matched'] )->Indent(1); print $D->Dump(); # reformatted the DATA to look nicer in the post __DATA__ Regular expressions have always been a weak spot for me, and I've got a question that's got me stumped. Here's the problem I'm trying to solve. I have somewhat large articles of text (returned from a search), what I'd like to do is capture the word and X number of words before and after it while tagging the matching word in the captured text. My inital thought was to try something like this. The problem I have is that if there is more than one term and they overlap, the nth term will not be annotated. So my next thought is lookahead/lookbehind, but they don't capture. Is there a way to do this with a single regex? Is a regex even the best way to do this? Thanks, -Lee

perl -Mstrict -Mwarnings context.pl is and the have $matched = [ 'Regular expressions [have] always been a weak spot for me, [and] I\'ve got a question', 'me stumped. Here\'s [the] problem I\'m trying to solve. I [have] somewhat large articles', 'what I\'d like to do [is] capture [the] word [and] X number of words before [and] after it while tagging [the] matching word in [the] captured text. My ', 'like this. [The] problem I [have] [is] that if there [is] more than one term [and] they overlap, [the] nth term will not be', 'So my next thought [is] lookahead/lookbehind', 'don\'t capture. [Is] there a way to do this', 'a single regex? [Is] a regex even [the] best way to do this' ];

The main idea is to handle the "overlapping" context as a single string which spans between 2 consecutive matching words "close" together.
Because I then have multiple occurences of matching words, then these words need to be matched again for markup.

For the primary match I made some small adjustments to match full words in the prefix and suffix context:

\b # a word boundary (?!\s) # following char is not a white space (1) (?s: # . matches newline in rest of regex .{0,$blen} # up to $blen chars (left context) $rxwords # followed by a word we search for (?: # group for repeatedly matching .{0,$jlen} # up to $jlen=$blen+$alen chars (2) $rxwords # followed by a searched word )* # repeatedly match .{0,$alen} # up to $alen chars (right context) (?: # group for disjunction (3) (?<=\s) # last matched char was white space | # or [^\s]* # non white space chars \b # up to the next word boundary ) )
Additional remarks:
  1. By making sure that we always break at word boundaries, we always have full words in the match on which we later reapply the $rxwords match to mark-up the words we search for.
    Specificall, by (1) we trim white space on the left. And with (3) we make sure that either we end in white space which is safe to split there, or we extend the $alen chars with all following non white space chars up to the next word boundary.
  2. When we look for spanning context between two matching words (2) this can indead incidentally contain additional matching words (not matched by $rxwords), but these additionals are "safe", since that (2) match borders to \b of $rxwords on both sides. And so they will be found in the reapplication of $rxwords in the postprocessing. This is the essential trick in avoiding any explicit gluing of separate contexts.
  3. This all assumes that the words to match do not incorporate \b boundaries. Otherwise the usage of \b in the regex(es) have to be complemented/substituted by (negative) look ahead for (non) white space. Looking for phrases (allowing white space inside) with the appropriate context is probably a lot harder in this way.
  4. Pathological texts which do not contain (enough) white space are not handled well.
  5. Instead of matching into array context, the /($expr)/g match could also be executed in a while condition to address temp memory concerns with large texts to match.
Comments welcome

Replies are listed 'Best First'.
Re^2: Regex: Matching around a word(s)
by shotgunefx (Parson) on Dec 20, 2005 at 19:28 UTC
    Nice++
    Did a benchmark (I'll post later with some tweaks) and it seems that for a few terms, the match/span is about twice as fast then the double regex, with more terms it goes down to about 70% faster.

    Simply splitting without actually doing any processing
    my ($text,@words) = @_; my @text = split /\s+/, $text; my @results; for my $t (@text){ for my $w (@words){ push @results if $t eq $w; } }
    is about 46% slower then the pattern match/span solution.


    -Lee

    perl digital dash (in progress)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://518018]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2024-04-24 07:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found