http://www.perlmonks.org?node_id=517393

shotgunefx has asked for the wisdom of the Perl Monks concerning the following question:

Regular expressions have always been a weak spot for me, and I've got a question that's got me stumped.

Here's the problem I'm trying to solve.

I have somwhat large articles of text (returned from a search), what I'd like to do is capture the word and X number of words before and after it while tagging the matching word in the captured text.

My inital thought was to try something like this.
my $regex = \b(.{1,50})\b(?i)('.join("|",@words).')\b(.{1,50})\b'; my $expr = qr /$regex/; # Later my $matchtext; while ($text=~m/$expr/g){ $matchtext.= "...$1<font color=\"FF0000\">$2</font>$3...<br><br>" ; }
The problem I have is that if there is more than one term and they overlap, the nth term will not be annotated.

So my next thought is lookahead/lookbehind, but they don't capture. Is there a way to do this with a single regex?

Is a regex even the best way to do this?

Thanks,
-Lee

perl digital dash (in progress)

Replies are listed 'Best First'.
Re: Regex: Matching around a word(s)
by Ovid (Cardinal) on Dec 16, 2005 at 23:04 UTC

    I think asking whether a regex is the best way to do this really gets at the heart of the problem. Regexes are great for matching text, but not so hot for parsing it. In this case, I rather think you're in kind of a grey, middle area.

    In your place, I might look at Text::ExtractWords. Combine that with Text::Sentence and you could have a very flexible solution that's easier to maintain.

    Cheers,
    Ovid

    New address of my CGI Course.

      To be honest, for this limited use, I'm hesitant to add the weight of two more modules (the application is a bit heavy as it is).

      If I'm not able to come up with a straight regex solution, I'll probably do something more iterative


      -Lee

      perl digital dash (in progress)
        I wouldn't be weary of weight, unless you are running in a 1990s CGI environment (as opposed to mod_perl or FastCGI), for example.

        This is due to two reasons. First of all, the efficiency difference will probably not be noticable unless benchmarked. The modules might even be more efficient than your regex stuff. Second, your time is worth more than the computer's time, as is the time of the maintenance programmer (you, or some other guy) that will come after you, in a few weeks/months time. This has value that can be later invested in beefier hardware. Unless the performance of the code is a problem (more over - a measurable problem), don't worry about it.

        This is just an opinion though, grain of salt, 2 cents, and all.

        -nuffin
        zz zZ Z Z #!perl
Re: Regex: Matching around a word(s)
by QM (Parson) on Dec 16, 2005 at 23:13 UTC
    I think your problem is that m//g doesn't grab overlapping matches without some help from you. In scalar context, it picks up where it left off.

    After a successful match, determine by inspection where the matching term is in the string, and then set pos just after that. Perhaps something like this (untested):

    while ($text =~ /$expr/g) { my ($prefix,$term,$suffix) = ($1, $2, $3); print "$prefix<b>$term</b>$suffix\n"; # update pos to just after $term pos() = pos() - length($suffix); }

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      Thanks. On the right track, but it's a bit funky if the matches are right next to each other. I'm going to see if I can't merge close matches.

      -Lee

      perl digital dash (in progress)
Re: Regex: Matching around a word(s)
by BrowserUk (Patriarch) on Dec 16, 2005 at 23:15 UTC

    Update: I missed the bit about "overlapping matches....

    Something like this?

    $s = 'I have somwhat large articles of text (returned from a search), + what I\'d like to do is capture the word and X number of words befor +e and after it while tagging the matching word in the captured text.' +;; print $s =~ m[ ( (?: \S+ \s+ ){3} X (?: \s+ \S+ ){3} ) ]x;; the word and X number of words

    Update2: Here's one that does overlapping and tagging:

    #! perl -slw use strict; my $word = $ARGV[0] or die "No search term"; ( my $text = do{ local $/; <DATA> } ) =~ tr[\n][]d; $text =~ s[ ( (?: \S+ \s+ ){1,3} ) ( $word ) [[:punct:]]* (?= ( (?: \s+ \S+ ){1,3} ) ) ][ print "$1<$2>$3" ]gex; __END__ Regular expressions have always been a weak spot for me, and I've got +a question that's got me stumped. Here's the problem I'm trying to solve +. I have somwhat large articles of text (returned from a search), what I +'d like to do is capture the word and X number of words before and after +it while tagging the matching word in the captured text. My inital though +t was to try something like this. The problem I have is that if there is more than one term and they overlap, the nth term will not be annotate +d. So my next thought is lookahead/lookbehind, but they don't capture. Is there a way to do this with a single regex? Is a regex even the bes +t way to do this? Thanks, -Lee

    Some results

    P:\test>junk me weak spot for <me> and I've got aquestion that's got <me> stumped. Here's the P:\test>junk is I'dlike to do <is> capture the word problem I have <is> that if there my next thought <is> lookahead/lookbehind, but they P:\test>junk got me, and I've <got> aquestion that's got aquestion that's <got> me stumped. Here's P:\test>junk to problem I'm trying <to> solve.I have somwhat search), what I'dlike <to> do is capture My inital thoughtwas <to> try something like there a way <to> do this with even the bestway <to> do this? Thanks,

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Appreciate the help, though a couple issues. 1. I need to find multiple matches.
      2. This fails on some input.

      Take "and" for example.
      #! perl -slw use strict; my $word = $ARGV[0] or die "No search term"; ( my $text = do{ local $/; <DATA> } ) =~ tr[\n][]d; $text =~ s[ ( (?: \S+ \s+ ){1,3} ) ( $word ) [[:punct:]]* (?= ( (?: \s+ \S+ ){1,3} ) ) ][ print "$1<$2>$3" ]gex; __END__ this finds and matches and highlights matches.
      This outputs
      this finds <and> matches and highlights matches <and> highlights matches.

      -Lee

      perl digital dash (in progress)

        Strange. I copied the above code and ran it and got this output:

        P:\test>junk and this finds <and> matches and highlights matches <and> highlights matches.

        Which is correct as far as I can tell?

        It found and highlighted both "and"s; That is what you mean by multiple matches isn't it?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Regex: Matching around a word(s)
by GrandFather (Saint) on Dec 17, 2005 at 04:53 UTC

    The following code looks like it does most of what you want.

    use strict; use warnings; die "No search terms supplied!" unless @ARGV; my %targets; my ($prePost) = $ARGV[0] =~ /^(\d+)$/; my @words = split /(?= )/, do {local $/ = undef; <DATA>}; my @tags; shift if defined $prePost; $prePost = 5 unless defined $prePost; map {$targets{$_} = (0)} @ARGV; for (0..$#words) { $words[$_] =~ s/[\r\n]+/ /; my $curr = $words[$_]; my ($pre, $word, $post) = $curr =~ /([^\w]*)(\w*)(.*)/; if (exists $targets{$word}) { $words[$_] = "$pre<$word>$post"; push @tags, $_; } next if ! @tags or $tags[0] > $_ - $prePost; print "".(join '', @words[$tags[0] - $prePost .. $tags[-1] + $prePos +t])."\n"; @tags = (); } print join '', @words[$tags[0] - $prePost .. $#words] if @tags;

    Prints:

    a weak spot for me, <and> I've got a question that's that's got me stumped. Here's <the> problem I'm trying to solve. what I'd like to do <is> capture <the> word <and> X number of words +before X number of words before <and> after it while tagging <the> matching + word in the captured tagging <the> matching word in <the> captured text. My inital thought + this. The problem I have <is> that if there <is> more than one term +and <is> more than one term <and> they overlap, <the> nth term will not +be annotated. So my next thought <is> lookahead/lookbehind, but they do +n't capture. regex? Is a regex even <the> best way to do this?

    DWIM is Perl's answer to Gödel
Re: Regex: Matching around a word(s)
by shotgunefx (Parson) on Dec 17, 2005 at 09:03 UTC
    Another take. This will discard the first word or fragment, unless the first word is a match.
    #!/usr/bin/perl use strict; use warnings; die "No search terms supplied!" unless @ARGV; my @words = @ARGV; my $text; { local $/ = undef; $text = <DATA>; } my $regex = join ( "|", @words ); # Words to highlight my $expr = qr /(?i)($regex)/; # Compile regex my $glen = 20; # Characters before and after the en +d of match to grab. { no warnings 'uninitialized'; my ( $ls, $le, @results ); # $ls=prev span start, $le=prev span end +, @results, results destination # Markup any matches or exit block. last unless $text =~ s/\b($expr)\b/[$1]/gi; while ( $text =~ m/\b($expr)\b/sg or $le <= length ($text) ) { my ($ipos,$spos,$epos); # char span positions if ($ipos = pos($text)) { # If the last match succeded $spos = $ipos - $glen > 0 ? $ipos - $glen : 0; # Range +check $epos = $ipos + $glen < length($text) ? $ipos + $glen : length +($text); # Assign to ($ls,$le) if this is our first time through and ne +xt. ( $ls, $le ) = ( $spos, $epos ) and next unless $le; } if ( $spos and $spos < $le ) { # If we have a match and it inte +rsects the last match $le = $epos; # merge overlapping char spans } else { # Lose the first word(possible fragment) unless the match is t +he first word. $ls = index($text," ", $ls) + 1 unless ($ls == 0); push @results,substr( $text, $ls, $le - $ls ) ; ( $ls, $le ) = ( $spos, $epos ); # Set "las +t position" to current. } last unless defined $spos; # End unle +ss we have one more match } print '"',$_,'..."', "\n" foreach @results; } __DATA__ Regular expressions have always been a weak spot for me, and I've got +a question that's got me stumped. Here's the problem I'm trying to solv +e. I have somewhat large articles of text (returned from a search), what + I'd like to do is capture the word and X number of words before and after + it while tagging the matching word in the captured text. My inital thoug +ht was to try something like this. The problem I have is that if there i +s more than one term and they overlap, the nth term will not be annotat +ed. So my next thought is lookahead/lookbehind, but they don't capture. Is there a way to do this with a single regex? Is a regex even the be +st way to do this? Thanks, -Lee

    -Lee

    perl digital dash (in progress)
Re: Regex: Matching around a word(s)
by eric256 (Parson) on Dec 20, 2005 at 00:06 UTC

    Here is a slightly different approach. At least as far as I can tell this is unique. This builds a hash of matches and then rescans the source printing the matches. This automaticaly condenses down all the overlaps.

    #!/usr/bin/perl use strict; use warnings; die "No search terms supplied!" unless @ARGV; my @words = @ARGV; my $regex = join("|",@words) ; my $expr = qr /^($regex)$/; $/ = ' '; my $i = 0; my $words = {}; my $pos = tell(DATA); for my $word (<DATA>) { chomp $word; $i++; if ($word =~ /$expr/) { for my $j (-5 .. 5) { $words->{$i + $j}++; } }; } seek(DATA, $pos, 0); $i =0; for my $word (<DATA>) { $i++; chomp $word; $word = "<$word>" if ($word =~ /$expr/); print "$word " if exists $words->{$i}; } __DATA__ Regular expressions have always been a weak spot for me, and I've got +a question that's got me stumped. Here's the problem I'm trying to solv +e. I have somewhat large articles of text (returned from a search), what + I'd like to do is capture the word and X number of words before and after + it while tagging the matching word in the captured text. My inital thoug +ht was to try something like this. The problem I have is that if there i +s more than one term and they overlap, the nth term will not be annotat +ed. So my next thought is lookahead/lookbehind, but they don't capture. Is there a way to do this with a single regex? Is a regex even the be +st way to do this? Thanks, -Lee

    ___________
    Eric Hodges $_='y==QAe=e?y==QG@>@?iy==QVq?f?=a@iG?=QQ=Q?9'; s/(.)/ord($1)-50/eigs;tr/6123457/- \/|\\\_\n/;print;
      Thanks. I'll whip up a benchmark with some of the different approaches, (actually have, but don't have yours in yet. Though I'll probably switch the filehandle use.

      Also, I'd probably use a hash instead of a hashref for $words as it is slightly faster.

      Not sure if it's faster or not (though "feels" it), I'd probably rewrite the following to use a hash slice
      # before if ($word =~ /$expr/) { for my $j (-5 .. 5) { $words->{$i + $j}++; } }; # after my $mwords = 5; @words{$i-$mwords..$i+$mwords} = 1 if $word =~ /$expr/; # Note: Keys created, but all but first have undef values


      -Lee

      perl digital dash (in progress)
Re: Regex: Matching around a word(s)
by TomDLux (Vicar) on Dec 19, 2005 at 19:38 UTC

    My approach would be to read the file and split it into an array of words, then go through the array and detect matches. For each match you can print the subarray $words[$index-$pre .. $index+$post]. I think this is the sort of thing Ovid had in mind. Using modules is always easier than doing things yourself, but if your definition of words is imple enough, it may be not to onerous.

    Of course, if the file is LARGE, you might want to only read sections of the file at a time, and update your 'window' when you have fewer than $post words left in memory.

    The next problem is defining what is a word .... Wonder if Shakespeare had anything to say on the topic?

    Tom

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

      Using modules is always easier than doing things yourself,

      Only if there is a module that does exactly what you want to do. Maybe.

      If there isn't, picking a module with a name vaguely related to the problem description and trying to bend it to the cause is pointless.

      Achieving what you describe--printing the target word and N words either side--is easy using a regex and does have the overhead of creating huge arrays just to stick em all back to gether for output. It is a bread and butter text processing task and exactly what the much lauded, highly prized, Perl-jewel-in-it's-crown regex engine is designed for.

      However, resolving the issues of almagamating multiple, closely consecetive matches into single snippets is equally problematic whether done with a regex or your "subarray" solution. (BTW.It should @array[ n .. m ] for a slice).


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      I thought I had responded to this earlier. Actually, I did, but must have mistaken preview for submit (dammit).

      BrowserUk stated my thoughts pretty well. As an aside, simply splitting the string and matching the tokens like so
      my @text = split /\s+/, $text; my @results; for (@text){ push @results, $1 if m/\b($expr)\b/; }
      is pretty slow in comparison to the solution I hit on.


      -Lee

      perl digital dash (in progress)
Re: Regex: Matching around a word(s)
by pKai (Priest) on Dec 20, 2005 at 10:46 UTC
    Seeing your output I tried to go back to the original questions:
    Is there a way to do this with a single regex?
    My answer here is to present a solution with "one and a half" regexes ;-) (see below)
    Is a regex even the best way to do this?
    Depends, I would say. Obviously, if the regex is too convoluted, it is not likely to be maintainable. OTOH a solution with a lot of pos-calculation is more likely to suffer from +/-1 border errors.

    So here's my take:

    use strict; use warnings; use Data::Dumper (); die "No search terms supplied!" unless @ARGV; my @words = @ARGV; my $text = do { local $/ = undef; <DATA> }; my $blen = 20; # (max) chars before a matching word to take with us my $alen = 20; # (max) chars after a matching word to take with us my $jlen = $blen+$alen; # (max) chars between 2 matching words capt +ured together my $strwords = join "|" => map quotemeta, @words; # Words to highlight my $rxwords = qr/\b(?i:$strwords)\b/; # ... compiled highlight word + match my $expr = qr/\b(?!\s)(?s:.{0,$blen}$rxwords(?:.{0,$jlen}$rxwords)*.{0 +,$alen}(?:(?<=\s)|[^\s]*\b))/; my $D = Data::Dumper->new( [[grep {s/($rxwords)/[$1]/g} $text =~ /($expr)/g]], ['matched'] )->Indent(1); print $D->Dump(); # reformatted the DATA to look nicer in the post __DATA__ Regular expressions have always been a weak spot for me, and I've got a question that's got me stumped. Here's the problem I'm trying to solve. I have somewhat large articles of text (returned from a search), what I'd like to do is capture the word and X number of words before and after it while tagging the matching word in the captured text. My inital thought was to try something like this. The problem I have is that if there is more than one term and they overlap, the nth term will not be annotated. So my next thought is lookahead/lookbehind, but they don't capture. Is there a way to do this with a single regex? Is a regex even the best way to do this? Thanks, -Lee

    perl -Mstrict -Mwarnings context.pl is and the have $matched = [ 'Regular expressions [have] always been a weak spot for me, [and] I\'ve got a question', 'me stumped. Here\'s [the] problem I\'m trying to solve. I [have] somewhat large articles', 'what I\'d like to do [is] capture [the] word [and] X number of words before [and] after it while tagging [the] matching word in [the] captured text. My ', 'like this. [The] problem I [have] [is] that if there [is] more than one term [and] they overlap, [the] nth term will not be', 'So my next thought [is] lookahead/lookbehind', 'don\'t capture. [Is] there a way to do this', 'a single regex? [Is] a regex even [the] best way to do this' ];

    The main idea is to handle the "overlapping" context as a single string which spans between 2 consecutive matching words "close" together.
    Because I then have multiple occurences of matching words, then these words need to be matched again for markup.

    For the primary match I made some small adjustments to match full words in the prefix and suffix context:

    \b # a word boundary (?!\s) # following char is not a white space (1) (?s: # . matches newline in rest of regex .{0,$blen} # up to $blen chars (left context) $rxwords # followed by a word we search for (?: # group for repeatedly matching .{0,$jlen} # up to $jlen=$blen+$alen chars (2) $rxwords # followed by a searched word )* # repeatedly match .{0,$alen} # up to $alen chars (right context) (?: # group for disjunction (3) (?<=\s) # last matched char was white space | # or [^\s]* # non white space chars \b # up to the next word boundary ) )
    Additional remarks:
    1. By making sure that we always break at word boundaries, we always have full words in the match on which we later reapply the $rxwords match to mark-up the words we search for.
      Specificall, by (1) we trim white space on the left. And with (3) we make sure that either we end in white space which is safe to split there, or we extend the $alen chars with all following non white space chars up to the next word boundary.
    2. When we look for spanning context between two matching words (2) this can indead incidentally contain additional matching words (not matched by $rxwords), but these additionals are "safe", since that (2) match borders to \b of $rxwords on both sides. And so they will be found in the reapplication of $rxwords in the postprocessing. This is the essential trick in avoiding any explicit gluing of separate contexts.
    3. This all assumes that the words to match do not incorporate \b boundaries. Otherwise the usage of \b in the regex(es) have to be complemented/substituted by (negative) look ahead for (non) white space. Looking for phrases (allowing white space inside) with the appropriate context is probably a lot harder in this way.
    4. Pathological texts which do not contain (enough) white space are not handled well.
    5. Instead of matching into array context, the /($expr)/g match could also be executed in a while condition to address temp memory concerns with large texts to match.
    Comments welcome
      Nice++
      Did a benchmark (I'll post later with some tweaks) and it seems that for a few terms, the match/span is about twice as fast then the double regex, with more terms it goes down to about 70% faster.

      Simply splitting without actually doing any processing
      my ($text,@words) = @_; my @text = split /\s+/, $text; my @results; for my $t (@text){ for my $w (@words){ push @results if $t eq $w; } }
      is about 46% slower then the pattern match/span solution.


      -Lee

      perl digital dash (in progress)
Re: Regex: Matching around a word(s)
by wolfger (Deacon) on Dec 20, 2005 at 19:10 UTC

    Edit: Ignore me please... I need to read the problem completely before whipping out code that doesn't work.

    #!/usr/bin/perl use strict; use warnings; my $string = "This is just and example text string to test our regex. +Please ignore it."; my $wordpadding = 2; my $searchword = "test"; $string =~ /((\w+\W+?){$wordpadding}$searchword(\W+?\w+){$wordpadding} +)/; print $1;

    jdporter unconsidered. Considered by wolfger: "please delete. I hit send before fully reading/understanding the problem." Final tally: 8 keep, 11 reap.