Re: Regex: Matching around a word(s)
by Ovid (Cardinal) on Dec 16, 2005 at 23:04 UTC
|
I think asking whether a regex is the best way to do this really gets at the heart of the problem. Regexes are great for matching text, but not so hot for parsing it. In this case, I rather think you're in kind of a grey, middle area.
In your place, I might look at Text::ExtractWords. Combine that with Text::Sentence and you could have a very flexible solution that's easier to maintain.
| [reply] |
|
To be honest, for this limited use, I'm hesitant to add the weight of two more modules (the application is a bit heavy as it is).
If I'm not able to come up with a straight regex solution, I'll probably do something more iterative
| [reply] |
|
I wouldn't be weary of weight, unless you are running in a 1990s CGI environment (as opposed to mod_perl or FastCGI), for example.
This is due to two reasons. First of all, the efficiency difference will probably not be noticable unless benchmarked. The modules might even be more efficient than your regex stuff. Second, your time is worth more than the computer's time, as is the time of the maintenance programmer (you, or some other guy) that will come after you, in a few weeks/months time. This has value that can be later invested in beefier hardware. Unless the performance of the code is a problem (more over - a measurable problem), don't worry about it.
This is just an opinion though, grain of salt, 2 cents, and all.
| [reply] |
|
|
|
Re: Regex: Matching around a word(s)
by QM (Parson) on Dec 16, 2005 at 23:13 UTC
|
I think your problem is that m//g doesn't grab overlapping matches without some help from you. In scalar context, it picks up where it left off.
After a successful match, determine by inspection where the matching term is in the string, and then set pos just after that. Perhaps something like this (untested):
while ($text =~ /$expr/g)
{
my ($prefix,$term,$suffix) = ($1, $2, $3);
print "$prefix<b>$term</b>$suffix\n";
# update pos to just after $term
pos() = pos() - length($suffix);
}
-QM
--
Quantum Mechanics: The dreams stuff is made of
| [reply] [d/l] [select] |
|
Thanks. On the right track, but it's a bit funky if the matches are right next to each other. I'm going to see if I can't merge close matches.
| [reply] |
Re: Regex: Matching around a word(s)
by BrowserUk (Patriarch) on Dec 16, 2005 at 23:15 UTC
|
$s = 'I have somwhat large articles of text (returned from a search),
+ what I\'d like to do is capture the word and X number of words befor
+e and after it while tagging the matching word in the captured text.'
+;;
print $s =~ m[ ( (?: \S+ \s+ ){3} X (?: \s+ \S+ ){3} ) ]x;;
the word and X number of words
Update2: Here's one that does overlapping and tagging:
#! perl -slw
use strict;
my $word = $ARGV[0] or die "No search term";
( my $text = do{ local $/; <DATA> } ) =~ tr[\n][]d;
$text =~ s[
( (?: \S+ \s+ ){1,3} )
( $word ) [[:punct:]]*
(?= ( (?: \s+ \S+ ){1,3} ) )
][
print "$1<$2>$3"
]gex;
__END__
Regular expressions have always been a weak spot for me, and I've got
+a
question that's got me stumped. Here's the problem I'm trying to solve
+.
I have somwhat large articles of text (returned from a search), what I
+'d
like to do is capture the word and X number of words before and after
+it
while tagging the matching word in the captured text. My inital though
+t
was to try something like this. The problem I have is that if there is
more than one term and they overlap, the nth term will not be annotate
+d.
So my next thought is lookahead/lookbehind, but they don't capture.
Is there a way to do this with a single regex? Is a regex even the bes
+t
way to do this? Thanks, -Lee
Some results
P:\test>junk me
weak spot for <me> and I've got
aquestion that's got <me> stumped. Here's the
P:\test>junk is
I'dlike to do <is> capture the word
problem I have <is> that if there
my next thought <is> lookahead/lookbehind, but they
P:\test>junk got
me, and I've <got> aquestion that's got
aquestion that's <got> me stumped. Here's
P:\test>junk to
problem I'm trying <to> solve.I have somwhat
search), what I'dlike <to> do is capture
My inital thoughtwas <to> try something like
there a way <to> do this with
even the bestway <to> do this? Thanks,
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
Appreciate the help, though a couple issues.
1. I need to find multiple matches.
2. This fails on some input.
Take "and" for example.
#! perl -slw
use strict;
my $word = $ARGV[0] or die "No search term";
( my $text = do{ local $/; <DATA> } ) =~ tr[\n][]d;
$text =~ s[
( (?: \S+ \s+ ){1,3} )
( $word ) [[:punct:]]*
(?= ( (?: \s+ \S+ ){1,3} ) )
][
print "$1<$2>$3"
]gex;
__END__
this finds and matches and highlights matches.
This outputs
this finds <and> matches and highlights
matches <and> highlights matches.
| [reply] [d/l] |
|
P:\test>junk and
this finds <and> matches and highlights
matches <and> highlights matches.
Which is correct as far as I can tell?
It found and highlighted both "and"s; That is what you mean by multiple matches isn't it?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
|
|
Re: Regex: Matching around a word(s)
by GrandFather (Saint) on Dec 17, 2005 at 04:53 UTC
|
use strict;
use warnings;
die "No search terms supplied!" unless @ARGV;
my %targets;
my ($prePost) = $ARGV[0] =~ /^(\d+)$/;
my @words = split /(?= )/, do {local $/ = undef; <DATA>};
my @tags;
shift if defined $prePost;
$prePost = 5 unless defined $prePost;
map {$targets{$_} = (0)} @ARGV;
for (0..$#words) {
$words[$_] =~ s/[\r\n]+/ /;
my $curr = $words[$_];
my ($pre, $word, $post) = $curr =~ /([^\w]*)(\w*)(.*)/;
if (exists $targets{$word}) {
$words[$_] = "$pre<$word>$post";
push @tags, $_;
}
next if ! @tags or $tags[0] > $_ - $prePost;
print "".(join '', @words[$tags[0] - $prePost .. $tags[-1] + $prePos
+t])."\n";
@tags = ();
}
print join '', @words[$tags[0] - $prePost .. $#words] if @tags;
Prints:
a weak spot for me, <and> I've got a question that's
that's got me stumped. Here's <the> problem I'm trying to solve.
what I'd like to do <is> capture <the> word <and> X number of words
+before
X number of words before <and> after it while tagging <the> matching
+ word in the captured
tagging <the> matching word in <the> captured text. My inital thought
+
this. The problem I have <is> that if there <is> more than one term
+and
<is> more than one term <and> they overlap, <the> nth term will not
+be
annotated. So my next thought <is> lookahead/lookbehind, but they do
+n't capture.
regex? Is a regex even <the> best way to do this?
DWIM is Perl's answer to Gödel
| [reply] [d/l] [select] |
|
| [reply] |
Re: Regex: Matching around a word(s)
by shotgunefx (Parson) on Dec 17, 2005 at 09:03 UTC
|
Another take. This will discard the first word or fragment, unless the first word is a match.
#!/usr/bin/perl
use strict;
use warnings;
die "No search terms supplied!" unless @ARGV;
my @words = @ARGV;
my $text;
{
local $/ = undef;
$text = <DATA>;
}
my $regex = join ( "|", @words ); # Words to highlight
my $expr = qr /(?i)($regex)/; # Compile regex
my $glen = 20; # Characters before and after the en
+d of match to grab.
{
no warnings 'uninitialized';
my ( $ls, $le, @results ); # $ls=prev span start, $le=prev span end
+, @results, results destination
# Markup any matches or exit block.
last unless $text =~ s/\b($expr)\b/[$1]/gi;
while ( $text =~ m/\b($expr)\b/sg or $le <= length ($text) ) {
my ($ipos,$spos,$epos); # char span positions
if ($ipos = pos($text)) { # If the last match succeded
$spos = $ipos - $glen > 0 ? $ipos - $glen : 0; # Range
+check
$epos = $ipos + $glen < length($text) ? $ipos + $glen : length
+($text);
# Assign to ($ls,$le) if this is our first time through and ne
+xt.
( $ls, $le ) = ( $spos, $epos ) and next unless $le;
}
if ( $spos and $spos < $le ) { # If we have a match and it inte
+rsects the last match
$le = $epos; # merge overlapping char spans
}
else {
# Lose the first word(possible fragment) unless the match is t
+he first word.
$ls = index($text," ", $ls) + 1 unless ($ls == 0);
push @results,substr( $text, $ls, $le - $ls ) ;
( $ls, $le ) = ( $spos, $epos ); # Set "las
+t position" to current.
}
last unless defined $spos; # End unle
+ss we have one more match
}
print '"',$_,'..."', "\n" foreach @results;
}
__DATA__
Regular expressions have always been a weak spot for me, and I've got
+a
question that's got me stumped. Here's the problem I'm trying to solv
+e.
I have somewhat large articles of text (returned from a search), what
+ I'd
like to do is capture the word and X number of words before and after
+ it
while tagging the matching word in the captured text. My inital thoug
+ht
was to try something like this. The problem I have is that if there i
+s
more than one term and they overlap, the nth term will not be annotat
+ed.
So my next thought is lookahead/lookbehind, but they don't capture.
Is there a way to do this with a single regex? Is a regex even the be
+st
way to do this? Thanks, -Lee
| [reply] [d/l] |
Re: Regex: Matching around a word(s)
by eric256 (Parson) on Dec 20, 2005 at 00:06 UTC
|
Here is a slightly different approach. At least as far as I can tell this is unique. This builds a hash of matches and then rescans the source printing the matches. This automaticaly condenses down all the overlaps.
#!/usr/bin/perl
use strict;
use warnings;
die "No search terms supplied!" unless @ARGV;
my @words = @ARGV;
my $regex = join("|",@words) ;
my $expr = qr /^($regex)$/;
$/ = ' ';
my $i = 0;
my $words = {};
my $pos = tell(DATA);
for my $word (<DATA>)
{
chomp $word;
$i++;
if ($word =~ /$expr/) {
for my $j (-5 .. 5) {
$words->{$i + $j}++;
}
};
}
seek(DATA, $pos, 0);
$i =0;
for my $word (<DATA>)
{
$i++;
chomp $word;
$word = "<$word>" if ($word =~ /$expr/);
print "$word " if exists $words->{$i};
}
__DATA__
Regular expressions have always been a weak spot for me, and I've got
+a
question that's got me stumped. Here's the problem I'm trying to solv
+e.
I have somewhat large articles of text (returned from a search), what
+ I'd
like to do is capture the word and X number of words before and after
+ it
while tagging the matching word in the captured text. My inital thoug
+ht
was to try something like this. The problem I have is that if there i
+s
more than one term and they overlap, the nth term will not be annotat
+ed.
So my next thought is lookahead/lookbehind, but they don't capture.
Is there a way to do this with a single regex? Is a regex even the be
+st
way to do this? Thanks, -Lee
___________
Eric Hodges
$_='y==QAe=e?y==QG@>@?iy==QVq?f?=a@iG?=QQ=Q?9';
s/(.)/ord($1)-50/eigs;tr/6123457/- \/|\\\_\n/;print;
| [reply] [d/l] |
|
Thanks. I'll whip up a benchmark with some of the different approaches, (actually have, but don't have yours in yet. Though I'll probably switch the filehandle use.
Also, I'd probably use a hash instead of a hashref for $words as it is slightly faster.
Not sure if it's faster or not (though "feels" it), I'd probably rewrite the following to use a hash slice
# before
if ($word =~ /$expr/) {
for my $j (-5 .. 5) {
$words->{$i + $j}++;
}
};
# after
my $mwords = 5;
@words{$i-$mwords..$i+$mwords} = 1
if $word =~ /$expr/;
# Note: Keys created, but all but first have undef values
| [reply] [d/l] |
Re: Regex: Matching around a word(s)
by TomDLux (Vicar) on Dec 19, 2005 at 19:38 UTC
|
My approach would be to read the file and split it into an array of words, then go through the array and detect matches. For each match you can print the subarray $words[$index-$pre .. $index+$post]. I think this is the sort of thing Ovid had in mind. Using modules is always easier than doing things yourself, but if your definition of words is imple enough, it may be not to onerous.
Of course, if the file is LARGE, you might want to only read sections of the file at a time, and update your 'window' when you have fewer than $post words left in memory.
The next problem is defining what is a word .... Wonder if Shakespeare had anything to say on the topic?
Tom
--
TTTATCGGTCGTTATATAGATGTTTGCA
| [reply] |
|
Using modules is always easier than doing things yourself,
Only if there is a module that does exactly what you want to do. Maybe.
If there isn't, picking a module with a name vaguely related to the problem description and trying to bend it to the cause is pointless.
Achieving what you describe--printing the target word and N words either side--is easy using a regex and does have the overhead of creating huge arrays just to stick em all back to gether for output. It is a bread and butter text processing task and exactly what the much lauded, highly prized, Perl-jewel-in-it's-crown regex engine is designed for.
However, resolving the issues of almagamating multiple, closely consecetive matches into single snippets is equally problematic whether done with a regex or your "subarray" solution. (BTW.It should @array[ n .. m ] for a slice).
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
I thought I had responded to this earlier. Actually, I did, but must have mistaken preview for submit (dammit).
BrowserUk stated my thoughts pretty well. As an aside, simply splitting the string and matching the tokens like so
my @text = split /\s+/, $text;
my @results;
for (@text){
push @results, $1 if m/\b($expr)\b/;
}
is pretty slow in comparison to the solution I hit on.
| [reply] [d/l] |
Re: Regex: Matching around a word(s)
by pKai (Priest) on Dec 20, 2005 at 10:46 UTC
|
Seeing your output I tried to go back to the original questions:
Is there a way to do this with a single regex?
My answer here is to present a solution with "one and a half" regexes ;-) (see below)
Is a regex even the best way to do this?
Depends, I would say. Obviously, if the regex is too convoluted, it is not likely to be maintainable. OTOH a solution with a lot of pos-calculation is more likely to suffer from +/-1 border errors.
So here's my take:
use strict;
use warnings;
use Data::Dumper ();
die "No search terms supplied!" unless @ARGV;
my @words = @ARGV;
my $text = do { local $/ = undef; <DATA> };
my $blen = 20; # (max) chars before a matching word to take with us
my $alen = 20; # (max) chars after a matching word to take with us
my $jlen = $blen+$alen; # (max) chars between 2 matching words capt
+ured together
my $strwords = join "|" => map quotemeta, @words; # Words to highlight
my $rxwords = qr/\b(?i:$strwords)\b/; # ... compiled highlight word
+ match
my $expr = qr/\b(?!\s)(?s:.{0,$blen}$rxwords(?:.{0,$jlen}$rxwords)*.{0
+,$alen}(?:(?<=\s)|[^\s]*\b))/;
my $D = Data::Dumper->new(
[[grep {s/($rxwords)/[$1]/g} $text =~ /($expr)/g]], ['matched']
)->Indent(1);
print $D->Dump();
# reformatted the DATA to look nicer in the post
__DATA__
Regular expressions have always been a weak spot for me, and
I've got a question that's got me stumped. Here's the
problem I'm trying to solve. I have somewhat large articles
of text (returned from a search), what I'd like to do is
capture the word and X number of words before and after it
while tagging the matching word in the captured text. My
inital thought was to try something like this. The problem I
have is that if there is more than one term and they overlap,
the nth term will not be annotated. So my next thought is
lookahead/lookbehind, but they don't capture. Is there a way
to do this with a single regex? Is a regex even the best way
to do this? Thanks, -Lee
perl -Mstrict -Mwarnings context.pl is and the have
$matched = [
'Regular expressions [have] always been a weak spot for me, [and]
I\'ve got a question',
'me stumped. Here\'s [the]
problem I\'m trying to solve. I [have] somewhat large articles',
'what I\'d like to do [is]
capture [the] word [and] X number of words before [and] after it
while tagging [the] matching word in [the] captured text. My
',
'like this. [The] problem I
[have] [is] that if there [is] more than one term [and] they overlap,
[the] nth term will not be',
'So my next thought [is]
lookahead/lookbehind',
'don\'t capture. [Is] there a way
to do this',
'a single regex? [Is] a regex even [the] best way
to do this'
];
The main idea is to handle the "overlapping" context as a single string which spans between 2 consecutive matching words "close" together.
Because I then have multiple occurences of matching words, then these words need to be matched again for markup.
For the primary match I made some small adjustments to match full words in the prefix and suffix context:
\b # a word boundary
(?!\s) # following char is not a white space (1)
(?s: # . matches newline in rest of regex
.{0,$blen} # up to $blen chars (left context)
$rxwords # followed by a word we search for
(?: # group for repeatedly matching
.{0,$jlen} # up to $jlen=$blen+$alen chars (2)
$rxwords # followed by a searched word
)* # repeatedly match
.{0,$alen} # up to $alen chars (right context)
(?: # group for disjunction (3)
(?<=\s) # last matched char was white space
| # or
[^\s]* # non white space chars
\b # up to the next word boundary
)
)
Additional remarks:
- By making sure that we always break at word boundaries, we always have full words in the match on which we later reapply the $rxwords match to mark-up the words we search for.
Specificall, by (1) we trim white space on the left. And with (3) we make sure that either we end in white space which is safe to split there, or we extend the $alen chars with all following non white space chars up to the next word boundary.
- When we look for spanning context between two matching words (2) this can indead incidentally contain additional matching words (not matched by $rxwords), but these additionals are "safe", since that (2) match borders to \b of $rxwords on both sides. And so they will be found in the reapplication of $rxwords in the postprocessing. This is the essential trick in avoiding any explicit gluing of separate contexts.
- This all assumes that the words to match do not incorporate \b boundaries. Otherwise the usage of \b in the regex(es) have to be complemented/substituted by (negative) look ahead for (non) white space. Looking for phrases (allowing white space inside) with the appropriate context is probably a lot harder in this way.
- Pathological texts which do not contain (enough) white space are not handled well.
- Instead of matching into array context, the /($expr)/g match could also be executed in a while condition to address temp memory concerns with large texts to match.
Comments welcome | [reply] [d/l] [select] |
|
Nice++
Did a benchmark (I'll post later with some tweaks) and it seems that for a few terms, the match/span is about twice as fast then the double regex, with more terms it goes down to about 70% faster.
Simply splitting without actually doing any processing
my ($text,@words) = @_;
my @text = split /\s+/, $text;
my @results;
for my $t (@text){
for my $w (@words){
push @results if $t eq $w;
}
}
is about 46% slower then the pattern match/span solution.
| [reply] [d/l] |
Re: Regex: Matching around a word(s)
by wolfger (Deacon) on Dec 20, 2005 at 19:10 UTC
|
Edit: Ignore me please... I need to read the problem completely before whipping out code that doesn't work.
#!/usr/bin/perl
use strict;
use warnings;
my $string = "This is just and example text string to test our regex.
+Please ignore it.";
my $wordpadding = 2;
my $searchword = "test";
$string =~ /((\w+\W+?){$wordpadding}$searchword(\W+?\w+){$wordpadding}
+)/;
print $1;
jdporter unconsidered. Considered by wolfger:
"please delete. I hit send before fully reading/understanding the problem."
Final tally: 8 keep, 11 reap.
| [reply] [d/l] |