in reply to How do I extract a text between some delimiters
Barring a CPAN module (untested code follows, please tell me if this doesn't work -- it may be missing some whitespace in the substitution, for example):
sub stripPOS { my $words = shift; # rip out any / plus following characters, up to the # first space $words =~ s!/\S*!!g; return $words; } $sentence =~ s! \s \<NP\d*\> (.*?) \</NP\> \s ! stripPOS($1). '/NP' !egx;
Let's break that out:
- s!
begins the substitution.
- \s \<NP\d*\> \s
looks for an <NP#> tag between spaces
- (.*?)
looks for the shortest possible string until...
- \</NP\> \s
you can find the closing tag
- ! stripPOS($1) . '/NP' !e
replace it with the POS-stripped version of the stuff in the middle, followed by an /NP "pseudo-POS"
- gx;
do it everywhere, and make it easier to read
Hope that helps, -- jkahn
Update (ca. 9p GMT-8): I've just realized that this code won't work if there are nested tags, e.g.:
<NP1> <NP2> The/D best/A one/N </NP> of/P <NP2> the/D Perl/N Monks/N </NP> </NP>
Anonymous Monk, does this happen in your input data? I will look at this and see if I can come up with a good answer if it does, or if I feel like it.....
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: How do I extract a text between some delimiters
by adrianh (Chancellor) on Sep 15, 2002 at 01:26 UTC | |
Re: Re: How do I extract a text between some delimiters
by Anonymous Monk on Sep 16, 2002 at 19:58 UTC |