How do I extract a text between some delimiters

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I have a bunch of sentences in a file and I want to match the Nouns Phrases, which are between these delimiters <NP#>...</NP> and substitute (delete) -in the noun phrase-, all the tags, to add only one tag "/NP" on the last word of the phrase. This for each sentence...

<S> Who/WP is/VBZ <NP1> the/DT author/NN </NP> of/IN <NP2> the/DT book
+/NN </NP> ,/,/*comma* "/"/*\
quote* <NP3> The/DT Iron/NNP Lady/NNP </NP> :/: <NP4> A/DT Biography/N
+NP of/IN \Margaret/NNP That\
cher/NNP </NP> "/"/*quote* ?/./*end-of-sentence*</S>
[download]

Result:

<S> Who/WP is/VBZ  the author/NP of/IN  the book/NP ,/,  The Iron Lady
+/NP :/:   A  Biography of Margaret That\
cher/NP  "/"...
[download]

So far, I am getting nothing with the match, no even with the replacement

 if ($sentence =~ / (<NP\d*>) ([a-zA-Z0-9.-_]+)\/([A-Z]+) ([^ ]+) (<\/
+NP>) /) {
    $text="NounPhrase";    # to test the match to replace 
    $result =~ s/$question/$text/;
    print "$result";};
[download]

Thanks in advance for your help

Comment on How do I extract a text between some delimiters Select or Download Code

Replies are listed 'Best First'.
Re: How do I extract a text between some delimiters by jkahn (Friar) on Sep 13, 2002 at 18:23 UTC
I'm sure there's a CPAN module that does generic tag extraction, but it might not conform to the form you're using here (I assume you're testing/building a part-of-speech tagger). Barring a CPAN module (untested code follows, please tell me if this doesn't work -- it may be missing some whitespace in the substitution, for example): `sub stripPOS { my $words = shift; # rip out any / plus following characters, up to the # first space $words =~ s!/\S!!g; return $words; } $sentence =~ s! \s \<NP\d\> (.?) \</NP\> \s ! stripPOS($1). '/NP' !egx;` [download] Let's break that out: `s!` begins the substitution. `\s \<NP\d\> \s` looks for an `<NP#>` tag between spaces `(.?)` looks for the shortest possible string until... `\</NP\> \s` you can find the closing tag `! stripPOS($1) . '/NP' !e` replace it with the POS-stripped version of the stuff in the middle, followed by an `/NP` "pseudo-POS" `gx;` do it everywhere, and make it easier to read Hope that helps, -- jkahn Update (ca. 9p GMT-8):* I've just realized that this code won't work if there are nested tags, e.g.: `<NP1> <NP2> The/D best/A one/N </NP> of/P <NP2> the/D Perl/N Monks/N </NP> </NP>` [download] Anonymous Monk, does this happen in your input data? I will look at this and see if I can come up with a good answer if it does, or if I feel like it.....	[reply] [d/l] [select]
Re^2: How do I extract a text between some delimiters by adrianh (Chancellor) on Sep 15, 2002 at 01:26 UTC
AM might want to consider extract_tagged() in Text::Balanced if s/he needs to cope with recursive NP tags. Of course - they could always go the whole hog and write a proper parser with Parse::RecDescent... or wait for perl6 rules to arrive :-)	[reply]
Re: Re: How do I extract a text between some delimiters by Anonymous Monk on Sep 16, 2002 at 19:58 UTC
Hello Monk It does not happen in my input data. There are not nested tags in my sentences. I already run your code, and it works for some of the examples I tried so far. Thanks	[reply]
Re: How do I extract a text between some delimiters by fglock (Vicar) on Sep 13, 2002 at 18:09 UTC
If you can be sure that it will also be in the `<NPx> .. </NP>` format, you can use this pseudo-code: split the sentence on `<NP\d>` and `</NP>`; remove tags on the even elements; join the sentence elements again. `@parts = split(/<NP.*?>\|<\/NP>/, $sentence); $even = 0; for $part (@parts) { if ($even) { $part =~ s/\/\w\w//g; $part .= '/NP'; } $even = ! $even; } $result = join('', @parts);` [download] Untested!	[reply] [d/l] [select]
Re: Re: How do I extract a text between some delimiters by fglock (Vicar) on Sep 17, 2002 at 15:43 UTC
Ok. Just replace this `$part =~ s/\/\w\w//g;` by this `$part =~ s/\/\w{2,3}//g;` jkahn's solution below doesn't have this problem. Using his solution it would be: `$part =~ s!/\S*!!g;`	[reply] [d/l] [select]
Re: Re: How do I extract a text between some delimiters by Anonymous Monk on Sep 17, 2002 at 15:36 UTC
Hi Monk I tested the code and it works well when the tag has only two characters, but it can also has three character the following was the output for one sentence: <S> What/WP was/VBD the monetary value /NP of/IN the NobelP PeaceP PrizeP /NP in/IN 1989 /NP ?/./end-of-sentence</S> The original input was: <S> What/WP was/VBD <NP5> the/DT monetary/JJ value/NN </NP> of/IN <NP6> the/DT Nobel/NNP Peace/NNP Prize/NNP </NP> in/IN <NP7> 1989/CD </NP> ?/./end-of-sentence</S> I tried to change the code but it did not work for two or three How I evaluate the case for two "OR" three characters Thanks	[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks