Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

How do I extract a text between some delimiters

by Anonymous Monk
on Sep 13, 2002 at 17:54 UTC ( [id://197671]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I have a bunch of sentences in a file and I want to match the Nouns Phrases, which are between these delimiters <NP#>...</NP> and substitute (delete) -in the noun phrase-, all the tags, to add only one tag "/NP" on the last word of the phrase. This for each sentence...
<S> Who/WP is/VBZ <NP1> the/DT author/NN </NP> of/IN <NP2> the/DT book +/NN </NP> ,/,/*comma* "/"/*\ quote* <NP3> The/DT Iron/NNP Lady/NNP </NP> :/: <NP4> A/DT Biography/N +NP of/IN \Margaret/NNP That\ cher/NNP </NP> "/"/*quote* ?/./*end-of-sentence*</S>
Result:
<S> Who/WP is/VBZ the author/NP of/IN the book/NP ,/, The Iron Lady +/NP :/: A Biography of Margaret That\ cher/NP "/"...
So far, I am getting nothing with the match, no even with the replacement
if ($sentence =~ / (<NP\d*>) ([a-zA-Z0-9.-_]+)\/([A-Z]+) ([^ ]+) (<\/ +NP>) /) { $text="NounPhrase"; # to test the match to replace $result =~ s/$question/$text/; print "$result";};


Thanks in advance for your help

Replies are listed 'Best First'.
Re: How do I extract a text between some delimiters
by jkahn (Friar) on Sep 13, 2002 at 18:23 UTC
    I'm sure there's a CPAN module that does generic tag extraction, but it might not conform to the form you're using here (I assume you're testing/building a part-of-speech tagger).

    Barring a CPAN module (untested code follows, please tell me if this doesn't work -- it may be missing some whitespace in the substitution, for example):

    sub stripPOS { my $words = shift; # rip out any / plus following characters, up to the # first space $words =~ s!/\S*!!g; return $words; } $sentence =~ s! \s \<NP\d*\> (.*?) \</NP\> \s ! stripPOS($1). '/NP' !egx;

    Let's break that out:

    • s!

      begins the substitution.

    • \s \<NP\d*\> \s

      looks for an <NP#> tag between spaces

    • (.*?)

      looks for the shortest possible string until...

    • \</NP\> \s

      you can find the closing tag

    • ! stripPOS($1) . '/NP' !e

      replace it with the POS-stripped version of the stuff in the middle, followed by an /NP "pseudo-POS"

    • gx;

      do it everywhere, and make it easier to read

    Hope that helps, -- jkahn

    Update (ca. 9p GMT-8): I've just realized that this code won't work if there are nested tags, e.g.:

    <NP1> <NP2> The/D best/A one/N </NP> of/P <NP2> the/D Perl/N Monks/N </NP> </NP>

    Anonymous Monk, does this happen in your input data? I will look at this and see if I can come up with a good answer if it does, or if I feel like it.....

      AM might want to consider extract_tagged() in Text::Balanced if s/he needs to cope with recursive NP tags.

      Of course - they could always go the whole hog and write a proper parser with Parse::RecDescent... or wait for perl6 rules to arrive :-)

      Hello Monk
      It does not happen in my input data. There are not nested tags in my sentences. I already run your code, and it works for some of the examples I tried so far.
      Thanks
Re: How do I extract a text between some delimiters
by fglock (Vicar) on Sep 13, 2002 at 18:09 UTC

    If you can be sure that it will also be in the  <NPx> .. </NP> format, you can use this pseudo-code:

    split the sentence on  <NP\d> and  </NP>; remove tags on the even elements; join the sentence elements again.

    @parts = split(/<NP.*?>|<\/NP>/, $sentence); $even = 0; for $part (@parts) { if ($even) { $part =~ s/\/\w\w//g; $part .= '/NP'; } $even = ! $even; } $result = join('', @parts);

    Untested!

      Ok. Just replace this

      $part =~ s/\/\w\w//g;

      by this

      $part =~ s/\/\w{2,3}//g;

      jkahn's solution below doesn't have this problem. Using his solution it would be:

      $part =~ s!/\S*!!g;
      Hi Monk
      I tested the code and it works well when the tag has only two characters, but it can also has three character

      the following was the output for one sentence:

      <S> What/WP was/VBD the monetary value /NP of/IN the NobelP PeaceP PrizeP /NP in/IN 1989 /NP ?/./*end-of-sentence*</S>

      The original input was:
      <S> What/WP was/VBD <NP5> the/DT monetary/JJ value/NN </NP> of/IN <NP6> the/DT Nobel/NNP Peace/NNP Prize/NNP </NP> in/IN <NP7> 1989/CD </NP> ?/./*end-of-sentence*</S>
      I tried to change the code but it did not work for two or three How I evaluate the case for two "OR" three characters
      Thanks

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://197671]
Approved by dws
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2024-04-24 05:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found