Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Strip user-defined words with regexp

by Marcello (Hermit)
on Mar 04, 2004 at 14:25 UTC ( #333862=perlquestion: print w/replies, xml ) Need Help??

Marcello has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I am looking for a way to get the first two words of a sentence. A word is defined as [A-Za-z0-9]+

The regexp must be able to handle arbitrary sentences, including newlines, \0, \t etc etc.

I currently use this regexp, but I am sure there is an easier one:

my $message = "\n +_ ABC1_\n2 3 4"; if ($message =~ m/([^A-Z0-9]*)([A-Z0-9]*)([^A-Z0-9]*)([A-Z0-9]*)(.|\n) +*/i) { print "[$2][$4]"; }
TIA

Marcel

Update: The term phrase is probably better, I want to extract the first two words of a phrase (user input, so contains probably every possible character :) ), no matter how many lines/sentences this input is.

Replies are listed 'Best First'.
Re: Strip user-defined words with regexp
by Limbic~Region (Chancellor) on Mar 04, 2004 at 14:38 UTC
    Marcello,
    You do not say what marks the end of a sentence. In english, there are many ways to do this (period, question mark, exclamation mark, etc). Also, your regex does not look like it should work by your specifications. This will also break on the sentence "How are you today Dr. Smith?"
    #!/usr/bin/perl use strict; use warnings; my $msg = "One bright day in the middle of the night,\n"; $msg .= "two dead men got up to fight.\n"; $msg .= "Back to Back they faced each other,\n"; $msg .= "drew their swords and shot each other.\n"; $msg .= "A deaf police man heard this noise,\n"; $msg .= "came and killed those two dead boys.\n"; $msg .= "If you don't believe this lie is true,\n"; $msg .= "ask the blind man - he saw it too!\n"; $msg =~ tr/\n//d; for my $sentence ( split /[.!?]/ , $msg ) { if ( $sentence =~ /^\s*([a-zA-Z0-9]+)\s+([a-zA-Z0-9]+)\s+/ ) { print "$1 $2\n"; } } __END__ One bright Back to A deaf If you
    Cheers - L~R
Re: Strip user-defined words with regexp
by rnahi (Curate) on Mar 04, 2004 at 14:37 UTC

    I would do it this way. I don't know if it deserves an high grade, but it gets the job done :).

    my $count = 0; while ( $message =~ /([A-Za-z0-9]+)/g) { last if $count++ > 1; print "$1\n"; }
      rnahi,
      "but it gets the job done :)."

      Sorry to nitpick, but actually it doesn't. The first two words were being desired of each sentence.
      L~R

      Updated to clarify Updated is in italics. I am still wrong because of a misinterpretation of the OP's requirements.

        Did you try it?

        It prints the first 2 words as defined by the OP.

        I think my original post was a bit too unclear.

        I am looking for at most the first two words. So it may be zero, one or two.

        I've modified his code to:
        my $firstWord = undef; my $secondWord = undef; my $i = 0; while ($message =~ /([A-Za-z0-9]+)/g) { if ($i == 0) { $firstWord = $1; } elsif ($i == 1) { $secondWord = $1; last; } $i++; }
        which does exactly what I was looking for. I only tried to do it in one regexp.

        Thanks, Marcel
Re: Strip user-defined words with regexp
by BrowserUk (Pope) on Mar 04, 2004 at 16:01 UTC

    This might meet the reqs?

    my( $word1, $word2 ) = ( grep $_, split /[^A-Za-z0-9]+/, $message )[0, +1]

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
      Out of curiosity: why not just
      my ($word1, $word2) = split(/[^A-Za-z0-9]+/, $message);
      ?

      Marcel

        Given the OP's examplei input

        With the grep

        $message = "\n +_ ABC1_\n2 3 4"; print join'|', ( grep $_, split /[^A-Za-z0-9]+/ , $message )[0,1]; ABC1|2

        Without

        $message = "\n +_ ABC1_\n2 3 4"; print join'|', split /[^A-Za-z0-9]+/ , $message; |ABC1|2|3|4

        You'll notice the null leading element.

        The list slice is pretty redundant, but it does make it obvious that you are only wanting the first two.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
Re: Strip user-defined words with regexp
by Happy-the-monk (Canon) on Mar 04, 2004 at 14:32 UTC
    What makes up or defines a sentence? The newline? The dot if it's not followed by a word character?
      Sorry, the term phrase is probably better. I am looking for the first two words in a phrase, the phrase can end with anything and can contain newlines, etc etc.
Re: Strip user-defined words with regexp
by halley (Prior) on Mar 04, 2004 at 14:32 UTC
    This kinda sounds like homework. You might try typing the following command at your command prompt.
    perldoc perlre
    Check out the \w match symbol, and ask yourself why you're using all these * in a regex when the problem as stated says +.

    --
    [ e d @ h a l l e y . c c ]

      I knew somebody was going to say this...

      It's not, I have an application which has to determine by the first two words of a phrase what todo. This phrase can be anything, it might even be only one word. Examples:

      my $message = "test one"; my $message = "test"; my $message = "_$ test..."; my $message = "_$ TEST..1.."; my $message = "_$\nTEST1.\n.1.2.3";
      BTW: \w is not helping me here, since I do not want the underscore character.
        Limbic~Region has almost got it then:
        if ( $message =~ /([a-zA-Z0-9]+)[^a-zA-Z0-9]*([a-zA-Z0-9]*)/ ) { print "$1 $2\n"; }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://333862]
Approved by halley
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2020-05-24 23:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If programming languages were movie genres, Perl would be:















    Results (142 votes). Check out past polls.

    Notices?