Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Removing Stopwords from a String

by mr.nick (Chaplain)
on Jan 06, 2001 at 21:28 UTC ( #50257=snippet: print w/ replies, xml ) Need Help??

Description: A short set of subroutines and some data for removing all the stopwords (a, for, and, not, while, etc...) from a string. Very useful in finding the juicy-bits inside a whole english phrase.

Thanx salvadors for the hints!

my @stopwords=qw(
                 a
                 about
                 above
                 according
                 across
                 actually
                 adj
                 after
                 afterwards
                 again
                 against
                 all
                 almost
                 alone
                 along
                 already
                 also
                 although
                 always
                 among
                 amongst
                 an
                 and
                 another
                 any
                 anyhow
                 anyone
                 anything
                 anywhere
                 are
                 aren't
                 around
                 as
                 at
                 be
                 became
                 because
                 become
                 becomes
                 becoming
                 been
                 before
                 beforehand
                 begin
                 beginning
                 behind
                 being
                 below
                 beside
                 besides
                 between
                 beyond
                 billion
                 both
                 but
                 by
                 can
                 can't
                 cannot
                 caption
                 co
                 company
                 corp
                 corporation
                 could
                 couldn't
                 did
                 didn't
                 do
                 does
                 doesn't
                 don't
                 down
                 during
                 each
                 eg
                 eight
                 eighty
                 either
                 else
                 elsewhere
                 end
                 ending
                 enough
                 etc
                 even
                 ever
                 every
                 everyone
                 everything
                 everywhere
                 except
                 few
                 fifty
                 first
                 five
                 for
                 former
                 formerly
                 forty
                 found
                 four
                 from
                 further
                 had
                 has
                 hasn't
                 have
                 haven't
                 he
                 he'd
                 he'll
                 he's
                 hence
                 her
                 here
                 here's
                 hereafter
                 hereby
                 herein
                 hereupon
                 hers
                 herself
                 him
                 himself
                 his
                 how
                 however
                 hundred
                 i
                 i'd
                 i'll
                 i'm
                 i've
                 ie
                 if
                 in
                 inc
                 indeed
                 instead
                 into
                 is
                 isn't
                 it
                 it's
                 its
                 itself
                 last
                 later
                 latter
                 latterly
                 least
                 less
                 let
                 let's
                 like
                 likely
                 ltd
                 made
                 make
                 makes
                 many
                 maybe
                 me
                 meantime
                 meanwhile
                 might
                 million
                 miss
                 more
                 moreover
                 most
                 mostly
                 mr
                 mrs
                 much
                 must
                 my
                 myself
                 namely
                 neither
                 never
                 nevertheless
                 next
                 nine
                 ninety
                 no
                 nobody
                 none
                 nonetheless
                 noone
                 nor
                 not
                 nothing
                 now
                 nowhere
                 of
                 off
                 often
                 on
                 once
                 one
                 one's
                 only
                 onto
                 or
                 other
                 others
                 otherwise
                 our
                 ours
                 ourselves
                 out
                 over
                 overall
                 own
                 per
                 perhaps
                 rather
                 recent
                 recently
                 same
                 seem
                 seemed
                 seeming
                 seems
                 seven
                 seventy
                 several
                 she
                 she'd
                 she'll
                 she's
                 should
                 shouldn't
                 since
                 six
                 sixty
                 so
                 some
                 somehow
                 someone
                 something
                 sometime
                 sometimes
                 somewhere
                 still
                 stop
                 such
                 taking
                 ten
                 than
                 that
                 that'll
                 that's
                 that've
                 the
                 their
                 them
                 themselves
                 then
                 thence
                 there
                 there'd
                 there'll
                 there're
                 there's
                 there've
                 thereafter
                 thereby
                 therefore
                 therein
                 thereupon
                 these
                 they
                 they'd
                 they'll
                 they're
                 they've
                 thirty
                 this
                 those
                 though
                 thousand
                 three
                 through
                 throughout
                 thru
                 thus
                 to
                 together
                 too
                 toward
                 towards
                 trillion
                 twenty
                 two
                 under
                 unless
                 unlike
                 unlikely
                 until
                 up
                 upon
                 us
                 used
                 using
                 very
                 via
                 ve
                 was
                 wasn't
                 we
                 we'd
                 we'll
                 we're
                 we've
                 well
                 were
                 weren't
                 what
                 what'll
                 what's
                 what've
                 whatever
                 when
                 whence
                 whenever
                 where
                 where's
                 whereafter
                 whereas
                 whereby
                 wherein
                 whereupon
                 wherever
                 whether
                 which
                 while
                 whither
                 who
                 who'd
                 who'll
                 who's
                 whoever
                 whole
                 whom
                 whomever
                 whose
                 why
                 will
                 with
                 within
                 without
                 won't
                 would
                 wouldn't
                 yeah
                 yes
                 yet
                 you
                 you'd
                 you'll
                 you're
                 you've
                 your
                 yours
                 yourself
                 yourselves
                 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
                );

my %stop=map { lc $_ => 1 } @stopwords;

sub findwords {
  my $string = shift;
  my (@ok, %seen);
  while ($string =~ /((\w|')+)/g) {
    push @ok, $1 unless $stop{lc $1} or $seen{lc $1}++;
  }
  return @ok;
}

1;
Comment on Removing Stopwords from a String
Download Code
Re: Removing Stopwords from a String
by salvadors (Pilgrim) on Jan 06, 2001 at 23:02 UTC

    Wow! That's a lot of regular expressions going on there...

    Personally I'd do something more akin to:

    my @stopwords = qw/ i'd add all my stop words in here /; my %stop = map { lc $_ => 1 } @stopwords; sub findwords { my $string = shift; my (@ok, %seen); while ($string =~ /((\w|')+)/g) { push @ok, $1 unless $stop{lc $1} or $seen{lc $1}++; } return @ok; }}

    My tests show this as coming out about 2 orders of magnitude faster, and it also copes better with apostrophized words that aren't in the stop list.

    Tony

      Hey How to use It?? I mean if i have an array containing whole string and i want to remove these stopwords from it then how i would use this subroutine.. Sorry I am new to PErl plz reply
Re: Removing Stopwords from a String
by bobf (Monsignor) on Mar 05, 2009 at 01:46 UTC

    CPAN to the rescue...

    From Lingua::StopWords:

    use Lingua::StopWords qw( getStopWords ); my $stopwords = getStopWords('en'); my @words = qw( i am the walrus goo goo g'joob ); # prints "walrus goo goo g'joob" print join ' ', grep { !$stopwords->{$_} } @words;

    From Lingua::EN::StopWords:

    use Lingua::EN::StopWords qw(%StopWords); my @words = ...; # Print non-stopwords in @words print join " ", grep { !$StopWords{$_} } @words;

Re: Removing Stopwords from a String
by Anonymous Monk on Nov 12, 2009 at 07:44 UTC
    this code didn't work :( is there any other way to remove stop words i am trying to remove stop words for a given french data.

Back to Snippets Section

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: snippet [id://50257]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2014-10-01 01:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (386 votes), past polls