Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^4: removing stop words

by fishbot_v2 (Chaplain)
on May 29, 2005 at 15:34 UTC ( [id://461528]=note: print w/replies, xml ) Need Help??


in reply to Re^3: removing stop words
in thread removing stop words

Yes - Jarkko Hietaniemi's Regex::PreSuf does just that.

my $re = presuf( qw{ a about above across after afterwards } ); # yields: a(?:bo(?:ut|ve)|cross|fter(?:wards)?)?

If we assume you aren't incurring the cost of building the regex each time (possibly you keep a stopwords file and stopreg file and rebuild the latter from the former when the former changes, or simply stat and rebuild from the main program...) then you get a significant savings:

Rate reg pre presuf reg 33.1/s -- -34% -59% pre1 50.4/s 53% -- -37% presuf 80.6/s 144% 60% --

pre1 is my simple algorithm from upthread, reg is a straight alternation, and presuf is presuf(). I used the english stoplist from Lingua::EN::StopWords (about 200 words) and a 4000 word text.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://461528]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-03-19 08:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found