Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Module for text/phrase ranking

by menth0l (Monk)
on Feb 22, 2011 at 10:44 UTC ( #889562=perlquestion: print w/ replies, xml ) Need Help??
menth0l has asked for the wisdom of the Perl Monks concerning the following question:

Is there any good and fast module that would 'scan' text for specified phrases and count their occurrences? It would be also great if it could automatically stem phrases (or provide fuzzy matching). I need it to rank some texts using given keywords. Any ideas?

Comment on Module for text/phrase ranking
Re: Module for text/phrase ranking
by moritz (Cardinal) on Feb 22, 2011 at 10:50 UTC

    If you want to scan the same text body multiple times for different keywords, it makes sense to build a proper index for them.

    If you want that case, KinoSearch and Plucene might be worth looking into, as well as ElasticSearch.

    I know that KinoSearch does stemming, and I suspect the others do too.

Re: Module for text/phrase ranking
by planetscape (Canon) on Feb 22, 2011 at 13:11 UTC
Re: Module for text/phrase ranking
by roboticus (Canon) on Feb 22, 2011 at 13:29 UTC

    menth0l:

    Well, the recommended approach is to first visit CPAN to see what bits you can leverage. A couple simple searches based on interesting words in your requirements should lead you to some potentially-useful modules:

    Search Term(s) Module Notes
    stem WordNet::Similarity It has a stem module
    scan text Text::Scan Claims: "Fast search for very large numbers of keys in a body of text." and the synopsis shows an example of using it to count words.
    ranking Tie::Hash::Rank Hmmm... this looks interesting. I'll have to install and check this one out.

    Then you should review the module(s). You need to find out which one(s) are suitable to your needs, and determine what capabilities they have. You'll want to write a couple "try it out" scripts to verify that the modules work the way you expect them to and get a feel for how to use them.

    Next, you can design your script by figuring out a rough skeleton of how to accomplish the job you want to do. Keep in mind the capabilities of the module(s) you've selected, so you can leverage them as much as possible. During this stage, be sure to generate some test cases and determine what you want the output to be. That'll give you a way to test your code and can help prevent running around in circles changing the output layout, etc.

    Finally, write/test/debug your script.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

    Update: Filled in lower-right cell, repaired broken HTML.

Re: Module for text/phrase ranking
by chrestomanci (Priest) on Feb 22, 2011 at 13:46 UTC

    We had a similar question a couple of weeks back. On that ocasion the suplicant had a multi gigabyte file in a static format that they wanted to search quickly.

    In your case, the simplest soution to search and count for single words is to use grep. eg:

    grep -c 'some_word' /path/to/filename.txt

    This approch is simple, and reasonably fast. If you want more speed you would need to construct an index, and the best way to do that would be to use a database. In the thread referenced above, erix documented how to import the data into a PostgreSQL database. It took about several minutes to import and index the data, but once that was done, searches took around a tenth of a millisecond.

    If your data is not structured, then you would have more work to import it into a database, but it can still be done.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://889562]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2014-07-12 11:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (239 votes), past polls