Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: Why? - Writing inverted index code in perl might be overkill

by dpavlin (Friar)
on Aug 19, 2005 at 17:18 UTC ( #485223=note: print w/ replies, xml ) Need Help??


in reply to Why? - Writing inverted index code in perl might be overkill
in thread Writing a Search Engine in Perl?

Only down-side to perl only version is speed. Of course, it depends on size of your input data. However, on my laptop I have more data that I want to index than any perl-only solution really can handle (over 20Gb in various formats).

I have some expiriences with WAIT (and some pending patches at http://svn.rot13.org/~dpavlin/svnweb/index.cgi/wait/log/trunk/ ), swish-e, Xapian (another great engine which updated perl bindings few days ago). I also experimented with CLucene perl bindings and finally ended with HyperEstraier.

I would suggent to make list of requirements of search engine and then select right one. My current list include:

  • full text search
  • filter results by attributes (e.g. date, category...)
  • ability to update index content while running searches on it
  • wildcard support (or substring, even better!)
  • acceptable speed on projected amount of data
Last point influence choice very much. I would go with Plucene if data size is small enough (or only for prototyping).

Writing good parsers and analyzers for input formats (do you want to rank bold words more than surround text?) and font-end is hard enough without writing own reverse index implementation, especially since some very good allready exist.


2share!2flame...


Comment on Re: Why? - Writing inverted index code in perl might be overkill
Re^2: Why? - Writing inverted index code in perl might be overkill
by Anonymous Monk on Apr 09, 2007 at 16:28 UTC
    In my experience, Plucene was not very good at handling your third requirement, "ability to update index content while running searches on it." The code that handles file locking is prone to die instead of wait. Not good for live websites. The following ASCII depicts my expression upon discovering this:

    8-[

    YMMV

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://485223]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2014-09-22 22:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (206 votes), past polls