Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Writing a Search Engine in Perl?

by techcode (Hermit)
on Aug 18, 2005 at 00:23 UTC ( [id://484644]=perlquestion: print w/replies, xml ) Need Help??

techcode has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks.

I've had this crazy idea in my head for a while - to make my own search engine. And when I say SE - I mean real thing : spider, indexer ...etc.

Naturally I want to do it in Perl and wonder if it's suitable for such a thing (I spotted some modules like LWP::RobotUA that could be handy). I guess that only time critical part of the system would be querying the DB for search results - am I right?

So I guess that it's mostly up to DB - and while I'm there what DB to use? Some guy told me that best thing would be to use hash databases of some sort - with a revert index. But in that case I cant even imagine a how phrase search would be done ...

I tried to find general info on creating search engines - but most things I found are in the lines of how to create a search engine for your site using PHP. Does anyone know some source of info on it?

O and BTW - someone could ask why would I want to do such a thing. Well there is only one - recently launched - real search engine for sites in Serbian (I'm from Serbia), and I think it can be done much better. And beside that it's interesting - I could pass it as my Advanced School Dissertation as I don't have any other idea.

Replies are listed 'Best First'.
Re: Writing a Search Engine in Perl?
by davido (Cardinal) on Aug 18, 2005 at 00:34 UTC

    There's no reason not to use Perl to create a search engine, other than the fact it's been done before... but you've answered that already: not in Serbian.

    Perl definately gives you access to many of the tools you would need: LWP::RobotUA, WWW::Mechanize, HTML parsers, XML parsers, DBI, and so on.

    Of course you'll also need a good database. PostgreSQL or something like that comes to mind. And a big hard drive. ;) The suite would have to consist of various types of crawlers, and then another script to handle the queries against the database.


    Dave

Re: Writing a Search Engine in Perl?
by spiritway (Vicar) on Aug 18, 2005 at 02:44 UTC

    I think this sort of thing is what makes Perl shine above many other languages - you can really do some amazing things with plowing through text and getting information out of it.

    You might want to check out a couple of O'Reilly books. One is called Spidering Hacks; the other is called Google Hacks. Both are useful. The Spidering Hacks book uses Perl exclusively for its examples. I recommend them both.

Re: Writing a Search Engine in Perl?
by lestrrat (Deacon) on Aug 18, 2005 at 02:48 UTC

    It's in Japanese, but http://xango.razil.jp is written mostly in Perl, using Xango, Senna, and mysql. Granted, the underlying full text search library is in C ;)

Re: Writing a Search Engine in Perl?
by biosysadmin (Deacon) on Aug 18, 2005 at 06:17 UTC
    Your idea isn't crazy at all, it's even been done before. Too bad this isn't 1994 and you're not going to grad school at Stanford (I think), otherwise you might have made a lot of money. :)

    I think that Perl is a very reasonable choice for a search engine of this kind. You'll want to look at text-encoding issues, but since you presumably speak Serbian you probably already know more about these than me.

    As far as examples go, check out this Perl.com article. Don't feel tied to the underlying algorithm, but working through their example would probably be informative.

      Yeah too bad I'm not on Stanford - and in 1994 I was ~ 11 years old - I don't say it's not possible to do it at that age, but at that time I was enjoining SEGA Mega Drive II and Amiga games :)

      As of making a lot of money - well - who knows. As I said, only other (real) search engine for Serbian sites (I don't count www.google.co.yu) started recently. Actually a guy I know from classes/labs is director/SEO for it in Serbia. They started few years ago in nearby country Slovenia ...

      And somehow it's not it - and I believe it's written in Java ... so it naturally cant be good :D Just kidding of course.

      If such thing is made - I believe a lot of press coverage would come (also had Comunicology classes with some PR), it would be labeled Made in Serbia and could fit government project : Let's buy (in this case it would be use) domestic if that is the right word in English anyway.

      If anyone else has some nice recommendations on sources about creating a search engine ...

Re: Writing a Search Engine in Perl?
by inman (Curate) on Aug 18, 2005 at 07:42 UTC
Writing inverted index code in perl might be overkill
by dpavlin (Friar) on Aug 18, 2005 at 15:52 UTC
    I would suggest against writing your own inverted index code. There are lot of good full-text index engines out there... For perl-only implementation, see Plucene or KinoSearch.

    For hybrid C-perl combination I would suggest http://hyperestraier.sourceforge.net/ for which I plan to write perl-only P2P API (help appriciated).

    Somewhat off-topic, but there is also (shameless plug) http://pgfoundry.org/projects/pgestraier/ for quering HyperEstraier index directly from PostgreSQL to have best of both worlds: structured data in PostgreSQL which is joinable with full-text results from HyperEstraier. It will probably include P2P API in near future.


    2share!2flame...

      When I saw P2P i started thinking of a massive P2P effort to index the web...Use spare processor power from computers around the world to index the web. If you had that kind of power you could do more interesting indexs of documents, i wonder however if it could truly rival google or yahoo's indexers. Kind of like a dmoz.org or something. Just thinking out loud, don't mind me.


      ___________
      Eric Hodges
        The problem that I see with a solution like this is that many would misuse it - depending on what part(s) of the system would be shared. I'm thinking of reverse engineering that could reveal how to be ranked better.
        That's actually possible using existing horisontal scalability of HyperEstraier.

        Just setup multiple servers which crawl separate parts of web. Setup search to search over all nodes at once.
        Indexer can query search index to find out if some other indexer did crawl that page already (and optionally refresh content if needed). That way, you will have fresher pages with bigger number of incomming links (which you can count and use that also in page ranking - I hope that this idea doesn't violate Google patent).

        I don't have pointer to perl solution for this (other than CPAN modules which make every problem 90% done). On the other hand, with current P2P architecture you can have multiple indexes (for e-mail, documents, etc.) and search over just some or all of them.


        2share!2flame...
      Eh nice link I made (if that is the right word in English anyway) ...

      Why do you think that writing reverse index in Perl would be overkill? And in what meaning it would be an overkill?

      Sure C/Perl combination might be considered - I could finally put C/C++ knowledge gained on advanced school to practical use ...

        Only down-side to perl only version is speed. Of course, it depends on size of your input data. However, on my laptop I have more data that I want to index than any perl-only solution really can handle (over 20Gb in various formats).

        I have some expiriences with WAIT (and some pending patches at http://svn.rot13.org/~dpavlin/svnweb/index.cgi/wait/log/trunk/ ), swish-e, Xapian (another great engine which updated perl bindings few days ago). I also experimented with CLucene perl bindings and finally ended with HyperEstraier.

        I would suggent to make list of requirements of search engine and then select right one. My current list include:

        • full text search
        • filter results by attributes (e.g. date, category...)
        • ability to update index content while running searches on it
        • wildcard support (or substring, even better!)
        • acceptable speed on projected amount of data
        Last point influence choice very much. I would go with Plucene if data size is small enough (or only for prototyping).

        Writing good parsers and analyzers for input formats (do you want to rank bold words more than surround text?) and font-end is hard enough without writing own reverse index implementation, especially since some very good allready exist.


        2share!2flame...
Re: Writing a Search Engine in Perl?
by john_oshea (Priest) on Aug 18, 2005 at 20:44 UTC

    You might want to have a look at Xapian (and its associated front-end, Omega).

    It's C++, admittedly, but the source is available and there are Perl bindings. We use it at work, and it's quite happy indexing utf-8 documents (including Japanese). There are also various language stemmers available - no Serbian at present, sadly - I can't comment on how good they are, as we don't use them.

    If nothing else, it may give you insight / inspiration for your own efforts.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://484644]
Approved by planetscape
Front-paged by Tanktalus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-26 02:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found