Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Perfect Indexer & Search Engine

by YAFZ (Pilgrim)
on Jun 17, 2003 at 11:24 UTC ( [id://266430]=perlquestion: print w/replies, xml ) Need Help??

YAFZ has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I know many there are many cool indexer, search engines, etc. out there but I don't know if any of them is perfect for my system. My system is not a very unique one but a usual hybrid system which stores text, html, xml, etc. content in the filesystem and metadata about these files (which group they belong, which template script will be used to render them, etc.) in the database.

For example assume that I've got this kind of file hierarchy:

/10/1/1345.html
/10/1/1346.html
/11/7/6544.html

1345, 1346 and 6546 are ID numbers for the content and assume that I've got DB records like that:

ID Type Course Week
1345 1 10 1
6544 5 11 7
1346 5 10 1

So users don't see 1345.html, or 1346.html but according to content type their URL is something like that:

http://www.blabla.com/ContentPage?ID=1345
http://www.blabla.com/DiscussionPage?ID=1346
etc.

This means that the Indexer & Search system must take that into account, it is not a simple `WORD -> THIS_FILE´ structure but something that needs more transformations according to rules that I´ll provide to system.

Since Perl is the one to rule text processing, manipulation, etc. I'll be glad if there are fellow monks who encountered a similar situation and found a perfect solution. Please enlighten me with your experience and vision...

Replies are listed 'Best First'.
Re: Perfect Indexer & Search Engine
by ViceRaid (Chaplain) on Jun 17, 2003 at 11:53 UTC

    Let's say you're considering types of search index which assign a set of values for different categories to each document. In the simple case, the categories you're using to categorise documents are words (or perhaps stems ). Given a short document text like:

    "Perl on Tuesday, Python on Wednesday, Rain on Thursday, Perl on Friday."

    You might get a document index that looks something like

    # normalised all the index values to be in the range 0..1 # removed "stop words" $docindex = { 'Perl' => 1, 'Python' => 0.5, 'Rain' => 0.5, 'Tuesday' => 0.5, 'Wednesday' => 0.5, 'Thursday' => 0.5, 'Friday' => 0.5 };

    With one of these indexes for every document, stored somewhere, you'd have the kind of model that might be used, say in a vector space search engine. Once you've got indexes of this sort, there's no reason why you can't add keys representing categories other than the words within a document, such as "belonging to Course 11". Add keys to the per-document index that would never be words, but can be used internally to limit searches. For example:

    $docindex = { 'Perl' => 1, 'Python' => 0.5, ... '~~Course11' => 1 };

    But ... all that said, since you've got a meaningful file hierarchy already ( /$course/$week/$item)I'd strongly recommend you look into something like HTdig and using it's restrict and exclude parameters to control where in the site a search is conducted. Basically, HTDig can be set to create multiple separate search indexes (perhaps 1 per course?) or create one big index, but then limit per-search results by path. It's documentation should help.

    HTH
    ViceRaid

    update: rephrased for clarity
      Well, I was considering HTdig but not sure about its database-integration capabilities. After your recommendations I'll concentrate on this software and see if it is up to my problem. Thanks for your comments.

        Sorry, I didn't understand your question as clearly as Zaxo. htDig's relational database integration capabilities are pretty much nil, AFAIK.

        Still, it might be easier to index the end-product - the rendered pages - using an existing product, rather than the database itself and XML/txt/HTML sources in a roll-your-own system. Then you wouldn't have to worry about reconstructing the URLs from the search results, and since you've already got category->url mapping, you can build a user search interface that allows limiting by categories by allowing search restriction by URL path.

        As an aside, it's also quite hard to do good free-text searches within an RDBMS - MySQL's FREETEXT indexes are pretty limited. On the site I'm working on at the moment, we've ditched a search system build round Oracle's ConText / Intermedia search tool in favour of an htDig system indexing the rendered pages within a CMS.

        cheers
        ViceRaid

Re: Perfect Indexer & Search Engine
by Zaxo (Archbishop) on Jun 17, 2003 at 12:43 UTC

    Perfection is elusive :-)

    If I understand your question, you want to know how to construct paths from db entries chosen from a cgi query. That is just a matter of building up a string. Select the db rows that match your query, according to your rules, and build the paths from the results.

    From your example data, it looks like the path is built from a db record as "$course/$week/$id.$ext". Is the Type associated to the particular url?

    If the url is significant to the search, consider making a directory for each, and putting a DirectoryIndex /cgi-bin/searchscript.pl line in each directory's .htaccess file. the searchscript.pl file can grab the url it was called under.

    I'd like to see your whole design, what you show here seems slightly clunky.

    After Compline,
    Zaxo

      Solution without headaches (after it's implemented, of course) is perfect (until it causes new headaches, of course, well that's what `scalability´ is for, isn't it ;-).

      Yes, the Type is associated with some specific URL. The system knows which template (read script, special actions, etc.) to use on the content according to this Type information.

      I'm sorry for a clunky description of my design :) I tried to be as clear as possible but that was the best I could compose at the time of writing.

      As you got it correctly my problem can be described as `knowing how to construct paths from db entries chosen from a search engine query´. And after considering the words of monks (including yours) I think I'll evaluate HTdig and see what I can do.
Re: Perfect Indexer & Search Engine
by Maclir (Curate) on Jun 17, 2003 at 12:44 UTC
    You may want to look into Swish-E. It has some very powerful search results rewriting capabilities, plus a ready made perl front end.
      I've just read an article about Swish-E, this one looks like a little nice indexer and searcher (especially being able to index and search man pages is a great feature) but I'm not sure if it can handle thousands of files which sum up to more than hundreds of MB of data (also deleting, modifying files and reindexing performance issues, etc.)
Re: Perfect Indexer & Search Engine
by belg4mit (Prior) on Jun 17, 2003 at 16:06 UTC
    This means that the Indexer & Search system must take that into account, it is not a simple `WORD -> THIS_FILE´ structure but something that needs more transformations according to rules that I´ll provide to system.
    What's the point then, eh? Do you enjoy duplicating code? Let the search engine do the work, just use a search engine that can do HTTP indexing instead of local filesystem, maybe http://www.perlfect.com/freescripts/search/

    --
    I'm not belgian but I play one on TV.

      Thanks for pointing me this compact search engine written in Perl. I'll take a look at. It would be great if I found a detailed comparison report about HTdig, SWISH-E, Perlfect Search, etc.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://266430]
Approved by Zaxo
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (4)
As of 2024-03-29 14:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found