Clear questions and runnable code
get the best and fastest answer
Help on building a search engine (design / algorithms / parsing / tokenising / database design)by bobtfish (Scribe)
|on Jun 07, 2004 at 10:59 UTC||Need Help??|
bobtfish has asked for the wisdom of the Perl Monks concerning the following question:
I've recently been working on a project to mirror various file sources from web sites. I have got a number of mirror scripts that work out the files that need to be mirrored, and do the mirroring. Once mirrored, each file has a number of tests performed on it (to try and classify it), and then an entry is entered into an SQL database for that file. The files can be of many different types, however the most common types are:
What I want to be able to do is a Google style search across all the files meta data and their contents (or probably just comments extracted from the file for program files).
I know that building an inverse index of the data and searching that is probably the way to go, and I have an initial implementation that works quite nicely.
Therefore my first question: Can anyone recommend a comment parser that I can throw an arbitary(ish) language file and get back all the comments?
Update: Mostly solved by matching comments as suggested by andyf. However if someone wants to point me at a packaged (CPAN?) solution that'd still be welcome.
At this point there is meta-information about this file:
So, I have three tables:
The code that does this looks like:
All reasonably simple, and reasonably sane (other than the fact that my tokeniser is a bit rubbish currently)..
Now, when I come to search, I allow the user to input a phrase such as:
(linux or sun) and log
I parse the search query into bits I know about; (,),and,or,not and bits I don't; linux, sun, log
For anything I don't know about, I tokenise it (as above) then search for that token (the case of a non-existant token adds lots of mess, but is reasonably easy to handle, so won't be considered here).
Once I have a token_id for each token, I then build an SQL query along the lines of:
SELECT file_id FROM search_files_to_tokens WHERE token_id = $token_id
for each file, and chain them together with UNION, INTERSECT and EXCEPT
For example, if tokens for linux = 1, sun = 2 and log = 3, query will be:
((SELECT file_id FROM search_files_to_tokens WHERE token_id = 1) UNION ((SELECT file_id FROM search_files_to_tokens WHERE token_id = 2)) INTERSECT (SELECT file_id FROM search_files_to_tokens WHERE token_id = 3))
This gives me back a pile of rows containing a file_id for files that match the search criteria. I can then do another simple (and fast) query which gives me back the counts for each token/file instance.
Taking that data, I can easilly iterate through it and provide each file with an overall score. My scoring system is simple:
For each token in this file, add (number of times this token appears in this file / number of times token appears in the result set as a whole) and this is your score.
I then sort the result set by score and display.
This generally works very well, however for queries with common tokens (that produce a large result set), it can be quite slow (but still faster than regex search). Also, search speed can change drastically with query order. If I start indexing comments in files / text file contents, it's going to get unusable..
I look forward to comments / questions from the Monastry as my knowlegde of complex (statefull) parseing is pretty limited, and I've never tried a search algorithm before..
Caveat: There is a command line (perl) search interface, however I must also be able to create a php search interface. Therefore using CPAN modules to do the parsing etc is good, using CPAN to do tokenisation or search just isn't going to happen unless the module is so simple (or easy to read) that re-implementing it in php (in a reeasonably small amount of time) is viable.
Update: Added <readmore> tags as I realised this post is huge and added answer to the first question so people don't waste their time searching for me.