Help on building a search engine (design / algorithms / parsing / tokenising / database design)

bobtfish has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

I've recently been working on a project to mirror various file sources from web sites. I have got a number of mirror scripts that work out the files that need to be mirrored, and do the mirroring. Once mirrored, each file has a number of tests performed on it (to try and classify it), and then an entry is entered into an SQL database for that file. The files can be of many different types, however the most common types are:

Text (English language) files
Language Source files (c, perl, php are the most common, although shell, c#, java etc do exist also)
Compressed files (gzip, bzip2)
Archive files (tar, gzip, cpio)

The number of files is likly to grow pretty large over the lifetime of this project (think a couple of million)..

What I want to be able to do is a Google style search across all the files meta data and their contents (or probably just comments extracted from the file for program files).

I know that building an inverse index of the data and searching that is probably the way to go, and I have an initial implementation that works quite nicely.

Therefore my first question: Can anyone recommend a comment parser that I can throw an arbitary(ish) language file and get back all the comments?

Update: Mostly solved by matching comments as suggested by andyf. However if someone wants to point me at a packaged (CPAN?) solution that'd still be welcome.

At this point there is meta-information about this file:

file_id - A unique id for this file
file_source - An integer that links to another table giving details of there the file came from etc.
md5hash - The md5 hash of the file
description - Some of our file sources provide their own meta-data about the files (in current cases, this is just a textual / HTML description)
comment - User comments about this file (initially blank)

I have built a search engine for just comments / descriptions, however I'm not sure my design would stand up to full contents or a massive amount of data (however it's a shedload faster than the database's regex search currently).

So, I have three tables:

files - contains file_id and comment/description
search_files_tokens - contains normalised tokens for search and token_id
search_files_to_tokens - contains a token_id, a file_id and a count of the number of times this token appears in this file.

What I do is as follows:

For each file, take comments and descriptions and contatenate them into $comment.
Split on whitespace into individual tokens.
Normalise each token (remove punctuation, tr/a-z/A-Z/ etc)
For each new token, create a token in the database (returning a unique token_id).
For each token found in the comment, insert a row into search_files_to_tokens.

I can now lookup a unique token, and find the files that contain it most often.

The code that does this looks like:

my $known_tokens;
while ($comment = get_next_file_comments()) {
    my $tokens;
    my @possible_tokens = split(/\s+/s, $comment);
    foreach (@possible_tokens) {
        my $hs = HTML::Strip->new(); #HTML::Strip segfaults without ne
+w instance each time
        $_=$hs->parse($_);
        /[\w\d\.\/\\]+/) or next;
        tr/a-z/A-Z/;
    s/[\s\',\!\"\\xa3\#\$\%\^\&\*\\=\+\:\'\`\(\)\{\}\[\]\.\/\\\-\_]//g
+;
        next if ($_ eq "");
        if (!defined($tokens->{$_})) {
            $tokens->{$_}=1;
        } else {
            $tokens->{$_}++;
        }
    }
    foreach (keys(%{$tokens})) {
        my $token_id;
        if (defined($known_tokens->{$_})) {
            $token_id = $known_tokens->{$_};
        } else {
            my $res = $gettokenid_sth->execute($_);
            ($gettokenid_sth->rows() > 0) and do {
                ($token_id) = $gettokenid_sth->fetchrow_array();
            };
            if (!defined($token_id) or !$token_id) {
                $addtoken_sth->execute($_);
                $gettoken_sth->execute($_);
                ($token_id) = $gettokenid_sth->fetchrow_array();
            }
            $known_tokens->{$_}=$token_id;
        }
        $addtokeninfile_sth->execute($file_id, $token_id, $tokens->{$_
+});
    }
}
[download]

All reasonably simple, and reasonably sane (other than the fact that my tokeniser is a bit rubbish currently)..

Now, when I come to search, I allow the user to input a phrase such as:

(linux or sun) and log

I parse the search query into bits I know about; (,),and,or,not and bits I don't; linux, sun, log

For anything I don't know about, I tokenise it (as above) then search for that token (the case of a non-existant token adds lots of mess, but is reasonably easy to handle, so won't be considered here).

Once I have a token_id for each token, I then build an SQL query along the lines of:

SELECT file_id FROM search_files_to_tokens WHERE token_id = $token_id

for each file, and chain them together with UNION, INTERSECT and EXCEPT

For example, if tokens for linux = 1, sun = 2 and log = 3, query will be:

((SELECT file_id FROM search_files_to_tokens WHERE token_id = 1) UNION ((SELECT file_id FROM search_files_to_tokens WHERE token_id = 2)) INTERSECT (SELECT file_id FROM search_files_to_tokens WHERE token_id = 3))

This gives me back a pile of rows containing a file_id for files that match the search criteria. I can then do another simple (and fast) query which gives me back the counts for each token/file instance.

Taking that data, I can easilly iterate through it and provide each file with an overall score. My scoring system is simple:

For each token in this file, add (number of times this token appears in this file / number of times token appears in the result set as a whole) and this is your score.

I then sort the result set by score and display.

This generally works very well, however for queries with common tokens (that produce a large result set), it can be quite slow (but still faster than regex search). Also, search speed can change drastically with query order. If I start indexing comments in files / text file contents, it's going to get unusable..

Questions:

Is my database schema sane, (or, sane but dumb)?
Is my search algorithm sane / can it be optimised easilly (or again, just fine, but dumb)?
Am I going about this totally the wrong way?
Is this the optimum method, and I should just get faster hardware?

I know that a simple optimisation would be to throw out the top five to ten tokens (going to do this, just not implemented yet) and maybe throwing away any token that only occurs on one file (this would produce a massive optimisation, however it scares me as people will want to search for specific things that they know exist by name)

I look forward to comments / questions from the Monastry as my knowlegde of complex (statefull) parseing is pretty limited, and I've never tried a search algorithm before..

Caveat: There is a command line (perl) search interface, however I must also be able to create a php search interface. Therefore using CPAN modules to do the parsing etc is good, using CPAN to do tokenisation or search just isn't going to happen unless the module is so simple (or easy to read) that re-implementing it in php (in a reeasonably small amount of time) is viable.

Update: Added <readmore> tags as I realised this post is huge and added answer to the first question so people don't waste their time searching for me.

Back to Seekers of Perl Wisdom