Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

RE: Re: Search Engine Theory

by httptech (Chaplain)
on Jun 06, 2000 at 15:56 UTC ( [id://16584]=note: print w/replies, xml ) Need Help??


in reply to Re: Search Engine Theory
in thread Search Engine Theory

Clever. I like it. Would it not be a good idea to record the name of the file only once? Like:
foo: index.html,5,18; bar: index.html,6;page1.html,1; baz: index.html,7;
That would make the index smaller. Then with some trivial splitting, you end up with something like:
$seq{'foo'}{'index.html'} = [5,18]; $seq{'bar'}{'index.html'} = [6]; $seq{'bar'}{'page1.html'} = [1]; $seq{'baz'}{'index.html'} = [7];
Now I just need a clever algorithm to iterate over it all and figure out which document has foo bar baz in order. Hmmm.

Replies are listed 'Best First'.
RE: RE: Re: Search Engine Theory
by nardo (Friar) on Jun 06, 2000 at 20:16 UTC
    The following should work, but it's off the top of my head and hasn't been thought out, so it's possible there's a better algorithm and it's also possible there's a bug in the code, so be sure to give it a good once over to make sure i haven't screwed up somewhere.

    $seq{'foo'}{'index.html'} = [5,18]; $seq{'bar'}{'index.html'} = [6]; $seq{'bar'}{'page1.html'} = [1]; $seq{'baz'}{'index.html'} = [7]; @pages = &find_phrase("foo bar baz"); sub find_phrase { my @words = split(/\s+/, $_[0]); my $word = shift @words; my $pos; my $page; my %found; foreach $page (keys %{$seq{$word}}) { foreach $pos (@{$seq{$word}{$page}}) { if(&find_phrase_at($page, $pos + 1, @words)) { $found{$page} = 1; } } } return keys %found; } sub find_phrase_at { my $file = shift; my $position = shift; my @words = @_; my $word; if(@words == 0) { return 1; } $word = shift @words; if(grep($_ == $position, @{$seq{$word}{$file}})) { return &find_phrase_at($file, $position + 1, @words); } return 0; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://16584]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2025-06-22 23:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.