Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Term document matrix for search engine

by Anonymous Monk
on Nov 09, 2012 at 13:50 UTC ( #1003130=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I am new to perl. I am doing a project in school, where I'm creating a simple search engine.I give a query(a string of words) and a list of files are searched and the best matched file name is displayed. The basic plan is:-
1)preprocess the files
2)Document clustering
3)create term document matrix
4)Search
I was able to write the pre-processing and clustering modules, but I have confusion regarding the term-document matrix. Should I create a separate array for each term, or should I use a 2-d array. And how do i search for terms from the array.(the document that contains maximum of the query terms is displayed)
And is there any better way to search than using a term-document matrix?
p.s. This is a pretty small project, so I don't need highly efficient search techniques, any easy ones would do.
Thank you

Comment on Term document matrix for search engine
Re: Term document matrix for search engine
by SuicideJunkie (Priest) on Nov 09, 2012 at 15:14 UTC

    Whenever you want to look things up, and they are not sequentially numbered, think about using a hash instead of an array.

    I'm not sure exactly what you need, but I'll take a guess.

    Consider an example where you have a hash of terms and the documents they apply to:

    my %table = ( recent => {'c:/autoexec.bat'=>undef, 'c:/frog.jpg'=>undef}, text => {'c:/autoexec.bat'=>undef, 'c:/classnotes.txt'=>undef}, biology => {'c:/classnotes.txt'=>undef, 'c:/frog.jpg'=>undef}, );

    Building the table, would be a matter of going through each file, and setting $table{$term}{$filename} = undef; for each term that the file matches.

    You could then determine which documents match 'recent and biology', via:

    my @terms = ('recent', 'biology'); my @matches = keys %{ $table{ (shift @terms) } }; foreach my $term (@terms) { @matches = grep { exists $table{$term}{$_} } @matches; } printf "Found %05d matches!\n", scalar @matches;

    Of course, this example only does searches with all terms required. For boolean operations on your search terms, you'd want to make a tree and combine or intersect (or xor or whatever) the hashes at each node.

      Hashing seems a good idea. I should search a file for a specific term and then place the file address into the hash if the term is present. Can you help me with the code please, I am still not good at with hashes.
      Thank you

        That sounds like it would defeat the purpose of the assignment.

        Take a read through perldata, but IMO the best thing to do is have a small test.pl script on the side where you can play around try things out quickly.

        And to understand what your code is doing, I highly recommend use Data::Dumper; and then later in your code, say print Dumper \%hash;. The output from dumper is about the same as code which will create the data structure, so it is really handy for learning. Curly brackets for nested hashes, square brackets for nested arrays.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1003130]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (13)
As of 2014-07-29 10:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (213 votes), past polls