Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re: Term document matrix for search engine

by SuicideJunkie (Vicar)
on Nov 09, 2012 at 15:14 UTC ( #1003156=note: print w/replies, xml ) Need Help??

in reply to Term document matrix for search engine

Whenever you want to look things up, and they are not sequentially numbered, think about using a hash instead of an array.

I'm not sure exactly what you need, but I'll take a guess.

Consider an example where you have a hash of terms and the documents they apply to:

my %table = ( recent => {'c:/autoexec.bat'=>undef, 'c:/frog.jpg'=>undef}, text => {'c:/autoexec.bat'=>undef, 'c:/classnotes.txt'=>undef}, biology => {'c:/classnotes.txt'=>undef, 'c:/frog.jpg'=>undef}, );

Building the table, would be a matter of going through each file, and setting $table{$term}{$filename} = undef; for each term that the file matches.

You could then determine which documents match 'recent and biology', via:

my @terms = ('recent', 'biology'); my @matches = keys %{ $table{ (shift @terms) } }; foreach my $term (@terms) { @matches = grep { exists $table{$term}{$_} } @matches; } printf "Found %05d matches!\n", scalar @matches;

Of course, this example only does searches with all terms required. For boolean operations on your search terms, you'd want to make a tree and combine or intersect (or xor or whatever) the hashes at each node.

Replies are listed 'Best First'.
Re^2: Term document matrix for search engine
by Anonymous Monk on Nov 09, 2012 at 16:24 UTC

    Hashing seems a good idea. I should search a file for a specific term and then place the file address into the hash if the term is present. Can you help me with the code please, I am still not good at with hashes.
    Thank you

      That sounds like it would defeat the purpose of the assignment.

      Take a read through perldata, but IMO the best thing to do is have a small script on the side where you can play around try things out quickly.

      And to understand what your code is doing, I highly recommend use Data::Dumper; and then later in your code, say print Dumper \%hash;. The output from dumper is about the same as code which will create the data structure, so it is really handy for learning. Curly brackets for nested hashes, square brackets for nested arrays.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1003156]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2018-06-19 05:36 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (111 votes). Check out past polls.