Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Search Algorithm

by tenfourty (Novice)
on Aug 10, 2000 at 17:27 UTC ( [id://27279]=perlquestion: print w/replies, xml ) Need Help??

tenfourty has asked for the wisdom of the Perl Monks concerning the following question:

Hi All you guru's out there,

I am looking for a very efficient sorting algorithm as I am writting a program to search through a list of files and find any occurences of a list of keywords.

At the moment the algorithm I am using is:

read in all the keywords from a file into an array get a listing of all the files I want to search into an array, for each file { for each keyword { foreach line { check if the keyword is in the line if the keyword is found then log it } } }

I can post my source up here if you want but i didn't want to take up extra space.

This algorithm works fine for a small file list, but it can take ages when searching through several thousand files.

Is there any faster or more advanced way of doing this search out there, maybe a module or something that will do this for me? I have searched but haven't come across anything yet.

Replies are listed 'Best First'.
Re: Search Algorithm
by t0mas (Priest) on Aug 10, 2000 at 17:42 UTC
    From perldoc -f study:
    The following scans a list of files (@files) for a list of words (@words), and prints out the names of those files that contain a match: $search = 'while (<>) { study;'; foreach $word (@words) { $search .= "++\$seen{\$ARGV} if /\\b$word\\b/;\n"; } $search .= "}"; @ARGV = @files; undef $/; eval $search; # this screams $/ = "\n"; # put back to normal input delimiter foreach $file (sort keys(%seen)) { print $file, "\n"; }
    Is this what you want?
    Anyway, undef $/ will be faster than line by line.

    /brother t0mas
Re: Search Algorithm
by maverick (Curate) on Aug 10, 2000 at 20:00 UTC
    If you're going to to be doing a lot of searches through a lot of files and the files that you're searching through don't change very often, then it may be worth your time to build an index.

    The basic idea is this. Using a dbm, make the key a word out of one of the files, and the value a list of files that contain that word. Thus you trade time in building the index for the speed of hash lookups.

    You'll end up with two programs. One that builds the index and one that does the search.

    This one will build the index.

    #!/usr/bin/perl use strict; # Berkeley DBMs are my fav. use DB_File; my %Index; # remove the old index and start fresh unlink("/home/maverick/tmp/index_dbm"); tie (%Index,'DB_File',"/home/maverick/tmp/index_dbm",O_RDWR|O_CREAT,06 +40,$DB_BTREE) || die "Tie Failed: $!"; foreach my $file (glob("/home/maverick/tmp/*.txt")) { open(F,$file) || die "Can't open $file: $!"; # slirp up the file and make a list of words my @words = map { split(/\W+/,$_) } <F>; # add this file to the list of matches for this word my %uniq; foreach (@words) { if (!defined($uniq{$_})) { # we've not seen this word before, so we add i +t. # I'm also assuming that ~ is safe to use as a + seperator. if (!defined($Index{$_})) { # it's the first additon of this word, + so I don't need to prepend a '~' $Index{$_} = $file; } else { $Index{$_} .= "~$file"; } $uniq{$_} = 1; } } close(F); } untie %Index;
    This one searchs using it.
    #!/usr/bin/perl use strict; use DB_File; my %Index; tie (%Index,'DB_File',"/home/maverick/tmp/index_dbm",O_RDWR,0640,$DB_B +TREE) || die "Tie Failed: $!"; print ">"; while(<>) { # chop off the newline $_ =~ s/[\r\n]//go; if (defined($Index{$_})) { print "$_ found in:\n"; # replace all the ~ with \n (without modifying the index) print join("\n",split(/~/,$Index{$_})),"\n"; } else { print "Not Found\n"; } print ">"; } untie(%Index);

    file1.txt contains:

    here's a file that contains a bunch of random keywords on many different lines that we can use for the sake of examp +le.
    file2.txt contains:
    here's another file that contains even more random text for the sake of example. I hope this helps solve the problem presented by tenfourty.
    and if you run the index maker and then search:
    darkstar:~/tmp>./mkindex.pl darkstar:~/tmp>./search.pl >tenfourty tenfourty found in: /home/maverick/tmp/file2.txt >maverick Not Found >example example found in: /home/maverick/tmp/file1.txt /home/maverick/tmp/file2.txt >text text found in: /home/maverick/tmp/file2.txt >
    Hope this helps :)

    /\/\averick

Re: Search Algorithm
by lhoward (Vicar) on Aug 10, 2000 at 17:38 UTC
    Two ideas come to mind. I don't think that either of these would be spectacularly faster than what you are doing now, but they may provide some modest improvement.

    The following idea relies on you being able to break each line up into "words" to match against your list. If that is not possible ignore this suggestion.

    read in all the keywords from a file into a hash as keys get a listing of all the files I want to search into an array, for each file { for each line { for each word in the line { if word is in hash of wanted words{ log existance of the word } } } }
    You may also be able to get it going faster by putting all the words into a single regular-expression, compiling the RE once and then checking each line of each file against that RE. (this will only work if you want to know if "at least one" of the words is in a line and don't care exactly which one it was).
    read in all the keywords from a file and make them into a regular expr +ession get a listing of all the files I want to search into an array, for each file { for each line { if line match RE { log it } } }
Re: Search Algorithm
by BigJoe (Curate) on Aug 10, 2000 at 17:44 UTC
    I would use the great grep command instead of your 3rd loop.
    This is out of the Perl Cookbook p 114
    @matching = grep { /^gnat/ } `who`
    This matches anything that comes out of the who command with the regular expression ^gnat

    --BigJoe

    Learn patience, you must.
    Young PerlMonk, craves Not these things.
Re: Search Algorithm
by giorgos (Novice) on Aug 11, 2000 at 17:15 UTC
    Hi all,

    First of all, what I am about to describe is not a straightforward, copy & paste solution but rather a discussion of information retrieval techniques that could apply to other programming languages as well.

    In general when writing search algorithms the easiest way to optimize your code is to a create an inverted index. An inverted index is a structure which maps each keyword that you want to make searchable to a list of documents that it occurs in. Optionally together with each document you could store a corresponding score for that keyword in that document. This could be for example the number of occurences(frequency) of this keyword in the respective document.

    An entry in the inverted index would look like:

    keyword = (document1, score1, document2, score2....)

    As you can see the big advantage of this approach is that you don't need to scan the documents every time you make a query. You only need to update the index every time some of the documents change. Even better the indexer could work incrementally and only examine files that have been modified(or created) since the last indexing time.

    The other problem that has to be solved is picking a good scoring algorithm that will calculate a score for each keyword, for each document that it occurs in. The most common algorithm that is used, is due to Gerald Salton, the father of informaation theory(we don't know who the mother is though).

    This states that the the weight W, of a term T, in a document D, is:

    W(T, D) = tf(T, D) * log ( DN / df(T))

    where tf(T, D) is the term frequency of T in D. DN is the total number of documents df(T) is the sum of frequencies of T in every document considered or as it called the document frequency of T.

    The quantity

    log ( DN / df(T))
    is called the inverse document frequency of T.

    So we can write:

    W(T, D) = tf(T, D) * idf(T)

    Now of course, there is a reason why all this gives good results. I will not go into detail but basically what is implied by the above formula is that the weight given to term in respect to a document is higher if:

    • it occurs many times in that document
    • it doesn't appear that often in other documents in the collection
    which in simple words means that this term is distinctive for the document in question.

    Quite a few variants of the above weighting algorithm exist.

    An example of such a type of search engine is Perlfect Search which I have written. Perhaps if you are interested you could take a look at http://perlfect.com/freescripts/search/ and examine the code to see most of the things that I describe above.

    With a few modifications it may even work for the problem that was mentioned by tenfourty.

    giorgos

Re: Search Algorithm
by mrmick (Curate) on Aug 10, 2000 at 17:46 UTC
    You may want to put the list of keywords in an array to create a regex and then iterate through the lines of each file and test for the keywords. This is a little sloppy but I hope you get the idea:
    my @KEYWORDS = qw(hello there you gurus); # create the regex my $string = ''; foreach (@KEYWORDS){ $string .= "$_||"; } $string =~ s/\|\|$//; # open the log file open(LOGFILE,$logfilename)||die"Cannot open $logfilename\n$!\n" # let's go through the files.... foreach (@files) { $filename = $_; open(FILE,$filename)||die"Cannot open $filename\n$!\n"; #check if a keyword is in the line while(<FILE>){ if (/$string/i){ print LOGFILE "Keyword found in $filename\n"; } } }

    Mick
      With any reasonably large set of keywords, this is maybe not the best idea. Not only does alternation within the regex really slow the engine down ( Camel 3 explain why quite well ) , but you will likely exceed the maximum allowed size of a single regex quickly.

      I would suggest using qr//, which was introduced in perl 5.5. It allows you to store a compiled regex in a scalar. Given all the keywords are stored in @words, create a hash ( wait for it ) called %regex like this:

      %regex = map { $_ => qr/$_/ } @words;
      Then, modify the inner most while loop to look like
      LINE: while ( <FILE> ) { for my $word ( @words ) { my $pat = $regex{$word}; next unless ( /$pat/ ); print "$word was found\n"; last LINE; } }
      The last LINE part is assumes you can stop processing the file as soon as one pattern matches. Remove it if you want to test all the keywords against each line in the file.

      mikfire

        Thanks to you all for your replies, this was the first time I had posted to this list and I'm amazed at the fast response.

        I will typically be searching for around 200 keywords in up to 2000 files, I need to output in my log the name of the file, the number of occurences of keywords and then for each occurence of the keyword I need to print that line and the line number.

        I think that you are right that a regexp is not the best way to search a line and that for each line I should check for the occurence of each word in a hash, my search should not be case sensitive as well, are the keys in a hash case sensitive, and if so how do I get around this?

        Firstly thanks for all your responses and for the help that you guys have given me.

        I slept on what you gave me and I realised that the perl only searches for single keywords, however my keywords are sometimes several words long.

        Just to let you know what my program does, it searches for non-ansi sql within source code. so the config file for my program is a list of oracle specific sql to search for. an example of a keyword is ALTER TABLE etc. I need to find all occurences of this in a file. the other thing that the code has to do is highlight multiple occurences of keywords in the same line

        this is why the code you gave me doesn't work as it goes through and searches word by word for each line

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://27279]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-12-06 17:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which IDE have you been most impressed by?













    Results (44 votes). Check out past polls.