Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

Re: Search Algorithm

by mrmick (Curate)
on Aug 10, 2000 at 17:46 UTC ( #27287=note: print w/replies, xml ) Need Help??

in reply to Search Algorithm

You may want to put the list of keywords in an array to create a regex and then iterate through the lines of each file and test for the keywords. This is a little sloppy but I hope you get the idea:
my @KEYWORDS = qw(hello there you gurus); # create the regex my $string = ''; foreach (@KEYWORDS){ $string .= "$_||"; } $string =~ s/\|\|$//; # open the log file open(LOGFILE,$logfilename)||die"Cannot open $logfilename\n$!\n" # let's go through the files.... foreach (@files) { $filename = $_; open(FILE,$filename)||die"Cannot open $filename\n$!\n"; #check if a keyword is in the line while(<FILE>){ if (/$string/i){ print LOGFILE "Keyword found in $filename\n"; } } }


Replies are listed 'Best First'.
RE: Re: Search Algorithm
by mikfire (Deacon) on Aug 10, 2000 at 18:09 UTC
    With any reasonably large set of keywords, this is maybe not the best idea. Not only does alternation within the regex really slow the engine down ( Camel 3 explain why quite well ) , but you will likely exceed the maximum allowed size of a single regex quickly.

    I would suggest using qr//, which was introduced in perl 5.5. It allows you to store a compiled regex in a scalar. Given all the keywords are stored in @words, create a hash ( wait for it ) called %regex like this:

    %regex = map { $_ => qr/$_/ } @words;
    Then, modify the inner most while loop to look like
    LINE: while ( <FILE> ) { for my $word ( @words ) { my $pat = $regex{$word}; next unless ( /$pat/ ); print "$word was found\n"; last LINE; } }
    The last LINE part is assumes you can stop processing the file as soon as one pattern matches. Remove it if you want to test all the keywords against each line in the file.


      Thanks to you all for your replies, this was the first time I had posted to this list and I'm amazed at the fast response.

      I will typically be searching for around 200 keywords in up to 2000 files, I need to output in my log the name of the file, the number of occurences of keywords and then for each occurence of the keyword I need to print that line and the line number.

      I think that you are right that a regexp is not the best way to search a line and that for each line I should check for the occurence of each word in a hash, my search should not be case sensitive as well, are the keys in a hash case sensitive, and if so how do I get around this?

        Keys in a hash are case sensitive, but nothing says you have to store them in a particular case. Pseudo-workable snippet follows:
        foreach (@words) { $word_idx{lc($_)} = $position; }
        The important magic is in lc. You'll have to use lc when you pull values out of the hash, too, or use a tied hash that does this for you.

      Firstly thanks for all your responses and for the help that you guys have given me.

      I slept on what you gave me and I realised that the perl only searches for single keywords, however my keywords are sometimes several words long.

      Just to let you know what my program does, it searches for non-ansi sql within source code. so the config file for my program is a list of oracle specific sql to search for. an example of a keyword is ALTER TABLE etc. I need to find all occurences of this in a file. the other thing that the code has to do is highlight multiple occurences of keywords in the same line

      this is why the code you gave me doesn't work as it goes through and searches word by word for each line

        # while not EOF keep going while ( <DATA> ) { $lineCount++; # increment the lineCount for my $word ( @words ) { my $pat = $regex{ lc($word) }; #next word unless the word is a keyword so store a report in @ +found next unless ( /$pat/ ); @found = (@found, "\nError in line $lineCount of file $file oc +curence of \"$word\" :\n\t@words\n"); $foundCount++; # increment total found words } }
        First, lets make this a bit less painful. Sorry this is not directly answering your problems, I will get to that in a bit.

        First and foremost, You need to stop that @found assignment. It is really painful to look at. The perlish way ( highly optimized as well as reading better ) is to push like

        push @found, "\nError in line $lineCount of file $file occurence of \" +$word\" :\n\t@words\n";
        The way you do that previously causes perl to expand the array to a list and then puts it back into the array. That expansion step is going to get very costly. The push doesn't bother with all that, it just tacks the data onto the end of the array.

        You can also get rid of the lineCount variable if you wish. Perl automagically keeps track of the current line number and stores that in $.

        Now, onto you problem. I am worried about this line

        # read in all the keywords from the configFile and put all the words i +nto a hash %regex = map { $_ => qr/$_/ } init_keywords ($ConfigFile); # init_keywords just returns an array of keywords all in lower case. ....
        this means you will only match words when they are in lower case. Without knowing your data set, I cannot say for certain if this is your problem, but that is my guess. You can solve this many ways, but frankly I think something like
        %regex = map { $_ => qr/$_/i } init_keywords ($ConfigFile);
        is the cleanest way to do it. The additional /i modifier tells the regex to ignore case when doing the match.

        Beyond this, you will need either somebody better than I am at this stuff ( and there are plenty of them here ) or I will need a sample of your data - both the key words and the files you are parsing.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://27287]
[LanX]: Eily you probably can't approve it because marto changed the section

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2018-04-23 16:16 GMT
Find Nodes?
    Voting Booth?