Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Faster Search Engine

by drewboy (Sexton)
on Jul 22, 2001 at 09:32 UTC ( [id://98769]=perlquestion: print w/replies, xml ) Need Help??

drewboy has asked for the wisdom of the Perl Monks concerning the following question:

hey everyone. i'm a newbie here. anyway, i'm using a search engine software called 'links 2.0', based on the perl language, installed on a unix system. my site hasn't opened up yet, though i'm pretty much done with hacking and modifying. i'm in the process of adding links to the database. however, as the database builds up, the search gets much slower, sluggish to be blunt. i heard that using mysql is faster, although i don't have the money for that.

so basically my question is, how do i make my search engine perform better, faster and more efficiently? i've been searching through the forums related to this software and have been reading about perl core dumps and the grep function, but no definite codes have been thought up to make it work.

my search.cgi is a 21 kb file that goes through a 70kb database with around 200 links, each with an ID, Title, Category, Description, and Keywords.

here are some codes to guide you:

# Go through the database. open (DB, "<$db_file_name") or &cgierr("error in search. unable to + open database: $db_file_name. Reason: $!"); flock (DB, 1) if ($db_use_flock); LINE: while (<DB>) { /^#/ and next LINE; # Skip comment Lines. /^\s*$/ and next LINE; # Skip blank lines. chomp; # Remove trailing new line. @values = &split_decode($_); $grand_total++;


i wonder if i could somehow modify the code above to the make the search faster. if you want the whole search.cgi file, you can view it at http://www.textcentral.com/search.txt. any kind of answer is highly appreciated. thanks perl gurus...

Replies are listed 'Best First'.
Re: Faster Search Engine
by tachyon (Chancellor) on Jul 22, 2001 at 11:50 UTC

    Hi, to be frank it seems like you need a real search engine. This is covered in some detail in the book CGI programming with Perl from O'Reilly aka the rat book. I recommend this to you.

    From the code you have posted it seems you are searching a flat text file form begining to end which as you note does not scale well - ie it gets slower and slower the bigger it gets. The basis of a search is to either use a real database or generate an index that you can search (usually via a hash key). You do the processing in advance when you generate this index and then your CGI searches the index to find what you want. You try to limit the processing that needs to be done in real time (ie for the CGI) so things happen fast from the user's point of view.

    A good free search engine you can incorporate into your site is available from http://www.whatuseek.com/ I use this for cheap and cheerful searches on shoestring sites. You can customise the results page into a format that looks like your site. Downside is ads and a page limit. You can see an example of this in action here It is not as good as your own Perl engine could be but is fast and easy to set up. View the source to see how the search box links to the search engine.

    Good luck. If you post more code or probably links to what you currently have we may be able to suggest how to speed it up for you. It is not really clear what you want to search for.

    Update

    Forget the database. I have written a little search app for you that will grep out all the lines that match a given search criterion in your data file in ~16 milliseconds. See below. This should be fast enough for you :-)

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      thanks tachyon, i will buy that book. but can you also recommend something that's not too boring and kind to beginners like me? i am also planning to take some classes on perl and cgi, i wonder if there are some offered here in new york city. do you know any?

      i am not using this search engine for http://www.textcentral.com. it is a totally different one. if you wanna see what i've done so far it's at http://alleba.dreamhost.com. my site is called 'alleba', a search engine and web directory. it is true that it searches through a flat text file. it is called links.db. but what do you mean by 'use a real database or generate an index that you can search'? how could i do this?

      i'm also posting my links.db and category.db and several others to get you into the whole schematics of the script. everything can be found at: http://alleba.dreamhost.com/scripts.
      nph-build.cgi -- rebuilds all the category .html files as well as the homepage (index.html) and directory page (dir.html). i'm not sure if this is relevant in my question.

      db-utils.pl -- i think this file contain some subs that search.cgi uses, such as for sorting links and categories.

      search.cgi -- this file does the actual search in links.db and category.db.

      site_html_templates.pl -- contains the templates in which several generated pages are based on, like for the search results. nph-build also builds category pages based on several elements below this file.

      links.db -- the links database, contains the ID, title, keywords etc.

      category.db -- contains all the category and subcategory names.

      links.def -- contains information on the field assignments of each piece of information. e.g. title 1, description 5.

      links.cfg -- here all the important settings are set such as absolute paths and url's that each .cgi and .pl file relies on.

      hopefully these files will enlighten everyone. thanks, looking forward to your replies.

      drewboy

        Here is a really basic search application for you. In this script you are prompted for a search string but this could easily be CGI input. Note that the quotemeta will escape most chars with a \ which 1) makes the string safe to use in the grep regex and 2)helps thwart hackers. *Do not interpolate a user supplied string into a regex without the quotemeta.* It then grep's out all the lines that contain that string and stores them in an array. The /i makes the search case insensitive. It is looking for an exact match only and will not understand boolean logic.

        Using your 70kb 'links.db' text file as the data and searching for 'PlanetGimmick' which is the last entry in the file it takes 0 seconds to run. If you ramp up and do the search 10000 times so that we run long enough to get a valid time it takes 161 seconds or 16.1 milliseconds to do the search. This is on an old PII 233 MHZ 64MB RAM Win95 Perl 5.6 system (my laptop). I expect this is fast enough for most practical purposes. Once you have the match lines in an array you can do whatever processing you want to them. The advantage being that you only process those lines that have matched your search criteria.

        #!/usr/bin/perl -wT use strict; # clean up the environment for CGI use delete @ENV{qw(IFS CDPATH ENV BASH_ENV)}; $ENV{'PATH'} = '/bin:'; # you may need more path info my $db_file = 'c:/links.db'; print "Find what? "; chomp(my $find = <>); # this escapes regex metachars and makes it safe # to interpolate $find into the regex in our grep. $find = quotemeta $find; # this untaints $find - we have made it safe above # using the quotemeta, this satisfies -T taint mode $find =~ m/^(.*)$/; $find = $1; my $start = time(); open (FILE, "<$db_file") or die "Oops can't read $db_file Perl says $! +\n"; my @db_file = <FILE>; # get the whole database into an array in RAM close FILE; # do the search my @lines = grep {/$find/i}@db_file; my $time = time() - $start; print "Search took $time seconds\n"; if (@lines) { print "Found\n@lines\n"; } else { print "No match\n"; }

        I expect this should solve your problem as it is plenty fast enough. It should scale in a linear fashion ie twice as big a file == twice as long for search. The scaling will breakdown when your file becomes larger than can be stored in main memory in an array and the operating system resorts to using swapspace on the disk as virtual RAM. If you get this big send me some options in the IPO OK!

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Faster Search Engine
by scottstef (Curate) on Jul 22, 2001 at 18:36 UTC
    You mention that you can not afford mysql. If you are running your own server, you can download mysql for free from http://www.mysql.com If you are renting space from a provider, most of the ones I have seen charge $5-10/month if that much for a database. The performance gains are exponetially positve for large databases.

    "The social dynamics of the net are a direct consequence of the fact that nobody has yet developed a Remote Strangulation Protocol." -- Larry Wall

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://98769]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2024-03-28 14:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found