Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

make a web site searchable

by bear0053 (Hermit)
on Aug 20, 2003 at 18:50 UTC ( #285257=perlquestion: print w/ replies, xml ) Need Help??
bear0053 has asked for the wisdom of the Perl Monks concerning the following question:

I am currently trying to design a searchable website. Basically I provide a text box for keywords to the user. When they click 'FIND', I have my script grab the keywords and search a Sql DB field. The way I am indexing the pages for searching is by stripping each page of all tags and then putting the plain text into the field I am doing my comparison against. Is there any better ways that you know of to handle this search indexing other than what I am currently doing? Thanks in advance.

Comment on make a web site searchable
Re: make a web site searchable
by tcf22 (Priest) on Aug 20, 2003 at 18:54 UTC
    Take a look at HTML::Index. This should do what you want.
•Re: make a web site searchable
by merlyn (Sage) on Aug 20, 2003 at 19:05 UTC
    If you don't mind bouncing them out to Google and back to your site, you can add "search this site with Google" without any programming whatsoever. I added the following HTML to the template for the bottom of my pages:
    <form action="http://www.google.com/search" method=GET> <INPUT TYPE=hidden name=site value=swr> <INPUT TYPE=hidden name=q value="site:stonehenge.com"> <INPUT TYPE=text name=as_q size=31 maxlength=256 value=""> <INPUT TYPE=submit name=btnG VALUE="Search stonehenge.com with Google" +> </form>
    Just replace the two occurrances of "stonehenge.com" with your domain name, and be sure that Google is hitting your site regularly.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      I think it only works if your site has webpages have somehow been indexed by Google. Otherwise, nothing returns.
        If you don't want to wait for Google to find you, you can always submit your URL to Google. After a few days or so, your site will be indexed and searchable.

        jeffa

        L-LL-L--L-LL-L--L-LL-L--
        -R--R-RR-R--R-RR-R--R-RR
        B--B--B--B--B--B--B--B--
        H---H---H---H---H---H---
        (the triplet paradiddle with high-hat)
        
Re: make a web site searchable
by bear0053 (Hermit) on Aug 20, 2003 at 19:18 UTC
    The google suggestion won't work because i need this search to be in house and not bounce off google. Thanks though for the suggestion it would be nice if i could utilize this feature.
Re: make a web site searchable
by halley (Prior) on Aug 20, 2003 at 19:21 UTC
    Building a Vector Search Engine in Perl, a perl.com article, should give you some interesting ideas. They may or may not apply to your situation or skill level. Good luck.

    --
    [ e d @ h a l l e y . c c ]

Re: make a web site searchable
by cfreak (Chaplain) on Aug 20, 2003 at 19:25 UTC
      We use perlfect at work, and it seems pretty good. On the other hand, I didn't set it up, and when I tried setting it up on my own site, I found it to be a pain in the arse. So I stuck with the one that I'd written a couple of years before and which does a sufficiently good job that I am disinclined to try any harder with perlfect.
Re: make a web site searchable
by trs80 (Priest) on Aug 20, 2003 at 20:14 UTC
Re: make a web site searchable
by hmerrill (Friar) on Aug 20, 2003 at 20:15 UTC
    I've never used it myself, but it might have some value here - I don't know what database your using, but MySQL has a Full Text Search capability, and I think Oracle has something similar - not sure about PostgreSQL.

    Here's a snippet from a doc I found on www.mysql.com when I did a search(upper right) for 'full text search':

    As of Version 3.23.23, MySQL has support for full-text indexing and se +arching. Full-text indexes in MySQL are an index of type FULLTEXT. FU +LLTEXT indexes are used with MyISAM tables only and can be created fr +om CHAR, VARCHAR, or TEXT columns at CREATE TABLE time or added later + with ALTER TABLE or CREATE INDEX. For large datasets, it will be muc +h faster to load your data into a table that has no FULLTEXT index, t +hen create the index with ALTER TABLE (or CREATE INDEX). Loading data + into a table that already has a FULLTEXT index will be slower. Full-text searching is performed with the MATCH() function. mysql> CREATE TABLE articles ( -> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, -> title VARCHAR(200), -> body TEXT, -> FULLTEXT (title,body) -> ); Query OK, 0 rows affected (0.00 sec) mysql> INSERT INTO articles VALUES -> (NULL,'MySQL Tutorial', 'DBMS stands for DataBase ...'), -> (NULL,'How To Use MySQL Efficiently', 'After you went through a + ...'), -> (NULL,'Optimising MySQL','In this tutorial we will show ...'), -> (NULL,'1001 MySQL Tricks','1. Never run mysqld as root. 2. ...' +), -> (NULL,'MySQL vs. YourSQL', 'In the following database compariso +n ...'), -> (NULL,'MySQL Security', 'When configured properly, MySQL ...'); Query OK, 6 rows affected (0.00 sec) Records: 6 Duplicates: 0 Warnings: 0 mysql> SELECT * FROM articles -> WHERE MATCH (title,body) AGAINST ('database'); +----+-------------------+------------------------------------------+ | id | title | body | +----+-------------------+------------------------------------------+ | 5 | MySQL vs. YourSQL | In the following database comparison ... | | 1 | MySQL Tutorial | DBMS stands for DataBase ... | +----+-------------------+------------------------------------------+ 2 rows in set (0.00 sec)
    and it continues on...

    Here's the link to this page:

    http://www.mysql.com/doc/en/Fulltext_Search.html

    HTH.
Re: make a web site searchable
by johndageek (Hermit) on Aug 20, 2003 at 20:50 UTC
    Depending on the size of the site and searching capability needed a more scalable solution may be:
    Scan the pages
    strip non words/tags
    parse the words - create a table with word and page-url as columns
    Index on word
    Please note you can normalize data if your site is large enough.

    your search page can then parse the key words, look up the words individually in the database with a join or a union (depending on type of return you want) as well as searching on partial words.

    Enjoy
    -John

Re: make a web site searchable
by dont_you (Hermit) on Aug 20, 2003 at 22:35 UTC
    Take a look at mnoGoSearch, they have done all the work for you already. I've got very good results with it.

    From their site: "mnoGoSearch (formerly known as UdmSearch) is a full-featured web search engine software for intranet and internet servers. mnoGoSearch for UNIX is a free software covered by the GNU General Public License"

    It's C based code, and the compiled CGI frontend works far better than the provided Perl script. Maybe some day someone will write an XS interface...

Re: make a web site searchable
by bean (Monk) on Aug 21, 2003 at 00:15 UTC
    This particular wheel has already been invented. Use ht://Dig if you want to accomplish this with a minimal amount of effort. Unless this is a programming assignment or for your own personal edification, in which case you'll need to exclude common words, research/choose/implement ranking algorithms, maybe even look into clustering techniques to find related/similar documents, etc.
Re: make a web site searchable
by bugsbunny (Scribe) on Aug 21, 2003 at 11:22 UTC
    try this one :
    http://perlfect.com/
      Hi there I am about to launch a Perl Application which will do mostly what you need. It will be available for demo soon at http://www.minigoogle.co.uk It searches msql databases, entire web directories and contains an engine which allows the admin to specify a url on another website to be spidered and indexed in the database. So you could group websites together under a topic and search these sites by keyword. I am hoping the concept will take off and I can "chain" the search engines together to make a much bigger searchable repository (similar to the way filesharing works but with information). The intention is to create a service which allows a more focused search with more accurate results. BUt it is early days yet. I also have a Lite version of the script which searches only flat file databases. It has been a team effort with several programmers working on it from all over the world. Pretty soon there will also be a version which will work like a Yahoo style directory engine but the plans for this have only just been drawn up. Does anyone have any comments, thoughts or ideas? cheers Dataferret
        Does anyone have any comments, thoughts or ideas?

        yes, i don't think google will be particularly happy with your choice of application name.

        ~Particle *accelerates*

Re: make a web site searchable
by Anonymous Monk on Aug 21, 2003 at 13:26 UTC
    you might give swish-e a try, depending on the size of your project. it can handle lots of data, indexes pretty fast, searches as well quite speedy and its not too memory consuming. it is used e.g. on apache.org, and in our company (daily newspaper) we index ~ 250.000 documents with it.
    downside: no incremental updates possible, but indexing is pretty fast (about 40 mins in our case), and it can handle multiple indices.

    cheers m
Re: make a web site searchable
by Anonymous Monk on Aug 22, 2003 at 04:37 UTC
    Hey,
    I am working on exactly what you are looking for (i am planning to eventually make it open source).

    Note that my approach is probebly overkill in your situation, it is meant for large data sets (tested on 250MB of text & works surprisingly well).
    I found that the best way is to set up an inverted-index of all the terms as well as an index which shows the position of each word within each document.
    I then use an algorithm which gives a bonus if the words that are being searched appear close to each other in a ducument -- this proximity-search algorithm is described at http://citeseer.nj.nec.com/cachedpage/550719/1 .
    Also to improve the inverted-index words are indexed by their stem (a stemming algorithm can be found here http://www.ldc.usb.ve/~vdaniel/porter.pm ).
    Aswell I have implemented an algorithm similar to google's pagerank (a good description of it is at http://citeseer.nj.nec.com/cachedpage/368196/1 ), the popularity of a page is taken into account when returning results.

    I use MySQL for all the storage / indexes.

Re: make a web site searchable
by richardX (Pilgrim) on Aug 22, 2003 at 09:14 UTC
    I have used both Perlfect and Swish but for my small sites I use the FREE service from Atomz.com
    Atomz does all the work for you and you don't have to install anything. KISS

    Richard

    There are three types of people in this world, those that can count and those that cannot. Anon

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://285257]
Front-paged by tcf22
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2014-11-22 10:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (121 votes), past polls