Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Using DBIx::FullTextSearch

by trs80 (Priest)
on Sep 21, 2002 at 00:08 UTC ( #199670=perlmeditation: print w/ replies, xml ) Need Help??

I had to upgrade the search capability of a web site this week and it needed to be done quickly. I started by looking at what the current code did vs. what the site really needed it to do. The code (inherited) was horrible. Here is the "search":
sub search_biography { my $search_text=shift; my @result; my $table; my $sth=$dbh->prepare("select distinct hm_index,first_name,last_name, category from Biography where first_name like '%$search_text%' or last_name like '%$search_text%' or bio_text like '%$search_text%' or occupation like '%$search_text%' or description like '%$search_text%'" ); $sth->execute; while(my @row=$sth->fetchrow) { push @results , "blah - was long HTML string" } return (\@result); }
This nice code would match 'across' if you looked for 'Ross', I doubt if that is what they wanted. I felt there had to be something already available to do this on CPAN and so began my search.

It took me some searching at CPAN but I finally stumbled across the DBIx::FullTextSearch module, but it was late at night and the doucmentation didn't seem to make much sense so I called it a day.

Well that sleep did me good, I was finally able to make some sense out of the documentation and make a plan on how to convert what I had over.

The documentation still doesn't give a clear (at least to my mind) path to implementing a solution, that is understandable since it isn't exactly an "everyday" item. This is my attempt at explaining the steps I took to shoehorn it into an existing application without too much pain. The documentation talks about a 'frontend' and a 'backend', the frontend is the information to be indexed. In this example we are going to use a database for both the backend and the frontend. See the docs for more options.

Phase One - Tests

One thing I have learned the hard way is that you should test your code outside of the web environment before you add an additional layer of complexity on to it, aka "mod_perl/<your framework/template here>". With that in mind I made two scripts, full_text_search_create.pl and full_text_search.pl

The first and most important script is the full_text_search_create.pl script since it actually creates the indexes that are used by the second one.

In my case I was lucky when it came to the tables because they were already properly setup with Primary keys so it removed a step from the process. If you attempting to work on a table that doesn't have a primary key, I suggest you do that before you attempt using this module.

Here is our first script

full_text_search_create.pl
use DBIx::FullTextSearch; use DBI; use strict; # the data source, that is the tables to be indexed my $dbh = DBI->connect('dbi:mysql:database','user','password'); my $table_info = { 'DBIx_biography' => { table_name => 'Biography', column_name => 'bio_text', }, 'DBIx_glossary' => { table_name => 'Glossary', column_name => 'glossary_definatio +n', }, 'DBIx_timeline' => { table_name => 'Timeline', column_name => 'event_description' +, }, 'DBIx_quotes' => { table_name => 'Quotes', column_name => 'quote', }, }; my $table_check = $dbh->prepare('show tables'); $table_check->execute; my $tables; while ( my ($table) = $table_check->fetchrow() ) { $tables{$table} = 1; } foreach my $front_end_table ( keys %{$table_info} ) { if ($tables{$front_end_table}) { $fts = DBIx::FullTextSearch->open($dbh, $front_end_table); $fts->empty; } else { $fts = DBIx::FullTextSearch->create($dbh, $front_end_table, frontend => 'table', # type of frontend # table to be indexed table_name => $table_info->{$front_end_tabl +e}{'table_name'}, # column that has the info to be indexed column_name => $table_info->{$front_end_table}{'column_nam +e'}, backend => 'column', # type of storage for the back +end # you really should set these up, but I won't get # into that in this writeup #stoplist => 'sl_en', # from example #stemmer => 'en-us' # from example ); } my $ids = $dbh->prepare(qq!SELECT hm_index FROM $table_info->{$front_end_table}{'tabl +e_name'}!); $ids->execute() or die "$DBI::errstr problem"; while ( my ($id) = $ids->fetchrow ) { $fts->index_document($id); } }


In my table structures each table has the same id name so I was able to avoid having to add that into the hash, but it wouldn't be so bad to include it anyway just allow for future growth.

The above code will create a new index or erase an existing one and recreate it. This makes it simple to run anytime there is a major update or in a nightly cron job.

The new names are created inside of the database you connected to in your DBI connect. I named them DBIx_<table_name> so I could spot them easily and it is also unlikely to conflict with existing table names.

For every index you want to create there are three tables: <front_end_table_name> <front_end_table_name>_data <front_end_table_name>_words These are all created automatically by the create statement and no action is required on your end.

Once the indexes are created we need to be able to test our create from the command line to see if it is returning the correct results.

full_text_search.pl
# full_text_search.pl use strict; use DBIx::FullTextSearch; use DBI; # the data source, that is the tables to be indexed my $dbh = DBI->connect('dbi:mysql:database','user','password'); my $front_end_table = shift @ARGV; my $fts = DBIx::FullTextSearch->open($dbh, $front_end_table); my $table = shift @ARGV; # this is a mapping of the SQL statements for each table, I was able # to get these directly from some existing code and then just # replaced my where conditional. This minimized the amount of # refactoring and time required to swap things out. my $table_column = { Biography => 'distinct hm_index,first_name,last_name,category', Quotes => qq!distinct hm_index,quote_stamp,quote, date_format(birth_year,'%Y'), date_format(death_year,'%Y'), name,bio_index,media_index,category!, Timeline => qq!hm_index,event_title,event_brief,event_description +, timeline_stamp,date_format(event_date, '%Y')!, Glossary => qq!hm_index,glossary_title,glossary_defination!, }; my @docs = $fts->contains(@ARGV); my $doc = join(',',@docs); if ($doc) { my $info = $dbh->prepare(qq!SELECT $table_column->{$table} FROM $table WHERE hm_index IN($doc)! ); $info->execute; while ( my @row = $info->fetchrow ) { # this is just for testing so it doesn't need # to be pretty :) print "'$row[1]' '$row[2]'\n"; } } else { print "no matches for '", join(' ',@ARGV) , "'"; }


The second program expects a minimum of 3 command line arguments, something along the lines of:
full_text_search.pl DBIx_biography Biography trs80
Where:
DBIx_biography = front end table Biography = table that contains our primary key, this is the table that we will get our final mathes from trs80 = what to look for


You could also do:
full_text_search.pl DBIx_biography Biography trs80 jeffa Ovid
and it will search for all three. It gives fairly ugly/simple results, but it just an index/search test interface that isn't intended for end users.

Beyond this things become more specific to your own site and how you want to render the results, but hopefully this revealed some of the magic in this module.


Update(1): Made some minor verbiage adjustments and added code tags on commandline for running the program

Comment on Using DBIx::FullTextSearch
Select or Download Code
Re: Using DBIx::FullTextSearch
by Ryszard (Priest) on Sep 22, 2002 at 11:47 UTC
    This is *somewhat* interesting. I say somewhat because you've not really added a conclusion to your experience of inplementation.

    Personally, what I'd like to see is pros, cons, preformance results (both time, and search semantics) and a general opinion of weather or not you think this exercise was worthwhile, and what kind of applications you'd use it for (ie how far do *you* think it will scale)..

    None the less, an interesting article on the implementation...

      Pros
      • CPAN available
      • Implement in a short period of time if you have existing databases
      • Keeps search information inside the same container as the data
      Cons
      • Requires knowledge/familiarity with DBI, MySQL, and Perl data strucutures. Which may make it difficult to use for newer users of Perl.
      • Only works with MySQL
      • Could impact database performance if search is perfromed on the same database as the data in heavily loaded systems.

      Performance Results

      There is really no reason for me to do this at the scale of the application this was applied to. There is no noticable latency even when the search is performed over 100Mbps connected servers.

      Worthwhile

      The exercise worthwhile for several reasons:
      • Reused existing code
      • Provided results that were much more accurate
      • Scripts produced will allow for easier future implementations on other projects
      • Provided support for search modifiers (AND, OR, +)


      Where to use it (will it scale)
      • Since I have not done any performance tests I can't possibly predict how far this would scale, but speaking from my own install with roughly 5MB of searched content the seat of the pants performance is acceptable

      The system has been online now for 6 months and it has satisified the clients search requirements as well as aided site visitors in locating the information of interest to them. The solution is implemented on a mod_perl enabled Apache server with ~30,000 visitors per month at peak.
Re: Using DBIx::FullTextSearch
by SpritusMaximus (Sexton) on Sep 24, 2002 at 16:03 UTC
    I used DBIx::FullTextSearch in a project where I mirror and index a political commentary website. I have found that oddly, the index is broken for certain words - I never had a chance to debug to find out why.

    On an AMD K2-350, 230 MB of RAM, RedHat 7.3, running a number of services (my home server so it has a very light load), approximately 3,500 static HTML pages, the following queries took the following times:

    +Linux +Bush (5 seconds)
    inane (~1 second)
    "Bloom County" (~2 seconds)


    This can be found at: http://www.bitbuckets.com/bartbot/

    WARNING: Political Satire contained within!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://199670]
Approved by PodMaster
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (12)
As of 2014-09-02 18:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (28 votes), past polls