|Perl: the Markov chain saw|
How to build a Search Engine.by TedYoung (Deacon)
|on Feb 10, 2007 at 22:09 UTC||Need Help??|
Whether you are building a CMS, maintaining an intranet portal, managing a large database of information, or acting as an information aggregator, you will probably need to offer some level of search functionality to your user base. This meditation presents several different options, and offers and in-depth discussion and example of a comprehensive solution.
Search Engine Providers
For basic websites, the easiest and quickest solution is to use an existing search engine provider like Google. Once Google has indexed your site, visitors can search for content using the following Google search syntax:
Google even offers copy-and-paste solutions for adding a search box to your website that will automatically restrict visitors' searches to your domain.
Google now even offers webmaster tools to review how your site is indexed and locate potential problems. You can even notify Google of updates to your site, so they may potentially be indexed sooner.
While this is quick and easy, it has many drawbacks! Your dataset must be a website, it must be publicly available and cannot require any authentication, it must not look like a web application1, the search results look like a separate website and may include ads, you cannot offer advanced searching options (e.g. limiting search to one section of a site), and Google (or any other provider) will be slow at catching updates and changes to your site.
1 - Google does not index pages that look like they have an ID or session ID, or make use of more than a few parameters.
So, what do you do if you cannot live with any of these limitations? On a side note, Google does offer commercial search services, but I am cheap!
If your website stores its data in a database, you may have the option of using full-text catalogues. A full-text catalogue indexes one or more fields in a table for searching. You can then execute an SQL query which includes a search query. The database returns records that are relevant your search.
FTCs are convenient because they automatically update as your data changes and they don't require additional libraries. However, their functionality is often limited in terms of analysis, they are not extensible, and their query languages are often not what people are used to. And since each catalogue is limited to a table, they may not work for you at all if your data is related through many tables. For more information, check out Mysql's Full Text documentation.
Embeddable Search Engines
An embeddable search engine is a library that allows you to build search functionality directly into your program (or website) without using a stand-alone service. This is analogous to an embedded database system, as apposed to a database server like mysql.
Before we dive into our options, let's consider the various features we might be interested in:
A Review of Available Tools
There are several available open source embeddable search engines out there. The defacto standard is Apache's Lucene. While Lucene is written in Java, making it of little use to us as Perl developers, it has driven the design of most alternatives and so it is important to know about. On the Perl side, there are several different ports of Lucene available.
Plucene is a port of Lucene to Perl. I started using this several years ago for a large CMS that I had built and continue to maintain. Its one major pro is that it is all Perl. However, off the shelf, it includes only a very basic analyzer, no preview generation, and is not known for its speed. I had to string together several additional Perl packages to get these features. My biggest issue with Plucen, however, was the lack of documentation. Without a background in how Apache's Lucene worked, I was left to navigate a very large set of PODs to find answers to may questions. Also, I fear that the differences between Plucene and Lucene make Lucene's documentation a misleading reference now and then. Regardless, Plucene was very stable and performed very well during the several years I used it!
Lucene is a Perl binding the C++ port of Lucene, called CLucene. Note: do not confuse this with CLucene, the deprecated Perl bindings for the same library. Being a native port, it is certainly faster than Plucene. It offers a very direct and comprehensive overview, but still leaves you the look at Apache's Lucene's documentation. It offers several analyzers and incremental updates. But, you are still own your own for generating previews, and parsing PDFs, etc.
KinoSearch is a Perl/C++/XS search engine loosley based on Lucene. Unlike the two previous options, its API was designed for Perl, offering much easier and cleaner programming. Being native, it offers very fast indexing, good documentation on their website. It offers stemming and stopword analyzers, generates highlighted previews, you have more control over your index setup and contents, and is very extensible. Despite its low version number, it is very complete. This is the choice I went with and will discuss below.
Apache's Lucy is a brand-new project started by the creators of KinoSearch and a Ruby search engine called Ferret. They plan to create a native search engine with Perl and Ruby bindings. Perhaps, this is the beginning of a new defacto search engine with bindings for most languages (like mysql, and postgres in the database world). But, it has only just started, so it isn't yet an option.
Note: I want to make sure the authors of these modules are aware that I really appreciate their efforts and my criticism of their modules is merely a professional review. Maintaining a port of Lucene is a arduous task at best, and I thank you for all of your efforts!!!
HTDig is a non Lucene-related technology that seems dead. Their own website hasn't announced an update in 2.5 years.
SWISH-e is a very complete and comprehensive search indexing tool. It is also not related to Lucene. SWISH-e offers a command line tool to quickly index a file set or website. It indexes many different file types, and does all of the website crawling work for you. It is fast and SWISH offers Perl bindings. It has a smart analyzer and is extensible through unix-style piping. They have a nice article called How to Index Anything. This is a very quick and complete solution, but I needed more control over how my content was indexed and searched and so would not work for me.
Indexing Your Site
I chose KinoSearch for my needs this time around. However, you can use what you learn here with any Lucene related library.
The first thing you want to do is open an index. I suggest you put the index somewhere outside of the webspace. A minor annoyance with KinoSearch was the fact that you have to tell KinoSearch to create a new index if there isn't one already there. Otherwise it will barf. And you can't just always create a new index if you plan to do incremental updates.
The KinoSearch::Analysis::PolyAnalyzer is a great feature of KinoSearch. It automatically loads analyzers designed for your specified language, including Stemming features. Use this for "google style" searching.
The next step is to define the structure of what you want to index. This, to me, is one of they most powerful features of Lucene style engines. Think of this step as defining the columns of a table in the database. You indicate what fields you want, which ones are indexed (so they can be searched), which ones are stored (for use during search results), which ones are analyzed, etc.
Let's go over each field.
We create a new Document, the record in search engine terminology. We set the corresponding fields values, and then add the document to the index. We may, however, want to do some special processing to the content field first.
Searching Your Index
Now we want to create a search. Generally, this part would be built into a script (CGI, shell, ModPerl, etc). I will leave that up to you. Here is how you execute the search:
Up to this point there should be no surprises. The highlighter is used to highlight generated previews. You have to tell it which field is used for generating the preview. The highlighter above puts strong tags around words that match. You can customize this to meet your needs. Note: the highlighter is smart enough to highlight terms that match because of your analyzer. So, in a search for Eat: Eating and Eats would also be highlighted if found.
Now we display our results:
The preview generation step creates previews of each record. By default, they are limited to 200 chars, and show portions of the document that were most relevant to your search.
Inside the while loop, each $result is a hash reference with each of your fields for that record; id, url, section, title, content, etc. You will also have a field called score that has the score of the record, and excerpt which has your preview, all nicely highlighted. Also, the results come out in order of most to least relevant.
In the previous example, I took the query directly from the user and passed it to the index. But what about limiting sections, like I promised? Here we have an additional variable called $section containing their choice of section. I update the query to insure that all results are in that section as follows:
Sometimes it is easiest to just recrawl your dataset every once and while, or even every time it changes. KinoSearch, CLucene, SWISH-e, and Plucene are all plenty fast for most datasets. But, if you are concerned, or have extremely large sets of data (or a busy server) we can elect to update only the records that have been modified.
First, we open the index. Then for each updated record, we remove the existing entry:
This deletes all records with that ID number (in most cases, probably only one record). Then we create a new document, set all of the field values, and add it to the index.
When done updating, you need to finish the index, but you don't need to optimize at that moment. You can wait for several updates if your server is really busy.
If you just want to index a bunch of files (PDFs, DOCs, etc) consider SWISH-e. But, if you have some files on your website, you will want to extract the text from them before you add it to your index. Here are some tools that will help:
For fun, consider using Text::Aspell to support inline google-style spelling corrections.
I hope this helps!
Ted Young($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)