I am curious as to the experience of others with regard to their experience with natural language stemming for site indexes. I ask this as I am in the process of rewriting a site search engine (to improve maintainability and to fit the corporate application environment) and have could across a number of discussions regarding natural language stemming in this type of application.
For those unfamiliar with this concept, stemming is the process of reducing a word to its stem or root form - This allows similar words such as computer and computing to be conflated or reduced to a single root (for example, comput), thereby reducing index dictionary size and in theory, reducing storage requirements and processing time - A further discussion on this concept can be found here.
While this type of processing allows for reducing index dictionary keys, I am concerned about he likelihood for stemming errors whereby dissimilar words may be stemmed to a similar root, particularly given that indexing speed and space requirements should not be an issue in the application environment - See here for a discussion on over- and under-stemming errors.
And so I ask a barage of questions:
- What are the experiences of fellow monks with natural language stemming?
- Have other monks found better results, as measured by minimal stemming errors, via one stemming algorithm (for example, Paice-Husk, Porter, etc.) over another?
- And in particular, what are other monks experiences with the Porter algorithm of stemming implemented in Lingua::Stem?
My thanks in advance
Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
Want more info? How to link or
or How to display code and escape characters
are good places to start.