Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

"Suffix" Dictionaries

by cLive ;-) (Prior)
on Dec 02, 2002 at 17:25 UTC ( #216979=perlquestion: print w/replies, xml ) Need Help??
cLive ;-) has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

In my current incarnation I'm developing a Search Engine from scratch. One thing I'd like to deal with is "headwords" and "derivatives" (excuse me if the terms aren't correct, I'm not a linguist :). Perhaps an example:

host hosts hosted hosting

I'm sure you get the idea...

Half of my problem is not knowing what to search for. I tried "suffix dictionaries" and "suffix tree" and found a few interesting articles, but CPAN appears to be rather sparse on this front (or maybe I'm searching on the wrong terms - I do find CPAN's search rather strange at times).


  • is there anything out there that might help here; or
  • does anyone know of any good books/papers that discuss this issue

thoughts welcomed

cLive ;-)

Replies are listed 'Best First'.
Re: "Suffix" Dictionaries
by valdez (Monsignor) on Dec 02, 2002 at 17:36 UTC
Re: "Suffix" Dictionaries
by Callum (Chaplain) on Dec 02, 2002 at 17:34 UTC
    lingua::stem will probably do what you're looking for, or at least point you in the right direction
Re: "Suffix" Dictionaries
by fletcher_the_dog (Friar) on Dec 03, 2002 at 02:18 UTC
    I am a linguist, and if you want to find good material about this subject do a search (in google, not cpan) for "morphology". This may not bring up results with code or canned algorithms, but it will give you a better understanding of what you should do. Finding the 'morphilogical root' of a word (in your example "host")requires more than knowing suffixes. For example, the root of "happiness" is "happy", here you don't just stick a suffix to a word, you first have to get rid of the "y". Therefore if you want to reverse the process and find the root of a given word, you can't always just hack off a suffix. Fortunately, these types of things follow pretty regular rules so you don't have to worry about a lot of exceptions (though you do have to acount for "foot" and "feet"). Small changes to the root itself are very common, so it will pay off in the long run if you take the time to make sure your algorithm does more then just hacking off suffixes.
Re: "Suffix" Dictionaries
by rob_au (Abbot) on Dec 03, 2002 at 11:23 UTC
    The term for this type of language processing is stemming - I asked a question about this type of natural language processing for site indexes in this thread in which there are some excellent replies with links and references on this type of natural language processing.


    perl -le 'print+unpack("N",pack("B32","00000000000000000000000111110000"))'

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://216979]
Approved by valdez
Front-paged by broquaint
[ambrus]: No, you're not doing anything wrong. It's just that our automatic spam filters confuse you with the spammers who post advertisments for online shops of counterfeit branded clothing.
[ambrus]: Or unless you left the title field empty or entered a very short title for the node, but then you'd get a message saying that.
[ambrus]: I hope Corion or some other admin is here and can check the logs to see what the problem is.
[mz2255]: I wish I had an online shop but sadly no. The title field definitly wasn't short, had a perl module in the title with 5-6 additional words.
[ambrus]: you can also try to just post again in case it was some intermittent error

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (12)
As of 2017-10-19 15:34 GMT
Find Nodes?
    Voting Booth?
    My fridge is mostly full of:

    Results (255 votes). Check out past polls.