Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

"Suffix" Dictionaries

by cLive ;-) (Parson)
on Dec 02, 2002 at 17:25 UTC ( #216979=perlquestion: print w/ replies, xml ) Need Help??
cLive ;-) has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

In my current incarnation I'm developing a Search Engine from scratch. One thing I'd like to deal with is "headwords" and "derivatives" (excuse me if the terms aren't correct, I'm not a linguist :). Perhaps an example:

host hosts hosted hosting

I'm sure you get the idea...

Half of my problem is not knowing what to search for. I tried "suffix dictionaries" and "suffix tree" and found a few interesting articles, but CPAN appears to be rather sparse on this front (or maybe I'm searching on the wrong terms - I do find CPAN's search rather strange at times).

So,

  • is there anything out there that might help here; or
  • does anyone know of any good books/papers that discuss this issue

thoughts welcomed

cLive ;-)

Comment on "Suffix" Dictionaries
Download Code
Re: "Suffix" Dictionaries
by Callum (Chaplain) on Dec 02, 2002 at 17:34 UTC
    lingua::stem will probably do what you're looking for, or at least point you in the right direction
Re: "Suffix" Dictionaries
by valdez (Monsignor) on Dec 02, 2002 at 17:36 UTC
Re: "Suffix" Dictionaries
by fletcher_the_dog (Friar) on Dec 03, 2002 at 02:18 UTC
    I am a linguist, and if you want to find good material about this subject do a search (in google, not cpan) for "morphology". This may not bring up results with code or canned algorithms, but it will give you a better understanding of what you should do. Finding the 'morphilogical root' of a word (in your example "host")requires more than knowing suffixes. For example, the root of "happiness" is "happy", here you don't just stick a suffix to a word, you first have to get rid of the "y". Therefore if you want to reverse the process and find the root of a given word, you can't always just hack off a suffix. Fortunately, these types of things follow pretty regular rules so you don't have to worry about a lot of exceptions (though you do have to acount for "foot" and "feet"). Small changes to the root itself are very common, so it will pay off in the long run if you take the time to make sure your algorithm does more then just hacking off suffixes.
Re: "Suffix" Dictionaries
by rob_au (Abbot) on Dec 03, 2002 at 11:23 UTC
    The term for this type of language processing is stemming - I asked a question about this type of natural language processing for site indexes in this thread in which there are some excellent replies with links and references on this type of natural language processing.

     

    perl -le 'print+unpack("N",pack("B32","00000000000000000000000111110000"))'

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://216979]
Approved by valdez
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (9)
As of 2014-11-23 02:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (127 votes), past polls