Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re^2: Splitting compound (concatenated) words )

by bitingduck (Chaplain)
on May 16, 2012 at 04:20 UTC ( [id://970744]=note: print w/replies, xml ) Need Help??


in reply to Re: Splitting compound (concatenated) words )
in thread Splitting compound (concatenated) words )

Trying to do it with a regex could be pretty time consuming if the dictionary or the subject text got very long. I worked on a project in a natural language translation class way longer ago than I care to think about, and the approach to making the dictionary was to make a linked tree where each letter was a node, with the possible subsequent letters as the words being child nodes. At the end of each complete word you put a flag node that says "end of word", but for a true compound word you'd have a child node with the next letter and another child with the "EOW" flag.

Kind of like this (where "." is end of word)

T /\ H O /\ /\ E I . N /\ /\ \. N . etc.

This dictionary Includes "To","Ton","The","Then" and starts to spell out "this". It makes finding the combined words fast and straightforward, but it doesn't help with distinguishing true compound words (e.g. "bookkeeper") from things like "theme" which could be "the me"(updated here to correct my bad choice of example). If you're really clever you might use some sort of Markov chain tool to guess that.

But I don't know the NLP modules well enough to know if there's something kicking around in CPAN. If you have a dictionary to slurp, you could do it yourself fairly easily.

Update:You might even get by ok with something like Text::SpellChecker

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://970744]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2024-04-23 12:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found