Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: testing parts of a string against a word database

by TomDLux (Vicar)
on Dec 01, 2011 at 00:33 UTC ( #940975=note: print w/ replies, xml ) Need Help??


in reply to testing parts of a string against a word database

I normally complain about people using features like regex when simpler mechanisms are available. In this case, I think you are over-simplifying, with substr(), when you could batch process. But I see you are collecting the punctuation you see, at the top, although you don't do anything with it ... maybe that's a bit of code you cleared away as not relevant to the problem.

What I would consider is merging the punctuation regex with splitting the line into words, using split to partition on non-word characters ... that is, not alpha, not numeric, not underscore. If that's too generous, you can be more specific.

my @words = split /\W/, $sen;

Also, how many NOUNS are you dealing with? If it's only a few million, I would read it into a hash, and check each word against the hash. Reading the file dozens, hundreds or thousands of times, is ghastly slow. A few megabytes for the hash is not excessively painful. Maybe you can save a copy of nouns.txt split into one word per line ... or save it as a YAML file or some other format that loads quickly as a Perl data structure.

As Occam said: Entia non sunt multiplicanda praeter necessitatem.


Comment on Re: testing parts of a string against a word database
Download Code
Re^2: testing parts of a string against a word database
by Rudolf (Monk) on Dec 01, 2011 at 01:45 UTC

    I'm learning a lot from your help, much appreciated Tom!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://940975]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (5)
As of 2015-07-05 04:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (60 votes), past polls