Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: testing parts of a string against a word database

by TomDLux (Vicar)
on Dec 01, 2011 at 00:33 UTC ( #940975=note: print w/ replies, xml ) Need Help??


in reply to testing parts of a string against a word database

I normally complain about people using features like regex when simpler mechanisms are available. In this case, I think you are over-simplifying, with substr(), when you could batch process. But I see you are collecting the punctuation you see, at the top, although you don't do anything with it ... maybe that's a bit of code you cleared away as not relevant to the problem.

What I would consider is merging the punctuation regex with splitting the line into words, using split to partition on non-word characters ... that is, not alpha, not numeric, not underscore. If that's too generous, you can be more specific.

my @words = split /\W/, $sen;

Also, how many NOUNS are you dealing with? If it's only a few million, I would read it into a hash, and check each word against the hash. Reading the file dozens, hundreds or thousands of times, is ghastly slow. A few megabytes for the hash is not excessively painful. Maybe you can save a copy of nouns.txt split into one word per line ... or save it as a YAML file or some other format that loads quickly as a Perl data structure.

As Occam said: Entia non sunt multiplicanda praeter necessitatem.


Comment on Re: testing parts of a string against a word database
Download Code
Re^2: testing parts of a string against a word database
by Rudolf (Monk) on Dec 01, 2011 at 01:45 UTC

    I'm learning a lot from your help, much appreciated Tom!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://940975]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (8)
As of 2014-12-20 15:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (96 votes), past polls