Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: testing parts of a string against a word database

by TomDLux (Vicar)
on Dec 01, 2011 at 00:33 UTC ( [id://940975]=note: print w/replies, xml ) Need Help??


in reply to testing parts of a string against a word database

I normally complain about people using features like regex when simpler mechanisms are available. In this case, I think you are over-simplifying, with substr(), when you could batch process. But I see you are collecting the punctuation you see, at the top, although you don't do anything with it ... maybe that's a bit of code you cleared away as not relevant to the problem.

What I would consider is merging the punctuation regex with splitting the line into words, using split to partition on non-word characters ... that is, not alpha, not numeric, not underscore. If that's too generous, you can be more specific.

my @words = split /\W/, $sen;

Also, how many NOUNS are you dealing with? If it's only a few million, I would read it into a hash, and check each word against the hash. Reading the file dozens, hundreds or thousands of times, is ghastly slow. A few megabytes for the hash is not excessively painful. Maybe you can save a copy of nouns.txt split into one word per line ... or save it as a YAML file or some other format that loads quickly as a Perl data structure.

As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Replies are listed 'Best First'.
Re^2: testing parts of a string against a word database
by Rudolf (Pilgrim) on Dec 01, 2011 at 01:45 UTC

    I'm learning a lot from your help, much appreciated Tom!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://940975]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2024-03-19 10:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found