Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

testing parts of a string against a word database

by Rudolf (Pilgrim)
on Nov 30, 2011 at 23:33 UTC ( [id://940971]=perlquestion: print w/replies, xml ) Need Help??

Rudolf has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, the program I'm working on attempts to accomplish two things: 1: break a sentence (user inputed) into its elements (words or whatever seperated by a space character). 2: test those elements against a file of nouns to see if they match. if the first element in the sentence is a match, the program works the way I want it to (ex: Cars bla bla.) Cars will match, however if the element is not the first one (ex: I like Cars.), Cars will not match with Cars in the txt file. any help or tips/hints would be much appreciated thank you!

#!/usr/bin/perl my $sen = <STDIN>; chomp $sen; if($sen =~ s/(\.|\?|\!)$/ /g){ #get punctuation and replace with white +space $punctuation = $&; } while($sen =~ m/ /g){ # test for spaces in sentence my $pos = pos $sen; my $element = substr($sen,0,$pos,""); # get chunk of sentence chop $element; #remove end whitespace push(@senElements,$element); #push chunk into array } open(NOUNS,'<',"nouns.txt") or die "Can't open noun database: $!\n"; # # # attempt to recognize sentence elements as a noun via file nouns. +txt # # # foreach $element (@senElements){ while(<NOUNS>){ chomp(my $line = $_); $line =~ s/ |\n//g; #remove any space chars and newlines from file + line if($element =~ m/^($line)$/i){ print "\n!MATCH! ~ $element is a noun\n"; } } } close(NOUNS);

Replies are listed 'Best First'.
Re: testing parts of a string against a word database
by Eliya (Vicar) on Dec 01, 2011 at 00:16 UTC

    The problem is that you're trying to iterate through the filehandle NOUNS multiple times without reopening or resetting the file pointer to the beginning of the file.

    So either use seek, or change the nesting of the loops so that you just have to go through the file once, i.e.

    while (<NOUNS>) { foreach my $element (@senElements) { ... } }
      Ah, a rookie mistake it seems. Thank you for your time Eliya!
Re: testing parts of a string against a word database
by TomDLux (Vicar) on Dec 01, 2011 at 00:33 UTC

    I normally complain about people using features like regex when simpler mechanisms are available. In this case, I think you are over-simplifying, with substr(), when you could batch process. But I see you are collecting the punctuation you see, at the top, although you don't do anything with it ... maybe that's a bit of code you cleared away as not relevant to the problem.

    What I would consider is merging the punctuation regex with splitting the line into words, using split to partition on non-word characters ... that is, not alpha, not numeric, not underscore. If that's too generous, you can be more specific.

    my @words = split /\W/, $sen;

    Also, how many NOUNS are you dealing with? If it's only a few million, I would read it into a hash, and check each word against the hash. Reading the file dozens, hundreds or thousands of times, is ghastly slow. A few megabytes for the hash is not excessively painful. Maybe you can save a copy of nouns.txt split into one word per line ... or save it as a YAML file or some other format that loads quickly as a Perl data structure.

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

      I'm learning a lot from your help, much appreciated Tom!

Re: testing parts of a string against a word database
by hbm (Hermit) on Dec 01, 2011 at 00:57 UTC

    I'd put the input into a hash; then scan nouns.txt (once) and for each word, see if it is in the hash.

    use strict; use warnings; my %words = map { s/[.?!]$//; lc $_, $_ } split/\s+/,<STDIN>; open(NOUNS,'<',"nouns.txt") or die "Can't open noun database: $!\n"; while(<NOUNS>){ s/\s+//g; print "\n!MATCH! ~ $words{lc $_} is a noun\n" if exists $words{lc $_}; } close(NOUNS);
Re: testing parts of a string against a word database
by TJPride (Pilgrim) on Dec 01, 2011 at 01:45 UTC
    use strict; use warnings; my (%words, $c); ### Contains words 'apple' and 'cart' open (HANDLE, 'words.txt'); while (<HANDLE>) { chomp; $words{uc $_}++; } while (<DATA>) { $c++; while (m/([A-Z']+)/ig) { print "[$c] $1 is a noun.\n" if $words{uc $1}; } } __DATA__ Jared ran forward. He picked up an apple. He put the apple in the cart.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://940971]
Approved by Eliya
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (2)
As of 2024-03-19 04:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found