Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: extract phrases of n-words length

by ack (Deacon)
on Jun 24, 2009 at 17:13 UTC ( [id://774457]=note: print w/replies, xml ) Need Help??


in reply to extract phrases of n-words length

Much like suaveantk and the others, I did not try to optimize but looked at how to easily handle any number of words in the phrase. I use a Max Phrase Length ($maxPhraseLen) but didn't think of what Polyglot did which to also have the ability to specify a Min Phrase Length. I think it is an easy mod to my approach to add that additional flexibility.

Note that in my code I allow for phrase lengths of 1 which could be used to generate (with some approach for editing out non-interesting words like 'a', 'an', 'the', etc) keywords for searching, for example, the full test writeup that the abstract might refer to.

My approach, shown in the code below, is to use an array of hashes where the array index indicates the number of words in the phrase and the hashes are just the same hashes that the OP used.

The code, then, is as follows:

#!/usr/bin/perl/ use warnings; use strict; my $phrases = []; # Array of Hashes containing the different # length phrases...the index of each entry # is the number of words in tne phrases, the # hash contains all of the phrases of that length, # exactly as the OP did in the example my $maxPhraseLen = 4; # to match the OPs example. Can be set to any # value from 1 to the Length of the Abstract my $abstract = 'Perl is a high-level, general-purpose, interpreted, ' +; $abstract .= 'dynamic programming language.'; my @words = split(/\s+/, $abstract); die "Max Phrase Length exceeds number of words in abstract: aborting\n +" if($maxPhraseLen > (scalar @words)); foreach my $numWordsInPhrase (1..$maxPhraseLen){ foreach my $index (0..($#words-$numWordsInPhrase)) { my $phrase = ""; # clear out the phrase accumulator for(my $i = 0; $i<$numWordsInPhrase; $i++){ $phrase .= $words[($index+$i)] . ' '; } $phrases->[$numWordsInPhrase]->{$phrase} = undef; } } print "\n"; foreach(my $i = 1; $i<(scalar @$phrases); $i++){ print "PHRASE LENGTH $i:\n"; foreach my $phrase (keys %{$phrases->[$i]}){ print " $phrase\n"; } } exit(0);

This yields the output that matches the OPs. But you can modify the $maxPhraseLen to be any value you want (up to the number of words in the abstract, $abstract...if you try to set the max phrase length to greater than the number of words in the abstract, the script dies with the message "Max Phrase Length exceeds number of words in abstract: aborting".

That is my suggestion for a way to handle the OPs objective.

ack Albuquerque, NM

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://774457]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2024-04-26 09:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found