Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

extract phrases of n-words length

by arun_kom (Monk)
on Jun 24, 2009 at 14:54 UTC ( #774421=perlquestion: print w/ replies, xml ) Need Help??
arun_kom has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

I would like to extract from an abstract, all unique phrases of two-, three- and four-words length in seperate groups. Order of the words in the phrase must be preserved but the order of the phrases in the output need not be. Using a hash achieves this in addition to removing duplicates if any. Please see below for my solution that achieves what i want.
However, I would like to know what is a better way as this approach will be cumbersome to scale up for phrases of n-words length as n gets larger.

Thanks in advance.
#!/usr/bin/perl use strict; use warnings; my $abstract = 'Perl is a high-level, general-purpose, interpreted, dy +namic programming language.'; my (%double_words, %triple_words, %four_words); my @words = split(/\s+/, $abstract); foreach(0..$#words){ $double_words{$words[$_].' '.$words[$_+1]}= undef if($_<$#words); $triple_words{$words[$_].' '.$words[$_+1].' '.$words[$_+2]}= undef +if($_<$#words-1); $four_words{$words[$_].' '.$words[$_+1].' '.$words[$_+2].' '.$words +[$_+3]}= undef if($_<$#words-2); } foreach(keys %four_words){ print $_."\n"; } #sample output
Sample output of all four-word phrases:

general-purpose, interpreted, dynamic programming
a high-level, general-purpose, interpreted,
Perl is a high-level,
interpreted, dynamic programming language.
is a high-level, general-purpose,
high-level, general-purpose, interpreted, dynamic

Comment on extract phrases of n-words length
Download Code
Re: extract phrases of n-words length
by wfsp (Abbot) on Jun 24, 2009 at 15:24 UTC
    An array slice might help here
    @array[$start..$end]
    #! /usr/bin/perl use strict; use warnings; use Data::Dumper; $Data::Dumper::Indent = 2; my $abstract = q{ Perl is a high-level, general-purpose, interpreted, dynamic programming language. }; my @words = split(q{ }, $abstract); my %phrase_table; my $count = $#words; for my $phrase_length (2..4){ for my $i (0..$count-$phrase_length+1){ push @{$phrase_table{$phrase_length}}, [ @words[$i..$i+$phrase_length-1] ]; } } print Dumper \%phrase_table;
    output update: <cough> Had an "off by two" error. :-)
Re: extract phrases of n-words length
by BrowserUk (Pope) on Jun 24, 2009 at 15:32 UTC

    You can simplify the code a little:

    #! perl -sw use 5.010; use strict; use Data::Dump qw[ pp ]; my $abstract = 'Perl is a high-level, general-purpose, interpreted, dy +namic programming language.'; my (%double_words, %triple_words, %four_words); my @words = split ' ', $abstract; $double_words{ join ' ', @words[ $_ .. $_ +1 ] } = undef for 0 .. $#wo +rds - 1; $triple_words{ join ' ', @words[ $_ .. $_ +2 ] } = undef for 0 .. $#wo +rds - 2; $four_words{ join ' ', @words[ $_ .. $_ +3 ] } = undef for 0 .. $#wo +rds - 3; say join '|', keys %double_words; say join '|', keys %triple_words; say join '|', keys %four_words;

    But that's doing essentially the same amount of work. I don't see much in the way of optimisations, but do you need them? How big can your abstracts get?

    I ran the above on a text contain 21,000 words and it took a little over 1/10th second.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: extract phrases of n-words length
by Polyglot (Monk) on Jun 24, 2009 at 15:37 UTC
    I might do it something like this:
    my $abstract = 'Perl is a high-level, general-purpose, interpreted, dynamic programming language.'; my $maxphraselength = 4; my $minphraselength = 2; my %phrasearray; my ($i,$offset); $abstract =~ s/\n|\r//g; for ($i=$minphraselength; $i<$maxphraselength; $i++) { for ($offset=0; $offset<$i; $offset++) { while ($abstract =~ /(\w+\W+){$offset}((\w+\W+){$i})/) { $phrasearray{$2}++; } } } foreach(keys %phrasearray){ print $_."\n"; } #sample output
    Would that work?

    Blessings,

    ~Polyglot~

Re: extract phrases of n-words length
by suaveant (Parson) on Jun 24, 2009 at 15:57 UTC
    One thing you could do to scale up would be instead of using many different hashes, use a single multi-level hash... so you would do $phrase{$word_one}{$word_two}{$word_three}{$word_four}, that should save some space and get you all your phrases, though it would be more work to get them out.

    Then again, all the added hash overhead might nullify the savings... it would make the data more usable in certain applications, however.

                    - Ant
                    - Some of my best work - (1 2 3)

Re: extract phrases of n-words length
by ack (Deacon) on Jun 24, 2009 at 17:13 UTC

    Much like suaveantk and the others, I did not try to optimize but looked at how to easily handle any number of words in the phrase. I use a Max Phrase Length ($maxPhraseLen) but didn't think of what Polyglot did which to also have the ability to specify a Min Phrase Length. I think it is an easy mod to my approach to add that additional flexibility.

    Note that in my code I allow for phrase lengths of 1 which could be used to generate (with some approach for editing out non-interesting words like 'a', 'an', 'the', etc) keywords for searching, for example, the full test writeup that the abstract might refer to.

    My approach, shown in the code below, is to use an array of hashes where the array index indicates the number of words in the phrase and the hashes are just the same hashes that the OP used.

    The code, then, is as follows:

    #!/usr/bin/perl/ use warnings; use strict; my $phrases = []; # Array of Hashes containing the different # length phrases...the index of each entry # is the number of words in tne phrases, the # hash contains all of the phrases of that length, # exactly as the OP did in the example my $maxPhraseLen = 4; # to match the OPs example. Can be set to any # value from 1 to the Length of the Abstract my $abstract = 'Perl is a high-level, general-purpose, interpreted, ' +; $abstract .= 'dynamic programming language.'; my @words = split(/\s+/, $abstract); die "Max Phrase Length exceeds number of words in abstract: aborting\n +" if($maxPhraseLen > (scalar @words)); foreach my $numWordsInPhrase (1..$maxPhraseLen){ foreach my $index (0..($#words-$numWordsInPhrase)) { my $phrase = ""; # clear out the phrase accumulator for(my $i = 0; $i<$numWordsInPhrase; $i++){ $phrase .= $words[($index+$i)] . ' '; } $phrases->[$numWordsInPhrase]->{$phrase} = undef; } } print "\n"; foreach(my $i = 1; $i<(scalar @$phrases); $i++){ print "PHRASE LENGTH $i:\n"; foreach my $phrase (keys %{$phrases->[$i]}){ print " $phrase\n"; } } exit(0);

    This yields the output that matches the OPs. But you can modify the $maxPhraseLen to be any value you want (up to the number of words in the abstract, $abstract...if you try to set the max phrase length to greater than the number of words in the abstract, the script dies with the message "Max Phrase Length exceeds number of words in abstract: aborting".

    That is my suggestion for a way to handle the OPs objective.

    ack Albuquerque, NM
Re: extract phrases of n-words length
by Fletch (Chancellor) on Jun 24, 2009 at 17:55 UTC
    use feature ':5.10'; use strict; use List::MoreUtils qw( natatime ); my $test_sentence = "Perl is a high-level, general-purpose, interpreted, dynamic pro +gramming language."; sub uniq_phrases { my $sentence = shift; my ( $min, $max ); given ( scalar @_ ) { when (2) { ( $min, $max ) = @_; } when (1) { ( $min, $max ) = ($_[0], $_[0]); } default { ( $min, $max ) = ( 2, 4 ); }; } my @words = split( /\s+/, $sentence ); my @pairs; for my $size ( $min .. $max ) { my %seen; for my $window ( 0 .. ( $#words - $size ) ) { my $it = natatime $size, @words[ $window .. $#words ]; while ( my @p = $it->() ) { next if @p != $size; my $p = join( " ", @p ); next if $seen{$p}++; push @pairs, $p; } } } return wantarray ? @pairs : \@pairs; } say join( "\n", sort { $a cmp $b } uniq_phrases( $test_sentence, 4 ) ) +; __END__
    Or . . .

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: extract phrases of n-words length
by jandrew (Hermit) on Jun 25, 2009 at 00:38 UTC
    At the risk of being redundant and borrowing heavily I submit a possible variation
    use strict; use warnings; use Smart::Comments '###'; my $abstract = 'Perl is a high-level, general-purpose, interpreted, + dynamic programming language.'; my $MinLength = 2;#Must be at least 1; my $MaxLength = 4;#Must be >= $MinLength my $MinPosition = $MinLength - 1;#Counting from 0 my ( $ArrayRef, $AnswerRef ) = (); @$ArrayRef = split(/\s+/, $abstract); #Iterate through @$ArrayRef for my $position (0..($#$ArrayRef-$MinPosition)) { #handle the first word with no space my $phrase = $ArrayRef->[$position]; for my $x (1..$MaxLength) {#Build the phrase #For all but the first time $phrase .= ' ' . $ArrayRef->[$position] if $x != 1; #Load the output $AnswerRef->[$x]->{$phrase} = 1 if $x >= $MinLength; $position++; #Bump up against the end last if $position > $#$ArrayRef; } } #use Smart::Comment.pm ### The unique 3 word phrases are: $AnswerRef->[3]
    Update adjusted the output to demonstrate surfing the answer ref and used .=
Re: extract phrases of n-words length
by planetscape (Canon) on Jun 25, 2009 at 00:40 UTC

    I think you'd benefit from taking a look at Ted Pedersen's Ngram Statistics Package, which not only handles n-grams composed of letters (two letters being a bigram, for example), but also of words.

    HTH,

    planetscape
Re: extract phrases of n-words length
by ELISHEVA (Prior) on Jun 25, 2009 at 05:10 UTC

    There are as I see it two scalability issues in the above code. The first of these has been addressed well by the various suggestions to use a sliding window ($i...$i+$n on the array of words).

    The second is the keys themselves. You are currently using the actual words in each phrase as a key. This means that searching for all sequences of N-M words in a text with K word characters (e.g. alpha-numerics) could concievably take up (N+N+1+...+M)*K characters for its keys alone. The actual amount depends on the frequency of each particular word runs: "a a a ..." will obviously use less space than "a b c d e..." or "a..z b..za c..zab ...".

    If you intend to make either N, the range from N...M or K large, you might want to consider assigning each word in the text a integer id and composing your key out of the integers rather than the word strings as you are doing now. Keys composed out of numerical indexes would save space and possibly search time.

    In pseudocode, this would work something like this:

    my @aWords; # words from the abstract my %hWords; # maps words to their id my $iUniqueWordsSoFar; # cheap way to assign ids my @aIds; # sliding window with ids from last N-M words for (0..($#aWords-$n)) { #look up/assign word id my $sWord = $aWords[$_]; my $iWord; if (exists($hWords{$sWord})) { $iWord = $hWords{$sWord}; } else { $iWord = $hWords{$sWord} = ++$iUniqueWordsSoFar; } # update sliding window of ids for last M words shift(@aIds) if scalar(@aIds); push @aIds, $iWord; # add key to hash for N..M length phrases by taking # first X elements of sliding window to construct # the key. } # final pass: convert what is left in @aIds to keys and # update appropriate phrase hashes.

    Best, beth

      Keys composed out of numerical indexes would save space ...

      Not so! Using the text from your post as a representative abstract, and building the the 3 hashes using both the original words and integers representing those words I get the following hash sizes:

      c:\test>774421-2 Using words: 44501 Using numbers: 40801

      Giving a total saving just under 10%. But... If you add in the requirements of the array required to hold the ordered numbers--ordering is a requirement--plus a hash used to unique the words:

      Numbers array: 12776 Lookup hash: 8860

      The memory requirement is nearly 50% higher.

      As for "and possibly search time.". As the hash keys are the concatenation of the string representations of the integers assigned, and hash keys are hashed byte-wise; and the average (English) words length is a little under 6-chars; and the average length of the integers required to represent them a little under 5-chars; the difference in the time taken to actually hash the keys will be very minimal. And as the number of pairs, triples & quads will be the same, the hashes will contain the same number of keys, so the lookup times will be very similar. These are the timings I got using the above sample:

      Lookup words: 0.000440120697021484 seconds Lookup numbers: 0.000699043273925781 seconds Lookups words: 515; numbers:515 Lookup words: 0.000420093536376953 seconds Lookup numbers: 0.000321149826049805 seconds Lookups words: 515; numbers:515 Lookup words: 0.000285863876342773 seconds Lookup numbers: 0.000767946243286133 seconds Lookups words: 515; numbers:515

      Too wildly varying to consider making an assessment.

      The code used to produce the figures:


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        What array to hold the ordered numbers? The algorithm I proposed had a moving window - an array that never held more than M integers where M is the maximum phrase length - it used the original word array to keep track of ordering. The code above translates the entire word array into numbers before generating the phrase combinations. That is like comparing slurping to line at a time file reads.

        The key question is: is my post representative of the word distribution of abstracts? Truthfully, I don't know, but you raise an interesting point. The space efficiency of the algorithm described above is highly dependent on the length of repeated words and the rate of word repetition without word order repetition - of which there is little in that text.

        It is hardly surprising that you would get the minimal space benefits you did with my post's text (only 10%). There are 186 distinct words (defined as runs bounded by whitespace, start of text or end of text) and both the median, average word repetition is less than 2 and word length is about 4.6 rather than 6. The average 3 word phrase contains only 11 characters including whitespace. This is only slightly longer than its numeric representation. Given that there are 186 words, the average numeric id is closer to three characters. For a phrase, that means an id of between 8 and 11 characters (2+3*2=8, 2+3*3=11).

        As an extreme example of why word order and word length matters, counting word runs in "a b a b a b" extended even ad-infinitum is going to cost more using indexing than not. Alhough words repeat greatly, only two phrases ever appear: "a b" and "b a". As words are the same length as the integers that represent them, the indexing mechanism would save nothing. The indexing hash takes up space and there is 0 saving in key length: key length of "1,2" and "2,1" is exactly the same as "a b", "b a".

        Best, beth

        Update:initially misunderstood what BrowserUK meant by array to hold the sequence.

        Update:added more statistics on text sample; added acknolwegement to BrowserUK.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://774421]
Approved by wfsp
Front-paged by wfsp
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2014-09-19 23:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (151 votes), past polls