Beefy Boxes and Bandwidth Generously Provided by pair Networks Joe
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

count trigrams of a whole file

by lakssreedhar (Acolyte)
on Dec 20, 2012 at 08:10 UTC ( #1009701=perlquestion: print w/ replies, xml ) Need Help??
lakssreedhar has asked for the wisdom of the Perl Monks concerning the following question:

Comment on count trigrams of a whole file
Re: count trigrams of a whole file
by frozenwithjoy (Curate) on Dec 20, 2012 at 08:50 UTC

    UPDATE:

    A quick fix (if your file isn't too large or if you have sufficient RAM) would be to populate @words like so:

    while (<>) { push @words, split /\s/; }

    That way, you can move onto your for loop and you should get the result you want. This was the result I got after making this change:

    trigram frequencies in your text: iwentthere! 1 wentthere!she 1 there!shealso 1 shealsowent 1 alsowentthere. 1

    ORIGINAL POST:

    Hi lakssreedhar, I didn't change anything, but I felt the need to reformat your code for better readability:

    @trigrams = (); while (<>) { @words = split /\s/, $_; for ( $i = 0 ; $i < $#words - 1 ; $i++ ) { $trigram = $words[$i] . $words[ $i + 1 ] . $words[ $i + 2 ]; $found = -1; if (@trigrams) { SEARCHtrigramINDEX: for ( $index = 0 ; $index <= $#trigrams ; $index++ ) { if ( $trigrams[$index] eq $trigram ) { $found = $index; last SEARCHtrigramINDEX; } } } if ( $found > -1 ) { $trigramfrequency[$found]++; } else { push @trigrams, $trigram; $trigramfrequency[$#trigrams]++; } } } print "trigram frequencies in your text:\n"; for ( $index = 0 ; $index <= @trigrams ; $index++ ) { print "$trigrams[$index] $trigramfrequency[$index]\n"; }
      Thanks i got it.

      @frozenwithjoy The trigrams are coming perfect for above code but the frequency count of trigrams is differing.

        Can you show some examples of how they are differing?
Re: count trigrams of a whole file
by Anonymous Monk on Dec 20, 2012 at 10:16 UTC
    A couple of improvements you can make:
    • On each line, include the last two elements from the previous line in your array (if they were defined). That way you handle the overlapping cases without needing to read the whole file into memory.
    • A hash is a perfect tool for keeping track of the counts. Then you can do away with your loop to search the array.
    • In general, it is bad style to use the 3-argument for loop in Perl. There is almost always a better option: foreach (@array), for my $i (0..$#array), etc.
    #!usr/bin/perl use strict; use warnings; my %trigrams; my @words; while(<DATA>) { #Include the previous two words to the beginning of this array. @words = ( $words[-2] // (), $words[-1] // (), split(/\s/, $_) ); $trigrams{"@words[$_..$_+2]"}++ for (0..$#words-2); } print "trigram frequencies in your text:\n"; #Sort the trigrams in descending order of frequency. for (sort {$trigrams{$b} <=> $trigrams{$a} } keys %trigrams) { print "$_: $trigrams{$_}\n"; } __DATA__ I went there! Me She also went there. Did you know that I went there!

      i need the words and its count printed in the order of words given in the file

        Thanks i got it

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1009701]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (10)
As of 2014-04-24 10:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (565 votes), past polls