Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

need help in extending the code

by sarvan (Sexton)
on Jul 27, 2011 at 18:21 UTC ( #917093=perlquestion: print w/ replies, xml ) Need Help??
sarvan has asked for the wisdom of the Perl Monks concerning the following question:

Hi there, I have the following script for computing similarity between two sentences.
#!/usr/bin/perl -w use strict; my $candidate = "it is not probable that it is the end"; my $reference = "but it is unlikely that this is the end"; my @candidate_words = split (/\W+/, $candidate); my $candidate_count=@candidate_words; my %cand_histogram; my %ref_histogram; my $valHolder=""; our $res=0; $cand_histogram{$_}++ foreach (split(/\s+/,$candidate)); $ref_histogram{$_}++ foreach (split(/\s+/,$reference)); my %seen; printf "%-6s %-10s %-10s %10s\n", 'Key','Candidate','Reference','MinCo +unt'; foreach my $key ( sort { $seen{$b} <=> $seen{$a} #descending word cnt or $a cmp $b #alphabetic otherwise } grep {!$seen{$_}++} # each key just once, # but count 'em also! (keys %cand_histogram) ) { #$valHolder=$cand_histogram{$key}; $valHolder=0; if($cand_histogram{$key} && $ref_histogram{$key}){ $valHolder=$cand_histogram{$key}; if($cand_histogram{$key} >= $ref_histogram{$key}){ $valHolder=$ref_histogram{$key}; } } $res+=$valHolder; printf "%-6s %-10s %-10s %10s\n", $key, $cand_histogram{$key}||='0', $ref_histogram{$key} ||='0',$valHolder; } my $BS=$res/$candidate_count; print "The calculated bleu Score is:$BS\n";

Here i have computed the similarity score between two sentences by taking unigrams(i.e individual words from sentences).

Now, i want to turn this same script to do the same computation but with n-gram(with n=3 i.e taking three words together through out the computation).

for eg. in unigram computation if i have a sentence "it is not probably that it is the end". then in n-gram words should be in the form

it is not is not probably not probably that that it is it is the is the end

In this way i want to compute the similarity score between the two sentences.. any help please Thanks..

Output:
Key Candidate Reference MinCount end 1 1 1 is 2 2 2 it 2 1 1 not 1 0 0 probable 1 0 0 that 1 1 1 the 1 1 1 The calculated bleu Score is:0.666666666666667

Comment on need help in extending the code
Select or Download Code
Re: need help in extending the code
by Marshall (Prior) on Jul 27, 2011 at 21:06 UTC
    #!/usr/bin/perl -w use strict; my $x= "it is not probably that it is the end"; my @ngram; my @wds = split(' ',$x); while (@wds>=3) { push @ngram, join(" ",@wds[0..2]); shift @wds; } print join("\n",@ngram),"\n"; __END__ it is not is not probably not probably that probably that it that it is it is the is the end
      Hi marshall,

      Thanks for the help. And in the same script which you have provided me few days ago and i also modified and posted yesterday. In that i want the comparison to be case-sensitive which means all words should be compared without case-sensitivesness..

      for eg. If i have the word "finding" in the candidate and "Finding" in reference. The present script dont give the count in reference for that word, since it has capital F.

      So, how can deal with this case-sensitive problem. Any help plz

        I'm not sure what you need. To count the number of occurrences of "finding" in a case insensitive way would require the /i switch for case_insensitive and then the /g switch to keep going globally.

        There is a critter, called the goatse operator, "=()=" that will give the scalar count of the number of matches. You can read about it here: goatse operator. Or just look at the code and play with it..

        !/usr/bin/perl -w use strict; my $x = "Finding fiNding fIndinG finding"; my $count = () = $x =~ /finding/ig; print "$count\n"; #prints 4

        In that i want the comparison to be case-sensitive which means all words should be compared without case-sensitivesness..

        Say that out loud and see if it makes sense to you -- in the same sentence you use "case-sensitive" to mean both case-sensitive and case-insensitive

        So what is your goal?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://917093]
Approved by Tanktalus
Front-paged by Tanktalus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2014-09-23 23:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (241 votes), past polls