Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Module for comparing text

by Anonymous Monk
on Sep 30, 2003 at 06:03 UTC ( #295187=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am looking for a Perl module that would let me to compare two texts and tell me how similar they are. I have searched on CPAN, but I couldn't find anything. I need to come up with some arbitrary value that would tell how similar two texts are. I was thinking of implementing it myself (splitting two versions into fixed length substrings and then comparing how many of those substrings are the same in both texts), but I am sure there should be a better algorithm and possibly a module for this. The whole purpose of this is to allow for a user to enter an answer on a web form that would allow for a certain level of difference from the answer stored in the database.

Thank you in advance

Replies are listed 'Best First'.
Re: Module for comparing text
by Zaxo (Archbishop) on Sep 30, 2003 at 06:17 UTC

    Look at Algorithm::Diff or, if the strings are short and expected to be similar, Text::Levenshtein, a more sophisticated direct measure of similarity.

    A third way is to compress the two strings, and compress their concatenation. If they are similar, the concatenation will compress to not much more size than one of its parts. I don't know of a module that does that.

    After Compline,

Re: Module for comparing text
by tachyon (Chancellor) on Sep 30, 2003 at 08:24 UTC

    I have been meaning to release this for a while. It does pretty much exactly what you want:

    package Algorithm::HowSimilar; use 5.006; use strict; use warnings; use Algorithm::Diff qw(traverse_sequences); use Carp; require Exporter; use vars qw( @ISA @EXPORT_OK $VERSION ); our @ISA = qw(Exporter); @EXPORT_OK = qw( compare ); $VERSION = '0.01'; sub compare { my $is_array = ref $_[0] eq 'ARRAY' ? 1 : 0; my $i = 0; if ( $is_array ) { my $seq1 = $_[0]; my $seq2 = $_[1]; my (@match,@d1, @d2) = ((),(),()); traverse_sequences( $seq1, $seq2, { MATCH => sub { push @match, $seq1->[$_[0]] }, DISCARD_A => sub { push @d1, $seq1->[$_[0]] }, DISCARD_B => sub { push @d2, $seq2->[$_[1]] }, }); my $m1 = @match/(@match+@d1); my $m2 = @match/(@match+@d2); my $mav = ($m1+$m2)/2; return $mav, $m1, $m2, \@match, \@d1, \@d2; } else { my ( $seq1, $seq2 ); if ( $_[2] and ref $_[2] eq 'CODE' ) { local $_ = $_[0]; $seq1 = &{$_[2]}; local $_ = $_[1]; $seq2 = &{$_[2]}; carp "Did not get an array ref from callback!\n" unless ref $seq1 eq 'ARRAY' and ref $seq2 eq 'ARRAY'; } else { $seq1 = _tokenize($_[0]); $seq2 = _tokenize($_[1]); } my ($match,$d1, $d2) = ('','',''); traverse_sequences( $seq1, $seq2, { MATCH => sub { $match .= $seq1->[$_[0]] }, DISCARD_A => sub { $d1 .= $seq1->[$_[0]] }, DISCARD_B => sub { $d2 .= $seq2->[$_[1]] }, }); my $m1 = length($match)/(length($match)+length($d1)); my $m2 = length($match)/(length($match)+length($d2)); my $mav = ($m1+$m2)/2; return $mav, $m1, $m2, $match, $d1, $d2; } } sub _tokenize { return [split //, $_[0]] } 1; __END__ =head1 NAME Algorithm::HowSimilar - Perl extension for quantifying similarites bet +ween things =head1 SYNOPSIS use Algorithm::HowSimilar qw(compare); @res = compare( $str1, $str2, sub { s/\s+//g; [split //] } ); @res = compare( \@ary1, \@ary2 ); =head1 DESCRIPTION This module leverages Algorithm::Diff to let you compare the degree of + sameness of array or strings. It returns a result set that defines exactly how +similar these things are. =head1 METHODS =head2 compare( ARG1, ARG2, OPTIONAL_CALLBACK ) You can call compare with either two strings compare( $str1, $str2 ): my ( $av_similarity, $sim_str1_to_str2, $sim_str2_to_str1, $matches, $in_str1_but_not_str2, $in_str2_but_not_str1 ) = compare( 'this is a string-a', 'this is a string bbb' ); Note that the mathematical similarities of one string to another will +be different unless the strings have the same length. The first result re +turned is the average similarity. Totally dissimilar strings will return 0. I +dentical strings will return 1. The degree of similarity therefore ranges from +0-1 and is reported as the biggest float your OS/Perl can manage. You can also compare two array refs compare( \@ary1, \@ary2 ): my ( $av_similarity, $sim_ary1_to_ary2, $sim_ary2_to_ary1, $ref_ary_matches, $ref_ary_in_ary1_but_not_ary2, $ref_ary_in_ary2_but_not_ary1 ) = compare( [ 1,2,3,4 ], [ 3,4,5,6,7 ] ); When called with two string you can specify an optional callback that +changes the default tokenization of strings (a simple split on null) to whatev +er you need. The strings are passed to you callback in $_ and the sub is expe +cted to return an array ref. So for example to ignore all whitespace you could: @res = compare( 'this is a string', 'this is a string ', sub { s/\s+//g; [split //] } ); You already get the intersection of the strings or arrays. You can get + the union like this: @res = compare( $str1, $str2 ); $intersection = $res[3]; $union = $res[3].$res[4].$res[5]; @res = compare( \@ary1, \@ary2 ); @intersection = @{$res[3]}; @union = ( @{$res[3]}, @{$res[4]}, @{$res[5]} ); =head2 EXPORT None by default. =head1 AUTHOR Dr James Freeman <> =head1 SEE ALSO L<perl>. =cut




Re: Module for comparing text
by matthewb (Curate) on Sep 30, 2003 at 06:12 UTC
    Not used it myself, but a quick search on CPAN with the keyword `diff' turns up Text::Diff , which looks like the sort of thing you are after.

Re: Module for comparing text
by ehdonhon (Curate) on Sep 30, 2003 at 06:14 UTC
Re: Module for comparing text
by dree (Monsignor) on Sep 30, 2003 at 09:00 UTC
Re: Module for comparing text
by castaway (Parson) on Sep 30, 2003 at 13:53 UTC

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://295187]
Approved by moxliukas
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2021-12-06 09:14 GMT
Find Nodes?
    Voting Booth?
    R or B?

    Results (32 votes). Check out past polls.