Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

String Comparison & Equivalence Challenge

by Polyglot (Friar)
on Mar 14, 2021 at 04:47 UTC ( #11129602=perlquestion: print w/replies, xml ) Need Help??

Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

BACKGROUND

Suppose we start with a database of the Bible, which has ~31,102 verses. Some of the verses are similar to each other. Some are textually identical. Suppose we wish to alert readers of one verse to other verses which are very similar or are the same; or perhaps return search results ranked by similarity.

CHALLENGE

1) How would we find and rank the similarity of other verses, such as on a percentage of equivalence/similarity?

2) How would we index the rankings?

EXAMPLE VERSES

(Identical)

2 Kings 19:1 -- And it came to pass, when king Hezekiah heard it, that he rent his clothes, and covered himself with sackcloth, and went into the house of the LORD.

Isaiah 37:1 -- And it came to pass, when king Hezekiah heard it, that he rent his clothes, and covered himself with sackcloth, and went into the house of the LORD.

(Only punctuation differs)

Exodus 20:7 -- Thou shalt not take the name of the LORD thy God in vain; for the LORD will not hold him guiltless that taketh his name in vain.

Deuteronomy 5:11 -- Thou shalt not take the name of the LORD thy God in vain: for the LORD will not hold him guiltless that taketh his name in vain.

(Similar)

Psalm 14:1 -- The fool hath said in his heart, There is no God. They are corrupt, they have done abominable works, there is none that doeth good.

Psalm 53:1 -- The fool hath said in his heart, There is no God. Corrupt are they, and have done abominable iniquity: there is none that doeth good.

(Should rank as similar)

Exodus 20:15 -- Thou shalt not steal.

Deuteronomy 5:19 -- Neither shalt thou steal.

(Less similar)

Genesis 1:1 -- In the beginning God created the heaven and the earth.

John 1:1 -- In the beginning was the Word, and the Word was with God, and the Word was God.


For any given verse there will be a different scale of similarities to other verses. Some verses may be so unique that almost no other verse would approach them in similarity. Other verses may have one or more identical or virtually identical repeats. Finding the nearest equivalent, which may not be very equivalent, might be a challenge. For some verses, perhaps many similar verses could be found, whereas for others the list may be very short--or even none at all, depending on where the line is drawn for threshold of equivalency.

All of this depends on being able, first and foremost, to measure the equivalence of two different strings. Secondarily, how should these rankings be preserved?

As a bonus, finding the most efficient way of indexing the table might be interesting. Does one iterate over each of the 31,102 verses once for every verse?

Enjoy!


RESOURCES

(Claim to have free SQL-format Bibles, with sign-up)

http://biblehub.net

- OR -

(Free CSV download of KJV Bible, easily converted to database [script below])

http://my-bible-study.appspot.com/assets/KJV_fixed.csv

And here's a script to use to push the CSV file into a database:

CREATE BIBLE DATABASE

#!/usr/bin/perl use DBI; use strict; use warnings; # DEFAULT SCRIPT USER our $db_user_name = 'root'; # DEFAULT PASSWORD FOR SCRIPT USER our $db_password = '[mysql_root_password_here]'; our $database = "Bibles"; our $table = "KJV"; our $dbfilename = 'KJV_fixed.csv'; our $DEBUG=0; # Treat all input and output as UTF-8 and set the flags correctly # Because _everyone_ should be using UTF8 these days! binmode STDOUT, ":utf8"; binmode STDERR, ":utf8"; binmode STDIN, ":utf8"; ################# #begin &initializeDB; my @source = &readsource($dbfilename); &filltable(@source); exit; #end ################# sub initializeDB { print "Creating database: $database\n"; my $dsn = "DBI:mysql:host=localhost"; my $statement = qq| CREATE DATABASE IF NOT EXISTS $database CHARAC +TER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci; |; &connectdb($statement,$dsn); } #END SUB initializeDB sub filltable { my @source = @_; my ($booknum, $chapternum, $verse, $text, $other) = ('', '', '', '', ' +'); my $line = ''; my $count = 0; my $percent = 0; my $statement = qq|create table if not exists $table (RecordNum SM +ALLINT NOT NULL PRIMARY KEY AUTO_INCREMENT, Book TINYINT, Chapter SMA +LLINT, Verse SMALLINT, Text text) CHARACTER SET utf8mb4 COLLATE utf8m +b4_unicode_ci ENGINE=MyISAM DEFAULT CHARSET=utf8mb4;|; + &connectdb($statement); print "Filling the table with the data...\n\nThis could take awhil +e.\n"; print "Total lines: ".scalar @source."\n"; foreach $line (@source) { $count++; if ($count/3110 == int($count/3110)) { $percent++; print "${percent}0\% completed...\n"; } chomp $line; $line =~ s/'/\\'/g; $line =~ s/(\d+),(\d+),(\d+),"(.*?)"/$1\t$2\t$3\t$4/g; ($booknum, $chapternum, $verse, $text, $other) = split(/\t/, $ +line); $text =~ s/^\s+//; $text =~ s/\s+$//; $statement = qq|INSERT INTO $table (RecordNum, Book, Chapter, +Verse, Text) VALUES (null, '$booknum', '$chapternum', '$verse', '$tex +t'); |; &connectdb($statement); } } #END SUB filltable sub readsource { my $filename=shift @_; my @data = (); print "\nReading source: $filename..."; open SOURCE, "<$filename" or die "Cannot open source file $!\n"; @data = <SOURCE>; close SOURCE; print "Done.\n\n"; return @data; } #END SUB readsource sub connectdb { if ($DEBUG==9) {print "sub connectdb: \n"}; my $statement = shift @_; my $dsn = shift @_ || "DBI:mysql:$database:localhost"; my $dbh = DBI->connect($dsn, $db_user_name, $db_password, { mysql_enable_utf8 => 1 }) or die "Can't connect to the DB: $DBI::errstr\n +"; my $quest = $dbh->prepare($statement, { RaiseError => 1 }) or die +"Cannot prepare statement! $DBI::errstr\n"; if ($DEBUG>0) { $quest->execute() or die qq|CONNECT DATABASE Statement: \n $sta +tement \n \n Error in first database statement! \n $DBI::errstr \n |; } else { $quest->execute() }; } # END SUB connectdb

READ BIBLE DATABASE

#!/usr/bin/perl use DBI; use strict; use warnings; # DEFAULT SCRIPT USER our $db_user_name = 'root'; # DEFAULT PASSWORD FOR SCRIPT USER our $db_password = '[mysql_root_password_here]'; our $database = "Bibles"; our $table = "KJV"; our $DEBUG=0; # Treat all input and output as UTF-8 and set the flags correctly # Because _everyone_ should be using UTF8 these days! binmode STDOUT, ":utf8"; binmode STDERR, ":utf8"; binmode STDIN, ":utf8"; ################# #begin my @results = &read_table_from_DB; &processResults(@results); exit; #end ################# sub processResults { my @data = @_; my ($record, $booknum, $chapternum, $versenum, $text) = ('', '', '', ' +', ''); #OPTIONAL FORM OF ITERATION: # JUST KEEP IN MIND THAT ARE FIVE (5) # RECORDS FOR EACH "ROW" while (@data) { $record = shift @data; $booknum = shift @data; $chapternum = shift @data; $versenum = shift @data; $text = shift @data; #CODE TO ANALYZE AND COMPARE AMONG TEXTS #print $record."\n"; #WILL GENERATE CONSIDERABLE TEXT TO SCREE +N } } #END SUB processResults sub read_table_from_DB { my @results = (); my ($record, $booknum, $chapternum, $verse, $text, $other) = ('', '', +'', '', '', ''); my $line = ''; my $count = 0; my $statement = qq|SELECT * FROM $table;|; + @results = &connectdb($statement); } #END SUB read_table_from_DB sub connectdb { if ($DEBUG==9) {print "sub connectdb: \n"}; my $statement = shift @_; my $dsn = shift @_ || "DBI:mysql:$database:localhost"; my @results = (); my $dbh = DBI->connect($dsn, $db_user_name, $db_password, { mysql_enable_utf8 => 1 }) or die "Can't connect to the DB: $DBI::errstr\n +"; my $quest = $dbh->prepare($statement, { RaiseError => 1 }) or die +"Cannot prepare statement! $DBI::errstr\n"; if ($DEBUG>0) { $quest->execute() or die qq|CONNECT DATABASE Statement: \n $sta +tement \n \n Error in first database statement! \n $DBI::errstr \n |; } else { $quest->execute() }; while(my @row = $quest->fetchrow_array()) { foreach my $item (@row) { push @results, $item; } } return @results; } # END SUB connectdb

Blessings,

~Polyglot~

Replies are listed 'Best First'.
Re: String Comparison & Equivalence Challenge
by erix (Prior) on Mar 14, 2021 at 08:51 UTC

    PostgreSQL's trigram-comparing doesn't fare too badly, I think. The trigram functionality is inside module pg_trgm (a contrib module).

    I compared your example sentences and asked pg_trgm what similarity it thought they had. pg_trgm expresses that similarity as a number between 0 and 1 (='very different' to 'virtually the same'):

    prior | strict_word_similarity | initi +al part ----------------------------+------------------------+---------------- +-------------------------- (Identical) | 1 | And it came to +pass, when king Hezekia... (Only punctuation differs) | 1 | Thou shalt not +take the name of the LO... (Similar) | 0.8333333 | The fool hath s +aid in his heart, There... (Should rank as similar) | 0.8 | Thou shalt not +steal. (Less similar) | 0.45614034 | In the beginnin +g God created the heave... (5 rows) -- (column 3 truncated)

    The SQL I used (the db must have pg_trgm installed, which you can do with CREATE EXTENSION pg_trgm;):

    select prior -- , similarity(txt1, txt2) , strict_word_similarity(txt1, txt2) , substring(txt1, 1, 40) -- || chr(10) || txt2 from (values ( '2 Kings 19:1' , 'And it came to pass, when king Hezekiah hea +rd it, that he rent his clothes, and covered himself with sackcloth, +and went into the house of the LORD.', 'Isaiah 37:1' , 'And it came to pass, when king Hezekiah hea +rd it, that he rent his clothes, and covered himself with sackcloth, +and went into the house of the LORD.' , '(Identical)') , ( 'Exodus 20:7', 'Thou shalt not take the name of the LORD thy God + in vain; for the LORD will not hold him guiltless that taketh his na +me in vain.', 'Deuteronomy 5:11', 'Thou shalt not take the name of the LORD th +y God in vain: for the LORD will not hold him guiltless that taketh h +is name in vain.', '(Only punctuation differs)' ) , ( 'Psalm 14:1' , 'The fool hath said in his heart, There is n +o God. They are corrupt, they have done abominable works, there is no +ne that doeth good.', 'Psalm 53:1' , 'The fool hath said in his heart, There is n +o God. Corrupt are they, and have done abominable iniquity: there is +none that doeth good.', '(Similar)' ) , ( 'Exodus 20:15' , 'Thou shalt not steal.', 'Deuteronomy 5:19', 'Neither shalt thou steal.', '(Should rank as similar)' ) , ( 'Genesis 1:1' , 'In the beginning God created the heaven and + the earth.', 'John 1:1' , 'In the beginning was the Word, and the Word + was with God, and the Word was God.', '(Less similar)' ) ) as f(verse1, txt1, verse2, txt2, prior)

      That does appear like an interesting function with relevant output. How would one find a similar function for MariaDB?

      Blessings,

      ~Polyglot~

        Sorry, no idea. I haven't tried. (Most likely it doesn't exist.)

        Loading your file into postgres wasn't hard, I just tried it. YMMV, of course, especially when you don't have postgresql installed yet.

        In case it helps, here is a quick-'n-dirty load into a (postgres!) table, of your file.

        #!/bin/bash file=KJV_fixed.csv schema=public table=kjv t=$schema.$table echo " drop table if exists $t ; create table if not exists $t ( recordnum serial PRIMARY KEY , book int , chapter int , verse int , text text ) " | psql -X < $file perl -ne 'chomp; my @arr = split(/[,]/, $_, 4); print join("\t", @arr), "\n"; ' | psql -c " copy $t(book, chapter, verse, text) from stdin (format csv, header false, delimiter E'\t'); analyze $t; "

        Output from that is:

        DROP TABLE CREATE TABLE Timing is on. ANALYZE Time: 326.648 ms
Re: String Comparison & Equivalence Challenge (tf-idf)
by LanX (Sage) on Mar 14, 2021 at 07:34 UTC
Re: String Comparison & Equivalence Challenge
by bliako (Monsignor) on Mar 14, 2021 at 09:50 UTC

    regarding preserving the similarities, a sparse (2D) matrix can be used. Either from a CPAN module e.g. Math::SparseMatrix and others or simply emulate one using a 2D hash.

    Similarities can be at different levels with different metrics: exact phrase, re-arranged phrase, similar words, similar sentiment. But why select one of these when you can use them all in a multi-dimensional similarity index. Something like this (totally untested):

    use List::Util qw(reduce); # store similarities as a sparse matrix as a 2-level hash my $S = {}; # metric weights, all 1's means not weighted, usually sum-of-weights=1 my $W = {'metric1' => 1, 'metric2' => 1, 'metric3' => 1]; # get a list of similarity values as a hash, keyed on metric names my $sims = similarity($phrase1, $phrase2); # get the most similar to phrase1 my $most = most_similar($phrase1); print "most similar to '$phrase1' is ".$most->{'phrase'}."\n"; # main entry to finding similarity between phrases A and B sub similarity { my ($A, $B) = @_; if( ! exists($S->{$A}) && ! exists($S->{$A}->{$B}) ){ # useless negation to satisfy certain monks' pet peeve $S->{$A}->{$B} = { 'metric-1' => metric1($A,$B), 'metric-2' => metric2($A,$B), 'metric-3' => metric3($A,$B), }; # this is a weighted similarity, it's a rough 1D metric based # on all other metrics. my $weighted = 0; $weighted += $W->{$_} * $S->{$A}->{$B}->{$_} for keys %$W; $S->{$A}->{$B}->{'weighted'} = $weighted; } return $S->{$A}->{$B} } # calculate similary between phrases A and B using metric1 sub metric1 { my ($A,$B) = @_; return ... # a real e.g. 3.5 } sub most_similar { my ($A, $metric_name) = @_; if( ! defined($metric_name) !! ! exists($W->{$metric_name}) ){ $metric_name = 'weighted' } my $w = $S->{$A}; my $max_sim_phrase = List::Util::reduce { $w->{$b}->{$metric_name} > + $w->{$a}->{$metric_name} ? $b : $a } keys %$w; my $max_sim_value = $w->{$max_sim_phrase}->{$metric_name}; return { 'phrase' => $max_sim_phrase, 'value' => $max_sim_value } }

    Edit: P.S. Stemming this ancient form of english can be a challenge as stemming relies on pre-trained models. Using the ancient greek bible text could be even more challenging finding models.

    bw, bliako

      > Using the ancient greek bible text could be even more challenging finding models.

      Only the New Testament is originally in Greek, the old one is in Hebrew and Aramaic AFAIK.

      Please correct me.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        I had no idea but it's reasomable (edit: what you say).

Re: String Comparison & Equivalence Challenge
by tybalt89 (Prior) on Mar 15, 2021 at 03:08 UTC
    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11129602 use warnings; use Algorithm::Diff qw(traverse_sequences); use Tk; use Tk::ROText; my $file = 'KJV_fixed.csv'; # download from 11129602 (see above) my $verses = do { local (@ARGV, $/) = $file; <> }; my $theverse = '19,14,1'; my $thebody = ''; my $font = 'times 12'; my $bestcount = 20; my $mw = MainWindow->new; $mw->geometry( '+0+200' ); $mw->title('String Comparison & Equivalence Challenge'); $mw->Button(-text => 'Exit', -command => sub {$mw->destroy}, )->pack(-side => 'bottom', -fill => 'x'); my $ro = $mw->Scrolled(ROText => -scrollbars => 'osoe', -height => 20, -width => 100, -font =>$font, -spacing3 => 5, -wrap => 'word', )->pack(-side => 'bottom', -fill => 'both', -expand => 1); my $button = $mw->Button(-text => 'Find Closest', -command => \&closes +t, )->pack(-side => 'bottom', -fill => 'x'); my $thetext = $mw->Scrolled(ROText => -scrollbars => 'osoe', -height => 4, -wrap => 'word', -font =>$font, )->pack(-side => 'bottom', -fill => 'x', -expand => 0); $mw->Label(-text => 'Verse', -fg => 'blue', )->pack(-side => 'left', -fill => 'x'); my $entry = $mw->Entry( -textvariable => \$theverse, )->pack(-side => 'left', -fill => 'x'); $entry->focus; $mw->Button(-text => 'Find Verse', -command => \&findverse, )->pack(-side => 'left', -fill => 'x'); $entry->bind('<Return>' => \&findverse); $mw->Entry(-textvariable => \$bestcount, -width => 5, )->pack(-side => 'right'); $mw->Label(-text => 'How Many', -fg => 'blue', )->pack(-side => 'right'); MainLoop; exit; sub findverse { $thetext->delete('1.0' => 'end'); $thebody = ''; if( $verses =~ /^$theverse,(.*)/m ) { $thebody = $1; $thetext->insert(end => $thebody); closest(); } else { $thetext->insert(end => '*** Verse not found ***'); } } sub closest { $ro->delete('1.0' => 'end'); $ro->insert(end => "\nSearching ..." ); $mw->update; my @matches; while( $verses =~ /^((\d+,\d+,\d+),(.*))/gm ) { $2 eq $theverse and next; my $whole = $1; my $text = $3; my @sortfrom = sort my @from = $thebody =~ /\w+/g; my @sortto = sort my @to = $text =~ /\w+/g; my $total = my $good = 0; traverse_sequences( \@from, \@to, my $actions = { MATCH => sub {$total += 10; $good += 10}, DISCARD_A => sub {$total++}, DISCARD_B => sub {$total++}, } ); traverse_sequences( \@sortfrom, \@sortto, $actions ); push @matches, sprintf "%06d %s", $good / ++$total * 1e6, $whole; } $ro->insert(end => ' Sorting ...' ); $mw->update; @matches = reverse sort @matches; $ro->delete('1.0' => 'end'); $ro->insert(end => join "", map s/^(\d\d)\d*/$1%/r . "\n", @matches[0 .. $bestcount - 1] ); }

    It takes about 2.6 seconds to go through the entire file searching for matches on my machine.

      I took several minutes to install Tk.pm so that I could run this. I was really curious what it actually accomplished in just 2.6 seconds--or if, perhaps, it would be much different on my computer. Alas, I have perl 5.12, which is not compatible with the s///r pragma. I tried a couple of things to work around that, but feel like I'm working in the dark here, never having entirely grasped map or what it does. ...at the risk of appearing foolish, here's what I tried for that line that didn't work.

      $ro->insert(end => join "", map { do { (my $s = $_ ) =~ s/^(\d\d)\d* +/$1%/ . "\n"; $s }, @matches[0 .. $bestcount - 1] }); #RESULT: #Useless use of concatenation (.) or string in void context at KJV_Ver +seMatcher_PM_Script.pl line 88. #syntax error at KJV_VerseMatcher_PM_Script.pl line 89, near "})"

      Blessings,

      ~Polyglot~

        $ro->insert(end => join "", map s/^(\d\d)\d*/$1%/r . "\n", @matches[...] );
        $ro->insert(end => join "", map { do { (my $s = $_ ) =~ s/^(\d\d)\d*/$1%/ . "\n"; $s }, @matches[...] });

        Try:
            $ro->insert(end => join "", map { (my $r = $_) =~ s/^(\d\d)\d*/$1%/;  "$r\n"; } @matches[0 .. $bestcount - 1]);

        Update:

        ... the s///r pragma.
        NB: s///r is not a pragma. /r is a modifier (update: available from Perl version 5.14 on) of the s/// operator.


        Give a man a fish:  <%-{-{-{-<

      Thank you very much!

      With Anomalous Monk's help, I was able to run the script. It takes about 15.1 seconds on my computer--and that is checking for the similarities of one verse. But, wow! Tk made it like a GUI! I'd never used Tk before--so that was a new experience. Interesting. I might have to try experimenting more with that at some point when I have time. For now, I just need a simple text-based process, and, let's see, about five and half days to index the entire Bible at that level.

      Now I just need to set up a loop and redirect the output to a file. That shouldn't be too hard, should it? (I managed to do the latter already. Perhaps if I can isolate the logic from the GUI it might make it more efficient.)

      Blessings,

      ~Polyglot~

        I found Perl/Tk to be a steep learning curve for all options of all widgets, but very convenient for everyday use of a few, familiar widgets. A couple of good books, both from O'Reilly:

        • Mastering Perl/Tk, by Stephen Lidie and Nancy Walsh
        • Learning Perl/Tk, by Nancy Walsh


        Give a man a fish:  <%-{-{-{-<

      How does this rank

      They are corrupt

      Vs

      Corrupt are they

      ?

      One or three matches?

      10/3 or 30/3 ?

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Re: String Comparison & Equivalence Challenge (plagiarism scanner)
by LanX (Sage) on Mar 15, 2021 at 00:39 UTC
    On a - slightly humoristic but serious - side note:

    Your use-case seems to match pretty much the niche called "plagiarism checker"

    (very popular in Germany since some politicians with cheat PhDs had to resign.)

    You might find more solutions there.

    And on another side note:

    A "challenge" is a programing competition where the author already has a solution and wants to entertain the gamers in a competition.

    Please don't call it a challenge if you don't know what to expect and just want your problem to be solved for free. :)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Perhaps the lingo here is beyond me, but what kind of "challenge" already has a solution? I thought companies like Micro$oft and others actually challenged people to try to hack their software--for money if holes could be found. Was that not a "challenge" because they didn't already know of the holes' existence? And why would they have offered people a reward for finding what they already knew?

      Oh, well, as I've said before, I'm not a real programmer. Maybe that's why there's been so many down-votes on this one. I had thought maybe some people just didn't like seeing the Bible texts or something, or that I had posted too much code (trying to be helpful).

      Regardless of what one might call this--it seems that most have no good solution, only a few clues and ideas.

      As to the "plagiarism"--there was plenty of that in the Bible. There are places where a whole chapter is the same as another part. Sometimes the original writer is mentioned--often not. "Plagiarism" appears to be an unbiblical concept. According to the Bible, all wisdom is from God, and any person taking credit to himself/herself is the one truly "plagiarizing."

      Blessings,

      ~Polyglot~

        I can't see that you have been downvoted in this thread and I certainly never did.

        But IMHO is "challenge" indeed the wrong lingo here and in the open source world.

        See also https://perlweeklychallenge.org/

        > it seems that most have no good solution, only a few clues and ideas.

        Well YMMV but in my view I've already given you excellent counseling and spend considerable time trying to help you.

        One of the mantras of this site is "not a code writing service!"

        We want you and other - current or future - readers of this thread to learn and make advances in programing Perl. That's why we share for free.

        But you keep repeating that you don't like the math or real programming.

        So probably you might be better served by hiring a "real" programmer?

        Since you already mentioned that M$ is paying for their "challenges", this seems to be your go-to model.

        Good luck! :)

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Re: String Comparison & Equivalence Challenge
by vincent_veyron (Sexton) on Mar 15, 2021 at 13:50 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11129602]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2021-09-19 02:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?