Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Syntactic Confectionary Delight
 
PerlMonks

Longest Common Substring

by rkg (Hermit)
 | Log in | Create a new user | The Monastery Gates | Super Search | 
 | Seekers of Perl Wisdom | Meditations | PerlMonks Discussion | 
 | Obfuscation | Reviews | Cool Uses For Perl | Perl News | Q&A | Tutorials | 
 | Poetry | Recent Threads | Newest Nodes | Donate | What's New | 

on Apr 09, 2003 at 15:13 UTC ( #249239=perlquestion: print w/ replies, xml ) Need Help??
rkg has asked for the wisdom of the Perl Monks concerning the following question:

Hi --

I am looking for a module to solve (or appoximately solve) the longest common substring problem.

I found two modules on CPAN: String-LCSS-0.10 and String-Ediff-0.01.

The first didn't download ("File not found") from CPAN; the second required a C compiler and I don't have one available (Activestate perl). Any have any suggestions?

Thanks

R

Comment on Longest Common Substring
Re: Longest Common Substring
by nothingmuch (Priest) on Apr 09, 2003 at 15:25 UTC
    Algorithm::Diff is probably of use. It's pure perl, and nearly every other module i found relies on it.

    -nuffin
    zz zZ Z Z #!perl
[reply]
[reply]
Re: Longest Common Substring
by glwtta (Hermit) on Apr 09, 2003 at 15:36 UTC
    String::LCSS certainly seems to be what you want. And I don't seem to be having any trouble getting it from CPAN.
[reply]
Re: Longest Common Substring
by BrowserUk (Saint) on Apr 09, 2003 at 16:24 UTC

    Probably not the most efficient implementation (and maybe incomplete), but if you can't get the modules to work, you could wrap this up as a subroutine.

    #! perl -slw use strict; use Data::Dumper; my $needle = 'the lazy cat'; my $haystack = 'the quick brown fox jumps over the lazy dog'; my @matches; for my $start (0..length $needle) { for my $len ($start+1 .. length $needle) { my $substr = substr($needle, $start, $len); push @matches, $haystack =~ m[($substr)]g; } } my ($len, $longest) = 0; length > $len and ($longest, $len) = ($_, length) for @matches; print "The longest common substring between\n", $needle, "\nand\n", $h +aystack, "\nis '$longest'"; __END__ C:\test>249239.pl The longest common substring between the lazy cat and the quick brown fox jumps over the lazy dog is 'the lazy ' C:\test>

    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
[reply]
[d/l]
      BrowserUK's solution doesn't quite work I tried the following and did not get back what was expected ( " the quick brown fox "):
      my $needle = 'i pushed the lazy dog into a creek, the quick brown fox +told me to'; my $haystack = 'the quick brown fox jumps over the lazy dog';
      and got back 'the lazy dog' (it works if you reverse the $needle and the $haystack.

      But to his credit it seems to work better than String-LCSS-0.10, and he did mention that it might be incomplete (run it both ways, get the longest string and print that, I believe should give you the most right answer, but my tests might also be incomplete).

      Where which running the following code:

      I would use Algorithm-Diff using this part of the documentation as a starting point.

      Note: The Author of String-LCSS-0.10 has been notified.

      -enlil

[reply]
[d/l]
[select]
Re: Longest Common Substring
by BrowserUk (Saint) on Apr 09, 2003 at 20:13 UTC

    Enlil's right, my code was a bit slovenly. I believe this is both more correct and more efficient, but I can't guarentee it doesn't fail on some edge case or another.

    #! perl -slw use strict; sub lcss (\$\$) { my ($needle, $haystack) = @_; ($needle, $haystack) = ($haystack, $needle) if length $$needle > length $$haystack; my ($longest_c, $longest) = 0; for my $start (0..length $$needle) { for my $len ( reverse $start+1 .. length $$needle) { my $substr = substr($$needle, $start, $len); length $1 > $longest_c and ($longest_c, $longest) = (lengt +h $1, $1) while $$haystack =~ m[($substr)]g; } } return $longest; } my $needle = 'the lazy cat'; my $haystack = 'the quick brown fox jumps over the lazy dog'; print "The longest common substring between\n" , $needle, "\nand\n" , $haystack, "\nis '" , lcss($needle, $haystack) , "'"; $needle = 'i pushed the lazy dog into a creek, the quick brown fox tol +d me to'; $haystack = 'the quick brown fox jumps over the lazy dog'; print "The longest common substring between\n" , $needle, "\nand\n" , $haystack, "\nis '" , lcss($needle, $haystack) , "'"; __END__ C:\test>249239.pl The longest common substring between the lazy cat and the quick brown fox jumps over the lazy dog is 'the lazy ' The longest common substring between i pushed the lazy dog into a creek, the quick brown fox told me to and the quick brown fox jumps over the lazy dog is 'the quick brown fox '

    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
[reply]
[d/l]
Re: Longest Common Substring
by PodMaster (Abbot) on Apr 10, 2003 at 00:53 UTC
    I don't know if it was you , but somebody has requested String-Ediff and I have put it up on my repository.

    If it was more than a single function it would've been impossible for me to do since the guy who wrote it used swig, and well, swig on win32 hasn't been working out for me. I ended up writing an XS port (pretty easily I might add), and the source can be found at http://crazyinsomniac.perlmonk.org/perl/files/String-Ediff-0.01.tar.gz.


    MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
    I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
    ** The Third rule of perl club is a statement of fact: pod is sexy.

[reply]
LCS != LCSS
by rkg (Hermit) on Apr 11, 2003 at 12:38 UTC
    Hi, all. Thanks for the help. After spending some time with Algorithm::Diff, I've realized that Alg::Diff's LCS(Longest Common SubSequence) is not the same as String-LCSS (Longest Common Substring). Just wanted to share that (perhaps obvious) observation. LCSS does what I need, and is working fine for me...thanks!

    R
[reply]

Back to Seekers of Perl Wisdom


Login:
Password
remember me
What's my password?
Create A New User

Node Status
node history
Node Type: perlquestion [id://249239]
Approved by nothingmuch
Front-paged by enoch
help
Community Ads
Chatterbox
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users
Others romping around the Monastery: (12)
GrandFather
wfsp
atcroft
herveus
Eyck
clinton
NodeReaper
djp
$self
vishi83
gnosti
im2
As of 2009-11-21 09:46 GMT
Sections
The Monastery Gates
Seekers of Perl Wisdom
Meditations
PerlMonks Discussion
Categorized Q&A
Tutorials
Obfuscated Code
Perl Poetry
Cool Uses for Perl
Perl News
Information
PerlMonks FAQ
Guide to the Monastery
What's New at PerlMonks
Voting/Experience System
Tutorials
Reviews
Library
Perl FAQs
Other Info Sources
Find Nodes
Nodes You Wrote
Super Search
List Nodes By Users
Newest Nodes
Recently Active Threads
Selected Best Nodes
Best Nodes
Worst Nodes
Saints in our Book
Leftovers
The St. Larry Wall Shrine
Offering Plate
Awards
Craft
Snippets Section
Code Catacombs
Quests
Editor Requests
Buy PerlMonks Gear
PerlMonks Merchandise
Planet Perl
Perlsphere
Use Perl
Perl.com
Perl 5 Wiki
Perl Jobs
Perl Mongers
Perl Directory
Perl documentation
CPAN
Random Node
Voting Booth

Future historians will find that the material characteristic of the current era is...

Aluminium
Plastic
Oil
Water
Carbon dioxide
Copper
Iron
Silicon
Salt
Uranium
Hydrogen
Other

Results (729 votes), past polls