Beefy Boxes and Bandwidth Generously Provided by pair Networks Ovid
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^4: Fast common substring matching

by bioMan (Beadle)
on Nov 29, 2005 at 16:53 UTC ( #512704=note: print w/ replies, xml ) Need Help??


in reply to Re^3: Fast common substring matching
in thread Fast common substring matching

Roy

There is one difference between your algorithm and Grandfather's. His code returns the longest substring for each pair of input strings.

With my original data set your code returns one substring. Grandfather's code returned over three thousand (where $minmatch = 256). On the other hand your code finds multiple occurrences of the longest common substrings, if they all have the same length, which I like.

Mike


Comment on Re^4: Fast common substring matching
Re^5: Fast common substring matching
by Roy Johnson (Monsignor) on Nov 29, 2005 at 17:08 UTC
    Yes, after I came up with my algorithm, I realized what all the output from GrandFather's code meant. I had thought it was just some sort of cryptic progress meter. :-)

    The (reasonably) obvious way to get the longest substring for each pair of input strings would be to run my algorithm using each pair of strings as input rather than the whole list of strings. That's probably more work than GF's method, though. I thought about trying it, but something shiny caught my attention...

    Update: but now I've done it. It runs on 20 strings of 1000 characters in something under 10 seconds for me. 100 strings of 1000 characters takes about 4 minutes.


    Caution: Contents may have been coded under pressure.

      I had thought it was just some sort of cryptic progress meter. :-)

      LOL - I know what you mean.

      I'm still going over your original code to see how you did what you did -- trying to learn some perl :-)

      I'll give the new code a try. I also see that the minimum length in your code doesn't have to be a power of 2. This should allow me to analyze a limit boundary that appears to be present in my data. Grandfather's code allowed me to come up with what I feel is a pretty good estimate for the value of the limit, but this should allow a closer examination of the limit.

      Mike

        Actually as far as I can remember my code doesn't require a power of 2 for the minimum size either. It may have been more important in earlier versions than in the current version.

        Somewhere on my todo list is an item to look at Roy's code, but I've not got down to that item on the list yet. :)


        DWIM is Perl's answer to Gödel

        Thanks for the clarification. For some reason I got it in my head that the minimum length of the substring had to be a power of 2. That idea must have come from someone else's algorithm for the longest common string search.

        Nontheless, your script has been very useful to me.

        Mike

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://512704]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (9)
As of 2014-04-20 18:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (486 votes), past polls