Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Algorithm inspiration required.

by pryrt (Abbot)
on Jun 18, 2018 at 22:08 UTC ( [id://1216899]=note: print w/replies, xml ) Need Help??


in reply to Algorithm inspiration required.

My thought process on this: assume your stream was the lyrics for the Twelve Days of Christmas, on infinite repeat. You come in on "four calling birds" -- at this point, you don't know whether you've just started the Fourth Day, or whether you've nearly finished the Twelfth Day... and I believe (like the others have mentioned) that it doesn't matter whether it's the "true" start, since that's rather arbitrary. (If it does matter, once you've found a huge repeating sequence, you can spawn something else to find the official "start" of that huge sequence). At some point, you know you'll get back to the _same_ "four calling birds" in the next repeat of the song, rather than just one of the nine from the fourth to twelfth day).

Using a hash function that easily allows streamed input (rather than requiring the whole file already: I think most of the SHA and related accept a stream, though I don't remember off the top of my head if most of the perl implementations do), start hashing from wherever you came in as $huge_hash; also, keep a count of the number of bytes hashed in $huge_hash as $huge_bytes. After you've grabbed the first phrase (in this instance, I picked 18 characters), use that as your $trigger. If you find this $trigger again, you _might_ have started over; if you haven't found this trigger, you know you haven't started over yet.

So, in my example, you continue reading the stream: "... on the fifth day... four calling birds".... and you've found the $trigger: set $archive_hash = $huge_hash; $archive_bytes = $huge_bytes; start a $repeat_hash from this instance of $trigger, and count $repeat_bytes, but continue with the $huge_hash and $huge_bytes (just in case you haven't found the biggest length yet). When you get to "... the sixth day ... four calling birds", you've found $trigger again, but your $repeat_hash and $repeat_bytes won't match the $archive_hash and $archive_bytes. Thus, you'll know that you haven't really found a repeating sequence. set $archive_hash = $huge_hash; $archive_bytes = $huge_bytes; reset $repeat_hash and $reset_bytes starting on this instance of $trigger, and continue.

At some point, you might want to decide "there have been too many false triggers, so I want a longer $trigger to avoid false triggers": maybe double the length of the trigger to look for. Unfortunately, we've lost the initial trigger information. Since you know you are always somewhere within the largest repeated sequence, and you don't (yet) care what the official "start" is, just start over from the beginning: reset $huge_hash from here, clear the $archive_hash, and set $trigger to 36 bytes instead of 18 bytes (for example).

Eventually, you'll hit a trigger that really is the start of the next time through, so your hashes and lengths will match.

Another sequence that you might want to experiment with (assume all whitespace equal -- and could be collapsed to no whitespace at all):

1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1234567 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1234567 12345678 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1234567 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1234567 12345678 123456789 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1234567 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1234567 12345678 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1234567 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1234567 12345678 123456789 1234567890

Then try starting randomly in that sequence somewhere, and see if the algorithm I presented (or someone else presented) works.

(rereading some of Eily's suggestions, I think maybe this is the same idea... but I'm not sure I fully understood, so maybe mine's a little different. It's presented differently, either way, so maybe will help or will trigger a better thought process in you or someone else.)

edit 1: rephrase first paragraph for clarity; also, fix the spelling of "twelfth"

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1216899]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (2)
As of 2024-04-19 18:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found