Re^3: Finding repeat sequences.

in reply to Re^2: Finding repeat sequences.
in thread Finding repeat sequences.

That is a very substantial improvement and it now beats tobyink's non-regex solution, but cannot touch hdb's:

Partisipants in performance tests: hdb tobyink damianc
b:    4 in s:         6        hdb :: 0.000089 s
b:    4 in s:         6    tobyink :: 0.000336 s
b:    4 in s:         6    damianc :: 0.000081 s
b:    4 in s:        43        hdb :: 0.000070 s
b:    4 in s:        43    tobyink :: 0.000070 s
b:    4 in s:        43    damianc :: 0.000266 s
b:    4 in s:       402        hdb :: 0.000083 s
b:    4 in s:       402    tobyink :: 0.000092 s
b:    4 in s:       402    damianc :: 0.000210 s
b:    4 in s:      4000        hdb :: 0.000391 s
b:    4 in s:      4000    tobyink :: 0.000158 s
b:    4 in s:      4000    damianc :: 0.000719 s
b:   26 in s:        32        hdb :: 0.000075 s
b:   26 in s:        32    tobyink :: 0.000097 s
b:   26 in s:        32    damianc :: 0.000109 s
b:   26 in s:       260        hdb :: 0.000076 s
b:   26 in s:       260    tobyink :: 0.000153 s
b:   26 in s:       260    damianc :: 0.000124 s
b:   26 in s:      2600        hdb :: 0.000111 s
b:   26 in s:      2600    tobyink :: 0.000152 s
b:   26 in s:      2600    damianc :: 0.000182 s
b:   26 in s:     26016        hdb :: 0.000125 s
b:   26 in s:     26016    tobyink :: 0.000947 s
b:   26 in s:     26016    damianc :: 0.000620 s
b:64000 in s:    100595        hdb :: 0.006378 s
b:64000 in s:    100595    tobyink :: 6.614787 s
b:64000 in s:    100595    damianc :: 6.247666 s
b:64000 in s:    657902        hdb :: 0.017817 s
b:64000 in s:    657902    tobyink :: 48.717781 s
b:64000 in s:    657902    damianc :: 12.516777 s
b:64000 in s:   6444916        hdb :: 0.235547 s
b:64000 in s:   6444916    tobyink :: 582.471826 s
b:64000 in s:   6444916    damianc :: 179.466052 s
b:64000 in s:  64031416        hdb :: 2.763495 s
b:64000 in s:  64031416    tobyink :: 6158.077379 s
b:64000 in s:  64031416    damianc :: 2104.112186 s
{
  damianc => {
    4 => { 1 => "8.10623168945313e-005", 10 => "0.000266075134277344",
+ 100 => "0.000210046768188477", 1000 => "0.000719070434570313", all =
+> "0.00127625465393066", },
    26 => { 1 => "0.000108957290649414", 10 => "0.000123977661132813",
+ 100 => "0.000181913375854492", 1000 => "0.000619888305664063", all =
+> "0.00103473663330078", },
    64000 => { 1 => "6.2476658821106", 10 => "12.5167770385742", 100 =
+> "179.466052055359", 1000 => "2104.11218619347", all => "2302.342681
+16951", },
    all => "2302.3449921608",
  },
  hdb => { 
    4 => { 1 => "8.89301300048828e-005", 10 => "7.00950622558594e-005"
+, 100 => "8.29696655273438e-005", 1000 => "0.000391006469726563", all
+ => "0.000633001327514648", },
    26 => { 1 => "7.48634338378906e-005", 10 => "7.60555267333984e-005
+", 100 => "0.000110864639282227", 1000 => "0.000124931335449219", all
+ => "0.000386714935302734", },
    64000 => { 1 => "0.006378173828125", 10 => "0.0178170204162598", 1
+00 => "0.235547065734863", 1000 => "2.76349496841431", all => "3.0232
+3722839355", },
    all => "3.02425694465637",
  },
  tobyink => {
    4 => { 1 => "0.000335931777954102", 10 => "7.00950622558594e-005",
+ 100 => "9.17911529541016e-005", 1000 => "0.000158071517944336", all 
+=> "0.000655889511108398", },
    26 => { 1 => "9.70363616943359e-005", 10 => "0.000153064727783203"
+, 100 => "0.000151872634887695", 1000 => "0.000946998596191406", all 
+=> "0.00134897232055664", },
    64000 => { 1 => "6.61478710174561", 10 => "48.7177810668945", 100 
+=> "582.471826076508", 1000 => "6158.07737898827", all => "6795.88177
+323341", },
    all => "6795.88377809525",
  },
}
[download]

I'm not sure why you think it would be any easier to maintain that the non-regex solutions?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^3: Finding repeat sequences. Download Code

Replies are listed 'Best First'.
Re^4: Finding repeat sequences. by DamianConway (Beadle) on Jun 20, 2013 at 08:02 UTC
I'm not sure why you think it would be any easier to maintain that the non-regex solutions? hdb's solution is extremely clever and obviously vastly superior in performance. Most impressive. Nevertheless, the reasons I'd prefer to have to maintain the regex solution include: The algorithm `find_substring()` uses is extremely clever, perhaps even a little subtle. I always assume that's likely to be a disadvantage for future maintainability. And, indeed, after ten minutes of close inspection, I'm still not entirely sure I fully understand `find_substring()`. By Kernighan's Metric, if I'm not smart enough to understand it, I'm certainly not half smart enough to maintain it. Infinite loops and manually iterated string indexes always make me nervous. They are opportunities for off-by-one errors and overlooked edge-case behaviours to lurk...or to creep in when the code is subsequently updated. I genuinely prefer functional or declarative styles of programming. The regex solution describes exactly what it does (provided you're fluent in the regex dialect...which I am), whereas `find_substring()`'s implementation not at all self-explanatory (to me). Continuing on with the functional/imperative contrast: the regex solution uses exactly one automatically-preset read-only variable. In contrast `find_substring()` uses a couple of manually-assigned read-write variables. In my view that means the latter has several extra places where future well-intentioned modifications could quietly break something. I think that the regex solution would also be much easier to integrate into a larger parsing system, when the current simple text processing task subsequently grows more complicated (which it inevitably will). I can debug the behaviour of the regex-based solution visually using Regexp::Debugger. I'd have to use `perl -d` to debug `find_substring()`. <shudder> The `find_substring()` implementation somehow reminds me of my years of coding in C, and at this point I really don't need that kind of post-traumatic flashback undoing all the therapy. ;-) Damian	[reply] [d/l] [select]
Re^5: Finding repeat sequences. by hdb (Monsignor) on Jun 20, 2013 at 09:00 UTC
I fully agree with your arguments and admit that I got carried away in my strive to make the algorithm as short as possible. In particular your comment on a small change could break it, is very valid. For something more maintainable, it should be written like this which does exactly the same thing (with the exception of the test in the while loop which is redundant). hdb => sub { my $input = shift; my $length = length $$input; my $i = 0; while( $i < $length ) { # redundant, but to be on + the safe side $i++; # length of next candidat +e my $possible = substr $$input, 0, $i; # next candidate my $j = index $$input, $possible, $i; # try to find it to the r +ight if( $j > 0 ) { # if found, skip ahead $possible = substr $$input, 0, $j; # the candidate has to be + this long at least $i = $j; # store its length } # check if we have a solution already if( substr( $$input, $i ) eq substr($$input, 0, $length - $i) ) +{ return $possible; # success ! } } # will never reach here return ''; }, [download] And an explanation why the skip ahead is allowed should probably also be added...	[reply] [d/l]
Re^6: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 21, 2013 at 16:26 UTC
Thought you might like to see my refactoring of your algorithm. Same performance, but somewhat clearer implementation: `hdbm => sub { my $r = shift; my $i = 1; until( substr( $$r, $i ) eq substr( $$r, 0, -$i ) ) { my $j = index( $$r, substr( $$r, 0, $i ), $i+1 ); $i = $j > 0 ? $j : ++$i; } return substr( $$r, 0, $i ) },` [download] Specifically, it avoids the unfounded "infinite loop" charge without needing to introduce a redundant test. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^7: Finding repeat sequences. by hdb (Monsignor) on Jun 21, 2013 at 17:21 UTC
Re^8: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 21, 2013 at 17:36 UTC
Some notes below your chosen depth have not been shown here
Re^5: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 20, 2013 at 15:11 UTC
You said: "I know which one I'd* rather maintain,"; I asked: "I'm not sure why you think it would be any easier to maintain that the non-regex solutions?"; and you've explained in terms of: your* preferences, assumptions, understanding, fears, preferences, views, thoughts, skill-set & bad memories. There is no way to argue with any of that. I cannot tell you what you should prefer, feel, think. etc. The only vaguely arguable points are: "... or to creep in when the code is subsequently updated." Why would the code be updated. "... future well-intentioned modifications could quietly break something." Why would anyone modify it, when it performs the required task as is. "... when the current simple text processing task subsequently grows more complicated (which it inevitably will)." It isn't a "text processing task" per se. If you look at the source data for the third test of my performance tests, you'll see that the source data is actually very large bitstrings. The task is to reduce storage requirements by run-length encoding leading repeating sequences. This will not change. "... I'd have to use perl -d to debug find_substring(). <shudder>" I added a couple of print statements and it told me everything I needed to know: `hdb => sub { my $input = shift; my $length = length $$input; my $i = 0; my $possible; my $j; while( 1 ) { $possible = substr $$input, 0, ++$i; print "i: $i : l:", length $possible; $possible = substr $$input, 0, $i=$j if ($j = index $$input +, $possible, $i) > 0 ; print "i:$i j:$j : l:", length $possible; return $possible if substr( $$input, $i ) eq substr($$input +, 0, $length - $i); } },` [download] I actually print the value of `$possible` rather than the length, but that would be inappropriate to post. The output from the last, longest test data shows why it is so fast: <Reveal this spoiler or all in this thread> With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^6: Finding repeat sequences. by DamianConway (Beadle) on Jun 20, 2013 at 22:56 UTC
Why would the code be updated. I don't know...yet. That's more-or-less the definition of maintenance: making changes when the code needs to be updated for some currently unpredictable reason. A few obvious possibilities: The data to be analysed alters in structure, which requires a new algorithm to detect. You decide that you want even greater compression by more detailed (possibly recursive) detection of repetition patterns. You encounter another very-similar-but-not-identical task, and decide to refactor the subroutine to handle both, rather than copy-and-pasting. You find another improvement on the current algorithm, which requires a modification of the existing code. You discover a bug or limitation in the current algorithm that requires a modification of the existing code. P5P discover a bug or limitation in `index()` (perhaps to do with Unicode edge-cases) and rewrite it correctly, which makes it vastly slower. P5P deprecates and prepares to remove `index()` (presumably because it's just too much like smartmatching ;-) Why would anyone modify it, when it performs the required task as is. Again, I don't know...yet. A couple of obvious possibilities: Because the required task may change or evolve. Because the code may eventually be found not to perform the task correctly as is. Because the insatiable need for speed, on ever-growing string lengths, may force you to improve the existing algorithm. I added a couple of print statements... Yes, I use that approach too, in preference to the horrors of the Perl debugger. But that's still not nearly as convenient, quick, explanatory, or powerful as Regexp::Debugger is on regexes. And look...you're now modifying the code! That itself is a potential source of bugs. You may never have accidentally left a "debugging `print`" in code, but I certainly have. I've even shipped code with them left in. Which eventually leads to bug-reports and more maintenance. :-( As you point out, I am only reporting my own preferences for code maintenance. But those preference aren't merely whims; they're based on long experience of my own abilities and limitations as a developer, and my own psychology and habits as a coder. And I honesty believe that many far-better developers and coders than me would also benefit from endeavouring to write code that is less subtle, more declarative, has fewer side-effects (i.e. variables), doesn't manually iterate through infinite loops, and is easier to debug without actually having to modify the code to do so. Sure, sometimes you need subtlety. And sometimes you have to write "manual" code. The unbeatable performance of hdb's excellent solution demonstrates that perfectly. But I will continue to believe that such code should be the exception, not the paradigm. Damian	[reply] [d/l] [select]
Re^7: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 21, 2013 at 00:20 UTC
Re^8: Finding repeat sequences. by DamianConway (Beadle) on Jun 21, 2013 at 02:03 UTC
Some notes below your chosen depth have not been shown here

In Section Seekers of Perl Wisdom