Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^5: Finding repeat sequences.

by BrowserUk (Pope)
on Jun 20, 2013 at 15:11 UTC ( #1039988=note: print w/ replies, xml ) Need Help??


in reply to Re^4: Finding repeat sequences.
in thread Finding repeat sequences.

You said: "I know which one I'd rather maintain,"; I asked: "I'm not sure why you think it would be any easier to maintain that the non-regex solutions?"; and you've explained in terms of: your preferences, assumptions, understanding, fears, preferences, views, thoughts, skill-set & bad memories.

There is no way to argue with any of that. I cannot tell you what you should prefer, feel, think. etc.

The only vaguely arguable points are:

  • "... or to creep in when the code is subsequently updated."

    Why would the code be updated.

  • "... future well-intentioned modifications could quietly break something."

    Why would anyone modify it, when it performs the required task as is.

  • "... when the current simple text processing task subsequently grows more complicated (which it inevitably will)."

    It isn't a "text processing task" per se. If you look at the source data for the third test of my performance tests, you'll see that the source data is actually very large bitstrings. The task is to reduce storage requirements by run-length encoding leading repeating sequences. This will not change.

  • "... I'd have to use perl -d to debug find_substring(). <shudder>"

    I added a couple of print statements and it told me everything I needed to know:

    hdb => sub { my $input = shift; my $length = length $$input; my $i = 0; my $possible; my $j; while( 1 ) { $possible = substr $$input, 0, ++$i; print "i: $i : l:", length $possible; $possible = substr $$input, 0, $i=$j if ($j = index $$input +, $possible, $i) > 0 ; print "i:$i j:$j : l:", length $possible; return $possible if substr( $$input, $i ) eq substr($$input +, 0, $length - $i); } },

    I actually print the value of $possible rather than the length, but that would be inappropriate to post.

    The output from the last, longest test data shows why it is so fast:

    i: 1 : l:1 i:1 j:1 : l:1 i: 2 : l:2 i:2 j:2 : l:2 i: 3 : l:3 i:3 j:3 : l:3 i: 4 : l:4 i:4 j:4 : l:4 i: 5 : l:5 i:5 j:5 : l:5 i: 6 : l:6 i:6 j:6 : l:6 i: 7 : l:7 i:7 j:7 : l:7 i: 8 : l:8 i:8 j:8 : l:8 i: 9 : l:9 i:9 j:9 : l:9 i: 10 : l:10 i:10 j:10 : l:10 i: 11 : l:11 i:11 j:11 : l:11 i: 12 : l:12 i:12 j:12 : l:12 i: 13 : l:13 i:13 j:13 : l:13 i: 14 : l:14 i:14 j:14 : l:14 i: 15 : l:15 i:15 j:15 : l:15 i: 16 : l:16 i:16 j:16 : l:16 i: 17 : l:17 i:17 j:17 : l:17 i: 18 : l:18 i:18 j:18 : l:18 i: 19 : l:19 i:19 j:19 : l:19 i: 20 : l:20 i:20 j:20 : l:20 i: 21 : l:21 i:21 j:21 : l:21 i: 22 : l:22 i:22 j:22 : l:22 i: 23 : l:23 i:23 j:23 : l:23 i: 24 : l:24 i:24 j:24 : l:24 i: 25 : l:25 i:25 j:25 : l:25 i: 26 : l:26 i:26 j:26 : l:26 i: 27 : l:27 i:27 j:27 : l:27 i: 28 : l:28 i:28 j:28 : l:28 i: 29 : l:29 i:29 j:29 : l:29 i: 30 : l:30 i:30 j:30 : l:30 i: 31 : l:31 i:31 j:31 : l:31 i: 32 : l:32 i:32 j:32 : l:32 i: 33 : l:33 i:65 j:65 : l:65 i: 66 : l:66 i:194 j:194 : l:194 i: 195 : l:195 i:64000 j:64000 : l:64000 b:64000 in s: 64020675 hdb :: 2.743027 s

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^5: Finding repeat sequences.
Select or Download Code
Re^6: Finding repeat sequences.
by DamianConway (Beadle) on Jun 20, 2013 at 22:56 UTC

    Why would the code be updated.

    I don't know...yet. That's more-or-less the definition of maintenance: making changes when the code needs to be updated for some currently unpredictable reason. A few obvious possibilities:

    • The data to be analysed alters in structure, which requires a new algorithm to detect.
    • You decide that you want even greater compression by more detailed (possibly recursive) detection of repetition patterns.
    • You encounter another very-similar-but-not-identical task, and decide to refactor the subroutine to handle both, rather than copy-and-pasting.
    • You find another improvement on the current algorithm, which requires a modification of the existing code.
    • You discover a bug or limitation in the current algorithm that requires a modification of the existing code.
    • P5P discover a bug or limitation in index() (perhaps to do with Unicode edge-cases) and rewrite it correctly, which makes it vastly slower.
    • P5P deprecates and prepares to remove index() (presumably because it's just too much like smartmatching ;-)

    Why would anyone modify it, when it performs the required task as is.

    Again, I don't know...yet. A couple of obvious possibilities:

    • Because the required task may change or evolve.
    • Because the code may eventually be found not to perform the task correctly as is.
    • Because the insatiable need for speed, on ever-growing string lengths, may force you to improve the existing algorithm.

    I added a couple of print statements...

    Yes, I use that approach too, in preference to the horrors of the Perl debugger. But that's still not nearly as convenient, quick, explanatory, or powerful as Regexp::Debugger is on regexes.

    And look...you're now modifying the code! That itself is a potential source of bugs. You may never have accidentally left a "debugging print" in code, but I certainly have. I've even shipped code with them left in. Which eventually leads to bug-reports and more maintenance. :-(

    As you point out, I am only reporting my own preferences for code maintenance. But those preference aren't merely whims; they're based on long experience of my own abilities and limitations as a developer, and my own psychology and habits as a coder. And I honesty believe that many far-better developers and coders than me would also benefit from endeavouring to write code that is less subtle, more declarative, has fewer side-effects (i.e. variables), doesn't manually iterate through infinite loops, and is easier to debug without actually having to modify the code to do so.

    Sure, sometimes you need subtlety. And sometimes you have to write "manual" code. The unbeatable performance of hdb's excellent solution demonstrates that perfectly. But I will continue to believe that such code should be the exception, not the paradigm.

    Damian
      Why would the code be updated. -- I don't know ...

      Why would anyone modify it, when it performs the required task as is. -- Again, I don't know.

      So you try to predict the future; ie, guess. I don't.

      If at some point in the future the code needs modification; I'll adapt it. Then.

      If at some point after that, the code requires modification again, or I encounter another task that I realise it can be adapted to, Then I'll consider trying to generalise it. But right now, it has one, and only one, very specific purpose.

      And I'll willingly and knowingly trade the near 3 orders of magnitude performance gain for that task now, against any potential savings against potential future maintenance costs.

      I'm very firmly of the opinion -- based on my years of experience -- that premature generalisation has cost this industry far more, in both financial and in terms of its reputation for spending a fortune developing huge, all encompassing, singing & dancing solutions that never work, and quietly or otherwise, just end up in the bit bucket; than premature optimisation ever has or ever will.

      And look...you're now modifying the code! That itself is a potential source of bugs. You may never have accidentally left a "debugging print" in code, but I certainly have. I've even shipped code with them left in.

      Hand on heart, no, I never have.

      But then, I don't use test harnesses that steal my output and summarises it to a bunch of meaningless statistics.

      Equally, nor do I do my explorations on my 'live' code. (Ie. The function in the actual application is very unlikely to be an anonymous subroutine value in a hash, to a key called hdb. Nor is it likely to be called find_substring().

      In fact, it is quite likely to not look much like hdb's implementation at all. Now I've found and understood the algorithm, I'll almost certainly re-write it to better fit with the nature of application.

      Eg. I probably pass in a reference to the bitstring, convert it to the bytestring internally, and return a packed tuple that encapsulates the compressed bitvector as (say):

      return pack 'L L Q*', $reps, $bits, substr( $$bitvector, 0, int( ( $_ + + 63) / 64 ) * 8 );

      This thread is all about algorithm, not implementation. (Which still leaves me wondering if hdb's algorithm couldn't be encapsulated into a regex?)

      Sure, sometimes you need subtlety. And sometimes you have to write "manual" code.... But I will continue to believe that such code should be the exception, not the paradigm.

      I completely agree; but were this paradigmatic problem, I probably wouldn't have needed to ask for help.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        So you try to predict the future; ie, guess.

        No. I simply try to develop coding habits that make the inherently unpredictable nature of the future less error-prone and less fraught. Basing those habits on repeated patterns of behaviour and outcome I have identified from observing the past.

        I'm very firmly of the opinion -- based on my years of experience -- that premature generalisation has cost this industry far more, in both financial and in terms of its reputation for spending a fortune developing huge, all encompassing, singing & dancing solutions that never work, and quietly or otherwise, just end up in the bit bucket; than premature optimisation ever has or ever will.

        I'm sure you're right. But you seem to be berating me for something I neither suggested nor advocate.

        It didn't prefer the regex solution because it was already more generalized; I preferred it because it was more descriptive (to me), less error-prone (for me), and easier to rework or enhance (by me) should that eventually become necessary.

        And I definitely wasn't suggesting premature optimization. After all, I'm not the one who would:

        ...willingly and knowingly trade the near 3 orders of magnitude performance gain for that task now, against any potential savings against potential future maintenance costs.

        ;-)

        I was merely saying that I believe that code maintainability is generally more important than code performance. Which is why I still prefer the regex-based solution, even through it's three orders of magnitude slower.

        I doubt we're ever going to agree on this...which is fine. But I'm certainly not going to apologize for making maintainability my own higher priority, nor for advocating it as a priority for most developers.

        Damian

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1039988]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (12)
As of 2014-09-02 12:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (22 votes), past polls