Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
Dear monks,

as a fan of quality assurance modules like Perl::Critic I am dreaming meditating of a plugin, that detects cut-and-past fragments in Perl code and suggests to isolate each duplicate in a separate subroutine/method. That could be handy, when I suddenly have to maintain a large chunk of foreign code. AFAIK Perl::Critic has no such plugin yet.

What would be the best strategy and infrastructure to look at code duplicates?

  • on the source code level? Then only verbatim copies (omitting whitespace) would qualify with a moderately complex recognizer.
  • or on the op tree level? Then nonverbatim copies sharing the same code structure would also qualify. Differently named variables and syntactical equivalent code fragments would not void the duplicate detection. (Reminder: i need to check out what can be done within Perl::Critic and what needs B::* modules)
  • There should probably be a minimum length (?) as well as a minimum frequency (2) threshold for the duplicate to be considered.

    How to detect the duplicates? The plugin should be able to recognize duplicates at 'long' distances, even better across files/modules. I am thinking of attaching a MD4-checksum to clusters of statements for fast recognition (hashing clusters with MD4 keys).

    Setting the right cluster size might be tricky. Natural places to make the cuts would have a minimal 'statefulness', that is eg before scopes are opened and after they are closed. The duplicate's badness could be determined by its frequency and code length.

    The problem then is to find all duplicate-subsets in a set. That sounds like autocorrelation or a diff of the whole with parts of itself. Any suggestions for good algorithms or even existing modules?

    I welcome your feedback, thanks.


    In reply to how to get rid of cut-and-paste sins? by hexcoder

    Title:
    Use:  <p> text here (a paragraph) </p>
    and:  <code> code here </code>
    to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others taking refuge in the Monastery: (3)
    As of 2014-07-12 23:21 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      When choosing user names for websites, I prefer to use:








      Results (242 votes), past polls