as a fan of quality assurance modules like Perl::Critic I am
dreaming meditating of a plugin, that
detects cut-and-past fragments in Perl code and suggests
to isolate each duplicate in a separate subroutine/method.
That could be handy, when I suddenly have to maintain a large chunk of foreign code. AFAIK Perl::Critic has no such plugin yet.
What would be the best strategy and infrastructure to look at code duplicates?
on the source code level?
Then only verbatim copies (omitting whitespace) would qualify with a moderately complex recognizer.
or on the op tree level?
Then nonverbatim copies sharing the same code structure would also qualify. Differently named variables and syntactical equivalent code fragments would not void the duplicate detection. (Reminder: i need to check out what can be done within Perl::Critic and what needs B::* modules)
There should probably be a minimum length (?) as well as a minimum frequency (2) threshold for the duplicate to be considered.
How to detect the duplicates? The plugin should be able to recognize duplicates at 'long' distances, even better across files/modules. I am thinking of attaching a MD4-checksum to clusters of statements for fast recognition (hashing clusters with MD4 keys).
Setting the right cluster size might be tricky.
Natural places to make the cuts would have a minimal 'statefulness', that is eg before scopes are opened and after they are closed.
The duplicate's badness could be determined by its frequency and code length.
The problem then is to find all duplicate-subsets in a set. That sounds like autocorrelation or a diff of the whole with parts of itself. Any suggestions for good algorithms or even existing modules?
I welcome your feedback, thanks.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
| & || & |
| < || < |
| > || > |
| [ || [ |
| ] || ] ||