|Problems? Is your data what you think it is?|
how to get rid of cut-and-paste sins?by hexcoder (Chaplain)
|on Feb 08, 2008 at 22:15 UTC||Need Help??|
as a fan of quality assurance modules like Perl::Critic I am
What would be the best strategy and infrastructure to look at code duplicates?
There should probably be a minimum length (?) as well as a minimum frequency (2) threshold for the duplicate to be considered.
How to detect the duplicates? The plugin should be able to recognize duplicates at 'long' distances, even better across files/modules. I am thinking of attaching a MD4-checksum to clusters of statements for fast recognition (hashing clusters with MD4 keys).
Setting the right cluster size might be tricky. Natural places to make the cuts would have a minimal 'statefulness', that is eg before scopes are opened and after they are closed. The duplicate's badness could be determined by its frequency and code length.
The problem then is to find all duplicate-subsets in a set. That sounds like autocorrelation or a diff of the whole with parts of itself. Any suggestions for good algorithms or even existing modules?
I welcome your feedback, thanks.