|XP is just a number|
[OT] Finding similar program codeby afoken (Canon)
|on Aug 27, 2021 at 11:10 UTC||Need Help??|
afoken has asked for the wisdom of the Perl Monks concerning the following question:
I have a huge pile of
Indenting is mostly random, lines are wrapped as randomly as the indenting. So I wrote a small script to fix indenting and line wrapping. That script also extracts each function to a separate file.
So far, I extracted about 100 functions, and I guess another 100 are still waiting for me in Excel files I have not yet touched.
Subroutine calls must have been a very strange and hard to understand concept to the author, so instead, he copied blocks of several hundred lines around, resulting in functions of 1000 lines and more AFTER cleaning up indenting and line wraps. Of course, the copied blocks were later changed, but not all copies at the same time and in the same way.
Now my problem is to identify blocks copied from one function to another, after both copies have been modified. Indent changes and wrapping changes should have been eliminated by my script, but smaller changes remain: Spelling errors fixed in only one or two copies, implicit objects suddenly inserted (e.g. ActiveSheet.Cells(...) instead of just Cells(...)), comments added or removed, empty lines added or removed, some code lines commented out in only one copy, and so on.
So, checksums won't work. Blindly running diff on any pair of functions will generate a lot of noise and only little signal. Grouping functions by line count or byte count, then running diff only on functions with similar sizes might generate a little bit less noise. Basically, any function having less than about 100 lines is unlikely to contain eroded/evolved copies of other functions. That sorts out about 80% of the functions, leaving me with 20 known and probably another 20 unknown functions.
Is there a smarter way to find those modified copies of existing code?
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)