http://www.perlmonks.org?node_id=761781

szabgab has asked for the wisdom of the Perl Monks concerning the following question:

I just got a new client that has a bunch of Perl scripts in three almost identical copies.

The story is that they used to have set of scripts dealing with one kind of input. When they started to receive a slightly different set of input - from another source - they copied the scripts and slightly modified some of them to cope with the new input. Then this happened again so now they have 3 sets of almost identical code.

Naturally no unit tests are around.

In addition some of the subroutines were also copied among the files of the original set so I have some duplication there too.

Luckily all of the use strict and warnings and almost no globals so the code is not bad at all.

What would be the best approach to deal with this?

I thought it would be nice to track down the identically named subroutines in the codebase and then recognize if the subroutines are the same or nearly the same, showing the diff between the subroutines.

For extra bonus it would be nice if I could easily unite two almost identical scripts into a module and a scrip. (and what if I have 3 copies...).

So this is a real issue I have now - or rather I had as I manually worked through it in the past 2 days but actually this could be a nice tool and a nice plugin to an IDE.

So let me also add my promotional link here to the Vertical Metre of Beer 3 - The Padre Plugin Contest! I'd be glad to see people starting to write a plugin that can help us handle such a situation.

Replies are listed 'Best First'.
Re: Refactoring copy-pasted code
by ELISHEVA (Prior) on May 04, 2009 at 18:40 UTC

    There are many ways to do this, but this is how I might go about it:

    1. Make sure all scripts are under version control.
    2. Develop a set of unit tests for one of the three scripts so you have a way of verifying that you haven't broken anything. Make sure the tests cover all of the major functionality that must not break.
    3. Study the code in all three making notes of all the places where there are differences. To get an idea of where to focus your attention you might use the shell command diff to compare the files.
    4. First refactoring pass:
      1. design/declare a data structure to hold all of the differences. This obviously will include some data. However, you may notice that there are bits of flow of control that also differ. If so, write a subroutine that encapsulates those bits and put a code reference into your data structure.
      2. refactor the script for which you wrote the test suite so that it uses the data structure you designed in the previous step.
      3. run your test suite to make sure that the refactoring broken nothing
    5. Second refactoring pass:
      1. define a class. The data for objects in the class will be the data structure. The methods will be the functions in that first script. The parameters to the new method will be the data that populates that structure.
      2. refactor again so that the first script simply creates the object and calls its run method
      3. test again
    6. For the remaining scripts:
      1. write a test suite. If the outputs are similar enough (e.g. only input-expectation pairs change). You may find it easier just to expand the first test suite so that it can be used to work with more than one script.
      2. refactor the script so that it creates an object and runs it
      3. test

    If you are really comfortable with objects and refactoring you may be able to get away with merging the two refactoring phases into one (I usually do). I included them separately because I think I mentally go through both those phases even if I only code the second one.

    Best, beth

Re: Refactoring copy-pasted code
by jethro (Monsignor) on May 04, 2009 at 18:35 UTC
    I would put the three versions as different branches into git (or some other version control system) and use 'git diff' to check for the differences. Merging them would be the final objective. It might even be that merging and resolving the conflicts is all that is needed to integrate the 3 versions, without the need for any preliminary diffs

    The merge step would be a bit unusual in that it involves programming, but the merge would practically provide a recipe or path one could follow.

Re: Refactoring copy-pasted code
by graff (Chancellor) on May 05, 2009 at 00:26 UTC
    I don't know if it'll help, but you could take a look and try it out -- I posted it here at the monastery a few years back: Tabulate sub defs, sub calls in Perl code.

    Maybe some minor tweaks (like counting string-lengths or line-counts of subs) would make it useful for focusing attention on the basis of a simple tabulation.

Re: Refactoring copy-pasted code
by DStaal (Chaplain) on May 04, 2009 at 18:01 UTC

    If possible, I'd break this into two tasks: Parsing the input file, and then processing the data. (I am assuming the intent is to do the same thing with all the files, just that the data format is slightly different.)

    The end result I'd be aiming for is a script and several modules (one per data format, most likely). Then you run the script, have some way for it to figure out which parser to pass it to (either automatic or manual as a command-line switch), and have it then take back the data in some cannocial form (hash?) and do whatever you want.

    (Note/Advertisement): If the data files are all line-oriented (that is: Each line can be read on it's own), my Mail::Log::Parse modules (which I'm working on generalizing, not that it needs much work) can do a lot of the heavy lifting of opening, buffering, seeking, and decompressing the files for you, leaving you to only writing one function for parsing per file format.)

      I think there is a misunderstanding here.

      I am not looking for a solution to the how to collect data from various sources? problem.

      I am looking for an answer on how to locate duplicate code? and maybe how to unite duplicate code to a single copy?.

Re: Refactoring copy-pasted code
by jplindstrom (Monsignor) on May 05, 2009 at 12:45 UTC
    I've also been thinking about a copy-paste detector, and there turns out to be a fair amount of prior art.

    If you Google for "copy paste detection", and "code duplication" you'll find plenty of software, although mostly for Java or C#.

    I've tried DuDe, which is kinda useful but far from perfect.

    /J

Re: Refactoring copy-pasted code
by talexb (Chancellor) on May 05, 2009 at 13:40 UTC

    I'm very happy using gvimdiff between two files to get visual feedback on what parts of the file are identical. And if you find that a bunch of the scripts are identical from lines 200-400, say, it should be possible to centralize those routines in a module.

    Writing tests for this kind of script can be a challenge, but setting up some test data files with as many possible corner cases should go a long way to solving that.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: Refactoring copy-pasted code
by dwm042 (Priest) on May 05, 2009 at 21:48 UTC
    szabgab++. This isn't uncommon. I have near-duplicate issues in one instance that is at the level of over 25 near duplicate bodies of code (involving 6-12 scripts) and another involving PHP whose near duplication numbers into the hundreds of scripts. If people can come up with rational means to handle near duplicates in this kind of domain range, I'm all ears.

    David.
      I have found the 'ediff' code-merging tools within Emacs useful for situations that are kind of like yours. There is an 'ediff-buffers' command that splits the screen highlighting the detected differences between the code in two edit buffers, and allows you to step through the list of differences. Single keystrokes 'a','b' will put what is in buffer A into the corresponding difference in buffer B, and vice versa. There are many other things you can do in this mode as well. Emacs also provides 'ediff-buffers3' which does 3 files at once, and 'ediff-directories' which sets up a 'session' going through all the files in a pair of directories. I might feel inclined to load up scripts A and B, copy one of them to a new file AB, and manually construct AB to be an abstracted union of the two. You could then do the same with script C, then D...

      Hey!! Wait! This isn't my signature!!