Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Searching for duplication in legacy code

by yulivee07 (Sexton)
on Nov 23, 2016 at 08:59 UTC ( [id://1176386]=perlquestion: print w/replies, xml ) Need Help??

yulivee07 has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks,

I inherited a really large legacy codebase which I am meant to maintain for the future. I have been working with this codebase for a while now and noticed by scrolling through the code that there is code duplication (especially the same subroutines) in many cases. I would like to make modules for all that duplicate functionality, but first I am looking for a way to find the code duplication.

I have the perl-sourcefiles for analysis. My codebase consists of ~60 deamons with 3000-6000 lines of code, so diffing all deamons against each other isn't really a practical way of approaching the problem. I was told that B::XRef may be a way to identify duplicate subroutines. Do you have additional suggestions what I can do in a situation like this?

Kind regards, yulivee

  • Comment on Searching for duplication in legacy code

Replies are listed 'Best First'.
Re: Searching for duplication in legacy code (refactoring strategy)
by LanX (Saint) on Nov 23, 2016 at 11:19 UTC
    Hi Yulivee,

    It depends on the nature of the duplication.

    Do equally named subs have identical code?

    Cut&paste programing involves mutations.

    General approach for refactoring

    a) identify all sub definitions in a file

    Possible Tools

    b) identify their dependencies
    • where are they called
    • which subs do they call
    • which global or closure variables do they use
    c) normalize sub code

    Formatting can differ

    d) diff potentially equal subs to measure similiarity

    What "potentially" means depends on the quality your code.

    probably changes happened to

    • sub name
    • local variable names
    • ...

    e) try to visualize dependencies to decide where best to start

    like with grapviz or a tree structure

    f) create a test suite to assure refactoring quality
    (The code might also show good inspection techniques)

    g) start refactoring incrementally, while constantly testing the out come

    depending on the quality of your tests you might first start with only one demon in production.

    h) care about a fall back scenario

    Especially use version control!


    Sorry, very general tips, because it really depends on the structure of your legacy code. Probably grep is already enough...

    (Think about it, you might also need "nested refactoring" because new modules still have duplicated code and need using other modules and so on)


    I did some googling yesterday after our conversation for "refactoring" and "duplication" and the term "plagiarism detection" popped up.

    like in these discussions:

    Couldn't find a general refactoring project for Perl, but also didn't spend much time yet.

    I think to cover all edge cases of a worst case scenario one certainly would need the use of PPI ( at least) or even a patched B::Deparse to scan the Op-Tree with PadWalker to identify variable dependencies and side effects.

    HTH! :)

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

Re: Searching for duplication in legacy code (updated)
by haukex (Archbishop) on Nov 23, 2016 at 11:26 UTC

    Hi yulivee07,

    I've done a bit of work with PPI and there is a chance it could be useful to you. This was an interesting question to me so I went off and whipped something up (Update: that means please consider this a beta) that finds identical subs, perhaps it's useful to you. PPI could also be used for more powerful identification of duplicated code.


    $ ./ Potential Duplicates: sub add in file, line 5, col 1 sub add in file, line 5, col 1 Potential Duplicates: sub subtr in file, line 11, col 1 sub subtract in file, line 11, col 1

    Hope this helps,
    -- Hauke D

Re: Searching for duplication in legacy code (updated)
by stevieb (Canon) on Nov 23, 2016 at 13:50 UTC

    My Devel::Examine::Subs can help with some of this. It uses PPI behind the scenes. It can gather all subs in a file, or a whole directory, then list all subs in all those files. It can even examine each sub and collect only ones that have lines containing specified search patterns, print out which lines each sub starts/ends, and also how many lines are in each sub.

    Collect and display all subs in all files in the current working directory:

    use warnings; use strict; use Devel::Examine::Subs; my $des = Devel::Examine::Subs->new(file => '.'); my $data = $des->all; for my $file (keys %$data){ print "$file\n"; for (@{ $data->{$file} }){ print "\t$_\n"; } }

    Sample output:

    lib/Test/BrewBuild/ new git link name clone pull lib/Test/BrewBuild/ new brew info installed using available install remove is_win _legacy_perls

    Get all the subs in the same manner, but collect them as objects instead to get a lot more information on each one:

    use warnings; use strict; use Devel::Examine::Subs; my $des = Devel::Examine::Subs->new(file => '.'); my $data = $des->all; for my $file (keys %$data){ print "$file\n"; my $subs = $des->objects(file => $file); for my $sub (@$subs){ print "\t" . $sub->name ."\n"; print "\t\t lines: " . $sub->line_count ."\n"; print "\t\t start: " . $sub->start ."\n"; print "\t\t end: " . $sub->end . "\n"; } }

    Sample output:

    lib/Test/BrewBuild/ _fork lines: 111 start: 146 end: 256 new lines: 21 start: 21 end: 41 dispatch lines: 87 start: 42 end: 128 _config lines: 17 start: 129 end: 145 lib/Test/BrewBuild/ name lines: 6 start: 34 end: 39 git lines: 17 start: 12 end: 28

    The main reason I wrote this software is so that I could introspect subs accurately, and then if necessary insert code in specific subs at either a line number or search term (yes, this distribution does that as well). You can even search for specific lines in each sub, and print out the line numbers those search patterns appear on.

    Of course, using the above techniques, it would be trivial to filter out which files have duplicated subs, stash all the duplicate names (along with file name) then using the objects, compare the length of the subs to do a cursory check to see if they appear to be an exact copy/paste (if the number of lines are the same). The synopsis in the docs explain how to get the objects within a hash, so that the hash's key is the sub's name. This may make things easier.

    update: I forgot to mention that each subroutine object also contains the full code for the sub in $sub->code. This should help tremendously in programmatically comparing a sub from one file to the dup sub in another file.

Re: Searching for duplication in legacy code (ctags and static parsing)
by LanX (Saint) on Nov 23, 2016 at 18:18 UTC
    For completeness, the "classical" approach to find all subs in a source file is ctags .

    I can't test at the moment, but I suppose after all these years it's well tested by now.

    But please keep in mind that this (like PPI) does static parsing (and Only Perl can parse Perl).

    Approaches like B::Xref do compile the code (i.e. let Perl parse Perl) before inspecting it. *

    See for instance How to find all the available functions in a file or methods in a module? for a list of edge cases where static parsing fails.

    Again just for completeness, from your description I suppose that static parsing is sufficient for you, but maybe you should be aware of the limitations.

    HTH! :)

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

    *) with the drawback that compiling can already have the side effects of running code, while static parsing is "safe".

    ) or Re^3: Perl not BNF-able?? with some limitations listed by adamk, who is PPI's author.

Re: Searching for duplication in legacy code
by stevieb (Canon) on Nov 23, 2016 at 21:44 UTC

    Perhaps another thing that can help you sort out what is calling each sub from where is to enable stack tracing in all of your subs. Normally I wouldn't go so far away from the original question, but looking at my module this morning had me testing a few others so I thought I'd throw it out there in hopes it can help in some way.

    I wrote Devel::Trace::Subs to do this tracing. It uses Devel::Examine::Subs in the background (in fact, I wrote Devel::Examine::Subs originally specifically to be used by this module). It is intrusive... it injects a command into every single sub within specified files (both inserting and removing is done with a single command line string).

    Here's an example where I configure every Perl file in my Mock::Sub directory to save trace information (there's only one file in this case, but I still just use the current working directory as the 'file' param.

    Configure all files (make a backup copy of your directory first!):

    perl -MDevel::Trace::Subs=install_trace -e 'install_trace(file => ".");'

    In my case, I install my distribution, but that may not be your case if scripts just access the libraries where they sit.

    Here's an example script that uses the module that now has tracing capabilities:

    use warnings; use strict; use Devel::Trace::Subs qw(trace_dump); use Mock::Sub; $ENV{DTS_ENABLE} = 1; my $mock = Mock::Sub->new; my $blah_sub = $mock->mock('blah'); blah(); trace_dump(); sub blah { print "blah!\n"; }

    The only parts of interest are the use Devel::Trace::Subs ... line, the $ENV{DTS_ENABLE} =1; line which enables the tracing, and the trace_dump(); line which dumps the trace data. The Mock::Sub stuff and everything else is irrelevant, it's just an example of normal code flow using other modules.

    Here is the output of the trace_dump():

    Code flow: 1: Mock::Sub::new 2: Mock::Sub::mock 3: Mock::Sub::Child::new 4: Mock::Sub::Child::side_effect 5: Mock::Sub::Child::_check_side_effect 6: Mock::Sub::Child::return_value 7: Mock::Sub::Child::_mock 8: Mock::Sub::Child::name 9: Mock::Sub::Child::_check_side_effect 10: Mock::Sub::Child::__ANON__ Stack trace: in: Mock::Sub::new sub: - file: line: 10 package: main in: Mock::Sub::mock sub: - file: line: 11 package: main in: Mock::Sub::Child::new sub: Mock::Sub::mock file: /usr/local/share/perl/5.18.2/Mock/ line: 50 package: Mock::Sub in: Mock::Sub::Child::side_effect sub: Mock::Sub::mock file: /usr/local/share/perl/5.18.2/Mock/ line: 52 package: Mock::Sub in: Mock::Sub::Child::_check_side_effect sub: Mock::Sub::Child::side_effect file: /usr/local/share/perl/5.18.2/Mock/Sub/ line: 185 package: Mock::Sub::Child in: Mock::Sub::Child::return_value sub: Mock::Sub::mock file: /usr/local/share/perl/5.18.2/Mock/ line: 53 package: Mock::Sub in: Mock::Sub::Child::_mock sub: Mock::Sub::mock file: /usr/local/share/perl/5.18.2/Mock/ line: 56 package: Mock::Sub in: Mock::Sub::Child::name sub: Mock::Sub::Child::_mock file: /usr/local/share/perl/5.18.2/Mock/Sub/ line: 49 package: Mock::Sub::Child in: Mock::Sub::Child::_check_side_effect sub: Mock::Sub::Child::_mock file: /usr/local/share/perl/5.18.2/Mock/Sub/ line: 81 package: Mock::Sub::Child in: Mock::Sub::Child::__ANON__ sub: - file: line: 13 package: main

    in: is the sub currently being executed. The rest of the info is the caller of that sub.

    After you're done, you can remove tracing just as easily:

    perl -MDevel::Trace::Subs=remove_trace -e 'remove_trace(file => ".");'

    In the above example, there's only a single library. If the directory had several, you'd see the calls between the different modules in the proper order.

Re: Searching for duplication in legacy code
by cguevara (Vicar) on Nov 23, 2016 at 19:53 UTC
Re: Searching for duplication in legacy code
by duyet (Friar) on Nov 23, 2016 at 10:20 UTC
    If you are on a Unix/Linux system just grep for the sub and/or sub name
    grep -r "sub " * grep -r "sub <sub name>" *
Re: Searching for duplication in legacy code
by 1nickt (Canon) on Nov 23, 2016 at 11:12 UTC

    As I am reading this the CPAN nodelet to the right shows recent upgrades to Class::Inspector, which has methods for examining the functions or methods in a loaded or other class, one of which will return "a reference to an array of CODE refs of the functions", which seems like it might be something to start with.

    The way forward always starts with a minimal test.
Re: Searching for duplication in legacy code
by fishy (Friar) on Nov 23, 2016 at 20:11 UTC
Re: Searching for duplication in legacy code
by hexcoder (Curate) on Sep 15, 2017 at 21:43 UTC

    I wrote a text duplication checker (see Code::DRY), which uses suffix arrays for performance. It has no special knowledge of Perl or units like subs, but it can find duplicated lines quite fast. You would need a C compiler to build the libraries, but then as memory permits you can scan whole directory trees for duplicates.

    I once planned to use it for a refactoring tool, but first wanted to implement the option to find structural duplicates (e.g. in token streams), where I got stuck...

    Hope this helps, hexcoder

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1176386]
Approved by haukex
Front-paged by Corion
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (5)
As of 2024-06-14 15:17 GMT
Find Nodes?
    Voting Booth?

    No recent polls found

    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.