aquarium has asked for the wisdom of the Perl Monks concerning the following question:

Looking for advice (perl or just generic approach) to programmatically find duplicated blocks of code across 40 or so php script files, so i can then manually refactor these code chunks that i know do exist. It's not doable manually as we're talking 55k+ lines of code total. The copy/pasted bits of duplicate code are not identified with any function or class structure, i.e. pure procedural code someone else wrote. that's unfortunatelly how the dev on this occured before my work with it starting. author is available but it's also beyond his ability (or anyone's) to now manually unravel the duplication by hand.
the hardest line to type correctly is: stty erase ^H

Replies are listed 'Best First'.
Re: general advice finding duplicate code
by GrandFather (Sage) on Jun 21, 2011 at 05:54 UTC

    Interesting, I started solving a very similar problem for planetscape some time ago. In her case she was wanting to refactor a web site where there were large chunks of duplicated HTML. The general approach was to normalise the HTML then extract chunks of some minimum size and populate a hash using the chunks as a key and adding the file location of each chunk to a list stored in the hash element. The interesting part is to then use the matched chunks as seed points and grow the match area to encompass as large a common region as sensible.

    With the HTML matching choosing "sensible" was something of a trade off. As the common region was increased the number of places that matched the region tended to reduce. It may be that in your case if the code rally has just been copied around without change that the regions are pretty well defined.

    My guess is that for something small like 55K LOC the technique would work quite well and in a timely fashion. Probably you don't need to worry so much about the normalise step.

    True laziness is hard work
Re: general advice finding duplicate code
by Ratazong (Monsignor) on Jun 21, 2011 at 06:29 UTC
Re: general advice finding duplicate code
by NetWallah (Abbot) on Jun 21, 2011 at 05:14 UTC
    I found this on StackOverflow - you may want to contact this person at Clone Doctor.
    I'm completing a CloneDR-based duplicated code finder tool (see for Perl. I really like real exam +ples. Can I have your 30 files? If it all works, I'll send you the re +port and eventually the production tool. (Zip file?) Ira Baxter Jun + 7 at 23:27

                "XML is like violence: if it doesn't solve your problem, use more."

Re: general advice finding duplicate code
by planetscape (Chancellor) on Jun 21, 2011 at 11:34 UTC
Re: general advice finding duplicate code
by 7stud (Deacon) on Jun 21, 2011 at 05:23 UTC
    use strict; use warnings; use 5.010; my $fname = 'somefile.php'; #Slurp whole file: my $file; { local $/ = undef; $file = <DATA>; } my %files_for; while($file =~ m{ <[?]php \s* (.*?) \s* [?]> }xmsg) { my $php_code = $1; push @{ $files_for{$php_code} }, $fname; } use Data::Dumper; say Dumper(\%files_for); __END__ <div>hello</div> <div><?php echo 'world'; ?> <div><?php echo 'hello';?> <div><?php echo 'world';?> --output:-- $VAR1 = { 'echo \'world\';' => [ 'somefile.php', 'somefile.php' ], 'echo \'hello\';' => [ 'somefile.php' ] };
Re: general advice finding duplicate code
by Anonymous Monk on Jun 21, 2011 at 05:55 UTC
    to now manually unravel the duplication by hand

    This is a classic refactoring problem. The amount of duplication doesn't matter, you simply go file by file, and re-write each file to be modular, using the appropriate amount of abstraction.

    By the time you're on file 20 (of 500), you'll know if possible (and worth the effort ) to refactor all 55k lines of code, or start from scratch.

    If this were perl, I would say use B::Xref to generate a graph, and then look for cycles ...

    Or if the code is at all modular, use autodia and/or GraphViz::ISA to get a picture

    surely php has something similar, maybe :)

      I gotta agree with this one. 55k LOC really isn't that much. If you have a duplication problem, you probably have a factoring problem.

      Refactoring the code will not only help eliminate your duplication issue, but will also teach you what the code is doing, and result in a much better end result than simply eliminating duplication.

      Eliminating the trivial copy/pastes is a good start though, anything that helps maintainability will buy time for refactoring, helping others avoid making the problem worse while you race to make it better.


Re: general advice finding duplicate code
by aquarium (Curate) on Jun 21, 2011 at 06:05 UTC
    Thanks for the responses so far. i'll look up the clone doctor code...however i cannot send this codebase to 3rd parties.
    the second approach, using dumper, looks like will only identify duplicated but individual lines of code across the scripts...which would be just as easy to do using
    cat *.php | sort | uniq -c
    i'll keep thinking about it too..and will post any gems. a brute force reducing sliding window between two scripts is possible but probably blow out to hours/days of running time for the 40 or so script pair combinations.
    the hardest line to type correctly is: stty erase ^H
      looks like will only identify duplicated but individual lines of code across the scripts

      Every approach is this approach :) its like a search engine

      You iterate over you files, and you index each file

      To index, you pick a unit (ex one word, or three adjacent lines of code)

      Generate a list of all units for a file

      Normalize each unit. For words you would stem (remove prefix/suffix..) to find the root, for lines you would remove insignificant whitespace, insignificant commas... normalize quoting characters...

      Hash each unit (sha1), and associate all this in a database

      Then, to find duplication, query the database to find duplicate hashes

      This is not unlike what git (git gc) does, so I wouldn't be surprised if git provides provided a tool to help you visualize these duplications, although I don't know of one

      It goes without saying before making code changes, you need a comprehensive test suite :)

Re: general advice finding duplicate code
by sundialsvc4 (Abbot) on Jun 21, 2011 at 11:19 UTC

    In the case of classic PHP, it might be a valid approach to consider the problem as being one of “extracting the (small) amounts of code” from what is basically “a very large and ugly template.”   PHP strongly encouraged the mix-up of code and (in effect) data in the same file, and, “as soon as it works, sort of, go on to the next one starting with a copy of the last one.”   Perhaps, though, a preponderance of the actual material is HTML, rather than logic?

Re: general advice finding duplicate code
by aquarium (Curate) on Jun 21, 2011 at 06:16 UTC
    looks like some answers might be fractal. those things follow me everywhere ;)
    the hardest line to type correctly is: stty erase ^H
Re: general advice finding duplicate code
by aquarium (Curate) on Jun 22, 2011 at 03:57 UTC
    Thank you everyone for the great help. I ended up using CPD with very good result. Amazingly enough it even ran straight from the link to the java web start. I was worried that any automated tool might have problems as the php also contains html and vml(ugh). But the output shows clearly that about 20 or so php files (significantly) have in common in the order of 100-150 lines of code in various (specified) places. So after doing this dedupe, should cut another several thousand lines of code. Trying to get to a code base where it actually becomes maintainable by some mere mortal like myself or someone else. the code was all written by a single author.
    the hardest line to type correctly is: stty erase ^H
Re: general advice finding duplicate code
by sundialsvc4 (Abbot) on Jun 23, 2011 at 13:22 UTC

    As we all know, probably the most annoying aspect of these problems is that the various “duplicated” bits of code are often not quite the same.

    PHP’s biggest weakness, in my opinion, is also a fundamental aspect of its design:   “code and data are intermingled.”   Logic is scattered willy-and-yon among the presentation and is usually completely governed by it.   I have turned a lot of PHP modules into Template::Toolkit files, but it was never, ever easy.   You basically are re-writing the damn thing . . . basically, from scratch.   But sometimes you just can’t make a silk purse.

      Oh wise monks. The five months since initially posting this I've learn a lot, but not achieved that much. Past the very few instances where multiple copies of code could be centralised, it turns out there's too many design flaws in this behemoth of an application. Upon further inspection there is an unwieldy spider web of possible calls and entry points between the scripts, the use of the wrong data structure and mixing of code, data, and presentation. For any little challenging situation, yet another additional url variable was invented. It's a right, royal mess. Any meaningful further fixes all seem to lead to basically a complete rewrite. To top it off, the self taught programming author of the app is still keeping his fingers in the pie. Often negating any useful fix by unfounded fear arguments against specific fix. So i've resigned myself to putting in less effort into this, as i can see it doesn't deserve all my attention. When it does fall over, I won't take it too seriously...even though the app is used by select few, instead of the masses. i'm now spending at least some time in desktop publishing and info-graphics. There's potential for programmers to follow some of Edward Tufte's ideas when generating data graphics.
      the hardest line to type correctly is: stty erase ^H


        Be sure to document these issues (both social and technical), and how you would re-[architecture the program, and use them as a valuable field experience (also CYA)

        Also, keeping your spirits up is very important, good to figure this out sooner

        De-stress early, de-stress often