general advice finding duplicate code

aquarium has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: general advice finding duplicate code by GrandFather (Saint) on Jun 21, 2011 at 05:54 UTC
Interesting, I started solving a very similar problem for planetscape some time ago. In her case she was wanting to refactor a web site where there were large chunks of duplicated HTML. The general approach was to normalise the HTML then extract chunks of some minimum size and populate a hash using the chunks as a key and adding the file location of each chunk to a list stored in the hash element. The interesting part is to then use the matched chunks as seed points and grow the match area to encompass as large a common region as sensible. With the HTML matching choosing "sensible" was something of a trade off. As the common region was increased the number of places that matched the region tended to reduce. It may be that in your case if the code rally has just been copied around without change that the regions are pretty well defined. My guess is that for something small like 55K LOC the technique would work quite well and in a timely fashion. Probably you don't need to worry so much about the normalise step. True laziness is hard work	[reply]
Re: general advice finding duplicate code by Ratazong (Monsignor) on Jun 21, 2011 at 06:29 UTC
Looks like the algorithms used for detecting plagiatism could help you here. If you are able to understand german, the following link to the Bauhaus University Weimar may help you: Search for documents on plagiatism-detection (most in german). In 2009 the university held a competition which sounds similar to your use-case HTH, Rata	[reply]
Re^2: general advice finding duplicate code by Anonymous Monk on Jun 21, 2011 at 07:02 UTC
Brilliant, that is just the keyword that was needed, plagiarism Source Code Similarity Detection Tools - Plagiarism \| Subject Centre for Information and Computer Sciences PMD - Finding copied and pasted code Note that CPD works with Java, JSP, C, C++, Fortran and PHP code.	[reply]
Re: general advice finding duplicate code by NetWallah (Canon) on Jun 21, 2011 at 05:14 UTC
I found this on StackOverflow - you may want to contact this person at Clone Doctor. `I'm completing a CloneDR-based duplicated code finder tool (see www.se +manticdesigns.com/Products/CloneDR) for Perl. I really like real exam +ples. Can I have your 30 files? If it all works, I'll send you the re +port and eventually the production tool. (Zip file?) – Ira Baxter Jun + 7 at 23:27` [download] "XML is like violence: if it doesn't solve your problem, use more."	[reply] [d/l]
Re: general advice finding duplicate code by planetscape (Chancellor) on Jun 21, 2011 at 11:34 UTC
See also: Brainstorming session: detecting plagiarism word similarity measure Fingerprinting text documents for approximate comparison HTH, planetscape	[reply]
Re: general advice finding duplicate code by 7stud (Deacon) on Jun 21, 2011 at 05:23 UTC
use strict; use warnings; use 5.010; my $fname = 'somefile.php'; #Slurp whole file: my $file; { local $/ = undef; $file = <DATA>; } my %files_for; while($file =~ m{ <[?]php \s* (.?) \s [?]> }xmsg) { my $php_code = $1; push @{ $files_for{$php_code} }, $fname; } use Data::Dumper; say Dumper(\%files_for); __END__ <div>hello</div> <div><?php echo 'world'; ?> <div><?php echo 'hello';?> <div><?php echo 'world';?> --output:-- $VAR1 = { 'echo \'world\';' => [ 'somefile.php', 'somefile.php' ], 'echo \'hello\';' => [ 'somefile.php' ] }; [download]	[reply] [d/l]
Re: general advice finding duplicate code by Anonymous Monk on Jun 21, 2011 at 05:55 UTC
to now manually unravel the duplication by hand This is a classic refactoring problem. The amount of duplication doesn't matter, you simply go file by file, and re-write each file to be modular, using the appropriate amount of abstraction. By the time you're on file 20 (of 500), you'll know if possible (and worth the effort ) to refactor all 55k lines of code, or start from scratch. If this were perl, I would say use B::Xref to generate a graph, and then look for cycles ... Or if the code is at all modular, use autodia and/or GraphViz::ISA to get a picture surely php has something similar, maybe :)	[reply]
Re^2: general advice finding duplicate code by armstd (Friar) on Jun 22, 2011 at 04:47 UTC
I gotta agree with this one. 55k LOC really isn't that much. If you have a duplication problem, you probably have a factoring problem. Refactoring the code will not only help eliminate your duplication issue, but will also teach you what the code is doing, and result in a much better end result than simply eliminating duplication. Eliminating the trivial copy/pastes is a good start though, anything that helps maintainability will buy time for refactoring, helping others avoid making the problem worse while you race to make it better. --Dave	[reply]
Re: general advice finding duplicate code by aquarium (Curate) on Jun 21, 2011 at 06:05 UTC
Thanks for the responses so far. i'll look up the clone doctor code...however i cannot send this codebase to 3rd parties. the second approach, using dumper, looks like will only identify duplicated but individual lines of code across the scripts...which would be just as easy to do using cat *.php \| sort \| uniq -c i'll keep thinking about it too..and will post any gems. a brute force reducing sliding window between two scripts is possible but probably blow out to hours/days of running time for the 40 or so script pair combinations. the hardest line to type correctly is: stty erase ^H	[reply]
Re^2: general advice finding duplicate code by Anonymous Monk on Jun 21, 2011 at 06:49 UTC
looks like will only identify duplicated but individual lines of code across the scripts Every approach is this approach :) its like a search engine You iterate over you files, and you index each file To index, you pick a unit (ex one word, or three adjacent lines of code) Generate a list of all units for a file Normalize each unit. For words you would stem (remove prefix/suffix..) to find the root, for lines you would remove insignificant whitespace, insignificant commas... normalize quoting characters... Hash each unit (sha1), and associate all this in a database Then, to find duplication, query the database to find duplicate hashes This is not unlike what git (git gc) does, so I wouldn't be surprised if git provides provided a tool to help you visualize these duplications, although I don't know of one It goes without saying before making code changes, you need a comprehensive test suite :)	[reply]
Re: general advice finding duplicate code by aquarium (Curate) on Jun 21, 2011 at 06:16 UTC
looks like some answers might be fractal. those things follow me everywhere ;) the hardest line to type correctly is: stty erase ^H	[reply]
Re: general advice finding duplicate code by sundialsvc4 (Abbot) on Jun 21, 2011 at 11:19 UTC
In the case of classic PHP, it might be a valid approach to consider the problem as being one of “extracting the (small) amounts of code” from what is basically “a very large and ugly template.” PHP strongly encouraged the mix-up of code and (in effect) data in the same file, and, “as soon as it works, sort of, go on to the next one starting with a copy of the last one.” Perhaps, though, a preponderance of the actual material is HTML, rather than logic?	[reply]
Re: general advice finding duplicate code by aquarium (Curate) on Jun 22, 2011 at 03:57 UTC
Thank you everyone for the great help. I ended up using CPD with very good result. Amazingly enough it even ran straight from the link to the java web start. I was worried that any automated tool might have problems as the php also contains html and vml(ugh). But the output shows clearly that about 20 or so php files (significantly) have in common in the order of 100-150 lines of code in various (specified) places. So after doing this dedupe, should cut another several thousand lines of code. Trying to get to a code base where it actually becomes maintainable by some mere mortal like myself or someone else. the code was all written by a single author. the hardest line to type correctly is: stty erase ^H	[reply]
Re: general advice finding duplicate code by sundialsvc4 (Abbot) on Jun 23, 2011 at 13:22 UTC
As we all know, probably the most annoying aspect of these problems is that the various “duplicated” bits of code are often not quite the same. PHP’s biggest weakness, in my opinion, is also a fundamental aspect of its design: “code and data are intermingled.” Logic is scattered willy-and-yon among the presentation and is usually completely governed by it. I have turned a lot of PHP modules into Template::Toolkit files, but it was never, ever easy. You basically are re-writing the damn thing . . . basically, from scratch. But sometimes you just can’t make a silk purse.	[reply]
Re^2: general advice finding duplicate code by aquarium (Curate) on Nov 10, 2011 at 22:32 UTC
Oh wise monks. The five months since initially posting this I've learn a lot, but not achieved that much. Past the very few instances where multiple copies of code could be centralised, it turns out there's too many design flaws in this behemoth of an application. Upon further inspection there is an unwieldy spider web of possible calls and entry points between the scripts, the use of the wrong data structure and mixing of code, data, and presentation. For any little challenging situation, yet another additional url variable was invented. It's a right, royal mess. Any meaningful further fixes all seem to lead to basically a complete rewrite. To top it off, the self taught programming author of the app is still keeping his fingers in the pie. Often negating any useful fix by unfounded fear arguments against specific fix. So i've resigned myself to putting in less effort into this, as i can see it doesn't deserve all my attention. When it does fall over, I won't take it too seriously...even though the app is used by select few, instead of the masses. i'm now spending at least some time in desktop publishing and info-graphics. There's potential for programmers to follow some of Edward Tufte's ideas when generating data graphics. the hardest line to type correctly is: stty erase ^H	[reply]
Re^3: general advice finding duplicate code by Anonymous Monk on Nov 10, 2011 at 22:51 UTC
Pity Be sure to document these issues (both social and technical), and how you would re-[architecture the program, and use them as a valuable field experience (also CYA) Also, keeping your spirits up is very important, good to figure this out sooner De-stress early, de-stress often	[reply]