Re: general advice finding duplicate code

Interesting, I started solving a very similar problem for planetscape some time ago. In her case she was wanting to refactor a web site where there were large chunks of duplicated HTML. The general approach was to normalise the HTML then extract chunks of some minimum size and populate a hash using the chunks as a key and adding the file location of each chunk to a list stored in the hash element. The interesting part is to then use the matched chunks as seed points and grow the match area to encompass as large a common region as sensible.

With the HTML matching choosing "sensible" was something of a trade off. As the common region was increased the number of places that matched the region tended to reduce. It may be that in your case if the code rally has just been copied around without change that the regions are pretty well defined.

My guess is that for something small like 55K LOC the technique would work quite well and in a timely fashion. Probably you don't need to worry so much about the normalise step.

True laziness is hard work

Comment on Re: general advice finding duplicate code


Pathologically Eclectic Rubbish Lister
	PerlMonks