Generally, *cut*-and-paste isn't the issue, *copy*-and-paste is. Granted, you ca...

likpok · on March 11, 2009

Pattern Insight does this, IIRC. They found at least one bug in Linux as a result of copypasta.

http://www.patterninsight.com/products/pattern-miner.html

I saw a demo, and it was pretty slick. I have never used it on an actual project though.

russell · on March 10, 2009

I meant copy and paste.

As you have discovered the problem is complex because of the explosion of compares if you try to do it in any general way. It would be useful to query "is there any other code like the section that I selected?" The copied code usually has changes in things like constants and argument. If you abstracted away constants it might be easier to compare. Just a suggestion.

silentbicycle · on March 10, 2009

Right. Allowing preprocessing with a syntax-specific extension or (ideally) scanning on the AST would be more interesting - code can have be surprisingly similar except for variable names, types, and other semantic annotations. Rather than calling them "patterns" and prizing them, sometimes abstracting them away would be a better approach.

My main focus is on curing the heartbreak of copy-and-paste-programming, but I'm sure it would have more general applications (if I ever get it out of quadratic performance, etc.). Unsurprisingly, what is "similar enough" can be a very slippery concept.

scott_s · on March 10, 2009

Doing it on the AST should help performance. Even if you're still using an N^4 algorithm, the AST should be significantly smaller than the string that represents the code.

And, as it sounds like you realize, this will catch cases where variable names have been changed, but the structure is the same.

silentbicycle · on March 10, 2009

Right, and there would also be more semantic boundaries, (possibly greatly) reducing the size of the overall problem space. If (by sheer coincidence) there was a token overlap between the last two lines of one function and the first two of another, that's probably not relevant in the same way a copied-and-pasted version of the same function in two places would be.

scott_s · on March 10, 2009

The AST also makes it easier to use heuristics based on semantic information to shorten the number of subtrees you're concerned with (previously substrings).

For example, if in a C-like language, you could decide to only look at subtrees that are at least complete statements, or go to a higher granularity and only look at compound statements. You could also eliminate common idioms, such as

  for (i = 0; i < N; i++)

Which will pop up all over the place and give false positives.

jibiki · on March 10, 2009

If you're literally just looking for code that appears in two places, then there is an obvious polynomial time algorithm, no? (Check every substring against every other substring...)

scott_s · on March 10, 2009

Yeah, it's polynomial, but it's still really slow and not practical.

In a string of length N, there are N substrings of length 1, N-1 substrings of length 2, N-2 substring of length 3, ..., 1 substring of length N. This is the sum of the first N integers, which is N(N+1)/2. If we're doing big oh analysis, we can just say that's O(N^2).

Comparing every something to every other something is an N^2 operation. But, in this case, our something is already O(N^2), so the final algorithm is actually O(N^4).

Doing a O(N^4) algorithm on nontrivial sizes of N (and nontrivial sizes of N are where you need it the most) will take a long time.

Disclaimer: it's been a long time since I've done any algorithm analysis, so please check my work.

ramchip · on March 10, 2009

My algorithms teacher does research in something similar. IIRC he has a program exactly for showing copy/pasted code. You may want to have a look at his papers on Clone Detection Tools, "A Novel Approach to Optimize Clone Refactoring Activity", etc.

http://www.polymtl.ca/recherche/rc/en/professeurs/details.ph...

silentbicycle · on March 10, 2009

Awesome, thanks!

lacker · on March 10, 2009

Comparing every something to every other something is an N^2 operation. But, in this case, our something is already O(N^2), so the final algorithm is actually O(N^4).

If you are just checking for equality, comparing each of N items to another group of N items is O(N). Use a hash table.

silentbicycle · on March 11, 2009

Exactly. Interned strings / symbols work out the same, too.