Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Generally, cut-and-paste isn't the issue, copy-and-paste is. Granted, you can cut and paste paste paste ..., but cut-and-paste usually means moving things to a more sensible place.

The biggest problem is when a code blob is duplicated all over the place and (worse still!) the copies slowly drift out of sync, making them hard to reunite. I've been working on a program off and on that tries to detect this in a language-agnostic manner, but I've hit a couple dead ends. (I half expect somebody with a more orthodox CS education to take one look at it and go, "Oh, that's NP-complete," but I haven't let it stop me - it's been a great way to learn about graph algorithms.)



Pattern Insight does this, IIRC. They found at least one bug in Linux as a result of copypasta.

http://www.patterninsight.com/products/pattern-miner.html

I saw a demo, and it was pretty slick. I have never used it on an actual project though.


I meant copy and paste.

As you have discovered the problem is complex because of the explosion of compares if you try to do it in any general way. It would be useful to query "is there any other code like the section that I selected?" The copied code usually has changes in things like constants and argument. If you abstracted away constants it might be easier to compare. Just a suggestion.


Right. Allowing preprocessing with a syntax-specific extension or (ideally) scanning on the AST would be more interesting - code can have be surprisingly similar except for variable names, types, and other semantic annotations. Rather than calling them "patterns" and prizing them, sometimes abstracting them away would be a better approach.

My main focus is on curing the heartbreak of copy-and-paste-programming, but I'm sure it would have more general applications (if I ever get it out of quadratic performance, etc.). Unsurprisingly, what is "similar enough" can be a very slippery concept.


Doing it on the AST should help performance. Even if you're still using an N^4 algorithm, the AST should be significantly smaller than the string that represents the code.

And, as it sounds like you realize, this will catch cases where variable names have been changed, but the structure is the same.


Right, and there would also be more semantic boundaries, (possibly greatly) reducing the size of the overall problem space. If (by sheer coincidence) there was a token overlap between the last two lines of one function and the first two of another, that's probably not relevant in the same way a copied-and-pasted version of the same function in two places would be.


The AST also makes it easier to use heuristics based on semantic information to shorten the number of subtrees you're concerned with (previously substrings).

For example, if in a C-like language, you could decide to only look at subtrees that are at least complete statements, or go to a higher granularity and only look at compound statements. You could also eliminate common idioms, such as

  for (i = 0; i < N; i++)
Which will pop up all over the place and give false positives.


If you're literally just looking for code that appears in two places, then there is an obvious polynomial time algorithm, no? (Check every substring against every other substring...)


Yeah, it's polynomial, but it's still really slow and not practical.

In a string of length N, there are N substrings of length 1, N-1 substrings of length 2, N-2 substring of length 3, ..., 1 substring of length N. This is the sum of the first N integers, which is N(N+1)/2. If we're doing big oh analysis, we can just say that's O(N^2).

Comparing every something to every other something is an N^2 operation. But, in this case, our something is already O(N^2), so the final algorithm is actually O(N^4).

Doing a O(N^4) algorithm on nontrivial sizes of N (and nontrivial sizes of N are where you need it the most) will take a long time.

Disclaimer: it's been a long time since I've done any algorithm analysis, so please check my work.


My algorithms teacher does research in something similar. IIRC he has a program exactly for showing copy/pasted code. You may want to have a look at his papers on Clone Detection Tools, "A Novel Approach to Optimize Clone Refactoring Activity", etc.

http://www.polymtl.ca/recherche/rc/en/professeurs/details.ph...


Awesome, thanks!


Comparing every something to every other something is an N^2 operation. But, in this case, our something is already O(N^2), so the final algorithm is actually O(N^4).

If you are just checking for equality, comparing each of N items to another group of N items is O(N). Use a hash table.


Exactly. Interned strings / symbols work out the same, too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: