Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I meant copy and paste.

As you have discovered the problem is complex because of the explosion of compares if you try to do it in any general way. It would be useful to query "is there any other code like the section that I selected?" The copied code usually has changes in things like constants and argument. If you abstracted away constants it might be easier to compare. Just a suggestion.



Right. Allowing preprocessing with a syntax-specific extension or (ideally) scanning on the AST would be more interesting - code can have be surprisingly similar except for variable names, types, and other semantic annotations. Rather than calling them "patterns" and prizing them, sometimes abstracting them away would be a better approach.

My main focus is on curing the heartbreak of copy-and-paste-programming, but I'm sure it would have more general applications (if I ever get it out of quadratic performance, etc.). Unsurprisingly, what is "similar enough" can be a very slippery concept.


Doing it on the AST should help performance. Even if you're still using an N^4 algorithm, the AST should be significantly smaller than the string that represents the code.

And, as it sounds like you realize, this will catch cases where variable names have been changed, but the structure is the same.


Right, and there would also be more semantic boundaries, (possibly greatly) reducing the size of the overall problem space. If (by sheer coincidence) there was a token overlap between the last two lines of one function and the first two of another, that's probably not relevant in the same way a copied-and-pasted version of the same function in two places would be.


The AST also makes it easier to use heuristics based on semantic information to shorten the number of subtrees you're concerned with (previously substrings).

For example, if in a C-like language, you could decide to only look at subtrees that are at least complete statements, or go to a higher granularity and only look at compound statements. You could also eliminate common idioms, such as

  for (i = 0; i < N; i++)
Which will pop up all over the place and give false positives.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: