Something related I've been thinking about lately is combining something like th...

Darmani · on June 1, 2019

That's called MOSS ("measure of source similarity"), and it's been used by universities for over a decade.

kqr · on June 1, 2019

My goal is slightly different, though -- I'm more interested in semantic similarity than source similarity. I saw one article suggesting to run the detection algorithm on the decompiled output of the compiler to get better results, which seems like a good idea as well.

Darmani · on June 5, 2019

Oh, I misunderstood. That's called "semantic clone detection," and it's been a breeding ground for second-tier papers for a long time.

sixwing · on May 31, 2019

very doable, at least for certain types of clones, and a topic of active research.

while leveraging the ASTs and scope graphs produced by semantic can allow you to attack the more complicated clone types (eg, code that has nearly the same meaning, with significantly different implementation), various parsing + hashing methods have proven useful for the more simple cases.

useful for far more than detecting plagiarism, too. it can boost signal for search, allow for more nuanced semantic navigation, assist in refactoring, and help understand the propagation/provenance of code (which can be important for understanding licenses, etc).