Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Something related I've been thinking about lately is combining something like this with plagiarism detection tools to weed out code duplication. I have no idea yet quite how, but it would be interesting to explore.


That's called MOSS ("measure of source similarity"), and it's been used by universities for over a decade.


My goal is slightly different, though -- I'm more interested in semantic similarity than source similarity. I saw one article suggesting to run the detection algorithm on the decompiled output of the compiler to get better results, which seems like a good idea as well.


Oh, I misunderstood. That's called "semantic clone detection," and it's been a breeding ground for second-tier papers for a long time.


very doable, at least for certain types of clones, and a topic of active research.

while leveraging the ASTs and scope graphs produced by semantic can allow you to attack the more complicated clone types (eg, code that has nearly the same meaning, with significantly different implementation), various parsing + hashing methods have proven useful for the more simple cases.

useful for far more than detecting plagiarism, too. it can boost signal for search, allow for more nuanced semantic navigation, assist in refactoring, and help understand the propagation/provenance of code (which can be important for understanding licenses, etc).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: