If you think that good code is a plain, expressionless and elegant string of characters that is, at its best, utterly anonymous, think again. New research suggests that programmers have ways of writing code, which can be used as digital fingerprints.
Whether it’s how they space out code using spaces and tabs, naming conventions with capitals and underscores, or quirks in commenting, a team from Drexel University, the University of Maryland, the University of Goettingen and Princeton can spot who wrote a piece of code — with alarming accuracy. Using natural language processing and machine learning to work out who wrote anonymous pieces of source code based on coding style alone, the team can identify the person behind the script with 95 per cent accuracy.
The work uses indicators such as layout and lexical attributes to work out who wrote a piece of code. But it also uses something called “abstract syntax trees”, which “capture properties of coding style that are completely independent from writing style.” In other words, it looks beyond naming, comments and spaces, to find hidden clues in the structure of code. Testing their machine learning software on scripts publicly available data from Google’s Code Jam, the team showed that analysis of 630 lines of code for an author will provide it with enough information to identify the coder from a fresh piece of script with 95 per cent accuracy. Increase the line count to 1900, and the identification accuracy reaches 97 per cent.
As well as being a neat trick, there are clear applications for code of this kind. Being able to accurately identify who wrote an anonymous piece of code could help authorities tack down hackers more easily, for instance, or identify those committing online fraud. Now, it’s time to do with code what you used to do with handwriting as a kid: learn to fake someone else’s. [Drexel via IT World]