CITP Luncheon Speaker Series:CITP Luncheon Series
Aylin Caliskan-Islam – De-anonymizing Programmers and Code Stylometry – Large Scale Authorship Attribution from Source Code and Executable Binaries of Compiled Code
Date: Tuesday, March 1, 2016
Location: 306 Sherrerd Hall
Streaming Live: https://www.youtube.com/user/citpprinceton
Food and discussion begin at 12:30 pm. Open to current Princeton faculty, staff, and students. Open to members of the public by invitation only. Please contact Jean Butcher at email@example.com if you are interested in attending a particular lunch.
De-anonymization is identifying the owner of an anonymous piece of data to re-identify the source. Research has shown that it is possible to de-anonymize individuals in large datasets by searching for fingerprints in anonymous data. In this work, we are interested in de-anonymizing programmers by attributing authorship to code samples of unknown authors. Being able to de-anonymize programmers can aid in resolving plagiarism issues, forensic investigations, and copyright-copyleft disputes. However, being able to de-anonymize individuals among thousands is a direct threat to privacy and anonymity. In this talk, we will examine de-anonymizing programmers from the standpoint of machine learning, which provides methods applicable to code stylometry to model programmers’ coding styles. Stylometry has been known to exist as the study of writing style, which is unique to every author. Similarly, code stylometry shows that coding style is unique to each programmer. Consequently, it is possible to de-anonymize programmers from their coding styles. A source code sample can be represented as a feature vector consisting of stylometric properties of code, such as lexical, layout, and syntactic properties. By using these numeric representations, a machine learning classifier generates a model for each programmer’s style, and attributes authorship to 14,400 anonymous source code samples of 1,600 programmers with 94% correct classification rate, which is a breakthrough in accuracy.
We also tackle the much harder problem of de-anonymizing programmers from executable binaries of compiled code. Many potentially distinguishing features present in source code, e.g. variable names, are removed in the compilation process, and compiler optimization may alter the structure of a program, further obscuring features that are known to be useful in determining authorship. We examine executable binary authorship attribution by using a novel set of features that include ones obtained by decompiling the executable binary to source code. We show that many syntactical features present in source code do in fact survive compilation and can be recovered from decompiled executable binaries. This allows us to add a powerful set of techniques from the domain of source code authorship attribution to the existing ones used for executable binaries, resulting in significant improvements to accuracy and scalability. We demonstrate this improvement on data from the Google Code Jam, obtaining attribution accuracy of up to 96% with 20 candidate programmers. We also demonstrate that our approach is robust to a range of compiler optimization settings, and binaries that have been stripped of their symbol tables. Finally, for the first time we are aware of, we demonstrate that authorship attribution can be performed on real world code found “in the wild” by performing attribution on single-author GitHub repositories.
Aylin Caliskan-Islam is a Postdoctoral Research Associate at CITP. Her work on the two main realms, security and privacy, involves the use of machine learning and natural language processing. In her previous work, she demonstrated that de-anonymization is possible through analyzing linguistic style in a variety of textual media, including social media, cyber criminal forums, and source code. She is currently extending her de-anonymization work to include non-textual data such as binary files and developing countermeasures against de-anonymization. Aylin’s other research interests include quantifying and classifying human privacy behavior and designing privacy nudges to avoid private information disclosure as a countermeasure. At Princeton, she works with Dr. Arvind Narayanan on text sanitization of sensitive documents for public disclosure, which can enable researchers to share data with linguists, sociologists, psychologists, and computer scientists without breaching the research subjects’ privacy. Her work has been featured in prominent privacy and security conferences such as Usenix Security Symposium, IEEE Symposium on Security and Privacy, Privacy Enhancing Technologies Symposium, and the Workshop on Privacy in the Electronic Society. In addition, she has given lectures and talks on privacy, security, and machine learning subjects at the Chaos Communications Congress and Drexel University. She holds a PhD in Computer Science from Drexel University and a Master of Science in Robotics from the University of Pennsylvania.