Security

De-anonymizing Programmers

Large Scale Authorship Attribution from Executable Binaries of Compiled Code and Source Code
Last year I presented research showing how to de-anonymize programmers based on their coding style. This is of immediate concern to open source software developers who would like to remain anonymous. On the other hand, being able to de-anonymize programmers can help in forensic investigations, or in resolving plagiarism claims or copyright disputes. I will report on our new research findings in the past year. We were able to increase the scale and accuracy of our methods dramatically and can now handle 1,600 programmers, reaching 94% de-anonymization accuracy. In ongoing research, we are tackling the much harder problem of de-anonymizing programmers from binaries of compiled code. This can help identify the author of a suspicious executable file and can potentially aid malware forensics. We demonstrate the efficacy of our techniques using a dataset collected from GitHub.
It is possible to identify individuals by de-anonymizing different types of large datasets. Once individuals are de-anonymized, different types of personal details can be detected from data that belong to them. Furthermore, their identities across different platforms can be linked. This is possible through utilizing machine learning methods that represent human data with a numeric vector that consists of features. Then a classifier is used to learn the patterns of each individual, to classify a previously unseen feature vector. Tor users, social networks, underground cyber forums, the Netflix dataset have been de-anonymized in the past five years. Advances in machine learning and the improvements in computational power, such as cloud computing services, make these large scale de-anonymization tasks possible in a feasible amount of time. As data aggregators are collecting vast amounts of data from all possible digital media channels and as computing power is becoming cheaper, de-anonymization threatens privacy on a daily basis. Last year, we showed how we can de-anonymize programmers from their source code. This is an immediate concern for programmers who would like to remain anonymous. (Remember Saeed Malekpour, who was sentenced to death after the Iranian government identified him as the web programmer of a porn site.) We scaled our method to 1,600 programmers after last year’s talk on identifying source code authors via stylometry. We reach 94% accuracy in correctly identifying the 1,600 authors of 14,400 source code samples. These results are a breakthrough in accuracy and magnitude when compared to related work. This year we have been focusing on de-anonymizing programmers from their binaries of compiled code. Identifying stylistic fingerprints in binaries is much more difficult in comparison to source code. Source code goes through compilation to generate binaries and some stylistic fingerprints get lost in translation while some others survive. We reach 65% accuracy, again a breakthrough, in de-anonymizing binaries of 100 authors. De-anonymization is a threat to privacy but it has many security enhancing applications. Identifying authors of source code helps aid in resolving plagiarism issues, forensic investigations, and copyright-copyleft disputes. Identifying authors of binaries can help identify the author of a suspicious executable file or even be extended to malware classification. We show how source code and binary authorship attribution works on a real world datasets collected from GitHub. I hope this talk raises awareness on the dangers of de-anonymization while showing how it can be helpful in resolving conflicts in some other areas. Binary de-anonymization could potentially enhance security by identifying malicious actors such as malware writers or software thieves. I would like to conclude by mentioning two future directions. Can binary de-anonymization be used for malware family classification and be incorporated to virus detectors? Obfuscators are not the counter measure to de-anonymizing programmers. We can identify the authors of obfuscated code with high accuracy. There is an immediate need for a code anonymization framework, especially for all the open source software developers who would like to remain anonymous.

Additional information

Type lecture
Language English

More sessions

12/27/15
Security
Joanna Rutkowska
Hall 2
Can we build trustworthy client systems on x86 hardware? What are the main challenges? What can we do about them, realistically? Is there anything we can?
12/27/15
Security
Hall 6
Unser Vortrag demonstriert einen PLC-only Wurm. Der PLC-Wurm kann selbstständig ein Netzwerk nach Siemens Simatic S7-1200 Geräten in den Versionen 1 bis 3 durchsuchen und diese befallen. Hierzu ist keine Unterstützung durch PCs oder Server erforderlich. Der Wurm „lebt“ ausschließlich in den PLCs.
12/27/15
Security
Hall 2
Dr. Peter Laackmann und Marcus Janke zeigen mit einem tiefen Einblick in die Welt der Hardware-Trojaner, auf welchem Wege „Institutionen“ versuchen können, sich versteckten Zugang zu Sicherheits-Hardware zu verschaffen.
12/27/15
Security
Yaniv Balmas
Hall G
Key-Loggers are cool, really cool. It seems, however, that every conceivable aspect of key-logging has already been covered: from physical devices to hooking techniques. What possible innovation could be left in this field?
12/27/15
Security
Ilja van Sprundel
Hall 2
This presentation covers windows kernel driver security issues. It'll discuss some background, and then give an overview of the most common issues seen in drivers, covering both finding and fixing issues.
12/27/15
Security
Alexander Graf
Hall 2
Did you ever want to have access to a few hundred thousand network end points? Or a few hundred thousand phone numbers? A short look behind the curtains of how not to do network security.
12/27/15
Security
Hall 1
For years SCADA StrangeLove team speaks about vulnerabilities in Industrial Control Systems. Now we want to show by example of railway the link between information security and industrial safety and demonstrate how a root access gained in a few minutes can bring to naught all the years of efforts that were devoted to the improvement of fail-safety and reliability of the ICS system. Railroads is a complex systems and process automation is used in different areas: to control power, switches, ...