Rosetta’s Way Back to the Source
Towards Reverse Engineering of Complex Software
What is this project about?
The Rosetta project, funded by the EU in the form of an ERC grant, aims to develop techniques to enable reverse engineering of complex software sthat is available only in binary form. To the best of our knowledge we are the first to start working on a comprehensive and realistic solution for recovering the data structures in binary programs (which is essential for reverse engineering), as well as techniques to recover the code. The main success criterion for the project will be our ability to reverse engineer a realistic, complex binary. Additionally, we will show the immediate usefulness of the information that we extract from the binary code (that is, even before full reverse engineering), by automatically hardening the software to make it resilient against memory corruption bugs (and attacks that exploit them).
In the Rosetta project, we target common processors like the x86, and languages like C and C++ that are difficult to reverse engineer, and we aim for full reverse engineering rather than just decompilation (which typically leaves out data structures and semantics). However, we do not necessarily aim for fully automated reverse engineering (which may well be impossible in the general case). Rather, we aim for techniques that make the process straightforward. In short, we will push reverse engineering towards ever more complex programs.
Our methodology revolves around recovering data structures, code and semantic information iteratively. Specifically, we will recover data structures not so much by statically looking at the instructions in the binary program (as others have done), but mainly by observing how the data is used
Research question. The project addresses the question whether the compilation process that translates source code to binary code is irreversible for complex software. Irreversibility of compilation is an assumed property that underlies most of the commercial software today. Specifically, the project aims to demonstrate that the assumption is false.
There has been quite some interest in the Rosetta project. For instance, an article about the project on Tweakers (Dutch) led to hundreds of reactions. Some informed, some not. Of course, it is fine to have opinions, but it is better to have informed opinions. On this page, I will try to explain the project in a bit more detail. For those who cannot wait: I answer some of the comments in the Tweakers’ posts here.
Why do we want it?
Most software is available only in binary form, without any access to source code. The reasons are two-fold:
- One reason is that source code may be lost. Sometimes source code disappears during periods of transition, or the source code that was archived corresponds to an older version of the executable. As a famous example of lost source code, Sega America boss Simon Jeffrey admitted in 2008 that they cannot find the source code of many of their old arcade games.
- More importantly, most software vendors do not want their clients to have access to source code. They argue that providing the source code to software, gives away their competitive edge. Also, by keeping the source code to themselves (known as the closed source model of software), they may give their other software an advantage over that of competitors, by allowing it to use undocumented features that are very difficult or impossible to use for competing products. Essentially, these vendors treat the compilation from source to binary as a pseudo-unbreakable code, and rely on the fact that it is exceedingly difficult to reverse the compilation.
Reverse engineering is commonly used in security settings, both by hackers to discover vulnerabilities in commercial software, and by security experts to discover the workings of malicious code. In either case, reverse engineering requires a huge, mostly manual effort. For large programs, reverse engineering is sufficiently hard to make it impractical, and software vendors treat compilation as an (in practice) unbreakable code. We will refer to this view as the irreversibility assumption.
If the source code is lost, the lack of access to the source is clearly undesirable. It is impossible for users to maintain, or even rewrite the software without access to source code. Perhaps the software has security vulnerabilities or other bugs that might be easy to fix if only we had access to the source. Rewriting the program from scratch is often not an option. The investment required for rewriting, testing and debugging the software from scratch may be gigantic and risky. What if the production line or security control system stops working because of wrinkles in the new code?
The second case, where vendors deliberately close the source, appears to be more complex. Who could argue against vendors trying to protect their intellectual investment and competitive edge? Some think that closed source is unfair or evil (e.g., because dominant vendors may give their own other software an advantage over that of competitors, by allowing it to use undocumented features that are very difficult or impossible to use for competing products). Others think that it is the only way to maintain the pace of innovation and that the absence of it will lead to software patents (typically said with a shudder).
I will not go into this at this point. Personally, I use mainly open source products, but go ahead if you want to use closed source stuff. I do, sometimes. Skype for instance (with some misgivings). However, beyond the question of whether closed source is good or evil, there are issues:
- What is the program doing? If we buy a program, we would like to verify that it does what it is advertised to do. For instance, can we trust the vendor on his word that all sensitive data in memory is always encrypted? This may be very important in security sensitive settings and extremely difficult to check. Moreover, we want to be sure that the program has no extra, hidden features that are harmful. For instance, few people really know what the popular communication program Skype is doing. Skype is extremely secretive. Can you be sure the protection of Skype traffic is safe? Can you be sure the program does not do harm? Could it even be spyware? Can you afford to take the risk?In a security sensitive environment, these are unacceptable properties. At the very least, it should be possible to analyse the code to determine whether the software is benign or malicious, secure or vulnerable. This is true not just for Skype, but also for less obfuscated code, such as media players, browsers, plugins, etc.
- Can we fix the software? Software contains bugs and bugs lead to crashes, misbehavior, and exploitable vulnerabilities. As good programmers generate around 2 to 10s of bugs per 1000 lines of code, it is unlikely that bugs will disappear. Sometimes, we may actually be able to fix some of the bugs, if only we had access to the source code. For instance, memory corruption (such as a buffer overflow) is one of the main attack vectors used by hackers to compromise a system. However, given the source, we could apply novel techniques such as write integrity testing (WIT) and stack protection to prevent the majority of such attacks. Without the source, we are at the mercy of the software vendor. As adequate protection introduces overhead, vendors may be slow in applying additional security measures, especially when better performance sells products.
Explicitly then: the project does not aim at illegal activities, violating copyrights and plagiarising other people’s code. It does aim to make it possible to analyse code that you bought and paid for, and to make it possible to fix broken software without the code. In particular, one thing I want to do with the data structures after I have excavated them, is protect them. Against common attacks, like buffer overflows. The idea is simple enough (the execution is not!): if you know there is a buffer of length 128 at location X, make sure that all accesses to the buffer stay within the buffer boundaries.
Whatever the motivation, the scientific question is whether it is even possible to revert an existing, complex program to readable source code. At first sight, the task is hopeless (which is why companies like Microsoft have been able to build on the irreversibility assumption). Binary code lacks most of the features that enable programmers to understand well-structured programs in a higher-level language. Let us consider some of the concepts that are lacking at the binary level:
- data structures (e.g., knowledge that a certain structure consists of 2 integers, 4 bytes, and a pointer);
- higher-level code (e.g., that a sequence of instructions corresponds to a tree walk algorithm);
- semantics (e.g., meaningful symbol names or knowledge of the purpose of functions and variables).
A tiny bit of technical stuff (bit not much!)
Let us zoom in on these difficulties a bit. To do so, we cannot avoid looking at the way modern computers and compilers operate. Nevertheless, by abstracting away the lowest-level details, the explanation should be at a sufficiently high level to be understood by a general audience, and sufficiently specific to satisfy domain experts.
Missing data structures and semantics. Arguably the hardest challenge is the lack of information about data structures and their semantics. Programmers are taught to design their data structures first and write the code second. Phrased differently, the data structures form the heart of the program. Not having the data structures makes reverse engineering an arduous, manual task. Unfortunately, after compilation the data structures themselves are completely invisible. To illustrate this, suppose a function in a program has a variable that represents an employee record. The data type used for the record consists of an array of up to 128 characters (bytes) to represent the employees name, a 16 bit integer number to represent the year of birth, a byte to represent the month, and another byte to represent the day. Most programming languages have language constructs to define such a record. For instance, the left hand side of the figure below shows how this could be defined in the language C.
Things get much worse. For instance, compiler optimizations may change the binary beyond recognition. This is known as WYSINWYX (What You See Is Not What You eXecute). Also, allocations on the stack are more difficult than explicit allocations on the heap.
Missing higher-level code and semantics. Besides the data structures, we also lack the actual code in a highlevel programming language, but here at least we have the instructions at the assembly level (shown as 2 instructions of pseudo-assembly in the figure on the right). Although we no longer have the names of fields and variables, and compiler optimizations may well have changed the program structure considerably, good programmers are able to recognize certain blocks of assembly code and translate them to a higher-level equivalent. Doing so is still manual labour, but unlike with data structures, we know where to start.
- Will all the tools be released as open source? To this question I replied that I cannot promise this yet, but that this is what I normally do. I stress that this is what I have always done. Many of my projects (Argos, Streamline) are available as open source. However, I have to be a bit cautious as I am not sure about legal issues, licenses of software that we will use etc. Also, code has to be stable (and I know from experience that software, once released, typically takes a lot of time in maintenance, bug fixes, etc). At any rate, the results of this research will be made public. As always we intend to publish how we do this and how well things work. So anyone can reproduce the results.
- Is the aim to destroy the software industry? What a strange idea. I do not aim to plagiarise commercial software, kill off any industries, or work for miscreants that work evil. I do want to enable you to analyse and improve the software you bought or downloaded.
- Compiler-optimisations will make it impossible to reverse engineer. Ah yes, they make it harder. An unrolled loop may no longer look like a loop, and if the unrolled loop accesses an array, it is difficult to recognise the accesses to the array as such (and thus to recover the array). But impossible? No.
- Will the RE be fully automated? No. I do not think it can be. Some higher level semantics and naming will still be done by a human engineer. The plan is to make RE straightforward rather than close to impossible.