2367 views
 owned this note
# Review of the SAILR paper ###### tags: `review` `post` `goto` `sailr` `paper` `cfg` `restructuring` Hello everyone! πŸ‘‹ This is a (not so) brief (and unsolicited πŸ˜†) review of the [SAILR paper](https://www.zionbasque.com/files/publications/sailr_usenix24.pdf). It's a nice piece of work and it's interesting for us since it compares against an old paper of ours, [A Comb for Decompiled C Code](https://rev.ng/downloads/asiaccs-2020-paper.pdf). [@alerevng](https://twitter.com/alerevng) and Pietro put this together, co-founders of rev.ng and authors of the above mentioned paper. This was originally conceived as a series of tweets, but it got too long. Hence the organization of the post. --- Before we begin, please consider subscribing to our newsletter. We're going to release the new website and start the closed beta *very* soon. People will be invited FIFO, so do it now. :::success <center><strong><a href="https://rev.ng/register-for-nightly.html">Register to the closed beta</a></strong></center> ::: Now, let's get to it! πŸ’ͺ --- > ![](https://pad.rev.ng/uploads/upload_3ebc3b068a9c1528c016135d421df17d.png) I disagree. The original could be assembly, or written in another programming language. The goal of a decompiler is to generate code that is easy to understand for humans and to analyze for other program analysis tools. "Code easy to understand and analyze" is a multi-objective problem. Sources of complexity might be: unstructured Control Flow (goto), complex expressions in if conditions (see DREAM), duplicated code blocks and more. However, the authors have a point: eliminating gotos at any cost, like our paper at AsiaCCS, is an extreme approach that only cares about one source of complexity. Moreover, there's a cognitive cost in duplicating code. DREAM pay this cost too, mostly complex if conditions. However, framing the role of a decompiler as trying to reconstruct the original source code. This dismisses approaches based on duplication, because by definition they don’t try to match the original code. Instead, they try to rewrite the code to be less complex as a whole. Of course there is the risk of duplicating too much code (that we ignored in our AsiaCCS paper). But if duplication is considered as a capability, part or a broader policy, it can be beneficial. Spoiler: this is something we’re working on πŸ™‚ --- However, duplication can also beneficial! In fact, each copy of the duplicated block will run under simpler conditions. This can lead dead code elimination to simplify it! We call this "basic block specialization". :::success ### Original source code ```c int *A = NULL; int *B = NULL; A = malloc(sizeof(int)); if (A == NULL) goto cleanup; B = malloc(sizeof(int)); if (B == NULL) goto cleanup; other_stuff(A, B); cleanup: if (A != NULL) free(A); if (B != NULL) free(B); return; ``` ::: :::success ### Post combing ```c int *A = NULL; int *B = NULL; A = malloc(sizeof(int)); if (A == NULL) { if (A != NULL) free(A); if (B != NULL) free(B); return; } B = malloc(sizeof(int)); if (B == NULL) { if (A != NULL) free(A); if (B != NULL) free(B); return; } other_stuff(A, B); if (A != NULL) free(A); if (B != NULL) free(B); return; ``` ::: :::success ### After optimizations ```c int *A = NULL; int *B = NULL; A = malloc(sizeof(int)); if (A == NULL) return; B = malloc(sizeof(int)); if (B == NULL) { free(A); return; } other_stuff(A, B); free(A); free(B); return; ``` ::: We can debate, whether these transformations are desirable in decompiled code, but I don't blindly accept that we need to mimic what the original source does. --- > ![](https://pad.rev.ng/uploads/upload_77a526136cd5e5d761c1835846d0010d.png) As a consequence of the previous observation, I do not consider this to be a good metric to try to optimize. --- > ![](https://pad.rev.ng/uploads/upload_e3212049420571fde70b78e12dd03ced.png) Not necessarily. For instance, the switch C construct cannot be translated directly to assembly. And there are many ways to lower them to assembly, with different trade offs. --- > ![](https://pad.rev.ng/uploads/upload_ac915a8132c4dbc5dc9411cc54550841.png) It's true that they do exist, and they are many. And Linus in person have said good things about them in the past. But, in most cases, gotos in Linux are well structured, and used to [tidily cleanup resources on failure](https://github.com/torvalds/linux/blob/6764c317b6bb91bd806ef79adf6d9c0e428b191e/block/bsg-lib.c#L398). When writing code, duplication is bad for maintainability and increase error opportunities. Reverse engineers have to understand code: duplication is not necessarily bad for that. --- > ![](https://pad.rev.ng/uploads/upload_8ae58596dbecd621952be5ef2a63392b.png) Again, gotos, in terms of software engineering, are the least evil in certain situations. This does not mean that during decompilation we can't do better. We're not *writing* code, we're *reading* code. --- > ![](https://pad.rev.ng/uploads/upload_14c74af8db69518ed16e8c9f1f3dc538.png) This is the exact opposite of what we try to do. It is an approach with some value, but it doesn't sound scalable or future-proof. It would end up requiring updates for each new ISA-platform-compiler tuple. In compilers, one usually normalizes the code, enforces invariants so that downstream passes can, opportunistically, do effective transformations. In a decompiler, trying to pattern match 1-to-1 the compiler behavior is like, in a compiler, trying to pattern match 1-to-1 the developer's code. Enumerating all the possible situations is not viable, it's better to devise a more general solution and refine them while you meet situations you don't handle very well. It takes an effort over a long period of time, but it pays back. This said, the effort they put in looking into what compilers do is valuable and very well complements performing manual QA on the decompiled code. We'll certainly take notes from these findings to further refine our approach! πŸ™‚ --- > ![](https://pad.rev.ng/uploads/upload_a1c8c7b770d397d7a3aa4fea6f1173aa.png) The LLVM optimization pipeline runs SimplifyCFG many times: https://github.com/llvm/llvm-project/blob/3b82336/llvm/lib/Passes/PassBuilderPipelines.cpp Basically it runs after each set of transformations that can change the CFG. Compilation pipelines are somewhat iterative in the trasformations they do, they cannot simply be "reversed". --- > ![](https://pad.rev.ng/uploads/upload_de671540521921a9f50a015a18652497.png) Using an average of absolute values doesn't seem like a good idea. This metric is heavily affected by the average size of a function. The metric does not capture the fact that a decompiler might be very good on small functions, but very bad on big functions. As a consquence, such decompiler might have a better score than one bad on small functions but better on large ones. Using some metric scaled on the function size might have been more effective. --- > ![](https://pad.rev.ng/uploads/upload_dc1f23cdd7171cb5033c3f5dc19b79b3.png) It would have been nice to specify which one is bit-rotten and which one is closed source. πŸ˜… By the way, we very much like their approach of re-implementing the various approaches, it seems quite fair! Good job! As a side note, we're going to open source all the decompilation pipeline (but not the analyses). The part of rev.ng we're dealing with *will* be open source. \o/ --- > ![](https://pad.rev.ng/uploads/upload_84b08416e08002bd957248329349909c.png) It's not entirely clear to me why they don't consider their approach to be like any other graph-schema-matching algorithm. They have made a significant effort to find a large number of unexploited graph schemas, that can bring benefits to the final quality. Still the approach seems smart preprocessing (basically all section 5) + regular graph-schema-matching, just with better schemas. --- > ![](https://pad.rev.ng/uploads/upload_aac94d197e1724c9330aba5c5e607522.png) The appropriate term would be "lowered" πŸ€“ The term "lifted" is AFAIU a reference to reversing what traditional compilers do along the pipeline, i.e., go from an "high level" representation to a "low level" one (therefore, lowering). --- > ![](https://pad.rev.ng/uploads/upload_c88c3cb6fed83efd386f1d4f4073af74.png) I don't buy this argument too much. They talk about high-quality of the original C code. But even in actively maintained codebases there are horrors like gotos and labels wrapped into macros, see [`vfprintf` internals from the `glibc`](https://elixir.bootlin.com/glibc/latest/source/stdio-common/vfprintf-internal). Imagine this compounded with inlining. The source of an actively maintained package with millions of users might be optimized for performance more than readability. So "quality" on its own doesn't necessarily mean "readability", which is what a reverse engineer really values. Also, due to the graph edit distance metric they chose to adopt, they are forced to do evaluations on programs of which they have the source code for. Not sure how representative these programs are for, e.g., malware or legacy code. --- > **DREAM** > ![](https://pad.rev.ng/uploads/upload_23841842d0ca3fa26446dc26e5abc1c8.png) > **Phoenix** > ![](https://pad.rev.ng/uploads/upload_5923f5a57ea93da98a397f85bf646f48.png) > **SAILR** > ![](https://pad.rev.ng/uploads/upload_f91dee24736b095a1682d4e92ea183c6.png) This is how rev.ng would decompile this code: ```c int64_t schedule_job(int64_t _argument0, int64_t _argument1, int64_t _argument2) { if (_argument0 && _argument1) { complete_job(); if (_argument2 != EARLY_EXIT ) { next_job(); refresh_jobs(); fast_unlock(); } } else { refresh_jobs(); if (_argument1) { fast_unlock(); } } complete_job(); log_workers(); return job_status(_argument1); } ``` --- > ![](https://pad.rev.ng/uploads/upload_d49dc0602edb6d16c0543b78f1c11b4a.png) Eh, not really. What about `if` statements using short-circuit? If you're unable to put the second part of the condition in a single expression, what do you do? This happens all the time. ```c if (a() && b()) { foo(); } else { bar(); } ``` ```graphviz digraph { bgcolor="#161b22" node [color="#6c7278",fontname="monospace",fillcolor="#24282f",fontcolor=white,shape=box,style="rounded,filled",width="1.3"]; edge [color=white,fontcolor=white,fontname=monospace]; A [label="a()?"] B [label="b()?"] then [label="foo()"] else [label="bar()"] A -> B:n [label="yes"] B -> then:n [label="yes"] A -> else:ne [label="no"] B -> else:n [label="no"] else -> C then -> C } ``` --- > ![](https://pad.rev.ng/uploads/upload_970eccdf425292a0f878c8056b78ff3b.png) I'm not an expert in GCC internals, but what the compiler drivers exposes you in terms of optimization pipeline configuration is very limited. Transformations that might induce goto run multiple times in multiple points and it's not always possible to disable them individually. They are often canonicalization passes, not really "optimizations". --- > ![](https://pad.rev.ng/uploads/upload_5718e8406ad1e7537f4f8b25f5497a55.png) Eh, this is a major thing. Real-world binaries have inlined code. My feeling is that, due to their choice of metrics (graph edit distance), they cannot do otherwise, because inlining would turn them upside down. But this makes the reported results less convincing. --- > ![](https://pad.rev.ng/uploads/upload_a29bbfdd585330eba977a170a4725049.png) This seems to be a CFG recovery problem. It can be circumvented using debug info to detect if a function is `noreturn`. --- > ![](https://pad.rev.ng/uploads/upload_fdeadee5b034a9e42a7db31e7a104d2f.png) This would be an interesting optimization to implement. It just has this feeling of being quadratic. I wonder if it's possible to do better than quadratic with more reasoning, possibly exploiting some analysis that compiler research has already cooked for us. --- > ![](https://pad.rev.ng/uploads/upload_63cb8d2213fc897f88aaa72c4022710e.png) rev.ng would produce the following result: ```c if (a && b) { puts("first print"); puts("second print"); puts("third print"); } else { puts("second print"); if (b) puts("third print"); } ``` --- > ![](https://pad.rev.ng/uploads/upload_7e8f585db0f0c0ff2ed3c48fcacafce8.png) I don't see why this graph would induce a `goto`. --- > ![](https://pad.rev.ng/uploads/upload_4ea6d8fe3c763d545d81292114529683.png) If by "iteratively" they meant until a fixed point, that's going to be quite expensive. --- > ![](https://pad.rev.ng/uploads/upload_03969d9224cc12c82ed24bcdf39ebb01.png) AFAIU, first the authors say that they set an arbitrary threshold to let some `goto`s survive and then they apply something very similar to our combing. I understand one contribution is ISD deopt, but the rest sounds like "combing with a threshold", which is good but similar to what we did. --- > ![](https://pad.rev.ng/uploads/upload_582a3c9d942e83d47faea98bda830f5e.png) rev.ng also has a "short-circuit reduction" strategy. It's an important part of our work. See below how some occurences of short-circuit reduction are "invisible" through the lens of graph edit distance, showing some limits of such metric. --- > ![](https://pad.rev.ng/uploads/upload_6bef6e8043b57cdf81df830d8d9f3d6e.png) We used cyclomatic complexity, and it was a mistake. It's true that it's flawed. It doesn't distinguish an `if-then-else` from a `while(true)`, and it's too highly correlated with the SLOCs. Instead, we believe that a much more promising metric is cognitive complexity: https://www.sonarsource.com/docs/CognitiveComplexity.pdf --- Let's now spend a moment on how good of a metric graph edit distance is to measure the quality of the results. Three observations. --- **Observation #1** > ![](https://pad.rev.ng/uploads/upload_a3e564a3414e480740d524a9a9d2faf3.png) Prior work using graph edit distance used it for binary-to-binary comparisons. SAILR uses it to compare source-to-source the original and the decompiled code. Both are re-parsed to produce a Control Flow Graph, and then the 2 CFGs are compared. Going from source to CFG is not 1-to-1 and can be done in many ways, even in simple cases like `switch` or short-circuiting. Consider the snippets below: a code snippet (Figure A), its CFG (Figure B) and various valid decompilations for it (Figure C, D, E). If the the CFG for E is built with short-circuiting semantics, E and C have the same CFG, hence the same graph edit distance. However, the latter is clearly superior, and you simply can't see that through the lens of graph edit distance. Also, no-goto-based decompilers, have strategies to go from D to E, but graph edit distance since it can't distinguish E from C. **Figure A** ```c if (a && *a) { do_X(); } else { do_Z(); } do_Y(); ``` **Figure B** ```graphviz digraph{ bgcolor="#161b22" node [color="#6c7278",fontname="monospace",fillcolor="#24282f",fontcolor=white,shape=box,style="rounded,filled",width="1.8"]; edge [color=white,fontcolor=white,fontname=monospace]; A [label="if (a)"]; deref_A [label="if (*a)"]; X [label="do_X();"] Y [label="do_Y();"] Z [label="do_Z();"] A -> deref_A; deref_A -> X; A -> Z; deref_A -> Z; X -> Y; Z -> Y; } ``` **Figure C** ```c // With graph-schema matching if (a) { if (*a) do_X(); else goto Z_label; } else { Z_label: do_Z(); } do_Y(); ``` **Figure D** ```c // With no-goto like rev.ng if (a) { if (*a) do_X(); else do_Z(); } else { do_Z(); } do_Y(); ``` **Figure E** ```c // Perfect outcome if (a && *a) { do_X(); } else { do_Z(); } do_Y(); ``` --- **Observation #2** > ![](https://pad.rev.ng/uploads/upload_fb815a1ca7ea2f58c90bdfd97edd4347.png) To justify the use of graph edit distance for source-to-source comparison they cite the Decomperson study. Humans tried write C code with the goal that it should compile to assembly similar to a reference binary. The autors of SAILR take the entries written by humans and measure the evolution of GED over time. Two problems with that. First they can admittedly do it only on very small examples because computing graph edit distance is NP-hard. And what they observe on small examples might not hold on larger ones. Second, the graph edit distance doesn't converge to 0. So the reverse engineers in the study continued to change their code even after it hit GED 0 (around submission 85-90) until it converged to GED 2. This seems a hint that what is being optimized in not GED. --- **Observation #3** ![](https://pad.rev.ng/uploads/upload_49c4cec146ecdadaa31422a0723c846f.png) AFAIU, CFGED is a new metric that the authors define in this paper. I don't think evaluating it only on graphs with less than 12 nodes is meaningful. Especially since the graphs where control flow structuring algorithms go sideways in real world can be much much bigger. But I get that doing it on much larger graph has the problem of NP-hardness. Still this leaves me with a little less confidence in its meaningfullness. --- ![](https://pad.rev.ng/uploads/upload_444d3dc4f50eee4b9e4a76fac3db15ed.png) Not sure what version they used, the last publicly available demo was quite old. --- Overall, I feel their efforts are not going in the right direction for a couple of reasons. 1. Learning the internals of a compiler is good. Trying to "reverse" them or overfit them, is not. It won't scale and it's not effective in general due to the iterative nature of compilers pipelines. 2. Try to be similar to the original code is not necessarily a goal. Code is written to be maintainable. Analysts do not need to maintain code, just read it. In this context, sometimes deviating from the original source is good (see basic block specialization example above). 3. Combing with threshold is a good approach, but there would be more to be discussed. If one looks closely to the `goto`s that IDA emits, they are often not due to being unable to correctly match/transform/duplicate/deduplicate things. Take the following example: ```c switch (x) { case 0: do_0(); // Missing case 1 case 2: do_2(); case 3: do_3(); default: do_default(); } ``` Here's the CFG: ```graphviz digraph { bgcolor="#161b22" node [color="#6c7278",fontname="monospace",fillcolor="#24282f",fontcolor=white,shape=box,style="rounded,filled",width="1.8"]; edge [color=white,fontcolor=white,fontname=monospace]; pre [label="x > 3"] pre -> switch [label="no"] pre -> do_default:n [label="yes"] switch [label="switch(x)"]; do_0; do_2; do_3; do_default; switch -> do_0:n [label="0"] switch -> do_2:n [label="2"] switch -> do_3:n [label="3"] switch:sse -> do_default:nnw [label="1"] } ``` IDA will produce something like this: ```c if (x > 3) goto label; switch (x) { case 0: do_0(); case 1: label: do_default(); case 2: do_2(); case 3: do_3(); } ``` Why? Because IDA wasn't able to understand that the `do_default` block is run under conditions that are the complement of all the other cases of the switch. Specifically, `do_default` is run is `x > 3 || x == 1`, while all the other cases cover `x == 0 || x == 1 || x == 2`. If one of the cases is the complement of all the others... that's `default`! Things like this are all over the place and require approaches that are not purely CFG-based. This is something we're working on right now. --- Let's now get to the experimental results. ![](https://pad.rev.ng/uploads/upload_63887ca8fb56e1feccbbe757070a42cb.png) ![](https://pad.rev.ng/uploads/upload_b17aad37ab27f3fd03ff03b9e5c1ca76.png) The focus is pretty much on average CFGED, but in Table 3 they report other interesting data. However, looking at median CFGED, the difference among the tools is not so stark anymore, and GHIDRA even has a better median. Which means that GHIDRA can keep CFGED low for a larger number of functions than SAILR. I'd be curious about other percentiles. My guess is that `revng` comes out so bad in terms of average CFGED because there are a few functions that just blow up with duplication, but we'd have to investigate that. The data about "Calls" is a proxy for code duplication and roughly matches data in our AsiaCCS paper about code duplication. It is very likely caused by a small set of really bad functions, and it' most definitely the cause of our horrible CFGED. We have already acknowledged that our radical approach on code duplication in AsiaCCS is just too radical. We are currently evolving it, to take a less simplistic view of code quality, and we'll definitely integrate some ideas from SAILR too! In particular, the analysis they made in section 5 provides a good set of examples to design new (de-)compiler optimizations, like they did. This is probably the most interesting part of the paper. These (de)-optimizations are general enough to be used both by graph-schema matching and by duplication-based decompilers. And we can even do more with smarter, less extreme, policies for duplication to improve the readability of decompiled code. --- That's all folks. I'd like to thank the author for considering our work! I really hope to see more of that with the upcoming releases! πŸ™‚ --- If you got this far... :::success <center><strong><a href="https://rev.ng/register-for-nightly.html">Register to the closed beta</a></strong></center> :::