REP1: Propagating Scalar Types

# REP1: Propagating Scalar Types ###### tags: `design` `vma2` `rep` ## What is this? We need to propagate the flavor of scalar types in the IR. There will be two users of this: 1. DLA: if we get fine grained information about scalar types on IR instructions, the data types we reconstruct will benefit from them. 2. Backend: propagating types can reduce casts. These two users work on different representations, the former on LLVM IR, the latter in MLIR on the Clift dialect. ## Why is it important? We no longer want `generic32_t` on DLA and we need to drop cast from `generic32_t` to `uint32_t` and the like. ## Design The idea is to have a small pipeline with multiple frontends, an "IR" (just a graph) and then a backend where results are produced, i.e., a map from an element of the input IR to a type of the model. ```graphviz digraph { node [shape=box] rankdir = LR; llvm [label="LLVM IR"] clift [label="Clift"] tfg [label="Type Flow Graph"] output [label="map<Use *, QualifiedType>"] llvm -> tfg clift -> tfg tfg -> output } ``` The graph will be *undirected*. ## Roadmap * Implement the graph * I suggest to use `GenericGraph` with a `ForwardNode`. * Implement a `verify` method for the graph. * Make sure the graph can be serialized by implementing `DOTGraphTraits` * Implement the LLVM IR frontend * Test * Next phase: implement propagation ## Readings * [Crash course on types in the model](https://pad.rev.ng/9RZ_MXlGRvSyu-ItFF5beg#) * `GenericGraph` * Implementation: `include/revng/ADT/GenericGraph.h` * Unit tests: `tests/unit/GenericGraph.cpp` ## Implementation ### Building an undirected graph I suggest to use `GenericGraph` with a `ForwardNode`. `GenericGraph` is a *directed* graph. An idea to have an *undirected* graph is, for each edge `A-B`, to create two directed edes `A->B` and `B->A`. In this way, the "successors" of a node in the directed graph will actually just be the "neighbors" of the undirected graph. ### Importing LLVM IR We'll likely need a node for the result of each instruction and one node for operand of an instruction. We can associate the node representing the result of an instruction to the `Instruction *` itself. The operands instead can be represented using a [`Use *`](https://llvm.org/doxygen/classllvm_1_1Use.html) (which you can think as `pair<Instruction *, unsigned>` where `unsigned` represents the index of the operand). #### Simple binary operation ```llvm define i32 @func(i32 %a, i32 %b) { %c = add i32 %a, %b ret i32 %c } ``` ```graphviz graph { node [shape=box] rankdir = TB; A -- C1 B -- C2 C1 -- C C2 -- C C -- ret1 } ``` #### Comparison ```llvm define i1 @func(i32 %a, i32 %b) { %c = icmp eq i32 %a, %b ret i1 %c } ``` ```graphviz graph { node [shape=box] rankdir = TB; A -- C1 C1 -- C2 B -- C2 C -- ret1 } ``` Graph layout is weird, but the graph si correct. `C1` and `C2` are *not* connected to `C`. #### Reading and assign to a local variable There is type flow between the two sides of the assignment (in LLVM terms, the operands of the `store`) and there is type flow between the local variable and those reading it (in LLVM terms, between the operand of the `load` and the result). ```llvm define i32 @func(i32 %a) { %local = alloca i32 store i32 %a, ptr %local %c = load i32, ptr %local ret i32 %c } ``` ```graphviz graph { node [shape=box] rankdir = TB; local -- store2 A -- store1 store1 -- store2 local -- load1 load1 -- c c -- ret1 } ``` **TODO**: we need to better discuss how to handle pointers. One option is to have a `PointersCount` field in the node to represent the fact that the node represents `T` with N pointers (e.g., for `N = 3` we get `T ***`). ### What's in a node? The node must contain a reference to the element that generated it in the LLVM IR or in the MLIR. Therefore, it needs to be a template argument. Also, for instance in case of LLVM IR, it can either be a `Instruction *` or a `Use *`. So I expect something like this: ```c++ template<typename Reference> struct NodeBody { model::QualifiedType TheType; Reference Source; }; using LLVMIRSource = std::variant<llvm::Instruction *, llvm::Use *>; using LLVMNodeBody = NodeBody<LLVMIRSource>; ``` ### Key idea behind propagation The scalar types associated to a node can be relaxed or strict. Let's consider the hierarchy of `PrimitiveKind` (plus pointers): ```graphviz digraph { node [shape=box] rankdir = BT; { node [style=dashed] Pointer uint8ptr [label="uint8_t *"] uint16ptr [label="uint16_t *"] otherptr [label="... *"] } Generic -> Float Generic -> PointerOrNumber PointerOrNumber -> Number Number -> Unsigned Number -> Signed PointerOrNumber -> Pointer Pointer -> uint8ptr Pointer -> uint16ptr Pointer -> otherptr } ``` `generic32_t` (a primitive of type `Generic`) is very relaxed, it's compatible with any other scalar type of the same size. `uint32_t` (a primitive of type `Unsigned`) is very strict, it's only compatible with itself. The goal of the algorithm is to refine relaxed types with compatible types in order to minimize casts. However, this needs a futher design phase. ### Testing We could have some LLVM IR input file, pass it through the tool that dumps the graph and use `FileCheck` to make assertions on it (e.g., presence of an edge). ## Open points * Probably should also run on MLIR to propagate types, but in this case not everything can be changed. Probably it needs to run until fixed point along with MakeModelGEP. Should we try to update the graph between one run and the other (as opposed to rebuild it from scratch)?