Clift Type System

Clift Type System === Clift provides a type system designed to be richer and higher-level than both assembly and LLVM IR. It includes the following types: - `LabelType` - `ValueType` - `PrimitiveType` - `ArrayType` - `PointerType` - `DefinedType`s - `EnumType` - `StructType` - `UnionType` - `FunctionType` - `TypedefType` The `LabelType`s represent the type of a label for a `goto` statement. Informally speaking, they can only be used in `goto`-related operations. As a consequence, they basically never propagate and never interact with other types. On the other hand, `ValueType`s are very interconnected with each other, and effectively represent a graph. The hierarchy among the `ValueType`s is depicted in the following graph. The graph adopts the following conventions for representing nodes: - nodes with dotted border represent `mlir::Attribute`s defined by the Clift Dialect - nodes with solid border represent `mlir::Type`s defined by the Clift Dialect - nodes with dashed border represent [MLIR Interfaces](https://mlir.llvm.org/docs/Interfaces/) (in particular a `mlir::TypeInterface`) defined by the Clift Dialect - dashed arrows (going from Interfaces to `Type`s) mean that the source Interface is implemented by the target `Type` - solid arrows (going from `Type`s to `Attribute`s) mean that the source `Type` contains an instance of the target `Attribute` ```graphviz digraph { node [shape=box]; rankdir = BT; LabelType [label=<LabelType >] ValueType [label=<ValueType Pure-Virtual Methods bool isConst() uint64_t getByteSize() >,style=dashed] # Value Types PrimitiveType [label=<PrimitiveType ValueType Methods bool isConst() uint64_t getByteSize() Fields PrimitiveKind kind bool is_const uint64_t size >] ArrayType [label=<ArrayType ValueType Methods bool isConst() uint64_t getByteSize() Fields ValueType ElementType uint64_t NumElements bool is_const >] PointerType [label=<PointerType ValueType Methods bool isConst() uint64_t getByteSize() Fields ValueType pointee_type uint64_t size bool is_const >] DefinedType [label=<DefinedType ValueType Methods bool isConst() uint64_t getByteSize() Other Methods uint64_t id() llvm::StringRef getName() Fields TypeDefinition element_type bool is_const >] # Edges to Value Types ValueType -> PrimitiveType [style=dashed] ValueType -> ArrayType [style=dashed] ValueType -> PointerType [style=dashed] ValueType -> DefinedType [style=dashed] # Attributes { node [style=dotted] EnumType [label=<EnumType Methods uint64_t getByteSize() Fields uint64_t ID llvm::StringRef name ValueType underlying_type ArrayRefFieldAttr Entries >] StructType [label=<StructType Methods uint64_t getByteSize() Fields uint64_t ID llvm::StringRef name ArrayRefFieldAttr Fields >] UnionType [label=<UnionType Methods uint64_t getByteSize() Fields uint64_t ID llvm::StringRef name ArrayRefFieldAttr Fields >] FunctionType [label=<FunctionType Methods uint64_t getByteSize() Fields (TODO update) uint64_t ID llvm::StringRef name TypeRange Arguments ValueType ReturnType >] TypedefType [label=<TypedefType Methods uint64_t getByteSize() Fields uint64_t ID llvm::StringRef name ValueType UnderlyingType >] } # Edges to Attributes DefinedType -> EnumType DefinedType -> StructType DefinedType -> UnionType DefinedType -> FunctionType DefinedType -> TypedefType } ``` The reason why the hierarchy is designed this way is that by design MLIR assumes all the constructs of Dialects do not use inheritance, only composition and Interfaces. So Clift must do this in order to play well with the rest of the MLIR infrastructure. For this reason `ValueType` is an Interface which is implemented by `PrimitiveType`, `ArrayType`, `PointerType`, and `DefinedType`. For the same reason, `DefinedType` cannot be derived from. Instead the various defined types such as `StructType` are defined as `mlir::Attribute`s used by DefinedType. `mlir::Attributes` All the `Attribute`s used by `DefinedType`s have a unique `uint64_t` ID, that uniquely identifies them, in practice acting as a GUID. The choice of having these IDs, is driven by the fact that MLIR has structural typing, meaning that if 2 types are structurally equivalent they are the same to MLIR. Having an explicit ID enables Clift two represent 2 different types even if they are structurally equivalent, since the ID is used to tell them apart. This is pretty important if we want to enable representing types that are structurally equivalent as different types. This, in turn, is pretty important for decompiling big C program, where otherwise all structurally equivalent types would collapse onto the same type. In practice, the ID itself is held by each `Attribute`, and `DefinedType` provides an `id()` method for accessing it. All the `Attribute`s used by `DefinedType` have a name, that can be empty. The name is strictly optional and is meant to be used **only** for debug purposes and for serializing Clift in an eye-pleasing way. It will never be used by the decompilation engine. The choice of making names optional enables to decouple naming from transformations. In practice, the name itself is held by each `Attribute`, and `DefinedType` provides an `getName()` method for accessing it. `PrimitiveType`s don't have a name. Thee name used for eye-pleasing serialization is pre-defined and implied by the structure of the `PrimitiveType` itself. Also `ArrayType`s and `PointerType`s don't have a name since it's not necessary: they can alway just be emitted inline from the element types or the pointee types. If someone wants to emit named arrays or pointer types, it's always possible to use a `TypedefType`. Each `ValueType` that is not a `PrimitiveType` also refers to other `ValueType`s. This introduces the possibility of having recursive types, e.g. a `StructType` with a field that is a `PointerType` that points to the parent (i.e. the `StructType` represents a node of a linked list). ### `LabelType` Clift label type. It represents the type of a label of a `GoToOp` It basically has no other property. ### `ValueType`s These types represent types for data manipulated by a C program. `ValueType` is an Interface that is implemented by `PrimitiveType`, `ArrayType`, `PointerType`, `ScalarTupleType` and `DefinedType` (informally called the "Value Types" alltogether). A `ValueType` represents the idea of a type with a known size (which is an integer number of bytes) and that can optionally have the `const` qualifier. This is represented by the fact that the Interface defines 2 pure-virtual methods: - `bool isConst()` - `uint64_t getByteSize()` #### --- Implementation Guide --- **Methods:** | Parameter | C++ type | Description| |--|--|--| | `isConst()` | `bool` | True if the represented type is `const`-qualified in C | `getByteSize()` | `uint64_t` | The size of the type in bytes | **Verify:** No `ValueType` should have zero size, except for `void` and `FunctionType`. The size of incomplete `UnionType`s and `StructType`s is undefined. ### `PrimitiveType`s These types are always available and don't require the user to define them explictly. They are modeled around C's `stdint.h` (`int8_t`, `uint64_t` and friends) but are not 100% equivalent. They are designed according to the following core ideas: - enabling precise representation of the low-level semantics on the CPU (e.g. size, signedness, int-vs-float-vs-pointer) in an architecture-independent way; - disallowing to represent C-like types that don't have a strictly well-defined architecture-independent size; This makes the Clift types at the same time: - more expressive than C types: there's not only `uint64_t` but also e.g. `int128_t` (which is not standardized yet), and `float80_t` (for e.g. floating point types on the x87 FP coprocessor), and `pointer_or_number32_t` to represent a type that is known to be an integer or a pointer but definitely not a floating point. - more restrictive than C types: there are no vanilla C-like `int`s, `float`s, nor `bool` or `char` because their mapping onto machine code are ABI-dependent. This design makes these types also more expressive and more restrictive than MLIR builtin types. This choice has two consequences on the design of the Dialect: - it prevents the use of the MLIR builtin types even for modeling primitive types - it prevents the use, even in part, of other MLIR dialects that use MLIR builtin types (such as `arith`), because it would mean to patch all those Dialects to teach them our `PrimitiveType`s too; These two consequences mean more work for the implementation of Clift, but they give a finer control over it, and they allow for stricter verification of the validity of the Clift IR, which is considered to strike the right balance with the goals of the project. #### --- Implementation Guide --- **Parameters:** | Parameter | C++ type | Description| |--|--|--| | `kind` | `PrimitiveKind` | An `enum` with the values `GenericKind` `FloatKind` `PointerOrNumberKind` `NumberKind` `SignedKind` `UnsignedKind` `VoidKind` | `size` | `uint64_t` | The size of the type | `is_const` | `bool` | True if the type represented type is `const`-qualified in C **`ValueType` Methods:** | Method | Description| |--|--| | `bool isConst` | Returns `is_const` | `uint64_t getByteSize()` | Returns `size` **Verify:** - There is a finite list of allowed combinations of `PrimitiveKind` + `Size`, which is determined by the model equivalent type. ### `PointerType` It represents a pointer to another `ValueType`. #### --- Implementation Guide --- **Parameters:** | Parameter | C++ Type | Description| |--|--|--| | `PointeeType` | `clift::ValueType` | The underlying pointed type| | `size` | `uint64_t` | The size of the type | `is_const` | `bool` | True if the type represented type is `const`-qualified in C **`ValueType` Methods:** | Method | Description| |--|--| | `bool isConst` | Returns `is_const` | `uint64_t getByteSize()` | Returns `size` **Verify:** - `PointeeType` must verify - Must be pointer-sized (**FIXME**: this is not implemented yet we don't want to put architecture-specific stuff in the dialect, but if we don't do that the previous point will not be verifiable. What are we gonna do? We can just leave this out of verification, but it's a missed opportunity. Otherwise we need a way to embed in Clift some information about the size of pointers, potentially allowing for a situation with many possible pointer-sizes at the same time in the same binary (for e.g. architectures with many valid pointer sizes or addressing modes, and for multi-binary or multi-arch modules). ### `ArrayType` Represents the type of an array of `ValueType`s, with a non-zero number of elements. Arrays with zero elements are not supported, because they contradict the important design principle of Clift `ValueType`s that they should precisely represent the size of an instance of a type in memory. Zero-sized arrays in C don't represent this size precisely, since in theory they represent something of 0 size, but in practice in C they represent an array of arbitrary size. In principle we could decide to support 0-sized array types in Clift, by specifying that they should actually be 0-sized, not as in C. However, this would then pose problems when decompiling Clift types to C, potentially introducing subtle differences in semantics. #### --- Implementation Guide --- **Parameters:** | Parameter | C++ Type | Description| |--|--|--| | `element_type` | `clift::ValueType` | The underlying type| | `elements_count` | `uint64_t` | Non-zero number of elements| | `is_const` | `bool` | True if the type represented type is `const`-qualified in C **`ValueType` Methods:** | Method | Description| |--|--| | `bool isConst` | Returns `is_const` | `uint64_t getByteSize()` | Returns `elements_count * element_type.getByteSize()` **Verify:** - `element_type` verifies - `elements_count > 0` ### `DefinedType`s These types basically represent all those types in C for which the full definition is completely provided by a programmer. This include `enum`s, `struct`s, `union`s, function types, and `typedef`s. As mentioned above, they have a unique ID, that is used to discriminate them given that MLIR has structural typing. The payload of the ID is in the underlying `TypeDefinition`. The ID allows to have 2 or more otherwise structurally identical types, which is very important for decompilation, since we want to be able e.g. to treat `size_t` or `pid_t` as separate types even if under the hood they might be the same `PrimitiveType`. As a consequence of this choice, given that e.g. `pid_t` and `size_t` are two different types in Clift (though the underlying `PrimitiveType` might be the same), a cast operation is required in Clift to transform a `pid_t` to a `size_t` or vice versa. #### --- Implementation Guide --- **Parameters:** | Parameter | C++ Type | Description| |--|--|--| | `element_type` | `TypeDefinition` | The user-defined type this type represents | | `is_const` | `bool` | True if the represented type is `const`-qualified in C **`ValueType` Methods:** | Method | Description| |--|--| | `bool isConst` | Returns `is_const` | `uint64_t getByteSize()` | Returns `element_type.getByteSize()` | `uint64_t id()` | Returns `element_type.id()` | `llvm::StringRef getName()` | Returns `element_type.getName()` ### `EnumAttr` An `EnumAttr` represents an `enum` in C. In Clift it also has an underlying `PrimitiveType` which is a reference to a `Signed` or `Unsigned` `PrimitiveType`. It also has a set of values for its entries, each represented by an `EnumFieldAttr`s. References to these entries are held in an array sorted by the value of the entry. It's not possible to have multiple entries with the same underlying value. This is in contrast to C. Entries optionally have names, used **only** for eye-pleasing serialization. #### --- Implementation Guide --- **Parameters:** | Parameter | C++ Type | Description| |--|--|--| | `id` | `uint64_t` | The integer uniquely identifying this type. | `name` | `llvm::StringRef` | The (optional) name of the type used for serialization | `underlying_type` | `PrimitiveType` | A `Signed` or `Unsigned` `PrimitiveType`, representing the underlying type of the `enum` | `fields` | `llvm::ArrayRef<mlir::clift::EnumFieldAttr>` | An array of references to the entries, sorted by value. **Methods:** | Method | Description| |--|--| | `uint64_t getByteSize()` | Returns `underlying_type.getByteSize()` | `llvm::StringRef getName()` | Returns `name` **Verify:** - `underlying_type` - ignoring typedefs - must be a `Signed` or `Unsigned` `PrimitiveType`, and must verify. - The values of the `fields` must be representable by `UnderlyingType`. - The values of the `fields` must be unique and sorted in ascending order. - The names of the `fields` must be empty or unique. ### `StructType` A `StructType` represents a `struct` in C. It has a set of fields, each represented by a `mlir::clift::FieldAttr`. References to these fields are held in an array sorted in ascending order by their offsets. Fields are not required to be dense in the `StructType`. In case of a sparse `StructType`, the corresponding number of empty padding bytes is implied. The `size` of the `StructType` can be larger than the offset of the last byte of the last field. In that case, the correct number of empty padding bytes is implied. Each field optionally has a name, used **only** for eye-pleasing serialization. #### --- Implementation Guide --- **Parameters:** | Parameter | C++ Type | Description| |--|--|--| | `id` | `uint64_t` | The integer uniquely identifying this type. | `name` | `llvm::StringRef` | The (optional) name of the type used for eye-pleasing serialization. | `size` | `uint64_t` | The size of the type in bytes. | `fields` | `ArrayRef<mlir::clift::FieldAttrs>` | The list of fields of the struct, sorted by offset. **Methods:** | Method | Description| |--|--| | `uint64_t getByteSize()` | Returns `size` | `llvm::StringRef getName()` | Returns `name` **Verify:** - The `fields` must be sorted in ascending order by their offsets. - There must not be overlapping fields. - The names of the `fields` must be empty or unique. - For each field: - The field type must verify. - The field type must have non-zero size. - `Field.Offset + Field.Size <= StructType.Size` ### `UnionType` A `UnionType` represents a `union` in C. It has a set of fields, each represented by a `mlir::clift::FieldAttr`. References to these fields are held in an array. For a `UnionType` to be valid all the starting offsets for the fields must always be 0. The size of a `UnionType` is the maximum among the sizes of its fields. Each field optionally has a name, used **only** for eye-pleasing serialization. #### --- Implementation Guide --- **Parameters:** | Parameter | C++ Type | Description| |--|--|--| |`id`| `uint64_t` | The integer uniquely identifying this type. |`fields` | `ArrayRef<mlir::clift::FieldAttrs>` | The list of fields of the union | | `name` | `llvm::StringRef` | The (optional) name of the type used for eye-pleasing serialization **Methods:** | Method | Description| |--|--| | `uint64_t getByteSize()` | Returns `std::max` of all the `getByteSize()` of all field types. | `llvm::StringRef getName()` | Returns `name` **Verify:** - A union shall contain no less than one field. - For each field: - The field type must verify. - The field type must have non-zero size. - The starting offset must be zero. - The names of the `fields` must be empty or unique. ### `FunctionType` TODO: massimo is changing this to enable function arguments to have names. This entry must be updated to reflect that. A `FunctionType` represents the `ValueType` of a function. It always has size 0. It's the only non-`Void` 0-sized `ValueType`. Each `FunctionType` has - a list of argument types (with optional names), which are references to other `ValueType`s - a single unnamed return type, which is a reference to another `ValueType` #### --- Implementation Guide --- **Parameters:** | Parameter | C++ Type | Description| |--|--|--| | `id` | `uint64_t` | The integer uniquely identifying this type. | `return_type` | `clift::ValueType` | A reference to the return type of the function. | `argument_types` | `std::vector<clift::ValueType>` | A vector of references to the `ValueType`s of the arguments. | `name` | `llvm::StringRef` | The (optional) name of the type used for eye-pleasing serialization. **Methods:** | Method | Description| |--|--| | `uint64_t getByteSize()` | Returns 0 | `llvm::StringRef getName()` | Returns `name` **Verify:** - All arguments and return type must verify - Arguments cannot have 0 size - `return_type` cannot be a `FunctionType` but can be `void`. All return types that are not `void` must have size > 0. - The function type itself has zero size - There should not be 2 different arguments with the same non-empty name ### `TypedefType` It represents a strongly typed alias. It is basically just a thin wrapper around a reference to another `Type`, the `UnderlyingType`, but unlike C `typedef`s it's not a weakly typed alias. The `Size` must always be the same as the `UnderlyingType`. The decision of making `TypedefType`s first-class citizens in the Clift Type System is driven by the need to represent e.g. struct fields or function arguments with given fields. If `TypedefType`s were not strongly typed, then a function with an integer argument would not be distinguishable from a function with a `pid_t` argument. Given that Clift is designed to support the decompilation of semantically rich C code, this is an important thing to have, hence `TypedefType`s need to be strongly typed otherwise they would just fade away in Clift. #### --- Implementation Guide --- **Parameters:** | Parameter | C++ Type | Description| |--|--|--| | `id` | `uint64_t` | The ID of the type | `underlying_type` | `clift::ValueType` | A reference to the underlying type | `name` | `llvm::StringRef` | The (optional) name of the type used for serialization **Methods:** | Method | Description| |--|--| | `uint64_t getByteSize()` | Returns `underlying_type.getByteSize()` | `llvm::StringRef getName()` | Returns `name` **Verify:** FIXME