This is an old version of the compendium, written May 19, 2015, 10:03 a.m. Changes made in this revision were made by ilsegv. View rendered version.

4205: Compilers

# Introduction ## The structure of a compiler The compiling process has two parts. The *analysis* part breaks up the source program and imposes a grammatical structure on them. In addition to this it will alert the user about illegal input (grammar, syntactical etc.) and fill inn the *symbol table*. The *synthesis* part constructs the target program from the intermediate representation and the *symbol table*. The parts are called *front end* and *back end* respectively. ### Semantic Analysis The *semantic analyzer* uses the syntax tree and the information in the symbol table to check the source program for semantic consistency. It also gathers type information and saves it to the symbol table. An important part here is *type checking* where it checks that operators have matching operands. It might also handle *coercions*, i.e. transforming INT + FLOAT to FLOAT + FLOAT. ## Programming Language Basics ### Aliasing There is as interesting consequence of call-by-reference (references passed by value). Two formal parameters can refer to the same location, and they are alias of each other. You need to understand this in order to optimize a program, because certain optimizations may only be done if one can be sure there is no aliasing. # A Simple Syntax-Directed Translator ## Introduction Intermediate code-generation comes in two forms, *abstract syntax tree* which represent the hierarchical syntactic structure of the source program. The other one is *three address instructions* ## Syntax-Directed Translation Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar. ### Synthesized Attributes The idea of associating quantities with programming constructs can be expressed in terms of grammars. We attach rules to productions of the grammar. A *syntax-directed definition* associates: 1. With each grammar symbol, a set of attributes 2. With each production, a set of *semantic rules* for computing values of the attributes associated with the symbols appearing in the production. An attribute is *synthesized* if its value at a parse-tree node N is determined from attribute values at the children of N and N itself, and can be evaluated by a single bottom-up traversal. ## A Translator for Simple Expressions A syntaxt-directed translation scheme often serves as the specification for a translator. We might come up in a conflict between a translatable and parsable grammar. ### Abstract and Concrete Syntax Useful starting point for creating a translator. AST resemble parse trees to an extent, but these are conceptually different. Parse trees have the interior nodes representing non-terminals, and a syntax tree represent programming constructs. Many productions represent grammar constructs, but many are just there to simplify parsing. The helpers are dropped in the syntax tree. ## Lexical Analysis A lexical analyzer reads characters from the input and groups them into “token objects” with attribute values. A sequence that comprises a single token is called a *lexeme*. An **num** can have a value and an **id** can have a lexeme. ### Removal of White Space and Comments White space and comments can be removed by the lexer, so the parser wont have to worry. ## Intermediate Code Generation ### Construction of Syntax Trees Syntax trees can be created for any construct, not just expressions. Each construct is represented by a node, with children for the semantically meaningful components of the construct. These nodes can be represented in a class structure with an arbitrary amount of children. # Lexical Analysis ## The Lexical-Analyzer Generator Lex This tool allows one to specify a lexical analyzer by specifying regular exrpessions to describe tokens. The expressions are transformed to deterministic finite automatas. Use:

- Lex source program ~~~~$\to$ ~~~~lex compiler ~~~~ $\to$ ~~~~c source program

- C source program ~~~~$\to$ ~~~~C compiler ~~~~ $\to$ ~~~~Lexer executable

- Input stream ~~~~$\to$ ~~~~Lexer executable ~~~~ $\to$ ~~~~ Sequence of tokens

# Syntax Analysis Every program has rules that describes the syntactic structure. These rules are normally expressed as context-free grammars. Grammars have significant benefits: - A grammar gives a precise, yet easy-to-understand syntactic specification of a programming language - Certain classes of grammars can be deterministicly and efficiently parsed, and the process of making the parser might reveal ambigous constructs in the language. ## Top-Down Parsing The problem of constructiong a parse tree for the input string, starting from the root and creating the nodes of the porse tree in preorder(depth-first). *Will generate the leftmost derivation*. At each step the problem is to determine the production to apply for each non-terminal. A predictive recursive-descent parser can choose productions by looking at the next symbol. We call thiss the LL(k) class. ## Bottom-Up Parsing A bottom-up parse corresponds to the construction of a parse tree for an input string that begins at the leaves (the bottom) and working up towards the root (the top). A general style of bottom-up parsing is shift-reduce parsing, which can be built by using LR grammars. ### Reductions We can thing of bottom-up parsing as the process of reducing a string *w* to the start symbol of the grammar. Each step matches a substring and replaces it with a non-terminal. By definition, a reduction is the reverse of a step in a derivation. The goal is to construct a derivation in reverse, which in fact will be a rightmost derivation. ## Using Ambigous Grammars Any ambigous grammar fails to be in LR. Ambigous grammars might still be useful, because of it’s expressiveness. We can specify disambigouating rules, although these should be used iwth care. ### Precedence and Associativity to Resolve Conflicts We might alter productions to give precedence (E $\to$ E + E | E \* E | (E) | id). This might not always be the optimal apporach, because the parser will have more states, and it is hard to modify precedence. We can express conflicts with precedence if we have different operators, associativity when considering equal operators. ## Parser Generators *yacc* (yet another compiler-compiler) can construct a translator for a programming language using a specification. It will parse with the LALR method. It is used the same way as *lex* (from specification to program that can read an input stream). The yacc specification file has three parts: - Declarations - Translation rules - Supporting C routines #### The Declarations Part Consists of two parts, normal C declarations and declarations of grammar tokens. The grammar docens will then be available for the second and third part. You can make the tokens available for Lex as well. #### The Translation Rules Part Put rules here, where each rule consists of grammar productions and associated semantic actions. Unquoted means non-terminals, quoted means terminals. *\$\$* means the attribute value associated with the head, *\$i* means the ith gramar symbol. #### The Supporting C-Routines Part Must provide *yylex()* (might come from lex/flex) which produces tokens. ### Using Yacc with Ambigous Grammars Some grammars might produce conflicts, and it will report so. Unless instructed otherwise, it will resolve all conflics with following rules: - A reduce/reduce is resolved by choosing the conflicting production listed first in the specification. - A shift/reduce conflict is resolved in favour of shift (will resolve dangling-else ambigouity correctly). These rules might not be what the compiler writer wants, so we can chose associativity (*%left ’+’ ’-’*) or force it to be nonnassociative. You can also force precedence. # Intermediate-Code Generation ## Type Checking To do *type checking* a compiler needs to assign a type expression to each component of the source program. It must then determine that these types conform to the rules (type system) of a program. It has the potential of caching errors. A *sound type system* eliminates the need for dynamic checking for type erors. A language is *strolgy typed* if a compler guarantees that the program will run without type errors. - Static/dynamic typed is about WHEN type information is aquired (compile-time or run-time) - Strongly/weakly typed is about how stricly types are distinguished. ### Rules for Type Checking Two forms, *synthesis* which builds up the type of an expression from the types of its subexpressions and *type inference* which determines the type of a language construct from the way it is used. The latter can be used for languages that permits names to be used without declaration, but still uses type checking. *void* is a special type that denote the absence of a value. ### Type Conversions You have *widening* (preserve information) and *narrowing* (may loose information). If type conversions are done automatically by the compiler, it is *implicit*. These conversions are also known as *coercions*, and is normally limited to widening conversions. Two functions, *max(a,b)* finds the highest type in the hierarchy and *widen()* which widens a type. ## Control Flow ## Backpatching A key problem when generating code for boolean expressions and flow of control stataments is that of matching a jump intstruction with the target of the jump. AN approach to this is called *backpatching*, in which lists of jumps are passed as synthesized attributes. When a jump is generated, the target of the jump is tempoarily left unspecified. ### One-Pass Code Generation using Backpatching Backpatching can be used to generate code for boolean expressions and flow-of-control statements in one pass. We use synthesized attributes *truelist* and *falselist* to manage labels in jumping code for bool expressions. ### Backpatching for Boolean Expressions ### Flow-of-Control Statements # Run-Time Environments A compiler must implement the abstarctions in the source language. In addition to names, scopes, bindings etc, it must also cooperate with the operating system and other system software. To do this, the compiler creates an *run-time environment* which it assumes the program are being executed. This deals with storage, namning, mechanisms to access variables, parameters, return values, interface to the operating system etc. ## Storage Organization In the perspective of the compiler writer, the program runs in its own logical address space, where the OS maps from logical to physical. Normal setup is Code $\to$ Static $\to$ Heap $\to$ Free Memory $\to$ Stack. The amount of storage needed is determined by the types in the program. Many machines requires the data to be *aligned*, that all addresses must be divisible by 4. The compiler may omit this limitation by packing data. Code size and static data is known compile-time.\ \ You have two areas for dynamic storage allocation, the *stack* and the *heap*, which sizes can change during program execution. Some programs allocate the heap automatically and has a garbage collection, some must have the programmer explicitly allocate space on the heap. ### Stacit Versus Dynamic Storage Allocation A storage allocation is static (compile time) if the decicion can be made at looking at the text of a program. A storage allocation is *dynamic* if it can be decided only when the program is running. Two techniques: - Stack - Heap. For data that might outlive the call to the procedure that created it. ## Stack Allocation of Space Almost all compilers for procedures/methods will use the stack. Will make sure no variables overlap in time and the relative addresses are always the same. ### Activation Records Procedure calls and returns are usually managed by a run-time stack (*Control stack*). Each live activation has an activation record (frame). Might include: - Temporary values - Local data - A saved machine status - Access link - Control link, pointing to the activation record of the caller. - Space for return value - The actual parameters used by the calling procedure. ## Heap Management For data that must be explicitly deleted. ### The Memory Manager The memory manager keeps track of all the free space in heap storage at all times. Performs: - *Allocation*. Must provide a memory region in requested size. - *Deallocation*. For reusing space. # Code Generation The final phase in the compiler is the code generator. It takes IR and input and produces a semantically equivalent target program. It must - Preserve semantic meaning - Make efficient use of resources on target machine - The code generator itself must run efficiently Making optimal code are undecidable, but many useful methods with heuristics exists which will improve program performance by many factors. ## Addresses in the Target Code ## Instruction Selection by Tree Rewriting Instruction selection can be a large combinatorial task, especially on CISC. We treat instruction selection as a tree-rewriting problem. Here we can make efficient algorithms. ### Tree-Translation Schemes The input of code-generation will be a sequence of trees at the semantic level. We need to apply a sequence of tree-rewriting rules to reduce the input tree to a single node. Each rule represents the translation of a portion of the tree given by the template. The replacement is called *tiling* of the subtree ### Code Generation by Tiling an Input Tree Given an input tree, the templates in the tree-rewriting rules are applied to tile its subtrees. When matchi is found, the subtree is replaced with replacement node of a rule and action associated with the rule is done (generate code). Two issues: - How is the tree-pattern matching to be done? - What if we have more matches? Must always have a match and not inifite amount of matches. You can create a postfix representation of a tree and create a grammar for it. By reduce-reduce, take largest production, by shift-reduce, shift. This is known as *maximal munch*[^1]. This is a good and well understood method, but it has its challenges. # Machine-Independent Optimizations High-level language constructs can introduce substantial run-time overhead if we naively translate each construct independently. We have three forms: - Local code optimization - within a block - Global code optimization - across basic blocks - Interprocedural code optimization - across functions and procedures Most global optimization are based on *data-flow analyses*. ## The Principal Sources of Optimization A compiler optimization must presere the semanticts of the original program, and the optimizer is usually only able to apply low-level semantic transformations. ### Causes of redundancy Sometimes redundancy is available at source level, but often introduced because the source is written in a high level language (array accesses will cause redundancy). By having a compiler eliminate the redundancies, we get the best of both worlds: the programs are both efficient and easy to maintain. ### A Running Example: Quicksort blabla ### Semantics-Preserving Transformations A compiler can improve a program without changing the function it computes. ### Global Common Subexpressions An occurence of an expression $E$ is called a *common subexpression* if $E$ was previously computed and the values of the variables in $E$ have not changed since the previous computation. ### Copy Propagation A *copy statement* is an assignment on the form $u = v$. The idea is to use $v$ for $u$ in other statements using $u$. It may not appear to be an improvement, but dead-code elimination gives the opportunity to get rid of the copy statement. ### Dead-Code Elimination A variable is *live* at a point in a program if its value can be used subesquently, otherwise *dead*. We would like to remove dead statements. Copy propagation makes dead code elimination able to delete many copy statements.\ \ Deducing at compile time that a value of an expression is known as *constant folding*. ### Code Motion Loops are very important place for optimizations. *Code motion* takes an expression that evaluates to the same each time and lifts it out of the loop. ### Induction Variables and Reduction in Strength A variable *x* is said to be an induction variable if thees is a positive or negative constant *c* such that each time *x* is assigned, its value increases by *c*. They can be computed with a single increment per loop iteration. Transforming from an expensive to a cheap operation is *strength reduction*, which you can do on induction variables. It’s normal to work inside-out when optimizing loops. When there ar two or more induction variables in a loop, it might be possible to get rid of all but one. ## Introduction to Data-Flow Analysis *Data-flow analysis* refers to a body of techniques that derive information about the flow of data along program execution paths. You can answer many questions about optimization, like global common subexpression and dead code ellimination. ### The Data-Flow Abstraction You can view the program as a series of transformations, where the states are between the statements. Control flow will go from statement to statement in single blocks, and go between blocks if there is an edge between. *Execution path* is the instructions executed, which there normally is an infinite number of. Different analyses may choose to abstract out different information.\ \ The definitions that may reach a program point along some path are known as reaching definitions. ### The Data-Flow Analysis Schema At each program point we associate a *data-flow value* which is the set of all possible program states that can be observed for that point. We denote the data-flow values before and after each statement $s$ by $IN[s]$ and $OUT[s]$, and the *data-flow problem* is to find a solution to a set of constraints on these definitions for all statements $s$. #### Transfer Functions The relationship between the data-flow values before and after the assignment statement is known as a *transfer function*. Might have two directions (forward and backward analysis). #### Control-Flow Constraints You can create $IN[B]$ and $OUT[B]$ by running through all the statements. ### Data-Flow Schemas on Basic Blocks Going through a block is easy, just follow each statement, and we get: $$OUT[B] = f_b(IN[B])$$ If we do constant propagation, we are interested in the sets of constats that *may* be assigned to a variable, then we have forward flow with union operator. $$IN[B] = \cup_{P a predecessor of B}OUT[P]$$ No unique solution, but find the most precise that satisfies control-flow and transfer constraints. ### Reaching Definitions One of the most common and useful data-flow schemas. We say a definition $d$ *reaches* a point $p$ if there is a path from the point immediatly following $d$ to $p$, such that $d$ is not killed along that path. We might have aliases, and that is why we must be conservative. #### Transfer Equation for Reaching Definitions $$f_d(x) = gen_d \cup (x - kill_d)$$ The gen-set contains all definitions that are directly visible after the block (will only generate last assignment of each variable), but the kill-set kills all definitions of assignment (even those inside the same block). #### Control-Flow Equations Union is the meet operator for reaching definitions. It creates a summary fo the contributions from different paths at the confluecnce of those paths. ### Live-Variable Analysis eful for register allocation for basic blocks (dead values need not be stored). We use **bacwkard analysis** and we define: 1. $def_B$ as the set of variables defined in the block. 2. $use_B$ as the set of variables used in the block. About the same as the reaching definitions, except opposite direction: $$IN[B] = use_b \cup (OUT[B] - def_B)$$ ### Available Expressions An expression $x + y$ is *available* at a point $p$ if every path from the entry node to $p$ evaluates $x + y$, and after the last such evaluation prior to reaching $p$, there are no subesquent assignments to $x$ or $y$. A block *kills* expression $x + y$ if it assigns (or may assign) $x$ or $y$ and does not subesquently recompute $x + y$. A block *generates* an expression $x + y$ if it evaluates it and not subsequently define $x$ or $y$. The primary use of available expressions is detecting global common subexpressions. At each step ($x = y + z$): 1. Add to $S$ the expression $y + z$. 2. Delete from $S$ any expression involving $x$. Must be done in correct order. Data-flow can be done the same way as reaching definitions (forward analysis), but the meet operator is *intersection rather than union*. An expression is available at the top of a block if it is available at the end of each predecessor block.\ \ In reaching definitions, we start with a set too small and work up, in available expression, we work with a set too large and work down. We start with the assumption that everything is available everywhere, to keep the optimization conservative. ## Loops in Flow Graphs Loops when considering optimizations are important because programs spend most of their time executing them. Additionally, program analysis can happen faster if it contains no loops, because you only need one pass through the source code. ### Dominators A node *d* of a flow graph *dominates* node n, written d dom n, if every path from the entry node of the flow graph to n goes through d. Every node dominates itself. The information can be presented in a *dominator tree*, where entry is the root and each node dominates its descendants. The problem can be formulated as a forward data-flow analysis. ### Edges in a Depth-First Spanning Tree We can traverse it with DFS, getting following endges - Advancing edges - Retreating edges - Cross edges # Interprocedural Analysis An interprocedural analysis operates across an entire program. One simple and useful technique is *inlining* of functions, where you replace the function invocation with the body of the function itself. You cannot immideatly inline recursive methods, and inlining may expand code size exponentially (cache miss). Advantages of function inlining is improved speed, due to no method invocation overhead. # Other definitions #### Loop Unrolling Loop unrolling creates more instructions in the loop body, permitting global scheduling algorithms to find more parallelism. Will increase code size, and hereby introduce more cache misses and memory references. [^1]: From wikipedia: In computer programming and computer science, “maximal munch” or “longest match” is the principle that when creating some construct, as much of the available input as possible should be consumed.