4205: Compiler Construction
See also: TDT4205: Compiler Technology
The structure of a compiler
The compiling process has two parts. The analysis part breaks up the source program and imposes a grammatical structure on them. In addition to this it will alert the user about illegal input (grammar, syntactical etc.) and fill inn the symbol table. The synthesis part constructs the target program from the intermediate representation and the symbol table. The parts are called front end and back end respectively.
The semantic analyzer uses the syntax tree and the information in the symbol table to check the source program for semantic consistency. It also gathers type information and saves it to the symbol table. An important part here is type checking where it checks that operators have matching operands. It might also handle coercions, i.e. transforming INT + FLOAT to FLOAT + FLOAT.
Programming Language Basics
There is as interesting consequence of call-by-reference (references passed by value). Two formal parameters can refer to the same location, and they are alias of each other. You need to understand this in order to optimize a program, because certain optimizations may only be done if one can be sure there is no aliasing.
A Simple Syntax-Directed Translator
Intermediate code-generation comes in two forms, abstract syntax tree which represents the hierarchical syntactic structure of the source program. The other one is three address instructions.
Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar.
The idea of associating quantities with programming constructs can be expressed in terms of grammars. We attach rules to productions of the grammar. A syntax-directed definition associates:
With each grammar symbol, a set of attributes
With each production, a set of semantic rules for computing values of the attributes associated with the symbols appearing in the production.
An attribute is synthesized if its value at a parse-tree node N is determined from attribute values at the children of N and N itself, and can be evaluated by a single bottom-up traversal.
A Translator for Simple Expressions
A syntax-directed translation scheme often serves as the specification for a translator. We might come up in a conflict between a translatable and parsable grammar.
Abstract and Concrete Syntax
Useful starting point for creating a translator. AST resemble parse trees to an extent, but these are conceptually different. Parse trees have the interior nodes representing non-terminals, and a syntax tree represent programming constructs. Many productions represent grammar constructs, but many are just there to simplify parsing. The helpers are dropped in the syntax tree.
A lexical analyzer reads characters from the input and groups them into “token objects” with attribute values. A sequence that comprises a single token is called a lexeme. An num can have a value and an id can have a lexeme.
Removal of White Space and Comments
White space and comments can be removed by the lexer, so the parser won't have to worry.
Intermediate Code Generation
Construction of Syntax Trees
Syntax trees can be created for any construct, not just expressions. Each construct is represented by a node, with children for the semantically meaningful components of the construct. These nodes can be represented in a class structure with an arbitrary amount of children.
The Lexical-Analyzer Generator Lex
This tool allows one to specify a lexical analyzer by specifying regular expressions to describe tokens. The expressions are transformed to deterministic finite automatas. Use:
Lex source program
$\to$lex compiler $\to$c source program
C source program
$\to$C compiler $\to$Lexer executable
$\to$Lexer executable $\to$Sequence of tokens
Every program has rules that describes the syntactic structure. These rules are normally expressed as context-free grammars. Grammars have significant benefits:
A grammar gives a precise, yet easy-to-understand syntactic specification of a programming language
Certain classes of grammars can be deterministically and efficiently parsed, and the process of making the parser might reveal ambiguous constructs in the language.
The problem of constructing a parse tree for the input string, starting from the root and creating the nodes of the parse tree in preorder (depth-first). Will generate the leftmost derivation. At each step the problem is to determine the production to apply for each non-terminal. A predictive recursive-descent parser can choose productions by looking at the next symbol. We call this the LL(k) class.
Using Ambiguous Grammars
Any ambiguous grammar fails to be in LR. Ambiguous grammars might still be useful, because of it’s expressiveness. We can specify disambiguating rules, although these should be used with care.
Precedence and Associativity to Resolve Conflicts
We might alter productions to give precedence (E
yacc (yet another compiler-compiler) can construct a translator for a programming language using a specification. It will parse with the LALR method. It is used the same way as lex (from specification to program that can read an input stream). The yacc specification file has three parts:
Supporting C routines
The Declarations Part
Consists of two parts, normal C declarations and declarations of grammar tokens. The grammar tokens will then be available for the second and third part. You can make the tokens available for Lex as well.
The Translation Rules Part
Put rules here, where each rule consists of grammar productions and associated semantic actions. Unquoted means non-terminals, quoted means terminals. \$\$ means the attribute value associated with the head, \$i means the ith grammar symbol.
The Supporting C-Routines Part
Must provide yylex() (might come from lex/flex) which produces tokens.
Using Yacc with Ambiguous Grammars
Some grammars might produce conflicts, and it will report so. Unless instructed otherwise, it will resolve all conflicts with following rules:
A reduce/reduce is resolved by choosing the conflicting production listed first in the specification.
A shift/reduce conflict is resolved in favor of shift (will resolve dangling-else ambiguity correctly).
These rules might not be what the compiler writer wants, so we can choose associativity (%left ’+’ ’-’) or force it to be non-associative. You can also force precedence.
To do type checking a compiler needs to assign a type expression to each component of the source program. It must then determine that these types conform to the rules (type system) of a program. It has the potential of catching errors. A sound type system eliminates the need for dynamic checking for type errors. A language is strongly typed if a compiler guarantees that the program will run without type errors.
Static/dynamic typed is about WHEN type information is acquired (compile-time or run-time).
Strongly/weakly typed is about how strictly types are distinguished.
Rules for Type Checking
Two forms, synthesis which builds up the type of an expression from the types of its sub-expressions and type inference which determines the type of a language construct from the way it is used. The latter can be used for languages that permits names to be used without declaration, but still uses type checking. void is a special type that denote the absence of a value.
You have widening (preserve information) and narrowing (may lose information). If type conversions are done automatically by the compiler, it is implicit. These conversions are also known as coercions, and is normally limited to widening conversions. Two functions, max(a,b) finds the highest type in the hierarchy and widen() which widens a type.
A key problem when generating code for boolean expressions and flow of control statements is that of matching a jump instruction with the target of the jump. An approach to this is called backpatching, in which lists of jumps are passed as synthesized attributes. When a jump is generated, the target of the jump is temporarily left unspecified.
One-Pass Code Generation using Backpatching
Backpatching can be used to generate code for boolean expressions and flow-of-control statements in one pass. We use synthesized attributes truelist and falselist to manage labels in jumping code for bool expressions.
Backpatching for Boolean Expressions
A compiler must implement the abstractions in the source language. In addition to names, scopes, bindings etc, it must also cooperate with the operating system and other system software. To do this, the compiler creates an run-time environment which it assumes the program are being executed. This deals with storage, naming, mechanisms to access variables, parameters, return values, interface to the operating system etc.
In the perspective of the compiler writer, the program runs in its own
logical address space, where the OS maps from logical to physical.
Normal setup is Code
Static Versus Dynamic Storage Allocation
A storage allocation is static (compile time) if the decision can be made at looking at the text of a program. A storage allocation is dynamic if it can be decided only when the program is running. Two techniques:
Heap. For data that might outlive the call to the procedure that created it.
Stack Allocation of Space
Almost all compilers for procedures/methods will use the stack. Will make sure no variables overlap in time and the relative addresses are always the same.
Procedure calls and returns are usually managed by a run-time stack (Control stack). Each live activation has an activation record (frame). Might include:
A saved machine status
Control link, pointing to the activation record of the caller.
Space for return value
The actual parameters used by the calling procedure.
For data that must be explicitly deleted.
The Memory Manager
The memory manager keeps track of all the free space in heap storage at all times. Performs:
Allocation. Must provide a memory region in requested size.
De-allocation. For reusing space.
The final phase in the compiler is the code generator. It takes IR and input and produces a semantically equivalent target program. It must
Preserve semantic meaning
Make efficient use of resources on target machine
The code generator itself must run efficiently
Making optimal code are undecidable, but many useful methods with heuristics exists which will improve program performance by many factors.
Addresses in the Target Code
Instruction Selection by Tree Rewriting
Instruction selection can be a large combinatorial task, especially on CISC. We treat instruction selection as a tree-rewriting problem. Here we can make efficient algorithms.
The input of code-generation will be a sequence of trees at the semantic level. We need to apply a sequence of tree-rewriting rules to reduce the input tree to a single node. Each rule represents the translation of a portion of the tree given by the template. The replacement is called tiling of the subtree
Code Generation by Tiling an Input Tree
Given an input tree, the templates in the tree-rewriting rules are applied to tile its subtrees. When match is found, the subtree is replaced with replacement node of a rule and action associated with the rule is done (generate code). Two issues:
How is the tree-pattern matching to be done?
What if we have more matches?
Must always have a match and not infinite amount of matches. You can create a postfix representation of a tree and create a grammar for it. By reduce-reduce, take largest production, by shift-reduce, shift. This is known as maximal munch[^1]. This is a good and well understood method, but it has its challenges.
High-level language constructs can introduce substantial run-time overhead if we naively translate each construct independently. We have three forms:
Local code optimization - within a block
Global code optimization - across basic blocks
Inter-procedural code optimization - across functions and procedures
Most global optimization are based on data-flow analyses.
The Principal Sources of Optimization
A compiler optimization must preserve the semantics of the original program, and the optimizer is usually only able to apply low-level semantic transformations.
Causes of redundancy
Sometimes redundancy is available at source level, but often introduced because the source is written in a high level language (array accesses will cause redundancy). By having a compiler eliminate the redundancies, we get the best of both worlds: the programs are both efficient and easy to maintain.
A Running Example: Quicksort
A compiler can improve a program without changing the function it computes.
Global Common Sub-expressions
An occurrence of an expression
A copy statement is an assignment on the form
A variable is live at a point in a program if its value can be used subsequently, otherwise dead. We would like to remove dead statements. Copy propagation makes dead code elimination able to delete many copy statements.\ \ Deducing at compile time that a value of an expression is known as constant folding.
Loops are very important place for optimizations. Code motion takes an expression that evaluates to the same each time and lifts it out of the loop.
Induction Variables and Reduction in Strength
A variable x is said to be an induction variable if there is a positive or negative constant c such that each time x is assigned, its value increases by c. They can be computed with a single increment per loop iteration. Transforming from an expensive to a cheap operation is strength reduction, which you can do on induction variables. It’s normal to work inside-out when optimizing loops. When there are two or more induction variables in a loop, it might be possible to get rid of all but one.
Introduction to Data-Flow Analysis
Data-flow analysis refers to a body of techniques that derive information about the flow of data along program execution paths. You can answer many questions about optimization, like global common sub-expression and dead code elimination.
The Data-Flow Abstraction
You can view the program as a series of transformations, where the states are between the statements. Control flow will go from statement to statement in single blocks, and go between blocks if there is an edge between. Execution path is the instructions executed, which there normally is an infinite number of. Different analyses may choose to abstract out different information.\ \ The definitions that may reach a program point along some path are known as reaching definitions.
The Data-Flow Analysis Schema
At each program point we associate a data-flow value which is the set
of all possible program states that can be observed for that point. We
denote the data-flow values before and after each statement
The relationship between the data-flow values before and after the assignment statement is known as a transfer function. Might have two directions (forward and backward analysis).
You can create
Data-Flow Schemas on Basic Blocks
Going through a block is easy, just follow each statement, and we get:
If we do constant propagation, we are interested in the sets of constants that may be assigned to a variable, then we have forward flow with union operator.
No unique solution, but find the most precise that satisfies control-flow and transfer constraints.
One of the most common and useful data-flow schemas. We say a definition
Transfer Equation for Reaching Definitions
The gen-set contains all definitions that are directly visible after the block (will only generate last assignment of each variable), but the kill-set kills all definitions of assignment (even those inside the same block).
Union is the meet operator for reaching definitions. It creates a summary of the contributions from different paths at the confluence of those paths.
Useful for register allocation for basic blocks (dead values need not be stored). We use backward analysis and we define:
$def_B$as the set of variables defined in the block.
$use_B$as the set of variables used in the block.
About the same as the reaching definitions, except opposite direction:
$S$the expression $y + z$.
$S$any expression involving $x$.
Must be done in correct order. Data-flow can be done the same way as reaching definitions (forward analysis), but the meet operator is intersection rather than union. An expression is available at the top of a block if it is available at the end of each predecessor block.\ \ In reaching definitions, we start with a set too small and work up, in available expression, we work with a set too large and work down. We start with the assumption that everything is available everywhere, to keep the optimization conservative.
Loops in Flow Graphs
Loops when considering optimizations are important because programs spend most of their time executing them. Additionally, program analysis can happen faster if it contains no loops, because you only need one pass through the source code.
A node d of a flow graph dominates node n, written d dom n, if every path from the entry node of the flow graph to n goes through d. Every node dominates itself. The information can be presented in a dominator tree, where entry is the root and each node dominates its descendants. The problem can be formulated as a forward data-flow analysis.
Edges in a Depth-First Spanning Tree
We can traverse it with DFS, getting following edges
An interprocedural analysis operates across an entire program. One simple and useful technique is inlining of functions, where you replace the function invocation with the body of the function itself. You cannot immediately inline recursive methods, and inlining may expand code size exponentially (cache miss). Advantages of function inlining is improved speed, due to no method invocation overhead.
Loop unrolling creates more instructions in the loop body, permitting global scheduling algorithms to find more parallelism. Will increase code size, and hereby introduce more cache misses and memory references.
[^1]: From wikipedia: In computer programming and computer science, “maximal munch” or “longest match” is the principle that when creating some construct, as much of the available input as possible should be consumed.