4205: Compilers
# Introduction
## The structure of a compiler
The compiling process has two parts. The *analysis* part breaks up the
source program and imposes a grammatical structure on them. In addition
to this it will alert the user about illegal input (grammar, syntactical
etc.) and fill inn the *symbol table*. The *synthesis* part constructs
the target program from the intermediate representation and the *symbol
table*. The parts are called *front end* and *back end* respectively.
### Semantic Analysis
The *semantic analyzer* uses the syntax tree and the information in the
symbol table to check the source program for semantic consistency. It
also gathers type information and saves it to the symbol table. An
important part here is *type checking* where it checks that operators
have matching operands. It might also handle *coercions*, i.e.
transforming INT + FLOAT to FLOAT + FLOAT.
## Programming Language Basics
### Aliasing
There is as interesting consequence of call-by-reference (references
passed by value). Two formal parameters can refer to the same location,
and they are alias of each other. You need to understand this in order
to optimize a program, because certain optimizations may only be done if
one can be sure there is no aliasing.
# A Simple Syntax-Directed Translator
## Introduction
Intermediate code-generation comes in two forms, *abstract syntax tree*
which represent the hierarchical syntactic structure of the source
program. The other one is *three address instructions*
## Syntax-Directed Translation
Syntax-directed translation is done by attaching rules or program
fragments to productions in a grammar.
### Synthesized Attributes
The idea of associating quantities with programming constructs can be
expressed in terms of grammars. We attach rules to productions of the
grammar. A *syntax-directed definition* associates:
1. With each grammar symbol, a set of attributes
2. With each production, a set of *semantic rules* for computing values
of the attributes associated with the symbols appearing in the
production.
An attribute is *synthesized* if its value at a parse-tree node N is
determined from attribute values at the children of N and N itself, and
can be evaluated by a single bottom-up traversal.
## A Translator for Simple Expressions
A syntaxt-directed translation scheme often serves as the specification
for a translator. We might come up in a conflict between a translatable
and parsable grammar.
### Abstract and Concrete Syntax
Useful starting point for creating a translator. AST resemble parse
trees to an extent, but these are conceptually different. Parse trees
have the interior nodes representing non-terminals, and a syntax tree
represent programming constructs. Many productions represent grammar
constructs, but many are just there to simplify parsing. The helpers are
dropped in the syntax tree.
## Lexical Analysis
A lexical analyzer reads characters from the input and groups them into
“token objects” with attribute values. A sequence that comprises a
single token is called a *lexeme*. An **num** can have a value and an
**id** can have a lexeme.
### Removal of White Space and Comments
White space and comments can be removed by the lexer, so the parser wont
have to worry.
## Intermediate Code Generation
### Construction of Syntax Trees
Syntax trees can be created for any construct, not just expressions.
Each construct is represented by a node, with children for the
semantically meaningful components of the construct. These nodes can be
represented in a class structure with an arbitrary amount of children.
# Lexical Analysis
## The Lexical-Analyzer Generator Lex
This tool allows one to specify a lexical analyzer by specifying regular
exrpessions to describe tokens. The expressions are transformed to
deterministic finite automatas. Use:
# Syntax Analysis
Every program has rules that describes the syntactic structure. These
rules are normally expressed as context-free grammars. Grammars have
significant benefits:
- A grammar gives a precise, yet easy-to-understand syntactic
specification of a programming language
- Certain classes of grammars can be deterministicly and efficiently
parsed, and the process of making the parser might reveal ambigous
constructs in the language.
## Top-Down Parsing
The problem of constructiong a parse tree for the input string, starting
from the root and creating the nodes of the porse tree in
preorder(depth-first). *Will generate the leftmost derivation*. At each
step the problem is to determine the production to apply for each
non-terminal. A predictive recursive-descent parser can choose
productions by looking at the next symbol. We call thiss the LL(k)
class.
## Bottom-Up Parsing
A bottom-up parse corresponds to the construction of a parse tree for an
input string that begins at the leaves (the bottom) and working up
towards the root (the top). A general style of bottom-up parsing is
shift-reduce parsing, which can be built by using LR grammars.
### Reductions
We can thing of bottom-up parsing as the process of reducing a string
*w* to the start symbol of the grammar. Each step matches a substring
and replaces it with a non-terminal. By definition, a reduction is the
reverse of a step in a derivation. The goal is to construct a derivation
in reverse, which in fact will be a rightmost derivation.
## Using Ambigous Grammars
Any ambigous grammar fails to be in LR. Ambigous grammars might still be
useful, because of it’s expressiveness. We can specify disambigouating
rules, although these should be used iwth care.
### Precedence and Associativity to Resolve Conflicts
We might alter productions to give precedence (E <span>$\to$ </span>E +
E | E \* E | (E) | id). This might not always be the optimal apporach,
because the parser will have more states, and it is hard to modify
precedence. We can express conflicts with precedence if we have
different operators, associativity when considering equal operators.
## Parser Generators
*yacc* (yet another compiler-compiler) can construct a translator for a
programming language using a specification. It will parse with the LALR
method. It is used the same way as *lex* (from specification to program
that can read an input stream). The yacc specification file has three
parts:
- Declarations
- Translation rules
- Supporting C routines
#### The Declarations Part
Consists of two parts, normal C declarations and declarations of grammar
tokens. The grammar docens will then be available for the second and
third part. You can make the tokens available for Lex as well.
#### The Translation Rules Part
Put rules here, where each rule consists of grammar productions and
associated semantic actions. Unquoted means non-terminals, quoted means
terminals. *\$\$* means the attribute value associated with the head,
*\$i* means the ith gramar symbol.
#### The Supporting C-Routines Part
Must provide *yylex()* (might come from lex/flex) which produces tokens.
### Using Yacc with Ambigous Grammars
Some grammars might produce conflicts, and it will report so. Unless
instructed otherwise, it will resolve all conflics with following rules:
- A reduce/reduce is resolved by choosing the conflicting production
listed first in the specification.
- A shift/reduce conflict is resolved in favour of shift (will resolve
dangling-else ambigouity correctly).
These rules might not be what the compiler writer wants, so we can chose
associativity (*%left ’+’ ’-’*) or force it to be nonnassociative. You
can also force precedence.
# Intermediate-Code Generation
## Type Checking
To do *type checking* a compiler needs to assign a type expression to
each component of the source program. It must then determine that these
types conform to the rules (type system) of a program. It has the
potential of caching errors. A *sound type system* eliminates the need
for dynamic checking for type erors. A language is *strolgy typed* if a
compler guarantees that the program will run without type errors.
- Static/dynamic typed is about WHEN type information is aquired
(compile-time or run-time)
- Strongly/weakly typed is about how stricly types are distinguished.
### Rules for Type Checking
Two forms, *synthesis* which builds up the type of an expression from
the types of its subexpressions and *type inference* which determines
the type of a language construct from the way it is used. The latter can
be used for languages that permits names to be used without declaration,
but still uses type checking. *void* is a special type that denote the
absence of a value.
### Type Conversions
You have *widening* (preserve information) and *narrowing* (may loose
information). If type conversions are done automatically by the
compiler, it is *implicit*. These conversions are also known as
*coercions*, and is normally limited to widening conversions. Two
functions, *max(a,b)* finds the highest type in the hierarchy and
*widen()* which widens a type.
## Control Flow
## Backpatching
A key problem when generating code for boolean expressions and flow of
control stataments is that of matching a jump intstruction with the
target of the jump. AN approach to this is called *backpatching*, in
which lists of jumps are passed as synthesized attributes. When a jump
is generated, the target of the jump is tempoarily left unspecified.
### One-Pass Code Generation using Backpatching
Backpatching can be used to generate code for boolean expressions and
flow-of-control statements in one pass. We use synthesized attributes
*truelist* and *falselist* to manage labels in jumping code for bool
expressions.
### Backpatching for Boolean Expressions
### Flow-of-Control Statements
# Run-Time Environments
A compiler must implement the abstarctions in the source language. In
addition to names, scopes, bindings etc, it must also cooperate with the
operating system and other system software. To do this, the compiler
creates an *run-time environment* which it assumes the program are being
executed. This deals with storage, namning, mechanisms to access
variables, parameters, return values, interface to the operating system
etc.
## Storage Organization
In the perspective of the compiler writer, the program runs in its own
logical address space, where the OS maps from logical to physical.
Normal setup is Code <span>$\to$ </span>Static <span>$\to$ </span>Heap
<span>$\to$ </span>Free Memory <span>$\to$ </span>Stack. The amount of
storage needed is determined by the types in the program. Many machines
requires the data to be *aligned*, that all addresses must be divisible
by 4. The compiler may omit this limitation by packing data. Code size
and static data is known compile-time.\
\
You have two areas for dynamic storage allocation, the *stack* and the
*heap*, which sizes can change during program execution. Some programs
allocate the heap automatically and has a garbage collection, some must
have the programmer explicitly allocate space on the heap.
### Stacit Versus Dynamic Storage Allocation
A storage allocation is static (compile time) if the decicion can be
made at looking at the text of a program. A storage allocation is
*dynamic* if it can be decided only when the program is running. Two
techniques:
- Stack
- Heap. For data that might outlive the call to the procedure that
created it.
## Stack Allocation of Space
Almost all compilers for procedures/methods will use the stack. Will
make sure no variables overlap in time and the relative addresses are
always the same.
### Activation Records
Procedure calls and returns are usually managed by a run-time stack
(*Control stack*). Each live activation has an activation record
(frame). Might include:
- Temporary values
- Local data
- A saved machine status
- Access link
- Control link, pointing to the activation record of the caller.
- Space for return value
- The actual parameters used by the calling procedure.
## Heap Management
For data that must be explicitly deleted.
### The Memory Manager
The memory manager keeps track of all the free space in heap storage at
all times. Performs:
- *Allocation*. Must provide a memory region in requested size.
- *Deallocation*. For reusing space.
# Code Generation
The final phase in the compiler is the code generator. It takes IR and
input and produces a semantically equivalent target program. It must
- Preserve semantic meaning
- Make efficient use of resources on target machine
- The code generator itself must run efficiently
Making optimal code are undecidable, but many useful methods with
heuristics exists which will improve program performance by many
factors.
## Addresses in the Target Code
## Instruction Selection by Tree Rewriting
Instruction selection can be a large combinatorial task, especially on
CISC. We treat instruction selection as a tree-rewriting problem. Here
we can make efficient algorithms.
### Tree-Translation Schemes
The input of code-generation will be a sequence of trees at the semantic
level. We need to apply a sequence of tree-rewriting rules to reduce the
input tree to a single node. Each rule represents the translation of a
portion of the tree given by the template. The replacement is called
*tiling* of the subtree
### Code Generation by Tiling an Input Tree
Given an input tree, the templates in the tree-rewriting rules are
applied to tile its subtrees. When matchi is found, the subtree is
replaced with replacement node of a rule and action associated with the
rule is done (generate code). Two issues:
- How is the tree-pattern matching to be done?
- What if we have more matches?
Must always have a match and not inifite amount of matches. You can
create a postfix representation of a tree and create a grammar for it.
By reduce-reduce, take largest production, by shift-reduce, shift. This
is known as *maximal munch*[^1]. This is a good and well understood
method, but it has its challenges.
# Machine-Independent Optimizations
High-level language constructs can introduce substantial run-time
overhead if we naively translate each construct independently. We have
three forms:
- Local code optimization - within a block
- Global code optimization - across basic blocks
- Interprocedural code optimization - across functions and procedures
Most global optimization are based on *data-flow analyses*.
## The Principal Sources of Optimization
A compiler optimization must presere the semanticts of the original
program, and the optimizer is usually only able to apply low-level
semantic transformations.
### Causes of redundancy
Sometimes redundancy is available at source level, but often introduced
because the source is written in a high level language (array accesses
will cause redundancy). By having a compiler eliminate the redundancies,
we get the best of both worlds: the programs are both efficient and easy
to maintain.
### A Running Example: Quicksort
blabla
### Semantics-Preserving Transformations
A compiler can improve a program without changing the function it
computes.
### Global Common Subexpressions
An occurence of an expression $E$ is called a *common subexpression* if
$E$ was previously computed and the values of the variables in $E$ have
not changed since the previous computation.
### Copy Propagation
A *copy statement* is an assignment on the form $u = v$. The idea is to
use $v$ for $u$ in other statements using $u$. It may not appear to be
an improvement, but dead-code elimination gives the opportunity to get
rid of the copy statement.
### Dead-Code Elimination
A variable is *live* at a point in a program if its value can be used
subesquently, otherwise *dead*. We would like to remove dead statements.
Copy propagation makes dead code elimination able to delete many copy
statements.\
\
Deducing at compile time that a value of an expression is known as
*constant folding*.
### Code Motion
Loops are very important place for optimizations. *Code motion* takes an
expression that evaluates to the same each time and lifts it out of the
loop.
### Induction Variables and Reduction in Strength
A variable *x* is said to be an induction variable if thees is a
positive or negative constant *c* such that each time *x* is assigned,
its value increases by *c*. They can be computed with a single increment
per loop iteration. Transforming from an expensive to a cheap operation
is *strength reduction*, which you can do on induction variables. It’s
normal to work inside-out when optimizing loops. When there ar two or
more induction variables in a loop, it might be possible to get rid of
all but one.
## Introduction to Data-Flow Analysis
*Data-flow analysis* refers to a body of techniques that derive
information about the flow of data along program execution paths. You
can answer many questions about optimization, like global common
subexpression and dead code ellimination.
### The Data-Flow Abstraction
You can view the program as a series of transformations, where the
states are between the statements. Control flow will go from statement
to statement in single blocks, and go between blocks if there is an edge
between. *Execution path* is the instructions executed, which there
normally is an infinite number of. Different analyses may choose to
abstract out different information.\
\
The definitions that may reach a program point along some path are known
as reaching definitions.
### The Data-Flow Analysis Schema
At each program point we associate a *data-flow value* which is the set
of all possible program states that can be observed for that point. We
denote the data-flow values before and after each statement $s$ by
$IN[s]$ and $OUT[s]$, and the *data-flow problem* is to find a solution
to a set of constraints on these definitions for all statements $s$.
#### Transfer Functions
The relationship between the data-flow values before and after the
assignment statement is known as a *transfer function*. Might have two
directions (forward and backward analysis).
#### Control-Flow Constraints
You can create $IN[B]$ and $OUT[B]$ by running through all the
statements.
### Data-Flow Schemas on Basic Blocks
Going through a block is easy, just follow each statement, and we get:
$$OUT[B] = f_b(IN[B])$$
If we do constant propagation, we are interested in the sets of constats
that *may* be assigned to a variable, then we have forward flow with
union operator.
$$IN[B] = \cup_{P a predecessor of B}OUT[P]$$
No unique solution, but find the most precise that satisfies
control-flow and transfer constraints.
### Reaching Definitions
One of the most common and useful data-flow schemas. We say a definition
$d$ *reaches* a point $p$ if there is a path from the point immediatly
following $d$ to $p$, such that $d$ is not killed along that path. We
might have aliases, and that is why we must be conservative.
#### Transfer Equation for Reaching Definitions
$$f_d(x) = gen_d \cup (x - kill_d)$$
The gen-set contains all definitions that are directly visible after the
block (will only generate last assignment of each variable), but the
kill-set kills all definitions of assignment (even those inside the same
block).
#### Control-Flow Equations
Union is the meet operator for reaching definitions. It creates a
summary fo the contributions from different paths at the confluecnce of
those paths.
### Live-Variable Analysis
eful for register allocation for basic blocks (dead values need not be
stored). We use **bacwkard analysis** and we define:
1. $def_B$ as the set of variables defined in the block.
2. $use_B$ as the set of variables used in the block.
About the same as the reaching definitions, except opposite direction:
$$IN[B] = use_b \cup (OUT[B] - def_B)$$
### Available Expressions
An expression $x + y$ is *available* at a point $p$ if every path from
the entry node to $p$ evaluates $x + y$, and after the last such
evaluation prior to reaching $p$, there are no subesquent assignments to
$x$ or $y$. A block *kills* expression $x + y$ if it assigns (or may
assign) $x$ or $y$ and does not subesquently recompute $x + y$. A block
*generates* an expression $x + y$ if it evaluates it and not
subsequently define $x$ or $y$. The primary use of available expressions
is detecting global common subexpressions. At each step ($x = y + z$):
1. Add to $S$ the expression $y + z$.
2. Delete from $S$ any expression involving $x$.
Must be done in correct order. Data-flow can be done the same way as
reaching definitions (forward analysis), but the meet operator is
*intersection rather than union*. An expression is available at the top
of a block if it is available at the end of each predecessor block.\
\
In reaching definitions, we start with a set too small and work up, in
available expression, we work with a set too large and work down. We
start with the assumption that everything is available everywhere, to
keep the optimization conservative.
## Loops in Flow Graphs
Loops when considering optimizations are important because programs
spend most of their time executing them. Additionally, program analysis
can happen faster if it contains no loops, because you only need one
pass through the source code.
### Dominators
A node *d* of a flow graph *dominates* node n, written d dom n, if every
path from the entry node of the flow graph to n goes through d. Every
node dominates itself. The information can be presented in a *dominator
tree*, where entry is the root and each node dominates its descendants.
The problem can be formulated as a forward data-flow analysis.
### Edges in a Depth-First Spanning Tree
We can traverse it with DFS, getting following endges
- Advancing edges
- Retreating edges
- Cross edges
# Interprocedural Analysis
An interprocedural analysis operates across an entire program. One
simple and useful technique is *inlining* of functions, where you
replace the function invocation with the body of the function itself.
You cannot immideatly inline recursive methods, and inlining may expand
code size exponentially (cache miss). Advantages of function inlining is
improved speed, due to no method invocation overhead.
# Other definitions
#### Loop Unrolling
Loop unrolling creates more instructions in the loop body, permitting
global scheduling algorithms to find more parallelism. Will increase
code size, and hereby introduce more cache misses and memory references.
[^1]: From wikipedia: In computer programming and computer science,
“maximal munch” or “longest match” is the principle that when
creating some construct, as much of the available input as possible
should be consumed.