TDT4205: Compilers
What is a compiler?
Compiling code is very useful.
# Practical information
# Recommended reading
* Language classification
* Strong and weak typing
* Dynamic and static typing
* Grammars and parsing
* Regular ⊆ Context-free ⊆ Context-sensitive ⊆ Unrestricted
* FIRST and FOLLOW sets
* LL-parsing
* Left-recursive grammars
* LR-parsing
* SLR
* LALR
* Handling ambiguity
* Operator precedence
* Shift/reduce
* The dangling else problem
* Optimizations
* When to implement what optimization, see slide number 27 of "Lecture 23: Introduction to Optimizations", CS412/413 Spring 2008 Slide Set by Tim Teitelbaum.
* Data-flow analysis
* Control flow graphs
* Transfer functions (monotonic, distributive)
* Different types of analysis
* Dominator trees
* Interference graphs
* Put lattices somewhere in here
# Compilers - how do they work?
Compilers are programs which translate structured text from one language to another. This course studies compilers that compile computer programs. A compiler is usually built of several parts:
## Scanner
A scanner reads the input text as a stream of characters and translates them into a stream of tokens. Tokenizer is another word for scanner. The scanner is a lexical analyzer.
## Parser
A parser reads the stream of tokens from the scanner and translates them into an abstract syntax tree using the grammar of the target language as a guide, so to speak. The parser is a syntax/semantic analyzer.
## Intermediate code generator
An intermediate code generator generates the abstract syntax tree for a simple intermediate code representation. This is to make things easier for the optimizer in the next step.
## Intermediate code optimizer
The intermediate code optimizer optimizes the intermediate code using lots of nifty tricks and techniques that are described in detail later in this compendium.
## Target language code generator
The target language code generator finally generates the final code in the target language, based on the optimized intermediate code.
# Language classification
Programming languages can be classified into many different classes, based on things such as type system, paradigm, language generation, use areas and so on, but this course seems to focus on classifications based on type systems.
## Dynamic vs static
A programming language is **dynamically typed** when most of its type checking is performed at run-time instead of at compile-time. Values have types, but variables do not. Popular dynamic languages include Python, Ruby, PHP, Lisp, JavaScript, Objective-C, Oz (which happens to be the best language ever), and Smalltalk.
A programming language is **statically typed** if it does most of its type checking at compile-time. Static typing is a form of program verification because it can in its purest form guarantee type safety before the program is run. Popular static languages include Pascal, Ada, C/C++, Java/C#, ML, Fortran.
Dynamic languages are typically more expressive, but static languages are safer (as in program verification) and often faster.
## Strong vs weak
A programming language is **strongly typed** if it specifies restrictions on which types can be co-operated upon by an operator. A language may for instance allow (int) $$+$$ (int), but disallow (str) $$+$$ (int), because str and int are not (for this example) defined to be compatible types for the + operator. Popular strongly typed languages include Pascal, C#/Java, Python, OCaml and Ada.
A programming language is **weakly typed** if it does not specify restrictions on which types can be co-operated upon by an operator. Weakly typed programming languages often support features like implicit type conversion, ad-hoc polymorphism. Popular weakly typed languages include BASIC, JavaScript, Perl, PHP and C.
Weakly typed languages are typically more expressive, but strongly typed languages are safer (as in program verification).
# Grammars
A grammar is a set of production rules for strings in a formal language. Grammars do not specify meaning of strings, or what can be done with them, or anything like that. Grammars only describe the form of the strings.
A grammar $$G = (N,\sigma,P,S)$$ consists of:
* A finite set $$N$$ of nonterminal symbols
* A finite set $$\sigma$$ of terminal symbols that is disjoint from $$N$$.
* A finite set $$P$$ of production rules, each rule on the form:
$$(\sigma \union N)\*N(\sigma \union N)\* \rightarrow (\sigma \union \N)\*$$
* A symbol $$S \in N$$ that is the start symbol
The language of a grammar $$G$$, $$L(G)$$, is defined as all the sentences that can be derived in a finite number of steps from the start symbol $$S$$.
## Notation
When writing about grammars, the notation convention is to use capital letters for nonterminals, lowercase letters for terminals, $$\epsilon$$ or $$\Lambda$$ for the empty string, and $$S$$ for the start symbol.
## The Chomsky hierarchy
Grammars are classified into a hierarchy based on the restrictiveness of the production rules. A more restrictive class can express fewer languages than a less restrictive class. The Chomsky hierarchy is, in order or decreasing restrictiveness:
Regular ⊆ Context-free ⊆ Context-sensitive ⊆ Unrestricted
### Regular grammars
Regular grammars are the most restrictive grammars in the grammar hierarchy. Regular grammars describe a regular language.
### Context-free grammars
Context-free grammars are grammars where every production rule is of the form $$V \rightarrow v$$.
### Context-sensitive grammars
Context-sensitive grammars are grammars where every production is of the form $$\alpha A \beta \rightarrow \alpha \gamma \beta$$, where $$\alpha$$, $$\beta$$ and $$\gamma$$ are strings of nonterminals and terminals, and $$\gamma$$ is not $$\epsilon$$.
### Unrestricted grammars
An unrestricted grammar is a grammar with no restrictions for the grammar productions.
# Finite automata
Finite automaton consist of:
* A finite set of state
* Transitions between states
* A start start
* A set of final states
There are two kinds of finite automaton studied in this course: NFA and DFA. Here is a series of random facts about finite automatons:
* Every NFA has an equivalent DFA.
* There exists algorithms which can convert NFA to DFA and back.
* Every finite automaton is equivalent with a regular expression.
* finite automatons can be thought of as deciders deciding the question "is this input gramatically correct?" for the language associated with the finite automaton.
## Deterministic finite automata (DFA)
DFA is a kind of finite automaton which has exactly one transition for each state for each possible input symbol.
## Non-deterministic finite automata (NFA)
# Parsing
## Top down parsing
Top down parsers look for left-most derivations of an input-stream by searching for parse trees using a top-down expansion of a given grammar's production rules.
### Recursive descent parser
A top-down parser built from a set of mutually recursive procedures where each such procedure usually implements one of the production rules of the grammar.
#### Predictive parser (LL-parser)
A predicitve parser is a recursive descent parser that does not require backtracking. Predictive parsers only work for LL(k)-grammars.
## Bottom-up parsing
Bottom-up parsers start with the input stream and attempt to reach the start symbol through a series of rewrites. Shift-reduce parsing is another name for bottom-up parsing.
### LR-parser
LR-parsers are left-reading, rightmost-derivating parsers that are O(n) for deterministic context-free languages. Canonical LR-parsing, though fast, is prohibitly spaceous. Luckily, memory-usage-improved parsers exist, such as the SLR and LALR parsers.
#### Simple LR-parser (SLR-parser)
An SLR-parser is an LR(1)-parser with a relatively simple parser generator algorithm. SLR lookahead set generators calculate lookahead by an easy approximation method (follow sets) based direcly on the grammar, ignoring everything else.
#### Look-Ahead LR parser (LALR-parser)
An LALR-parser is an LR(k)-parser (k>0), and a simplified version of a canonical LR-parser. LALR lookahead set generators calculate lookahead by a more precise method based on exploring the graph of parser states and their transitions.
## FIRST and FOLLOW sets
## shift/reduce
## Handling ambiguity
### Operator precedence
### The dangling else problem
# Optimizations
When compiling code, a compiler should generally try to optimize its output as much as possible. When a compiler optimizes its output, it is called an optimizing compiler. Optimization in the context refers to the processing of minimizing some attributes of the output program, most commonly execution time, but also memory footprint, without changing the semantics of the program. Optimization is typically computationally intense, with some problems being NP-complete or even undecidable.
## Data-flow analysis
Data-flow analysis-based optimizations are optimizations based on data-flow analysis. A data-flow analysis is a technique for gathering information about the possible set of values calculated at various points in a program.
### Common subexpression elimination
Wikipedia says it best: "In the expression (a + b) - (a + b)/4, "common subexpression" refers to the duplicated (a + b). Compilers implementing this technique realize that (a + b) won't change, and as such, only calculate its value once."
### Constant folding and propagation
Constant folding and propagation is the process of pre-computing expressions consisting of constants and replacing them with their final value. As an example, the expression $$3 + 2$$ can be replaced by the expression $$5$$.
### Additional techniques
These might not be part of the curriculum, I am not sure:
* Induction variable recognition and elimination
* Alias classification and pointer analysis
* Dead store elimination
### Control flow graphs (CFG)
A control flow graph represents all code paths that might be traversed through a program during its execution. Each node in the graph represents a basic block. A basic block is a continuous piece of code without jumps or jump targets. Possible jumps are represented as directed edges between nodes. Additionally, on block is marked as an entry block where execution starts, and one block is marked as an exit block, where execution ends.
#### Reachability
Subgraphs that are reachable from the entry node can safely be discarded from the graph, as they can never be executed. This is an important technique for dead code elimination.
If the exit block is unreachable, the program has an infinite loop. Although not all infinite loops can be detected, it is still nice to detect some of them!
#### Transfer functions
I have no idea what this is.
#### Different types of analysis
##### alias analysis
##### pointer analysis
##### shape analysis
##### escape analysis
##### array access analysis
##### dependence analysis
##### control flow analysis
##### data flow analysis
###### live variable aka liveness analysis
A variable in a program is live at a point of the program if it holds a value that may be needed in the future. Read up on lattices, because they are useful when doing live variable analysis.
A use-define chain (UD chain) is a data structure that is used in liveness analysis. UD chains are data structures that consist of a use of a variable (U) and all the definitions of said variable that can reach that use without any other intervening definitions (D). Usually, a definition means an assignment of a value to a variable.
A UD chain makes it possible to lookup for each usage of a variable, which earlier assignments its current value depends on. This facilitates splitting variables into several different variables when variables have a definition chain of length 1. This is very useful for putting things into static single assignment form.
###### available expression analysis
#### Dominator trees
Lemme explain: In a CFG, a node $$d$$ dominates a node $$n$$ if every path from the start node to $$n$$ must go through $$d$$. We write: $$d$$ dom $$n$$, or $dd$$ >> $$n$$. A dominator tree is tree where each node's chidren are those nodes it immediately dominates. The start node is the root of the tree. Dominator trees are used for computing static single assigment form for an IR - that is that every variable is assigned exactly once.
### Interference graphs
Interference graphs are used for register allocation. An interference graph is a graph where nodes represent variables in a program, and edges between them represent intferences between them. If two variables do not interefere, the same register can be used for both of them, reducing the number of registers needed.
# Nice tips for the final
* The difference between DFAs and NFAs
* Scope of identifiers in object-oriented class systems. See slide number 12 of "Lecture 12: Symbol Tables", CS412/413 Spring 2008 Slide Set by Tim Teitelbaum.
* Classifying different well-known programming languages as strong/weakly typed, static/dynamic typing (" given these languages, where would you put them? Ex.: fortran, cobol, matlab, python etc")
* Function call on objects, dynamic dispatch vector, see slide number 6 of "Lecture 22: Implementing Objects", CS412/413 Spring 2008 Slide Set by Tim Teitelbaum.
* When to implement what optimization, see slide number 27 of "Lecture 23: Introduction to Optimizations", CS412/413 Spring 2008 Slide Set by Tim Teitelbaum.
* FP ⊑ MFP ⊑ MOP ⊑ IDEAL, viktig i komptek; lecture 27, slide 22; også slide 15 i neste sett
* multiple choice tip for compilers: is c weakly typed? yes.