Table of Contents

Introduction
1. The structure of a compiler
  1. Semantic Analysis
2. Programming Language Basics
  1. Aliasing
A Simple Syntax-Directed Translator
Lexical Analysis
1. The Lexical-Analyzer Generator Lex
Syntax Analysis
Intermediate-Code Generation
Run-Time Environments
Code Generation
1. Addresses in the Target Code
2. Instruction Selection by Tree Rewriting
  1. Tree-Translation Schemes
  2. Code Generation by Tiling an Input Tree
Machine-Independent Optimizations
Interprocedural Analysis
Other definitions
1. Loop Unrolling

‹

4205: Compiler Construction

Tags:

komptek

Introduction

The structure of a compiler

The compiling process has two parts. The analysis part breaks up the source program and imposes a grammatical structure on them. In addition to this it will alert the user about illegal input (grammar, syntactical etc.) and fill inn the symbol table. The synthesis part constructs the target program from the intermediate representation and the symbol table. The parts are called front end and back end respectively.

Semantic Analysis

The semantic analyzer uses the syntax tree and the information in the symbol table to check the source program for semantic consistency. It also gathers type information and saves it to the symbol table. An important part here is type checking where it checks that operators have matching operands. It might also handle coercions, i.e. transforming INT + FLOAT to FLOAT + FLOAT.

Programming Language Basics

Aliasing

There is as interesting consequence of call-by-reference (references passed by value). Two formal parameters can refer to the same location, and they are alias of each other. You need to understand this in order to optimize a program, because certain optimizations may only be done if one can be sure there is no aliasing.

A Simple Syntax-Directed Translator

Introduction

Intermediate code-generation comes in two forms, abstract syntax tree which represents the hierarchical syntactic structure of the source program. The other one is three address instructions.

Syntax-Directed Translation

Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar.

Synthesized Attributes

The idea of associating quantities with programming constructs can be expressed in terms of grammars. We attach rules to productions of the grammar. A syntax-directed definition associates:

With each grammar symbol, a set of attributes
With each production, a set of semantic rules for computing values of the attributes associated with the symbols appearing in the production.

An attribute is synthesized if its value at a parse-tree node N is determined from attribute values at the children of N and N itself, and can be evaluated by a single bottom-up traversal.

A Translator for Simple Expressions

A syntax-directed translation scheme often serves as the specification for a translator. We might come up in a conflict between a translatable and parsable grammar.

Abstract and Concrete Syntax

Useful starting point for creating a translator. AST resemble parse trees to an extent, but these are conceptually different. Parse trees have the interior nodes representing non-terminals, and a syntax tree represent programming constructs. Many productions represent grammar constructs, but many are just there to simplify parsing. The helpers are dropped in the syntax tree.

Lexical Analysis

A lexical analyzer reads characters from the input and groups them into “token objects” with attribute values. A sequence that comprises a single token is called a lexeme. An num can have a value and an id can have a lexeme.

Removal of White Space and Comments

White space and comments can be removed by the lexer, so the parser won't have to worry.

Intermediate Code Generation

Construction of Syntax Trees

Syntax trees can be created for any construct, not just expressions. Each construct is represented by a node, with children for the semantically meaningful components of the construct. These nodes can be represented in a class structure with an arbitrary amount of children.

Lexical Analysis

The Lexical-Analyzer Generator Lex

This tool allows one to specify a lexical analyzer by specifying regular expressions to describe tokens. The expressions are transformed to deterministic finite automatas. Use:

Lex source program $\to$ lex compiler $\to$ c source program
C source program $\to$ C compiler $\to$ Lexer executable
Input stream $\to$ Lexer executable $\to$ Sequence of tokens

Syntax Analysis

Every program has rules that describes the syntactic structure. These rules are normally expressed as context-free grammars. Grammars have significant benefits:

A grammar gives a precise, yet easy-to-understand syntactic specification of a programming language
Certain classes of grammars can be deterministically and efficiently parsed, and the process of making the parser might reveal ambiguous constructs in the language.

Top-Down Parsing

The problem of constructing a parse tree for the input string, starting from the root and creating the nodes of the parse tree in preorder (depth-first). Will generate the leftmost derivation. At each step the problem is to determine the production to apply for each non-terminal. A predictive recursive-descent parser can choose productions by looking at the next symbol. We call this the LL(k) class.

Bottom-Up Parsing

A bottom-up parse corresponds to the construction of a parse tree for an input string that begins at the leaves (the bottom) and working up towards the root (the top). A general style of bottom-up parsing is shift-reduce parsing, which can be built by using LR grammars.

Reductions

We can think of bottom-up parsing as the process of reducing a string w to the start symbol of the grammar. Each step matches a substring and replaces it with a non-terminal. By definition, a reduction is the reverse of a step in a derivation. The goal is to construct a derivation in reverse, which in fact will be a rightmost derivation.

Using Ambiguous Grammars

Any ambiguous grammar fails to be in LR. Ambiguous grammars might still be useful, because of it’s expressiveness. We can specify disambiguating rules, although these should be used with care.

Precedence and Associativity to Resolve Conflicts

We might alter productions to give precedence (E $\to$ E + E | E * E | (E) | id). This might not always be the optimal approach, because the parser will have more states, and it is hard to modify precedence. We can express conflicts with precedence if we have different operators, associativity when considering equal operators.

Parser Generators

yacc (yet another compiler-compiler) can construct a translator for a programming language using a specification. It will parse with the LALR method. It is used the same way as lex (from specification to program that can read an input stream). The yacc specification file has three parts:

Declarations
Translation rules
Supporting C routines

The Declarations Part

Consists of two parts, normal C declarations and declarations of grammar tokens. The grammar tokens will then be available for the second and third part. You can make the tokens available for Lex as well.

The Translation Rules Part

Put rules here, where each rule consists of grammar productions and associated semantic actions. Unquoted means non-terminals, quoted means terminals. \$\$ means the attribute value associated with the head, \$i means the ith grammar symbol.

The Supporting C-Routines Part

Must provide yylex() (might come from lex/flex) which produces tokens.

Using Yacc with Ambiguous Grammars

Some grammars might produce conflicts, and it will report so. Unless instructed otherwise, it will resolve all conflicts with following rules:

A reduce/reduce is resolved by choosing the conflicting production listed first in the specification.
A shift/reduce conflict is resolved in favor of shift (will resolve dangling-else ambiguity correctly).

These rules might not be what the compiler writer wants, so we can choose associativity (%left ’+’ ’-’) or force it to be non-associative. You can also force precedence.

Intermediate-Code Generation

Type Checking

To do type checking a compiler needs to assign a type expression to each component of the source program. It must then determine that these types conform to the rules (type system) of a program. It has the potential of catching errors. A sound type system eliminates the need for dynamic checking for type errors. A language is strongly typed if a compiler guarantees that the program will run without type errors.

Static/dynamic typed is about WHEN type information is acquired (compile-time or run-time).
Strongly/weakly typed is about how strictly types are distinguished.

Rules for Type Checking

Two forms, synthesis which builds up the type of an expression from the types of its sub-expressions and type inference which determines the type of a language construct from the way it is used. The latter can be used for languages that permits names to be used without declaration, but still uses type checking. void is a special type that denote the absence of a value.

Type Conversions

You have widening (preserve information) and narrowing (may lose information). If type conversions are done automatically by the compiler, it is implicit. These conversions are also known as coercions, and is normally limited to widening conversions. Two functions, max(a,b) finds the highest type in the hierarchy and widen() which widens a type.

Control Flow

Backpatching

A key problem when generating code for boolean expressions and flow of control statements is that of matching a jump instruction with the target of the jump. An approach to this is called backpatching, in which lists of jumps are passed as synthesized attributes. When a jump is generated, the target of the jump is temporarily left unspecified.

One-Pass Code Generation using Backpatching

Backpatching can be used to generate code for boolean expressions and flow-of-control statements in one pass. We use synthesized attributes truelist and falselist to manage labels in jumping code for bool expressions.

Backpatching for Boolean Expressions

Flow-of-Control Statements

Run-Time Environments

A compiler must implement the abstractions in the source language. In addition to names, scopes, bindings etc, it must also cooperate with the operating system and other system software. To do this, the compiler creates an run-time environment which it assumes the program are being executed. This deals with storage, naming, mechanisms to access variables, parameters, return values, interface to the operating system etc.

Storage Organization

In the perspective of the compiler writer, the program runs in its own logical address space, where the OS maps from logical to physical. Normal setup is Code $\to$ Static $\to$ Heap $\to$ Free Memory $\to$ Stack. The amount of storage needed is determined by the types in the program. Many machines require the data to be aligned, that all addresses must be divisible by 4. The compiler may omit this limitation by packing data. Code size and static data is known compile-time.\ \ You have two areas for dynamic storage allocation, the stack and the heap, which sizes can change during program execution. Some programs allocate the heap automatically and has a garbage collection, some must have the programmer explicitly allocate space on the heap.

Static Versus Dynamic Storage Allocation

A storage allocation is static (compile time) if the decision can be made at looking at the text of a program. A storage allocation is dynamic if it can be decided only when the program is running. Two techniques:

Stack
Heap. For data that might outlive the call to the procedure that created it.

Stack Allocation of Space

Almost all compilers for procedures/methods will use the stack. Will make sure no variables overlap in time and the relative addresses are always the same.

Activation Records

Procedure calls and returns are usually managed by a run-time stack (Control stack). Each live activation has an activation record (frame). Might include:

Temporary values
Local data
A saved machine status
Access link
Control link, pointing to the activation record of the caller.
Space for return value
The actual parameters used by the calling procedure.

Heap Management

For data that must be explicitly deleted.

The Memory Manager

The memory manager keeps track of all the free space in heap storage at all times. Performs:

Allocation. Must provide a memory region in requested size.
De-allocation. For reusing space.

Code Generation

The final phase in the compiler is the code generator. It takes IR and input and produces a semantically equivalent target program. It must

Preserve semantic meaning
Make efficient use of resources on target machine
The code generator itself must run efficiently

Making optimal code are undecidable, but many useful methods with heuristics exists which will improve program performance by many factors.

Addresses in the Target Code

Instruction Selection by Tree Rewriting

Instruction selection can be a large combinatorial task, especially on CISC. We treat instruction selection as a tree-rewriting problem. Here we can make efficient algorithms.

Tree-Translation Schemes

The input of code-generation will be a sequence of trees at the semantic level. We need to apply a sequence of tree-rewriting rules to reduce the input tree to a single node. Each rule represents the translation of a portion of the tree given by the template. The replacement is called tiling of the subtree

Code Generation by Tiling an Input Tree

Given an input tree, the templates in the tree-rewriting rules are applied to tile its subtrees. When match is found, the subtree is replaced with replacement node of a rule and action associated with the rule is done (generate code). Two issues:

How is the tree-pattern matching to be done?
What if we have more matches?

Must always have a match and not infinite amount of matches. You can create a postfix representation of a tree and create a grammar for it. By reduce-reduce, take largest production, by shift-reduce, shift. This is known as maximal munch[^1]. This is a good and well understood method, but it has its challenges.

Machine-Independent Optimizations

High-level language constructs can introduce substantial run-time overhead if we naively translate each construct independently. We have three forms:

Local code optimization - within a block
Global code optimization - across basic blocks
Inter-procedural code optimization - across functions and procedures

Most global optimization are based on data-flow analyses.

The Principal Sources of Optimization

A compiler optimization must preserve the semantics of the original program, and the optimizer is usually only able to apply low-level semantic transformations.

Causes of redundancy

Sometimes redundancy is available at source level, but often introduced because the source is written in a high level language (array accesses will cause redundancy). By having a compiler eliminate the redundancies, we get the best of both worlds: the programs are both efficient and easy to maintain.

A Running Example: Quicksort

blabla

Semantics-Preserving Transformations

A compiler can improve a program without changing the function it computes.

Global Common Sub-expressions

An occurrence of an expression $E$ is called a common sub-expression if $E$ was previously computed and the values of the variables in $E$ have not changed since the previous computation.

Copy Propagation

A copy statement is an assignment on the form $u = v$. The idea is to use $v$ for $u$ in other statements using $u$. It may not appear to be an improvement, but dead-code elimination gives the opportunity to get rid of the copy statement.

Dead-Code Elimination

A variable is live at a point in a program if its value can be used subsequently, otherwise dead. We would like to remove dead statements. Copy propagation makes dead code elimination able to delete many copy statements.\ \ Deducing at compile time that a value of an expression is known as constant folding.

Code Motion

Loops are very important place for optimizations. Code motion takes an expression that evaluates to the same each time and lifts it out of the loop.

Induction Variables and Reduction in Strength

A variable x is said to be an induction variable if there is a positive or negative constant c such that each time x is assigned, its value increases by c. They can be computed with a single increment per loop iteration. Transforming from an expensive to a cheap operation is strength reduction, which you can do on induction variables. It’s normal to work inside-out when optimizing loops. When there are two or more induction variables in a loop, it might be possible to get rid of all but one.

Introduction to Data-Flow Analysis

Data-flow analysis refers to a body of techniques that derive information about the flow of data along program execution paths. You can answer many questions about optimization, like global common sub-expression and dead code elimination.

The Data-Flow Abstraction

You can view the program as a series of transformations, where the states are between the statements. Control flow will go from statement to statement in single blocks, and go between blocks if there is an edge between. Execution path is the instructions executed, which there normally is an infinite number of. Different analyses may choose to abstract out different information.\ \ The definitions that may reach a program point along some path are known as reaching definitions.

The Data-Flow Analysis Schema

At each program point we associate a data-flow value which is the set of all possible program states that can be observed for that point. We denote the data-flow values before and after each statement $s$ by $IN[s]$ and $OUT[s]$, and the data-flow problem is to find a solution to a set of constraints on these definitions for all statements $s$.

Transfer Functions

The relationship between the data-flow values before and after the assignment statement is known as a transfer function. Might have two directions (forward and backward analysis).

Control-Flow Constraints

You can create $IN[B]$ and $OUT[B]$ by running through all the statements.

Data-Flow Schemas on Basic Blocks

Going through a block is easy, just follow each statement, and we get:

$$OUT[B] = f_b(IN[B])$$

If we do constant propagation, we are interested in the sets of constants that may be assigned to a variable, then we have forward flow with union operator.

$$IN[B] = \cup_{P a predecessor of B}OUT[P]$$

No unique solution, but find the most precise that satisfies control-flow and transfer constraints.

Reaching Definitions

One of the most common and useful data-flow schemas. We say a definition $d$ reaches a point $p$ if there is a path from the point immediately following $d$ to $p$, such that $d$ is not killed along that path. We might have aliases, and that is why we must be conservative.

Transfer Equation for Reaching Definitions

$$f_d(x) = gen_d \cup (x - kill_d)$$

The gen-set contains all definitions that are directly visible after the block (will only generate last assignment of each variable), but the kill-set kills all definitions of assignment (even those inside the same block).

Control-Flow Equations

Union is the meet operator for reaching definitions. It creates a summary of the contributions from different paths at the confluence of those paths.

Live-Variable Analysis

Useful for register allocation for basic blocks (dead values need not be stored). We use backward analysis and we define:

$def_B$ as the set of variables defined in the block.
$use_B$ as the set of variables used in the block.

About the same as the reaching definitions, except opposite direction:

$$IN[B] = use_b \cup (OUT[B] - def_B)$$

Available Expressions

An expression $x + y$ is available at a point $p$ if every path from the entry node to $p$ evaluates $x + y$, and after the last such evaluation prior to reaching $p$, there are no subsequent assignments to $x$ or $y$. A block kills expression $x + y$ if it assigns (or may assign) $x$ or $y$ and does not subsequently re-compute $x + y$. A block generates an expression $x + y$ if it evaluates it and not subsequently define $x$ or $y$. The primary use of available expressions is detecting global common sub-expressions. At each step ($x = y + z$):

Add to $S$ the expression $y + z$.
Delete from $S$ any expression involving $x$.

Must be done in correct order. Data-flow can be done the same way as reaching definitions (forward analysis), but the meet operator is intersection rather than union. An expression is available at the top of a block if it is available at the end of each predecessor block.\ \ In reaching definitions, we start with a set too small and work up, in available expression, we work with a set too large and work down. We start with the assumption that everything is available everywhere, to keep the optimization conservative.

Loops in Flow Graphs

Loops when considering optimizations are important because programs spend most of their time executing them. Additionally, program analysis can happen faster if it contains no loops, because you only need one pass through the source code.

Dominators

A node d of a flow graph dominates node n, written d dom n, if every path from the entry node of the flow graph to n goes through d. Every node dominates itself. The information can be presented in a dominator tree, where entry is the root and each node dominates its descendants. The problem can be formulated as a forward data-flow analysis.

Edges in a Depth-First Spanning Tree

We can traverse it with DFS, getting following edges

Advancing edges
Retreating edges
Cross edges

Interprocedural Analysis

An interprocedural analysis operates across an entire program. One simple and useful technique is inlining of functions, where you replace the function invocation with the body of the function itself. You cannot immediately inline recursive methods, and inlining may expand code size exponentially (cache miss). Advantages of function inlining is improved speed, due to no method invocation overhead.

Other definitions

Loop Unrolling

Loop unrolling creates more instructions in the loop body, permitting global scheduling algorithms to find more parallelism. Will increase code size, and hereby introduce more cache misses and memory references.

[^1]: From wikipedia: In computer programming and computer science, “maximal munch” or “longest match” is the principle that when creating some construct, as much of the available input as possible should be consumed.

Written by

ilsegv Stian Jensen sigveseb kulia pbsds perod hawk_aa

Last updated: Thu, 23 May 2019 18:45:47 +0200 .