Table of Contents

Introduction
Estimation and partitioning
Multiprocessors and accelerators
1. Speed-up formula
Instruction set extensions
1. Generator
  1. ISEGEN
On-chip communication architechtures
Implementing embedded systems
1. Adress space sharing and folding
Retargetable compilers
High Level Synthesis
1. Academic (non-commercial) tools
2. Optimization methods
Uncategorized topics

‹

TFE02: Hardware software codesign with embedded systems

Tags:

sw, fpga
hw

Introduction

Disclaimer

This compendium is a work in progress and needs review and possibly restructuring and citations. The current structure is based on the ordering and naming of lectures from the 2017 semester.

Specialization course

This subject is part of the Design of Digital Systems, Specialization Course (TFE4525).

The hardware/software codesign flow

Figure showing the HW/SW codesign flow

Estimation and partitioning

Estimation and partitioning is important to evaluate the quality of an architecture.

Architecture exploration

Architecture exploration is the process of mapping out the hardware composition of a system. This is typically done at an early stage and may include a choice of processors, application specific hardware (such as ASICs), reprogrammable hardware (such as FPGAs) and memories.

Co-simulation

Simulating both software and hardware together through computer aided design tools (CAD). May be done at different levels of abstraction ranging from physical models and RTL to system modelling.

SystemC is a noteworthy co-simulation tool.

Partitioning

Following architecture exploration is typically partitioning, where functionality of the application is grouped. The groups are mapped onto architectural components.

The partitioning problem: Given a set of objects $O=\{o_1,o_2,...,o_n\}$, determine a partition $P=\{p_1,p_2,...,p_m\}$ such that $p_1 \cup p_2 \cup ... \cup p_m = O$, $p_i \cap p_j = \emptyset, \quad\forall\{i,j\}, \quad i\neq j$, and the cost determined by an objective function $Objfct(P)$ is minimal.

Typical configuration of a partitioning system.

Possible benefits:

Virtual prototyping (actual system specific hardware not needed)
Coarse simulation (reduced accuracy at the benefit of fast results)
Increased quality and coverage of verification (by utilizing both simulation and hardware prototyping)
Bugs may be fixed more easily and efficiently in hardware rather than in software workarounds

All of which may aid in reduced design and verification time, overall.

Formulas

Objective function

$$ Objfct = k_1 \cdot area + k_2 \cdot delay + k_3 \cdot power $$

The area, delay and power metrics are often substituted with functions accounting for constraints. Eg. if the area is above 100% utilization, the cost of area should be significantly increased in order to balance the objective function for realistic particioning.

Cost function

The cost function of partitioning may be defined in a simplified model as such $$ C_{\text{tot}} = \sum_{P_i \in \text{SW}}C_{\text{SW}}(P_i) + \sum_{Pi \in \text{HW}}C_{\text{HW}}(P_i) + \sum_{P_{ij} \in (\text{SW} \leftrightarrow \text{HW})}C_{\text{comm}}(P_{ij}) $$

Closeness function

...

Local minimas

Some partitioning algorithms accounts for escaping local minimas, which are called hill-climbing. Their counterpart are called greedy, meaning they only accept moves that immediately improve cost or closeness.

Both types of algorithms can be combined to optimize solving the partitioning problem.

Designer interaction

Both human designers and synthesizers benefit from interaction with the partitioning methods through the following selections:

Object granularity
Allocation of system components
Quality metrics
Objective and closeness functions
Algorithm choice(s)

Algorithms

A system with $n$ objects and $m$ system components have $m^n$ possible mappings.

Divided into constructive and iterative algorithms depending on if they create partitions from the ground up or need to be based on an existing partition.

Random mapping

Type	constructive
Complexity	$\mathcal{O}(n)$
Grouping metric	(irrelevant)
Search type	(irrelevant)

Hierarchical clustering

Type	constructive
Complexity	$\mathcal{O}(n^2)$
Grouping metric	closeness
Search type	greedy

May be applied to partitioning problems with simplified hardware/software processing and communication costs.

(example in 2015 exam solution)

Simplified heuristic steps:

Intitialize each object as a group
Compute closenesses between objects
Merge closest objects and recompute closenesses

Example:

Further reading at Wikipedia: Hierarchical clustering.

Multi-stage clustering

In short the same as the above except allowing intermediate variation in closeness metrics.

Group Migration

Type	iterative
Complexity	$\mathcal{O}(n^3)$ (or $\mathcal{O}(n^2)$ or even $\mathcal{O}(n)$)*
Grouping metric	cost
Search type	hill-climbing

* depending on constant $Objfct$ and use of structural partitioning

The Kerninghan-Lin minimal cut algorithm is a hill climbing, heuristic algorithm for finding partitions of graphs. It can be utilized to determine optimal distributions of functionality between hardware and software for eg. accelerators.

Read more about this algorithm over at Wikipedia.

Simplified heuristic:

Initialize
Create a sequence of n moves
- Move all $o$ once depending on which results in $min(C)$
If a better cost was found
- then: Update P, repeat
- else: exit

Example: Initial partitioning and costs

Evaluate	In Software	In Hardware	Cost*
Starting point	$P_1$, $P_2$	$P_3$, $P_4$	32
Move $P_1$	$P_2$	$P_1$, $P_3$, $P_4$	35
Move $P_2$	$P_1$	$P_2$, $P_3$, $P_4$	29
Move $P_3$	$P_1$, $P_2$, $P_3$	$P_4$	32
Move $P_4$	$P_1$, $P_2$, $P_4$	$P_3$	37
Move best ($P_2$)	$P_1$	$P_2$, $P_3$, $P_4$	29
Move $P_1$		$P_1$, $P_2$, $P_3$, $P_4$	23
Move $P_3$	$P_1$, $P_3$	$P_2$, $P_4$	35
Move $P_4$	$P_1$, $P_4$	$P_2$, $P_3$	34
Move best ($P_1$)		$P_1$, $P_2$, $P_3$, $P_4$	23
Move $P_3$	$P_3$	$P_1$, $P_2$, $P_4$	29
Move $P_4$	$P_4$	$P_1$, $P_2$, $P_3$	28
Move best ($P_4$)	$P_4$	$P_1$, $P_2$, $P_3$	28
Move $P_3$	$P_3$, $P_4$	$P_1$, $P_2$	26
Move best ($P_3$)	$P_3$, $P_4$	$P_1$, $P_2$	26
Keep best partition		$P_1$, $P_2$, $P_3$, $P_4$	23
(start new round)

* Calculated using the Cost function formula above.

Ratio cut

Type	constructive
Complexity	unknown
Grouping metric	closeness
Search type	greedy

Reduce cut sizes without grouping far objects and without constraining group size.

$$ ratio = \frac{cut(P)}{size(p_1) \times size(p_2)} $$

Simulated annealing

Type	constructive
Complexity	unknown
Grouping metric	cost
Search type	hill-climbing

Similar to Group Migration except eg: may move each object more than once accepts any move that improves cost * randomness

Theoretically provides globally optimal solution given equilibrium at every temperature with infitesimal temperature decrease.

Execution time and result heavily dependant on different Accept, Equilibrium, DecreaseTemp and Frozen in the following algorithm.

Algorithm:

temp = initial temperature
cost = Objfct(P)
while not Frozen loop
    while not Equilibrium loop
        P_tentative = RandomMove(P)
        cost_tentative = Objfct(P_tentative)
        Δcost = cost_tentative - cost
        if (Accept(Δcost,temp) > Random(0,1)) then
            P = P_tentative
            cost = cost_tentative
        end if
    end loop
    temp = DecreaseTemp(temp)
end loop

Genetic evolution

Type	constructive (starts with random partitions)
Complexity	unknown
Grouping metric	cost
Search type	hill-climbing

Emulates evolution through a set of methods called selection, crossover, and mutation.
Requires more memory with generations of multiple partitions.
Long run times, similar to simulated annealing.

Estimators

Quality

There are three main factors considered when determining the quality of an estimator, namely speed, accuracy and fidelity.

Speed

The speed of an estimator indicates the time it takes to produce an estimate. This may be the essential metric in iterative improvement (eg. Group Migration).

Accuracy

The accuracy indicates the quality of estimates with respect to real implementation results. It can be quantified using the following formula.

$$ \mathcal{A} = \frac{1}{N}\sum_{i=1}^{N}{\left(1-\frac{|E_i-M_i|}{M_i}\right)} $$

Fidelity

The term fidelity is used to describe the quality of an estimation methodology, often by comparing estimated results with actual results.

We use the following formula for fidelity:

$$ \mathcal{F} = \frac{2}{n(n-1)}\sum_{i=1}^{n-1}{\sum_{j=i+1}^{n}{\mu_{ij}}}, \quad\text{where }\\ \mu_{ij} = (E_i<E_j \land M_i<M_j) \lor (E_i>E_j \land M_i>M_j), $$

where $E$ is the estimated values and $M$ is the measured values of each solution sample.

Example:

We have made an estimation of the power of different solutions to an architecture. The model is sampled with 5 variations of implementation.

Solution index	E	M
1	40	22
2	50	54
3	60	73
4	70	62
5	80	72

$$\begin{matrix} \mu_{12}=1&\mu_{13}=1&\mu_{14}=1&\mu_{15}=1\\ \mu_{23}=1&\mu_{24}=1&\mu_{25}=1\\ \mu_{34}=0&\mu_{35}=0\\ \mu_{45}=1 \end{matrix}$$

$$ \mathcal{F} = \frac{2}{5*4}(1+1+1+1+1+1+1+0+0+1) = 0.8 = 80 \% $$

Simulated annealing

Simulated annealing is a partition algorithm that models the annealing process in physics, where a material is melted and its minimal energy state is achieved by lowering the temperature slowly enough that equilibrium is reached at each temperature.

– Solution example for 2014 exam

temp = Initial temperature
cost = Objfct(P)
while not Frozen loop
    while not Equilibrium loop
        P_tentative = RandomMove(P)
        cost_tentative = Objfct(P_tentative)
        Δcost = cost_tentative - cost
        if (Accept(Δcost,temp) > Random(0,1)) then
            P = P_tentative
            cost = cost_tentative
        end if
    end loop
    temp = DecreaseTemp(temp)
end loop

Multiprocessors and accelerators

Speed-up formula

This formula defines the speedup by a specific utilization of hardware accelerators.

$$ S = n(t_{CPU} - t_{accel}) = n[t_{CPU} - (t_{in} + t_x + t_{out})], $$ where $n$ is the number of iterations, $t_{in}$ and $t_{out}$ are the times needed to read and write data to the accelerator that cannot be overlapped with its execution time $t_x$.

The above formula accounts for specific times. By tweaking this formula a bit we can easily calculate the speedup in terms of degree.

$$ S' = 1 - \frac{t_{in}+t_x+t_{out}}{t_{CPU}} $$

(example in 2015 exam solution)

Instruction set extensions

Generator

ISEGEN

ISEGEN is an instruction set extension generator. ISEGEN will allow the user to optimise a system where an application is running on a CPU by identifying chunks of the program that can be beneficial to replace with a custom instruction, i.e. an instruction set extension, implemented in hardware.

– Solution example for 2014 exam

SetInitialConditions()
last_best_C ⇐ C
loop (until exit condition)
    best_C ⇐ last_best_C
    while (there exists unmarked node in DFG)
        foreach (unmarked node n)
            Calculate M_toggle(n,best_C)
        endfor
        best_node ⇐ Node with maximum Gain
        Toggle and Mark best_node
        CalcImpactOfToggle(best_node,best_C)
        if (toggling best_node satisfies constraints)
            Update best_C from toggling best_node
            Calculate M(best_C)
        endif
    endwhile
    if (M(best_C) > M(last_best_C))
        last_best_C ⇐ best_C
        Unmark all nodes
    endif
endloop
C ⇐ last_best_C

On-chip communication architechtures

Bus throughput

May be defined as

$$ throughput_{bus} = width_{bus} \times clock\_frequency_{bus} $$

Split Transfer

When slaves are enabled to split bus access (through an arbiter) temporarily to allow another master to utilize the bus until the initial slave is ready to return communication and consequently unsplits the bus.

Arbitration schemes

Time Division Multiple Access (TDMA)

Time Division Multiple Access (TDMA) arbitration assigns a bus access time slot to each master. The length of the slots assigned to each master can vary, depending on its transfer requirements. If a master has nothing to send in its time slot, the time slot is wasted, with reduced band width utilization as result.

Round Robin (RR)

Round Robin (RR) arbitration gives the masters access to the bus in a circular manner. It is simple to implement and every master will get access to the bus but with unpredictable delay. Critical data may have to wait a long time.

TDMA/RR

A two-level TDMA/RR first divides the bus band width into a set of time slots for each master. If a master has nothing to send in its time slot, it is given to the other masters in a circular manner. This increases the band width utilization compared with pure TDMA. The cost is an increase in the arbiter implementation complexity.

– Solution example for 2012 exam

Implementing embedded systems

Retargetable compilers

In a retargetable compiler a description of machine resources such as

instruction set
register files
instruction scheduling

In a retargetable compiler it is possible to edit this (back-end) processor model.

Acronyms

IR	Intermediate representation
CFG	Control flow graph
DFG	Data flow graph

Code-selection

(example, using DFG, in 2013 exam)

Loop optimization techniques

Techniques for optimizing loops for performance, power or even area. Beware that some methods are not possible to utilize if, for example, different steps of the loop(s) are dependant on each other.

Loop permutation

Primarily intended for optimizing address locality and cache hit rate.

Simplified code example:

for k in range(M):
    for j in range(N):
        foo(p[j][k])

would likely be optimized for cache by swapping iteration order of the indexes like in the following counter-example

for j in range(N):
    for i in range(M):
        foo(p[j][k])

Loop fusion

Very simply illustrated with the following example transformation. May be useful for cache optimization.

for j in range(N):
    foo(p[j])

for j in range(N):
    bar(p[j])

$$ \implies $$

for j in range(N):
    foo(p[j])
    bar(p[j])

Loop fission

Loop fission is basically doing the above transformation in reverse. Likely favourable for multi-core processors for concurrent processing.

Loop unrolling

Illustrated in the following example transformation.

$N=16$ loop iterations

for (i = 0; i < 15; i++)
{
    p[j] = foo(j);
}

$$ \implies $$

for (i = 0; i < 15; i+=3)
{
    p[j] = foo(j);
    p[j+1] = foo(j+1);
    p[j+2] = foo(j+2);
}

With a loop factor $f=3$

Note that loops with $N \mod f \neq 0$ must include additional lines of code for operations that do not fit in the new loop.

Loop splitting

for j in range(N):
    if j<K:
        foo()
    else:
        bar()

$$ \implies $$

for j in range(K):
    foo()

for j in range(K, N):
    bar()

High Level Synthesis

Through benchmarking (2017) commercial tools currently yields generally better results.

Academic (non-commercial) tools

Dwarv (CoSy C $\rightarrow$ VHDL)
Bambu (GCC C $\rightarrow$ Verilog)
LegUp (C $\rightarrow$ RTL)

Optimization methods

Bitwidth analysis and optimization
Memory space allocation
Exploiting spatial parallelism
If-conversion
Operation Chaining
Loop Optimizations
Hardware Resource Libraries
Speculation and Code Motion

Uncategorized topics

Pareto curve example

Samples only $$ \implies $$ With pareto curve

Closeness vs. cost

A cost function (or objective function) uses metrics (e.g., area, power consumption, performance) and weights to define the “goodness” of a given solution. Different metrics can have different weight (e.g., area may be more important than power consumption). This can be included using weighing constants in the function.

A closeness function uses closeness metrics to indicate the desirability of grouping objects. Parts of a system that can use the same hardware, e.g., a multiplier, but at different points in time during execution, can for instance be grouped together. Processes that communicate a lot will also often benefit from being grouped together, and are hence seen as close.

– Solution example for 2012 exam

Homogenous and heterogenous specification

...

Written by

larshb

Last updated: Thu, 30 Nov 2017 16:59:21 +0100 .

TFE02: Hardware software codesign with embedded systems

Introduction

Disclaimer

Specialization course

The hardware/software codesign flow

Estimation and partitioning

Architecture exploration

Co-simulation

Partitioning

Formulas

Objective function

Cost function

Closeness function

Local minimas

Designer interaction

Algorithms

Random mapping

Hierarchical clustering

Multi-stage clustering

Group Migration

Ratio cut

Simulated annealing

Genetic evolution

Estimators

Quality

Speed

Accuracy

Fidelity

Simulated annealing

Multiprocessors and accelerators

Speed-up formula

Instruction set extensions

Generator

ISEGEN

On-chip communication architechtures

Bus throughput

Split Transfer

Arbitration schemes

Time Division Multiple Access (TDMA)

Round Robin (RR)

TDMA/RR

Implementing embedded systems

Adress space sharing and folding

Retargetable compilers

Acronyms

Code-selection

Loop optimization techniques

Loop permutation

Loop fusion

Loop fission

Loop unrolling

Loop splitting

High Level Synthesis

Academic (non-commercial) tools

Optimization methods

Uncategorized topics

Pareto curve example

Closeness vs. cost

Homogenous and heterogenous specification

Written by