TFE02: Hardware software codesign with embedded systems
# Introduction
## Disclaimer
This compendium is a work in progress and needs review and possibly restructuring and citations. The current structure is based on the ordering and naming of lectures from the 2017 semester.
## Specialization course
This subject is part of the [Design of Digital Systems, Specialization Course (TFE4525)](https://www.wikipendium.no/TFE4525_Design_of_Digital_Systems_Specialization_Course).
## The hardware/software codesign flow

# Estimation and partitioning
Estimation and partitioning is important to evaluate the quality of an architecture.
## Architecture exploration
__Architecture exploration__ is the process of mapping out the hardware composition of a system. This is typically done at an early stage and may include a choice of processors, application specific hardware (such as ASICs), reprogrammable hardware (such as FPGAs) and memories.
## Co-simulation
Simulating both software and hardware together through computer aided design tools (CAD). May be done at different levels of abstraction ranging from physical models and RTL to system modelling.
SystemC is a noteworthy co-simulation tool.
## Partitioning
Following architecture exploration is typically __partitioning__, where functionality of the application is grouped. The groups are mapped onto architectural components.
__The partitioning problem:__ Given a set of objects $O=\{o_1,o_2,...,o_n\}$, determine a __partition__ $P=\{p_1,p_2,...,p_m\}$ such that $p_1 \cup p_2 \cup ... \cup p_m = O$, $p_i \cap p_j = \emptyset, \quad\forall\{i,j\}, \quad i\neq j$, and the cost determined by an objective function $Objfct(P)$ is minimal.

Typical configuration of a partitioning system.
__Possible benefits:__
* Virtual prototyping (actual system specific hardware not needed)
* Coarse simulation (reduced accuracy at the benefit of fast results)
* Increased quality and coverage of verification (by utilizing both simulation and hardware prototyping)
* Bugs may be fixed more easily and efficiently in hardware rather than in software workarounds
All of which may aid in reduced design and verification time, overall.
### Formulas
#### Objective function
$$ Objfct = k_1 \cdot area + k_2 \cdot delay + k_3 \cdot power $$
The _area_, _delay_ and _power_ metrics are often substituted with functions accounting for constraints. Eg. if the _area_ is above 100% utilization, the cost of area should be significantly increased in order to balance the objective function for realistic particioning.
#### Cost function
The cost function of partitioning may be defined in a simplified model as such
$$ C_{\text{tot}} = \sum_{P_i \in \text{SW}}C_{\text{SW}}(P_i) + \sum_{Pi \in \text{HW}}C_{\text{HW}}(P_i) + \sum_{P_{ij} \in (\text{SW} \leftrightarrow \text{HW})}C_{\text{comm}}(P_{ij}) $$
#### Closeness function
...
### Local minimas
Some partitioning algorithms accounts for escaping __local minimas__, which are called _hill-climbing_. Their counterpart are called _greedy_, meaning they only accept _moves_ that immediately improve _cost_ or _closeness_.
Both types of algorithms can be combined to optimize solving the partitioning problem.
### Designer interaction
Both human designers and synthesizers benefit from interaction with the partitioning methods through the following selections:
* Object granularity
* Allocation of system components
* Quality metrics
* Objective and closeness functions
* Algorithm choice(s)
### Algorithms
A system with $n$ objects and $m$ system components have $m^n$ possible mappings.
Divided into _constructive_ and _iterative_ algorithms depending on if they create partitions from the ground up or need to be based on an existing partition.
#### Random mapping
||__Type__ ||constructive ||
||__Complexity__ ||$\mathcal{O}(n)$||
||__Grouping metric__||_(irrelevant)_ ||
||__Search type__ ||_(irrelevant)_ ||
#### Hierarchical clustering
||__Type__ ||constructive ||
||__Complexity__ ||$\mathcal{O}(n^2)$||
||__Grouping metric__||closeness ||
||__Search type__ ||greedy ||
May be applied to partitioning problems with simplified hardware/software processing and communication costs.
_(example in 2015 exam solution)_
__Simplified heuristic steps:__
* Intitialize each object as a group
* Compute closenesses between objects
* Merge closest objects and recompute closenesses
__Example:__

Further reading at [Wikipedia: Hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering).
#### Multi-stage clustering
In short the same as the above except allowing intermediate variation in closeness metrics.
#### Group Migration
||__Type__ ||iterative ||
||__Complexity__ ||$\mathcal{O}(n^3)$ _(or $\mathcal{O}(n^2)$ or even $\mathcal{O}(n)$\)_\*||
||__Grouping metric__||cost ||
||__Search type__ ||hill-climbing ||
\* depending on constant $Objfct$ and use of structural partitioning
The __Kerninghan-Lin__ minimal cut algorithm is a [hill climbing](https://en.wikipedia.org/wiki/Hill_climbing), [heuristic](https://en.wikipedia.org/wiki/Heuristic_(computer_science)) algorithm for finding [partitions of graphs](https://en.wikipedia.org/wiki/Graph_partition). It can be utilized to determine optimal distributions of functionality between hardware and software for eg. accelerators.
Read more about this algorithm over at [Wikipedia](https://en.wikipedia.org/wiki/Kernighan%E2%80%93Lin_algorithm).
__Simplified heuristic:__
* _Initialize_
* Create a sequence of n moves
* Move all $o$ once depending on which results in $min(C)$
* If a better cost was found
* then: Update P, _repeat_
* else: exit
__Example:__

||__Evaluate__ ||__In Software__ ||__In Hardware__ ||__Cost__\*||
||__Starting point__ ||$P_1$, $P_2$ ||$P_3$, $P_4$ || 32||
||Move $P_1$ ||$P_2$ ||$P_1$, $P_3$, $P_4$ || 35||
||Move $P_2$ ||$P_1$ ||$P_2$, $P_3$, $P_4$ || 29||
||Move $P_3$ ||$P_1$, $P_2$, $P_3$||$P_4$ || 32||
||Move $P_4$ ||$P_1$, $P_2$, $P_4$||$P_3$ || 37||
||_Move best ($P_2$)_ ||$P_1$ ||$P_2$, $P_3$, $P_4$ || 29||
||Move $P_1$ || ||$P_1$, $P_2$, $P_3$, $P_4$|| 23||
||Move $P_3$ ||$P_1$, $P_3$ ||$P_2$, $P_4$ || 35||
||Move $P_4$ ||$P_1$, $P_4$ ||$P_2$, $P_3$ || 34||
||_Move best ($P_1$)_ || ||$P_1$, $P_2$, $P_3$, $P_4$|| 23||
||Move $P_3$ ||$P_3$ ||$P_1$, $P_2$, $P_4$ || 29||
||Move $P_4$ ||$P_4$ ||$P_1$, $P_2$, $P_3$ || 28||
||_Move best ($P_4$)_ ||$P_4$ ||$P_1$, $P_2$, $P_3$ || 28||
||Move $P_3$ ||$P_3$, $P_4$ ||$P_1$, $P_2$ || 26||
||_Move best ($P_3$)_ ||$P_3$, $P_4$ ||$P_1$, $P_2$ || 26||
||_Keep best partition_|| ||$P_1$, $P_2$, $P_3$, $P_4$|| 23||
||_(start new round)_ || || || ||
\* Calculated using the _Cost function_ formula above.
#### Ratio cut
||__Type__ ||constructive ||
||__Complexity__ ||_unknown_ ||
||__Grouping metric__||closeness ||
||__Search type__ ||greedy ||
Reduce cut sizes without grouping _far_ objects and without constraining _group size_.
$$ ratio = \frac{cut(P)}{size(p_1) \times size(p_2)} $$
#### Simulated annealing
||__Type__ ||constructive ||
||__Complexity__ ||_unknown_ ||
||__Grouping metric__||cost ||
||__Search type__ ||hill-climbing||
Similar to _Group Migration_ except eg:
* may move each object more than once
* accepts any move that improves cost
* randomness
Theoretically provides globally optimal solution given equilibrium at every temperature with infitesimal temperature decrease.
Execution time and result heavily dependant on different _Accept_, _Equilibrium_, _DecreaseTemp_ and _Frozen_ in the following algorithm.
__Algorithm:__
temp = initial temperature
cost = Objfct(P)
while not Frozen loop
while not Equilibrium loop
P_tentative = RandomMove(P)
cost_tentative = Objfct(P_tentative)
Δcost = cost_tentative - cost
if (Accept(Δcost,temp) > Random(0,1)) then
P = P_tentative
cost = cost_tentative
end if
end loop
temp = DecreaseTemp(temp)
end loop
#### Genetic evolution
||__Type__ ||constructive _(starts with random partitions)_ ||
||__Complexity__ ||_unknown_ ||
||__Grouping metric__||cost ||
||__Search type__ ||hill-climbing||
* Emulates evolution through a set of methods called _selection_, _crossover_, and _mutation_.
* Requires more memory with generations of multiple partitions.
* Long run times, similar to _simulated annealing_.
## Estimators
### Quality
There are three main factors considered when determining the quality of an estimator, namely _speed_, _accuracy_ and _fidelity_.
#### Speed
The __speed__ of an estimator indicates the time it takes to produce an estimate. This may be the essential metric in iterative improvement (eg. Group Migration).
#### Accuracy
The __accuracy__ indicates the quality of estimates with respect to real implementation results. It can be quantified using the following formula.
$$ \mathcal{A} = \frac{1}{N}\sum_{i=1}^{N}{\left(1-\frac{|E_i-M_i|}{M_i}\right)} $$
#### Fidelity
The term __fidelity__ is used to describe the quality of an estimation methodology, often by comparing estimated results with actual results.
We use the following formula for fidelity:
$$ \mathcal{F} = \frac{2}{n(n-1)}\sum_{i=1}^{n-1}{\sum_{j=i+1}^{n}{\mu_{ij}}}, \quad\text{where }\\
\mu_{ij} = (E_i<E_j \land M_i<M_j) \lor (E_i>E_j \land M_i>M_j),
$$
where $E$ is the estimated values and $M$ is the measured values of each solution sample.
__Example:__
We have made an estimation of the power of different solutions to an architecture. The model is sampled with 5 variations of implementation.
||Solution index||E ||M ||
||1 ||40||22||
||2 ||50||54||
||3 ||60||73||
||4 ||70||62||
||5 ||80||72||
$$\begin{matrix}
\mu_{12}=1&\mu_{13}=1&\mu_{14}=1&\mu_{15}=1\\
\mu_{23}=1&\mu_{24}=1&\mu_{25}=1\\
\mu_{34}=0&\mu_{35}=0\\
\mu_{45}=1
\end{matrix}$$
$$ \mathcal{F} = \frac{2}{5*4}(1+1+1+1+1+1+1+0+0+1) = 0.8 = 80 \% $$
### Simulated annealing
> Simulated annealing is a partition algorithm that models the annealing process in physics, where a material is melted and its minimal energy state is achieved by lowering the temperature slowly enough that equilibrium is reached at each temperature.
> – Solution example for 2014 exam
temp = Initial temperature
cost = Objfct(P)
while not Frozen loop
while not Equilibrium loop
P_tentative = RandomMove(P)
cost_tentative = Objfct(P_tentative)
Δcost = cost_tentative - cost
if (Accept(Δcost,temp) > Random(0,1)) then
P = P_tentative
cost = cost_tentative
end if
end loop
temp = DecreaseTemp(temp)
end loop
# Multiprocessors and accelerators
## Speed-up formula
This formula defines the speedup by a specific utilization of hardware accelerators.
$$ S = n(t_{CPU} - t_{accel}) = n[t_{CPU} - (t_{in} + t_x + t_{out})], $$
where $n$ is the number of iterations, $t_{in}$ and $t_{out}$ are the times needed to read and write data to the accelerator that cannot be overlapped with its execution time $t_x$.
The above formula accounts for specific times. By tweaking this formula a bit we can easily calculate the speedup in terms of degree.
$$ S' = 1 - \frac{t_{in}+t_x+t_{out}}{t_{CPU}} $$
_(example in 2015 exam solution)_
# Instruction set extensions
## Generator
### ISEGEN
> ISEGEN is an instruction set extension generator. ISEGEN will allow the user to optimise a system where an application is running on a CPU by identifying chunks of the program that can be beneficial to replace with a custom instruction, i.e. an instruction set extension, implemented in hardware.
> – Solution example for 2014 exam
SetInitialConditions()
last_best_C ⇐ C
loop (until exit condition)
best_C ⇐ last_best_C
while (there exists unmarked node in DFG)
foreach (unmarked node n)
Calculate M_toggle(n,best_C)
endfor
best_node ⇐ Node with maximum Gain
Toggle and Mark best_node
CalcImpactOfToggle(best_node,best_C)
if (toggling best_node satisfies constraints)
Update best_C from toggling best_node
Calculate M(best_C)
endif
endwhile
if (M(best_C) > M(last_best_C))
last_best_C ⇐ best_C
Unmark all nodes
endif
endloop
C ⇐ last_best_C
# On-chip communication architechtures
## Split Transfer
When slaves are enabled to _split_ bus access (through an arbiter) temporarily to allow another master to utilize the bus until the initial slave is ready to return communication and consequently _unsplits_ the bus.
## Arbitration schemes
### Time Division Multiple Access (TDMA)
> __Time Division Multiple Access__ (TDMA) arbitration assigns a bus access time slot to
each master. The length of the slots assigned to each master can vary, depending on its
transfer requirements. If a master has nothing to send in its time slot, the time slot is wasted,
with reduced band width utilization as result.
### Round Robin (RR)
> __Round Robin__ (RR) arbitration gives the masters access to the bus in a circular manner. It
is simple to implement and every master will get access to the bus but with unpredictable
delay. Critical data may have to wait a long time.
### TDMA/RR
> A two-level __TDMA/RR__ first divides the bus band width into a set of time slots for each
master. If a master has nothing to send in its time slot, it is given to the other masters in a
circular manner. This increases the band width utilization compared with pure TDMA. The
cost is an increase in the arbiter implementation complexity.
> – Solution example for 2012 exam
# Implementing embedded systems
## Adress space sharing and folding
_(example in 2014 exam solution example)_
# Retargetable compilers
In a retargetable compiler a description of machine resources such as
* instruction set
* register files
* instruction scheduling
In a retargetable compiler it is possible to edit this (back-end) processor model.
## Acronyms
|| IR||Intermediate representation||
||CFG||Control flow graph||
||DFG||Data flow graph||
## Code-selection
_(example, using DFG, in 2013 exam)_
## Loop optimization techniques
Techniques for optimizing loops for performance, power or even area. Beware that some methods are not possible to utilize if, for example, different steps of the loop(s) are dependant on each other.
### Loop permutation
Primarily intended for optimizing address locality and cache hit rate.
__Simplified code example:__
for k in range(M):
for j in range(N):
foo(p[j][k])
would likely be optimized for cache by swapping iteration order of the indexes like in the following counter-example
for j in range(N):
for i in range(M):
foo(p[j][k])
### Loop fusion
Very simply illustrated with the following example transformation. May be useful for cache optimization.
for j in range(N):
foo(p[j])
for j in range(N):
bar(p[j])
$$ \implies $$
for j in range(N):
foo(p[j])
bar(p[j])
### Loop fission
Loop fission is basically doing the above transformation in reverse. Likely favourable for multi-core processors for concurrent processing.
### Loop unrolling
Illustrated in the following example transformation.
$N=16$ loop iterations
for (i = 0; i < 15; i++)
{
p[j] = foo(j);
}
$$ \implies $$
for (i = 0; i < 15; i+=3)
{
p[j] = foo(j);
p[j+1] = foo(j+1);
p[j+2] = foo(j+2);
}
With a loop factor $f=3$
Note that loops with $N \mod f \neq 0$ must include additional lines of code for operations that do not fit in the new loop.
### Loop splitting
for j in range(N):
if j<K:
foo()
else:
bar()
$$ \implies $$
for j in range(K):
foo()
for j in range(K, N):
bar()
# High Level Synthesis
Through benchmarking (2017) commercial tools currently yields generally better results.
## Academic (non-commercial) tools
* Dwarv (CoSy C $\rightarrow$ VHDL)
* Bambu (GCC C $\rightarrow$ Verilog)
* LegUp (C $\rightarrow$ RTL)
## Optimization methods
* Bitwidth analysis and optimization
* Memory space allocation
* Exploiting spatial parallelism
* If-conversion
* Operation Chaining
* Loop Optimizations
* Hardware Resource Libraries
* Speculation and Code Motion
# Uncategorized topics
## Pareto curve example

$$ \implies $$

## Closeness vs. cost
> A __cost__ function (or objective function) uses metrics (e.g., area, power consumption,
performance) and weights to define the “goodness” of a given solution. Different metrics can
have different weight (e.g., area may be more important than power consumption). This can
be included using weighing constants in the function.
> A __closeness__ function uses closeness metrics to indicate the desirability of grouping objects.
Parts of a system that can use the same hardware, e.g., a multiplier, but at different points in
time during execution, can for instance be grouped together. Processes that communicate a
lot will also often benefit from being grouped together, and are hence seen as close.
> – Solution example for 2012 exam