TFE02: Hardware software codesign with embedded systems
# Disclaimer
This compendium is a work in progress and needs review and possibly restructuring. The current structure is based on the ordering and naming of lectures from the 2017 semester.
# Estimation and partitioning
Estimation and partitioning is important to evaluate the quality of an architecture.
## Architecture exploration
__Architecture exploration__ is the process of mapping out the hardware composition of a system. This is typically done at an early stage and may include a choice of processors, application specific hardware (such as ASICs), reprogrammable hardware (such as FPGAs) and memories.
## Partitioning
Following architecture exploration is typically __partitioning__, where functionality of the application is grouped. The groups are mapped onto architectural components.
### Co-simulation
Simulating both software and hardware together through computer aided design tools (CAD). May be done at different levels of abstraction ranging from physical models and RTL to system modelling.
SystemC is a noteworthy co-simulation tool.
__Possible benefits:__
* Virtual prototyping (actual system specific hardware not needed)
* Coarse simulation (reduced accuracy at the benefit of fast results)
* Increased quality and coverage of verification (by utilizing both simulation and hardware prototyping)
* Bugs may be fixed more easily and efficiently in hardware rather than in software workarounds
All of which may aid in reduced design and verification time, overall.
### Speed-up formula
This formula defines the speedup by a specific utilization of hardware accelerators.
$$ S = n(t_{CPU} - t_{accel}) = n[t_{CPU} - (t_{in} + t_x + t_{out})], $$
### Hierarchical clustering
A greedy heuristic that may be applied to partitioning problems with simplified hardware/software processing and communication costs.
_(example wanted)_
Further reading at [Wikipedia: Hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering).
### Kernighan–Lin algorithm
The __Kerninghan-Lin__ minimal cut algorithm is a [hill climbing](https://en.wikipedia.org/wiki/Hill_climbing), [heuristic](https://en.wikipedia.org/wiki/Heuristic_(computer_science)) algorithm for finding [partitions of graphs](https://en.wikipedia.org/wiki/Graph_partition). It can be utilized to determine optimal distributions of functionality between hardware and software for eg. accelerators.
Read more about this algorithm over at [Wikipedia](https://en.wikipedia.org/wiki/Kernighan%E2%80%93Lin_algorithm).
## Fidelity
The term __fidelity__ is used to describe the quality of an estimation methodology, often by comparing estimated results with actual results.
We use the following formula for fidelity:
$$ \mathcal{F} = \frac{2}{n(n-1)}\sum_{i=1}^{n-1}{\sum_{j=i+1}^{n}{\mu_{ij}}}, \quad\text{where }\\
\mu_{ij} = (E_i<E_j \land M_i<M_j) \lor (E_i>E_j \land M_i>M_j),
$$
where $E$ is the estimated values and $M$ is the measured values of each solution sample.
### Example
We have made an estimation of the power of different solutions to an architecture. The model is sampled with 5 variations of implementation.
||Solution index||E ||M ||
||1 ||40||22||
||2 ||50||54||
||3 ||60||73||
||4 ||70||62||
||5 ||80||72||
$$\begin{matrix}
\mu_{12}=1&\mu_{13}=1&\mu_{14}=1&\mu_{15}=1\\
\mu_{23}=1&\mu_{24}=1&\mu_{25}=1\\
\mu_{34}=0&\mu_{35}=0\\
\mu_{45}=1
\end{matrix}$$
$$ \mathcal{F} = \frac{2}{5*4}(1+1+1+1+1+1+1+0+0+1) = 0.8 = 80 \% $$
# Multiprocessors and accelerators
# Instruction set extensions
# On-chip communication architechtures
## Split Transfer
When slaves are enabled to _split_ bus access (through an arbiter) temporarily to allow another master to utilize the bus until the initial slave is ready to return communication and consequently _unsplits_ the bus.
# Implementing embedded systems
# Retargetable compilers
## Acronyms
|| IR||Intermediate representation||
||CFG||Control flow graph||
||DFG||Data flow graph||
## Loop unrolling example
__Original code:__
$N=16$ loop iterations
for (i = 0; i < 15; i++)
{
p[j] = foo(j);
}
__Modified code:__
With a loop factor $f=3$
for (i = 0; i < 15; i+=3)
{
p[j] = foo(j);
p[j+1] = foo(j+1);
p[j+2] = foo(j+2);
}
Note that loops with $N \mod f \neq 0$ must include additional lines of code for operations that do not fit in the new loop.
# High Level Synthesis