This is an old version of the compendium, written Nov. 25, 2017, 1:29 a.m. Changes made in this revision were made by larshb. View rendered version.

TFE02: Hardware software codesign with embedded systems

# Introduction ## Disclaimer This compendium is a work in progress and needs review and possibly restructuring. The current structure is based on the ordering and naming of lectures from the 2017 semester. ## The hardware/software codesign flow ![Figure showing the HW/SW codesign flow](http://i.pi.gy/8ZwzD.png) # Estimation and partitioning Estimation and partitioning is important to evaluate the quality of an architecture. ## Architecture exploration __Architecture exploration__ is the process of mapping out the hardware composition of a system. This is typically done at an early stage and may include a choice of processors, application specific hardware (such as ASICs), reprogrammable hardware (such as FPGAs) and memories. ## Partitioning Following architecture exploration is typically __partitioning__, where functionality of the application is grouped. The groups are mapped onto architectural components. ### Co-simulation Simulating both software and hardware together through computer aided design tools (CAD). May be done at different levels of abstraction ranging from physical models and RTL to system modelling. SystemC is a noteworthy co-simulation tool. __Possible benefits:__ * Virtual prototyping (actual system specific hardware not needed) * Coarse simulation (reduced accuracy at the benefit of fast results) * Increased quality and coverage of verification (by utilizing both simulation and hardware prototyping) * Bugs may be fixed more easily and efficiently in hardware rather than in software workarounds All of which may aid in reduced design and verification time, overall. ### Speed-up formula This formula defines the speedup by a specific utilization of hardware accelerators. $$ S = n(t_{CPU} - t_{accel}) = n[t_{CPU} - (t_{in} + t_x + t_{out})], $$ where $n$ is the number of iterations, $t_{in}$ and $t_{out}$ are the times needed to read and write data to the accelerator that cannot be overlapped with its execution time $t_x$. The above formula accounts for specific times. By tweaking this formula a bit we can easily calculate the speedup in terms of degree. $$ S' = 1 - \frac{t_{in}+t_x+t_{out}}{t_{CPU}} $$ _(example in 2015 exam solution)_ ### Hierarchical clustering A greedy heuristic that may be applied to partitioning problems with simplified hardware/software processing and communication costs. _(example in 2015 exam solution)_ Further reading at [Wikipedia: Hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering). ### Kernighan–Lin algorithm The __Kerninghan-Lin__ minimal cut algorithm is a [hill climbing](https://en.wikipedia.org/wiki/Hill_climbing), [heuristic](https://en.wikipedia.org/wiki/Heuristic_(computer_science)) algorithm for finding [partitions of graphs](https://en.wikipedia.org/wiki/Graph_partition). It can be utilized to determine optimal distributions of functionality between hardware and software for eg. accelerators. Read more about this algorithm over at [Wikipedia](https://en.wikipedia.org/wiki/Kernighan%E2%80%93Lin_algorithm). ## Fidelity The term __fidelity__ is used to describe the quality of an estimation methodology, often by comparing estimated results with actual results. We use the following formula for fidelity: $$ \mathcal{F} = \frac{2}{n(n-1)}\sum_{i=1}^{n-1}{\sum_{j=i+1}^{n}{\mu_{ij}}}, \quad\text{where }\\ \mu_{ij} = (E_i<E_j \land M_i<M_j) \lor (E_i>E_j \land M_i>M_j), $$ where $E$ is the estimated values and $M$ is the measured values of each solution sample. ### Example We have made an estimation of the power of different solutions to an architecture. The model is sampled with 5 variations of implementation. ||Solution index||E ||M || ||1 ||40||22|| ||2 ||50||54|| ||3 ||60||73|| ||4 ||70||62|| ||5 ||80||72|| $$\begin{matrix} \mu_{12}=1&\mu_{13}=1&\mu_{14}=1&\mu_{15}=1\\ \mu_{23}=1&\mu_{24}=1&\mu_{25}=1\\ \mu_{34}=0&\mu_{35}=0\\ \mu_{45}=1 \end{matrix}$$ $$ \mathcal{F} = \frac{2}{5*4}(1+1+1+1+1+1+1+0+0+1) = 0.8 = 80 \% $$

## Estimators ### Simulated annealing > Simulated annealing is a partition algorithm that models the annealing process in physics, where a material is melted and its minimal energy state is achieved by lowering the temperature slowly enough that equilibrium is reached at each temperature. > – Solution example for 2014 exam temp = Initial temperature cost = Objfct(P) while not Frozen loop while not Equilibrium loop P_tentative = RandomMove(P) cost_tentative = Objfct(P_tentative) Δcost = cost_tentative - cost if (Accept(Δcost,temp) > Random(0,1)) then P = P_tentative cost = cost_tentative end if end loop temp = DecreaseTemp(temp) end loop # Multiprocessors and accelerators

# Instruction set extensions ## Generator ### ISEGEN > ISEGEN is an instruction set extension generator. ISEGEN will allow the user to optimise a system where an application is running on a CPU by identifying chunks of the program that can be beneficial to replace with a custom instruction, i.e. an instruction set extension, implemented in hardware. > – Solution example for 2014 exam SetInitialConditions() last_best_C ⇐ C loop (until exit condition) best_C ⇐ last_best_C while (there exists unmarked node in DFG) foreach (unmarked node n) Calculate M_toggle(n,best_C) endfor best_node ⇐ Node with maximum Gain Toggle and Mark best_node CalcImpactOfToggle(best_node,best_C) if (toggling best_node satisfies constraints) Update best_C from toggling best_node Calculate M(best_C) endif endwhile if (M(best_C) > M(last_best_C)) last_best_C ⇐ best_C Unmark all nodes endif endloop C ⇐ last_best_C # On-chip communication architechtures ## Split Transfer When slaves are enabled to _split_ bus access (through an arbiter) temporarily to allow another master to utilize the bus until the initial slave is ready to return communication and consequently _unsplits_ the bus. # Implementing embedded systems # Retargetable compilers In a retargetable compiler a description of machine resources such as * instruction set * register files * instruction scheduling In a retargetable compiler it is possible to edit this (back-end) processor model. ## Acronyms || IR||Intermediate representation|| ||CFG||Control flow graph|| ||DFG||Data flow graph|| ## Loop optimization techniques Techniques for optimizing loops for performance, power or even area. Beware that some methods are not possible to utilize if, for example, different steps of the loop(s) are dependant on each other. ### Loop permutation Primarily intended for optimizing address locality and cache hit rate. __Simplified code example:__ for k in range(M): for j in range(N): foo(p[j][k]) would likely be optimized for cache by swapping iteration order of the indexes like in the following counter-example for j in range(N): for i in range(M): foo(p[j][k]) ### Loop fusion Very simply illustrated with the following example transformation. May be useful for cache optimization. for j in range(N): foo(p[j]) for j in range(N): bar(p[j]) $$ \implies $$ for j in range(N): foo(p[j]) bar(p[j]) ### Loop fission Loop fission is basically doing the above transformation in reverse. Likely favourable for multi-core processors for concurrent processing. ### Loop unrolling Illustrated in the following example transformation. $N=16$ loop iterations for (i = 0; i < 15; i++) { p[j] = foo(j); } $$ \implies $$ for (i = 0; i < 15; i+=3) { p[j] = foo(j); p[j+1] = foo(j+1); p[j+2] = foo(j+2); } With a loop factor $f=3$ Note that loops with $N \mod f \neq 0$ must include additional lines of code for operations that do not fit in the new loop. ### Loop splitting for j in range(N): if j<K: foo() else: bar() $$ \implies $$ for j in range(K): foo() for j in range(K, N): bar() # High Level Synthesis # Unspecified topics ## Adress space sharing and folding

_(example in 2014 exam solution example)_ ## Simulated annealing > Simulated annealing is a partition algorithm that models the annealing process in physics, where a material is melted and its minimal energy state is achieved by lowering the temperature slowly enough that equilibrium is reached at each temperature. > – Solution example for 2014 exam temp = Initial temperature cost = Objfct(P) while not Frozen loop while not Equilibrium loop P_tentative = RandomMove(P) cost_tentative = Objfct(P_tentative) Δcost = cost_tentative - cost if (Accept(Δcost,temp) > Random(0,1)) then P = P_tentative cost = cost_tentative end if end loop temp = DecreaseTemp(temp) end loop