TFE02: Hardware software codesign with embedded systems
Introduction
Disclaimer
This compendium is a work in progress and needs review and possibly restructuring and citations. The current structure is based on the ordering and naming of lectures from the 2017 semester.
Specialization course
This subject is part of the Design of Digital Systems, Specialization Course (TFE4525).
The hardware/software codesign flow
Estimation and partitioning
Estimation and partitioning is important to evaluate the quality of an architecture.
Architecture exploration
Architecture exploration is the process of mapping out the hardware composition of a system. This is typically done at an early stage and may include a choice of processors, application specific hardware (such as ASICs), reprogrammable hardware (such as FPGAs) and memories.
Co-simulation
Simulating both software and hardware together through computer aided design tools (CAD). May be done at different levels of abstraction ranging from physical models and RTL to system modelling.
SystemC is a noteworthy co-simulation tool.
Partitioning
Following architecture exploration is typically partitioning, where functionality of the application is grouped. The groups are mapped onto architectural components.
The partitioning problem: Given a set of objects
Typical configuration of a partitioning system.
Possible benefits:
- Virtual prototyping (actual system specific hardware not needed)
- Coarse simulation (reduced accuracy at the benefit of fast results)
- Increased quality and coverage of verification (by utilizing both simulation and hardware prototyping)
- Bugs may be fixed more easily and efficiently in hardware rather than in software workarounds
All of which may aid in reduced design and verification time, overall.
Formulas
Objective function
The area, delay and power metrics are often substituted with functions accounting for constraints. Eg. if the area is above 100% utilization, the cost of area should be significantly increased in order to balance the objective function for realistic particioning.
Cost function
The cost function of partitioning may be defined in a simplified model as such
Closeness function
...
Local minimas
Some partitioning algorithms accounts for escaping local minimas, which are called hill-climbing. Their counterpart are called greedy, meaning they only accept moves that immediately improve cost or closeness.
Both types of algorithms can be combined to optimize solving the partitioning problem.
Designer interaction
Both human designers and synthesizers benefit from interaction with the partitioning methods through the following selections:
- Object granularity
- Allocation of system components
- Quality metrics
- Objective and closeness functions
- Algorithm choice(s)
Algorithms
A system with
Divided into constructive and iterative algorithms depending on if they create partitions from the ground up or need to be based on an existing partition.
Random mapping
Type | constructive |
Complexity | |
Grouping metric | (irrelevant) |
Search type | (irrelevant) |
Hierarchical clustering
Type | constructive |
Complexity | |
Grouping metric | closeness |
Search type | greedy |
May be applied to partitioning problems with simplified hardware/software processing and communication costs.
(example in 2015 exam solution)
Simplified heuristic steps:
- Intitialize each object as a group
- Compute closenesses between objects
- Merge closest objects and recompute closenesses
Example:
Further reading at Wikipedia: Hierarchical clustering.
Multi-stage clustering
In short the same as the above except allowing intermediate variation in closeness metrics.
Group Migration
Type | iterative |
Complexity | |
Grouping metric | cost |
Search type | hill-climbing |
* depending on constant
The Kerninghan-Lin minimal cut algorithm is a hill climbing, heuristic algorithm for finding partitions of graphs. It can be utilized to determine optimal distributions of functionality between hardware and software for eg. accelerators.
Read more about this algorithm over at Wikipedia.
Simplified heuristic:
- Initialize
- Create a sequence of n moves
- Move all
$o$ once depending on which results in$min(C)$
- Move all
- If a better cost was found
- then: Update P, repeat
- else: exit
Example:
Evaluate | In Software | In Hardware | Cost* |
Starting point | 32 | ||
Move |
35 | ||
Move |
29 | ||
Move |
32 | ||
Move |
37 | ||
Move best ( |
29 | ||
Move |
23 | ||
Move |
35 | ||
Move |
34 | ||
Move best ( |
23 | ||
Move |
29 | ||
Move |
28 | ||
Move best ( |
28 | ||
Move |
26 | ||
Move best ( |
26 | ||
Keep best partition | 23 | ||
(start new round) |
* Calculated using the Cost function formula above.
Ratio cut
Type | constructive |
Complexity | unknown |
Grouping metric | closeness |
Search type | greedy |
Reduce cut sizes without grouping far objects and without constraining group size.
Simulated annealing
Type | constructive |
Complexity | unknown |
Grouping metric | cost |
Search type | hill-climbing |
Similar to Group Migration except eg: may move each object more than once accepts any move that improves cost * randomness
Theoretically provides globally optimal solution given equilibrium at every temperature with infitesimal temperature decrease.
Execution time and result heavily dependant on different Accept, Equilibrium, DecreaseTemp and Frozen in the following algorithm.
Algorithm:
temp = initial temperature
cost = Objfct(P)
while not Frozen loop
while not Equilibrium loop
P_tentative = RandomMove(P)
cost_tentative = Objfct(P_tentative)
Δcost = cost_tentative - cost
if (Accept(Δcost,temp) > Random(0,1)) then
P = P_tentative
cost = cost_tentative
end if
end loop
temp = DecreaseTemp(temp)
end loop
Genetic evolution
Type | constructive (starts with random partitions) |
Complexity | unknown |
Grouping metric | cost |
Search type | hill-climbing |
- Emulates evolution through a set of methods called selection, crossover, and mutation.
- Requires more memory with generations of multiple partitions.
- Long run times, similar to simulated annealing.
Estimators
Quality
There are three main factors considered when determining the quality of an estimator, namely speed, accuracy and fidelity.
Speed
The speed of an estimator indicates the time it takes to produce an estimate. This may be the essential metric in iterative improvement (eg. Group Migration).
Accuracy
The accuracy indicates the quality of estimates with respect to real implementation results. It can be quantified using the following formula.
Fidelity
The term fidelity is used to describe the quality of an estimation methodology, often by comparing estimated results with actual results.
We use the following formula for fidelity:
where
Example:
We have made an estimation of the power of different solutions to an architecture. The model is sampled with 5 variations of implementation.
Solution index | E | M |
1 | 40 | 22 |
2 | 50 | 54 |
3 | 60 | 73 |
4 | 70 | 62 |
5 | 80 | 72 |
Simulated annealing
Simulated annealing is a partition algorithm that models the annealing process in physics, where a material is melted and its minimal energy state is achieved by lowering the temperature slowly enough that equilibrium is reached at each temperature.
– Solution example for 2014 exam
temp = Initial temperature
cost = Objfct(P)
while not Frozen loop
while not Equilibrium loop
P_tentative = RandomMove(P)
cost_tentative = Objfct(P_tentative)
Δcost = cost_tentative - cost
if (Accept(Δcost,temp) > Random(0,1)) then
P = P_tentative
cost = cost_tentative
end if
end loop
temp = DecreaseTemp(temp)
end loop
Multiprocessors and accelerators
Speed-up formula
This formula defines the speedup by a specific utilization of hardware accelerators.
The above formula accounts for specific times. By tweaking this formula a bit we can easily calculate the speedup in terms of degree.
(example in 2015 exam solution)
Instruction set extensions
Generator
ISEGEN
ISEGEN is an instruction set extension generator. ISEGEN will allow the user to optimise a system where an application is running on a CPU by identifying chunks of the program that can be beneficial to replace with a custom instruction, i.e. an instruction set extension, implemented in hardware.
– Solution example for 2014 exam
SetInitialConditions()
last_best_C ⇐ C
loop (until exit condition)
best_C ⇐ last_best_C
while (there exists unmarked node in DFG)
foreach (unmarked node n)
Calculate M_toggle(n,best_C)
endfor
best_node ⇐ Node with maximum Gain
Toggle and Mark best_node
CalcImpactOfToggle(best_node,best_C)
if (toggling best_node satisfies constraints)
Update best_C from toggling best_node
Calculate M(best_C)
endif
endwhile
if (M(best_C) > M(last_best_C))
last_best_C ⇐ best_C
Unmark all nodes
endif
endloop
C ⇐ last_best_C
On-chip communication architechtures
Bus throughput
May be defined as
Split Transfer
When slaves are enabled to split bus access (through an arbiter) temporarily to allow another master to utilize the bus until the initial slave is ready to return communication and consequently unsplits the bus.
Arbitration schemes
Time Division Multiple Access (TDMA)
Time Division Multiple Access (TDMA) arbitration assigns a bus access time slot to each master. The length of the slots assigned to each master can vary, depending on its transfer requirements. If a master has nothing to send in its time slot, the time slot is wasted, with reduced band width utilization as result.
Round Robin (RR)
Round Robin (RR) arbitration gives the masters access to the bus in a circular manner. It is simple to implement and every master will get access to the bus but with unpredictable delay. Critical data may have to wait a long time.
TDMA/RR
A two-level TDMA/RR first divides the bus band width into a set of time slots for each master. If a master has nothing to send in its time slot, it is given to the other masters in a circular manner. This increases the band width utilization compared with pure TDMA. The cost is an increase in the arbiter implementation complexity.
– Solution example for 2012 exam
Implementing embedded systems
Adress space sharing and folding
(example in 2014 exam solution example)
Retargetable compilers
In a retargetable compiler a description of machine resources such as
- instruction set
- register files
- instruction scheduling
In a retargetable compiler it is possible to edit this (back-end) processor model.
Acronyms
IR | Intermediate representation |
CFG | Control flow graph |
DFG | Data flow graph |
Code-selection
(example, using DFG, in 2013 exam)
Loop optimization techniques
Techniques for optimizing loops for performance, power or even area. Beware that some methods are not possible to utilize if, for example, different steps of the loop(s) are dependant on each other.
Loop permutation
Primarily intended for optimizing address locality and cache hit rate.
Simplified code example:
for k in range(M):
for j in range(N):
foo(p[j][k])
would likely be optimized for cache by swapping iteration order of the indexes like in the following counter-example
for j in range(N):
for i in range(M):
foo(p[j][k])
Loop fusion
Very simply illustrated with the following example transformation. May be useful for cache optimization.
for j in range(N):
foo(p[j])
for j in range(N):
bar(p[j])
for j in range(N):
foo(p[j])
bar(p[j])
Loop fission
Loop fission is basically doing the above transformation in reverse. Likely favourable for multi-core processors for concurrent processing.
Loop unrolling
Illustrated in the following example transformation.
for (i = 0; i < 15; i++)
{
p[j] = foo(j);
}
for (i = 0; i < 15; i+=3)
{
p[j] = foo(j);
p[j+1] = foo(j+1);
p[j+2] = foo(j+2);
}
With a loop factor
Note that loops with
Loop splitting
for j in range(N):
if j<K:
foo()
else:
bar()
for j in range(K):
foo()
for j in range(K, N):
bar()
High Level Synthesis
Through benchmarking (2017) commercial tools currently yields generally better results.
Academic (non-commercial) tools
- Dwarv (CoSy C
$\rightarrow$ VHDL) - Bambu (GCC C
$\rightarrow$ Verilog) - LegUp (C
$\rightarrow$ RTL)
Optimization methods
- Bitwidth analysis and optimization
- Memory space allocation
- Exploiting spatial parallelism
- If-conversion
- Operation Chaining
- Loop Optimizations
- Hardware Resource Libraries
- Speculation and Code Motion
Uncategorized topics
Pareto curve example
Closeness vs. cost
A cost function (or objective function) uses metrics (e.g., area, power consumption, performance) and weights to define the “goodness” of a given solution. Different metrics can have different weight (e.g., area may be more important than power consumption). This can be included using weighing constants in the function.
A closeness function uses closeness metrics to indicate the desirability of grouping objects. Parts of a system that can use the same hardware, e.g., a multiplier, but at different points in time during execution, can for instance be grouped together. Processes that communicate a lot will also often benefit from being grouped together, and are hence seen as close.
– Solution example for 2012 exam
Homogenous and heterogenous specification
...