Table of Contents

1.1 Models and metrics to enable energy-efficiency optimizations
1.2 Case Studies of Multi-core Energy Efficiency in Task Based Programs
1.3 Feedback-Driven Threading: Power-Efficient and High-Performance Execution of Multithreaded Workloads on CMPs
1.4 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction
1.6 Optimized Hardware for Suboptimal Software: The Case for SIMD-aware Benchmarks
2.1 The emergence of Cellular Computing
2.2 Life and Evolution in Computers
1. Arguments for what computers need to have 'life'
2.3 A Phylogenetic, Ontogenetic, and Epigenetic View of Bio-Inspired Hardware Systems
2.4 Evolution-in-Materio: Evolving Computation in Materials
2.5 D-wave
1. a) Architectural overview
2. b) Programming: The Map Coloring Problem
2.6

‹

TDT1: Architecture of Computing Systems

Tags:

This compendium is the 2015 version of TDT1

The following papers are curriculum:

Part 1 Modern computing architectures (Lasse): with focus on multicores and energy efficient computing.

1.1 Models and metrics to enable energy-efficiency optimizations, Suzanne Rivoire et.al., Computer, December 2007 (ITSL = Made available to students thru Its Learning) Presentation
1.2 Case Studies of Multi-core Energy Efficiency in Task Based Programs, Hallgeir Lien et al, In proc. of ICT-GLOW ICT againts Global Warming, Vienna Sept 2012 (ITSL)
1.3 Feedback-Driven Threading: Power-Efficient and High-Performance Execution of Multithreaded Workloads on CMPs, Suleman et.al., ASPLOS 2008 (ITSL)
1.4 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction, Kumar et.al., MICRO 2003 (ITSL)
1.6 Optimized Hardware for Suboptimal Software: The Case for SIMD-aware Benchmarks, J M Cebrian et. al., ISPASS 2014Presentation

Part 2 Unconventional computing architectures (Stefano, www.nichele.eu, nichele@idi.ntnu.no):

Current and future research on new alternative computing architectures that go beyond the traditional Turing/von Neumann paradigm:

2.1 Moshe Sipper - The Emergence of Cellular Computing (http://www.cs.bgu.ac.il/~sipper/papabs/cellcomp.pdf)
2.2 Melanie Mitchell - Life and Evolution in Computers (http://web.cecs.pdx.edu/~mm/life-and-evolution.pdf)
2.3 Moshe Sipper et al. - A Phylogenetic, Ontogenetic and Epigenetic View of Bio-Inspired Hardware Systems (http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=585894&tag=1 also available on ITSLEARNING)
2.4 Julian Miller et al. - Evolution-in-Materio: Evolving Computation in Materials (http://www.cartesiangp.co.uk/papers/ei2014-miller.pdf) Presentation
2.5a D-Wave Quantum Computer Architecture (http://www.dwavesys.com/sites/default/files/D-Wave-brochure-Aug2015B.pdf)
2.5b Programming with D-Wave, Map Coloring Problem (http://www.dwavesys.com/sites/default/files/Map Coloring WP2.pdf)
2.6 Lila Kari - DNA Computing: Arrival of Biological Mathematics (http://link.springer.com/article/10.1007%2FBF03024425 also available on ITSLEARNING)

1.1 Models and metrics to enable energy-efficiency optimizations

Slides

Intro

Power consumption has become a major concern in data centers, as the consumption doubled between 2000 and 2006 (and a lot has probably happened since). Better energy efficiency can reduce costs directly, because computers use less power. In addition, more efficient computers generate less heat and thus require less cooling.

The paper introduces a new benchmark for measuring energy efficiency, namely JouleSort.

Energy-efficiency metrics

The ideal benchmark for energy efficiency would balance power and performance perfectly, and its rules should be impossible to circumvent.

Energy-delay product

Comparing processors based on energy – the product of execution time and power – could motivate processor designers to focus on lowering clock frequency alone to achieve "energy efficiency". This would be very bad for performance. The energy-delay product, however, weighs power against the square of execution time. This more precisely shows the energy efficiency rather than just the clock frequency.

JouleSort design goals

Should evaluate trade-off between power and performance.
- No reward for only high performance or only low power.
The metric measured is energy (not energy-delay). - This is because another benchmark that emphasises performance is not needed, and joulesort can instead only emphasize energy.
- Also, energy-delay product doesn't make as much sense on the system level as it did on the processor-level.
Should be balanced
Should be inclusive
- As many systems as possible should be measurable with the benchmark
- The metric and the workload should be applicable to many different technologies.

The workload in joulesort is a standard external sort (the same as in many other benchmarks).

JouleSort ranks systems based on the amount of records it can sort per Joule consumed.

Result of joulesort

Winners of pennysort do well
Cost efficiency has improved more than energy efficiency over the last years
Laptops and file servers has the best results

In trying to create a machine custom tailored for the JouleSort benchmark, the most important decision was to balance the performance between CPU and I/O. A fast CPU is no use if it needs to idle most of the time waiting for I/O.

Improving the performance of mainly energy efficient computers seems to be a better approach than trying to make high-performance computers more energy efficient.

The authors created a "CoolSort machine" which consisted of a high-end mobile CPU and 13 SATA laptop disks. This was the perfect ratio for utilizing both the CPU and the disks at maximum capacity.

1.2 Case Studies of Multi-core Energy Efficiency in Task Based Programs

Intro

Saving energy is a top priority in most computing systems. Recently, we have seen a convergence between embedded systems and High Performance Computing. Both these market segments now have energy efficiency as a major design goal.

Task Based Programming (TBP) has recently gained increasing interest. In newer TBS systems, like OmpSs, the run time system are taking care of task dependencies. This further simplifies programming of heterogeneous and hybrid architectures.

The authors uses three benchmarks, Black Scholes, Fastest Fourier Transform in the West (FFTW), and Matrix Multiplication, to compare the effects of applying vectorization and thread level parallelism. This is done through OmpSs, with vectorization using AVX and SSE.

Background

The authors use FLOPS/W as their energy metric. The choice is motivated by the European HPC project Mont Blanc, which uses the same metric in their studies. Other metrics exist as well, like energy and Energy-Delay Product (EDP). Energy (product of power and execution time) motivates the use of slow processors with low frequency to optain good energy measurements. EDP could be used, as it focuses more on performance. Article 1.1 presents the general metric Performance^N / Watt. EDP corresponds to N=2, while N=0 focuses on power consumption.

This study uses performance counters to measure performance. They count or measure the operations that useful floating point operations, that is the FP's contribute towards solving the problem. Energy measurments were obtained from special registers in the architecture used in the study. These only measure the on-chip energy consumption. To give fair comparisons between the benchmarks, it is important that the problem sizes fit in Last Level Cache (LLC, L3).

Experiments and Results

Blach Scholes scales favorably with the use of multi threading. It also retains performance and energy efficiency in out-of-cache problem sizes. AVX provides better results than SSE overall for this benchmark.

FFTW is bandwidth bound for problem sizes beyond available cache space. For problem sizes fitting in the L2 cache, single threaded execution gives better energy efficiency results. For larger problem sizes, FFTW gains no benefit of hyperthreading. AVX performs better than SSE.

Matrix Multiplication is computation bound, which gives the high performance and energy effiency when using 4-threads on problem sizes above 512x512. 8-threads has low energy efficiency and also performs poorer than 4-threads. Since hyperthreadeds share one core, they share resources (ALU and L1/L2 cache) as well. Since Matrix Multiply is computation bound, some cores are idle waiting for resources while consuming energy.

Conclusion

Vectorization provides a significant improvement in both performance and on-chip energy efficiency. However, even though hyperthreading provides a performance boost it does not necessarily imply better energy efficiency.

1.3 Feedback-Driven Threading: Power-Efficient and High-Performance Execution of Multithreaded Workloads on CMPs

Framework to find optimal threading for a program.

FDT: Feedback-driven threading
SAT: Synchronization-aware threading
BAT: Bandwidth-aware threading

The FDT system is used to implement SAT and BAT. These reduce execution time and power significantly.
Shared data in a multi-threaded application is kept synchronized by using critical sections.
- With critical sections, we can use SAT to boost performance. Critical sections says that there all other threads needs to wait for one thread to finish.
- Training works by putting a marker in the compiler before and after each critical section. Then the time spent in critical sections is known.
When an application is highly parallelized, fetching new data will often be the bottleneck.
- In these cases, we can use BAT to optimize for bandwith access.
- Training works by putting markers before and after each call to the external data. Then the time spent off-chip is known.
One can combine SAT and BAT.
Works like this:
- Have a training phase where one tries out stuff to see what should be done.
- After the small time it takes to train, execute the rest of the threads based on the training information.
If we combine SAT and BAT, do this:
- Compute the number of threads for BAT and SAT. The minimum number of those two minimizes the overall execution time.
Applications that get limited by data-synchronization, increasing the number of threads significantly increases execution time and power.
Applications that get limited by off-chip bandwidth, increasing the number of threads increases on-chip power without providing any performance improvements.
Having a system that controls the number of threads can reduce both execution time and power.

1.4 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction

The paper investigates whether we can achieve power savings while keeping performance by using single-ISA heterogeneous multi-core architectures. The answer? Yes, we can!

How is this achieved?

The paper describes four different architectures with differing degrees of complexity.

Task-switching may be done when the current process is scheduled out of the CPU.

Power usage is estimated using wattch and adjusted for architectural differences using peak power and typical power.

Benchmarking is done using SPEC2000.

Results

The study shows that there is a non-constant ratio between the performance of the different cores. By using a switching oracle you get very good results.

In some cases, using two cores may be sufficient to produce significant gains.

Core switching

Every 100th time interval, one or more cores are sampled. Different heuristics are:

neighbor
neighbor-global
random
all

Then, a core is selected.

These heuristics achieve up to 93% of the energy-delay gains of the oracle-based switcher.

Conclusion

A sample heterogeneous multi-core design with four complexity-graded cores has the potential to increase energy efficiency (defined as energy-delay product, in this case) by a factor of three, in one experiment, without dramatic losses in performance.

Using the single-ISA heterogeneous multi-core architecture yields far better results for energy efficiency than simply reducing the clock frequency. Some performance loss is expected, but the energy efficiency gains outweigh the losses in performance.

1.6 Optimized Hardware for Suboptimal Software: The Case for SIMD-aware Benchmarks

There's a need to create new benchmarks that better reflect new breakthroughs in processor design. To create new processors based on old benchmarks that do not take into consideration new techniques to solve problems will not lead to better processors. I.e. if a benchmark doesn't measure vectorization, and all new processors are made to optimize for that benchmark, there won't be a lot of new processors that are as good as they could be.

This paper introduces a new benchmark that takes vectorization into consideration. The benchmark shows how SIMD code changes scalability, energy efficiency and hardware requirements.

2.1 The emergence of Cellular Computing

Von Neumann studied these ways of handling computation toward the end of his academic career.
Von Neumann architecture uses one complex processor to sequentially perform a single complex task.
Cellular computing uses a different approach for computing these tasks.
- At heart, three principles: simplicity, vast parallelism and locality.
Simple
- Each unit in the cellular computer - the cell - is simple. On its own it can not do much, but together, multiple cells can do a lot.
Vast parallelism
- Each cell can do its calculations mostly on its own, with some input from its neighbors. This creates parallelism on a much bigger scale than seen elsewhere.
Locality
- A cell can only talk to its closest cells. There is only a small amount of information that can be carried through. There is no central controller.
Cellular computing = simplicity + vast parallelism + locality
Cell type
- Discrete - the cell takes on discrete values
- Continous - the cell takes on analog values
Cell definition
- Exhaustive enumeration
- Parameterized function
- Program
- Behavioral rules
Cell mobility
- A cell can either stay put and only change value, or change the environment and move.
Connectivity scheme
- Regular grid - n dimensional array of a given geometry gives the connectivity of the cells. I.e. Game of Life.
- Graph - non-regular grid.
Deterministic models will give the same output and trajectory through the system for a given input.
Nondeterministic models will not necessarily give the same trajectory and output for a given input.
Global vs local problems
- There can be a subtle difference.
- Global problem can be if there are more 1s than 0s in your CA.

A local program is things like grayscaling an image, as each pixel can be processed by one node with a local property.

2.2 Life and Evolution in Computers

Arguments for what computers need to have 'life'

Autonomy
Metabolism
Self-reproduction
Survival instinct
Evolution and adaptation
One asks if we can make "intelligent" and "alive" computers. There are some controversy about this, but the author means that we absolutely can.
There are a couple of things that one says is not possible for a computer program: autonomy, metabolism, self-reproduction, survival instinct, evolution and adaption etc. All of these examples have more or less been achieved by a computer.
- By "achieved" means at least one example that the author means is good enough. If this actually is good enough is up to debate.
You can make a program that prints itself. To achieve this you need to use the fact that information in a computer can be either executable code or data.
- You can then have a program that reads itself as data and prints that data.
- You will need to have an interpreter that can execute the reading of one self. This interpreter is completely external to the program itself.
- In other self-replication, for instance in DNA, the building blocks for creating the "interpreter" is included (messenger RNA). So this makes the DNA itself completely self-replicable.
Computer programs like "genetic algorithms" use a idealized computational version of evolution.
- This system has shown to be extremely efficient at finding good and sometimes unusual solutions to hard problems.
- Emergent behaviour.
One sees that new computer systems are moving towards more parallel decentralized systesm.
- These mimic nature better than traditional systems.
- They have many properties that are desired of computer systems nowadays.
  - Fast at detecting complex patterns.
  - Robust - if one part fails, the rest can move on.
  - Good at adapting to circumstances.
Celular automata
- Massively parallell, game of life.

2.3 A Phylogenetic, Ontogenetic, and Epigenetic View of Bio-Inspired Hardware Systems

Three levels of organization:

Phylogenetic: temporal evolution of the genetic programs within individuals and species
Ontogenetic: developmental process of a single multicellular organism
Epigenetic: learning processes during an individual organism’s lifetime

Classification model made of these three things to classify genetic algorithms.
- Could lead to better organization of different ways of doing GA.

Can categorize systems on planes in the POE-system

PO: Phylogenetic-Ontogenetic
PE: Phylogenetic-Epigenetic
OE: Ontogenetic-Epigenetic

2.4 Evolution-in-Materio: Evolving Computation in Materials

Link to the slides

Basically what it is

Evolution in materio is about using evolutionary algorithms in physical material to get a desired result in a physical material. There are many reasons why this is done; we are approaching the silicon cap, where more transistors would mean our processors melt, therefore we might want to find some other basis for computation than silicon. Tests of this has also shown that using this technique is a good way of supplementing the human mind's thinking, as the human mind is often bound by its knowledge.

Working examples

There are multiple examples of this actually working. The NASA antenna which is on one of NASA's space shuttles might be the best known example.

Measuring and changing problems

When working with evolution in materio, we must be able to distinguish if we are significantly changing the matter in question. Matter is changing all the time, and it can be difficult to see if the matter is changing because of us, or regardless of us. To combat this, we need ways of measuring changes in the material. If we are not careful, we might be measuring floating pins, and we might see the same results as we expect from measuring the actual matter. If our material spits out random numbers, we have created a random number generator, not a turing machine with some extra computations. We need to be on the edge of chaos.

Ways of evolving

There are three main ways of using evolutionary algorithms in the context of evolution in materio.

Purely in software

You evolve your design in software to be used in software.

Evolution in software produces results in real life

You evolve in software, and take the result to be used in real life. This is often a blueprint or a design that is used to create a real life item.

Physical evolution in actual materio

You have a material that you change physical properties of and read off data.

Some notes on the field itself

This is a very new field and it will take some more research before it matures. The results so far have to be taken with a grain of salt, as the results aren't easy or even possible to replicate.

2.5 D-wave

a) Architectural overview

Q-bit: can be either 1 or 0 don't know until you checked.
You can have entangled q-bits, so when you've read one value, the other values are determined.
- Two entangled particles A and B can be entangled so that when A is 1, B have to be 0.
- This is used to in linear time solve NP-hard problems.
Architecture:
- Processor shielded to 50k times less than the Earth's magnetic field.
- Vacuum where pressure is 10 billion times lower than atmospheric preasure.
- Extreme cold: 0.015 Kelvin
Programming
- Created an API that wraps the way one programs this.
- Since it's a probabilistic computer, it will give multiple good answers in a short period of time.
  - 10 000 answers in a second, which is a lot of good alternatives to choose from.
Used by the industry to find new and exciting ways to solve complex problems.
- Actual results is yet to be shown.

b) Programming: The Map Coloring Problem

To solve a problem, follow these four simple steps.
- Turn on one of several qubits
- Map a single region to a unit cell
- Implement neighbor constraints using intercell couplers
- Clone neighbors to meet all neighbor constraints.
Since there are no registers, we have to influence the qubits.
- Assign a weight to the qubits,.Introduce a notion of a coupler that connects two qubits together, give this coupler a strength.
- This makes it possible to write down an objective function.
- When the objective is minimized, the constraints are satisfied.
This paper shows that with a lot of qubits, one can solve the four color map problem. However this means having very many logical quibits, which translates into even more physical qubits.

2.6

Link to the slides

Written by

sindrma yulongb Stian Jensen cristea aleksanb

Last updated: Tue, 1 Dec 2015 20:30:14 +0100 .