Table of Contents

Curriculum 2019
Chapter 1 - Introduction
Chapter 2 - Memory Hirarchy Design
1. VIPT - Virtually Indexed Physically Tagged
2. Cache optimization
Chapter 3 - Instruction-Level Parallelism
1. Dependencies
Chapter 4 - Data level paralellism (also includes parts of chapter 1)
1. The Three Laws
2. Computational intensity
  1. Roofline model
Chapter 6 - Warehouse Scale Computing
Appendix F - Interconnects
Paper - The Future of Multiprocessors, Borkar et al., Communications of the ACM, 2011
1. Factors to consider when designing CMPs
2. Transistors are shrinking
  1. Transistors per IO pin
  2. Less cache per core
The future of microprocessors
Vilje
Redefining the Role of the CPU in the Era of CPU-GPU Integration

‹

TDT4260: Computer Architecture

Tags:

Curriculum 2019

Computer Architecture - A Quantitative Approach, 5th /6th Edition, Hennessy and Patterson

Appendix B.1-B.7
Appendix C.1-C.5
Chapter 1.1-1.12
Chapter 2.1-2.8
Chapter 3.1-3.9, 3.11-3.14, 3.15/3.10 (5th / 6th edition)
Chapter 4.1-4.9
Chapter 5.1-5.7, 5.9
Chapter 6.1-6.6
Appendix F.1-F.6

Papers

Ten Simple Rules for Structuring Papers, Brett Mensh and Konrad Kording
The Future of Microprocessors, Shekahr Borkar and Andrew A. Chien
Digital Leads the Pack with 21164 - First of Next-Generation RISCs Extends Alpha’s Performance Lead - Linley Gwennap

Chapter 1 - Introduction

Moore's Law, explosive growth in computer power, yada yada yada.

Power dissipation problems have arrived since 2002, now multiple cores is the shit.

Dennard scaling states that as transistors get smaller, the power density stays constant. This basically means that cores can go faster because of more transistors, and still have the same power consuption.

Defining Computer Architecture

Originally, Computer Architecture design was only the instruction set, but now a lot more is needed to design a computer.

MIPS is used as example ISA in the book.

There are two components to a computer, organization and hardware.

The organization includes the high-level aspects of a computer's design, for example memory system and interconnect and the design of the internal CPU.

The hardware refers to more specifics of a computer, the logic design and packaging.

Computer Architecture is defined in the book as all three aspects of a computer, ISA, organization and hardware.

To design a computer, you need to meet functional requirements as well as price, power, performance and availabilty goals. And as well you will often need to define what the functional requirements are.

Performance

"X is n times faster than Y":

$$ n=\frac{Execution\:time_{Y}}{Execution\:time_{X}}=\frac{\frac{1}{Performance_{Y}}}{\frac{1}{Performance_{X}}}=\frac{Performance_{X}}{Performance_{Y}}$$

According to the book, the only consistent and reliable measure of performance is the execution time. Time is however not defined unambigiously. The most straightforward definition is wall-clock time (or response time or elapsed time), which his the time you would get timing a task with a stopwatch. CPU time looks at how long time the CPU works on the task, not waiting for I/O or running other programs.

Quantitative Principles of Computer Design

Take advantage of Parallelism

Really important, at every level.

System level

Multiple processors and disks, improved throughput. Ability to expand memory and number of CPUs is called scalability.

Processor level

Exploit parallelism among instructions. Easiest way is through pipelining, i.e. overlap instruction execution to reduce the total time to complete an instruction sequence. Not every instruction depends on its immediate predecessor.

Principle of Locality

Programs tend to reuse data and instructions it has used recently. An implication of locality is that we can predict with reasonable accuracy what instructions and data the program will use in near future, based on recent accesses.

Focus on the Common Case

If you want to make something faster, it's often worth focusing on the part that takes up most time, which is usually a process that happens often. If 80% of execution time is spent waiting for memory requests, that's a strong indication that this part should be improved.

Amdahl's Law

Amdahl's Law defines the speedup that can be gained by using a particular feature, which is an enhancement that will improve performance when it is used.

$$Speedup=\frac{Performance\:with\:enhancement}{Performance\:without\:enhancement}$$

Alternatively,

$$Speedup=\frac{Execution\:time\:without\:enhancement}{Execution\:time\:with\:enhancement}$$

Speedup tells us how much faster a task will run using the computer with the enhancement.

$$Execution\:time_{new} = Execution\:time_{old}\times \left((1-Fraction_{enhanced})+\frac{Fraction_{enhanced}}{Speedup_{enhanced}}\right)$$

$$Speedup_{overall} = \frac{1}{Execution\:time_{new}}$$

Example

Suppose that we want to enhance the processor used for Web serving. The new processor is 10 times faster on computation in the Web serving application than the original processor. Assuming that the original processor is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overal speedup gained by incorporating the enhancement?

Answer: $$Fraction_{enhanced} = 0.4$$

$$Speedup_{enhanced} = 10$$

$$Execution\:time_{overall} = 0.6+\frac{0.4}{10} = 0.64$$

$$Speedup_{overall} = \frac{1}{0.64} \approx 1.56$$

The Processor Performance Equation

CPU time can be expressed as:

$$CPU\:time = CPU\:clock\:cycles\:for\:a\:program\times Clock\:cycle\:time$$

$$CPU\:time = \frac{CPU\:clock\:cycles\:for\:a\:program}{Clock\:rate}$$

Cycles per instruction (CPI) is useful.

$$CPI = \frac{CPU\:clock\:cycles\:for\:a\:program}{Instruction count}$$

Transposing:

$$CPU\:time = Instruction\:count\times Cycles\:per\:instruction\times Clock\:cycle\:time$$

Finding the total number of clock cycles:

$$CPU\:clock\:cycles = \sum_{i=1}^{n}IC_{i} \times CPI_{i}$$

This can be used to find the total CPU time:

$$CPU\:time = \left(\sum_{i=1}^{n}IC_{i}\times CPI_{i}\right)\times Clock\:cycle\:time$$

And overall CPI:

$$CPI = \sum_{i=1}^{n}\frac{IC_{i}}{Instruction\:count}\times CPI_{i}$$

Chapter 2 - Memory Hirarchy Design

VIPT - Virtually Indexed Physically Tagged

Using virtual addresses in the cache compicates matters, yet we still want to use them. What we do is use VIPT, where we use the virtual index to index into the cache, but the tag that is stored in the cache is compared against the real tag of address requested. We get the real tag by looking up the virtual address in the translation lookaside buffer.

Cache optimization

Overview of cache optimizations and what they improve:

	Timing	Power	Throughput	Miss Penalty	Miss Rate
L1 Configuration	x	x		x
Way prediction	x	x
Pipelining	x		x
Non-blocking			x	x
Multibanked		x	x
Critical Word First				x
Merging Write Buffer				x
Compiler optimization					x
Prefetching				x	x

L1 Configuration

We can use VIPT and increase assosiativity

Way prediction

We can predict which way the cache block will be found in.

Pipelining

Similar to CPU execution pipeline, we pipeline accesses to the cache. On modern processors, even an L1 hit can take multiple cycles.

Non-blocking

Do not block on miss, but continue execution and hope that the next instruction will be a hit. In which case(and no dependencies) we can continue execution. We only stall if there are no more independent instructions we can execute. We store the outstanding misses in MSHRs, miss status holding registers.

Multibanked

We can spread the cache across multiple banks. That allows them to be accessed independently, and thus increases throuput by supporting multiple loads and stores.

Early Restart & Critical Word First

Early restart means we do not wait for an entire cache block to arrive, but once we get the word we are waiting for, we pass it to the CPU so it can continue execution. Critical Word First will transfer the requested word first, and then send the rest of the block.

Merge Write Buffer

Instead of sending off multiple write requests to consecutive addresses, we merge them into one continous write request. Ie, better to write 3 consecutive bytes with one request, than to send three requests.

Compiler optimization

Here we have loop interchange, loop fusion, loop fission and blocking.

Prefetching

Attempt to predict future memory requests, and try to fetch them ahead of time. It's much easier to prefetch instructions than data. We can have both hardware and software prefetching.

Chapter 3 - Instruction-Level Parallelism

Pipelining, used in every processor since 1985.

Dependencies

Data dependences and Hazards

Also known as Read-after-write (RAW).

An instruction depends on the result of an earlier instruction.

1. A = 3
2. B = A
3. C = B

Line 3 has a data depency to 2, and line 2 has a dependency to line 1. Line 3 has also a depency to line 1.

Anti-dependency

Also known as write-after-read (WAR).

When a instruction requires a value that is later updated. Means that the order of the instructions can not be changed.

1. B = 3
2. A = B + 1
3. B = 7

Since A needs the value of B, and it is later updated, the order of these instructions can not be changed.

Output dependency

Also known as write-after-write (WAW).

The order of the instructions affects the final value of a variable.

1. B = 3
2. A = B + 1
3. B = 7

If the order of line 1 and 3 changes, the output value of B changes. Thus, there's a output dependency there somewhere.

Chapter 4 - Data level paralellism (also includes parts of chapter 1)

The Three Laws

Many exam questions from previous years in this course requires knowledge of three laws: Amdahl's law, Gustafson's law, and Pollack's rule. All re

Amdahl's Law (parallel ver.)

$$ S(n) = \frac{T(1)}{T(n)} = \frac{1}{(1 - p) + \frac{p}{n}} $$

$ S(n) $ is theoretical speedup that can be had by executing a given algorithm on a system capable of executing $ n $ threads of execution. $ T(n) $ is the time the algorithm takes to finish when being executed on $ n $ threads. $ p $ is the fraction of the algorithm that can be parallelized (i.e., is not strictly serial).

The law can also be expressed with what part is strictly serial (b). It is almost just the same:

$$ S(n) = \frac{1}{b + \frac{1 - b}{n}}$$

Gustafson's Law

$$ S(P) = P - \alpha (P - 1) $$

$ S(P) $ is the theoretical speedup that can be had by executing a given algorithm on a system with $ P $ available processors. $ \alpha $ is the non-paralellizable fraction of the process. This is also called scaled speedup.

Pollack's Rule

Really something else, but in this course is generously interpreted to mean:

$$ P(\delta a) \approxeq \sqrt{\delta a} $$

$ P $ is processor performance. $ \delta a $ is the change of the size of the chip in terms of transistors (and typically therefore physical area as well).

Computational intensity

Conputational intensity is the ratio of calculation divided by words moved between different levels of memory.

Roofline model

A roofline model is a model used to provide performance estimates of an application.

Chapter 6 - Warehouse Scale Computing

WSC vs HPC

WSC is more cost conscious and handles more independent workload. We usually exploit task/request level parallelism. HPC has interdependent workloads, which require more communication. HPC usually has longer running jobs and cares about the latency til completion, while WSC cares about throughput(tail latency).

Cloud models

On Premises -> do not use cloud
Intrastructure -> Get a VM
Platform -> Get a runtime environment, only care about application and data.
Software -> Just get a hosted software, do not care about anything.

WSC vs Datacenters

The slides from 2019 says you get

5.7x reduction in storage cost
7.1x reduction in administrative costs
7.3x reduction in networking cost
Probably volume discount on servers
PUE ~1.2, vs 2. PUE (Power usage effectiveness) is the ratio of total energy usage vs energy used by computing equipment.

MapReduce

Common algorithm/method for handeling massive computation/data, which conists of two steps.

Map - Map the data onto different workers. Each worker will then process it's part of it.
Reduce - Collect the results from each worker and reduce it to a single answer.

Consistency - ACID vs BASE

ACID - Adomicity, consistency, isolation, durability

BASE - Basically available, soft state, eventual consistency

Base ensures the data across multiple replicas will be consisten eventually, but not immediately. This meas BASE scales much better, but might not be sutable for credit card transactions?

Organization

A WSC is divided into clusters which consists of server racks. Each rac consists of multiple servers. These are connected by multiple levels of switches.

Accelerators

Amazon, Google and Microsoft all provide support for accelerators in their cloud google has Cloud Tensor Processor Units, while Amazon and MS both have some sort of FPGAs.

Power

Use DVFS (dynamic voltage frequency scaling) to reduce clockspeed and voltage during low loads, and then scale up during peak hours.

Cost

CAPEX - Capital Expences

This is the things you have to pay up front, servers, land, etc...

OPEX - Operating Expences

This is mainly salaries and energy expences.

Appendix F - Interconnects

On chip interconnect

AMBA

AMBA is a bus spesification introduced by ARM for connecting CPU cores and peripherals on system on chip. It has been released in multiple revisions

APB - Advanced peripheral bus (1996)
AHB - Advanced high-performance bus (1999)
AXI - Advanced extensible interface (2003)
ACE - AXI coherency extension (2011)
CHI - Coherent Hub Interface (2013)

The newer revisions will have higher speeds, and higher bandwidth, and to support slower/cheaper pheripherals, there will be bridges between the busses.

Comparison / summary

AMBA 3.0 AXI	AMBA 2.0 AHB
Channel based spesification, with five separate channels for read address, read data, write address, write data, and write response enabling flexibility in implementation.	Explicit bus-based spesification, with single shared address bus and separate read and write data busse
Burst mode, requires transmitting address of only first data item on the bus.	Requires transmitting address of every data item transmitted on the bus.
OO transaction completion provides native support for multiple, outstanding transactions.	Simpler SPLIT transaction scheme provides limited and rudimentary outstanding transaction completion.
Fixed burst mode for memory mapped I/O peripherals.	No fixed burst mode.
Exclusive data access (semaphore operation) support.	No exclusive access support.
Advanced security and cache hint support.	Simple protection and cache hint support
Register slice support for timing isolation	No inherent support for timing isolation
Native low-power clock control interface	No low-power interface
Default bus matrix topology support	Default hierarchical bus topology

AXI Coherence Extensions (ACE)

Provides a five state cache model.

Interconnect Topologies

First and foremost, we can either have switched or dedicated pahts. This is KTN stuff.

Topology parameters

Routing distance - Number of hops between two spesific nodes
Diameter - Maximum routing distance (maximum latency)
Average distance - Indication of the average latency
Bisection bandwith - The minimal cut of links that partition the network in two equal parts, and is an indication of the available bandwidth.
Degree of a router - Number of ports the router has.

Network topologies

	Description	Pros	Cons
Crossbar	Every node is connected to all other nodes	Low latency and Non-blocking	Expensive and does not scale
Ring	All are connected in a ring	Cheap	High latency, not easy to scale and bisection bandwith remains constant
Mesh interconnect		Low cost, good average latency, high bandwidth, easy to layout on chip.	Large diameter (O(sqrt(N))
2D Torus
3D Torus		High bandwith and low diameter
Hypercube
Multistage Networks		Low latency	Can block
Tree

Routing, Arbitration and Switching

We need a mechanism to rute packets/connections. We also need to arbitrate access to the connections. In regular topologies, simply using +-1 for up/down, left/right will suffice. We can also do Dimension-order routing (DOR), where we first rout in one dimension, and then move to the next. Think 2D/3D torus.

Deadlock

For a deadlock to occur, we need a shared resource which is intrementally allocated, and is ot preemptible. We can handle deadlocks either by preventing them, or by recovering when it happens.

Flow control

We can extend the connection with stop/go csignals which are sent when the receiver detects the buffer filling up. It can then notify when the buffer has been sufficiently emptied.

Paper - The Future of Multiprocessors, Borkar et al., Communications of the ACM, 2011

Modern computers are Chip Multiprocessor systems. There are several ways to boost the performance of these systems, but they often have trade-offs as side effects.

Factors to consider when designing CMPs

Processor organization

Whether powerful out-oforder issue processors, or smaller, more numerous inorder processors provide superior throughput.

Cache hierarchy

The amount of cache memory per processor that results in maximal throughput. The ideal capacity is a function of processor organization, memory latency, and available off-chip bandwidth.

Off-chip bandwidth

Finite bandwidth limits the number of cores that can be placed on a chip, forcing more area to be devoted to on-chip caches to reduce bandwidth demands.

Application characteristics

Applications with different access patterns require different CMP designs to attain the best throughput. Different applications display varying sensitivities to L2 cache capacity, resulting in widely varying bandwidth demands.

Transistors are shrinking

The transistors in modern computers are shrinking. In the paper, transistor sizes down towards 35nm are considered, and the advantages and emerging problems related to this development are discussed. For the most part, small transistors are simply a gateway to more available area when designing processors. It allows for more cores, larger cores, more cache and larger internal buses. There are however other problems that become more prominenet as you utilize these advantages.

Transistors per IO pin

Processors have a rapidly growing number of transistors, and a IO pin count that is almost stationary. This result in a system where we have more cores fighting over the same amount of IO. This is most critical with regards to memory access, as we can't add more memory channels without the physical pins. Many applications will be unable to utilize the higher core count when the memory hiarchy is unable to follow the development.

Less cache per core

With more area available, we have space for more cache on the chip. This is however a tradeoff, as each transistor used for cache is a transistor not used for more cores. It is favorable to have as many cores as possible to maximize throughput. This is valid only to the point where the extra cores are left waiting for memory too much of the time to be effective. The result is that with larger memory available, we still have to constantly search for the optimal amount of caching. The result is that relative to the amount of memory available, we have less L1 cache with smaller transistors.

The future of microprocessors

The current growth in transistor speed cannot continue as it have the last 20 years. We have to start looking for ways to increase performance without depending on faster transistors.

Key insights

Moore’s Law continues but demands radical changes in architecture and software.
Architectures will go beyond homogeneous parallelism, embrace heterogeneity, and exploit the bounty of transistors to incorporate application-customized hardware.
Software must increase parallelism and exploit heterogeneous and application-customized hardware to deliver performance growth.

Solutions

Parallelism

The current solution to the clock frequency wall is to apply parallelism to processors. Most processors today have multiple cores running at the same frequency as older single core processors. They are not as powerfull on their own as a modern sigle core processor would have been, but many applications are able to utilize this parallelism is such a way that they benefit from their combined power, resulting in a more powerful computer.

Heterogeneity

As the number of cores in our processors rise, new problems start to arise. The most critical of these are related to the way memory access gets slower and slower as more cores fight about it. We can solve this by finding an alternative to higher core count, and heterogeneity is one such alternative. The consept is to instead of a large count of cores that are equally good at all tasks, we can have a smaller number of cores speccializing in different operations. In doing so, we can have programs run their operations on the cores that are able to run the code best, resulting in a faster computer.

Energy efficiency

Since Intel Pentium 4, we have realized that running CPUs at blazing speeds create too much heat to be sustainable. We can't keep making smaller and smaller transistors if they still produce as much energy as they do today, as it simply gets too hot. There are many solutions to this in existing processors, but we need to keep being creative and finding new ways to make processors energy efficient if we are gonna keep creating more powerfull processors.

Vilje

Vilje is the supercomputer at NTNU, and a bragging paper is in the curriculum.

Main points:

SGI Altix ICE X system
eight core Intel Sandy Bridge processor
two chips at each node
- 16 physical cores
- 32 logical cores
- 32 GB node memory
23040 cores
479.23 TFLOPS peak
8D enhanced hypercube
currently 82nd at top500.org

Cooperation between met.no, NTNU and Sintef. Runs the weather forecast for Norway and Sweden.

Redefining the Role of the CPU in the Era of CPU-GPU Integration

This research studied modern architectures where some CPU instructions are offloaded to the GPU. They found that the CPU performed differently, mainly in the form of difficulties to exploit both ILP and load/store prefetching, more use of the branch predictor, and a lesser focus on vector instructions and utilisation of multi cores. Future CPU-GPU architectures need to redesign the CPU design, as it is still critical to maximize perfromance.

Written by

chrisdef sigveseb jl1456 sindrsb jesper2k Ose runholm lepaulse gombos

Last updated: Mon, 3 Jun 2024 12:43:09 +0200 .