TDT4260: Computer Architecture
Computer Architecture - A Quantitative Approach, 5th /6th Edition, Hennessy and Patterson
- Appendix B.1-B.7
- Appendix C.1-C.5
- Chapter 1.1-1.12
- Chapter 2.1-2.8
- Chapter 3.1-3.9, 3.11-3.14, 3.15/3.10 (5th / 6th edition)
- Chapter 4.1-4.9
- Chapter 5.1-5.7, 5.9
- Chapter 6.1-6.6
- Appendix F.1-F.6
- Ten Simple Rules for Structuring Papers, Brett Mensh and Konrad Kording
- The Future of Microprocessors, Shekahr Borkar and Andrew A. Chien
- Digital Leads the Pack with 21164 - First of Next-Generation RISCs Extends Alpha’s Performance Lead - Linley Gwennap
Chapter 1 - Introduction
Moore's Law, explosive growth in computer power, yada yada yada.
Power dissipation problems have arrived since 2002, now multiple cores is the shit.
Dennard scaling states that as transistors get smaller, the power density stays constant. This basically means that cores can go faster because of more transistors, and still have the same power consuption.
Defining Computer Architecture
Originally, Computer Architecture design was only the instruction set, but now a lot more is needed to design a computer.
MIPS is used as example ISA in the book.
There are two components to a computer, organization and hardware.
The organization includes the high-level aspects of a computer's design, for example memory system and interconnect and the design of the internal CPU.
The hardware refers to more specifics of a computer, the logic design and packaging.
Computer Architecture is defined in the book as all three aspects of a computer, ISA, organization and hardware.
To design a computer, you need to meet functional requirements as well as price, power, performance and availabilty goals. And as well you will often need to define what the functional requirements are.
"X is n times faster than Y":
According to the book, the only consistent and reliable measure of performance is the execution time. Time is however not defined unambigiously. The most straightforward definition is wall-clock time (or response time or elapsed time), which his the time you would get timing a task with a stopwatch. CPU time looks at how long time the CPU works on the task, not waiting for I/O or running other programs.
Quantitative Principles of Computer Design
Take advantage of Parallelism
Really important, at every level.
Multiple processors and disks, improved throughput. Ability to expand memory and number of CPUs is called scalability.
Exploit parallelism among instructions. Easiest way is through pipelining, i.e. overlap instruction execution to reduce the total time to complete an instruction sequence. Not every instruction depends on its immediate predecessor.
Principle of Locality
Programs tend to reuse data and instructions it has used recently. An implication of locality is that we can predict with reasonable accuracy what instructions and data the program will use in near future, based on recent accesses.
Focus on the Common Case
If you can make something better, it should be the most common thing, as that will occur more often. This is kind of intuitive, dumbasss. Amdahl's Law can be used to quantify this.
Amdahl's Law defines the speedup that can be gained by using a particular feature, which is an enhancement that will improve performance when it is used.
Speedup tells us how much faster a task will run using the computer with the enhancement.
Suppose that we want to enhance the processor used for Web serving. The new processor is 10 times faster on computation in the Web serving application than the original processor. Assuming that the original processor is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overal speedup gaine by incorporating the enhancement?
The Processor Performance Equation
CPU time can be expressed as:
Cycles per instruction (CPI) is useful.
Finding the total number of clock cycles:
This can be used to find the total CPU time:
And overall CPI:
Chapter 2 - Memory Hirarchy Design
VIPT - Virtually Indexed Physically Tagged
Using virtual addresses in the cache compicates matters, yet we still want to use them. What we do is use VIPT, where we use the virtual index to index into the cache, but the tag that is stored in the cache is compared against the real tag of address requested. We get the real tag by looking up the virtual address in the translation lookaside buffer.
Overview of cache optimizations and what they improve:
|Timing||Power||Throughput||Miss Penalty||Miss Rate|
|Critical Word First||x|
|Merging Write Buffer||x|
We can use VIPT and increase assosiativity
We can predict which way the cache block will be found in.
Similar to CPU execution pipeline, we pipeline accesses to the cache. On modern processors, even an L1 hit can take multiple cycles.
Do not block on miss, but continue execution and hope that the next instruction will be a hit. In which case(and no dependencies) we can continue execution. We only stall if there are no more independent instructions we can execute. We store the outstanding misses in MSHRs, miss status holding registers.
We can spread the cache across multiple banks. That allows them to be accessed independently, and thus increases throuput by supporting multiple loads and stores.
Early Restart & Critical Word First
Early restart means we do not wait for an entire cache block to arrive, but once we get the word we are waiting for, we pass it to the CPU so it can continue execution. Critical Word First will transfer the requested word first, and then send the rest of the block.
Merge Write Buffer
Instead of sending off multiple write requests to consecutive addresses, we merge them into one continous write request. Ie, better to write 3 consecutive bytes with one request, than to send three requests.
Here we have loop interchange, loop fusion, loop fission and blocking.
Attempt to predict future memory requests, and try to fetch them ahead of time. It's much easier to prefetch instructions than data. We can have both hardware and software prefetching.
Chapter 3 - Instruction-Level Parallelism
Pipelining, used in every processor since 1985.
Data dependences and Hazards
Also known as Read-after-write (RAW).
An instruction depends on the result of an earlier instruction.
1. A = 3 2. B = A 3. C = B
Line 3 has a data depency to 2, and line 2 has a dependency to line 1. Line 3 has also a depency to line 1.
Also known as write-after-read (WAR).
When a instruction requires a value that is later updated. Means that the order of the instructions can not be changed.
1. B = 3 2. A = B + 1 3. B = 7
Since A needs the value of B, and it is later updated, the order of these instructions can not be changed.
Also known as write-after-write (WAW).
The order of the instructions affects the final value of a variable.
1. B = 3 2. A = B + 1 3. B = 7
If the order of line 1 and 3 changes, the output value of B changes. Thus, there's a output dependency there somewhere.
Chapter 4 - Data level paralellism (also includes parts of chapter 1)
The Three Laws
Many exam questions from previous years in this course requires knowledge of three laws: Amdahl's law, Gustafson's law, and Pollack's rule. All re
The law can also be expressed with what part is strictly serial (b). It is almost just the same:
Really something else, but in this course is generously interpreted to mean:
Conputational intensity is the ratio of calculation divided by words moved between different levels of memory.
A roofline model is a model used to provide performance estimates of an application.
Chapter 6 - Warehouse Scale Computing
WSC vs HPC
WSC is more cost conscious and handles more independent workload. We usually exploit task/request level parallelism. HPC has interdependent workloads, which require more communication. HPC usually has longer running jobs and cares about the latency til completion, while WSC cares about throughput(tail latency).
- On Premises -> do not use cloud
- Intrastructure -> Get a VM
- Platform -> Get a runtime environment, only care about application and data.
- Software -> Just get a hosted software, do not care about anything.
WSC vs Datacenters
The slides from 2019 says you get
- 5.7x reduction in storage cost
- 7.1x reduction in administrative costs
- 7.3x reduction in networking cost
- Probably volume discount on servers
- PUE ~1.2, vs 2. PUE (Power usage effectiveness) is the ratio of total energy usage vs energy used by computing equipment.
Common algorithm/method for handeling massive computation/data, which conists of two steps.
- Map - Map the data onto different workers. Each worker will then process it's part of it.
- Reduce - Collect the results from each worker and reduce it to a single answer.
Consistency - ACID vs BASE
ACID - Adomicity, consistency, isolation, durability
BASE - Basically available, soft state, eventual consistency
Base ensures the data across multiple replicas will be consisten eventually, but not immediately. This meas BASE scales much better, but might not be sutable for credit card transactions?
A WSC is divided into clusters which consists of server racks. Each rac consists of multiple servers. These are connected by multiple levels of switches.
Amazon, Google and Microsoft all provide support for accelerators in their cloud google has Cloud Tensor Processor Units, while Amazon and MS both have some sort of FPGAs.
Use DVFS (dynamic voltage frequency scaling) to reduce clockspeed and voltage during low loads, and then scale up during peak hours.
CAPEX - Capital Expences
This is the things you have to pay up front, servers, land, etc...
OPEX - Operating Expences
This is mainly salaries and energy expences.
Appendix F - Interconnects
On chip interconnect
AMBA is a bus spesification introduced by ARM for connecting CPU cores and peripherals on system on chip. It has been released in multiple revisions
- APB - Advanced peripheral bus (1996)
- AHB - Advanced high-performance bus (1999)
- AXI - Advanced extensible interface (2003)
- ACE - AXI coherency extension (2011)
- CHI - Coherent Hub Interface (2013)
The newer revisions will have higher speeds, and higher bandwidth, and to support slower/cheaper pheripherals, there will be bridges between the busses.
Comparison / summary
|AMBA 3.0 AXI||AMBA 2.0 AHB|
|Channel based spesification, with five separate channels for read address, read data, write address, write data, and write response enabling flexibility in implementation.||Explicit bus-based spesification, with single shared address bus and separate read and write data busse|
|Burst mode, requires transmitting address of only first data item on the bus.||Requires transmitting address of every data item transmitted on the bus.|
|OO transaction completion provides native support for multiple, outstanding transactions.||Simpler SPLIT transaction scheme provides limited and rudimentary outstanding transaction completion.|
|Fixed burst mode for memory mapped I/O peripherals.||No fixed burst mode.|
|Exclusive data access (semaphore operation) support.||No exclusive access support.|
|Advanced security and cache hint support.||Simple protection and cache hint support|
|Register slice support for timing isolation||No inherent support for timing isolation|
|Native low-power clock control interface||No low-power interface|
|Default bus matrix topology support||Default hierarchical bus topology|
AXI Coherence Extensions (ACE)
Provides a five state cache model.
First and foremost, we can either have switched or dedicated pahts. This is KTN stuff.
- Routing distance - Number of hops between two spesific nodes
- Diameter - Maximum routing distance (maximum latency)
- Average distance - Indication of the average latency
- Bisection bandwith - The minimal cut of links that partition the network in two equal parts, and is an indication of the available bandwidth.
- Degree of a router - Number of ports the router has.
|Crossbar||Every node is connected to all other nodes||Low latency and Non-blocking||Expensive and does not scale|
|Ring||All are connected in a ring||Cheap||High latency, not easy to scale and bisection bandwith remains constant|
|Mesh interconnect||Low cost, good average latency, high bandwidth, easy to layout on chip.||Large diameter (O(sqrt(N))|
|3D Torus||High bandwith and low diameter|
|Hypercube||Copy and connect nodes for each dimention (0D has one node)||Low diameter and high bisection bandwidth||Costly|
|Multistage Networks||Low latency||Can block|
Routing, Arbitration and Switching
We need a mechanism to rute packets/connections. We also need to arbitrate access to the connections. In regular topologies, simply using +-1 for up/down, left/right will suffice. We can also do Dimension-order routing (DOR), where we first rout in one dimension, and then move to the next. Think 2D/3D torus.
For a deadlock to occur, we need a shared resource which is intrementally allocated, and is ot preemptible. We can handle deadlocks either by preventing them, or by recovering when it happens.
We can extend the connection with stop/go csignals which are sent when the receiver detects the buffer filling up. It can then notify when the buffer has been sufficiently emptied.
Paper - The Future of Multiprocessors, Borkar et al., Communications of the ACM, 2011
Modern computers are Chip Multiprocessor systems. There are several ways to boost the performance of these systems, but they often have trade-offs as side effects.
Factors to consider when designing CMPs
Whether powerful out-oforder issue processors, or smaller, more numerous inorder processors provide superior throughput.
The amount of cache memory per processor that results in maximal throughput. The ideal capacity is a function of processor organization, memory latency, and available off-chip bandwidth.
Finite bandwidth limits the number of cores that can be placed on a chip, forcing more area to be devoted to on-chip caches to reduce bandwidth demands.
Applications with different access patterns require different CMP designs to attain the best throughput. Different applications display varying sensitivities to L2 cache capacity, resulting in widely varying bandwidth demands.
Transistors are shrinking
The transistors in modern computers are shrinking. In the paper, transistor sizes down towards 35nm are considered, and the advantages and emerging problems related to this development are discussed. For the most part, small transistors are simply a gateway to more available area when designing processors. It allows for more cores, larger cores, more cache and larger internal buses. There are however other problems that become more prominenet as you utilize these advantages.
Transistors per IO pin
Processors have a rapidly growing number of transistors, and a IO pin count that is almost stationary. This result in a system where we have more cores fighting over the same amount of IO. This is most critical with regards to memory access, as we can't add more memory channels without the physical pins. Many applications will be unable to utilize the higher core count when the memory hiarchy is unable to follow the development.
Less cache per core
With more area available, we have space for more cache on the chip. This is however a tradeoff, as each transistor used for cache is a transistor not used for more cores. It is favorable to have as many cores as possible to maximize throughput. This is valid only to the point where the extra cores are left waiting for memory too much of the time to be effective. The result is that with larger memory available, we still have to constantly search for the optimal amount of caching. The result is that relative to the amount of memory available, we have less L1 cache with smaller transistors.
The future of microprocessors
The current growth in transistor speed cannot continue as it have the last 20 years. We have to start looking for ways to increase performance without depending on faster transistors.
Moore’s Law continues but demands radical changes in architecture and software.
Architectures will go beyond homogeneous parallelism, embrace heterogeneity, and exploit the bounty of transistors to incorporate application-customized hardware.
Software must increase parallelism and exploit heterogeneous and application-customized hardware to deliver performance growth.
The current solution to the clock frequency wall is to apply parallelism to processors. Most processors today have multiple cores running at the same frequency as older single core processors. They are not as powerfull on their own as a modern sigle core processor would have been, but many applications are able to utilize this parallelism is such a way that they benefit from their combined power, resulting in a more powerful computer.
As the number of cores in our processors rise, new problems start to arise. The most critical of these are related to the way memory access gets slower and slower as more cores fight about it. We can solve this by finding an alternative to higher core count, and heterogeneity is one such alternative. The consept is to instead of a large count of cores that are equally good at all tasks, we can have a smaller number of cores speccializing in different operations. In doing so, we can have programs run their operations on the cores that are able to run the code best, resulting in a faster computer.
Since Intel Pentium 4, we have realized that running CPUs at blazing speeds create too much heat to be sustainable. We can't keep making smaller and smaller transistors if they still produce as much energy as they do today, as it simply gets too hot. There are many solutions to this in existing processors, but we need to keep being creative and finding new ways to make processors energy efficient if we are gonna keep creating more powerfull processors.
Vilje is the supercomputer at NTNU, and a bragging paper is in the curriculum.
- SGI Altix ICE X system
- eight core Intel Sandy Bridge processor
- two chips at each node
- 16 physical cores
- 32 logical cores
- 32 GB node memory
- 23040 cores
- 479.23 TFLOPS peak
- 8D enhanced hypercube
- currently 82nd at top500.org
Cooperation between met.no, NTNU and Sintef. Runs the weather forecast for Norway and Sweden.
Redefining the Role of the CPU in the Era of CPU-GPU Integration
This research studied modern architectures where some CPU instructions are offloaded to the GPU. They found that the CPU performed differently, mainly in the form of difficulties to exploit both ILP and load/store prefetching, more use of the branch predictor, and a lesser focus on vector instructions and utilisation of multi cores. Future CPU-GPU architectures need to redesign the CPU design, as it is still critical to maximize perfromance.