TDT4260: Computer Architecture
Computer Architecture - A Quantitative Approach, 5th Edition, Hennessy and Patterson
- Chapter 1.1-1.6, 1.8-1.9, 1.11-1.12
- Appendix B.1-B.4, page B.49-50, B.6-B.7(Partly overlapping with Chapter 2)
- Chapter 2.1-2.5, 2.7-2.8
- Chapter 3.1-3.12, 3.15
- Chapter 4.1-4.5, 4.8-4.9
- Chapter 5.1-5.6, 5.9-5.10
- Chapter 6.1-6.6
- Appendix F.1-F.6
- Exploring the Design Space of Future CMPs, by Jaehyuk Huh, Stephen W. Keckler, and Doug Burger, PACT 2001
- The Future of Microprocessors, by Shekahr Borkar AND Andrew A. Chien, Communications of the ACM, 2011
- Dark silicon and the end of multicore scaling, by Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, Karthikeyan Sankaralingam, and Doug Burger, ISCA, 2011
- Redefining the Role of the CPU in the Era of CPU-GPU Integration, by Manish Arora, Siddhartha Nath, Subhra Mazumdar, Scott Baden, and Dean Tullsen, MICRO 2012
Chapter 1 - Introduction
Moore's Law, explosive growth in computer power, yada yada yada.
Power dissipation problems have arrived since 2002, now multiple cores is the shit.
Defining Computer Architecture
Originally, Computer Architecture design was only the instruction set, but now a lot more is needed to design a computer.
MIPS is used as example ISA in the book.
There are two components to a computer, organization and hardware.
The organization includes the high-level aspects of a computer's design, for example memory system and interconnect and the design of the internal CPU.
The hardware refers to more specifics of a computer, the logic design and packaging.
Computer Architecture is defined in the book as all three aspects of a computer, ISA, organization and hardware.
To design a computer, you need to meet functional requirements as well as price, power, performance and availabilty goals. And as well you will often need to define what the functional requirements are.
"X is n times faster than Y":
According to the book, the only consistent and reliable measure of performance is the execution time. Time is however not defined unambigiously. The most straightforward definition is wall-clock time (or response time or elapsed time), which his the time you would get timing a task with a stopwatch. CPU time looks at how long time the CPU works on the task, not waiting for I/O or running other programs.
Quantitative Principles of Computer Design
Take advantage of Parallelism
Really important, at every level.
Multiple processors and disks, improved throughput. Ability to expand memory and number of CPUs is called scalability.
Exploit parallelism among instructions. Easiest way is through pipelining, i.e. overlap instruction execution to reduce the total time to complete an instruction sequence. Not every instruction depends on its immediate predecessor.
Principle of Locality
Programs tend to reuse data and instructions it has used recently. An implication of locality is that we can predict with reasonable accuracy what instructions and data the program will use in near future, based on recent accesses.
Focus on the Common Case
If you can make something better, it should be the most common thing, as that will occur more often. This is kind of intuitive, dumbasss. Amdahl's Law can be used to quantify this.
Amdahl's Law defines the speedup that can be gained by using a particular feature, which is an enhancement that will improve performance when it is used.
Speedup tells us how much faster a task will run using the computer with the enhancement.
Suppose that we want to enhance the processor used for Web serving. The new processor is 10 times faster on computation in the Web serving application than the original processor. Assuming that the original processor is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overal speedup gaine by incorporating the enhancement?
The Processor Performance Equation
CPU time can be expressed as:
Cycles per instruction (CPI) is useful.
Finding the total number of clock cycles:
This can be used to find the total CPU time:
And overall CPI:
Chapter 2 - Instruction-Level Parallelism
Pipelining, used in every processor since 1985.
Paper - The Future of Multiprocessors, Borkar et al., Communications of the ACM, 2011
Modern computers are Chip Multiprocessor systems. There are several ways to boost the performance of these systems, but they often have trade-offs as side effects.
Factors to consider when designing CMPs
Whether powerful out-oforder issue processors, or smaller, more numerous inorder processors provide superior throughput.
The amount of cache memory per processor that results in maximal throughput. The ideal capacity is a function of processor organization, memory latency, and available off-chip bandwidth.
Finite bandwidth limits the number of cores that can be placed on a chip, forcing more area to be devoted to on-chip caches to reduce bandwidth demands.
Applications with different access patterns require different CMP designs to attain the best throughput. Different applications display varying sensitivities to L2 cache capacity, resulting in widely varying bandwidth demands.
Transistors are shrinking
The transistors in modern computers are shrinking. In the paper, transistor sizes down towards 35nm are considered, and the advantages and emerging problems related to this development are discussed. For the most part, small transistors are simply a gateway to more available area when designing processors. It allows for more cores, larger cores, more cache and larger internal buses. There are however other problems that become more prominenet as you utilize these advantages.
Transistors per IO pin
Processors have a rapidly growing number of transistors, and a IO pin count that is almost stationary. This result in a system where we have more cores fighting over the same amount of IO. This is most critical with regards to memory access, as we can't add more memory channels without the physical pins. Many applications will be unable to utilize the higher core count when the memory hiarchy is unable to follow the development.
Less cache per core
With more area available, we have space for more cache on the chip. This is however a tradeoff, as each transistor used for cache is a transistor not used for more cores. It is favorable to have as many cores as possible to maximize throughput. This is valid only to the point where the extra cores are left waiting for memory too much of the time to be effective. The result is that with larger memory available, we still have to constantly search for the optimal amount of caching. The result is that relative to the amount of memory available, we have less L1 cache with smaller transistors.
The future of microprocessors
The current growth in transistor speed cannot continue as it have the last 20 years. We have to start looking for ways to increase performance without depending on faster transistors.
Moore’s Law continues but demands radical changes in architecture and software.
Architectures will go beyond homogeneous parallelism, embrace heterogeneity, and exploit the bounty of transistors to incorporate application-customized hardware.
Software must increase parallelism and exploit heterogeneous and application-customized hardware to deliver performance growth.
The current solution to the clock frequency wall is to apply parallelism to processors. Most processors today have multiple cores running at the same frequency as older single core processors. They are not as powerfull on their own as a modern sigle core processor would have been, but many applications are able to utilize this parallelism is such a way that they benefit from their combined power, resulting in a more powerful computer.
As the number of cores in our processors rise, new problems start to arise. The most critical of these are related to the way memory access gets slower and slower as more cores fight about it. We can solve this by finding an alternative to higher core count, and heterogeneity is one such alternative. The consept is to instead of a large count of cores that are equally good at all tasks, we can have a smaller number of cores speccializing in different operations. In doing so, we can have programs run their operations on the cores that are able to run the code best, resulting in a faster computer.
Since Intel Pentium 4, we have realized that running CPUs at blazing speeds create too much heat to be sustainable. We can't keep making smaller and smaller transistors if they still produce as much energy as they do today, as it simply gets too hot. There are many solutions to this in existing processors, but we need to keep being creative and finding new ways to make processors energy efficient if we are gonna keep creating more powerfull processors.
The Three Laws
Many exam questions from previous years in this course requires knowledge of three laws: Amdahl's law, Gustafson's law, and Pollack's rule.
The law can also be expressed with what part is strictly serial (b). It is almost just the same:
Really something else, but in this course is generously interpreted to mean:
Vilje is the supercomputer at NTNU, and a bragging paper is in the curriculum.
- SGI Altix ICE X system
- eight core Intel Sandy Bridge processor
- two chips at each node
- 16 physical cores
- 32 logical cores
- 32 GB node memory
- 23040 cores
- 479.23 TFLOPS peak
- 8D enhanced hypercube
- currently 82nd at top500.org
Cooperation between met.no, NTNU and Sintef. Runs the weather forecast for Norway and Sweden.
Redefining the Role of the CPU in the Era of CPU-GPU Integration
This research studied modern architectures where some CPU instructions are offloaded to the GPU. They found that the CPU performed differently, mainly in the form of difficulties to exploit both ILP and load/store prefetching, more use of the branch predictor, and a lesser focus on vector instructions and utilisation of multi cores. Future CPU-GPU architectures need to redesign the CPU design, as it is still critical to maximize perfromance.
Also known as Read-after-write (RAW).
An instruction depends on the result of an earlier instruction.
1. A = 3 2. B = A 3. C = B
Line 3 has a data depency to 2, and line 2 has a dependency to line 1. Line 3 has also a depency to line 1.
Also known as write-after-read (WAR).
When a instruction requires a value that is later updated. Means that the order of the instructions can not be changed.
1. B = 3 2. A = B + 1 3. B = 7
Since A needs the value of B, and it is later updated, the order of these instructions can not be changed.
Also known as write-after-write (WAW).
The order of the instructions affects the final value of a variable.
1. B = 3 2. A = B + 1 3. B = 7
If the order of line 1 and 3 changes, the output value of B changes. Thus, there's a output dependency there somewhere.