This is an old version of the compendium, written May 30, 2016, 7:34 p.m. Changes made in this revision were made by Ose. View rendered version.

TDT4260: Computer Architecture

# Curriculum 2016 Computer Architecture - A Quantitative Approach, 5th Edition, Hennessy and Patterson - Chapter 1.1-1.6, 1.8-1.9, 1.11-1.12 - Appendix B.1-B.4, page B.49-50, B.6-B.7(Partly overlapping with Chapter 2) - Chapter 2.1-2.5, 2.7-2.8 - Chapter 3.1-3.12, 3.15 - Chapter 4.1-4.5, 4.8-4.9 - Chapter 5.1-5.6, 5.9-5.10 - Chapter 6.1-6.6 - Appendix F.1-F.6 Papers - Exploring the Design Space of Future CMPs, by Jaehyuk Huh, Stephen W. Keckler, and Doug Burger, PACT 2001 - The Future of Microprocessors, by Shekahr Borkar AND Andrew A. Chien, Communications of the ACM, 2011 - Dark silicon and the end of multicore scaling, by Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, Karthikeyan Sankaralingam, and Doug Burger, ISCA, 2011 - Redefining the Role of the CPU in the Era of CPU-GPU Integration, by Manish Arora, Siddhartha Nath, Subhra Mazumdar, Scott Baden, and Dean Tullsen, MICRO 2012 # Chapter 1 - Introduction Moore's Law, explosive growth in computer power, yada yada yada. Power dissipation problems have arrived since 2002, now multiple cores is the shit. ## Defining Computer Architecture Originally, Computer Architecture design was only the instruction set, but now a lot more is needed to design a computer. MIPS is used as example ISA in the book. There are two components to a computer, _organization_ and _hardware_. The organization includes the high-level aspects of a computer's design, for example memory system and interconnect and the design of the internal CPU. The hardware refers to more specifics of a computer, the logic design and packaging. Computer Architecture is defined in the book as all three aspects of a computer, ISA, organization _and_ hardware. To design a computer, you need to meet functional requirements as well as price, power, performance and availabilty goals. And as well you will often need to define what the functional requirements are. ## Performance "X is _n_ times faster than Y": $$ n=\frac{Execution\:time_{Y}}{Execution\:time_{X}}=\frac{\frac{1}{Performance_{Y}}}{\frac{1}{Performance_{X}}}=\frac{Performance_{X}}{Performance_{Y}}$$ According to the book, the only consistent and reliable measure of performance is the execution time. Time is however not defined unambigiously. The most straightforward definition is _wall-clock time_ (or _response time_ or _elapsed time_), which his the time you would get timing a task with a stopwatch. _CPU time_ looks at how long time the CPU works on the task, not waiting for I/O or running other programs. ## Quantitative Principles of Computer Design ### Take advantage of Parallelism Really important, at every level. #### System level Multiple processors and disks, improved throughput. Ability to expand memory and number of CPUs is called _scalability_. #### Processor level Exploit parallelism among instructions. Easiest way is through pipelining, i.e. overlap instruction execution to reduce the total time to complete an instruction sequence. Not every instruction depends on its immediate predecessor. ### Principle of Locality Programs tend to reuse data and instructions it has used recently. An implication of locality is that we can predict with reasonable accuracy what instructions and data the program will use in near future, based on recent accesses. ### Focus on the Common Case If you can make something better, it should be the most common thing, as that will occur more often. This is kind of intuitive, dumbasss. _Amdahl's Law_ can be used to quantify this. ### Amdahl's Law Amdahl's Law defines the _speedup_ that can be gained by using a particular feature, which is an enhancement that will improve performance when it is used. $$Speedup=\frac{Performance\:for\:entire\:task\:using\:the\:enhancement\:when\:possible}{Performance\:for\:entire\:task\:without\:using\:the\:enhancement}$$ Alternatively, $$Speedup=\frac{Execution\:time\:for\:entire\:task\:without\:using\:the\:enhancement}{Execution\:time\:for\:entire\:task\:using\:the\:enhancement\:when\:possible}$$ Speedup tells us how much faster a task will run using the computer with the enhancement. $$Execution\:time_{new} = Execution\:time_{old}\times \left((1-Fraction_{enhanced})+\frac{Fraction_{enhanced}}{Speedup_{enhanced}}\right)$$ $$Speedup_{overall} = \frac{Execution\:time_{old}}{Execution\:time_{new}} = \frac{1}{(1-Fraction_{enhanced})+\frac{Fraction_{enhanced}}{Speedup_{enhanced}}}$$ #### EXAMPLE!!!! Suppose that we want to enhance the processor used for Web serving. The new processor is 10 times faster on computation in the Web serving application than the original processor. Assuming that the original processor is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overal speedup gaine by incorporating the enhancement? __Answer:__ $$Fraction_{enhanced} = 0.4$$ $$Speedup_{enhanced} = 10$$ $$Speedup_{overall} = \frac{1}{0.6+\frac{0.4}{10}} = \frac{1}{0.64} \approx 1.56$$ ### The Processor Performance Equation CPU time can be expressed as: $$CPU\:time = CPU\:clock\:cycles\:for\:a\:program\times Clock\:cycle\:time$$ or $$CPU\:time = \frac{CPU\:clock\:cycles\:for\:a\:program}{Clock\:rate}$$ Cycles per instruction (CPI) is useful. $$CPI = \frac{CPU\:clock\:cycles\:for\:a\:program}{Instruction count}$$ Transposing: $$CPU\:time = Instruction\:count\times Cycles\:per\:instruction\times Clock\:cycle\:time$$ Finding the total number of clock cycles: $$CPU\:clock\:cycles = \sum_{i=1}^{n}IC_{i} \times CPI_{i}$$ This can be used to find the total CPU time: $$CPU\:time = \left(\sum_{i=1}^{n}IC_{i}\times CPI_{i}\right)\times Clock\:cycle\:time$$ And overall CPI: $$CPI = \sum_{i=1}^{n}\frac{IC_{i}}{Instruction\:count}\times CPI_{i}$$ # Chapter 2 - Instruction-Level Parallelism Pipelining, used in every processor since 1985. # Paper - The Future of Multiprocessors, Borkar et al., Communications of the ACM, 2011 Modern computers are Chip Multiprocessor systems. There are several ways to boost the performance of these systems, but they often have trade-offs as side effects. ## Factors to consider when designing CMPs ### Processor organization Whether powerful out-oforder issue processors, or smaller, more numerous inorder processors provide superior throughput. ### Cache hierarchy The amount of cache memory per processor that results in maximal throughput. The ideal capacity is a function of processor organization, memory latency, and available off-chip bandwidth. ### Off-chip bandwidth Finite bandwidth limits the number of cores that can be placed on a chip, forcing more area to be devoted to on-chip caches to reduce bandwidth demands. ### Application characteristics Applications with different access patterns require different CMP designs to attain the best throughput. Different applications display varying sensitivities to L2 cache capacity, resulting in widely varying bandwidth demands. ## Transistors are shrinking The transistors in modern computers are shrinking. In the paper, transistor sizes down towards 35nm are considered, and the advantages and emerging problems related to this development are discussed. For the most part, small transistors are simply a gateway to more available area when designing processors. It allows for more cores, larger cores, more cache and larger internal buses. There are however other problems that become more prominenet as you utilize these advantages. ### Transistors per IO pin Processors have a rapidly growing number of transistors, and a IO pin count that is almost stationary. This result in a system where we have more cores fighting over the same amount of IO. This is most critical with regards to memory access, as we can't add more memory channels without the physical pins. Many applications will be unable to utilize the higher core count when the memory hiarchy is unable to follow the development. ### Less cache per core With more area available, we have space for more cache on the chip. This is however a tradeoff, as each transistor used for cache is a transistor not used for more cores. It is favorable to have as many cores as possible to maximize throughput. This is valid only to the point where the extra cores are left waiting for memory too much of the time to be effective. The result is that with larger memory available, we still have to constantly search for the optimal amount of caching. The result is that relative to the amount of memory available, we have less L1 cache with smaller transistors. # The future of microprocessors The current growth in transistor speed cannot continue as it have the last 20 years. We have to start looking for ways to increase performance without depending on faster transistors. ## Key insights * Moore’s Law continues but demands radical changes in architecture and software. * Architectures will go beyond homogeneous parallelism, embrace heterogeneity, and exploit the bounty of transistors to incorporate application-customized hardware. * Software must increase parallelism and exploit heterogeneous and application-customized hardware to deliver performance growth. ## Solutions ### Parallelism The current solution to the clock frequency wall is to apply parallelism to processors. Most processors today have multiple cores running at the same frequency as older single core processors. They are not as powerfull on their own as a modern sigle core processor would have been, but many applications are able to utilize this parallelism is such a way that they benefit from their combined power, resulting in a more powerful computer. ### Heterogeneity As the number of cores in our processors rise, new problems start to arise. The most critical of these are related to the way memory access gets slower and slower as more cores fight about it. We can solve this by finding an alternative to higher core count, and heterogeneity is one such alternative. The consept is to instead of a large count of cores that are equally good at all tasks, we can have a smaller number of cores speccializing in different operations. In doing so, we can have programs run their operations on the cores that are able to run the code best, resulting in a faster computer. ## Energy efficiency Since Intel Pentium 4, we have realized that running CPUs at blazing speeds create too much heat to be sustainable. We can't keep making smaller and smaller transistors if they still produce as much energy as they do today, as it simply gets too hot. There are many solutions to this in existing processors, but we need to keep being creative and finding new ways to make processors energy efficient if we are gonna keep creating more powerfull processors. # The Three Laws Many exam questions from previous years in this course requires knowledge of three laws: Amdahl's law, Gustafson's law, and Pollack's rule.

## Amdahl's Law

$$ S(n) = \frac{T(1)}{T(n)} = \frac{1}{(1 - p) + \frac{p}{n}} $$ $ S(n) $ is theoretical speedup that can be had by executing a given algorithm on a system capable of executing $ n $ threads of execution. $ T(n) $ is the time the algorithm takes to finish when being executed on $ n $ threads. $ p $ is the fraction of the algorithm that can be parallelized (i.e., is not strictly serial). The law can also be expressed with what part is strictly serial (b). It is almost just the same: $$ S(n) = \frac{1}{b + \frac{1 - b}{n}}$$ ## Gustafson's Law $$ S(P) = P - \alpha (P - 1) $$ $ S(P) $ is the theoretical speedup that can be had by executing a given algorithm on a system with $ P $ available processors. $ \alpha $ is the non-paralellizable fraction of the process. This is also called _scaled speedup_. ## Pollack's Rule Really something else, but in this course is generously interpreted to mean: $$ P(\delta a) \approxeq \sqrt{\delta a} $$ $ P $ is processor performance. $ \delta a $ is the change of the size of the chip in terms of transistors (and typically therefore physical area as well). # Vilje Vilje is the supercomputer at NTNU, and a bragging paper is in the curriculum. Main points: - SGI Altix ICE X system - eight core Intel Sandy Bridge processor - two chips at each node - 16 physical cores - 32 logical cores - 32 GB node memory - 23040 cores - 479.23 TFLOPS peak - 8D enhanced hypercube - currently 82nd at top500.org Cooperation between met.no, NTNU and Sintef. Runs the weather forecast for Norway and Sweden. # Dependencies ## Data dependecy Also known as Read-after-write (RAW). An instruction depends on the result of an earlier instruction. 1. A = 3 2. B = A 3. C = B Line 3 has a data depency to 2, and line 2 has a dependency to line 1. Line 3 has also a depency to line 1. ## Anti-dependency Also known as write-after-read (WAR). When a instruction requires a value that is later updated. Means that the order of the instructions can not be changed. 1. B = 3 2. A = B + 1 3. B = 7 Since A needs the value of B, and it is later updated, the order of these instructions can not be changed. ## Output dependency Also known as write-after-write (WAW). The order of the instructions affects the final value of a variable. 1. B = 3 2. A = B + 1 3. B = 7 If the order of line 1 and 3 changes, the output value of B changes. Thus, there's a output dependency there somewhere.