This is an old version of the compendium, written May 14, 2014, 7:42 p.m. Changes made in this revision were made by oyvindrobertsen. View rendered version.

TDT4258: Energy Efficient Computer Design

- Curriculum - Computers as Components, 2nd or 3rd edition - Both editions: Chapters 1-6 - 2nd edition: Chapter 8.2.1 - 3rd edition: Chapter 8.4.4 - EFM32 Application Note 27: Energy Optimization - EFM32 Application Note 61: Energy Harvesting - Compendium - Lecture slides # Chapter 1 - Embedded computing An embedded computer system is any device that includes a programmable computer but is not itself intended to be a general-purpose computer. Embedded systems are more complex to design than normal PCs as they meet multiple design constraints. Constraints about cost, size and performance is often much stricter than elsewhere. ## Characteristics of embedded computing algorithms Embedded computing systems have to provide sophisticated functionality. - Complex algorithms - Operations performed by a microprocessor may be very sophisticated - User interface - Microprocessors are frequently used to control complex user interfaces that may include multiple menus and many options (e.g. the moving maps in GPS navigation). ## Summary Embedded computing, while fun, is often very complex. This is mostly due to the strict constraints that must be satisfied. Trying to hack together a complex embedded system probably won't work, so a design process is needed. Features such as being able to perform to certain deadlines, consuming very little power, be of a certain size and more are very common in embedded systems. Both top-down and bottom-up processes of design might be needed to successfully develop an embedded system. # Chapter 2 - Instruction sets ## Computer architecture A computer usually has one of two architectures. Either von Neumann or Harvard. Both architectures has a CPU, but the memory design is different. von Neumann has one memory and one CPU connected by a bus. In this architecture the CPU has one internal register which is the __program counter (PC)__, which keeps track of what part of the memory is run. The PC is then updated to run a different instruction from memory. Harvard on the other hand, has one memory for programs and one for data which are both connected to the CPU. One effect of the Harvard architecture is that it's harder to write self modifying programs, since the data is completely separate from the program. Harvard architecture is widely used today for one very good reason – it allows for higher performance when it comes to digital signal processing. Since there are two ports for memory management you get a higher bandwidth. It also makes it easier to move the data at the correct times. A __microcontroller__ is defined as a single-chip computer that includes a processor, memory and programmable I/O devices. ## Assembly languages Another axis in which one can see computer architecure relates to their instruction set and how they are executed. Many early computers had an instruction set which today is known as __CISC (complex instruction set computers)__. This involves many instructions which can do all sorts of things like finding a substring in a string, having loads and loads of instructions with varying length and so on. Another way of making computers is known as the __RISC (reduced instruction set computers)__. This architecture tended towards having fewer and simpler instructions. RISC machines tended to use the __load/store__ instruction set. This means that you can't do data manipulation directly in memory, but have to load data into registers and store it back in memory when the task is done. RISC instructions were chosen so they could be effectivly pipelined in the processor and was heavily optimized. Early RISC machines substantially outperformed CISC machines, though this has narrowed slightly in the later years. Some other characteristics of assembly languages are: - __Word length__ (4-bit, 8-bit, 16-bit, 32-bit...) : this is the length of each block of data the processor handles as one instruction - __Little-endian__ vs __ big-endian__ : if the lowest-order byte should reside in the lower (little) or higher (big) bits of the word - __Single-issue instruction__, __multiple-issue instruction__, __superscalar__ and __VLIW__ There are different assembly languages per computer architecture, but they usually share the same basic features: - One instruction appears per line. - Labels, which give names to memory locations, start in the first column. - Instructions must start in the second column or after to distinguish them from labels. - Comments run from some designated comment character to the end of the line. ## Multi-instruction execution In some cases, where multiple executions have no dependencies, the CPU can execute several instructions simultaneously. One technique to achieve this (used by desktop and laptop computers), is by using superscalar execution. A superscalar processor scans the program during execution to find sets of instructions that can be executed together. Another technique to achieve simultaneous instruction execution, is by using __very long instruction word (VLIW)__ processors. These processors rely on the compiler to identify sets of instructions that can be executed in parallel. ### Superscalar vs. VLIW. - Superscalar execution uses a dynamic approach, where hardware on the processor does all the work. - VLIW execution uses a static approach, where all the work is done by the compiler. - Superscalar execution can find parallelism that VLIW processors can't. - Superscalar execution is more expensive in both cost and energy consumption. ## ARM ARM is a family of RISC (Reduced instruction set computer) architectures. ARM instructions are written as follows: LDR r0,[r8] ; comment goes here label ADD r4,r0,r1W We can see that ARM instructions are written one per line, and that comments begin with a semicolon and continue to the end of the line. The label gives a name to a memory location, and comes at the beginning of the line, starting in the first column. ### Memory organization The ARM architecture supports two basic types of data: - The standard ARM word is 32 bits long. - One word may be divided into four 8-bit bytes. ARM is a load-store architecture, which means that data operands must first be loaded into the CPU and then stored back to main memory to save the results. ### ARM data operations ARM uses a load-store architecture for data operations. It has 16 (r0 to r15) general purpose registers, _though some are often used for a specific task_. r15 is used as the program counter, this should obviously not be overwritten. Another important register is the __current program status register (CPSR)__ which holds information about arithmetic, logical or shifting operations. The CPSR has the following useful information in its top four bits: - The negative (N) bit is set when the result is negative in two's-compliment arithmetic. - The zero (Z) bit is set if the result is zero. - The carry (C) bit is set when there is a carry out of the operation. - The overflow (V) bit is set when an arithmetic operation results in an overflow. r11 is used as the __frame pointer (fp)__. This register points to the end of the previous frame. A frame is a block of code that is being executed right now. To access variables within a frame you would subtract a value from the frame pointer. The concept of frames were introduced to allow for nested function calls and recursion, as well as a structured way of handling function arguments and return values. The frame pointer is technically not necessary unless the frame can be grown during the process execution. r13 is the __stack pointer__, it points to the end of the frame currently being executed. r14 is the __link register__, it contains the address to which to return to after a function call has completed. Now, you may be yelling at your screen, frustrated that I just said that thats what stack frames are for. Fret not, there is logic to this. The link register is an architecture feature provided by ARM with the purpose of supporting returning from function calls. The link register gets overwritten each time a branch-and-link instruction is executed, so recursion and nested function calls are out of the question. This makes the stack frame structure necessary. ### Small ARM examples Please see the book for a bigger reference to ARM assembly. #### Translate this expression to assembly __NOTE:__ a is at -24, b at -28 and z at -44 in respect to the frame pointer. Translate this C code to ARM. `z = (a << 2) | (b & 15);` ldr r3, [fp, #-24] lsl r2, r3, #2 ldr r3, [fp, #-28] and r3, r3, #15 orr r3, r2, r3 str r3, [fp, #-44] One thing to take from this example is that computing an expression with multiple parts, always start at the inner part and work your way out. #### Implement this if-statement C-code: if (a > b) { a = 5; } else { b = 3; } Assembly: .L1: ldr r2, [fp, #-24] ldr r3, [fp, #-28] cmp r2, r3 ; jump if false bgt .L3 ; true block mov r3, #5 str r3, [fp, #-24] b .L4 ; false block .L3: mov r3, #3 str r3, [fp, #-28] .L4: ; continue here ### Advanced ARM features Many ARM processors provide advanced features for less general applications. * DSP - Extensions are provided with the purpose of improved digital signal processing. For instance, MAC (multiply-and-accumulate) instructions can be utilised on 16x16 or 32x16 dataset. * SIMD - (Single Instruction Multiple Data) A single register is treated as having several smaller sets of data. The same operation is then applied to each single element. * NEON - NEON instructions are an extension on SIMD, and provides not only instructions optimized for vectors of data, but also larger registers, enabling a larger level of data parallellism. * TrustZone - Security features. A special instruction is available for entering TrustZone, a separate processor mode, allowing one to perform operations not permitted in normal mode. * Jazelle - Allows for direct execution of 8-bit Java™ bytecode instructions, removing the need for an interpreter. #### SIMD Example Two 32-bit registers are considered to consist of subsets of data, each subset a byte. r1 = 0x00 0xfe 0x00 0xfe r2 = 0x11 0x01 0x10 0x01 Performing a SIMD-add on the two registers would result in the following simd_add r1, r1, r2 r1 = 0x11 0xff 0x10 0xff # Chapter 3 - CPUs ## Input and output mechanisms. I/O-devices typically have several registers, data registers and status registers. - Data registers hold values that are treated as data by the device. - Status registers provide information about the device's operation, such as whether the current transaction has completed. It is very common for a devices status and data registers to be mapped into the main memory. Some architectures (Intel x86) provide a seperate address space for I/O devices and special instructions for reading/writing from/to devices. #### Example of writing to and reading from a memory mapped I/O device Our device has a status register and a data register. The status register is mapped to memory address 0x40006080 and data to 0x40006084. Both registers are 32 bits in size. Reading from the status register and writing the value to the data register in ARM Assembly (this is also a good example of how to bypass fitting 32-bit addresses into 32-bit instruction words): STATUS = 0x40006080 ; Read from status register LDR r0, =STATUS LDR r1, [r0] ; Write to data register STR r1, [r0, #0x4] The corresponding C-code: #define STATUS 0x40006080 // Reading uint32_t status_val = *STATUS; // Writing

*STATUS = status_val;

### Busy/wait Using busy/wait, the CPU tests the device status while the I/O transaction is in progress, and is therefore extremely inefficient. The CPU could do useful work in parallel with the I/O transaction, and to allow this, we can use interrupts. ### Interrupts Using interrupts, the device can force the CPU to execute a particular piece of code, and the CPU can therefore do other work while the I/O transaction is in progress or the device has no need for the CPUs attention. This is done by using an interrupt handler routine (or device driver), that gets called when the device generates an interrupt. An interrupt controller is used for prioritisation and ordering of different kinds of interrupts. Typically, each I/O controller can generate interrupts, making a mechanism for gathering and prioritising interrupts necessary. The interrupt controller choses which interrupt to forward to the CPU. Based on where the interrupt originated, the CPU suspends its execution and saves the current frame to the stack and calls the corresponding interrupt handler.

## Supervisor mode, exceptions, and traps Some processor architectures provide a __supervisor mode__. Most programs run in user mode, but some tasks are usually reserved for execution in supervisor mode. Supervisor modes are less frequently provided by e.g. DSPs and more frequently by processors intended for more complex systems, typically systems needing an OS. For instance, on systems with an MMU (memory management unit), writing to the MMU config registers can typically only be done in supervisor mode.

## Memory management and address translation. ## Caches. ## Performance and power consumption of CPUs. # Chapter 4 - Computing platforms Coming soon # Chapter 5 - Program design and analysis ## Cyclomatic complexity A simple measure of a programs control complexity is it's cyclomatic complexity. Cyclomatic complexity is the number of lineariy independent paths through the execution of a program. It is calculated as follows: $M = E - N + 2P$ Where E is the number of edges in the programs CDFG, N is the number of nodes and P is the number of connected components. # Chapter 6 - Processes and operating systems Coming soon