This is an old version of the compendium, written May 11, 2015, 2:07 a.m. Changes made in this revision were made by chrisdef. View rendered version.

TDT4258: Energy Efficient Computer Design

- Curriculum - Computers as Components, 2nd or 3rd edition - Both editions: Chapters 1-6 - 2nd edition: Chapter 8.2.1 - 3rd edition: Chapter 8.4.4 - EFM32 Application Note 27: Energy Optimization - EFM32 Application Note 61: Energy Harvesting - Compendium - Lecture slides # Chapter 1 - Embedded computing An embedded computer system is any device that includes a programmable computer but is not itself intended to be a general-purpose computer. Embedded systems are more complex to design than normal PCs as they meet multiple design constraints. Constraints about cost, size and performance is often much stricter than elsewhere. ## Characteristics of embedded computing algorithms Embedded computing systems have to provide sophisticated functionality. - Complex algorithms - Operations performed by a microprocessor may be very sophisticated - User interface - Microprocessors are frequently used to control complex user interfaces that may include multiple menus and many options (e.g. the moving maps in GPS navigation). ## Summary Embedded computing, while fun, is often very complex. This is mostly due to the strict constraints that must be satisfied. Trying to hack together a complex embedded system probably won't work, so a design process is needed. Features such as being able to perform to certain deadlines, consuming very little power, be of a certain size and more are very common in embedded systems. Both top-down and bottom-up processes of design might be needed to successfully develop an embedded system. # Chapter 2 - Instruction sets ## Computer architecture A computer usually has one of two architectures. Either von Neumann or Harvard. Both architectures has a CPU, but the memory design is different. The von Neumann architecture has one __shared memory for data and programs__, connected to the CPU by a bus. In this architecture the CPU has one internal register which is the __program counter (PC)__, which keeps track of what part of the memory is run. The PC is then updated to run a different instruction from memory. Harvard on the other hand, has one memory for programs and one for data which are both connected to the CPU. One effect of the Harvard architecture is that it's harder to write self modifying programs, since the data is completely separate from the program. Harvard architecture is widely used today for one very good reason – it allows for higher performance when it comes to digital signal processing. Since there are two ports for memory management you get a higher bandwidth. It also makes it easier to move the data at the correct times. A __microcontroller__ is defined as a single-chip computer that includes a processor, memory and programmable I/O devices. ## Assembly languages Another axis in which one can see computer architecure relates to their instruction set and how they are executed. Many early computers had an instruction set which today is known as __CISC (complex instruction set computers)__. This involves many instructions which can do all sorts of things like finding a substring in a string, having loads and loads of instructions with varying length and so on. Another way of making computers is known as the __RISC (reduced instruction set computers)__. This architecture tended towards having fewer and simpler instructions. RISC machines tended to use the __load/store__ instruction set. This means that you can't do data manipulation directly in memory, but have to load data into registers and store it back in memory when the task is done. RISC instructions were chosen so they could be effectivly pipelined in the processor and was heavily optimized. Early RISC machines substantially outperformed CISC machines, though this has narrowed slightly in the later years. Some other characteristics of assembly languages are: - __Word length__ (4-bit, 8-bit, 16-bit, 32-bit...) : this is the length of each block of data the processor handles as one instruction - __Little-endian__ vs __ big-endian__ : if the lowest-order byte should reside in the lower (little) or higher (big) bits of the word - __Single-issue instruction__, __multiple-issue instruction__, __superscalar__ and __VLIW__ There are different assembly languages per computer architecture, but they usually share the same basic features: - One instruction appears per line. - Labels, which give names to memory locations, start in the first column. - Instructions must start in the second column or after to distinguish them from labels. - Comments run from some designated comment character to the end of the line. ## Multi-instruction execution In some cases, where multiple executions have no dependencies, the CPU can execute several instructions simultaneously. One technique to achieve this (used by desktop and laptop computers), is by using superscalar execution. A superscalar processor scans the program during execution to find sets of instructions that can be executed together. Another technique to achieve simultaneous instruction execution, is by using __very long instruction word (VLIW)__ processors. These processors rely on the compiler to identify sets of instructions that can be executed in parallel. ### Superscalar vs. VLIW. - Superscalar execution uses a dynamic approach, where hardware on the processor does all the work. - VLIW execution uses a static approach, where all the work is done by the compiler. - Superscalar execution can find parallelism that VLIW processors can't. - Superscalar execution is more expensive in both cost and energy consumption. ## ARM ARM is a family of RISC (Reduced instruction set computer) architectures. ARM instructions are written as follows: LDR r0,[r8] ; comment goes here label: ADD r4,r0,r1W We can see that ARM instructions are written one per line, and that comments begin with a semicolon and continue to the end of the line. The label gives a name to a memory location, and comes at the beginning of the line, starting in the first column. ### Memory organization The ARM architecture supports two basic types of data: - The standard ARM word is 32 bits long. - One word may be divided into four 8-bit bytes. ARM is a load-store architecture, which means that data operands must first be loaded into the CPU and then stored back to main memory to save the results. ### ARM data operations ARM uses a load-store architecture for data operations. It has 16 (r0 to r15) general purpose registers, _though some are often used for a specific task_. r15 is used as the program counter, this should obviously not be overwritten. Another important register is the __current program status register (CPSR)__ which holds information about arithmetic, logical or shifting operations. The CPSR has the following useful information in its top four bits: - The negative (N) bit is set when the result is negative in two's-compliment arithmetic. - The zero (Z) bit is set if the result is zero. - The carry (C) bit is set when there is a carry out of the operation. - The overflow (V) bit is set when an arithmetic operation results in an overflow. r11 is used as the __frame pointer (fp)__. This register points to the end of the previous frame. A frame is a block of code that is being executed right now. To access variables within a frame you would subtract a value from the frame pointer. The concept of frames were introduced to allow for nested function calls and recursion, as well as a structured way of handling function arguments and return values. The frame pointer is technically not necessary unless the frame can be grown during the process execution. r13 is the __stack pointer__, it points to the end of the frame currently being executed. r14 is the __link register__, it contains the address to which to return to after a function call has completed. Now, you may be yelling at your screen, frustrated that I just said that thats what stack frames are for. Fret not, there is logic to this. The link register is an architecture feature provided by ARM with the purpose of supporting returning from function calls. The link register gets overwritten each time a branch-and-link instruction is executed, so recursion and nested function calls are out of the question. This makes the stack frame structure necessary. ### Small ARM examples Please see the book for a bigger reference to ARM assembly. #### Translate this expression to assembly __NOTE:__ a is at -24, b at -28 and z at -44 in respect to the frame pointer. Translate this C code to ARM. `z = (a << 2) | (b & 15);` ldr r3, [fp, #-24] lsl r2, r3, #2 ldr r3, [fp, #-28] and r3, r3, #15 orr r3, r2, r3 str r3, [fp, #-44] One thing to take from this example is that computing an expression with multiple parts, always start at the inner part and work your way out. #### Implement this if-statement C-code: if (a > b) { a = 5; } else { b = 3; } Assembly: .L1: ldr r2, [fp, #-24] ldr r3, [fp, #-28] cmp r2, r3 ; jump if false bgt .L3 ; true block mov r3, #5 str r3, [fp, #-24] b .L4 ; false block .L3: mov r3, #3 str r3, [fp, #-28] .L4: ; continue here ### Advanced ARM features Many ARM processors provide advanced features for less general applications. * DSP - Extensions are provided with the purpose of improved digital signal processing. For instance, MAC (multiply-and-accumulate) instructions can be utilised on 16x16 or 32x16 dataset. * SIMD - (Single Instruction Multiple Data) A single register is treated as having several smaller sets of data. The same operation is then applied to each single element. * NEON - NEON instructions are an extension on SIMD, and provides not only instructions optimized for vectors of data, but also larger registers, enabling a larger level of data parallellism. * TrustZone - Security features. A special instruction is available for entering TrustZone, a separate processor mode, allowing one to perform operations not permitted in normal mode. * Jazelle - Allows for direct execution of 8-bit Java™ bytecode instructions, removing the need for an interpreter. #### SIMD Example Two 32-bit registers are considered to consist of subsets of data, each subset a byte. r1 = 0x00 0xfe 0x00 0xfe r2 = 0x11 0x01 0x10 0x01 Performing a SIMD-add on the two registers would result in the following simd_add r1, r1, r2 r1 = 0x11 0xff 0x10 0xff # Chapter 3 - CPUs ## Input and output mechanisms. I/O-devices typically have several registers, data registers and status registers. - Data registers hold values that are treated as data by the device. - Status registers provide information about the device's operation, such as whether the current transaction has completed. It is very common for a devices status and data registers to be mapped into the main memory. Some architectures (Intel x86) provide a seperate address space for I/O devices and special instructions for reading/writing from/to devices. #### Example of writing to and reading from a memory mapped I/O device Our device has a status register and a data register. The status register is mapped to memory address 0x40006080 and data to 0x40006084. Both registers are 32 bits in size. Reading from the status register and writing the value to the data register in ARM Assembly (this is also a good example of how to bypass fitting 32-bit addresses into 32-bit instruction words): STATUS = 0x40006080 ; Read from status register LDR r0, =STATUS LDR r1, [r0] ; Write to data register STR r1, [r0, #0x4] The corresponding C-code: #define STATUS 0x40006080 // Reading uint32_t status_val = *STATUS; // Writing *STATUS = status_val; ### Busy/wait Using busy/wait, the CPU tests the device status while the I/O transaction is in progress, and is therefore extremely inefficient. The CPU could do useful work in parallel with the I/O transaction, and to allow this, we can use interrupts. ### Interrupts Using interrupts, the device can force the CPU to execute a particular piece of code, and the CPU can therefore do other work while the I/O transaction is in progress or the device has no need for the CPUs attention. This is done by using an interrupt handler routine (or device driver), that gets called when the device generates an interrupt. An interrupt controller is used for prioritisation and ordering of different kinds of interrupts. Typically, each I/O controller can generate interrupts, making a mechanism for gathering and prioritising interrupts necessary. The interrupt controller chooses which interrupt to forward to the CPU. Based on where the interrupt originated, the CPU suspends its execution and saves the current frame to the stack and calls the corresponding interrupt handler. ## Supervisor mode, exceptions, and traps ### Supervisor mode Some processor architectures provide a __supervisor mode__. Most programs run in user mode, but some tasks are usually reserved for execution in supervisor mode. Supervisor modes are less frequently provided by e.g. DSPs and more frequently by processors intended for more complex systems, typically systems needing an OS. For instance, on systems with an MMU (memory management unit), writing to the MMU config registers can typically only be done in supervisor mode. ### Exceptions Exceptions can be viewed as a CPU generated interrupt, and is an internally detected error. If a program for instance tries to divide by zero, the CPU will generate an exception and a predefined routine will be called to handle the error. What routine to be called is, as with interrupts, defined by a vector and the order of the exceptions is defined by the exception's priority. ## Memory system mechanics The clock speeds of processors are increasing at a much higher rate than what the memory access rate is. It is therefore useful to have ways to narrow the gap between processing speeds and memory reading/writing speeds. ### Caches A cache is a small and fast memory, which is used to keep parts of the main memory for quicker look ups by the CPU. The cache contains parts of the main memory, so that the CPU can do quick look-ups of values it needs, rather than always getting the same value from the main memory (a cache look-up is often only a few ns, while a main memory look-up is typically 50 - 75 ns). Between the CPU and the cache/main memory there is a cache controller. When the CPU needs data, the cache controller requests both the cache and the main memory. If the data is in the cache (a __cache hit__), the main memory request is aborted. If not (a __cache miss__), the data is retrieved from the main memory. Writing to cache/memory requires a bit more work, as you need to update both the cache and the main memory. ### Memory management and address translation. In modern CPUs, the memory management unit (MMU) is responsible for mapping virtual, logical adresses (from tables) in the processor to the actual physical adresses in RAM. ## Performance and power consumption of CPUs. # Chapter 4 - Computing platforms # Chapter 4 - Computing platforms This chapter talks about the three main parts of a computing platform: The microprocessor, the memory and the I/O devices. A microcontroller is a single chip that contains all these things. A computing platform is equally dependent on both hardware and software - each needs the other to perform its function. Most embedded systems include a hardware abstraction layer (HAL) to provide a basic level of abstraction from the hardware. This is frequently used by device drivers. ## Buses A bus provides a common connection between the CPU and the memory and/or I/O devices. The word bus refers to both the physical connection and the protocol by which the connection is established. Signals transmitting on the bus contains the data itself, addresses, a clock and some control signals. The primary task of a bus is to provide an interface to memory. Every bus has a bus master, assuring that only one process is active on the bus. In a typical bus, the CPU functions as the bus master. Most buses utilize the four-cycle handshake, which ensures a safe connection. This works in the following way: Device 1 signals that it’s ready to transmit, device 2 signals that it’s ready to receive, device 2 signals that it has received the data, device 1 ends the transmission. Data can be transmitted in different ways on the bus. A burst transfer refers to transmitting data repeatedly without going through all the steps required to transmit each piece of data in a separate transaction. A disconnected transfer refers to requesting the transfer, and then completing it later, freeing the processor in the meantime. Direct Memory Access (DMA) refers to a bus operation that allows reading and writing data without the use of the CPU. This is established through a two-step process: First, the DMA controller sends a bus request to the CPU. Then, the CPU sends a bus grant to the DMA controller. When this is done, the DMA can act as a bus master. The CPU enables DMA by writing to registers in the DMA controller; when the DMA operation is complete, the CPU receives an interrupt. Since the CPU can’t access the bus while the DMA is active, cyclic scheduling of DMA requests is frequently used, meaning that the DMA occupies the bus for only a few cycles at a time. Multiple bus configurations are used, since the purpose of the bus may differ. Smaller buses have fewer connections and less risk of being occupied. Larger buses can access more of the system. Also, the higher the speed of the bus, the higher the cost. It is therefore common to connect different buses together. Interconnected buses utilize a bridge to communicate with each other. The bridge acts as a master for one bus, and a slave for the other. Bus bandwidth refers to the data capacity of the bus. The transfer time can be measured with the two following formulas, where N is data size in bytes, D is clock cycles, O is overhead and W is the width of the set of bytes.

$$ T_~~basic~~(basic)(N) = (D + O) * N/W $$ (for basic transfers)

$$ T_burst(N) = (BD + O) * N/BW $$ (for burst transfers) ## Memory Memory is organized as a two-dimensional array. Dynamic RAM (DRAM) is commonly used as memory in modern computing platforms. Multiple types of DRAM is available, which vary widely in speed, capacity and other capabilities. Memory for PCs are often S-/DIMMs: Single/Double In-line Memory Modules. ROM and Flash are two other types of memory that is used. Memory components interact with the CPU through a Memory controller. To achieve parallelism in memory, we utilize channels and banks. Channels are separate connections to the CPU. Banks are separate memory blocks with its own memory arrays and addressing logic. Banks in the same part in memory share the same channel. Channels are generally more expensive than banks. Memory blocks can have different aspect ratios, meaning the height and witdh can vary. If the size is constant, increasing the height shrinks the width and vice versa. ## Choosing a computing platform When choosing a computing platform for your embedded system, you should adapt both the CPU, bus, memory and I/O devices to what suits your needs. You should also choose the right software to run on your system. ## Consumer electronic devices Consumer electronics should abide to both functional and non-functional requirements. The first is technical specifications that the system should support. The latter are things like battery life, looks and price. # Chapter 5 - Program design and analysis ## Circular buffers and stream-oriented programming Many embedded systems, especially DSP-based ones, do work on streams of data. A circular buffer is a memory efficient data structure for handling such streams. If we assume our algorithm needs a certain subset of the the data stream to process the next output, we can use a circular buffer to contain a window into the data stream at a current time. Since the size of this window doesn't change, the buffer can also be of constant size. In practice, a circular buffer is implemented as an array and a position variable pointing to the start of the current window in the array. ### Example Data stream: data: 1 2 3 4 5 time ---> At time $t$ the data window consists of `1 2 3 4`, at $t+1$: `2 3 4 5`. Circular buffer at $t$: buf = 1 2 3 4 Circular buffer at $t+1$: buf = 5 2 3 4 At time $t$ we're done with the first data segment, and can therefore discard it making room for the new data segment arriving at $t+1$. Another important data structure for stream-oriented programming is the queue. While the circular buffer is holds a constant number of elements, the queue can be of varying size. ## Models of programs A __control/data flow graph__ (CDFG) is used as a unifying way of modeling programs. It allows us to describe program behaviour regardless of what programming language is utilised or what platform it is supposed to run on. A typical CDFG consists of components describing both data and control operations. ### Data flow graphs A __data flow graph__ is way of modeling data operations. In high level languages, a block of code containing no conditionals (one entry point/one exit point) is known as a basic block. DFGs allow us to model basic blocks. The first step in modeling a basic block is ensuring it conforms to __single-assignment form__. In single-assignment form, each variable is only assigned to only once. To rewrite an expression not in this form, one splits each variable that is assigned to more than once into two variables. With an expression on single-assignment form, we can represent variables as edges and operators as nodes in our DFG. ### Control/data flow graphs With our DFG in place, we can use a CDFG to model conditionals and control flow. (if/else, for-loops, while-loops, etc.) A CDFG encapsulates DFGs by representing them as __data flow nodes__. The other type of nodes in a CDFG are decision nodes, representing conditionals and control flow. ## Assembly, linking and loading ### Program generation workflow A day in the life of a program™: +----------------------+ +--------+ +-------------+ +---------+ |High level source code|-->|Compiler|-->|Assembly code|-->|Assembler| +----------------------+ +--------+ +-------------+ +---------+ | ▼ +-----------+ |Object code| +-----------+ | ▼ +---------+ +------+ +-----------------+ +------+ |Execution|<---|Loader|<---|Executable binary|<---|Linker| +---------+ +------+ +-----------------+ +------+ An assembler's main task is to generate binary instructions from supplied assembly code. The assembler in question must do this with respect to instruction formats and must also translate labels into actual addresses. Each assembly source file gets assembled into its own object file. Since code from one file may depend on labeled code from a different file, we need to merge each object file together in a sensible way. This is done by the linker. ### Assembler So how does the assembler actually generate object code? In short, it makes two passes through the code, the first one to generate a symbol table, and the second one to actually assemble each instruction at addresses calculated in the first pass. For the time being, let us assume we know the starting address of the program. (Most compilers provide us with relative addressing, alleviating us from having to worry about starting addresses.) During the first pass of the code, the assembler maintains a __program location counter__, starting at the starting address of our program and incrementing by the size of an instruction each time it progresses a line. Each time the assembler encounters a label during this pass, it adds that label to the symbol table, with the current PLC value assosiated. During the second pass, the assembler generates binary instructions for each line of assembly in the source file at memory locations conforming to the information in the now generated symbol table. The symbol table is later used by the linker to structure the entire executable. ### Linker The next step in the process is the linking stage. Most programs are spread across several source files and are therefore assembled into separate object files, each with their own symbol table. The linker stitches together all object files into a single executable, making sure to resolve any external references in each object file. The linker also takes a linker script, defining where in the physical memory to place program and data. ## Cyclomatic complexity A simple measure of a programs control complexity is it's cyclomatic complexity. Cyclomatic complexity is the number of linearly independent paths through the execution of a program. It is calculated as follows: $M = E - N + 2P$ Where E is the number of edges in the programs CDFG, N is the number of nodes and P is the number of connected components. # Chapter 6 - Processes and operating systems Coming soon # Chapter 7 - Stuff Not in the curriculum per 2014. Having section seven refer to chapter eight looked rather silly though, so I'm including this paragraph. # Chapter 8 - I$^2$C bus The I$^2$C bus is a very common, low cost, easy to implement bus mostly used for linking microcontrollers into systems. The bus exists in two variations; the standard (supporting a transfer rate of up to 100 kbps) and the extended (transfer rate of up to 400 kbps). An I$^2$C bus uses only two lines; a __serial data line__ (SDL) for data and a __serial clock line__ indicating valid data on the SDL. Each node in an I$^2$C network is connected to both lines. On the data link layer, every device connected to an I$^2$C bus must have a unique address. In the standard I$^2$C definition, an address is seven bits, while the extended definition allows for ten bits. The zero-address 0x0000000 is used to signal a __general call__, signaling all devices connected to the bus. A bus transaction consists of a series of one-byte transmissions on the SDL. Transactions are either read or write. A read transaction consists of the master transmitting the address it reads from and the corresponding slave responds with the requested data. The write transaction also transmits the addres to write to, but follows it up with the data to write. An address transmission consists of eight bits, seven for the address in question and one indicating the data direction (r/w). So how does the SCL fit into this? A bus transaction is initiated by a start signal and ended with a stop signal. * __Start signal__ - SCL 1, SDL 0->1 * __Stop signal__ - SCL 1->0, SDL 0->1