This is an old version of the compendium, written Dec. 7, 2014, 5:29 p.m. Changes made in this revision were made by stiaje. View rendered version.

TDT4200: Parallel Computing

One thing to remember: location, location, location (of your memory) #Distributed computing Supercomputers, clusters, not feasible with shared memory. Enter message passing. Can achieve super linear speedup due to more fast memory (cache). Scales better for huge data sets (to achieve Gustafson's law-like speedups). ##MPI A standard for message passing, implemented by several libraries (for instance OpenMPI).

MPI organises processes in communicators, which are collections of processes that can send messages to each other, . `MPI_COMM_WORLD ~~contains all processes~~` contains all the processes. In addition, ~~hav~~there are specialised communicators such as the cartesian communicator to simplify organising the processes in a cartesian grid.

MPI supports a lot of different calls. Only some are part of this subject:

int MPI_Ssend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) int MPI_Wait ( MPI_Request *request, MPI_Status *status) int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request ) int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request ) int MPI_Issend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request ) Note that the functions starting on an I are non-blocking. Their counterparts without the I are however not always blocking calls. The regular send function is in many implementations only blocking when the data that is to be sent is to large to fit in the send buffer. The size of this buffer may vary, so depending on the system running the program it may or may not deadlock. #Shared memory parallellism ## OpenMP OperMP is an API for multi-platform shared memory multiprocessing. Compiler directives for parallellising seemingly procedural code. Excellent at parallellising (for) loops. Does not expose "low-level" features used, mutexes etc. #pragma omp parallel for(int i = 0; i < 10000; i++){ array[i] = myFunction(i); } Different schedules for the parallelizing may be chosen, like this: #pragma omp parallel for schedule(kind [, chunk_size]) The different kinds of available schedules are Static : Divides the loop into equal-sized chunks or as equal as possible. The default chunk size is `loop_count/number_of_threads`. Dynamic : Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue. By default, the chunk size is 1. Guided : Guided scheduling is similar to dynamic in that not all iterations are allocated at the start. However, threads get a large contiguous chunk to start with, and the chunk size gradually decreases as the program runs, down to the limit set in the `chunk_size` parameter. Note the functions, as they may be necessary to know on the exam: omp_get_num_threads(); // return the total number of threads omp_get_thread_num(); // return the id of this thread ## Pthreads Basic threading library. Basically a thread spawns other threads that start running a supplied function. No built-in barrier, but access to mutexes etc. ### Process vs thread A Process and a thread are both independent sequences of execution. The main difference is that threads spawned by the same process run in the same shared memory. They are able to alter each others data, as well as pass messages between each other. #GPUs - SIMD ## Advantages, limitations ## OpenCL OpenCL is an ANSI C extension that compile to kernels suitable to run on a GPGPU. ## CUDA Cuda is a parallel computing platform. It is comparable to OpenCL is that it create small programs (kernels) executed on a GPGPU. ### Architecture and memory model Grid : The GPU is divided into a grid of some blocks. The programmer may chose the size of this grid within the boundries imposed by the hardware of the GPGPU. The grid have a common global memory, which all threads can read and write to at any time. It is slow because of the wast number of threads that are able to write to it, and it is often congested. The grid also have a global constant memory. This is a read only memory, which only the host may write to, and all threads can read from. It is faster than the global memory. Block : Each cell in this grid is called a block. A block is further divided into warps. The number of warps in a block can be set by the programmer within the hardware limitations of the GPGPU. All warps in a block have access to a shared memory for the block. It is not possible for a thread to access the shared memory of another block. Warp : A warp is a collection of 32 threads that are executed together. It is not possible to do jumps in only some of the threads at a time. They are executed SIMD style. They are executed 16 at a time on the GPU alternating. Thread : A thread is a single execution of a kernel. It is normally one of many identical threads, but will have it's own unique input. This mean that it may also have a unique output. A thread have access to it's very own set of registers, which are (as always) the fastest memory available to the system. # Serial optimisation You want the serial portion to be as small as possible, less computation per core can drastically reduce wall time. - Profile, profile, profile, find out where most time is spent, this is typically where you want to optimise - Remove branches - Use libraries (ATLAS, linear algebra routines, PETSc, contains serial and parallell (MPI) routines) - Location, location, location (you want to be able to predict memory usage for cache) - Better algorithm - Tune compilation options ## Use early cases most often in small switches. Compilers will turn switch statements with few (typically anything less than 3) cases to a chain of if statements. Early cases are faster to find in such cases. In large switches, a slower jump table is used. Jump tables spend the same amount of time jumping regardless of what case is used. # Race conditions Race conditions, or race hazards, are situations where multiple input signals race to affect a system. The order or timing in which they arrive determine some state. This is a hazard as it may cause the system to enter some unintended state, because the programmer did not handle all input orders that could occur. ## Non-critical race condition A race condition which only affects some part of the system, while the system as a whole will end up in the same state regardless of input order. The program output is unaffected by such race conditions. ## Critical race condition A critical race condition is a race condition which affects the final state of the program. This will often result in bugs. # Locality of reference A storage location or related location is often accessed multiple times. A computer can increase performance by keeping this data ready for when it is accessed. This is the function of a cache in a computer. There are a few types of locality: Temporal locality : Data is frequently accessed multiple times within a time frame. Spatial locality : Data located on a relatively close location to recently accessed data is likely to be accessed. There exists locality in other areas as well: Branch locality : The paths a branch have taken recently are likely to be repeated. This is a temporal behavior, as the branch locations may be spaced out far. # Other - Alias in/out, MPI_Scatter - Monte carlo methods? ## Amdahl's Law Amdahl's law is used to find maximum expected improvement to a system when only a part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors. The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. **Example:** A program needs 20 hours using a single processor core, and a particular portion of the program which takes one hour to execute cannot be parallelized, while the remaining 19 hours (95%) of execution time can be parallelized. No matter how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than the one hour. Hence, the speedup is limited to at most 20×. The execution time of an algorithm when executed on $n$ threads, where $B$ is the fraction of the algorithm which cannot be parallelized: $$ T(n) = T(1) \bigg( B + \frac{1}{n}(1-B) \bigg) $$ The theoretical speedup then becomes: $$ S(n) = \frac{T(1)}{T(n)} = \frac{T(1)}{ T(1) \Big( B + \frac{1}{n}(1-B) \Big) } = \frac{1}{B + \frac{1}{n}(1-B)} $$ ## Gustafson's Law Gustafson's law says that computations involving arbitrarily large data sets can be efficiently parallelized. $$ S(P) = P - \alpha \cdot (P - 1) $$ * $P$, the number of processors * $S$, the speedup * $\alpha$, the non-parallelizable fraction ## Strong vs weak scaling Strong scaling describe how a programs execution time is altered when adding processors and keeping the problem size constant. Weak scaling describe how a programs execution time is altered when adding processors and problem size linearly. ## The Brick Wall A combination of three other "walls": The power wall : Consuming exponentially increasing power with increasing operating frequency. The memory wall : The increasing gap between processor and memory speeds. The ILP wall : The diminishing returns on finding more ILP. ## NUMA ***Non-uniform Memory Access*** NUMA is a computer memory design which is used in multiprocessing. Because CPUs are so much faster than the memory, sophisticated measures are needed to avoid stalling the CPU. Under NUMA, the memory access time depends on the memory location relative to the processor. Memory which is local to the processor, can be accessed faster than memory which is local to another processor, or memory which is shared between multiple processors. ## Replicated vs. Distributed Grids || || **Replicated** || **Distributed** || || **Description** || The same data is copied to all nodes in the grid || Data is only stored on one node (like RAID) || || **Pros** || Reliable || Scales much better || || **Cons** || Low performance, scales poorly || Lower reliability ||