This is an old version of the compendium, written Dec. 6, 2014, 7 p.m. Changes made in this revision were made by stiaje. View rendered version.

TDT4200: Parallel Computing

One thing to remember: location, location, location (of your memory) #Distributed computing Supercomputers, clusters, not feasible with shared memory. Enter message passing. Can achieve super linear speedup due to more fast memory (cache). Scales better for huge data sets (to achieve Gustafson's law-like speedups). ##MPI A standard for message passing, implemented by several libraries (for instance OpenMPI). MPI organises processes in communicators, which are collections of processes that can send messages to each other, MPI_COMM_WORLD contains all processes, have specialised communicators such as the cartesian communicator to simplify organising the processes in a cartesian grid. #Shared memory parallellism ## OpenMP Compiler directives for parallellising seemingly procedural code. Excellent at parallellising (for) loops. Does not expose "low-level" features used, mutexes etc. #pragma omp parallel for(int i = 0; i < 10000; i++){ array[i] = myFunction(i); } Different schedules for the parallelizing may be chosen, like this: #pragma omp parallel for schedule(kind [, chunk_size]) The different kinds of available schedules are Static : Divides the loop into equal-sized chunks or as equal as possible. The default chunk size is `loop_count/number_of_threads`. Dynamic : Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue. By default, the chunk size is 1. Guided : Guided scheduling is similar to dynamic in that not all iterations are allocated at the start. However, threads get a large contiguous chunk to start with, and the chunk size gradually decreases as the program runs, down to the limit set in the `chunk_size` parameter. ## Pthreads Basic threading library. Basically a thread spawns other threads that start running a supplied function. No built-in barrier, but access to mutexes etc. #GPUs - SIMD ## Advantages, limitations ## OpenCL ## CUDA # Serial optimisation You want the serial portion to be as small as possible, less computation per core can drastically reduce wall time. - Profile, profile, profile, find out where most time is spent, this is typically where you want to optimise - Remove branches - Use libraries (ATLAS, linear algebra routines, PETSc, contains serial and parallell (MPI) routines) - Location, location, location (you want to be able to predict memory usage for cache) - Better algorithm - Tune compilation options # Race conditions Race conditions, or race hazards, are situations where multiple input signals race to affect a system. The order or timing in which they arrive determine some state. This is a hazard as it may cause the system to enter some unintended state, because the programmer did not handle all input orders that could occur. ## Non-critical race condition

A race condition wh~~ere the race condition~~ich only affects some part of the system, ~~but the system as a~~ wh~~ole~~ile the system as a w~~ill end~~hole will end up in the same state regardless of input order.

The program output is unaffected by such race conditions. ## Critical race condition A critical race condition is a race condition which affects the final state of the program. This will often result in bugs.

# Locality of reference A storage location or related location is oftern accessed multiple times.

A computer can increase performance by keeping this data ready for when it is accessed. This is the function of a cache in a computer. There are a few types of locality: Temporal locality : Data is frequently accessed multiple times within a time frame. Spatial locality : Data located on a relatively close location to recently accessed data is likely to be accessed. There exists locality in other areas as well: Branch locality : The paths a branch have taken recently are likely to be repeated. This is a temporal behavior, as the branch locations may be spaced out far. # Other - Alias in/out, MPI_Scatter - Monte carlo methods? ## Amdahl's Law Amdahl's law is used to find maximum expected improvement to a system when only a part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors. The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. **Example:** A program needs 20 hours using a single processor core, and a particular portion of the program which takes one hour to execute cannot be parallelized, while the remaining 19 hours (95%) of execution time can be parallelized. No matter how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than the one hour. Hence, the speedup is limited to at most 20×. The execution time of an algorithm when executed on $n$ threads, where $B$ is the fraction of the algorithm which cannot be parallelized: $$ T(n) = T(1) \bigg( B + \frac{1}{n}(1-B) \bigg) $$ The theoretical speedup then becomes: $$ S(n) = \frac{T(1)}{T(n)} = \frac{T(1)}{ T(1) \Big( B + \frac{1}{n}(1-B) \Big) } = \frac{1}{B + \frac{1}{n}(1-B)} $$ ## Gustafson's Law Gustafson's law says that computations involving arbitrarily large data sets can be efficiently parallelized. $$ S(P) = P - \alpha \cdot (P - 1) $$ * $P$, the number of processors * $S$, the speedup * $\alpha$, the non-parallelizable fraction ## The Brick Wall A combination of three other "walls": The power wall : Consuming exponentially increasing power with increasing operating frequency. The memory wall : The increasing gap between processor and memory speeds. The ILP wall : The diminishing returns on finding more ILP. ## NUMA ***Non-uniform Memory Access*** NUMA is a computer memory design which is used in multiprocessing. Because CPUs are so much faster than the memory, sophisticated measures are needed to avoid stalling the CPU. Under NUMA, the memory access time depends on the memory location relative to the processor. Memory which is local to the processor, can be accessed faster than memory which is local to another processor, or memory which is shared between multiple processors. ## Replicated vs. Distributed Grids || || **Replicated** || **Distributed** || || **Description** || The same data is copied to all nodes in the grid || Data is only stored on one node (like RAID) || || **Pros** || Reliable || Scales much better || || **Cons** || Low performance, scales poorly || Lower reliability ||