This is an old version of the compendium, written Dec. 5, 2014, 9:05 p.m. Changes made in this revision were made by stiaje. View rendered version.

TDT4200: Parallel Computing

One thing to remember: location, location, location (of your memory) #Distributed computing Supercomputers, clusters, not feasible with shared memory. Enter message passing. Can achieve super linear speedup due to more fast memory (cache). Scales better for huge data sets (to achieve Gustafson's law-like speedups). ##MPI A standard for message passing, implemented by several libraries (for instance OpenMPI). MPI organises processes in communicators, which are collections of processes that can send messages to each other, MPI_COMM_WORLD contains all processes, have specialised communicators such as the cartesian communicator to simplify organising the processes in a cartesian grid. #Shared memory parallellism ## OpenMP Compiler directives for parallellising seemingly procedural code. Excellent at parallellising (for) loops. Does not expose "low-level" features used, mutexes etc. #pragma omp parallel for(int i = 0; i < 10000; i++){ array[i] = myFunction(i); } Different schedules for the parallelizing may be chosen, like this: #pragma omp parallel for schedule(kind [, chunk_size]) The different kinds of available schedules are Static : Divides the loop into equal-sized chunks or as equal as possible. The default chunk size is `loop_count/number_of_threads`. Dynamic : Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue. By default, the chunk size is 1. Guided : Guided scheduling is similar to dynamic in that not all iterations are allocated at the start. However, threads get a large contiguous chunk to start with, and the chunk size gradually decreases as the program runs, down to the limit set in the `chunk_size` parameter. ## Pthreads Basic threading library. Basically a thread spawns other threads that start running a supplied function. No built-in barrier, but access to mutexes etc. #GPUs - SIMD ## Advantages, limitations ## OpenCL ## CUDA #Serial optimisation You want the serial portion to be as small as possible, less computation per core can drastically reduce wall time. - Profile, profile, profile, find out where most time is spent, this is typically where you want to optimise - Remove branches - Use libraries (ATLAS, linear algebra routines, PETSc, contains serial and parallell (MPI) routines) - Location, location, location (you want to be able to predict memory usage for cache) - Better algorithm - Tune compilation options #Other - NUMA (Non-uniform memory access) - Amdahls' Law vs. Gustafson's Law - Heterogenous systems - Alias in/out, MPI_Scatter

- Monte carlo methods? ## The Brick Wall A combination of three other "walls": The power wall : Consuming exponentially increasing power with increasing operating frequency. The memory wall : The increasing gap between processor and memory speeds. The ILP wall : The diminishing returns on finding more ILP.