TDT4200: Parallel Computing
One thing to remember: location, location, location (of your memory)
#Distributed computing
Supercomputers, clusters, not feasible with shared memory. Enter message passing. Can achieve super linear speedup due to more fast memory (cache). Scales better for huge data sets (to achieve Gustafson's law-like speedups).
##MPI
A standard for message passing, implemented by several libraries (for instance OpenMPI).
MPI organises processes in communicators, which are collections of processes that can send messages to each other, MPI_COMM_WORLD contains all processes, have specialised communicators such as the cartesian communicator to simplify organising the processes in a cartesian grid.
#Shared memory parallellism
##OpenMP
Compiler directives for parallellising seemingly procedural code. Excellent at parallellising (for) loops. Does not expose "low-level" features used, mutexes etc.
##Pthreads
Basic threading library. Basically a thread spawns other threads that start running a supplied function. No built-in barrier, but access to mutexes etc.
#GPUs - SIMD
## Advantages, limitations
## OpenCL
## CUDA
#Serial optimisation
You want the serial portion to be as small as possible, less computation per core can drastically reduce wall time.
- Profile, profile, profile, find out where most time is spent, this is typically where you want to optimise
- Remove branches
- Use libraries (ATLAS, linear algebra routines, PETSc, contains serial and parallell (MPI) routines)
- Location, location, location (you want to be able to predict memory usage for cache)
- Better algorithm
- Tune compilation options
#Other
- NUMA (Non-uniform memory access)
- Amdahls' Law vs. Gustafson's Law
- Heterogenous systems
- Alias in/out, MPI_Scatter
- Monte carlo methods?