TDT4200: Parallel Computing
One thing to remember: location, location, location (of your memory)
#Distributed computing
Supercomputers, clusters, not feasible with shared memory.
Enter message passing. Can achieve super linear speedup due to more fast memory (cache).
Scales better for huge data sets (to achieve Gustafson's law-like speedups).
##MPI
A standard for message passing, implemented by several libraries (for instance OpenMPI).
MPI organises processes in communicators, which are collections of processes that can send messages to each other,
MPI_COMM_WORLD contains all processes, have specialised communicators such as the cartesian communicator to simplify organising the processes in a cartesian grid.
#Shared memory parallellism
## OpenMP
Compiler directives for parallellising seemingly procedural code.
Excellent at parallellising (for) loops.
Does not expose "low-level" features used, mutexes etc.
#pragma omp parallel
for(int i = 0; i < 10000; i++){
array[i] = myFunction(i);
}
Different schedules for the parallelizing may be chosen, like this:
#pragma omp parallel for schedule(kind [, chunk_size])
The different kinds of available schedules are
Static
: Divides the loop into equal-sized chunks or as equal as possible.
The default chunk size is `loop_count/number_of_threads`.
Dynamic
: Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue. By default, the chunk size is 1.
Guided
: Guided scheduling is similar to dynamic in that not all iterations are allocated at the start. However, threads get a large contiguous chunk to start with, and the chunk size gradually decreases as the program runs, down to the limit set in the `chunk_size` parameter.
## Pthreads
Basic threading library. Basically a thread spawns other threads that start running a supplied function.
No built-in barrier, but access to mutexes etc.
#GPUs - SIMD
## Advantages, limitations
## OpenCL
## CUDA
# Serial optimisation
You want the serial portion to be as small as possible,
less computation per core can drastically reduce wall time.
- Profile, profile, profile, find out where most time is spent, this is typically where you want to optimise
- Remove branches
- Use libraries (ATLAS, linear algebra routines, PETSc, contains serial and parallell (MPI) routines)
- Location, location, location (you want to be able to predict memory usage for cache)
- Better algorithm
- Tune compilation options
#Other
- Amdahl's Law vs. Gustafson's Law
- Heterogenous systems
- Alias in/out, MPI_Scatter
- Monte carlo methods?
## Amdahl's Law
Amdahl's law is used to find maximum expected improvement to a system when only a part of the system is improved.
It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors.
The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program.
**Example:** A program needs 20 hours using a single processor core,
and a particular portion of the program which takes one hour to execute cannot be parallelized,
while the remaining 19 hours (95%) of execution time can be parallelized.
No matter how many processors are devoted to a parallelized execution of this program,
the minimum execution time cannot be less than the one hour.
Hence, the speedup is limited to at most 20×.
The execution time of an algorithm when executed on $n$ threads,
where $B$ is the fraction of the algorithm which cannot be parallelized:
$$ T(n) = T(1) \bigg( B + \frac{1}{n}(1-B) \bigg) $$
The theoretical speedup then becomes:
## The Brick Wall
A combination of three other "walls":
The power wall
: Consuming exponentially increasing power with increasing operating frequency.
The memory wall
: The increasing gap between processor and memory speeds.
The ILP wall
: The diminishing returns on finding more ILP.
## NUMA
***Non-uniform Memory Access***
NUMA is a computer memory design which is used in multiprocessing.
Because CPUs are so much faster than the memory, sophisticated measures are needed to avoid stalling the CPU.
Under NUMA, the memory access time depends on the memory location relative to the processor.
Memory which is local to the processor, can be accessed faster than memory which is local to another processor, or memory which is shared between multiple processors.
## Replicated vs. Distributed Grids
|| || **Replicated** || **Distributed** ||
|| **Description** || The same data is copied to all nodes in the grid || Data is only stored on one node (like RAID) ||
|| **Pros** || Reliable || Scales much better ||
|| **Cons** || Low performance, scales poorly || Lower reliability ||