TDT4200: Parallel Computing
One thing to remember: location, location, location (of your memory)
Distributed computing
Supercomputers, clusters, not feasible with shared memory. Enter message passing. Can achieve super linear speedup due to more fast memory (cache). Scales better for huge data sets (to achieve Gustafson's law-like speedups).
MPI
A standard for message passing, implemented by several libraries (for instance OpenMPI).
MPI organises processes in communicators, which are collections of processes that can send messages to each other.
MPI_COMM_WORLD
contains all the processes.
In addition, there are specialised communicators such as the cartesian communicator to simplify organising the processes in a cartesian grid.
MPI supports a lot of different calls. Only some are part of this subject (These are some of them, but the list is incomplete):
int MPI_Ssend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm )
int MPI_Wait ( MPI_Request *request, MPI_Status *status)
int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request )
int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request )
int MPI_Issend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request )
Note that the functions starting on an I are non-blocking. Their counterparts without the I are however not always blocking calls. The regular send function is in many implementations only blocking when the data that is to be sent is to large to fit in the send buffer. The size of this buffer may vary, so depending on the system running the program it may or may not deadlock.
Let's look at one of the more advanced MPI functions: MPI_Reduce
.
The use case for MPI_Reduce
can be simple: say all processes have some data, and we would like to perform some operation on that data, and reduce it to a single element.
For instance, let's say all processes have an integer, and we would like to find the sum of all those integers.
This is the signature of the reduce function:
MPI_Reduce(void *input, void *output, int count, MPI_Datatype datatype, MPI_Op operation, int dest, MPI_Comm comm)
.
Note that while all processes need to pass a pointer to the output value, only one process will actually get the result - dest
.
The following minimal program illustrates the use of MPI_Reduce
:
#include <stdio.h>
#include <mpi.h>
int main() {
MPI_Init(NULL, NULL);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int number = rank;
int accumulator = 0;
MPI_Reduce(&number, &accumulator, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0)
printf("accumulator = %d\n", accumulator);
MPI_Finalize();
}
Other operators include MAX
, MIN
, PROD
, LOR
(logical OR) and BOR
(bitwise OR).
It is worth knowing that input
and output
can't point to the same variable.
Another variant of the function, MPI_Allreduce
is similar, but results in all processes getting the result;
naturally, this function does not take a dest
argument.
GPUs - SIMD
Why GPUs?
We say GPU, but we really mean GPGPUs, General Purpose GPUs. GPUs have a lower clock speed, but in turn they have a lot more cores. If used correctly, they can speed up computations that apply the same instruction to different data, SIMD. Code run on the GPU are usually small simple program units kalled kernels.
CUDA
Cuda is a parallel computing platform developed by Nvidia to be run on their graphics cards. CUDA is similar to C and is combiled down to binaries thay is linked into your application and then loaded onto the GPU.
OpenCL
OpenCL is a standard/API originally developed by Apple, but is currently maintained by the Khronos Group. OpenCL defines an API that allows you write code targeting numerous platforms, including NVIDIA. Platform owners, such as Intel, Nvidia, AMD, etc, then provide OpenCL runtimes for their platforms. Contrary to CUDA, the kernels are compiled when program is executing, rather than ahead of time and shipped with the program.
Pros and cons
OpenCL supports a lot more hardware, but this also means the API is more generalized. This can, though does not have to, reduce performance.
Launching a kernel
- Create a context
- Load the kernel source code
- Compile the kernel
- Allocate memory on device
- Copy memory to device
- Launch kernel
- Copy back result
These steps are the steps for OpenCL, for CUDA, 1-3 is done ahead of time, and thus only 4-7 is done when launching the kernel.
CUDA vs OpenCL concepts
CUDA | OpenCL |
Platform | |
Host | Host |
Device | Device |
Streaming Multiprocessor | Compute Unit |
CUDA Core | Processing Element |
CUDA and OpenCL terms
CUDA | OpenCL |
Thread | Work-Item |
Warp | Collection of Work-Items |
Thread-Block | Work-Group |
Global memory | Global memory |
Constant memory | Constant memory |
Shared memory | Local memory |
Local memory | Private memory |
__global__ function |
__kernel function |
__device__ function |
N/A |
__constant__ variable |
__constant variable |
__device__ variable |
__global variable |
__shared__ variable |
__local variable |
Architecture and memory model of CUDA
- Grid
- The GPU is divided into a grid of some blocks. The programmer may choose the size of this grid within the boundaries imposed by the hardware of the GPGPU. The grid has a common global memory, which all threads can read and write to at any time. It is slow because of the wast number of threads that are able to write to it, and it is often congested. The grid also has a global constant memory. This is a read only memory, which only the host may write to, and all threads can read from. It is faster than the global memory.
- Block
- Each cell in this grid is called a block. A block is further divided into warps. The number of warps in a block can be set by the programmer within the hardware limitations of the GPGPU. All warps in a block have access to a shared memory for the block. It is not possible for a thread to access the shared memory of another block.
- Warp
- A warp is a collection of 32 threads that are executed together. Threads are stored consecutively in warps, meaning that the first warp will contain thread 0. A warp executes one common instruction at a time (SIMD style), so full efficiency is realized when all threads of a warp agree on their execution path. If branch divergence occurs, each possible path will be executed, and threads not on that path will be disabled.
- Thread
- A thread is a single execution of a kernel. It is normally one of many identical threads, but will have it's own unique input. This mean that it may also have a unique output. A thread have access to it's very own set of registers, which are (as always) the fastest memory available to the system.
Serial optimisation
You want the serial portion to be as small as possible, less computation per core can drastically reduce wall time.
- Profile, profile, profile, find out where most time is spent, this is typically where you want to optimise
- Remove branches
- Use libraries (ATLAS, linear algebra routines, PETSc, contains serial and parallell (MPI) routines)
- Location, location, location (you want to be able to predict memory usage for cache)
- Better algorithm
- Tune compilation options
Use early cases most often in small switches.
Compilers will turn switch statements with few (typically anything less than 3) cases to a chain of if statements. Early cases are faster to find in such cases. In large switches, a slower jump table is used. Jump tables spend the same amount of time jumping regardless of what case is used.
Race conditions
Race conditions, or race hazards, are situations where multiple input signals race to affect a system. The order or timing in which they arrive determine some state. This is a hazard as it may cause the system to enter some unintended state, because the programmer did not handle all input orders that could occur.
Non-critical race condition
A race condition which only affects some part of the system, while the system as a whole will end up in the same state regardless of input order. The program output is unaffected by such race conditions.
Critical race condition
A critical race condition is a race condition which affects the final state of the program. This will often result in bugs.
How to fix/avoid race condition
Depends on what is causing it, but usually, you are looking for a synchronization mechanism. The ones we care about are
- Semaphore
- Mutex
- Spinlock
The samaphore is a counter that indicates the number of resources available. A thread will decrease the counter to aquire a resource, and increase it when it is finished. A mutex is a semaphore with the counter set to 1.
Locality of reference
A storage location or related location is often accessed multiple times. A computer can increase performance by keeping this data ready for when it is accessed. This is the function of a cache in a computer. There are a few types of locality:
- Temporal locality
- Data is frequently accessed multiple times within a time frame.
- Spatial locality
- Data located on a relatively close location to recently accessed data is likely to be accessed.
There exists locality in other areas as well:
- Branch locality
- The paths a branch have taken recently are likely to be repeated. This is a temporal behavior, as the branch locations may be spaced out far.
Other
- Alias in/out, MPI_Scatter
- Monte carlo methods?
Amdahl's Law
Amdahl's law is used to find maximum expected improvement to a system when only a part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors.
The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program.
Example: A program needs 20 hours using a single processor core, and a particular portion of the program which takes one hour to execute cannot be parallelized, while the remaining 19 hours (95%) of execution time can be parallelized. No matter how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than the one hour. Hence, the speedup is limited to at most 20×.
The execution time of an algorithm when executed on
The theoretical speedup then becomes:
Gustafson's Law
Gustafson's law says that computations involving arbitrarily large data sets can be efficiently parallelized.
$P$ , the number of processors$S$ , the speedup$\alpha$ , the non-parallelizable fraction
Strong vs weak scaling
Strong scaling describes how a program's execution time is altered when adding processors and keeping the problem size constant.
Weak scaling describes how a program's execution time is altered when adding processors and increasing the problem size linearly.
The Brick Wall
A combination of three other "walls":
- The power wall
- Consuming exponentially increasing power with increasing operating frequency.
- The memory wall
- The increasing gap between processor and memory speeds.
- The ILP wall
- The diminishing returns on finding more ILP.
NUMA
Non-uniform Memory Access
NUMA is a computer memory design which is used in multiprocessing. Because CPUs are so much faster than the memory, sophisticated measures are needed to avoid stalling the CPU. Under NUMA, the memory access time depends on the memory location relative to the processor. Memory which is local to the processor, can be accessed faster than memory which is local to another processor, or memory which is shared between multiple processors.
Replicated vs. Distributed Grids
Replicated | Distributed | |
Description | The same data is copied to all nodes in the grid | Data is only stored on one node (like RAID) |
Pros | Reliable | Scales much better |
Cons | Low performance, scales poorly | Lower reliability |