‹

TDT4200: Parallel Computing

Tags:

One thing to remember: location, location, location (of your memory)

Distributed computing

Supercomputers, clusters, not feasible with shared memory. Enter message passing. Can achieve super linear speedup due to more fast memory (cache). Scales better for huge data sets (to achieve Gustafson's law-like speedups).

MPI

A standard for message passing, implemented by several libraries (for instance OpenMPI).

MPI organises processes in communicators, which are collections of processes that can send messages to each other. MPI_COMM_WORLD contains all the processes. In addition, there are specialised communicators such as the cartesian communicator to simplify organising the processes in a cartesian grid.

MPI supports a lot of different calls. Only some are part of this subject (These are some of them, but the list is incomplete):

int MPI_Ssend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm )
int MPI_Wait ( MPI_Request *request, MPI_Status *status)
int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request )
int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request )
int MPI_Issend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request )

Note that the functions starting on an I are non-blocking. Their counterparts without the I are however not always blocking calls. The regular send function is in many implementations only blocking when the data that is to be sent is to large to fit in the send buffer. The size of this buffer may vary, so depending on the system running the program it may or may not deadlock.

Let's look at one of the more advanced MPI functions: MPI_Reduce. The use case for MPI_Reduce can be simple: say all processes have some data, and we would like to perform some operation on that data, and reduce it to a single element. For instance, let's say all processes have an integer, and we would like to find the sum of all those integers. This is the signature of the reduce function: MPI_Reduce(void *input, void *output, int count, MPI_Datatype datatype, MPI_Op operation, int dest, MPI_Comm comm). Note that while all processes need to pass a pointer to the output value, only one process will actually get the result - dest. The following minimal program illustrates the use of MPI_Reduce:

#include <stdio.h>
#include <mpi.h>

int main() {
  MPI_Init(NULL, NULL);
  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  int number = rank;
  int accumulator = 0;
  MPI_Reduce(&number, &accumulator, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
  if (rank == 0)
    printf("accumulator = %d\n", accumulator);
  MPI_Finalize();
}

Other operators include MAX, MIN, PROD, LOR (logical OR) and BOR (bitwise OR). It is worth knowing that input and output can't point to the same variable. Another variant of the function, MPI_Allreduce is similar, but results in all processes getting the result; naturally, this function does not take a dest argument.

Shared memory parallellism

OpenMP

OperMP is an API for multi-platform shared memory multiprocessing. Compiler directives for parallellising seemingly procedural code. Excellent at parallellising (for) loops. Does not expose "low-level" features used, mutexes etc.

#pragma omp parallel
for(int i = 0; i < 10000; i++){
    array[i] = myFunction(i);
}

Different schedules for the parallelizing may be chosen, like this:

#pragma omp parallel for schedule(kind [, chunk_size])

The different kinds of available schedules are

Static: Divides the loop into equal-sized chunks or as equal as possible. The default chunk size is loop_count/number_of_threads.
Dynamic: Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue. By default, the chunk size is 1.
Guided: Guided scheduling is similar to dynamic in that not all iterations are allocated at the start. However, threads get a large contiguous chunk to start with, and the chunk size gradually decreases as the program runs, down to the limit set in the chunk_size parameter.

Note the functions, as they may be necessary to know on the exam:

omp_get_num_threads(); // return the total number of threads
omp_get_thread_num();  // return the id of this thread

Pthreads

Basic threading library. Basically a thread spawns other threads that start running a supplied function. No built-in barrier, but access to mutexes etc.

int pthread_create(pthread_t *thread, const pthread_attr_t *attr,
                   void *(*start_routine) (void *), void *arg);

int pthread_join(pthread_t thread, void **value_ptr);

Process vs thread

A Process and a thread are both independent sequences of execution. The main difference is that threads spawned by the same process run in the same shared memory. They are able to alter each others data, as well as pass messages between each other.

GPUs - SIMD

Why GPUs?

We say GPU, but we really mean GPGPUs, General Purpose GPUs. GPUs have a lower clock speed, but in turn they have a lot more cores. If used correctly, they can speed up computations that apply the same instruction to different data, SIMD. Code run on the GPU are usually small simple program units kalled kernels.

CUDA

Cuda is a parallel computing platform developed by Nvidia to be run on their graphics cards. CUDA is similar to C and is combiled down to binaries thay is linked into your application and then loaded onto the GPU.

OpenCL

OpenCL is a standard/API originally developed by Apple, but is currently maintained by the Khronos Group. OpenCL defines an API that allows you write code targeting numerous platforms, including NVIDIA. Platform owners, such as Intel, Nvidia, AMD, etc, then provide OpenCL runtimes for their platforms. Contrary to CUDA, the kernels are compiled when program is executing, rather than ahead of time and shipped with the program.

Pros and cons

OpenCL supports a lot more hardware, but this also means the API is more generalized. This can, though does not have to, reduce performance.

Launching a kernel

Create a context
Load the kernel source code
Compile the kernel
Allocate memory on device
Copy memory to device
Launch kernel
Copy back result

These steps are the steps for OpenCL, for CUDA, 1-3 is done ahead of time, and thus only 4-7 is done when launching the kernel.

CUDA vs OpenCL concepts

CUDA	OpenCL
	Platform
Host	Host
Device	Device
Streaming Multiprocessor	Compute Unit
CUDA Core	Processing Element

CUDA and OpenCL terms

CUDA	OpenCL
Thread	Work-Item
Warp	Collection of Work-Items
Thread-Block	Work-Group
Global memory	Global memory
Constant memory	Constant memory
Shared memory	Local memory
Local memory	Private memory
`__global__`function	`__kernel` function
`__device__` function	N/A
`__constant__` variable	`__constant` variable
`__device__` variable	`__global` variable
`__shared__` variable	`__local` variable

Architecture and memory model of CUDA

Grid: The GPU is divided into a grid of some blocks. The programmer may choose the size of this grid within the boundaries imposed by the hardware of the GPGPU. The grid has a common global memory, which all threads can read and write to at any time. It is slow because of the wast number of threads that are able to write to it, and it is often congested. The grid also has a global constant memory. This is a read only memory, which only the host may write to, and all threads can read from. It is faster than the global memory.
Block: Each cell in this grid is called a block. A block is further divided into warps. The number of warps in a block can be set by the programmer within the hardware limitations of the GPGPU. All warps in a block have access to a shared memory for the block. It is not possible for a thread to access the shared memory of another block.
Warp: A warp is a collection of 32 threads that are executed together. Threads are stored consecutively in warps, meaning that the first warp will contain thread 0. A warp executes one common instruction at a time (SIMD style), so full efficiency is realized when all threads of a warp agree on their execution path. If branch divergence occurs, each possible path will be executed, and threads not on that path will be disabled.
Thread: A thread is a single execution of a kernel. It is normally one of many identical threads, but will have it's own unique input. This mean that it may also have a unique output. A thread have access to it's very own set of registers, which are (as always) the fastest memory available to the system.

Serial optimisation

You want the serial portion to be as small as possible, less computation per core can drastically reduce wall time.

Profile, profile, profile, find out where most time is spent, this is typically where you want to optimise
Remove branches
Use libraries (ATLAS, linear algebra routines, PETSc, contains serial and parallell (MPI) routines)
Location, location, location (you want to be able to predict memory usage for cache)
Better algorithm
Tune compilation options

Use early cases most often in small switches.

Compilers will turn switch statements with few (typically anything less than 3) cases to a chain of if statements. Early cases are faster to find in such cases. In large switches, a slower jump table is used. Jump tables spend the same amount of time jumping regardless of what case is used.

Race conditions

Race conditions, or race hazards, are situations where multiple input signals race to affect a system. The order or timing in which they arrive determine some state. This is a hazard as it may cause the system to enter some unintended state, because the programmer did not handle all input orders that could occur.

Non-critical race condition

A race condition which only affects some part of the system, while the system as a whole will end up in the same state regardless of input order. The program output is unaffected by such race conditions.

Critical race condition

A critical race condition is a race condition which affects the final state of the program. This will often result in bugs.

How to fix/avoid race condition

Depends on what is causing it, but usually, you are looking for a synchronization mechanism. The ones we care about are

Semaphore
Mutex
Spinlock

The samaphore is a counter that indicates the number of resources available. A thread will decrease the counter to aquire a resource, and increase it when it is finished. A mutex is a semaphore with the counter set to 1.

Locality of reference

A storage location or related location is often accessed multiple times. A computer can increase performance by keeping this data ready for when it is accessed. This is the function of a cache in a computer. There are a few types of locality:

Temporal locality: Data is frequently accessed multiple times within a time frame.
Spatial locality: Data located on a relatively close location to recently accessed data is likely to be accessed.

There exists locality in other areas as well:

Branch locality: The paths a branch have taken recently are likely to be repeated. This is a temporal behavior, as the branch locations may be spaced out far.

Other

Alias in/out, MPI_Scatter
Monte carlo methods?

Amdahl's Law

Amdahl's law is used to find maximum expected improvement to a system when only a part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors.

The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program.

Example: A program needs 20 hours using a single processor core, and a particular portion of the program which takes one hour to execute cannot be parallelized, while the remaining 19 hours (95%) of execution time can be parallelized. No matter how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than the one hour. Hence, the speedup is limited to at most 20×.

The execution time of an algorithm when executed on $n$ threads, where $B$ is the fraction of the algorithm which cannot be parallelized:

$$ T(n) = T(1) \bigg( B + \frac{1}{n}(1-B) \bigg) $$

The theoretical speedup then becomes:

$$ S(n) = \frac{T(1)}{T(n)} = \frac{T(1)}{ T(1) \Big( B + \frac{1}{n}(1-B) \Big) } = \frac{1}{B + \frac{1}{n}(1-B)} $$

Gustafson's Law

Gustafson's law says that computations involving arbitrarily large data sets can be efficiently parallelized.

$$ S(P) = P - \alpha \cdot (P - 1) $$

$P$, the number of processors
$S$, the speedup
$\alpha$, the non-parallelizable fraction

Strong vs weak scaling

Strong scaling describes how a program's execution time is altered when adding processors and keeping the problem size constant.

Weak scaling describes how a program's execution time is altered when adding processors and increasing the problem size linearly.

The Brick Wall

A combination of three other "walls":

The power wall: Consuming exponentially increasing power with increasing operating frequency.
The memory wall: The increasing gap between processor and memory speeds.
The ILP wall: The diminishing returns on finding more ILP.

NUMA

Non-uniform Memory Access

NUMA is a computer memory design which is used in multiprocessing. Because CPUs are so much faster than the memory, sophisticated measures are needed to avoid stalling the CPU. Under NUMA, the memory access time depends on the memory location relative to the processor. Memory which is local to the processor, can be accessed faster than memory which is local to another processor, or memory which is shared between multiple processors.

Replicated vs. Distributed Grids

	Replicated	Distributed
Description	The same data is copied to all nodes in the grid	Data is only stored on one node (like RAID)
Pros	Reliable	Scales much better
Cons	Low performance, scales poorly	Lower reliability

Written by

Stian Jensen niklasmh emiltayl sindrsb paal runholm martinhath aliceg

Last updated: Sat, 17 Nov 2018 14:28:14 +0100 .