This is an old version of the compendium, written May 7, 2015, 10:48 a.m. Changes made in this revision were made by stiaje. View rendered version.

Previous version

TMA4280: Introduction to Supercomputing

# Practical notes This exam lets you (as of 2014 at least) bring most printed and hand-written examination aids, which means that if you print and bring: 1. All lecture notes, slides, codes, exercises and suggested solutions from the TMA4280 course material. 2. Rottmann: Mathematical formulas. 3. Earlier exams+suggested solutions in TMA4280. 4. LINPACK specification and FAQ. 5. All handwritten notes, including annotations on printed material. 6. Simple, approved calculator. you should be pretty alright on the exam.

# Introduction

Some of the theory on parallelisation is covered by [TDT4200: Parallel Computing](/TDT4200), relevant parts are the distributed memory and shared memory parallisation as well as the parallelisation theory. In addition to topics covered by TDT4200, TMA4280 covers some mathematical theory behind solutions to the problems typically solved by supercomputers and efficient ways of finding said solutions. # Shared and distributed memory parallelisation It is not feasible to share all the memory on large clusters, containing anything from just below 10000 cores to 3 million cores on world's fastest supercomputer. A hybrid model is often utilised, with several cores sharing memory. ## Distributed memory parallelisation: MPI Message Passing Interface. Four main modes of communication: one-to-one, one-to-all, all-to-one and all-to-all. Should probably have something on groups and communicators, which are ordered sets of processes, potentially with virtual topologies such as being placed in a cartesian grid. Distributed memory forces you to think about where your data is, which is good. It also excludes race conditions, because each node has its own part of memory to work with. ### Distributed file I/O MPI I/O, should probably write something about this. ### One-to-one One process communicates with one other process. The MPI calls used would be MPI\_Send and MPI\_Recv. ### One-to-all Send a message from one process to all processes in the communicator. Use MPI\_Bcast. Useful when sending configuration parameters. If you need to send a different message to each process, use MPI\_Scatter. This will send different parts of an array to different processes. Useful when all processes need to sum a part of an array, for example. ### All-to-one This category can be used to make all processes calculate a partial result, and store all parts on one process. MPI\_Gather is used to collect the results on one process. ### All-to-all MPI\_Allgather. ### Communicators Communicators in MPI can send data between processes in a group. We usually only use MPI\_COMM\_WORLD as communicator, but more complicated models with multiple communicators are possible. All processes in a group have a unique rank. ## Shared memory parallelisation: OpenMP Parallelisation through threading (on the samen node). All threads have access to the same memory, which makes race conditions possible. OpenMP has several keywords to avoid race conditions, like atomic and barrier. # The maths used in this course TMA4280 is a maths course, but most of the curriculum is centered around parallel computing. Nevertheless, some actual knowledge of maths is required. ## Elliptic partial differential equation > An elliptic partial differential equation is a general partial differential equation of second order of form $$ Au_{xx} + 2Bu_{xy} + Cu_{yy} + Du_x + Eu_y + F = 0 $$ that satisfies the condition $$ B^2 - AC < 0 $$ (assuming implicitly that $ u_{xy} = u_{yx} $). ### The Poisson problem The Poisson equation is an elliptic partial differential equation. The Poisson problem is the solution of the Poisson equation given boundary conditions. The poisson equation is typically denoted as $$ -\nabla^2u = f,\quad \text{in}\ \Omega .$$ Here, $ u $ is the unknown, $ f $ is the load, and $ \Omega $ is the domain. $ \nabla^2 $ is the sum of the second order partial derivatives. ## Speedup To determine the speedup from 1 to $P$ processors, we use the times $T_1$ and $T_p$. $$ S_p = \frac{T_1}{T_p} $$ $S_p = P$ is the best theoretical possible speedup. More processors require more communicator, and achieve less speedup per processor.