This is an old version of the compendium, written June 1, 2014, 1:50 p.m. Changes made in this revision were made by stiaje. View rendered version.

TDT4190: Distributed systems

> Better than monotonic systems. (This is "funny" if you've taken the course [TDT4205](/TDT4205)) # Characterization of Distributed Systems (Chapter 1) **Distributed systems:** Components located at networked computers which communicate and coordinate by passing messages between them. This definition has three big implications: - **Concurrency:** All computers connected in a network are able to do work at the same time. - **No global clock:** When programs are executing a task together, in a network with delays, how can they agree on what time it is? They need some kind of system to handle timing of actions. - **Independent failures:** All computer programs fail. Distributed systems can fail in even more ways than your regular non-network-connected software. When two computers are working together, and one of them stops responding, it's hard for the second one to know if the computer is dead, the network has failed, or if it's just become terribly slow.

## Why use a distributed system - Enables sharing of resources between multiple users. - A~~dvantages of distr~~pplications that allow multiple users to communicate are not possib~~uted systems~~ le without distributed systems. - ~~Share resources~~ Allows for higher scalab~~etween multiple users~~ility than a single computer. ~~- Chat - Facebook, y'all - Scalability and such~~

## Challenges in a distributed system ### Heterogeneity Distributed systems may consist of many different types of networks, hardware, operating systems, programming languages and differing implementations. For instance, a computer connected to the internet through ethernet should be able to communicate with someone using Wi-Fi. Likewise, a computer running a UNIX based operating system needs to be able to exchange data with a Windows computer (although it would much rather just hang out with its Linux friends). To overcome such differences, we need standards for how data should be exchanged that needs to be agreed upon by everyone. A solution that helps disguise these differences between systems is called _middleware_. Middleware takes care of all the peculiarities of a specific system and provides a consistent API for programming against. Virtual machines are another solution to the problem of heterogeneity. A virtual machine will let you run the same code on many different kinds of machines. The JVM (Java Virtual Machine) is a very common example of this. One virtual machine needs to be implemented for each system it needs to run on, but then you only need to compile code for running on that single virtual machine. ### Openness Specifications and documentation for software interface of network components. RFCs (Request For Comments) describe how most of the internet protocols are constructed. The W3C publishes and develops standards for the web. ### Security Security in distributed systems includes: - **Confidentiality:** Avoiding unauthorized access to systems. - **Integrity:** Avoiding corruption of data. - **Availability:** Securing against DoS attacks. ### Scalability A system is scalable if it can remain efficient even under high demand. Achieving a scalable system is not trivial and involves a combination of optimizing hardware and software. Software should try not to introduce limitations (I'm looking at you IPv4), and allow the system to grow with demand. Scalability can't always be solved by doubling the amount of resources as that rarely leads to a doubling of performance. The system should have few bottlenecks, as they will lead to poor performance. They might also become a single point of failure that can take down the whole system, if all traffic needs to go through them. ### Failure handling - Detecting failures - Masking failures - Tolerating failures - Recover from failures - Redundancy ### Concurrency Making sure data remains consistent when it is accessed from more than one place at the same time. ### Transparency To ease the use of a distributed system, the system should try to hide the separation of different components. Access transparency : Allowing both local and remote resources to be accessed with identic operations. Location transparency : Accessing resources without knowing their physical network location. Concurrency transparency : Allowing multiple processes to operate on the same shared resource without interfering with each other. Replication transparency : Letting multiple instances of a resource exist achieves higher reliabillity and performance. Replication transparency lets application programmers and users ignore these details. Failure transparency : Handling failures so that applications can still complete their tasks. Mobility transparency : Letting resources remain available even though they're being moved around. Scaling transparency : Allows a system to be scaled without changing applications or the structure of the system. ### Quality of service - Reliability - Security - Performance - Adaptability # System Models (Chapter 2) - Physical models - Architectural models - Fundamental models # Remote Invocation (Chapter 5) ## Request-reply Protocols A pattern on top of the most basic message passing between entities. Request-reply protocols support synchronous two-way exchange of messages, and is used as the basis for [RPC](#remote-procedure-call-rpc) and [RMI](#remote-method-invocation-rmi). ### Operations - `doOperation`: Used by clients to execute remote operations. The one who calls `doOperation` waits until the server execute the requested operation and the answer has reached the client. - `getRequest`: Used by a server to acquire requests from clients. - `sendReply`: When the server has invoked the requested operation, it sends a reply back to the client. The client then unblocks and continues its execution. If doOperation, getRequest and sendReply are sent over UDP, they may arrive in incorrect order or even not arrive at all. Therefore, doOperation uses a timeout when waiting for a response from the server. When a operation times out, doOperation may: - Return an indication to the client that the operation has failed (this is very simple, and rarely used). - The timeout may be caused by either the request or the reply message being lost in the network. doOperation will keep sending the message again until it's certain that the reason for not getting an answer is that the server doesn't respond, rather than the message getting lost along the way. When a request message is re-sent, the server may receive it multiple times. We want the server to avoid executing an operation more than once for the same request. If the server hasn't sent a reply when it receives a duplicate request it doesn't have to do anything. It will simply send a reply when it is done. If the server has already sent an answer when it receives an identical request, it will have to re-execute the operation to achieve the same result, unless the result was saved the first time. Some servers can execute the same operation multiple times and always achieve the same result. Operations that may be executed multiple times are called idempotent. #### Request and reply over TCP TCP is considered a reliable channel, which discards the need to re-send messages, and also removes any issues with non-idempotent operations. Sometimes, however, TCP is unnecessary. It introduces overhead, and features which are often not needed. Extra messages for setting up the connection, when you are often sending very small amounts of data. #### HTTP HTTP is an example of a request-reply protocol. Client requests specify a URL which includes the hostname of the server. The server supports a fixed set of methods (GET, POST, PUT...). It also allows clients to specify what kind of content they want to receive (json/xml), and supports authenticated requests. HTTP is implemented over TCP. ## Remote Procedure Call (RPC) Functions on remote machines may be called as if they were in the local address space. The underlying RPC system will hide all the ugly details of the network such as encoding and decoding of parameters and the result of the procedure call. In RPC, you use interfaces to control what kind of communication is possible between modules. Only information available through the interface will be available. Sun RPC and SOAP are among the most common implementations of RPC. ## Remote Method Invocation (RMI) Like RPC, but for objects. It lets you refer to, and call methods on, remote machines. Java RMI is one implementation of RMI. # Distributed File Systems (Chapter 12) ## Network File System (NFS) Provides transparent access to remote files. Any computer can act as both a server and a client, and the files can be available for remote access. When reading files, NFS will only cache to RAM, and the files will not persist over a reboot. It's common practice to have some machines be dedicated servers. ## Andrew File System (AFS) The Andrew File System (henceforth abbreviated AFS) is a distributed computing enviroment originally develpoed for use as a campus computing and information system. AFS will make a local copy of the whole file when it is accessed, and store this on disk. These files will presist through reboots. This makes AFS well suited for systems where the users have a computer of their own, which they use regulary. It is however, not suited for systems where many users use many different computers, and do not use the same computer each time. ## Big Table Big table is a compressed, high performance data storage system built on Google File System and some other google services. [incomplete]