TDT4190: Distributed systems
Characterization of Distributed Systems (Chapter 1)
Distributed systems: Components located at networked computers which communicate and coordinate by passing messages between them.
This definition has three big implications:
- Concurrency: All computers connected in a network are able to do work at the same time.
- No global clock: When programs are executing a task together, in a network with delays, how can they agree on what time it is? They need some kind of system to handle timing of actions.
- Independent failures: All computer programs fail. Distributed systems can fail in even more ways than your regular non-network-connected software. When two computers are working together, and one of them stops responding, it's hard for the second one to know if the computer is dead, the network has failed, or if it's just become terribly slow.
Why use a distributed system
- Enables sharing of resources between multiple users.
- Applications that allow multiple users to communicate are not possible without distributed systems.
- Allows for higher scalability than a single computer.
Challenges in a distributed system
Distributed systems may consist of many different types of networks, hardware, operating systems, programming languages and differing implementations. For instance, a computer connected to the internet through ethernet should be able to communicate with someone using Wi-Fi. Likewise, a computer running a UNIX based operating system needs to be able to exchange data with a Windows computer (although it would much rather just hang out with its Linux friends).
To overcome such differences, we need standards for how data should be exchanged that needs to be agreed upon by everyone. A solution that helps disguise these differences between systems is called middleware. Middleware takes care of all the peculiarities of a specific system and provides a consistent API for programming against. Virtual machines are another solution to the problem of heterogeneity. A virtual machine will let you run the same code on many different kinds of machines. The JVM (Java Virtual Machine) is a very common example of this. One virtual machine needs to be implemented for each system it needs to run on, but then you only need to compile code for running on that single virtual machine.
Specifications and documentation for software interface of network components.
RFCs (Request For Comments) describe how most of the internet protocols are constructed.
The W3C publishes and develops standards for the web.
Security in distributed systems includes:
- Confidentiality: Avoiding unauthorized access to systems.
- Integrity: Avoiding corruption of data.
- Availability: Securing against DoS attacks.
A system is scalable if it can remain efficient even under high demand. Achieving a scalable system is not trivial and involves a combination of optimizing hardware and software. Software should try not to introduce limitations (I'm looking at you IPv4), and allow the system to grow with demand. Scalability can't always be solved by doubling the amount of resources as that rarely leads to a doubling of performance. The system should have few bottlenecks, as they will lead to poor performance. They might also become a single point of failure that can take down the whole system, if all traffic needs to go through them.
- Detecting failures
- Masking failures
- Tolerating failures
- Recover from failures
Making sure data remains consistent when it is accessed from more than one place at the same time.
To ease the use of a distributed system, the system should try to hide the separation of different components.
- Access transparency
- Allowing both local and remote resources to be accessed with identical operations.
- Location transparency
- Accessing resources without knowing their physical network location.
- Concurrency transparency
- Allowing multiple processes to operate on the same shared resource without interfering with each other.
- Replication transparency
- Letting multiple instances of a resource exist achieves higher reliabillity and performance. Replication transparency lets application programmers and users ignore these details.
- Failure transparency
- Handling failures so that applications can still complete their tasks.
- Mobility transparency
- Letting resources remain available even though they're being moved around.
- Scaling transparency
- Allows a system to be scaled without changing applications or the structure of the system.
Quality of service
System Models (Chapter 2)
- Physical models
- Architectural models
- Fundamental models
Remote Invocation (Chapter 5)
doOperation: Used by clients to execute remote operations. The one who calls
doOperationwaits until the server execute the requested operation and the answer has reached the client.
getRequest: Used by a server to acquire requests from clients.
sendReply: When the server has invoked the requested operation, it sends a reply back to the client. The client then unblocks and continues its execution.
If doOperation, getRequest and sendReply are sent over UDP, they may arrive in incorrect order or even not arrive at all. Therefore, doOperation uses a timeout when waiting for a response from the server. When a operation times out, doOperation may:
- Return an indication to the client that the operation has failed (this is very simple, and rarely used).
- The timeout may be caused by either the request or the reply message being lost in the network. doOperation will keep sending the message again until it's certain that the reason for not getting an answer is that the server doesn't respond, rather than the message getting lost along the way.
When a request message is re-sent, the server may receive it multiple times. We want the server to avoid executing an operation more than once for the same request. If the server hasn't sent a reply when it receives a duplicate request it doesn't have to do anything. It will simply send a reply when it is done. If the server has already sent an answer when it receives an identical request, it will have to re-execute the operation to achieve the same result, unless the result was saved the first time. Some servers can execute the same operation multiple times and always achieve the same result. Operations that may be executed multiple times are called idempotent.
Request and reply over TCP
TCP is considered a reliable channel, which discards the need to re-send messages, and also removes any issues with non-idempotent operations. Sometimes, however, TCP is unnecessary. It introduces overhead, and features which are often not needed. Extra messages for setting up the connection, when you are often sending very small amounts of data.
HTTP is an example of a request-reply protocol. Client requests specify a URL which includes the hostname of the server. The server supports a fixed set of methods (GET, POST, PUT...). It also allows clients to specify what kind of content they want to receive (json/xml), and supports authenticated requests.
HTTP is implemented over TCP.
Remote Procedure Call (RPC)
Functions on remote machines may be called as if they were in the local address space. The underlying RPC system will hide all the ugly details of the network such as encoding and decoding of parameters and the result of the procedure call.
In RPC, you use interfaces to control what kind of communication is possible between modules. Only information available through the interface will be available.
Sun RPC and SOAP are among the most common implementations of RPC.
Remote Method Invocation (RMI)
Like RPC, but for objects. It lets you refer to, and call methods on, remote machines.
Java RMI is one implementation of RMI.
Peer-to-peer Systems (Chapter 10)
Networks that don't separate between clients and servers, and where all nodes in the network both supply and consume data.
Peer-to-peer systems can be classified into three generations. The first generation was introduced by Napster in 2001. The next generation came with file-sharing services such as Gnutella, Kazaa and BitTorrent.
The third generation introduced dedicated middleware layers which provided application-independent management of distributed resources.
Functional requirements for peer-to-peer middleware
- Allow clients to find and communicate with all individual resources that are made available to a service, even though the resources are widely distributed.
- Clients should be able to add new resources and remove them when they want to.
- Clients should be able to add hosts to the service, and be able to remove them.
- Like other middleware, peer-to-peer middleware should provide a simple API for application programmers.
Non-functional requirements for peer-to-peer middleware
- Global scalability.
- Load balancing.
- Optimizing for interaction between neighbors.
- Neighbors in the P2P system aren't necessarily close in the physical network.
- Placing data near users that use it most.
- Allow dynamic availability.
- Routing should function even while machines are disconnecting.
Routing overlays is a distributed algorithm that is responsible for localizing nodes and objects in a peer-to-peer system. Since all nodes can't know about all objects, the key is to ask a node that might know more than you. Routing overlays must handle search, insertion, deletion and churn.
- Examples: Gnutella (older versions), Freenet
- The simplest form of p2p networks.
- All nodes are equal.
- Nodes may connect to any other node in the network.
- No planned structure.
- A node cannot know if another node has more information than itself.
- All nodes retain full control of their data
- Nodes decide for themselves which nodes to connect to
- Search needs to be broadcasted
- No guarantee that a search reaches every node
- Search needs to be broadcasted
Hybrids (super nodes)
- Examples: Kazaa, Skype, Gnutella (newer versions)
- Distributed hash tables are most common
- Chord, CAN, Pastry, Tapestry
- Fixed structure
- Nodes find their position based on ID
- Normally a hash based on IP address, etc.
- Tries to achieve a uniform distribution
- Nodes and data are contained in the same address space
- Data is stored near a node with the same ID
- Lookup is usually O(log n)
- ID based positioning makes search efficient and targeted
- Robust against failures
- Lookup is fast and is guaranteed to succeed
- Good routing guarantees
- Can only do lookup based on ID (how is this a con exactly?)
- Structure controls who stores what
- A distributed hash table
- Shared 128 bit address space for nodes and objects
- Every node gets a unique ID (GUID)
- Every object gets a unique ID (GUID)
- Circular routing based on prefix
- Every node has a non-complete routing table
- Lookup: O(log n)
- Normally uses UDP
- Java-implementation is available at freepastry.org
Connecting to a new node
- The new node X creates a GUID.
- X finds its nearest neighbor in the pastry network.
- X sends a join message to this ID.
- The message is routed as a lookup.
- Is received by Z, which is the node with the most similar ID.
- Nodes which route the message further, attach part of their routing table, which then builds X's routing table.
- Node Z sends X a copy of its leaf node list.
- X sends its routing table and leaf node list to all the nodes in these.
Example from the exercises
A node with GUID 54C contacts node AF3 and wants to join the pastry network. What happens?
Node 54C finds the node which is physically closest in the network. This node sends a join message. This join message is routed as a lookup on the node 54C. The node which has the longest prefix match with 54C, in this case 54B, finally receives the message. In this particular case, the node 54B was in AF3's routing table, and was routed directly.
Every node that routed the join message further added parts of their own routing table to build 54C's routing table. Node 54B (which received the join message) added its leaf node list to 54C's routing table.
Finally, 54C sends its routing table and leaf node list to all nodes that are in one of them. This way, they are now able to lookup 54C.
- A node X which doesn't answer its neighbors is removed from the ring.
- The node that discovers the failure finds another node near X, and asks for its routing table.
- The node repairs its routing table.
- The node sends out information about the failure and its new routing table to its neighbors.
- The routing breaks down if I following nodes fails. (2I is the size of the leaf node list).
Example from the exercises
A node has lost its connection to the network (the node 'died' and is not available anymore). Describe the most important consequences of this.
When a node fails, there are two things that might happen first. At some point, the network will realize that the node that just died has disappeared. This might happen when someone tries to contact it, or sends a request to see if it's still alive, and it doesn't respond. When a dead node is discovered, the routing tables need to be updated. The node that discovered the failing node asks an alternative node for its routing table. Based on this nodes routing table, it can rebuild its own routing table. Finally, it distributes its new routing table to all the other nodes which need an update.
- Every node maintains a routing table
- Every field in the routing table has multiple alternatives
- 65A1xxxx matches multiple nodes
- Every node chooses the alternative that produces the lowest message delay ("physically closest")
- Routes will be 30-50 % longer than optimal
- Nodes send heartbeats to their neighbors to indicate that they are still alive.
- Routing may fail.
- Reliable transfer uses "at-least-once" semantics.
- Messages are repeated if they don't receive an answer.
- A random selection of messages are routed a sub-optimal way.
- They pick an earlier row in the routing table than necessary.
- This makes routing more random.
- Makes it more difficult for "evil" nodes and random faults to stop the routing.
Security (Chapter 11)
except 11.5, 11.6.3 and 11.6.4
Names used in security protocols
- First participant
- Second participant
- Participant when a third party is involved
- Participant when four parties are involved
- Malicious attacker
- A server
Distributed File Systems (Chapter 12)
Network File System (NFS)
Provides transparent access to remote files. Any computer can act as both a server and a client, and the files can be available for remote access. When reading files, NFS will only cache to RAM, and the files will not persist over a reboot. It's common practice to have some machines be dedicated servers.
Andrew File System (AFS)
The Andrew File System (henceforth abbreviated AFS) is a distributed computing enviroment originally develpoed for use as a campus computing and information system. AFS will make a local copy of the whole file when it is accessed, and store this on disk. These files will presist through reboots. This makes AFS well suited for systems where the users have a computer of their own, which they use regulary. It is however, not suited for systems where many users use many different computers, and do not use the same computer each time.
Big table is a compressed, high performance data storage system built on Google File System and some other google services. [incomplete]
Time and Global States (Chapter 14)
Why is time important?
Time is important in distributed systems for many reasons. It is often necessary to store the time an event occured. Many distributed algorithms also depend on having proper causality.
All computers have an internal clock which notifies the computer that a certain amount of time has elapsed. These clocks are not perfect, and two computers will usually not show the same time.
- Clock skew
- The difference between the clocks in the computers, when you read them at the same time.
- Clock drift
- The change in the offset in between the clock in a computer and a nominal perfect reference clock per unit of time as measured by the reference clock.
A good way to synchronize the clocks in a distributed system, is to calibrate the clocks against an external source which follows Coordinated Universal Time. The most precise clocks that exist are atom clocks, and by synchronizing with one of these across the internet, you get a pretty good synchronization. Another option is to synchronize with the GPS system, but this requires special hardware, and is only done when you have extra special demands for synchronization.
It's worth to note that clocks with clock drift can't be adjusted at anytime. If you have a program that's dependent on causality, and use timestamps, you run the risk of getting timestamps with the wrong order when you compare them after adjusting the clock.
Synchronizing in a synchronized system
In a synchronized system, we say that we know the maximum clock skew and the maximum transfer delay in the network. In these systems, we can always calculate how long we have to wait before executing an operation to guarantee that synchronization faults won't occur. If no other nodes have created a timestamp that is within the boundaries the system has set around our own timestamp, we can guarantee correct causality.
Cristian's algorithm is a method for clock synchronization which uses a time server with an exact clock.
When a process asks this server for the time, the server will provide it.
This timestamp will then not be more wrong than the RTT for the request.
The process then may set its clock to
An alternative method for time synchronization is not to use a central server that everyone asks for the time. The Berkeley algorithm uses a master node that asks all the servers for their time. Based on all the the timestamps it receives, it calculates the average time, and makes this the correct time in the system. The master will then, instead of distributing the correct time, send a message to each server with how much it needs to adjust its clock. This way, some of the error that arises from network delay is removed.
Network Time Protocol (NTP)
Since Cristian's algorithm and the Berkeley Algorithm were both constructed for use on intranets, they both have issues when used over the internet. This is because the internet has very varying amounts of delay and transfer quality.
NTP bases its calculations on statistical data from multiple NTP servers. The servers are spread all around the world, and every one of them have multiple paths onto the internet, so even if a ISP or a server goes down, they have redundancy. These servers must also be able to authenticate each other, so that their time cannot be adjusted by evil third-parties. The NTP network consists of main servers which receive their time from UTC radio clocks.
Servers are classified in levels of "stratum", which indicate how many jumps they are from a main reference clock. A server with stratum 2 is synchronized with a main server, while one with stratum 3 is synchronized with a server with stratum 2.
Stratum 16 means a server is unsynchronized.
Logical Time and Logical Clocks
A message will always be sent before it is received. To tasks executed by the same process will always be executed in the same order the process observed them. These obvious facts may be used to ensure correct causality in message transfers and transactions. We can keep track of these orders using logical clocks. Logical clocks are monotonic counters which don't necessarily have any relation to physical clocks.