TTK4145: Real Time Programming
Contributors
Mainly written by Mads Erdal, Erik Wilthil, Kristian Breistøl, Dag Slettebø, Andreas Lindahl Flåten, Anders Dahlen, Marius Thoresen, Eirik Lie Strandbråten, Bjørn Spockeli, Kristoffer Gryte, Kine Iversen, Cathrine Bruteig, Jostein Munz, Lars Skjærpe Midttun, Jon Håmann Brusevold and Geir Kulia.
Introduction to real-time systems
This chapter gives a definition of a real-time system, as well as characteristics and examples of real-time systems. This chapter will also present languages for programming and the development cycle for real-time systems. This book will consider three programming languages: Ada, Java and C, or more specifically: C/Real-Time POSIX, Real-Time Java and, well, Ada.
Definition of a real-time system
The Oxford Dictionary of Computing's definition of a real-time system: Any system in which the time at which output is produced is significant. This is usually because the input corresponds to some movement in the physical world, and the output has to relate to that same movement. The lag from input time to output time must be sufficiently small for acceptable timeliness 'Timeliness' is system-dependent. For instance, a missile guidance system requires output within milliseconds, whereas a computer-controlled water dam may only need output within a second.
- Young's (1982) definition of a real-time system:
- Any information processing activity or system which has to respond to externally generated input stimuli within a finite and specified period.
- Another definition is (Randell et al., 1995):
- A real-time system is a system that is required to react to stimuli from the environment (including the passage of physical time) within time intervals dictated by the environment.
The correctness of a real-time system depends not only on the logical result of the computation, but also on the time at which the result are produced. A hard real-time system are those where it is absolutely imperative that responses occur within the specified deadline. For soft systems, the system will still function correctly if deadlines are occasionally missed. A system may have both soft and hard real-time-sub-systems and soft and hard deadlines. A service may even have both a soft and a hard deadline with different reactions to the two deadlines.
A deadline that can be missed occasionally, but in which there is no benefit from late delivery, is called firm. Another means of classifying the role that time has in a real-time system, is to distinguish between reactive systems and time-aware systems. Time-aware systems make explicit references to the time frame of the enclosing environment. A reactive system is more concerned with relative times, however, since they are often also control systems, the need to be synchronized with their environment. In order for a reactive system to 'keep up with its environment' they are often structured to be time-triggered. All computations are periodic in that they have a defined cycle time, and are released for execution by an internal clock. The alternative is event- triggered where the environment explicitly controls the release for execution of some activity. These activities are termed aperiodic or sporadic if the number of releasing events within a time interval is bound.
Examples of real-time systems
A system where a computer interacts with the environment with sensors (for instance temperature or pressure transducer) and actuators (valves, motors in general) are examples of real-time systems. E.g plants, elevators, washing machines, etc. Real-time systems are used in process control, manufacturing, communication, multimedia, command and control, and pretty much everywhere else. These examples are thoroughly explained in the book for those interested.
Characteristics of real-time systems
Following are characteristics of real-time systems, which will be thoroughly examined in the later chapters of the book. A real-time system must, of course not exhibit all these characteristics, but should have facilities which support these characteristics.
Real-Time facilities
As mentioned, response time is crucial in any embedded (/real-time) system. As this is often near impossible to guarantee, due to computing resources under all possible conditions, real-time systems are often constructed using processors with considerable spare capacity. This is to ensure that a worst-case scenario does not produce unwelcome delays. Given adequate processing power, language and run-time support are required to enable the programmer to:
- specify times at which actions are to be performed;
- specify times at which actions are to be completed;
- support repeating (periodic or aperiodic) work;
- control (i.e. bound) the jitter on input and output operations;
- respond to situations where not all of the timing requirements can be met;
- respond to situations where the timing requirements are changed dynamically.
These are called real-time control facilities, and enable the program to synchronize with time itself.
Some systems are more demanding at certain periods, and therefore needs to change the timing requirements dynamically. As an example, an aircraft flight control system needs to focus at the task at hand, and also be ready to give all computing resources to handle an emergency. Such changes are generally known as mood changes, and the also have consequences for the temporal characteristics of the executing software.
Concurrent control of separate system components
Often, an embedded system must handle several coexisting external elements with which the computer must interact simultaneously. Then it is necessary to consider distributed and multiprocessor embedded systems. The problem then becomes how to express the concurrency in the structure of programs which exhibit concurrency. Chapter 4, 5 and 6 will consider various models of concurrent programming, and how to achieve reliable communication and synchronization between concurrent processes in the presence of design errors.
Low-level programming
An embedded system typically requires the computer components to interact with the external world, by monitoring sensors and controlling actuators for a wide variety of devices. These devices interface the computer via input and output registers, and their operational requirements are device- and computer-dependent, and may also generate interrupts. A real-time system is also often run by battery in a mobile device. We have also mentioned that real-time systems are very dependent on response time. Low-level programming eliminates both delay and extra power consumption by using built in language features. High-level programming enables the programmer to abstract away from implementation details, and to concentrate on solving the problem at hand. For a real-time programmer, this luxury is unacceptable, since delay and power consumption are two central problems when programming a real-time system.
Support for numerical computation
When a real-time system is to control some engineering activity, it often involves continuous signals. For very simple systems, one might have an analog controller working on the continuous signals. One wish to calculate an input, based on comparing the output with a reference signal. This will often require heavy calculations, depending on the complexity of the model and the number of inputs and outputs. Because of these difficulties, most controllers are implemented as digital computers. Since nature is analog, and computers are digital, converters are needed in order to convert analog signals to digital, and vice versa. This must again be taken into account when designing of the control algorithms. A fundamental requirement of a real-time programming language, therefore, is the ability to manipulate real or floating-point numbers. Fortunately most engineering languages do provide the necessary abstractions in this area.
Large and complex
Largeness is, by Lehman and Belady (1985) related to variety. Since most real-time systems respond to real-world events, the amount of variety is great. Hence, most real-time systems will exhibit the rather undesirable property of largeness. Due to this large- ness, real-time systems needs to be broken down into smaller components which can be managed effectively. Real-time languages provides features, such as abstract data types, classes, objects, generic components, separate compilation, etc to manage this software complexity.
Extremely reliable and safe
For some systems, failure is simply not an option. Systems providing life support, controlling nuclear plants or transferring money, can not fail, and if they do, however, fail, they must fail in a controlled way. When human interaction is required, care must be taken in the design of the interface in order to minimize the possibility of human error. Chapter 2 and 3 will consider the problems of producing reliable and safe software. Copying with both expected and unexpected error conditions will be examined in chapter 7 and 13(?).
Structure of the book
If you are interested in the structure of the book, read the book, not the compendium.
Development cycle for real-time systems
The most important stage in the development of any real-time system is the generation of a consistent design that satisfies an authoritative specification of requirements. The book does not focus on issues of design, but rather the investigation of language and operating system primitives, which allow designs to be realized. This section will give a brief overview of some of the typical stages that are passed through in the top-down design approach: - requirements specification - during which an authoritative specification of the system's required functional and meta-functional behaviour is produced; - architectural design - during which a top-level description of the proposed system is developed; - detailed design - during which the complete system design is specified; - coding - during which the system is implemented; - testing - during which the efficacy of the system is tested.
Requirement specification
Almost all computing projects start with an informal description of what is desired, followed by an extensive analysis of requirements. In this stage, functionality is defined, temporal behaviour made explicit and reliability requirements and the desired behaviour of the software in the event of component failure. The environment must also be modelled. Since interaction with the environment is especially important for a real-time system, maximum rate of interrupts, maximum number of objects and failure modes are all important.
Design activities
Decomposition and abstraction form the basis of most software engineering methods. Decomposition involves the systematic breakdown of the complex system into smaller and smaller parts, until components are isolated that can be understood and engineered by individuals or small groups.Abstraction enables the consideration of detail, particularly that appertaining to implementation, to be postponed. Sub-levels in the design must have well-defined roles in order to be verifiable. If the specification of the entire system can be verified just in terms of the specification of the immediate subcomponents, the decomposition is said to be compositional. This is an important property when formally analysing programs. Whenever a class (object) is used, typically, it involves the addition of some form of task (process). Both object and task abstractions are important in the design and implementation of reliable embedded systems. These forms of encapsulation lead to the use of modules. From the definition of modules, more sizable components can be defined that may even be re-usable in subsequent designs. Cohesion and coupling are two metrics that are often used to describe the relationships between entities within a design. Cohesion is concerned with how well a module holds together - its internal strength. Allworth and Zobel gave six measures of cohesion that range from the very poor (Coincidental) to the strong (Functional). Coupling, by comparison, is a measure of the interdependence of program modules. Two modules passing control information between the are said to be tight, or have high coupling. On the other end you have loose coupling if only data is communicated. A loose coupling means that the module is easy to replace with an other one, ie what we want. A good decomposition is one with strong cohesion, and loose coupling. This principle is equally true in sequential and concurrent programming domains.
Testing and simulation
To assist in any complex testing activity, a realistic test bed presents many attractions. For software, such a test environment is called a simulator. Using a simulator helps create normal, as well as abnormal system behaviour without having to meltdown a nuclear reactor in order to test the system. It is crucial that the simulator are able to reproduce accurately the sequence of events expected in the real system. As mentioned before, high reliability requirements are the essence of most real-time systems, as it is clear that testing must be extremely stringent.
Languages for programming real-time systems
There are three classes of programming languages to be identified; assembly languages, sequential systems implementation languages and high-level concurrent languages. These will be reviewed, after introducing some general language design criteria.
General language design criteria
Young (1982) lists the following six criteria as the basis of a real-time language design: security, readability, flexibility, simplicity, portability and efficiency.
- Security
- The security of a language design is a measure of the extent to which programming errors can be detected automatically by the compiler or language run-time support system. Some errors, like errors in the programmer's logic, cannot be detected automatically. A secure language must, therefore, be well structured and readable so that such errors can easily be spotted. The disadvantage of security is compiler complexity and compilation time (I'm not slacking of, my code is compiling!)
- Readability
- Readability depends on it having the appropriate choice of keywords, the ability to define types and the facilities for program modularization. Good readability leads to increased maintainability, security and reduced documentation costs. It does, however, tend to increase the length of any given program.
- Flexibility
- Flexibility is when the language allow the programmer to express all the required operations in a straightforward and coherent fashion, without having to report to operating system commands or machine code. Simplicity reduce cost associated with programmer training, minimizes the effort required to produce compilers and diminishes the possibility of making programming errors as a result of misinterpretation of the language. Simplicity and Flexibility can also be related to the expressive power, the ability to express the solution of a wide range of problems, and usability, the ease of use of the language.
- Portability
- Portability is the ability to run a program independent of the hardware on which it is executed on. Java claims to have this ability. This is difficult to achieve for a real-time system, however, it should be able to isolate the machine-dependent part of the program, from the machine-independent part. Efficiency is the languages ability to allow efficient and predictable programs to be produced. Response times are crucial in a real-time system, however, this must be balanced against security, flexibility and readability requirements.
Assembly languages
Initially, most real-time systems were written in assembly language. Mostly due to the lack of support in high-level languages on microcomputers. Assembly language achieves efficient implementations, but is machine-oriented rather than problem-oriented. This keeps development costs high and makes maintenance difficult and expensive.
Sequential systems implementation languages
Most common programming languages used for real-time programming today, are sequential. They also tend to be weak in the facilities they provide for real-time control and reliability. As a result of these shortcomings, it is often necessary to rely on operating system support and assembly code inserts.
High-level concurrent programming languages
The software crisis in the 1970s had several symptoms. In short terms, the general language design criteria discussed above was seldom met. This was due to the fact that there were to many general-purpose languages in use. In the American Departement of Defence (DoD), there were over 450 different languages used in their embedded computer applications. DoD concluded that a single language was a desirable goal, and that no current language was suitable. So they decided to develop a new language, which resulted in Ada in 1983. Ada was later updated in 1995, 2005 and 2012, which is the current version.
Reliability and fault tolerance
Real-time and embedded systems usually have much higher requirements for reliability or safety. It is pretty annoying when Spotify crash or Matlab hangs, but these problems have little to no impact on the surroundings. A control system for, say avionics, have much higher demands on reliability.
Faults and failure
Let us first clear up some definitions
- Reliability
- A measure of the success with which the system conforms to some authorative specification of its behaviour.
- Fault
- A mechanical or algorithmic cause which, given the right conditions, produce an unexpected or incorrect result.
- Error
- An internal unexpected problem in a system, caused by activation of a fault.
- Failure
- System deviation from its specification.
Note that this implies a system maintain its reliability until a failure, and a failure only happens when the system does not interact with the environment in the prescribed manner. This means a reliable system either prevent activation of faults, or prevents the error from propagating through the system. This chain is shown in figure 1. Note that a failure in a subsystem will lead to a fault in a system. There are four sources of faults that can result in a system failure:
- Inadequate specification
- Software faults
- Hardware faults
- Interference in the communication subsystems
The last two are in some ways predictable errors in that the system designer will know what happens. For example, if a network module fails, a message will not be delivered. However, when an error occurs from incorrect interaction or wrong code, the result can be more unpredictable, as the scope of these effects can not be forseen as clearly as a hardware error. In this regard, a fault is the source of misbehaviour. Without regards to the above sources, there are three types of faults that can exist in a system. Also note that most system faults originate from incorrect or incomplete specification.
- Transient faults
- These faults occur at a particular time, remain in the system for a period, then disappear afterwards. An example of this is radioactivity.
- Permanent faults
- A fault that persists in the system until it is fixed. All software faults are permanent faults.
- Intermittent faults
- A special case of transient faults that occur repeatedly. Overheating is an example.
It is worth noting that software faults have a special name: The notorious bug. Bugs can be divided into two categories, with the names taken from their quantum physics counterparts. Bohrbugs are easily reproduced and easy to find and thus remedy. Heisenbugs are more complex bugs that arise from certain, often complex, combinations of events. A good example of this are threads that are not properly synchronized, and produce random behaviour subject to the state of each thread.
Failure modes and fault prevention
There are, generally speaking, two ways a system can fail. When a system produce a value failure, the output is wrong or even corrupted/unusable. When the system does not deliver the value in the right time slot, it has produced a time failure. A combination of these two are called an arbitrary failure. Note that while time failure intuitively covers delayed data, undelivered (delayed infinitely) and even information that is delivered too early are also considered time failure.
The book also give an extensive list of different ways a system might fail. These are given in a bullet list on page 32, and covers everything from complete chaos (uncontrollable failure) to an immune system (never fail) via the controlled shutdown/reduced operation failures. Fault prevention is an approach that aims to reduce the failures in a system by reducing the number of faults. The two main ways to do this is pretty intuitive: You can either find and remove the faults, or you can prevent their introduction in the first place. These are known as fault removal and fault avoidance. While there are hardware aspects to fault avoidance, such as using components that are suited for the task at hand, the most important is to write software with as few faults as possible. There are several ways to do this. Design methods, for example UML, and programming languages designed for real-time systems, such as Ada, will help reduce the number of faults. Extensive testing and simulation of the system will also help reduce the number of faults present during operation.
Fault tolerance and redundancy
Even though all precautions might have been taken during design, there will, in practice, still be faults in the system when it goes live. No software programmer writes 100% correct code all the time. No design specification is complete, consistent and unambiguous. Testing can only show the presence of faults, not their absence. Thus a system designer has to make sure the system continues to operate when a fault activates. This is known as fault tolerance, and curriculum describes three different levels.
- Full fault tolerance
- ensures the system continues operation as normal, at least for some time.
- Graceful degradation
- reduces the functionality to only perform core tasks while the system waits for repairs.
- Fail safe
- systems stops all operation, but ensures its integrity. For example, a failing pitch controller on a helicopter might reset the pitch to zero before shutting down.
In order to achieve this, system designers introduce extra elements to detect and handle faults. This is called redundancy, and is possible to implement in both software and hardware. We will focus on the software part, where N-version programming and recovery blocks are the most common techniques.
N-version programming
N-version programming relies on several pieces of software to compute the same result. There are several degrees of diversity, ranging from using different compilers for the same program, to let several teams build the same functionality in different programming languages. During execution, a driver process handles starts the programs, receive the result and act depending on the output. If the programs does not produce the same result, the driver might choose the most common result, ask the processes to compute it again, or simply terminate the faulty process. One of the downsides to this solution is that, in a real-time system, the driver process and the different programs might need to communicate between them, introducing a communication module as well. Add this to the fact that software is by far the most extensive and time consuming part of a real-time system, and the project costs might double by introducing a redundant language. There are also increased strains on hardware, or even need for separate hardware pieces, which increases the cost and/or complexity of the system. As a final note, recall that most software faults originate from the specification thus all N versions might suffer from the same fault.
Dynamic redundancy
Note that N-version programming is a case of static redundancy. All N versions of the program runs continuously no matter if the system fails or not. An alternative to this is to have modules in the background and only runs when an error is detected. This could decrease resource usage while still handling errors properly. This type of redundancy has four stages.
Error detection
Since the recovery system is not continually running, there needs to be some way to detect a fault. There are many triggers that could be used.
-
Replication checks are an extension to N-version programming that starts a recovery routine if one or more of the N results are wrong. Also consider a 2-version system: We cannot know which result is correct, but we know something is wrong, and action has to be taken.
-
Timing checks take the form of a watchdog timer that each process have to reset within a set interval. Note that timing checks can only be used to check for time failures Other error detection systems have to be used for value failures.
-
A reversal check can be used if there is a one-to-one relationship between the input and output. Thus, given an output, we can do a reversal calculation and compare it to the original input2.
-
Coding checks rely on extra information sent along with the data, which can be compared on the receiving side. This is often used in communication. For example, imagine an integer is being sent on the network. A checksum might be the square of the number, which is calculated on the sending side. When both numbers are received, the square root of the second number is compared to the first: A mismatch is proof of error.
-
Reasonableness checks rely on the knowledge of the system. This includes range checks, such as date and time validation.
-
Structural checks are used to ensure the data structures are correct. We know the number of items in a list; if this changes when it is not supposed to, something is wrong. Go users might recall that the json.Marshal-function produce an error when the information it has received does not fit the target struct.
-
Dynamic reasonableness checks the current output to the current output. Imagine a system on the form
$x' = ax$ suddenly change from a large positive value to a small negative value: This jump should not occur, and something is wrong.
Damage confinement
Normally, errors are not detected the instant it occurred. Thus when an error is detected, the system have to figure out how far it has spread. Damage confinement concerns how the rest of the system is affected by an error. The two main techniques of doing this are modular decomposition and atomic actions. Modular decomposition emphasizes on breaking down the system into components where each component only communicates with another through well-defined interfaces. The internal details are hidden from other modules. If done properly, all errors are detected before another module needs the result. Atomic actions will be described in a later chapter.
Error recovery
When the error is found and we know the extent of it, the recovery process can start. There are two methods of error recovery: Forward and backward recovery. Forward recovery involves continuing as before with selected corrections. While this is an efficent method of error recovery, it requires a good understanding of what the error is and how it can be remedied, else the corrections might only be for the worse. Backward recovery steps back to a previous state, called a recovery point, and tries to accomplish the same action with a different method. This works straightforward when there are no threads, but it quickly gets complicated with multiple threads, because of communication between threads: If we need to roll one thread back to an earlier recovery point, we might have to undo a message sent to the other thread. If this is the case, the other thread might have to roll back, requiring messages sent to other threads to be undone. Thus since one thread has had an error, multiple other threads will have to roll back as well. This is called the domino effect.
Fault treatment
While error corrections remedy the error, it does not remove the underlying fault. This fault might be easy to identify and/or remedy, and the experienced error will not happen again. Both hardware and software faults are relatively easy to fix once identified if the system is allowed to be brought oine. If the program have to be modified while executing, things get a little more complicated.
Recovery blocks
Recovery blocks are an implementation of dynamic recovery applied to software. It defines all entries to a software block as a recovery point, and then have an acceptance test before the block can exit. The method is summarized in Figure 2. Note that the acceptance test can involve several of the methods discussed for error detection. For a short summary of the difference between N-version programming and recovery blocks, read chapter 2.7.
Measuring reliability
Hardware reliability is pretty simple to measure: Take a bunch of similar components, test them, and establish a measure of reliability from the amount of components that fail. Software reliability, on the other hand, is much harder to measure as code only comes from one source, and it does deteriorate over time. Thus new measurements of reliability have to be established. First, the notion that software was either working or not working was established. As discussed, software can be designed to maintain only critical functionality when shit hits the fan, and thus this boolean interpretation of software reliability is not accurate enough. A more recent definition of software reliability is based on probability. More specifically, it is determined as the probability that a given program will operate correctly in a specified environment for a specified length of time. One main way to calculate this is to use a growth model, increasing reliability for every fault that is corrected in the system. Another method is to calculate this probability statistically by applying a number of test cases to the system and, without correcting any fault, determine the amount of correct cases.
Exceptions and exception handling
There are five general requirements for an exception-handling facility: - As with all language features, the facility must be simple to understand and use.
-
The code for exception-handling should not be so obtrusive as to obscure understanding of the program's normal error-free operation. A mechanism which intermingles code for normal processing and exceptional processing will prove difficult to understand and maintain. It may well lead to a less reliable system.
-
The mechanism should be designed so that run-time overheads are incurred only when handling an exception. Although the majority of applications require that the performance of a program which uses exceptions is not adversely affected under normal operating conditions, this may not always be the case. Under some circumstances, in particular where speed of recovery is of prime importance, an application may be prepared to tolerate a little overhead on the normal error-free operation.
-
The mechanism should allow the uniform treatment of exceptions detected both by the environment and by the program. For example, an exception such as arith- metic overflow, which is detected by the hardware, should be handled in exactly the same manner as an exception raised by the program as a result of an assertion failure.
-
The exception mechanism should allow recovery actions to be programmed.
Exception handling in older real-time languages
Unusual return value is one of the simplest methods for exception handling, and works as follows:
if (function_call(parameters) == AN_ERROR) {
/* error handling code */
} else {
/* normal return code */
}
If a function already returns a value and it is not possible to partition the range of values to indicate an error, then status flags are used. These are atomic shared variables which can be set and tested. Using forced branch the instructions immediately following the subroutine call is skipped to indicate the presence (or absence) of an error. This is achieved by the subroutine incrementing its return address (program counter) by the length of a simple jump instruction to indicate an error-free (or error) return.
Modern exception handling
While most of the old real-time languages has error handling that changes the flow of execution in the program, most modern languages has exception handling built directly into the language. There are four classes of exceptions.
-
Detected by the environment and raised synchronously - an array bounds violation or divide by zero are examples of such exceptions.
-
Detected by the application and raised synchronously - for example, the failure of a program-defined assertion check.
-
Detected by the environment and raised asynchronously - an exception raised as the result of power failure or the failure of some health-monitoring mechanism.
-
Detected by the application and raised asynchronously - for example, one process may recognize that an error condition has occurred that will result in another process not meeting its deadline or not terminating correctly.
Chapter 3 mainly focus on synchronous exception handling. Ada requires exceptions to be declared as constants, while Java and C++ have a more object-oriented view.
Exception propagation
If an exception is raised it's not always a handler associated with that block that takes care of it. There are two possible methods for dealing with a situation where no immediate exception handler can be found. The first approach is to regard the absence of a handler as a programmer error which should be reported at compile-time. This can be challenging as it's not always possible for the compiler to check whether the calling context includes the appropriate exception handlers. The second approach is to look for handlers up the chain of invokers at run-time; this is called propagating the exception. This is the approach taken in Ada, Java and C++. If an exception is propagated outside its scope, it will be impossible to find a handler. Most languages have a 'catch all' exception handler to solve this potential problem.
- The resumption model
- Using this model the handler tries to cure the problem that caused the exception to be raised, and the invoker resumes as if nothing happened. This model is also called the notify model. This model is illustrated on p.67 in the textboook.
- The termination model
- Using this model, the control is not returned to the invoker after an exception is taken care of. This model is the most common, and is also called the escape model.
- The hybrid model
- With the hybrid model the handler decides whether the error is recoverable or not, and can choose if the invoker is to be terminated.
Exception handling in Ada, Java and C See textbook for examples.
Ada
Supports explicit exception declaration in the same fashion as constants, with keyword exception. Can be declared in the same place as any other declaration and has the same scope. Some standard exceptions, like Constraint_Error and Storage_Error has the whole program as scope. Every block in Ada can contain an optional collection of exception handlers declared at the end of the block. They have the following syntax: when (the name of the identity) => (action). see textbook for more details. when others can be used as the last exception-handling choice to pick up all exceptions not listed previously. In Ada an exception which is raised and not handled by a subprogram is propagated to the caller of the subprogram. Therefore, such an exception will only be handled by the initialization code if it itself called the subprogram. See textbook (p.73) for example. The exception can also be propagated by a program re-raising the exception in the 3 EXCEPTIONS AND EXCEPTION HANDLING 24 local handler. This facility is useful in the programming of last wishes. See textbook (p.74) for example. Exceptions will never meet requirement 3, as detecting possible error conditions will require some resources. The Ada language provides a facility that checks if standard exceptions raised by the run-time environment becomes too costly for a particular application. Difficulties with the Ada model of exceptions:
-
It can be hard to keep track of where an exception is raised.
-
Ada does not allow a full range of parameters to be passed to handlers only a character string. This can be inconvenient if an object of a particular type needs to be passed.
-
It is possible for exceptions to be propagated outside of the scope of their declaration. Such exceptions can only be trapped by others. However, they may go back into scope again when propagated further up the dynamic chain. This is disconcerting, although probably inevitable when using a block structured language and exception propagation.
Java
Java supports a termination model of exception handling, like Ada, but the exceptions are integrated into the object-oriented model. Raising an exception in Java is called throwing the exception, and it can only be han- dled from within a try-block. Each handler is specified using a catch statement, that works like a function declaration. A handler with parameter type T will catch a thrown object of type E if:
-
T and E are the same type; or
-
T is a parent (super) class of E at the throw point.
As with Ada, if no exception is found in the calling context of a function, the calling context is terminated and a handler is sought in a try-block within its calling context. Java supports a finally clause as part of a try-block. The code attached to this clause is guaranteed to execute whatever happens in the try-block irrespective of whether exceptions are thrown, caught or propagated, or, indeed, even if there are no exceptions thrown at all. To illustrate this, consider the following example, which will return false:
try {
return true;
} finally {
return false;
}
C
C does not define any exception-handling facilities within the language. This limits the possibilities of programming reliable systems, but it is possible to provide some form of exception-handling mechanism by using the macro facility of the language. To implement a termination model in C, it is necessary to save the status of a program's registers, and so on, on entry to an exception domain and then restore them if an exception occurs. This can be done using setjmp and longjmp.
Summary
Although many different models exist they all address the following issues:
-
Exception representation - an exception may, or may not, be explicitly represented in a language.
-
The domain of an exception handler - associated with each handler is a domain which specifies the region of computation during which, if an exception occurs, the handler will be activated. The domain is normally associated with a block, subprogram or a statement.
-
Exception propagation - this is closely related to the idea of an exception domain. It is possible that when an exception is raised there is no exception handler in the enclosing domain.
-
Resumption or termination model - this determines the action to be taken after an exception has been handled.
-
Parameter passing to the handler - this may or may not be allowed. The table below shows the exception-handling facilities of various languages. 
Language | Domain | Propagation | Mode | Parameters |
Java | Block | Yes | Termination | Limited |
Java | Block | Yes | Termination | Yes |
C++ | Block | Yes | Termination | Yes |
CHILL | Statement | No | Termination | No |
CLU | Statement | No | Termination | Yes |
Mesa | Block | Yes | Hybrid | Yes |
It is not unanimously accepted that exception-handling facilities should be provided in a language. To sceptics, an exception is a goto where the destination is indeterminable and the source is unknown. They can, therefore, be considered to be the antithesis of structured programming.
Concurrent programming
Introduction
Definition: Concurrent programming is the name of techniques used to express potential parallelism and solving the resulting synchronization and communication challenges. Motivation behind concurrent programming:
-
Real-world systems (robots etc.) have a parallel nature.
-
Modern processors are much faster than the I/O-devices with whom they interact, and a sequential program that is waiting for I/O is unable to perform other operations.
-
To allow more than one processor to solve a problem.
-
Without concurrency, the software must be constructed as a single control loop. This structure cannot retain the logical distinction between system components. Speed-up of parallelizing: Amdahl's law gives that:
$\text{Max speed-up} = \frac{1}{(1-P)+\frac{P}{N}'}$ where P is the proportion of code that benefits from parallelization and N is the number of available processors.
Terms and definitions
Task/thread/process
A task/thread is the unit of parallelism and a single thread of control. A process is threads working within its own shared memory context.
Fundamental facilities
All concurrent programming consist of these fundamental facilities:
- Activities (threads, tasks or processes).
- Synchronization mechanisms.
- Support of communication between concurrent activities.
Types of interaction
Interaction between tasks can be grouped as one of the following: - Independent: No synchronization or communication between threads. - Cooperate: Synchronization and communication. - Competing: About their share of resources. Although they communicate and synch to obtain resources, they are essentially independent.
Hierarchy
Based on hierarchy and relation, one often distinguish between tasks to be:
-
Parent/child: Responsibility for the creation of another task. The parent may be delayed while the child is being created and initialized.
-
Guardian/dependant: A task may be dependent on the guardian task itself. The guardian is not allowed to terminate until all dependent tasks have terminated.
-
Sometimes the parent is also the guardian, but not necessarily with dynamic tasks.
Concurrent execution
Variation between the languages
The models of concurrency for the programming languages varies, especially when it comes to:
-
Structure: The number of tasks can be either:
- Static: fixed at compile-time
- Dynamic: tasks are created any time
-
The level of parallelism: Tasks can be either:
- Nested: allowed to be defined within other tasks
- Flat: not nested
-
Granularity (detaljnivå): Within languages that supports nested constructs we distinct between:
- Course grain parallelism: Program contains few tasks, each with a longer life. (Applies to most concurrent programming languages, typified by Ada.)
- Fine grain parallelism: Program contains a large number of simple tasks, some of which exist only for a single action (e.g Occam2).
-
Initialization: The tasks can be supplied with information by:
- Passing it as parameters (most modern languages allow this)
- Communicating explicitly after is has commenced its execution.
-
Termination: Can be done in a variety of ways:
- Normal completion of execution of the task's body
- Suicide, by a self-terminate statement
- Abortion, through action of another task
- An untrapped error condition
- Never (when executing non-terminating loops)
- When no longer needed
Task representation
The three basic mechanisms for representing concurrent execution is: fork and join, cobegin and explicit task declaration. The latter is most used in modern real-time languages.
Fork and join
The fork statement specifies that a designated routine should start executing concurrently with the invoker of the fork. The join statement allows the invoker to synchronize with the completion of the invoked routine. For example:
A procedure P(parent) begins and forks function F(child). They will now execute concurrently until the point when P states join. Here P will wait for F to finish (if it hasn't already done so). Fork and join allow for dynamic task creation and provide a means of passing information to the child via parameters. This can however be error prone in use, for example, in some systems a guardian must explicitly rejoin all dependants rather than merely wait for their completion.
Cobegin
The cobegin denotes the concurrent execution of a collection of statements, from
Explicit task declaration
The routines within the "Explicit task declaration" can state themselves whether they will be executed concurrently. This has become the standard way of expressing concurrency in real-time languages.
Language parallellism
Ada
The unit of parallelism is called task in Ada. Tasks consist of a specification and a body. They can be passed initialization data upon creation, but only discrete types and access (pointer) types. The tasks can be given as task types. Tasks that do not have types declared for them and have no pointers to them are called anonymous.
Dynamic tasks can be created by giving a non-static value to the bounds of an array (of tasks) or using the 'new' operator on an access type (of task type). Tasks created by the ('new') allocator have the important property that the guardian (called master in Ada) is not in the block in which it was created, but in the one that contains the declaration of the access type. That means that if task Y is created inside an inner block, but was first declared outside, the inner block can terminate even though Y hasn't yet terminated. Termination: A task will terminate as it completes its body, there is an unhandled exception, when no longer needed (done by a select statement, or when it is aborted. Any task can abort any other task, and as a result, all its dependants will also abort. This is a dangerous action, and should be used only when no other alternative actions are possible.
Java
Java has a predefined class, java.lang.Thread which provides threads. Java also has a interface, called Runnable, to express concurrent execution. This provide a run method.
There are two ways to create threads. One way is to declare a class to be a subclass of Thread and override the run method. The second way is to declare a class that implements the Runnable interface. Common for all threads in Java is that they are not created automatically when their associated objects are created, but must be explicitly created and started using the start method (e.g thread1.start();).
Java allows dynamic thread creation like Ada, but unlike Ada, Java allows arbitrary data to be passed as parameters. Java has no guardian concept, and relies on garbage collection to clean up objects no longer accessible. But the main function still terminates when all its threads are terminated. The join method works like described in "fork and join", and the isAlive method tells whether a target has terminated.
- Termination
- A thread terminates as it completes its body or there is an unhandled exception.
Deamon threads provide genereal service and typically never terminate until they are no longer needed (like in Ada using the select statement). In Java, one does not longer have the same opportunity to abort threads like in Ada.
C/Real-Time POSIX
C/Real-Time POSIX provides three mechanisms for creating concurrent activities. The first is the fork and wait mechanism. The second is spawn, which is a combination of fork and exec. It also allows for a process to contain several threads like in Ada and Java. A thread can be executed when it is called by pthread_create. Here the parameters are passed.
Termination: A thread can be terminated normally by returning from its start_routine or by calling pthread_exit. One thread can wait for another to terminate by the pthread_join function. The cleaning up and reclaiming the storage of a thread is called detaching.
Multiprocessor and distributed systems
Definitions:
- Multiprogramming
- Tasks multiplex their executions on a single processor
- Multiprocessing
- Tasks multiplex their executions on a multiprocessor system where there is access to shared memory
- Distributed Processing
- Tasks multiplex their executions on several processors which do not share memory
Global dispatching means that a task can be given to all available processors. From a real-time perspective, fixing tasks to certain processors usually will result in a more predictable response. Linux only allow threads to be constrained to execute on a limited set of processors. This is called processor affinity. Although Ada, Real-Time Java and C/Real-Time POSIX all support multiprocessor implementations, they do not provide mechanism to set the processor affinity of a task.
Steps to make a distributed system
Production of an application for a distributed system involves some steps not required when programs are produced for a single or multiprocessor platform.
- Partitioning
- dividing the system into parts
- Configuration
- the partitioned parts are associated with particular elements in the system
- Allocation
- turning the configured system into a system of executable modules
- Transparent execution
- the distributed software is executed so remote resources can be accessed independent of location
- Reconfiguration
- the dynamic change to the location of a software component
Language-supported vs operating-system-supported concurrency
There has been a long debate as to whether the support for concurrency should be provided by the language or the operating system. Pros and cons for the former follows: Pros:
- More readable and maintainable programs.
- There are many different operating systems.
- An embedded computer may not have any resident operating system.
Cons:
- Different languages have different models of concurrency.
- It may be difficult to implement a language's model of concurrency efficiently on top of an operating system's model.
- Operating systems API standards, such as POSIX, have emerged and therefore programs are more portable.
Message-Based Synchronization and Communication
Message passing is introduced as an alternative to shared variable synchronization and communication. This approach is typified by the use of a single construct for both communication and synchronization. There are a wide variety in the semantics of message passing, because of this three major issues are considered and discussed:
- The model of synchronization
- The method of task naming
- The message structure
Process Synchronization
In a message-based system a receiver, obviously, cannot read a message before it is sent. This means that the receiver is explicitly notified whenever new information is available. This is opposed to shared variable synchronization, were a receiver will not know whether the variable has been written to or not. The task synchronization of the sender in a message-based system can be broadly classified in three categories:
- Asynchronous - the sender proceeds immediately after sending the message
- Synchronous - the sender waits for the message to be received
- Remote invocation - the sender proceeds only when a reply has been returned
These three forms are related, as two asynchronous events can constitute synchronous send (see Ex. 2). Likewise two synchronous (or asynchronous) send can constitute remote invocation (see Ex. 3).
Using asynchronous tasks for synchronized messaging
Task 1: Task 2:
asyn_send(message) wait(message)
wait(acknowledgement) asyn_send(acknowledgement)
Using synchronous tasks for remote invocation messaging
Task 1: Task 2:
syn_send(message) wait(message)
wait(reply) ...
construct reply
...
syn_send(reply)
As asynchronous send can be used to construct the other two, this model gives the most flexibility. Because of this, languages and operating system should support this model. There are, however, some drawbacks to consider when using asynchronous send:
- Infinite buffers might be needed to store messages that have not yet been read (perhaps because the receiver has terminated)
- Asynchronous communication is out-of-date, as most sends are programmed to expect an acknowledgement (synchronous communication)
- More communication is needed with the asynchronous model, hence increased complexity
- It is more difficult to prove correctness of the complete system
Task Naming and Message Structure
In naming task there are two things to consider: direct vs. indirect naming and symmetry. In a direct naming scheme the sender explicitly names the receiver:
send(<task-name>, <message>)
An indirect naming scheme, on the other hand, names an intermediary (known variously as a channel, mailbox, link or pipe):
send(<mailbox>, <message>)
A naming scheme is symmetric if both the sender and receiver name each other (directly or indirectly). It is asymmetric if the receiver do not name a specific source, hence accepting messages from any sender. Asymmetric naming fits the client-server paradigm, as a server can receive messages from, and do services for any number of clients (tasks). Indirect asymmetric naming have some further issues, as the intermediary can have different structures:
- many-to one
- many-to-many
- one-to-one
- one-to-many
Message Structure
A language should allow any date object of any type (pre- or userdefined) to be sent as a message. This can be difficult in practice, especially if the sender and receiver have different data representation.
Selective Waiting, Non-Determinism and Synchronization Primitives
Message passing has so far been assumed to make the receiver wait for a message. This is often too restrictive, as the receiver may want to wait for several different tasks to call. To facilitate this common need, receivers can be allowed to wait selectively for messages from different tasks.
Dijkstra's guarded command is a programming structure that is used to solve this. A guarded command is executed if its guard evaluates to TRUE, and is an element of a guarded command set. If several commands in a set evaluates to true; the choice of which is allowed to proceed, should be arbitrary. Meaning, the construct is non-deterministic, this is opposed to if-statements.
The reason for making this choice non-deterministic, is that most concurrent languages usually make few assumptions to which tasks are executed when. The scheduler is also assumed to be non-deterministic to when different tasks are scheduled. The same argument can be applied to any queue defined by a synchronization primitive, as non-deterministic scheduling implies that a queue should release tasks non-deterministic.
Message Passing in Ada
The semantics of remote invocation have many superficial similarities with a procedure call in Ada. Data is passed to a receiver, the receiver executes and then data is returned. Because of this similarity, Ada supports the definition of a program's message in a way that is compatible with the definition of procedures, protected subprograms and entries.
For a task to be able to receive a message, it must define an entry. Entries can be defined as private, meaning they can only be called by tasks local to the task's body. There is also provided facilities for defining, in effect, an array of entries. These are called an entry family (see the multiplexor on p.197 for an example). To call a task (send a message) simply involves naming the receiving task and its entry directly. If an entry call is made on a task that is no longer active then the exeption Tasking_Error is raised at the point of call.
In Ada data can be transferred in the opposite direction of the message itself. Because of terminology confusion that can arise from this, message passing is usually called 'extended rendezvous'.
All entries should have an accept statement associated with it, this statement must be used in a tasks body. The naming of an accept statement is asymmetric, as they name the entry and not the task from were the message came. If multiple tasks call the same entry, they are order in a FIFO queue.
Exeption Handling and the Rendezvous
Any valid Ada code can be executed during a rendezvous, therefore it is possible for an exception to be raised within the accept statement itself. If this occur the exception can either be handled appropriately inside the statement itself, or the statement will terminate and the exception should be handled outside. If the exception is not handled inside statement scope error may occur.
Message Passing via Task Interfaces
Ada's direct asymmetric naming means that tasks must be named explicitly. However, this is not always the appropriate behaviour. As a client may not want to be concerned about the tasks type, but only that it provides the appropriate service. Task interfaces are ideal for this purpose. A task type in Ada 2005 can be declared to 'implement' zero, one or more combinations of synchronized and task interfaces.
Select statement
The select statement in Ada must contain at least one accept alternative (up to as many as you want), and up to one of three further alternatives:
- a terminate alternative
- an else alternative
- a delay alternative
The general form an select statment is:
select
when <Boolean−Expression> =>
accept <entry> do ..
end <entry>
-- any sequence of statements
or
−− similar
..
end select
The terminate alternative terminates the task if no other entries can call the entries of this task. The else alternative is executed if there are no other alternative immediately available. The delay alternative suspends the the task until an entry is detected. If there are multiple entries waiting, the choice of which is evaluated is non-deterministic.
C/Real-Time POSIX message queues
C/Real-Time POSIX support asynchronous, indirect message passing through the notion of message queues. A message queue can have many readers and writers. Priority may be associated with each message (not part of the curriculum). Message queues are intended for communication between processes, but they can be used for threads as well. See Program 6.1, p. 207 for the C/Real-Time POSIX interface to message queues.
Distributed systems
In a distributed system were tasks operate on different nodes, the goal is to make the communication between them as easy and reliable as possible. In reality complex communication protocols are needed. A set of mechanism must be provided:
- tasks do not need to decode or encode messages.
- all messages received by a user task can be assumed to be intact, and in good condition.
- messages received by a task is of the type the task expects.
- tasks are not restricted to communicate only in terms of predefined, built-in types.
It is possible to identify three main standards for communication in a distributed system:
- externally-defined application programmer's interface (API), to access network transport protocols
- remote procedure call (RPC) paradigm
- distributed object paradigm
Remote Procedure Call
RPC is typically used when communication between programs written in the same language is needed. A procedure (server) is identified as being one that can be called. From this two procedures are identified: a client stub and a server stub (see figure 6.1 p.211). The stubs are sometimes called middleware, as they sit between the application and the operating system. The client stub is the caller (server at the client procedure), and the server stub is the receiver (server at the server procedure). The client stub is used for:
- identify the address of the server (stub) procedure
- convert parameters into bytes suitable for transmission across the network (parameter marshalling)
- send the request to execute the procedure to the server (stub)
- wait for a reply from the server (stub) and unmarshall the message
- return control to the client procedure along with returned message The server stub is used for:
- receive requests from client (stub) procedures
- unmarshall the received message
- call the server
- catch any exceptions that are raised by the server
- marshal and send return parameters and/or exceptions. If the clients and servers are written in different languages the marshalling and unmarshalling must be machine and language independent.
The Distributed Object Model
The distributet object model allow:
- dynamic creation of an object on a remote machine
- identification of an object to be determined and held on any machine
- transparent invocation of a method in an object as if it were a local method
- transparent run-time dispatching of a method call across the network
Not all systems supporting this model provide mechanisms for all this functionality.
- Ada
- supports the static allocation of objects, allows the identification of remote Ada objects, facilitates the transparent execution of remote methods, and support distributed run-time dispatching of method calls.
- Java
- allows the code of a Java object to be sent across the network and instances to be created remotely, the remote naming of a Java object, the transparent invocation of its methods, and distributed run-time dispatching.
- CORBA
- allows objects to be created in different languages on different machines, facilitates the transparent execution of remote methods, and supports distributed run-time dispatching method calls.
Atomic actions
This chapter is meant as a basic explanation of the concepts behind atomic actions. For specific implementations of atomic actions in C, Ada and Java, please see (Burns and Wellings, 2009), chapters 7.2, 7.5-7.7.
Atomic actions
In concurrent programming, several tasks can run in parallel, each performing operations or actions. With concurrency, it is easy for tasks to interfere with each other when performing the same action. A solution to this problem is by requiring the actions to be run as an indivisible or atomic action. There are several ways of expressing the properties of atomic actions and they are as follows.
- An action is atomic if the tasks performing it are not aware of the existence of any other task, and no other active task is aware of the activity of the tasks during the time the tasks art performing the actions.
- An action is atomic if the tasks performing it do not communicate with other tasks while the action is being performed.
- An action is atomic if the tasks performing it can detect no state change except those performed by themselves and if they do not reveal their state changes until the actions is complete.
- Actions are atomic if they can be considered, so far as other tasks are concerned, to be be indivisible and instantaneous, such that the effects on the system are as if they were interleaved as opposed to concurrent.
The essence of these statements is that the system always sees a consistent state, and that all actions performed appear to be instantaneous. These concepts are usually not supported natively in programming, so the functionality must be implemented using other more primitive concepts. An atomic action, even though viewed as indivisible, can be composed by internal atomic actions. This structure is called nested atomic actions. To fulfil the stated requirements for atomic actions, it is important that only tasks of outer levels are involved in internal actions.
Two-phase atomic actions
For an action to be completely atomic, no communication with other parts of the system is allowed. This is however a problem when it comes to acquiring resources, because the task will then communicate with a resource manager like a server task. This leads to two options. The first is to make the server task, which serves all other tasks as well, a part of the atomic action. This will be very inefficient, as it will serialize all tasks requesting data from the server, and removing the concurrency. The second option is to make the atomic communicate freely with resource servers, and this is in reality the only option. For efficiency and increased concurrency, it is advantageous if tasks will wait as long as possible before requiring resources, and freeing them as soon as possible. This could however make it possible for other tasks to detect a change in state before the original task is done performing the atomic action. To avoid this, atomic actions are split into two separate phases, a 'growing' phase where only acquisitions can be made, and a 'shrinking' phase where resources can be released while no new acquisitions can be made. This way, other tasks will always see a consistent system state.
Atomic transactions
Atomic transactions are like regular atomic actions, with the addition of one feature that the atomic actions will be defined as a success or failure after execution. If an atomic action fails, the system may be left in an inconsistent state. If an atomic transaction fails, the system will return to the state prior to the execution. Atomic transactions are also called recoverable actions for this feature. There are two distinct properties of atomic transactions:
- failure atomicity
- The transaction must complete successfully or have no effect.
- synchronizations atomicity
- The transaction must be indivisible so that no other transaction can observe its execution.
Even though transactions provide error detection, they are not suitable for fault-tolerant systems. This is because they provide no recovery mechanism in case of errors, and will not allow any recovery procedures to be run.
Requirements for atomic actions
For implementation of atomic actions in a real-time system, it is necessary with a welldefined set of requirements which must be fulfilled.
- Well-defined boundaries
- tasks should have a start, end and side boundary. A start boundary is the locations in the task where the action is initiated, the end is the location where the action will end in the task. The side boundary is the separation of the tasks involved in an action from the rest of the system.
- Indivisibility
- atomic actions can not allow any exchange of information between the tasks inside and outside the action. Also, there are no synchronization at the start of atomic actions, but there are at the end. That means tasks can not leave an action until all tasks are ready to leave.
- Nesting
- nesting is allowed as long as the atomic actions don't overlap with other atomic actions.
- Concurrency
- it should be possible to execute different atomic actions concurrently. The overall effect of running atomic actions concurrently must be the same as if they were run sequentially.
The atomic actions should allow recovery procedures to be implemented, as their intention is damage confinement.
Recoverable atomic actions
Backward error recovery
The concept of backward error recovery is that when it is applied to a group of communicating tasks, it is possible for the tasks to be rolled back to their starting point of execution. In general, this is problematic because there is no easy way of defining a consistent set of recovery points. However, when an error occurs in an atomic action, the tasks involved can be rolled back to the clearly defined starting point, and alternative procedure can be executed. As atomic actions operate independently, this will guarantee correct results. Atomic actions used in this way are called conversations. Each action statement will contain a recovery module as seen here:
action A with (P1,P2) do
ensure <acceptance test>
by
- primary module
else by
- recovery module
else error
end A;
The basic principles of a conversation can be summarized:
- On entry to the conversation, the state of a task is saved. The set of entry points form the recovery line.
- While inside the conversation, a task is allowed only to communicate with other tasks active in the conversation and general resource managers. As conversations are built from atomic actions, this property is inherited
- In order to leave the conversation, all tasks active in the conversation must have passed their acceptance test. If this is the case, then the conversation is finished and all recovery points are discarded.
- If any tasks fails its acceptance test, all tasks have their alternative modules. It is, therefore, assumed that any error recovery to be performed inside a conversation must be performed by all tasks taking part in the conversation.
- Nesting is allowed, but only strict nesting is allowed (Inner structure entirely contained in outer).
- If all alternatives in the conversation fail, then recovery must be performed at a higher level
Conversations have been criticized even though they offer recovery options for sets of tasks. If a set of tasks fail, the same set of tasks will try another procedure, but often the case is at a higher level, and the tasks may need to work together with entirely different tasks to recover successfully. This leads to a lot of inefficiency, and the system may not even recover at all.
Forward error recovery
The principle of forward error recovery is that if an exception occur in one of the tasks active in an atomic action, then that exception is raised in all the active tasks. This exception is called asynchronous as it originates in another task. The following example illustrates the concept:
action A with (P2 , P3 ) do
- the action
exception:
when exception_a =>
- sequence of statements
when exception_b =>
- sequence of statements
when others =>
raise atomic_action_failure
end A;
There are two models of exception handling: termination and resumption. With the termination method, the atomic action will complete normally if all exceptions are handled and no further exceptions are raised. With the resumption method, the tasks will resume where they were when the exception was raised. If one task doesn't have an exception handler or fails to handle an exception, the atomic action will fail and a new exception will be raised in the other tasks.
- Resolution of concurrently raised exceptions
- A problem with exceptions is when a fault can't be uniquely identified and causes different exceptions to be raised simultaneously. As there may be different exception handlers for each of the exceptions in each task. The problem is how to decide which to choose. The solution is simply to create a priority tree for the exceptions, and to handle the exception at the root of the subtree which contains all raised exceptions.
- Exceptions and internal atomic actions
- Another problem with exceptions is when exceptions are raised when other tasks are in a nested action. All tasks must participate in the recovery action, but internal actions are indivisible, and to handle an exception would break with that principle. Another problem is that the nested actions may not be able to handle all kinds of exceptions that can be raised. The solution is to let the nested actions have a predefined abort exception case which aborts the nested actions, and since the nested action fails, the system is restored to the initial state.
Asynchronous notification
In real implementations of recoverable atomic actions, forward and backward error handling are usually combined, and backward error handling can even be implemented with forward error handling. The key to implementing error handling is the asynchronous messages, and most programming languages have some feature allowing asynchronous messages to be sent. There are mainly two principles for this: Termination and resumption.
The resumption model (often called event handling) is like a software interrupt. A task contains an event handler which will handle predefined events. When the task is interrupted, the event handler is executed, and when it's finished, the task will continue execution from the point where it was interrupted.
With the termination model, each task specifies a domain in which it will accept an asynchronous signal. If the task is in the defined domain and receives an asynchronous signal, it will terminate the domain. This is called Asynchronous transfer of control or ATC. If a task is outside the defined domain, the ATC will be queued or ignored. When the ATC is finished, the control is returned to the calling task at a different location than from where it was called.
The user need for asynchronous notification
The main feature of the asynchronous notification is the instantaneous message it provides. It adds complexity to the system, but the alternative to this is to wait for events to occur and processes to finish. In real-time systems, this isn't always good enough. The main reasons for using asynchronous notifications are
- Error recovery
- When errors occur it can mean that the system states are no longer valid, that processes may never finish or a timing error may have occurred. All this means there is no longer a point to waiting for a process to finish, and the best solution is to respond immediately.
- Mode changes
- For large systems, there could be many operation modes with entirely different tasks that are performed. When sudden changes in modes happen, we want the system to respond immediately, for example when a transition to an emergency state occur.
- Scheduling using partial/imprecise computations
- Many algorithms calculating specific results operates by giving more accurate results for each iteration. These algorithms have no natural finish point, so using an asynchronous notification will often be the simplest way of terminating such algorithms.
- User interrupts
- Whenever a system receives interrupts from a user, for example when an error occur, the response must be immediate.
Resource control
Implementation of resource entities requires some form of control agent. If the resource is passive it is said to be protected(synchronized). If the resource is active it is called a server. This chapter discusses the problem of reliable resource control.
Resource control and atomic actions
- Tasks need to communicate and synchronize to perform resource allocation, this need not to be in the form of an atomic action, because the only information exchanged is necessary to achieve.
- The code necessary for a particular task to communicate with the controller should be an atomic action.
Resource management
Resources must be encapsulated and be accessed only through a high-level procedural interface.
- Package in ADA
- Naturally when using monitors in Java and C.
Expressive power and ease of use
- Monitors/synchronized methods, servers, protected resources.
- Blooms criteria for evaluating synchronization primitives in the context of resource management:
- Ease of use of a synchronization primitive encompasses:
- the ease with which it expresses each of these synchronization constraints.
- the ease with which it allows the constraints to be combined to achieve more complex synchronization schemes.
- In the context of resource control, the information needed to express these constraints can be categorized as follows:
- the type of service request
- the order in which requests arrive
- the state of the server and any objects it manages
- the parameters of requests
- Comparison between condition synchronization and avoidance synchronization:
- conditions wait: all requests are accepted, but any task whose request cannot currently be met is suspended on a queue.
- avoidance: requests are not accepted unless they can be met.
Request type
- Priority one request over another.
- Monitor and synchronized methods: Programmed as distinct procedures.
- Ada: Represented by different entries in the server task, or protected object.
Request order
- Monitor: FIFO
- ADA: FIFO calls to the same entry can be serviced FIFO if the appropriate queuing policy is chosen.
- ADA: calls to different entry, arbitrary manner. (Out of programmers control) Possible if all clients call a common "register" entry.
Server state
- Some operations may be permissible only when the server and the objects it administers are in a particular state.
- Monitors could be used with condition variables being used to implement constraints.
Request parameters
The order of operations of a server may be constrained by information contained in the parameters of request. Typical the size of the request. How can a resource allocater in Ada be constructed to handle this:
- Different entries for different sizes.
- Two stage interaction: Call first "sign in" and then "allocate" to the server.
- Problem if the client sends "sign in" and then being aborted from another task before sending "allocate". Deadlock on the server, if the server wait infinitely.
- If the server does not wait infinitely on "allocate" there will be a problem if the client is just slow.
- solutions:
- Define the abort primitive to apply to an atomic action rather than a task; forward and backward error recovery can then be used when communicating with the server.
- Assume that abort is only used in extreme situations where the breaking of the atomic is of no consequence.
- Try to protect the server from the effect of client abort.
Requester priority
Client priority. It is possible in Java, C, Ada to define a priority queue, but in general con- current programming languages, tasks are released from primititves such as semaphores or condition variables in either an arbitrary or FIFO manner.
The requeue facility
One means of enhancing the usability of avoidance synchronization is to add requeue facility. The Ada language has such a facility.
The key notion beyond requeue is to move the task (which has been through one guard or barrier) to 'beyond' another guard.
Ada allows requeues between task entries and protected object entries.
Semantics of requeue
I gave up the sentence: read chapter 8.4.1
Requeuing to other entries
Example in the book ch 8.4.2
Asymmetric naming and security
- asymmetric naming → the server being unaware of the identity of the client.
- A server task may wish to know the identity of the calling client so that: - a request can be refused on the gronds of deadlock prevention or fairness(that is, to grant the request would be unfair to other clients). - it can be guaranteed that resources are released only by the task that earlier obtained them.
- Ada: get the client id from task identification
- Java: get client id using thread.currentThread
Resource usage
- Shared access: more than one can access at the same time.
- Exclusive access: only one can access at the same time.
- A combination of those above; Problem a task in shared mode trying to access when one in exclusive mode already has accessed. And the other way...
- One allocate the resource should release them as soon as possible.
- Releasing too fast → error
- Modified two phased resource, so that the resource are not released until the action is completed.
- Recovery procedure: success→ release resource. Fail → undo any effects on the resource.
Deadlock
- Many tasks competing for a finite number of resources.
- livelock
- Several tasks are continually attempting to gain sole access to the same resource; if the resource allocation policy is not fair, one task may never get its turn to access the resource. This is called indefinite postponement or servation or lockout.
- liveness → in concurrent systemts: if something is supposed to happen it eventually will.
- four necessary conditions that must hold if deadlock is to occur:
- Mutual exclusion - only on task can use a resource at once (that is the resource is non-sharable or at least limited in its concurrent access). - Hold and wait - there must exist tasks which are holding resources while waiting for others. - No pre-emption - a resource can only be release voluntarily by a task. - Circular wait - a circular chain of tasks must exists such that each task holds resources which are being requested by the next task in the chain.
- Three solutions to handle deadlock - deadlock prevention - deadlock avoidance - deadlock detection and recovery
Scheduling real-time systems
Scheduling: Restricting the non-determinism in concurrent systems by providing ordering of the use the resources of the system, and predicting the worst case behaviour under this ordering. I.e. it has to assign priorities to the different processes, and run a schedulability test. In a static scheduler this is decided/calculated prior to execution, while a dynamic scheduler calculates in run-time.
The cyclic executive approach
For this simple approach, all tasks are purely periodic and known in advance. Tasks are either running or not initiated, never suspended. It is only a sequence of procedure calls, thus there is no need for semaphores etc. The major cycle determines the longest possible period of a task, while the minor cycle determines the shortest. The scheduler groups the tasks in minor cycles, making sure all tasks run at least one time (the task with the longest period) every major cycle. Everything is planned in advance, however placing the tasks in the cycles is similar to the knapsack problem which is NP-hard.
Task-based scheduling
A task can now be either running, suspended waiting for a timing event, or suspended waiting for a non-timing event.
Scheduling approaches
- Fixed-Priority Scheduling (FPS)
- All tasks have static priority.
- Earliest Deadline First (EDF)
- Earliest deadline yields highest priority. Dynamic.
- Value-Based Scheduling
- Adaptive; assigns a value to each task to decide which to run next.
Scheduling characteristics
A schedulability test can be:
- sufficient
- success indicates that deadlines are always met.
- necessary
- failure indicates that the system will miss some deadlines.
- sustainible
- the test still predicts schedulability when the system parameters (period, deadline..) improve. (Not all tests do this)
Preemption and non-preemtion
- Preemptive
- the scheduler immediately switches to tasks of higher priority.
- Non-preemptive
- the scheduler depends on the tasks to complete, or give up their resources before it can to switch to other tasks. I.e. it allows lower priority tasks to complete before switching to tasks of higher priority.
- defered preemtion/cooperative dispatching
- the scheduler allows a lower-priority task to continue to execute within a bounded time, and not necessarily to completion, before switching to a higher-priority task.
Simple task model
The book makes some assumptions on the tasks: fixed set of periodic, independent tasks with known periods, constant worst-case execution times, and deadlines equal to their periods. They run on a single processor where overheads run in zero time. A critical instant is the instant when the processor runs with maximum load. See table 11.2 page 370 for standard notation used in the book.
Fixed-priority scheduling(FPS)
- Rate monotonic priority assignment
- shortest period yields highest priority. In the book 1 is the lowest priority.
Utilization-based schedulability tests for FPS
If the following is true (for rate monotonic ordering), then all N tasks will meet their deadline:
sum of utilization of all tasks ≤ total utilization
I.e. it is sufficient but not necessary. (A system may fail the test, and still be schedulable)
If we increase the number of tasks to infinity with no restriction on relative period, we can prove that the maximum utilization approaches 69%. This is called asymptotic analysis of utilization.
Improved utilization-based tests for FPS
The schedulability test from previous section can be improved by reinterpretation of the given equation. Instead of having N stand for number of tasks, it is now defined as number of distinct task families. A task family is tasks that have periods that are multiples of a common value. Ex. 3, 9, 12, here N=1. If the given periods are 3, 6, 10, then N=2 as 10 does not share a common multiple with the other periods. (Read BW ch. 11.4.1)
Another improved utilization-based test, has the form:
Response time analysis (RTA) for FPS
As opposed to the utilization based FPS tests, the RTA tests are exact(sufficient and necessary).
Here
Sporadic and aperiodic tasks
Up until this point we have assumed
Hard and soft tasks
Worst case figure is usually considerably higher than average case, thus shedulability testing with worst case figures can cause low processor utilization. To improve this it should be possible to schedule all tasks using average figures, i.e. we allow transient overload; situations where we might not meet all deadlines. However hard real-time tasks should never miss deadlines, so they should be schedulable using worst-case figures.
Aperiodic tasks and fixed-priority execution-time servers
- Execution-time server
- makes sure hard tasks meet their deadline, but allow soft tasks to run as often as possible. Dual-priority scheduling
- Priorities are split into a high-, a medium- and a low band. Aperiodic tasks are in the medium band. Hard tasks are initially in the low band, but are bumped to the high band to meet their deadlines.
Task systems with D < T
Deadline monotonic priority ordering(DMPO: Tasks have a fixed priority that is inversely proportional to their deadline. It is optimal for a fixed priority scheme. FSP works well with D < T , while EDF utilization test fails. However since EDF is more effective than FSP, a system that passes an FSP test will also be schedulable for EDF.
Task interactions and blocking
This far, tasks have been viewed as independent. In reality they are not, as they often share resources. Thus we have to add the possibility for tasks to be suspended; waiting for a future event (e.g. an unlocked semaphore).
- Priority inversion
- a task with high priority have to wait for a task with lower priority. (Wait for it to free a semaphore..). The high priority task is then blocked.
- Priority inheritance
- If p is suspended, waiting for q, then q's priority is increased to p's priority (if it originally was lower).
To make equation (2) work with the expanded model, we add a term Bi which is the maximum blocking time of task i.
See the book page 385 for explanation on the function to calculate
Priority ceiling protocols
Priority ceiling protocols are used to manage and synchronize shared resources so that deadlocks and priority inversions are avoided and blocking is minimized.
Original ceiling priority protocol(OCPP):
- All tasks have a static default priority
- Resources have a static ceiling value, that corresponds to the max priority of the tasks using the resource
- Tasks have a dynamic priority; its own static priority or any (higher) priorities inherited from blocking higher-priority tasks
- A resource may only be locked by tasks with higher priority than any currently locked tasks
We see that the priority of a task will be increased if it is blocking a higher priority task.
Immediate ceiling priority protocol(ICPP)
- All tasks have a static default priority
- Resources have a static ceiling value, that corresponds to the max priority of the tasks using the resource
- Tasks have a dynamic priority; the max of its own static priority and the ceiling value of any resources it had locked
Instead of waiting until it actually blocks a higher priority task, ICPP increases the priority of a task as soon as in locks a resource. ICPP will not start executing until all its needed resources are free, since unavailable resouces it needs indicates that a task of higher priority is running. ICPP is easier to implement, leads to less switching between tasks, but causes more change in priority compared to OCPP.
Ceiling protocols, mutual exclusion and deadlock
- "Proof" that *CCP ensures mutual exclusion
- if a task under ICPP has access to a resource it will have the highest possible priority for that resource, and thus no other task demanding this resource will run.
- "Proof" that *CCP is deadlock-free
- if a task has a resource and requests another, then the new resource it requested can not have lower priority.
An extendible task model for FPS
So far we have extended the FSP model to include
- Release jitter
- the max variation in a task's release
$D>T$ - not schedulable if
$R > D$ - Cooperative sheduling/deferred preemtion
- mutex via non-preemption. Increases schedulability and can lead to lower C values.
- Fault tolerance
- includes forward- and backward recovery to the model
- Offsets
- analysis of arbitrary offsets is an NP-hard problem
Priority assignment
If a task
Insufficient priorities
If we have insufficient priorities tasks must share priorities, and thus assume interference.
Execution-time servers
A virtual resource layer between the applications and the processor(s). Guarantees level of service and resource allocation. Both periodic and sporadic servers act as periodic tasks.
Earliest deadline first (EDF) scheduling
Alternative to fixed-priority scheduling(FSP), but is less supported by programming languages and operating systems.
Utilization-based schdulability tests for EDF
Utilization-based test for EDF
where
Comparison between fixed-priority scheduling(FPS) and earliest deadline first(EDF).
FPS:
- FPS is easier to implement as the scheduling attribute(priority) is static
- Easier to incorporate tasks without deadlines by giving them a arbitrary priority.
- Using priority as a scheduling attribute can be used to describe more factors.
- During overload situations, FPS behaves predictably, with the lower priority task missing their * deadline first.
EDF:
- EDF uses deadlines as scheduling attribute which are dynamic and requires more complex run-time systems.
- Giving a task an arbitrary deadline is strictly not the same as a task without a deadline.
- Deadlines does not represent the criticality of the task, i.e the task with deadline first may not be the most critical task.
- EDF is unpredictable under overload and may experience a domino effect where a large number of tasks miss their deadline.
- Higher utilization than FPS
Processor demand criteria for EDF
When utilization-based schedulability tests cannot be applied,in the case of EDF, when deadlines are shorter than periods. Therefore another way of checking schedulability is through the Processor Demand Critera(PCD) method. In PDC the load of the system
where as before
A system is schedualable if and only if the load
However one only need to check the
The QPA test
For non-trivial systems, i.e systems with large task sets, the upper bound
Let
Blocking and EDF
Priority inversion in FPS corresponds to deadline inversion in EDF, i.e when a task requires a resource that is in use by another task with a longer deadline. Preventing blocking issues in EDF can be done by using Stack Resource Policy(SRP). SRP assigns a {\bf preemption level } to each task, reflecting the length of the deadline, the shorter the deadline the higher preemtion level.
At run-time, shared resources are given to the task which has the highest preemption level and then passed on to task with the second highest and so on. Thus tasks are only blocked once(during release) and deadlocks are prevented.
Aperiodic tasks and EDF execution-time servers
Fill in some text here!
Dynamic systems and online analysis
In hard real-time systems arrival patterns and computation times are known a priori, but in dynamic soft real-time systems they may not be know As a consequence offline analysis cannot be performed on the system and some kind of online analysis is needed.
EDF is a dynamic scheduling scheme that performs badly under transient overloads,i.e when a task misses its deadline it starts a domino effect causing the next task to miss its deadline and so on. Therefore an online scheduling scheme is needed to manage any overload that is likely to occur due to the dynamics of the system 's environment
Most online scheduling schemes uses the following two mechanisms to handle overload:
1: Admissions control procedure that limits the number of tasks that compete for the processors 2: EDF dispatching routine for those tasks that are admitted by 1
Admissions control prevent the processor from being overloaded with to many tasks, such that EDF works effectively. Admissions control will admit some tasks and reject some, therefore an overall priority has to be established. This is by giving both admitted and rejected tasks values classified as
- Static
- the task always has the same whenever its released
- Dynamic
- the value of the task is computed at release as it is dependant on the state of the system
- Adaptive
- the value of the task will change during execution according to the dynamics of the system
Best-effort
Hybrid systems
Worst-Case execution time
In all scheduling approaches described in the book, i.e cyclic executives, FPS and EDF, the worst-case execution time(WCET) is assumed known, but in most cases has to be measured or found through analysis.\newline
Measuring WCET is problematic as one can not be sure when the worst case as been observed. The drawbacks with analysis is that an effective and accurate model has to be available \newline
Most analysis techniques uses the following steps to estimate the WCET.
1: Decompose code to a directed graph of basic blocks, which represent straightline code. 2: Use the processor model to estimate the basic blocks WCET 3: When the WCET is calculated for all basic blocks the cost of moving from one block to the other in directed graph is the WCET of the destination block 4: Calculate the WCET by traversing the graph and eliminating all but the largest paths between blocks.
This however demands that the code is restricted,i.e. all loops and recursions have to be bounded.
Multiprocessor scheduling
Scheduling problems for multi-core processor systems are significantly more complicated than single processor systems.Three issues have to be addressed:
1: the placement of tasks to processors 2: the scheduling of any shared network 3: the implementation of locks in shared-memory multiprocessor architectures.
Global or partitioned placement
Two task placement schemes are possible:
- Global
- allocate task as they become runnable and tasks may be reallocated during execution
- Partitioned
- pre-run-time allocation
Someone fix this table!
Global | Partitioned | |
Advantages | Does not require run-time support and being able to handle systems that the global scheme would finde difficult | |
Disadvantages | Difficult to identify the optimal scheduling policy. For single processors EDF is optimal, but no optimal scheme is known for multiprocessor systems. | Difficult to generate a valid allocation of tasks. Processors must not be overloaded with tasks. Finding optimal schemes for large number of tasks and processors is not possible with out heuristi |
 |
Someone fix this table!
Scheduling the network
For bus-based tightly coupled multiprocessor, the behaviour of multi-level caches makes worst case execution time (WCET) even harder to analyse. With network-communication the messages must be scheduled as well for reliable end to end communication. There are however several network schemes more suitable for timing analysis, e.g
- Time Division Multiple Access(TDMA)
- each processor is allocated a fixed time slot within a specific cycle where the running task can send messages.
- Control Area Network(CAN)
- each message is given a fixed priority and messages are sent according to priority.
TDMA is only used to static task allocation, as no two processors should ever produce messages at the same time.
Mutual exclusion on multiprocessor platforms
Implementing mutual exclusion(Mutex) in multiprocessor shared memory systems requires locks that can be contended for globally However, when a lock is requested by a task on another processor, but not granted. The task requesting task will then busy-wait on the lock waiting for it to become free.
A system with many globally shared objects and nested usage patterns(i.e task accessing one object while holding the lock to other objects) will be prone to deadlocks and livelocks using global locks.
Therefore the use of lock-free algorithms is a better alternative. Creating multiple copies where many are read from but only one is updated is way of implementing a lock-free algorithm.
Scheduling for power-aware systems
Power-aware systems are often systems that run on batteries and they often use variable speed processors to save power. If the frequency of the processor is lowered, the voltage required for stable operation is also lowered considerably. Because power consumption scales exponentially with the voltage used, running on frequencies requiring half the voltage can effectively quadruple battery life. This however rises a different scheduling problem, at what speed must the processor run when executing a set of tasks for the system(set of tasks) to be scheduable?
In order to answer the question above we must first consider the following questions
- Is the system scheduable when the processor is running at maximum speed?
- If the system is scheduable, by what maximum factor k can all the computations times
$(C_i)$ be increased such that the system is still remains sheduable?
Incorporating system overheads
So far, the overheads of implementing a multitasking scheduling system has been ignored. However in real systems the following has to be taken into consideration
- The cost of a context switch(the cost of moving task between processors) is not negligible.
- All context switch operations are non-preemptive
- The cost of handling an interrupt and releasing an application sporadic task is not insignificant.
- A clock interrupt could result in periodic tasks being moved from a delay queue to dispatch/ready queue.
Transactions Fundamentals
- Transaction
- The term refers to an atomic transaction, which provides an "all-or- nothing" property to work that is conducted within its scope. An atomic transaction has the same properties as an atomic action plus the added feature that if a failure happens, the components are returned to their natural state instead of being left possibly inconsistent as in an atomic action.
ACID properties: Properties of an atomic action.
- Atomicity: The transaction completes successfully (commits) or if it fails (aborts) all of its effects are undone (rolled back).
- Consistency: Transactions produce consistent results and preserve the internal consistency of the data it acts on.
- Isolation: Intermediate states produced while a transaction is executing are not visible to others. Furthermore, transactions appear to execute serially, even if they are actually executed concurrently.
- Durability: The effects of a committed transaction are never lost.
- Termination
- A transaction can be terminated in two ways: committed or aborted. When a transaction is committed, all changes made within it are made durable (forced onto stable storage). When a transaction is aborted, all of the changes are undone.
- Coordinator
- Associated with every transaction and is responsible for governing the outcome of the transaction. It can be implemented as a separate device or may be co-located with the user for improved performance. It communicates with enlisted participants to inform them of the desired termination requirements.
- Transaction manager factory
- responsible for managing coordinators for many transactions.
Atomicity
- Two-phase commit
- The industry standard multi-phase commit protocol required to ensure that the transaction has an atomic outcome. In phase 1 the coordinator communicates with all the action participants to determine whether they will commit or abort. An abort reply, or no reply, from any participant acts as a veto, causing the entire action to abort. If the transactions commits, then phase 2 is entered and the coordinator forces the participants to carry out the decision. Each participant making a commit response to the coordinator in phase 1 is blocked until they have received the coordinator's phase 2 message. Note that the two-phase commit protocol is not client-server based. It simply talks about a coordinator and participants, and makes no assumption about where they are located.
Consistency
A transactional application should maintain the consistency of the resources that it uses. Unlike the other ACID properties, this is something the transaction cannot achieve by itself since it does not possess any semantic information about the resources it manipulates and it is therefore the programmer's responsibility to ensure consistency.
Isolation (Serializability)
- Interference free/Serializable
- If you have variables that are not shared between programs, you can execute some aspects of the programs concurrently. A partly concurrent execution order is termed serializable if it produces an equivalent result as doing it in serial order. If a program possess this property it is said to be atomic with respect to concurrency.
Concurrency Control Mechanisms: To ensure serializability.
- Two-Phase Concurrency Control: All operations on an object is of type "read" or "write". Many computations can hold a "read-lock", which is associated with an object by a computation before reading it. However, if someone wants to change (insert/modify/delete) the object, this requires a "write-lock". This can not be obtained if any of the other operations hold either a read or write lock on that object. This is to make sure that no one changes something others are using in their operations. If you have obtained a write-lock, this will block all other requests for read and write until you release the lock. All computations must follow a two-phase locking policy. In phase 1, the growing phase, locks are acquired, but not released. In phase 2, the shrinking phase, locks are released but can't be acquired. To avoid the cascade roll back problem (see under), all locks are released simultaneosly in the shrinking phase.
-
Pessimistic Concurrency Control}: When a resource is accessed, a lock is obtained on it. The lock will remain held on that resource for the duration of the transaction. Drawbacks with this are:
- Deadlocks
- Concurrency may be limited by the granularity of the locks
- The overhead of acquiring and maintaining concurrency control information in an environment where conflict or data sharing is not high.
-
Optimistic Concurrency Control: Locks are only acquired at the end of the transaction when it is about to terminate. When updating a resource, the operation needs to check if this conflicts with updates that may have occurred in the interim, using timestamps. Most transaction systems that offers optimistic concurrency control will roll back if a conflict is detected, which may result in a lot of work being lost.
- Type-Specific Concurrency Control: Concurrent read/write or write/write operations are permitted on an object from different transactions provided these operations can be shown to be non-interfering. Object-oriented systems are well suited to this approach, since semantic knowledge about the operations of objects can be exploited to control permissible concurrency within objects.
- Deadlock Detection and Prevention: If a deadlock is detected, the only way to resolve it for at least one of the transaction to roll back. Two popular techniques for deadlock detection is:
- Timeout-based: A transaction waits a specified period of time, then rolls back assuming there is a deadlock.
- Graph-based: Tracks waiting transaction dependencies by constructing a waits-for graph: nodes are waiting transactions and edges are waiting situations. Guaranteed to detect all deadlocks, but may be costly to execute in a distributed environment.
- The Cascade Roll Back Problem
- Suppose that a computation in its shrinking phase is to be rolled back, and that some of the objects with write locks have already been released. If some of these objects have been locked by other computations, then abortion of the computation will require these computations to be aborted as well.
Durability
- The Cascade Roll Back Problem
- Suppose that a computation in its shrinking phase is to be rolled back, and that some of the objects with write locks have already been released. If some of these objects have been locked by other computations, then abortion of the computation will require these computations to be aborted as well.
Durability
Any state changes that occur during the transaction must be saved in such a manner that a subsequent failure will not cause them to be lost, for example will a database typically maintain its state in disk in case of machine failure.
Two-Phase Commit Optimizations
There are several variants in the standard two-phase commit protocol (used to achieve atomic outcomes for transactions).
- Presumed abort
- If during system recovery from failure no logged evidence for commit of some transaction is found by the recovery procedure, then it assumes that the transaction has been aborted, and acts accordingly. This means that it does not matter if aborts are logged at all, and such logging can be saved under this assumption.
- One-phase
- If there is only one participant involved in the transaction, there is implicit consensus on whether to commit or not and the coordinator need not drive it through the prepare phase.
- Read-only
- A participant may indicate to the coordinator if it is responsible for a service that didn't do any work during the course of the transaction, and can be left out of the second phase.
- Last resource commit
- It is possible for a single resource that is one-phase aware (no prepare, only commit or roll-back), to be enlisted in a transaction with two-phase commit aware resources. The coordinator executes the prepare phase on only the latter, and if the decision is to commit, then the one-phase aware resource is informed about this.
Synchronization
Synchronization are informed that a transaction is about to commit, when it has completed and which state it has completed.
Synchronization essentially turn the two-phase commit protocol into a four-phase protocol. Before the transaction starts the two-phase commit, and all registered Synchronizations are informed. Then they handle from the implementation. A typically implementation will flush a cached copy of some transactional object state to a database, where its ultimate fate will be controlled by a participant in the two-phase commit protocol. Any failure at this state will cause the transaction to roll back. The coordinator then conducts the normal two-phase commit protocol.(This does not involve the Synchronizations.) When the transaction coordinator has finished the two-phase commit protocol, the Synchronizations are informed.
Heuristic Transactions
One of the most significant type of failure can occur in a transaction system. Heuristic transactions can give you or the system a lot of work to resolve, and are unavoidable.
The two-phase commit protocol is blocking to guarantee atomicity. If a failure occurs, participants may remain blocked, even if failure recovery mechanisms exit. Some applications and participants cannot tolerate this blocking(same reasons as deadlock). To break the blocking, participants that are past the prepare phase are allowed to make decisions as to whether commit or rollback. The choice will depend on the participant implementation and application/environment. Must record this decision in case it has to compete the original transaction. Sometimes the coordinator informs the participant that the outcome and the decision is not the same. Then a heuristic outcome has happened with a corresponding heuristic decision.
Two levels of heuristic interaction:
- Participant-to-coordinator
- If a participant make a decision, it must assume that is has caused a heuristic and remember the choice. At this point the participant is in a heuristic state.
- Coordinator-to-participant
- If the choice if different from the coordinator, then a heuristic outcome must be reported to the application. The transaction will be marked as committed of rolled back, an its in a heuristic state.
Possible heuristic outcome(the worst type of failure):
- Heuristic rollback
- The commit operation failed because the participants unilaterally rolled back the transaction.
- Heuristic commit
- An attempted rollback failed, because all of the participants unilaterally committed.
- Heuristic mixed
- Some participants were committed while others where rolled back.
- Heuristic hazard
- The use of some participants is unknown. The known have either been committed or rolled back.
Heuristic decision should be used with care and only in exceptional circumstances, since there is a possibility that the decision will differ from the determined, that will lead to loss in the system.
The Transaction Log
To guarantee atomicity in the presence of failure, it is necessary for the transaction service itself to maintain state. \textbf{Transaction log}: registration of the participants and where they have reached in the protocol. After the coordinators decision has been conducted, the participant log can be deleted and informed the coordinator.
Failure Recovery
Failures can occur both centralized and distributed. More components leads to greater chance of failure. A distributed system failure are often independent.
Transactions are good tool to ensure consistency in the presence of concurrent users, and have also an excellent fault-tolerance mechanism. When using a transaction you get the benefits of atomicity despite failures (All or nothing happens). The failures can affect the transaction system. Without failure recovery and fault-tolerance capability, the benefits a transaction system offers are seriously downgraded.
In order to cope with failure, transaction service implementations must possess a failure recovery subsystem. The subsystem ensures that results of a transaction are applied consistently to all resources affected by the transaction. The original application does not need to restart in order to perform the recovery.
Recovery after failure requires the informations from the transaction log, that is held in some durable state-store. The information will be used to recreate the transaction and the recovery subsystem will continue to complete the transaction.
If the participants waited for recovery to be driven from the coordinator, then there are some issues. An example is if the coordinator also failed and recovered, it can take some time before the recovery subsystem gets around recovering the specific transaction.
Most transaction systems require a failure recovery component in both the coordinator and participant machines. Implementing recovery is extremely complex. Most transaction service implementations do not provide it.
Trust Implications
The trust between the parties involved in the transaction is often overlooked. In any transaction there are three actors:
- The participant that is driven through the two-phase protocol.
- The coordinator, with which participants register and that drives them through the commit protocol.
- The application logic that tells the coordinator whether it wants to commit or rollback the transaction.
Whatever the application logic might want is never guaranteed by the coordinator, because failure can occur that forces the transaction to roll back. If the coordinator and participant behave correctly and no failures occur, then whatever the application logic tells the coordinator to do, the participant will follow. The coordinator can only do what it is told to by its users.
Types of Transactions
The most common form of transaction that is used is a top-level transaction. Top-level transactions are most suitably viewed as "short-lived" entries, performing stable state changes to the system. They are less suited for structuring "long-lived" application functions, which may reduce the concurrency in the system to an unacceptable level.