This is an old version of the compendium, written May 20, 2015, 4:50 p.m. Changes made in this revision were made by stiaje. View rendered version.

TTK4145: Real Time Programming

# Contributors

Mainly written by Mads Erdal, Erik Wilthil, Kristian Breistøl, Dag Slettebø, Andreas Lindahl Flåten, An- ders Dahlen, Marius Thoresen, Eirik Lie Strandbråten, Bjørn Spockeli, Kristo􏰂er Gryte, Kine Iversen, Cathrine Bruteig, Jostein Munz, Lars Skjærpe Midttun, Jon Håmann Brusevold and partly by Geir Kulia.

Edited by Geir Kulia. # Preface The purpose of this compendium is to give a shorter summary of the curriculum in TTK4145 Real-Time Programming. It has been made to create an easier gateway into understanding the many terms and concepts of this subject. It should be noted that the Institute has not taken part in the preparation of this com- pendium. It has been made as a collaborative text by many authors, and with little time to quality control, one must expect some typos and misspellings. Furthermore, it cannot be guaranteed that all important aspects of the curriculum have been mentioned. Good luck with your exams! # Introduction to real-time systems This chapter gives a de􏰃nition of a real-time system, as well as characteristics and exam- ples of real-time systems. This chapter will also present languages for programming and the development cycle for real-time systems. This book will consider three programming languages: Ada, Java and C, or more spesi􏰃c: C/Real-Time POSIX, Ral-Time Java and, well, Ada. ## De􏰃nition of a real-time system The Oxford Dictionary of Computing's de􏰃nition of a real-time system: Any system in which the time at which output is produced is signi􏰃cant. This is usually because the input corresponds to some movement in the phys- ical world, and the output has to relate to that same movement. The lag from input time to output time must be su􏰅ciently small for acceptable timeliness 'Timeliness' is system-dependent. For instance, a missile guidance system requires output within milliseconds, whereas a computer-controlled water dam may only need output within a second. Young's (1982) de􏰃nition of a real-time system: : Any information processing activity or system which has to respond to externally generated input stimuli within a 􏰃nite and speci􏰃ed period. Another de􏰃nition is (Randell et al., 1995): : A real-time system is a system that is required to react to stimuli from the environment (including the passage of physical time) within time intervals dictated by the environment. The correctness of a real-time system depends not only on the logical result of the com- putation, but also on the time at which the result are produced. A hard real-time system are those where it is absolutely imperative that responses occur within the spec- i􏰃ed deadline. For soft systems, the system will still function correctly if deadlines are occasionally missed. A system may have both soft and hard real-time-sub-systems and soft and hard deadlines. A service may even have both a soft and a hard deadline with di􏰂erent reactions to the two deadlines. A deadline that can be missed occasionally, but in which there is no bene􏰃t from late delivery, is called 􏰃rm. Another means of classifying the role that time has in a real-time system, is to distin- guish between reactive systems and time-aware systems. Time-aware systems make explicit references to the time frame of the enclosing environment. A reactive system is more concerned with relative times, however, since they are often also control systems, the need to be synchronized with their environment. In order for a reactive system to 'keep up with its environment' they are often structured to be time-triggered. All computations are periodic in that they have a de􏰃ned cycle time, and are released for execution by an internal clock. The alternative is event- triggered where the environment explicitly controls the release for execution of some activity. These activities are termed aperiodic or sporadic if the number of releasing events within a time interval is bound. ## Examples of real-time systems Elevators... A system where a computer interacts with the environment with sensors (for instance temperature or pressure transducer) and actuators (valves, motors in general) are ex- amples of real-time systems. E.g plants, elevators, washing machines, etc. Real-time systems are used in process control, manufacturing, communication, multi-media, com- mand and control, and pretty much everywhere else. These examples are thoroughly explained in the book for those interested. ## Characteristics of real-time systems Following are characteristics of real-time systems, which will be thoroughly examined in the later chapters of the book. A real-time system must, of course not exhibit all these characteristics, but should have facilities which support these characteristics. ### Real-Time facilities As mentioned, response time is crucial in any embedded (/real-time) system. As this is often near impossible to guarantee, due to computing resources under all possible conditions, real-time systems are often constructed using processors with considerable spare capacity. This is to ensure that a worst-case scenario does not produce unwelcome delays. Given adequate processing power, language and run-time support are required to enable the programmer to: - specify times at which actions are to be performed; - specify times at which actions are to be completed; - support repeating (periodic or aperiodic) work; - control (i.e. bound) the jitter on input and output operations; - respond to situations where not all of the timing requirements can be met; - respond to situations where the timing requirements are changed dynamically. These are called real-time control facilities, and enable the program to synchronize with time itself. Some systems are more demanding at certain periods, and therefore needs to change the timing requirements dynamically. As an example, an aircraft 􏰄ight control system needs to focus at the task at hand, and also be ready to give all computing resources to handle an emergency. Such changes are generally known as mood changes, and the also have consequences for the temporal characteristics of the executing software. ### Concurrent control of separate system components Often, an embedded system must handle several coexisting external elements with which the computer must interact simultaneously. Then it is necessary to consider distributed and multiprocessor embedded systems. The problem then becomes how to express the concurrency in the structure of programs which exhibit concurrency. Chapter 4, 5 and 6 will consider various models of concurrent programming, and how to achieve reliable communication and synchronization between concurrent processes in the presence of design errors. ### Low-level programming An embedded system typically requires the computer components to interact with the external world, by monitoring sensors and controlling actuators for a wide variety of devices. These devices interface the computer via input and output registers, and their operational requirements are device- and computer-dependent, and may also generate interrupts. A real-time system is also often run by battery in a mobile device. We have also mentioned that real-time systems are very dependent on response time. Low- level programming eliminates both delay and extra power consumption by using built in language features. High-level programming enables the programmer to abstract away from implementation details, and to concentrate on solving the problem at hand. For a real-time programmer, this luxury unacceptable, since delay and power consumption are to central problems when programming a real-time system. ### Support for numerical computation When a real-time system is to control some engineering activity, it often involves contin- uous signals. For very simple systems, one might have an analog controller working on the continuous signals. One wish to calculate an input, based on comparing the output with a reference signal. This will often require heavy calculations, depending on the complexity of the model and the number of inputs and outputs. Because of these di􏰅- culties, most controllers are implemented as digital computers. Since nature is analog, and computers are digital, converters are needed in order to convert analog signals to digital, and vice versa. This must again be taken into account when designing of the control algorithms. A fundamental requirement of a real-time programming language, therefore, is the ability to manipulate real or 􏰄oating-point numbers. Fortunately most engineering languages do provide the necessary abstractions in this area. ### Large and complex Largeness is, by Lehman and Belady (1985) related to variety. Since most real-time sys- tems respond to real-world events, the amount of variety is great. Hence, most real-time systems will exhibit the rather undesirable property of largeness. Due to this large- ness, real-time systems needs to be broken down into smaller components which can be managed e􏰂ectively. Real-time languages provides features, such as abstract data types, classes, objects, generic components, separate compilation, etc to manage this software complexity. ### Extremely reliable and safe For some systems, failure is simply not an option. Systems providing life support, con- trolling nuclear plants or transferring money, can not fail, and if they do, however, fail, they must fail in a controlled way. When human interaction is required, care must be taken in the design of the interface in order to minimize the possibility of human error. Chapter 2 and 3 will consider the problems of producing reliable and safe software. Copy- ing with both expected and unexpected error conditions will be examined in chapter 7 and 13(?). ### Structure of the book If you are interested in the structure of the book, reed the book, not the compendium... ## Development cycle for real-time systems The most important stage in the development of any real-time system is the generation of a consistent design that satis􏰃es an authoritative speci􏰃cation of requirements. The book does not focus on issues of design, but rather the investigation of language and operating system primitives, which allow designs to be realized. This section will give a brief overview of some of the typical stages that are passed through in the top-down design approach: - requirements speci􏰃cation - during which an authoritative speci􏰃cation of the system's required functional and meta-functional behaviour is produced; - architectural design - during which a top-level description of the proposed system is developed; - detailed design - during which the complete system design is speci􏰃ed; - coding - during which the system is implemented; - testing - during which the e􏰅cacy of the system is tested. ### Requirement speci􏰃cation Almost all computing projects start with an informal description of what is desired, followed by an extensive analysis of requirements. In this stage, functionality is de􏰃ned, temporal behaviour made explicit and reliability requirements and the desired behaviour of the software in the event of component failure. The environment must also be modelled. Since interaction with the environment is especially important for a real-time system, maximum rate of interrupts, maximum number of objects and failure modes are all important. ### Design activities Decomposition and abstraction form the basis of most software engineering methods. Decomposition involves the systematic breakdown of the complex system into smaller and smaller parts, until components are isolated that can be understood and engineered by individuals or small groups.Abstraction enables the consideration of detail, particularly that appertaining to implementation, to be postponed. Sub-levels in the design must have well-de􏰃ned roles in order to be veri􏰃able. If the speci􏰃cation of the entire system can be veri􏰃ed just in terms of the speci􏰃cation of the immediate subcomponents, the decomposition is said to be compositional. This is an important property when formally analysing programs. Whenever a class (object) is used, typically, it involves the addition of some form of task (process). Both object and task abstractions are important in the design and implementation of reliable embedded systems. These forms of encapsulation lead to the use of modules. From the de􏰃nition of modules, more sizable components can be de􏰃ned that may even be re-usable in subsequent designs. Cohesion and coupling are two metrics that are often used to describe the relationships between entities within a design. Cohesion is concerned with how well a module holds together - its internal strength. Allworth and Zobel gave six measures of cohesion that range from the very poor (Coincidental) to the strong (Functional). Coupling, by comparison, is a measure of the interdependence of program modules. Two modules passing control information between the are said to be tight, or have high coupling. On the other end you have loose coupling if only data is communicated. A loose coupling means that the module is easy to replace with an other one, ie what we want. A good decomposition is one with strong cohesion, and loose coupling. This principle is equally true in sequential and concurrent programming domains. ### Testing and simulation To assist in any complex testing activity, a realistic test bed presents many attractions. For software, such a test environment is called a simulator. Using a simulator helps create normal, as well as abnormal system behaviour without having to meltdown a nuclear reactor in order to test the system. It is crucial that the simulator are able to reproduce accurately the sequence of events expected in the real system. As mentioned before, high reliability requirements are the essence of most real-time systems, is it is clear that testing must be extremely stringent. ## Languages for programming real-time systems There are three classes of programming languages to be identi􏰃ed; assembly languages, sequential systems implementation languages and high-level concurrent languages. These will be reviewed, after introducing some general language design criteria. ### General language design criteria Young (1982) lists the following six criteria as the basis of a real-time language design: security, readability, 􏰄exibility, simplicity, portability and e􏰅ciency. Security : The security of a language design is a measure of the extent to which program- ming errors can be detected automatically by the compiler or language run-time support system. Some errors, like errors in the programmer's logic, cannot be detected automat- ically. A secure language must, therefore, be well structured and readable so that such errors can easily be spotted. The disadvantage of security is compiler complexity and compilation time (I'm not slacking of, my code is compiling!) Readability : Readability depends on it having the appropriate choice of keywords, the ability to de􏰃ne types and the facilities for program modularization. Good readability leads to increased maintainability, security and reduced documentation costs. It does, however, tend to increase the length of any given program. Flexibility : Flexibility is when the language allow the programmer to express all the required operations in a straightforward and coherent fashion, without having to report to operating system commands or machine code. Simplicity Simplicity reduce cost associated with programmer training, minimizes the e􏰂ort required to produce compilers and diminishes the possibility of making program- ming errors as a result of misinterpretation of the language. Simplicity and Flexibility can also be related to the expressive power, the ability to express the solution of a wide range of problems, and usability, the ease of use of the language. Portability : Portability is the ability to run a program independent of the hardware on which it is executed on. Java claims to have this ability. This is di􏰅cult to achieve for a real-time system, however, it should be able to isolate the machine-dependent part of the program, from the machine-independent part. E􏰅ciency E􏰅ciency is the languages ability to allow e􏰅cient and predictable programs to be produced. Response times are crucial in a real-time system, however, this must be balanced against security, 􏰄exibility and readability requirements. ### Assembly languages Initially, most real-time systems were written in assembly language. Mostly due to the lack of support in high-level languages on microcomputers. Assembly language achieves e􏰅cient implementations, but is machine-oriented rather than problem-oriented. This keeps development costs high and makes maintenance di􏰅cult and expensive. ### Sequential systems implementation languages Most common programming languages used for real-time programming today, are se- quential. They also tend to be weak in the facilities they provide for real-time control and reliability. As a result of these shortcomings, it is often necessary to rely on operating system support and assembly code inserts. ### High-level concurrent programming languages The software crisis in the 1970s had several symptoms. In short terms, the general language design criteria discussed above was seldom met. This was due to the fact that there were to many general-purpose languages in use. In the American Departement of Defence (DoD), there were over 450 di􏰂erent languages used in their embedded computer applications. DoD concluded that a single language was a desirable goal, and that no current language was suitable. So they decided to develop a new language, which resulted in Ada in 1983. Ada was later updated in 1995, 2005 and 2012, which is the current version. # Reliability and fault tolerance Real-time and embedded systems usually have much higher requirements for reliability or safety. It is pretty annoying when Spotify crash or Matlab hangs, but these problems have little to no impact on the surroundings. A control system for, say avionics, have much higher demands on reliability. ## Faults and failure Let us 􏰃rst clear up some de􏰃nitions Reliability A measure of the success with which the system conforms to some authorative speci􏰃cation of its behaviour. Fault A mechanical or algorithmic cause which, given the right conditions, produce an unexpected or incorrect result. Error An internal unexpected problem in a system, caused by activation of a fault. Failure System deviation from its speci􏰃cation. Note that this implies a system maintain its reliability until a failure, and a failure only happens when the system does not interact with the environment in the prescribed manner. This means a reliable system either prevent activation of faults, or prevents the error from propagating through the system. This chain is shown in 􏰃gure 1. Note that a failure in a subsystem will lead to a fault in a system. There are four sources of faults that can result in a system failure: • Inadequate speci􏰃cation - Software faults - Hardware faults - Interference in the communication subsystems The last two are in some ways predictable errors in that the system designer will know what happens. For example, if a network module fails, a message will not be delivered. However, when an error occurs from incorrect interaction or wrong code, the result can be more unpredictable, as the scope of these e􏰂ects can not be forseen as clearly as a hardware error. In this regard, a fault is the source of misbehaviour. Without regards to the above sources, there are three types of faults that can exist in a system. Also note that most system faults originate from incorrect or incomplete speci􏰃cation. Transient faults : These faults occur at a particular time, remain in the system for a period, then disappear afterwards. An example of this is radioactivity. Permanent faults : A fault that persists in the system until it is 􏰃xed. All software faults are permanent faults. Intermittent faults A special case of transient faults that occur repeatedly. Overheating is an example. It is worth noting that software faults have a special name: The notorious bug . Bugs can be divided into two categories, with the names taken from their quantum physics counterparts. Bohrbugs are easily reproduced and easy to 􏰃nd and thus remedy. Heisenbugs1 are more complex bugs that arise from certain, often complex, combinations of events. A good example of this are threads that are not properly synchronized, and produce random behaviour subject to the state of each thread. ## Failure modes and fault prevention There are, generally speaking, two ways a system can fail. When a system produce a value failure, the output is wrong or even corrupted/unusable. When the system does not deliver the value in the right time slot, it has produced a time failure. A combination of these two are called an arbitrary failure. Note that while time failure intuitively covers delayed data, undelivered (delayed in􏰃nitely) and even information that is delivered too early are also considered time failure. The book also give an extensive list of di􏰂erent ways a system might fail. These are given in a bullet list on page 32, and covers everything from complete chaos (uncontrollable failure) to an immune system (never fail) via the controlled shutdown/reduced operation failures. Fault prevention is an approach that aims to reduce the failures in a system by reducing the number of faults. The two main ways to do this is pretty intuitive: You can either 􏰃nd and remove the faults, or you can prevent their introduction in the 􏰃rst place. These are known as fault removal and fault avoidance. While there are hardware aspects to fault avoidance, such as using components that are suited for the task at hand, the most important is to write software with as few faults as possible. There are several ways to do this. Design methods, for example UML, and programming languages designed for real-time systems, such as Ada, will help reduce the number of faults. Extensive testing and simulation of the system will also help reduce the number of faults present during operation. ## Fault tolerance and redundandy Even though all precautions might have been taken during design, there will, in practice, still be faults in the system when it goes live. No software programmer writes 100% cor- rect code all the time. No design speci􏰃cation is complete, consistent and unambiguous. Testing can only show the presence of faults, not their absence. Thus a system designer has to make sure the system continues to operate when a fault activates. This is known as fault tolerance, and curriculum describes three di􏰂erent levels. Full fault tolerance : ensures the system continues operation as normal, at least for some time. Graceful degradation : reduces the functionality to only perform core tasks while the system waits for repairs. Fail safe : systems stops all operation, but ensures its integrity. For example, a failing pitch controller on a helicopter might reset the pitch to zero before shutting down. In order to achieve this, system designers introduce extra elements to detect and handle faults. This is called redundancy, and is possible to implement in both software and hardware. We will focus on the software part, where N-version programming and recovery blocks are the most common techniques. ## N-version programming N-version programming relies on several pieces of software to compute the same result. There are several degrees of diversity, ranging from using di􏰂erent compilers for the same program, to let several teams build the same functionality in di􏰂erent programming languages. During execution, a driver process handles starts the programs, receive the result and act depending on the output. If the programs does not produce the same result, the driver might choose the most common result, ask the processes to compute it again, or simply terminate the faulty process. One of the downsides to this solution is that, in a real-time system, the driver process and the di􏰂erent programs might need to communicate between them, introducing a commu- nication module as well. Add this to the fact that software is by far the most extensive and time consuming part of a real-time system, and the project costs might double by introducing a redundant language. There are also increased strains on hardware, or even need for separate hardware pieces, which increases the cost and/or complexity of the system. As a 􏰃nal note, recall that most software faults originate from the speci􏰃cation 􏰁 thus all N versions might su􏰂er from the same fault. ## Dynamic redundancy Note that N-version programming is a case of static redundancy. All N versions of the program runs continuously no matter if the system fails or not. An alternative to this is to have modules in the background and only runs when an error is detected. This could decrease resource usage while still handling errors properly. This type of redundancy has four stages. ### Error detection Since the recovery system is not continually running, there needs to be some way to detect a fault. There are many triggers that could be used. - Replication checks are an extension to N-version programming that starts a recov- ery routine if one or more of the N results are wrong. Also consider a 2-version system: We cannot know which result is correct, but we know something is wrong, and action has to be taken. - Timing checks take the form of a watchdog timer that each process have to reset within a set interval. Note that timing checks can only be used to check for time failures 􏰁 Other error detection systems have to be used for value failures. - A reversal check can be used if there is a one-to-one relationship between the input and output. Thus, given an output, we can do a reversal calculation and compare it to the original input2. - Coding checks rely on extra information sent along with the data, which can be compared on the receiving side. This is often used in communication. For example, imagine an integer is being sent on the network. A checksum might be the square of the number, which is calculated on the sending side. When both numbers are received, the square root of the second number is compared to the 􏰃rst: A mismatch is proof of error. - Reasonableness checks rely on the knowledge of the system. This includes range checks, such as date and time validation. - Structural checks are used to ensure the data structures are correct. We know the number of items in a list; if this changes when it is not supposed to, something is wrong. Go users might recall that the json.Marshal-function produce an error when the information it has received does not 􏰃t the target struct. Dynamic reasonableness checks the current output to the current output. Imagine a system on the form $x' = ax$ suddenly change from a large positive value to a small negative value: This jump should not occur, and something is wrong. ### Damage con􏰃nement Normally, errors are not detected the instant it occurred. Thus when an error is detected, the system have to 􏰃gure out how far it has spread. Damage con􏰃nement concerns how the rest of the system is a􏰂ected by an error. The two main techniques of doing this are modular decomposition and atomic actions. Modular decomposition emphasizes on breaking down the system into components where each component only communicates with another through well-de􏰃ned interfaces. The internal details are hidden from other modules. If done properly, all errors are detected before another module needs the result. Atomic actions will be described in a later chapter. ### Error recovery When the error is found and we know the extent of it, the recovery process can start. There are two methods of error recovery: Forward and backward recovery. Forward recovery involves continuing as before with selected corrections. While this is an e􏰅cent method of error recovery, it requires a good understanding of what the error is and how it can be remedied, else the corrections might only be for the worse. Backward recovery steps back to a previous state, called a recovery point, and tries to accomplish the same action with a di􏰂erent method. with a di􏰂erent method. This works straightforward when there are no threads, but it quickly gets complicated with multiple threads, because of communication between threads: If we need to roll one thread back to an earlier recovery point, we might have to undo a message sent to the other thread. If this is the case, the other thread might have to roll back, requiring messages sent to other threads to be undone. Thus since one thread has had an error, multiple other threads will have to roll back as well. This is called the domino e􏰂ect. ### Fault treatment While error corrections remedy the error, it does not remove the underlying fault. This fault might be easy to identify and/or remedy, and the experienced error will not happen again. Both hardware and software faults are relatively easy to 􏰃x once identi􏰃ed if the system is allowed to be brought o􏰆ine. If the program have to be modi􏰃ed while executing, things get a little more complicated. ##Recovery blocks Recovery blocks are an implementation of dynamic recovery applied to software. It de􏰃nes all entries to a software block as a recovery point, and then have an acceptance test before the block can exit. The method is summarized in Figure 2. Note that the acceptance test can involve several of the methods discussed for error detection. For a short summary of the di􏰂erence between N-version programming and recovery blocks, read chapter 2.7. ## Measuring reliability Hardware reliability is pretty simple to measure: Take a bunch of similar components, test them, and establish a measure of reliability from the amount of components that fail. Software reliability, on the other hand, is much harder to measure as code only comes from one source, and it does deteriorate over time. Thus new measurements of reliability have to be established. First, the notion that software was either working or not working was established. As discussed, software can be designed to maintain only critical functionality when shit hits the fan, and thus this boolean interpretation of software reliability is not accurate enough. A more recent de􏰃nition of software reliability is based on probability. More speci􏰃cally, it is determined as the probability that a given program will operate correctly in a speci􏰃ed environment for a speci􏰃ed length of time. One main way to calculate this is to use a growth model, increasing reliability for every fault that is corrected in the system. Another method is to calculate this probability statistically by applying a number of test cases to the system and, without correcting any fault, determine the amount of correct cases. # Exceptions and exception handling There are 􏰃ve general requirements for an exception-handling facility: - As with all language features, the facility must be simple to understand and use. - The code for exception-handling should not be so obtrusive as to obscure under- standing of the program's normal error-free operation. A mechanism which inter- mingles code for normal processing and exceptional processing will prove di􏰅cult to understand and maintain. It may well lead to a less reliable system. - The mechanism should be designed so that run-time overheads are incurred only when handling an exception. Although the majority of applications require that the performance of a program which uses exceptions is not adversely a􏰂ected un- der normal operating conditions, this may not always be the case. Under some circumstances, in particular where speed of recovery is of prime importance, an application may be prepared to tolerate a little overhead on the normal error-free operation. - The mechanism should allow the uniform treatment of exceptions detected both by the environment and by the program. For example, an exception such as arith- metic over􏰄ow, which is detected by the hardware, should be handled in exactly the same manner as an exception raised by the program as a result of an assertion failure. - The exception mechanism should allow recovery actions to be programmed. ##Exception handling in older real-time languages Unusual return value is one of the simplest methods for exception handling, and works as follows: if (function_call(parameters) == AN_ERROR { /* error handling code */ } else { /* normal return code */ }; If a function already returns a value and it is not possible to partition the range of values to indicate an error, then status 􏰄ags are used. These are atomic shared variables which can be set and tested. Using forced branch the instructions immediately following the subroutine call is skipped to indicate the presence (or absence) of an error. This is achieved by the subroutine incre- menting its return address (program counter) by the length of a simple jump instruction to indicate an error-free (or error) return. 3.2 Modern exception handling While most of the old real-time languages has error handling that changes the 􏰄ow of execution in the program, most modern languages has exception handling built directly into the language. There are four classes of exceptions. - Detected by the environment and raised synchronously - an array bounds violation or divide by zero are examples of such exceptions. - Detected by the application and raised synchronously - for example, the failure of a program-de􏰃ned assertion check. - Detected by the environment and raised asynchronously - an exception raised as the result of power failure or the failure of some health-monitoring mechanism. - Detected by the application and raised asynchronously - for example, one process may recognize that an error condition has occurred that will result in another process not meeting its deadline or not terminating correctly. Chapter 3 mainly focus on synchronous exception handling. Ada requires exceptions to be declared as constants, while Java and C++ have a more object-oriented view. ## Exception propagation If an exception is raised it's not always a handler associated with that block that takes care of it. There are two possible methods for dealing with a situation where no imme- diate exception handler can be found. The 􏰃rst approach is to regard the absence of a handler as a programmer error which should be reported at compile-time. This can be challenging as it's not always possible for the compiler to check whether the calling context includes the appropriate exception handlers. The second approach is to look for handlers up the chain of invokers at run-time; this is called propagating the exception. This is the approach taken in Ada, Java and C++. If an exception is propagated outside its scope, it will be impossible to 􏰃nd a handler. Most languages have a 'catch all' exception handler to solve this potential problem. The resumption model : Using this model the handler tries to cure the problem that caused the exception to be raised, and the invoker resumes as if nothing happened. This model is also called the notify model. This model is illustrated on p.67 in the textboook. The termination model : Using this model, the control is not returned to the invoker after an exception is taken care of. This model is the most common, and is also called the escape model. The hybrid model : With the hybrid model the handler decides whether the error is recoverable or not, and can choose if the invoker is to be terminated. ## Exception handling in Ada, Java and C See textbook for examples. ### Ada Supports explicit exception declaration in the same fashion as constants, with keyword exception. Can be declared in the same place as any other declaration and has the same scope. Some standard exceptions, like Constraint_Error and Storage_Error has the whole program as scope. Every block in Ada can contain an optional collection of exception handlers declared at the end of the block. They have the following syntax: when (the name of the iden- tity) => (action). see textbook for more details. when others can be used as the last exception-handling choice to pick up all exceptions not listed previously. In Ada an exception that are raised and not handled by a subprogram is propagated to the caller of the subprogram. Therefore, such an exception will only be handled by the initialization code if it itself called the subprogram. See textbook (p.73) for example. The exception can also be propagated by a program re-raising the exception in the 3 EXCEPTIONS AND EXCEPTION HANDLING 24 local handler. This facility is useful in the programming of last wishes. See textbook (p.74) for example. Exceptions will never meet requirement 3, as detecting possible error conditions will require some resources. The Ada language provides a facility that checks if standard exceptions raised by the run-time environment becomes too costly for a particular appli- cation. Di􏰅culties with the Ada model of exceptions: - It can be hard to keep track of where an exception is raised. - Ada does not allow a full range of parameters to be passed to handlers only a character string. This can be inconvenient if an object of a particular type needs to be passed. - It is possible for exceptions to be propagated outside of the scope of their declara- tion. Such exceptions can only be trapped by when others. However, they may go back into scope again when propagated further up the dynamic chain. This is disconcerting, although probably inevitable when using a block structured language and exception propagation. ### Java Java supports a termination model of exception handling, like Ada, but the exceptions are integrated into the object-oriented model. Raising an exception in Java is called throwing the exception, and it can only be han- dled from within a try-block. Each handler is speci􏰃ed using a catch statement, that works like a function declaration. A handler with parameter type T will catch a thrown object of type E if: - T and E are the same type; or - T is a parent (super) class of E at the throw point. As with Ada, if no exception is found in the calling context of a function, the calling context is terminated and a handler is sought in a try-block within its calling context. Java supports a finally clause as part of a try-block. The code attached to this clause is guaranteed to execute whatever happens in the try-block irrespective of whether exceptions are thrown, caught or propagated, or, indeed, even if there are no exceptions thrown at all. ### C C does not de􏰃ne any exception-handling facilities within the language. This limits the possibilities of programming reliable systems, but it is possible to provide some form of exception-handling mechanism by using the macro facility of the language. To implement a termination model in C, it is necessary to save the status of a program's registers, and so on, on entry to an exception domain and then restore them if an exception occurs. This can be done using setjmp and longjmp. ## Summary Although many di􏰂erent models exist they all address the following issues: - Exception representation - an exception may, or may not, be explicitly repre- sented in a language. - The domain of an exception handler - associated with each handler is a domain which speci􏰃es the region of computation during which, if an exception occurs, the handler will be activated. The domain is normally associated with a block, subprogram or a statement. - Exception propagation - this is closely related to the idea of an exception do- main. It is possible that when an exception is raised there is no exception handler in the enclosing domain. - Resumption or termination model - this determines the action to be taken after an exception has been handled. - Parameter passing to the handler - this may or may not be allowed. The table below shows the exception-handling facilities of various languages. || __Language__ || __Domain__ || __Propagation__ || __Mode__ || __Parameters__ || || Java || Block || Yes || Termination || Limited || || Java || Block || Yes || Termination || Yes || || C++ || Block || Yes || Termination || Yes || || CHILL || Statement || No || Termination || No || || CLU || Statement || No || Termination || Yes || || Mesa || Block || Yes || Hybrid || Yes || It is not unanimously accepted that exception-handling facilities should be provided in a language. To sceptics, an exception is a goto where the destination is indeterminable and the source is unknown. They can, therefore, be considered to be the antithesis of structured programming. # Concurrent programming ## Introduction De􏰃nition: Concurrent programming is the name of techniques used to express potential parallelism and solving the resulting synchronization and communication challenges. Motivation behind concurrent programming: - Real-world systems (robots etc.) have a parallel nature. - Modern processors are much faster than the I/O-devices with whom they inter- act, and a sequential program that is waiting for I/O is unable to perform other operations. - To allow more than one processor to solve a problem. - Without concurrency, the software must be constructed as a single control loop. This structure cannot retain the logical distinction between system components. Speed-up of parallelizing: Amdahl's law gives that: $\text{Max speed-up} = \frac{1}{(1-P)+\frac{P}{N}'}$ where P is the proportion of code that bene􏰃ts from parallelization and N is the number of available processors. ## Terms and de􏰃nitions ### Task/thread/process A task/thread is the unit of parallelism and a single thread of control. A process is threads working within its own shared memory context. ### Fundamental facilities All concurrent programming consist of these fundamental facilities: 1. Activities (threads, tasks or processes). 2. Synchronization mechanisms. 3. Support of communication between concurrent activities. ### Types of interaction Interaction between tasks can be grouped as one of the following: - Independent: No synchronization or communication between threads. - Cooperate: Synchronization and communication. - Competing: About their share of resources. Although they communicate and synch to obtain resources, they are essentially independent. ### Hierarchy Based on hierarchy and relation, one often distinguish between tasks to be: - Parent/child: Responsibility for the creation of another task. The parent may be delayed while the child is being created and initialized. - Guardian/dependant: A task may be dependent on the guardian task itself. The guardian is not allowed to terminate until all dependent tasks have terminated. - Sometimes the parent is also the guardian, but not necessarily with dynamic tasks. ## Concurrent execution ### Variation between the languages The models of concurrency for the programming languages varies, especially when it comes to: - Structure: The number of tasks can be either: 􏰇 - Static: 􏰃xed at compile-time 􏰇 - Dynamic: tasks are created any time - The level of parallelism: Tasks can be either: 􏰇 - Nested: allowed to be de􏰃ned within other tasks 􏰇 - Flat: not nested - Granularity (detaljniva): Within languages that supports nested constructs we distinct between: 􏰇 - Course grain parallelism: Program contains few tasks, each with a longer life. (Applies to most concurrent programming languages, typi􏰃ed by Ada.) 􏰇- Fine grain parallelism: Program contains a large number of simple tasks, some of which exist only for a single action (e.g Occam2). - Initialization: The tasks can be supplied with information by: 􏰇 - Passing it as parameters (most modern languages allow this) - Communicating explicitly after is has commenced its execution. - Termination: Can be done in a variety of ways: 1. Normal completion of execution of the task's body 2. Suicide, by a self-terminate statement 3. Abortion, through action of another task 4. An untrapped error condition 5. Never (when executing non-terminating loops) 6. When no longer needed ## Task representation The three basic mechanisms for representing concurrent execution is: fork and join, cobe- gin and explicit task declaration. The latter is most used in modern real-time languages. ### Fork and join The fork statement speci􏰃es that a designated routine should start executing concurrently with the invoker of the fork. The join statement allows the invoker to synchronize with the completion of the invoked routine. For example: A procedure P(parent) begins and forks function F(child). They will now execute con- currently until the point when P states join. Here P will wait for F to 􏰃nish (if it hasn't already done so). Fork and join allow for dynamic task creation and provide a means of passing information to the child via parameters. This can however be error prone in use, for example, in some systems a guardian must explicitly rejoin all dependants rather than merely wait for their completion. ### Cobegin The cobegin denotes the concurrent execution of a collection of statements, from $S_1$ to $S_n$. The cobegin terminates (equals join in the "Fork and join") when all statements have terminated. Data can be passed to the invoked tasks via parameters of the procedure call. A cobegin statement can even include a collection of statements that itself has a cobegin within it. ### Explicit task declaration The routines within the "Explicit task declaration" can state themselves whether they will be executed concurrently. This has become the standard way of expressing concurrency in real-time languages. Ada: The unit of parallelism is called task in Ada. Tasks consist of a speci􏰃cation and a body. They can be passed initialization data upon creation, but only discrete types and access (pointer) types. The tasks can be given as task types. Tasks that do not have types declared for them and have no pointers to them are called anonymous. Dynamic tasks can be created by giving a non-static value to the bounds of an array (of tasks) or using the 'new' operator on an access type (of task type). Tasks created by the ('new') allocator have the important property that the guardian (called master in Ada) is not the block in which is was created, but the one that contains the declaration of the access type. That means that if task Y is created inside an inner block, but was 􏰃rst declared outside, the inner block can terminate even though Y hasn't yet terminated. Termination: A task will terminate as it completes its body, there is an unhandled exception, when no longer needed (done by a select statement, or when it is aborted. Any task can abort any other task, and as a result, all its dependants will also abort. This is a dangerous action, and should be used only when no other alternative actions are possible. Java: Java has a prede􏰃ned class, java.lang.Thread which provides threads. Java also has a interface, called Runnable, to express concurrent execution. This provide a run method. There are two ways to create threads. One way is to declare a class to be a subclass of Thread and override the run method. The second way is to declare a class that implements the Runnable interface. Common for all threads in Java is that they are not created automatically when their associated objects are created, but must be explicitly created and started using the start method (e.g thread1.start();). Java allows dynamic thread creation like Ada, but unlike Ada, Java allows arbitrary data to be passed as parameters. Java has no guardian concept, and relies on garbage collection to clean up objects no longer accessible. But the main function still terminates when all its threads are terminated. The join method works like described in "fork and join", and the isAlive method tells whether a target has terminated. Termination : A thread terminates as it completes its body or there is an unhandled exception. Deamon threads provide genereal service and typically never terminate until they are no longer needed (like in Ada using the select statement). In Java, one does not longer have the same opportunity to abort threads like in Ada. C/Real-Time POSIX: C/Real-Time POSIX provides three mechanisms for creating con- current activities. The 􏰃rst is the fork and wait mechanism. The second is spawn, which is a combination of fork and exec. It also allows for a process to contain several threads like in Ada and Java. A thread can be executed when it is called by pthread_create. Here the parameters are passed. Termination: A thread can be terminated normally by returning from its start_routine or by calling pthread_exit. One thread can wait for another to terminate by the pthread_join function. The cleaning up and reclaiming the storage of a thread is called detaching. ## Multiprocessor and distributed systems De􏰃nitions: - Multiprogramming: Tasks multiplex their executions on a single processor - Multiprocessing: Tasks multiplex their executions on a multiprocessor system where there is access to shared memory - Distributed Processing: Tasks multiplex their executions on several processors which do not share memory Global dispatching means that a task can be given to all available processors. From a real-time perspective, 􏰃xing tasks to certain processors usually will result in a more predictable response. Linux only allow threads to be constrained to execute on a limited set of processors. This is called processor a􏰅nity. Although Ada, Real-Time Java and C/Real-Time POSIX all support multiprocessor implementations, they do not provide mechanism to set the processor a􏰅nity of a task. ###Steps to make a distributed system Production of a application for a distributed system involves some steps not required when programs are produced for a single or multiprocessor platform. Partitioning : dividing the system into parts Con􏰃guration : the partitioned parts are associated with particular elements in the system Allocation : turning the con􏰃gured system into a system of executable modules Transparent execution : the distributed software is executed so remote resources can be accessed independent of location Recon􏰃guration : the dynamic change to the location of a software component ## Language-supported vs operating-system-supported concurrency There has been a long debate as to whether the support for concurrency should be provided by the language or the operating system. Pros and cons for the former follows: Pros: - More readable and maintainable programs. - There are many di􏰂erent operating systems. - An embedded computer may not have any resident operating system. Cons: - Di􏰂erent languages have di􏰂erent models of concurrency. - It may be di􏰅cult to implement a language's model of concurrency e􏰅ciently on top of an operating system's model. - Operating systems API standards, such as POSIX, have emerged and therefore programs are more portable. #Shared variable-based synchronization and communication The relevant learning goal for this chapter is: The ability to use (correctly) and evaluate mechanisms for shared variable synchronization. Discussions about issues when using multi-core architectures are left out. In short, a correct program will work on both single-core and multi-core + shared memory architec- tures. However, errors may not be detected on a single-core system due to the sequential nature of a single-core environment. Problems with integrating concurrency with object oriented programming (such as in Java) is also not covered. ## Introduction The correct behaviour of a concurrent program is critically dependent on synchronization and communication between tasks. The classical errors data race, deadlock, livelock and resource starvation are a consequence of incorrect synchronization and/or communication of concurrent tasks. There are two main ways of performing task communication, either with shared variables or message passing. In this section, shared variables are discussed. An example of message based communication combined with synchronization is Google Go's unbu􏰂ered channels. To introduce some important concepts, consider two tasks with a shared integer X. Both tasks wish to update the value of this integer with the following statement: X := X+1 The problem with this statement is that it is not executed as an indivisible (atomic) operation, but is in fact split into instructions that can become interleaved. We would like some way to assure that only one of the tasks execute this statement at any given time. We call a sequence of statements we require to execute indivisibly a critical section. The synchronization required to protect a critical section is called mutual exclusion. ## Signals and Busy waiting The simplest way for two tasks to communicate is with a shared signal variable. One of the tasks continuously reads the shared variable until a the other task sets this signal variable to some value. The 􏰃rst task is then allowed to continue, since a synchronization has occured. The main problem with this is that the 􏰃rst task has to continuously poll the signal, this is called busy waiting and is very resource demanding. In addition,implementing higher level features like mutual exclusion is complicated and hard to prove correct (especially when more then two tasks are involved). With busy waiting, a state called livelock can occur. This happens when a task is busy waiting and does not receive a signal in 􏰃nite time. This means that the task cannot continue to execute and will in addition spend resources on busy waiting. ## Semaphores The most basic building block in shared variable synchronization is the semaphore. It is therefore important to understand what it does, how can be implemented, its limitations, and why it works. An important feature of the semaphore is that it eliminates the need for busy waiting. The semaphore is conceptually a shared non-negative integer equipped with two methods; wait and signal. It is imperative that the implementation of these methods are atomic, i.e two tasks executing wait at the same time will not interfere with each other. The two methods works as follows. wait(S) : If the value of the semaphore, S, is greater than zero the decrement its value by one; otherwise delay the task until S is greater than zero (and then decrement its value). signal(S) : Increment the value of the semaphore, S, by one. With the semaphore in hand, we can implement mutual exclusion of two tasks like this: mutex : semaphore; // initially 1 task P1; loop wait(mutex); <critical section> signal(mutex); task P2; <same as P1> The question now is; how exactly does this eliminate busy waiting? It seems like the wait procedure needs to perform some sort of busy waiting and wait for a signal from some other process. This is where the RTSS (run-time support system) comes in. The RTSS should keep tabs on which tasks are waiting and remove them from the processor. Eventually, if the program is correct (for example, no deadlocks), some other task will call signal and the RTSS will let the task will continue. ## Error conditions and liveness The error condition livelock was explained in section 5.1. Another error condition intro- duced by synchronization mechanisms is the deadlock. This is a state where a set of tasks cannot proceed due to con􏰄icting synchronization demands. Consider the following two tasks, with two synchronization semaphores S1 and S2. // Task 1 wait(S1); wait(S2); ... <critical section> ... signal(S2); signal(S1); and // Task 2 wait(S2); wait(S1); ... <critical section> ... signal(S1); signal(S2); In this example, we risk interleaving the wait calls such that Task 1 is locked at the call to wait(S2), while Task 2 is locked at wait(S1) and none of them can continue their operation. This is called a deadlock. The third error condition, which is less severe, is called starvation. This is when a task is not given access to a needed resource in 􏰃nite time. If it can be proven that a task is free from livelocks, deadlocks or starvation, we say that it possesses the liveness property. ## Conditional critical regions The semaphore is simple, but is error prone when implementing complex synchronization features between multiple tasks. To resolve some of these issues, we introduce a language feature called Conditional Critical Regions (CCRs). This language feature is designed to implement mutual exclusion in critical regions. The idea is to group shared data, and label this shared data as a resource. Whenever we have a block of code that needs to use this resource, we label it as a region using this resource with a conditional guard. The CCR feature ensures that any two regions associated with the same resource do not execute simultaneously (thus giving mutual exclusion). The following example illustrates how a language could include CCRs: // Pseudo−code to illustrate Conditional critical regions shared_var_t := Integer ; SHARED_VAR_MAX : Integer := 50; resource shared_var : shared_var_t := 0; task P1; loop region when shared_var < SHARED_VAR_MAX do shared_var := shared_var + 1; end region ; end loop ; task P2; loop region when shared_var < SHARED_VAR_MAX do shared_var := shared_var + 2; end region ; end loop ; task P3; loop region when shared_var >= SHARED_VAR_MAX do shared_var := 0; // reset the counting end region ; end loop ; We see that we introduce two new keywords, resource and region to use CCRs in our language. ## Monitors The main problem with CCRs are that the blocks of code referring to a shared resource can be dispersed throughout the program. Monitors are introduced as a way to structure both resources and procedures together. A monitor has a set of variables, and a set of procedures operating on these shared variables. Each call to a procedure is protected by mutual exclusion provided by the programming language. The main problem with the monitor is still relies on some sort of internal conditional synchronization. This can be implemented with a semaphore-like structure called a condition variable. The similarities with the semaphore brings along the weaknesses of the semaphore; the condition variable is error prone and can be complicated to use. ## Protected objects In Ada, a language feature called protected objects improves the monitor by using guards instead of condition variables on its internal procedures which need conditional synchro- nization. If a procedure in a protected object has a guard, this is evaluated if a task attempts to execute this procedure. If the guard evaluates to False, the task is suspended. The condition is re-evaluated every time a task exits a procedure in the protected object. # Message-Based Synchronization and Communication Message passing is introduced as an alternative to shared variable synchronization and communication. This approach is typi􏰃ed by the use of a single construct for both communication and synchronization. There are a wide variety in the semantics of message passing, because of this three major issues are considered and discussed: - The model of synchronization • The method of task naming - The message structure ## Process Synchronization In a message-based system a receiver, obviously, cannot read a message before it is sent. This is opposed to shared variable synchronization, were a receiver will not know whether the variable has been written to or not. The task synchronization of the sender in a message-based system can be broadly clas- si􏰃ed in three categories: - Asynchronous - the sender proceeds immediately after sending the message - Synchronous - the sender waits for the message to be received - Remote invocation - the sender proceeds only when a reply has been returned These three forms are related, as two asynchronous events can constitute synchronous send (see Ex. 2). Likewise two synchronous (or asynchronous) send can constitute remote invocation (see Ex. 3). Using asynchronous tasks for synchronized messaging Task 1: Task 2: asyn_send(message) wait(message) wait(acknowledgement) asyn_send(acknowledgement) Using synchronous tasks for remote invocation messaging Task 1: Task 2: syn_send(message) wait(message) wait(reply) ... construct reply ... syn_send(reply) As asynchronous send can be used to construct the other two, this model gives the most 􏰄exibility. Because of this languages and operating system should support this model. There are, however, some drawbacks to consider when using asynchronous send: - In􏰃nite bu􏰂ers might be needed to store messages that have not yet been read (perhaps because the receiver has terminated) - Asynchronous communication is out-of-date, as most sends are programmed to expect an acknowledgement (synchronous communication) - More communication are needed with the asynchronous model, hence increased complexity - It is more di􏰅cult to prove correctness of the complete system ## Task Naming and and Message Structure In naming task there are two things to consider: direct vs. indirect naming and symmetry. In a direct naming scheme the sender explicitly names the receiver: send<message> to <task-name> An indirect naming scheme, on the other hand, names an intermediary (known variously as a channel, mailbox, link or pipe): send<message> to <mailbox> A naming scheme is symmetric if both the sender and receiver name each other (directly or indirectly). It is asymmetric if the receiver do not name a speci􏰃c source, hence accepting messages from any sender. Asymmetric naming 􏰃ts the client-server paradigm, as a server can receive messages from, and do services for any number of clients (tasks). Indirect asymmetric naming have some further issues, as the intermediary can have di􏰂erent structures: - many-to one - many-to-many • one-to-one - one-to-many ### Message Structure A language should allow any date object of any type (pre- or userde􏰃ned) to be sent as a message. This can be di􏰅cult in practice, especially if the sender and receiver have di􏰂erent data representation. ## Selective Waiting, Non-Determinism and Synchronization Primi- tives Message passing has so far been assumed to make the receiver wait for a message. This is often too restrictive, as the receiver may want to wait for several di􏰂erent tasks to call. To facilitate this common need, receivers can be allowed to wait selectively for messages from di􏰂erent tasks. Dijkstra's guarded command is a programming structure that is used to solve this. A guarded command is executed if its guard evaluates to TRUE, and is an element of a guarded command set. If several commands in a set evaluates to true; the choice of which is allowed to proceed, should be arbitrary. Meaning, the construct is non-deterministic, this is opposed to if-statements. The reason for making this choice non-deterministic, is that most concurrent languages usually make few assumptions to which tasks are executed when. The scheduler is also as- sumed to be non-deterministic to when di􏰂erent tasks are scheduled. The same argument can be applied to any queue de􏰃ned by a synchronization primitive, as non-deterministic scheduling implies that a queue should release tasks non-deterministic. ##Message Passing in Ada The semantics of remote invocation have many super􏰃cial similarities with a procedure call in Ada. Data is passed to a receiver, the receiver executes and then data is returned. Because of this similarity, Ada supports the de􏰃nition of a program's message in a way that is compatible with the de􏰃nition of procedures, protected subprograms and entries. For a task to be able to receive a message, it must de􏰃ne an entry. Entries can be de􏰃ned as private, meaning they can only be called by tasks local to the task's body. There is also provided facilities for de􏰃ning, in e􏰂ect, an array of entries. These are called an entry family (see the multiplexor on p.197 for an example). To call a task (send a message) simply involves naming the receiving task and its entry directly. If an entry call is made on a task that is no longer active then the exeption Tasking_Error is raised at the point of call. In Ada data can be transferred in the opposite direction of the message itself. Because of terminology confusion that can arise from this, message passing is usually called 'extended rendezvous'. All entries should have an accept statement associated with it, this statement must be used in a tasks body. The naming of an accept statement is asymmetric, as they name the entry and not the task from were the message came. If multiple tasks call the same entry, they are order in a FIFO queue. ### Exeption Handling and the Rendezvous Any valid Ada code can be executed during a rendezvous, therefore it is possible for an exception to be raised within the accept statement itself. If this occur the exception can either be handled appropriately inside the statement itself, or the statement will terminate and the exception should be handled outside. If the exception is not handled inside statement scope error may occur. ### Message Passing via Task Interfaces Ada's direct asymmetric naming means that tasks must be named explicitly. However, this is not always the appropriate behaviour. As a client may not want to be concerned about the tasks type, but only that it provides the appropriate service. Task interfaces are ideal for this purpose. A task type in Ada 2005 can be declared to 'implement' zero, one or more combinations of synchronized and task interfaces. ### Select statement The select statement in Ada must contain at least one accept alternative (up to as many as you want), and up to one of three further alternatives: - a terminate alternative • an else alternative - a delay alternative The general form an select statment is: select when <Boolean−Expression> => accept <entry> do .. end <entry> -- any sequence of statements or −− similar .. end select The terminate alternative terminates the task if no other entries can call the entries of this task. The else alternative is executed if there are no other alternative immediately available. The delay alternative suspends the the task until an entry is detected. If there are multiple entries waiting, the choice of which is evaluated is non-deterministic. ## C/Real-Time POSIX message queues C/Real-Time POSIX support asynchronous, indirect message passing through the notion of message queues. A message queue can have many readers and writers. Priority may be associated with each message (not part of the curriculum). Message queues are intended for communication between processes, but they can be used for threads as well. See Program 6.1, p. 207 for the C/Real-Time POSIX interface to message queues. ## Distributed systems In a distributed system were tasks operate on di􏰂erent nodes, the goal is to make the communication between them as easy and reliable as possible. In reality complex com- munication protocols are needed. A set of mechanism must be provided: - tasks do not need to decode or encode messages. - all messages received by a user task can be assumed to be intact, and in good condition. - messages received by a task is of the type the task expects. - tasks are not restricted to communicate only in terms of prede􏰃ned, built-in types. It is possible identify three main standards for communication in a distributed system: - externally-de􏰃ned application programmer's interface (API), to access network transport protocols - remote procedure call (RPC) paradigm - distributed object paradigm ### Remote Procedure Call RPC is typically used when communication between programs written in the same lan- guage is needed. A procedure (server) is identi􏰃ed as being one that can be called. From this two procedures are identi􏰃ed: a client stub and a server stub (see 􏰃gure 6.1 p.211). The stubs are sometimes called middleware, as they sit between the application and the operating system. The client stub is the caller (server at the client procedure), and the server stub is the receiver (server at the server procedure). The client stub is used for: - identify the address of the server (stub) procedure - convert parameters into bytes suitable for transmission across the network (param- eter marshalling) - send the request to execute the procedure to the server (stub) - wait for a reply from the server (stub) and unmarshall the message - return control to the client procedure along with returned message The server stub is used for: - receive requests from client (stub) procedures - unmarshall the received message - call the server - catch any exceptions that are raised by the server - marshal and send return parameters and/or exceptions. If the clients and servers are written in di􏰂erent languages the marshalling and unmar- shalling must be machine and language independent. ### The Distributed Object Model The distributet object model allow: - dynamic creation of an object on a remote machine - identi􏰃cation of an object to be determined and held on any machine - transparent invocation of a method in an abject as if it were a local method • transparent run-time dispatching of a method call across the network Not all systems supporting this model provide mechanisms for all this functionality. Ada : supports the static allocation of objects, allows the identi􏰃cation of remote Ada objects, facilitates the transparent execution og remote methods, and support distributed run-time dispatching of method calls. Java : allows the code of a Java object to be sent across the network and instances to be created remotely, the remote naming of a Java object, the transparent invocation of its methods, and distributed run-time dispatching. CORBA : allows objects to be created in di􏰂erent languages on di􏰂erent machines, facilitates the transparent execution of remote methods, and supports distributed run- time dispatching method calls. # Atomic actions This chapter is meant as a basic explanation of the concepts behind atomic actions. For speci􏰃c implementations of atomic actions in C, Ada and Java, please see (Burns and Wellings, 2009), chapters 7.2, 7.5-7.7. ## Atomic actions In concurrent programming, several tasks can run in parallel, each performing operations or actions. With concurrency, it is easy for tasks to interfere with each other when performing the same action. A solution to this problem is by requiring the actions to be run as an indivisible or atomic action. The properties of atomic actions are de􏰃ned as follows. 1. An action is atomic if the tasks performing it are not aware of the existence of any other task, and no other active task is aware of the activity of the tasks during the time the tasks art performing the actions. 2. An action is atomic if the tasks performing it do not communicate with other tasks while the action is being performed. 3. An action is atomic if the tasks performing it can detect no state change except those performed by them selves and if they do not revel their state changes until the actions is complete. 4. Actions are atomic if they can be considered, so far as other tasks are concerned, to be be indivisible and instantaneous, such that the e􏰂ects on the system are as if they were interleaved as opposed to concurrent. The essence of these statements is that the system always sees a consistent state, and that all actions performed appear to be instantaneous. These concepts are usually not supported natively in programming, so the functionality must be implemented using other more primitive concepts. An atomic action, even though viewed as indivisible, can be composed by internal atomic actions. This structure is called nested atomic actions. To ful􏰃l the stated requirements for atomic actions, it is important that only tasks of outer levels are involved in internal actions. ### Two-phase atomic actions For an action to be completely atomic, no communication with other parts of the system is allowed. This is however a problem when it comes to acquiring resources, because the task will then communicate with a resource manager like a server task. This leads to two options. The 􏰃rst is to make the server task, which serves all other tasks as well, a part of the atomic action. This will be very ine􏰅cient, as it will serialize all tasks requesting data from the server, and removing the concurrency. The second option is to make the atomic communicate freely with resource servers, and this is in reality the only option. For e􏰅ciency and increased concurrency, it is advantageous if tasks will wait as long as possible before requiring resources, and freeing them as soon as possible. This could however make it possible for other tasks to detect a change in state before the original task is done performing the atomic action. To avoid this, atomic actions are split into two separate phases, a 'growing' phase where only acquisitions can be made, and a 'shrinking' phase where resources can be released while no new acquisitions can be made. This way, other tasks will always see a consistent system state. ### Atomic transactions Atomic transactions are like regular atomic actions, with the addition of one feature that the atomic actions will be de􏰃ned as a success or failure after execution. If an atomic action fails, the system may be left in an inconsistent state. If an atomic transaction fails, the system will return to the state prior to the execution. Atomic transactions are also called recoverable actions for this feature. There are two distinct properties of atomic transactions: failure atomicity : The transaction must complete successfully or have no e􏰂ect. synchronizations atomicity : The transaction must be indivisible so that no other transaction can observe its execution. Even though transactions provide error detection, they are not suitable for fault-tolerant systems. This is because they provide no recovery mechanism in case of errors, and will not allow any recovery procedures to be run. ### Requirements for atomic actions For implementation of atomic actions in a real-time system, it is necessary with a well- de􏰃ned set of requirements which must be ful􏰃lled. Well-de􏰃ned boundaries : tasks should have a start, end and side boundary. A start boundary is the locations in the task where the action is initiated, the end is the location where the action will end in the task. The side boundary is the separation of the tasks involved in an action from the rest of the system. Indivisibility : atomic actions can not allow any exchange of information between the tasks inside and outside the action. Also, there are no synchronization at the start of atomic actions, but there are at the end. That means tasks can not leave an action until all tasks are ready to leave. Nesting : nesting is allowed as long as the atomic actions don't overlap with other atomic actions. Concurrency : it should be possible to execute di􏰂erent atomic actions concurrently. The overall e􏰂ect of running atomic actions concurrently must be the same as if they were run sequentially. The atomic actions should allow recovery procedures to be implemented, as their inten- tion is damage con􏰃nement. ## Recoverable atomic actions ### Backward error recovery The concept of backward error recovery is that when it is applied to a group of com- municating tasks, it is possible for the tasks to be rolled back to their starting point of execution. In general, this is problematic because there is no easy way of de􏰃ning a consistent set of recovery points. However, when an error occur in an atomic action, the tasks involved can be rolled back to the clearly de􏰃ned starting point, and alternative procedures can be executed. As atomic actions operate independently, this will guarantee correct results. Atomic actions used in this way are called conversations. Each action statement will contain a recovery module as seen here: action A whit (P1,P2) do ensure <acceptance test> by - primary module else by - recovery module else error end A; The basic principles of a conversations can bs summarized: - On entry to the conversation, the state of a task is saved. The set of entry points form the recovery line. - While inside the conversation, a task is allowed only to communicate with other tasks active in the conversation and general resource managers. As conversations are built from atomic actions, this property is inherited - In order to leave the conversation, all tasks active in the conversation must have passed their acceptance test. If this is the case, then the conversation is 􏰃nished and all recovery points are discarded. - If any tasks fails its acceptance test, all tasks have their alternative modules. It is, therefore, assumed that any error recovery to be performed inside a conversation must be performed by all tasks taking part in the conversation. - Nesting is allowed, but only strict nesting is allowed (Inner structure entirely con- tained in outer). - If all alternatives in the conversation fail, then recovery must be performed at a higher level Conversations have been criticized despite even though they o􏰂er recovery options for sets of tasks. If a set of tasks fail, the same set of tasks will try another procedure, but often the case is at a higher level, and the tasks may need to work together with entirely other tasks to recover successfully. This leads to a lot of ine􏰅ciency, and the system may not even recover at all. ### Forward error recovery The principle of forward error recovery is that if an exception occur in one of the tasks active in an atomic action, then that exception is raised in all the active tasks. This exception is called asynchronous as it originates in another task. The following example illustrates the concept: action A with (P2 , P3 ) do - the action exception: when exception_a => - sequence of statements when exception_b => - sequence of statements when others => raise atomic_action_failure end A; There are two models of exceptions handling. With the termination method, the atomic action will complete normally if all exceptions are handled and no further exceptions are raised. With the resumption method, the tasks will resume where they were when the exception was raised. If one tasks doesn't have an exception handler or fails to handle an exception, the atomic action will fail and a new exception will be raised in the other tasks. Resolution of concurrently raised exceptions : A problem with exceptions is when a fault can't be uniquely identi􏰃ed and causes di􏰂fferent exceptions to be raised simultaneously. As there may be di􏰂erent exception handlers for each of the exceptions in each task. The problem is how to decide which to choose. The solution is simply to create a priority tree for the exceptions, and to handle the exception at the root of the subtree which contains all raised exceptions. Exceptions and internal atomic actions : Another problem with exceptions is when exceptions are raised when other tasks are in a nested action. All tasks must participate in the recovery action, but internal actions are indivisible, and to handle an exception would break with that principle. Another problem is that the nested actions may not be able to handle all kinds of exceptions that can be raised. The solution is to let the nested actions have a prede􏰃ned abort exception case which aborts the nested actions, and since the nested action fails, the system is restored to the initial state. ## Asynchronous noti􏰃cation In real implementations of recoverable atomic actions, forward and backward error han- dling are usually combined, and backward error handling can even be implemented with forward error handling. The key to implementing error handling is the asynchronous messages, and most programming languages have some feature allowing asynchronous messages to be sent. There are mainly two principles for this: Termination and resump- tion. The resumption model (often called event handling) is like a software interrupt. A task contains an event handler which will handle prede􏰃ned events. When the task is interrupted, the event handler is executed, and when it's 􏰃nished, the task will continue execution from the point where it was interrupted. With the termination model, each task speci􏰃es a domain in which it will accept an asynchronous signal. If the task is in the de􏰃ned domain and receives an asynchronous signal, it will terminate the domain. This is called Asynchronous transfer of control or ATC. If a task is outside the de􏰃ned domain, the ATC will be queued or ignored. When the ATC is 􏰃nished, the control is returned to the calling task at a di􏰂erent location than from where it was called. ### The user need for asynchronous noti􏰃cation The main feature of the asynchronous noti􏰃cation is the instantaneous message it pro- vides. It adds complexity to the system, but the alternative to this is to wait for events to occur and processes to 􏰃nish. In real-time systems, this isn't always good enough. The main reasons for using asynchronous noti􏰃cations are Error recovery : When errors occur it can mean that the system states are no longer valid, that processes may never 􏰃nish or a timing error may have occurred. All this means there is no longer a point of waiting for a process to 􏰃nish, and the best solution is to respond immediately. Mode changes : For large systems, there could be many operation modes with entirely di􏰂erent tasks that are performed. When sudden changes in modes happen, we want the system to respond immediately, for example when a transition to an emergency state occur. Scheduling using partial/imprecise computations : Many algorithms calculating speci􏰃c results operates by giving more accurate results for each iteration. These algorithms have no natural 􏰃nish point, so using an asynchronous noti􏰃cation will often be the simplest way of terminating such algorithms. User interrupts : Whenever a system receives interrupts from a user, for example when an error occur, the response must be immediate. # Resource control Implementation of resource entities requires some form of control agent. If the resource is passive it is said to be protected(synchronized). If the resource is active it is called a server. This chapter discusses the problem of reliable resource control. ## Resource control and atomic actions - Tasks need to communicate and synchronize to perform resource allocation, this need not to be in the form of an atomic action, because the only information exchanged is necessary to achieve. - The code necessary for a particular task to communicate with the controller should be an atomic action. ## Resource management Resources must be encapsulated and be accessed only through a high-level proce- dural interface. 􏰇 - Package in ADA 􏰇- Naturally when using monitors in Java and C. ## Expressive power and ease of use - Monitors/synchronized methods, servers, protected resources. - Blooms criteria for evaluating synchronization primitives in the context of resource management: - Ease of use of a synchronization primitive encompasses: 􏰇 - the ease with which it expresses each of these synchronization constraints. 􏰇 - the ease with which it allows the constraints to be combined to achieve more complex synchronization schemes. - In the context of resource control, the information needed to express these con- straints can be categorized as follows: 􏰇 - the type of service request 􏰇 - the order in which requests arrive 􏰇 - the state of the server and any objects it manages - the parameters of requests - Comparison between condition synchronization and avoidance synchronization: 􏰇 - conditions wait: all requests are accepted, but any task whose request cannot currently be met is suspended on a queue. 􏰇 - avoidance: requests are not accepted unless they can be met. ### Request type - Priority one request over another. - Monitor and synchronized methods: Programmed as distinct procedures. - Ada: Represented by di􏰂erent entries in the server task, or protected object. ### Request order - Monitor: FIFO - ADA: FIFO calls to the same entry can be serviced FIFO if the appropriate queuing policy is chosen. - ADA: calls to di􏰂erent entry, arbitrary manner. (Out of programmers control) Possible if all clients call a common "register" entry. ### Server state - Some operations may be permissible only when the server and the objects it ad- ministers are in a particular state. - Monitors could be used with condition variables being used to implement con- straints. ### Request parameters The order of operations of a server may be constrained by information contained in the parameters of request. Typical the size of the request. How can a resource allocater in Ada be constructed to handle this: - Di􏰂erent entries for di􏰂erent sizes. - Two stage interaction: Call 􏰃rst "sign in" and then "allocate" to the server. - Problem if the client sends "sign in" and then being aborted from another task before sending "allocate". Deadlock on the server, if the server wait in􏰃nitely. 􏰇 - If the server does not wait in􏰃nitely on "allocate" there will be a problem if the client is just slow. 􏰇 - solutions: - De􏰃ne the abort primitive to apply to an atomic action rather than a task; forward and backward error recovery chan then be used when com- munication with the server. - Assume that abort is only used in extreme situations where the breaking of the atomic is of no consequence. - Try to protect the server from the e􏰂ect of client abort. ### Requester priority Client priority. It is possible in Java, C, Ada to de􏰃ne a priority queue, but in general con- current programming languages, tasks are released from primititves such as semaphores or condition variables in either an arbitrary or FIFO manner. ## The requeue facility One means of enhancing the usability of avoidance synchronization is to add requeue facility. The Ada language has such a facility. The key notion beyond requeue is to move the task (which has been through one guard or barrier) to 'beyond' another guard. Ada allows requeues between task entries and protected object entries. ## Semantics of requeue I gave up the sentence: read chapter 8.4.1 ### Requeuing to other entries Example in the book ch 8.4.2 ## Asymmetric naming and security - asymmetric naming → the server being unaware of the identity of the client. - A server task may wish to know the identity of the calling client so that: 􏰇 - a request can be refused on the gronds of deadlock prevention or fairness(that is, to grant the request would be unfair to other clients). 􏰇 - it can be guaranteed that resources are released only by the task that earlier obtained them. - Ada: get the client id from task identi􏰃cation - Java: get client id using thread.currentThread ## Resource usage - Shared access: more than one can access at the same time. - Exclusive access: only one van access at the same time. - A combination of those above; Problem a task in shared mode trying to access when one in exclusive mode already has accessed. And the other way... - One allocate the resource should release them as soon as possible. - Releasing to fast → error - Modi􏰃ed two phased resource, so that the resource are not released until the action is completed. - Recovery procedure: success→ release resource. Fail → undo any e􏰂ects on the resource. ## Deadlock - Many tasks competing for a 􏰃nite number of resources. - livelock - Several tasks are continually attempting to gain sole access to the same resource; if the resource allocation policy is not fair, one task may never get its turn to access the resource. This is called inde􏰃nite postponement or servation or lockout. - liveness → in concurrent systemts: if something is supposed to happen it eventually will. - four necessary conditions that must hold if deadlock is to occur: - Mutual exclusion - only on task can use a resource at once (that is the resource is non-sharable or at leas limited in it concurrent access). 􏰇 - Hold and wait - there must exist tasks which are holding resources while waiting for others. 􏰇 - No pre-emption - a resource can only be release voluntarily by a task. 􏰇 - Circular wait - a circular chain of tasks must exists such that each task holds resources which are being requested by the next task in the chain. - Three solutions to handle deadlock 􏰇 - deadlock prevention 􏰇 - deadlock avoidance 􏰇 - deadlock detection and recovery # Scheduling real-time systems Scheduling: Restricting the non-determinism in concurrent systems by providing or- dering of the use the resources of the system, and predicting the worst case behaviour under this ordering. I.e. it has to assign priorities to the di􏰂erent processes, and run a schedulability test. In a static scheduler this is decided/calculated prior to execution, while a dynamic scheduler calculates in run-time. ## The cyclic executive approach For this simple approach, all tasks are purely periodic and known in advance. Tasks are either running or not initiated, never suspended. It is only a sequence of procedure calls, thus there is no need for semaphores etc. The major cycle determines the longest possible period of a task, while the minor cycle determines the shortest. The scheduler groups the tasks in minor cycles, making sure all tasks run at least one time (the task with the longest period) every major cycle. Everything is planned in advance, however placing the tasks in the cycles is similar to the knapsack problem which is NP-hard. ## Task-based scheduling A task can now be either running, suspended waiting for a timing event, or suspended waiting for a non-timing event. ### Scheduling approaches Fixed-Priority Scheduling(FPS) : All tasks have static priority. Earliest Deadline First(EDF) : Earliest deadline yields highest priority. Dynamic. Value-Based Scheduling : Adaptive; assigns a value to each task to decide which to run next. ### Scheduling characteristics A schedulability test can be: su􏰅cient : success indicates that deadlines are always met. necessary : failure indicates that the system will miss some deadlines. sustainible : the test still predicts schedulability when the system parameters (period, deadline..) improve. (Not all tests do this) ### Preemption and non-preemtion Preemptive : the scheduler automatically switches to tasks of higher priority. de􏰂ered preemtion/cooperative dispatching : (Fill in here) Non-preemptive : the scheduler depends on the tasks to complete, or give up their resources before it can to switch to other tasks. I.e. it allows lower priority tasks to complete before switching to tasks of higher priority. ### Simple task model The book makes some assumptions on the tasks: 􏰃xed set of periodic, independent tasks with known periods, constant worst-case execution times, and deadlines equal to their periods. They run on a single processor where overheads run in zero time. A critical instant is the instant when the processor runs with maximum load. See table 11.2 page 370 for standard notation used in the book. ## Fixed-priority scheduling(FPS) Rate monotonic priority assignment : shortest period yields highest priority. In the book 1 is the lowest priority. ## Utilization-based schedulability tests for FPS If the following is true (for rate monotonic ordering), then all N tasks will meet their deadline: $\sum_{i=1}^N \frac{C_i}{T_i} \leq N \cdot (2^{\frac{1}{N}}-1)$ sum of utilization of all tasks ≤ total utilization I.e. it is su􏰅cient but not necessary. (A system may fail the test, and still be schedulable) ### Improved utilization-based tests for FPS Several improvements are suggested: Have N stand for the number of di􏰂erent families of tasks. Members of a task family have a common denominator for their periods: 8,16,64... Correctly deemes many systems as schedulable, but the test is not sustainable. Improving from periods of 8,16,64 to 9,17,65 makes N increase from 1 to 3, making the system fail. $ \prod_{i=1}^{N} (\frac{C_i}{T_i}+1) \leq 2 $ ## Response time analysis (RTA) for FPS As opposed to the utilization based FPS tests, the RTA tests are exact(su􏰅cient and necessary). $ \omega_{i}^{n+1} = C_i + \sum_{j \in hp(i)} \lceil \frac{\omega_{i}^{n}}{T_j} \rceil \cdot C_j$ Here $hp(i)$ is the set of indexes that have a higher priority than i, wi is the response time of the $i$'th task. In other words: response time of task i is equal to the execution time of $i$ + the sum (over all tasks j of higher priority than i) of number of times j is executed while i is running times the execution time of j. To 􏰃nd w numerically we execute equation (2) recursively until $\omega_{i}^{n+1} = \omega_{i}^{n}$, starting with $\omega_{i}^{0} \leq C_i$. The process is repeated for all iii tasks in the system. ## Sporadic and aperiodic tasks Up until this point we have assumed $D = T$ , which is unreasonable for sporadic/aperiodic tasks. equation (2) is still valid for $D < T$, as long as we stop when $\omega_{i}^{n+1} > D_i$. ### Hard and soft tasks Worst case 􏰃gure is usually considerably higher than average case, thus shedulability testing with worst case 􏰃gures can cause low processor utilization. To improve this it should be possible to schedule all tasks using average 􏰃gures, i.e. we allow transient overload; situations where we might not meet all deadlines. However hard real-time tasks should never miss deadlines, so they should be schedulable using worst-case 􏰃gures. ## Aperiodic tasks and 􏰃fixed-priority execution-time servers Execution-time server : makes sure hard tasks meet their deadline, but allow soft tasks to run as often as possible. Dual-priority scheduling : Priorities are split into a high-, a medium- and a low band. Aperiodic tasks are in the medium band. Hard tasks are initially in the low band, but are bumped to the high band to meet their deadlines. ## Task systems with D < T Deadline monotonic priority ordering(DMPO: Tasks have a 􏰃fixed priority that is inversely proportional to their deadline. It is optimal for a 􏰃xed priority scheme. FSP works well with D < T , while EDF utilization test fails. However since EDF is more e􏰂ective than FSP, a system that passes an FSP test will also be schedulable for EDF. ## Task interactions and blocking This far, tasks have been viewed as independent. In reality they are not, as they often share resources. Thus we have to add the possibility for tasks to be suspended; waiting for a future event (e.g. an unlocked semaphore). Priority inversion : a task with high priority have to wait for a task with lower priority. (Wait for it to free a semaphore..). The high priority task is then blocked. Priority inheritance : If p is suspended, waiting for q, then q's priority is increased to p's priority (if it originally was lower). To make equation (2) work with the expanded model, we add a term Bi which is the maximum blocking time of task i. $w_i^{n+1} = C_i + B_i + \sum_{j\in hp(i)} \lceil \frac{w_i^n}{T_j} \rceil C_j$ $B_i = \sum^K_{k=1} usage(k,i)C(k)$ See the book page 385 for explanation on the function to calculate $B_i$. ## Priority ceiling protocols Minimizes blocking (at most once during execution by tasks of lower priority), eliminates deadlocks, provides mutual exclusive access to resources. Original ceiling priority protocol(OCPP): - All tasks have a static default priority - Resources have a static ceiling value, that corresponds to the max priority of the tasks using the resource - Tasks have a dynamic priority; its own static priority or any (higher) priorities inherited from blocking higher-priority tasks - A resource may only be locked by tasks with higher priority than any currently locked tasks We see that the priority of a task will be increased if it is blocking a higher priority task. ### Immediate ceiling priority protocol(ICPP) - All tasks have a static default priority - Resources have a static ceiling value, that corresponds to the max priority of the tasks using the resource - Tasks have a dynamic priority; the max of its own static priority and the ceiling value of any resources it had locked Instead of waiting until it actually blocks a higher priority task, ICPP increases the priority of a task as soon as in locks a resource. ICPP will not start executing until all its needed resources are free, since unavailable resouces it needs indicates that a task of higher priority is running. ICPP is easier to implement, leads to less switching between tasks, but causes more change in priority compared to OCPP.