TTM4110: Dependability and Performance with Discrete Event Simulation
Preface
This compendium attempts to provide a sufficent summary of the most important concepts of the course. This compendium is written by students for educational purposes and is largely based on the course book and the lecture slides of 2018.
The compendium is not written by professionals, so there are bound to be errors. If you spot any such errors, feel free to correct it. If you feel like something important is missing from the compendium, feel free to add it yourself.
Chapter 1 Introduction
This chapter attempts to explain what the course is all about and throws a bunch of new terms and concepts that will be further explored in the following chapters. In this chapter we will give you the most central concepts (the ones that has been asked about in previous exams) and provide an explaination of some of them.
Systems, models and properties
This course is all about evaluating the dependability and/or performance of systems so it makes sense to start out with clarifying what that entails.
- System
- A system can be defined as a regularly interacting or interdependent group of items forming a unified whole.
Basically a system can be anything that does anything, as long as there are several components working together to peform some sort of funciton.
A system can be described by two types of characteristics: functional and non-functional properties.
- Functional properties
- Which functions are performed.
- Non-functional properties
- How well these functions are performed.
Non-functional properties are the most important concepts in this course. Non-functional properties can be either performance (e.g carrying capacity, speed, blocking probability, delay) or dependability (the system works and fullfills its function, safety).
When evaluating the dependability and performance of a system, we often don't want to use the actual system for the evalutation. Instead we use a model of the system.
- Model
- A model is an abstraction of the real or projected system.
A model is in simpler words: A simplifcation of the real system. There can be several reasons for using a model. Developing a model is most likely a lot less constly than building the actual system, and by analysing the model is is possible to discover problems, that could have been costly to fix if the system was already built. Another reason can be simplicity. Properties of a system can sometimes be difficult to handle or measure. By building a simpler model, these properties can be studied in a controlled manner. When using a model it is important to know that the results derived from the model should be valid for the real system, not just for the model.
A system has three basic properties:
- System components
- System components are items that limit the dependability and/or performance of the system (e.g processors, hard disks, transmission channels).
- Structure
- A reflection of how the system components interact. The structure of a system can be physical, logical, or derived from the physical and/or logical sturctures.
- Behaviour
- The interactions (themselves) between the system components (e.g Queueing disciplines, protocols, other traffic mechanisms apart form protocols, fault handling).
QoS
The book continues to introduce concepts that are required for understanding the concepts later on in the coruse, so here it goes:
- Quality of Service (QoS)
- Degree of compliance of a service to the areement that exists between the user and the provider of the service.
Basically: Does the system live up to my expecations? How well does it do that?
Dependability
- Dependability
- Trustworthiness of a system such that reliance can justifiably be placed on the service it delivers.
Basically: Can I trust that system works correctly when using it?
- Failure
- Deviation of the delivered service from the compliance with the specification. Transition from correct service to incorrect service (e.g the service beomes unavailable).
- Error
- Part of the system state which is liable to lead to a failure.
- Fault
- Adjudged or hypothesized cause of an error.
These three concepts together are the threats to dependability. They are connected like this: A fault causes an error and an error may lead to system failure. There two basic appoaches to achieve a dependable system:
- Fault prevention
- To prevent the occurence or introduction of faults.
- Fault tolerance
- To prevent that errors cause fialures or, in other words deliver a correct service despite the presence of faults.
There are several types of faults, these are the most important to know:
- Physical faults
- These are the "classical" faults, e.g physcial wear out of components.
- Transient faults
- These faults are present only for a short period of time and physical change occurs in the system, e.g. external disturbances like electromagnetic interference and radiation.
- Intermittent faults
- These faults come and go, for instance due to hardware components operating close to its tolerance limit, e.g. a gate that receive only sometimes (and not always) a satisfactory signal level.
- MTFF
- Mean Time to First Failure
- MTCF
- Mean Time to Catastrophic Failure
- MTBF
- Mean Time Between Failures
- MUT
- Mean Up Time
- MDT
- Mean Down Time
- MTTF
- Mean Time To Failure
The last major definitions related to dependability are:
- Availability
- Ability of a system to provide a set of services at a given time or at any instant of time or at any instant within a given time interval.
- Reliability
- Ability of a system to provide uninterrupted service.
- Safety
- Ability of a system to provide service without the occurence of catastrophic
Performance
Performance also has several definitions and concepts we must understand so bear with me:
- Performance
- Abiliy of a system to provide the resources needed to deliver its services.
- Capacity
- Maximum load a syste mcan handle per time unit.
- Carried traffic
- Average number of resources in a time interval. Measured in Erlang.
- Primary traffic
- Traffic generated by the users.
- Secondary traffic
- Traffic used for signalling between network nodes.
- Time Congestion
- The Fraction of tiem all resources are busy.
- Throughput
- Portion of the system capaciy that is utilised be the users.
- Delay
- Time it takes to complete a service.
- Waiting time
- Accumulated time a service request is pending for service in a system.
- Service time
- Accumulated time a servide request is served by a system.
- Sojourn time
- Total time a service request is in a system.
Methods
There are three primary ways of analysing a system: Mathematical analysis, simulation and measurement on a prototype of a system. Here are the main benefits of each of them:
- Mathematical analysis
- In the case of well-known problems, results can easily be obtained using formulas or algorithms.
- Simulation
- Relatively detailed models can be used.
- Measurements on (a prototype of) a system
- Results are trustworthy, realistic and detailed.
Chapter 2: Fundamentals of probability
This chapter is just plain statistics and should be familiar if you have taken a statistics course. We will thus not prioritise this chapter, and we suggest you look at alternative sources if you need to brush up on your statistics knowledge. Do feel free to add material to this chapter yourself if you so desire.
Expected value:
Sample mean (avarage):
Variance:
Sample variance:
Standard deviation:
Standard error:
Intensity:
Conditional probability:
Baye's formula:
Negative exponential distribution (n.e.d) (probability density function):
Erlang -k distribution probability density function (k is shape and λ is rate of each phase):
and cumulative:
Death rate (failure rate):
Chapter 3: Fundamentals of stochastic processes
This chapter introduces some applied statistics in the context of the course.
- Regular point process
- A point process is regular if and only if at most one event occurs at a time.
- Renewal process
- A point process is a renewal process if and only if the inter-event times are independent and identically distributed.
- Merging of renewal processes
- Merging renewal processes will in general not result in a new renewal processs.
- Little's formula
- The average number of customers in a stationary system is equal to their average arrival intensity multiplied by their average sojourn time in the system. Note that the system must be stable with no losses.
$ \bar{N} = \bar{\lambda} * \hat{W} $
Chapter 4: Simulation
Random variable generation:
Important properties of a random varaible generator is: fast, correct statistical properties, reproducible,long cycle, portable.
There are 4 main methods
- Direct method
- This method is fast, but only applicable to discrete distributions.
- Inverse transform method
- Fast, but requires a closed form inverse cumulative distribution function.
- Acceptance-rejection method
- Many distributions, but requires several drawing per random variate
- Convolution
- Simpe implementation for complicated distributions, but only applicable to sums of variates.
Types of simulation models
Simulation can be classified as static or dynamic, discrete or continuous, and deterministic or stochastic.
In this course we use dynamic, discrete, and stochastic simulation models.
Static simulation model
Also called Monte-Carlo simulation models a particular point in time (time not part of model) E.g throwing a dice.
Dynamic simulation model
Model represents a system as it changes over time. E.g. simulate a bank from 08:00-16:00.
Discrete simulation model
State variables change only at discrete set of points in time.
Continuous simulation model
State variable changes continuously over time.
Deterministic simulation model
Contains no random variables.
Stochastic simulation model
Contains at least one random input variable
Discrete event simulation (DES):
The system dynamics is given by discrete events. The system state is distinct and will only change at a specific time instance. Simulations of discrete events need only to simulate the events that change the system state, and a finite number of events will take place in a finite interval of time.
Time driven vs. event-driven simulation:
Time driven: Fixed time stamp. For example: Move the clock 1 second forward and determine the system state.
Event driven: Variable time-stamp. For example: Move the clock to the time of next event.
Concepts in Discrete event simulation:
Model: Abstract representation of system.
Entity: Object with explicit model description.
Attributes: Properties of an entity.
System: Collection of entities.
System state: Set of variables containing sufficient information.
List: Collection of associated entities.
Event: Change of system state.
Event notice: A record of an event to occur.
Event list: List of event notices (ordered by time)
Activity: Specified duration.
Delay: Unspecified duration.
Resource: Passive object used to perform an activity.
Process oriented simulation:
A Process describes the behaviour of an entity (sequence of activities, life cycle). Focus on the entities and their dependencies and interactions (object-oriented approach). Event-driven approach. Requires a list of events and synchronization mechanisms.
Activity diagram:
A graphical representation of a model. Activites/life-cycles of entities. Interactions between entities (cooperation, interrupt, generation, priorities). Entities and resources (comptetition, synchronization).
Chapter 5: Poisson and Markov process
This chapter presents the Poisson process and its properties. In particular, it describes how Poisson processes can be aggregated to form Markov processes that are an important part of Markov models.
Poisson and Markov Process
- Poisson process
- Regular point process (only one event at the time). Intensity is constant (homogenous PP). Intensity is independent of previous events.
- Merging of Posson processes (Palm's theorem)
- Merging Poisson processes gives a new Poisson process with intensity equal to the sum of the intensities.
Markov process: Events from many Poisson Processes.
Poisson distribution: Number of events until and including time t.
Exponential distribution: Time between two consecutive events.
Poisson process -merging
The merging (superposition, multiplexing) of two Poisson processes is a Poisson process.
Splitting of a Posson process in n processes forms n Poisson processes if the splitting is probabilitstic (with constant probability).
Stationary traffic models
Examples of queuing systems
Phone subscirbers share a limited number of telephone lines in a telephone exchange
Mobile phone subscribers share a limited number of radio transmitters/receivers in a mobile base station.
Taxi call centers have delay queues.
IP routers have input and output buffers.
Maintenance centers have order lists and repairmen.
Performance: Delay, blocking, utilization, waiting time...
Queuing model
A system does not need to have any queuing positions to be classified as a queuing system.
A queueing system is characterized by the arrival process, service time distribution, number of servers, maimum number of customers in the system, number of sources, and queuing dicipline (service order).
Chapter 7: Dependability models
Faults
Software faults are the most common couse of failure in ICT systems.
- Transient fault
- A fault that is only present in the system for a short time.
Random extensions
Use of modeling in development and dimensioning.
TODO
Point process - Point process describes occurrence of an event - No gradation of type of event - The state, X(t), of a process is the number of events, N(t) Regular point process only one event at a time.
Time to k'th event:
$$ S_k = \sum_{i=1}^k Y_i $$ har også at$$ P(N(t)<k) = P(S_k>t) $$
sd
Renewal process Regular point processes Time to next event is independent of what happened in the past The superposition of merging two renewal processes is generally not a renewal process. However, the superposition of a large number of renewal processes tends to a Poisson process.
Little's formula:
Erlang: Erlang is a dimensionless unit that is used as a measure for offered load or carried load from service-providing elements. Often expressed as call arrival rate times average call length. Given a scenario with a single cord in telephone system. The cord can provide phone calls for 60 minutes per hour. A full utilization of the cord results in 1 erlang. If users now attempts to make calls in this system (measured in eg. per hour): Carried load is the traffic that was carried for the successful call-attempts. Offered load is the traffic that would have been carried if all the call-attempts succeeded.
Offered traffic (in erlangs) is related to the call arrival rate, λ, and the average call-holding time (the average time of a phone call), h, by:
Relative frequency: Number of times the event occurs divided by the number of times an experiment is carried out
Probability: Asymptotic relative frequency of the event investigated
Joint probability:
The probability that both events occurs in conjunction
(Intersection)
Conditional probability: If A and B are two events, and P(B) > 0 P(A|B) = P(A ∩ B) P(B)
Multiplicative rule P(A ∩ B) = P(A|B)P(B)
Law of total probability: P(B) = P j P(B ∩ Aj) = P j P(B|Aj) · P(Aj)
Negative exponential distribution:
X ∼ Exp(λ)
Erlang-k distribution: It is the distribution of the sum of k random variables identically distributed.
X ~ Erlang-k(λ)
Binomial distribution:
Probability distribution of the number X of events 1 in a
sequence of n independent Bernoulli trials.
X ∼ Bin(n, p)
Poisson distribution:
Corresponds to the binomial distribution with an infinite number
of Bernoulli trials.
Probability of the number of occurrences of an event, given
that it occurs on average at the intensity
Poissoon process:
TODO: format
• Regular point process (only one event at the time) • Intensity is constant (homogeneous PP) • Intensity is independent of previous events The merging (superposition, multiplexing) of two Poisson processes is a Poisson process Splitting of a Poisson process in n processes forms n Poisson processes if the splitting is probabilistic (with constant probability)
Queueing model
Kendall's notations:
TODO: format
Examples Kendall’s notation, example: M/M/1 – M = Poisson arrival process – M = service time distribution is n.e.d. – 1 server – No system capacity given, i.e. infinite – No queuing discipline, i.e. arbitrary discipline can be used • Infinite server systems: M/M/∞ • Loss systems: M/M/n/n • Delay systems with infinite queue: M/M/n • Jackson’s queueing network: network of M/M/nj -system
Dependability
Series system:
Availability:
Reliability function:
Parallel system:
Availability:
Reliability function:
K-of-n system:
Availability:
Reliability function:
Mean Time to First Failure:
Mean Time to Catastrophic Failure:
Mean Time Between Failures:
Mean Up Time:
Mean Down Time:
Mean Time To Failure:
Mean Time to First Failure:
For a system that cannot be repaired and a structure of
independent elements with constant failure rate
How to improve system availability:
- Identify the most critical factor (element) in the system model
- Set element availability to 1
- Calculate the net change in system availability
- Repeat for all elements
- Which element has the larges impact on the system availability?
- What does it take to improve that element?
- Alternatives:
- Replacement of a component with lower failure rate
- Improve recovery time
- Change the structure by adding more
Cut set
Cut set
The system fails if all elements in the set fails Minimum cut set The system fails if and only if (iff) when all elements in the set fail, given that the system elements not in the set are working
Path set
Path set
The system is working if all elements in the set are working Minimum path set The system is working iff all elements in the set are working, given that all system elements not in the set are failed.