This is an old version of the compendium, written May 29, 2013, 5:01 p.m. Changes made in this revision were made by jonasft. View rendered version.

TDT4145: Data Modelling and Database Systems

# Types of failures Failures of type 1-4 are common, and the system is expected to automatically recover from these. Failures of type 5 and 6 are less common, and recovering from them can be a major undertaking. 1. A computer failure / crash 2. A transaction or system error 3. Local errors or exception conditions detected by the transaction 4. Concurrency control enforcement 5. Disk failure 6. Physical problems and catastrophes # Database storage stuctures Databases are typically stored on secondary (non-volatile) storage. ## List of storage structures || Heap file || The simplest storage structure: an unordered list of records. Records are inserted at the end of the heap file, giving O(1) insertion. Retrieval requires a linear search ( O(n) ), and is therefore quite inefficient for large values of n. Deleted records are only marked as deleted, and not actually removed. This is O(1) (in addition to the time it takes to retrieve a record to be deleted). Highly volatile heap-files must be re-structured often to prune away records that are marked as deleted. || || Hash buckets || Hash functions directly calculate the page address of a record based on values in the record. A good hashing function has an even spread across the address space. When doing calculations on hash buckets (say, for an exam), it is safe to assume about a 60% fill rate. Ignoring collisions, hash buckets are O(1) for retrieval, insertion, modification and deletion. Hash buckets often suck for range retrieval. || || B+-trees || A pretty good all-around storage structure. Retrieval, insertion, modification and deletion is O(log n). Good for volatile data, as the index dynamically expands and contracts as the data set grows and shrinks. Less optimal for non-volatile/stable files - use ISAM for that. || || ISAM || (Indexed Sequential Access Method) || ## Hashing schemes || Extendible hashing || Extendible hashing treats a hash as a bit string, and uses a trie for bucket lookup. Re-hashing is done incrementally, one bucket at a time, as needed. Both LSB and MSB implementations are common. || || Linear hashing || Linear hashing allows for expansion of the hash table one slot at a time, round-robin. || # Transactions A transaction is a unit of work performed within a DBMS. They provide reliability and isolation. Transactions are "all or nothing", in the sense that the entire unit of work must complete without error, or have no effect at all. ## ACID: Atomicitiy, consistency, isolation and durability Databases must conform to the ACID principles: || Atomicity || A unit of work cannot be split into smaller parts—a unit of work can either complete entirely, or fail entirely. || || Consistency || A transaction always leaves the database in a valid state upon completion, with all data valid according to defined rules such as constraints, triggers, cascades etc. || || Isolation || No transaction can interfere with another transaction. || || Durability || When a transaction completes, it remains complete, even in the case of power loss, crashes etc. || A typical transaction follows the following lifecycle: 1. Begin the transaction 2. Execute a set of data manipulations/queries 3. If no errors occur, then commit the transaction and end it 4. If errors occur, then rollback the transaction and end it ## Policies used in transaction control The following are four policies that can be used in transaction control in a database: || NO-FORCE policy || Changes made to objects are not required to be written to disk in-place. Changes must still be logged to maintain durability, and can be applied to the object at a later time. This reduces seek-time on disk for a commit, as the log is usually sequentially stored in memory, and objects can be scattered. Frequently updated objects can also merge accumulated writes from the log, thereby reducing the total number of writes on the object. Faster, but adds rollback complexity. || || FORCE policy || Changes made to objects are required to be written to disk in-place. Sucks. || || STEAL policy || Allows a transaction to be written on a nonvolatile storage before its commit occurs. Faster, but adds rollback complexity. || || NO-STEAL policy || Does not allow a transaction to be written on a nonvolative storage before its commit occurs. Sucks. || Performance wise, if anyone asks, STEAL NO-FORCE is the way to go. ## The system log A DBMS maintains a system log tracking all transaction operations to be able to recover from failures. The log must be kept on non-volatile storage, for obvious reasons. A log consists of records of different types: 1. [start\_transaction, transaction\_id] 2. [write\_item, transaction\_id, item\_id, old\_value, new_value] 3. [read\_item, transaction\_id, item_id] 4. [commit, transaction_id] 5. [abort, transaction_id] ## ARIES (aka Algorithms for Recovery and Isolation Exploiting Semantics) Recovery algorithm designed to work with a no-force, steal database approach. ARIES has three main principles: || Write ahead logging || Any change to an object is first recorded in the log, and the log must be written to non-volatile storage before changes to the object are made. || || Repeating history during redo || On restart after a crash, ARIES retraces the steps in history necessary to bring the database to the exact state at the time of the crash, and then undos the transactions that were still active at the time of the crash. || || Logging changes during undo || Changes made to the database during undo are logged to prevent unwanted repeats. || Two datastructures must be maintained to gather enough information for logging: the dirty page table (DPT) and the transaction table (TT): || Dirty page table (DPT) || Keeps record of all the pages that have been modified and not yet written to disk. It also stores the sequence number of the first sequence that made a page dirty (recLSN). || || Transaction table (TT) || Contains all transactions that are currently running, as well as the sequence number of the last sequence that caused an entry in the log. || ARIES periodically saves the DPT and the TT to the log file, creating checkpoints, to avoid rescanning the entire log file in the case of a redo. This allows for skipping blocks that are not in the DPT, or that are in the DPT but have a recLSN that is greater than the logg post LSN. ## Schedule (aka History) Schedules are lists of operations (usually data reads/writes, locks, commits etc) ordered by time. Schedules can be classified into different classes: || Serial || Transactions are non-interleaved. No transaction starts before the previous one has ended. Example: D = R1(X)W1(X)C1W(2)C2. || || Serializable || Schedules that have equivalent outcomes to a serial schedule are serializable. A schedule is serializable if and only if its precedence graph is acyclic || || Recoverable || Transactions commit only after all transactions whose changes they read, commit. || || Unrecoverable || A schedule which is not recoverable || || ACA (Avoids Cascading Aborts, aka Cascadeless) || Transactions may not read uncommitted changes from another transaction during the same schedule (RAW is disallowed). All ACA are recoverable. || || Strict || Write operations in one transaction which precedes a conflicting operation in another transaction must be committed before the conflicting operation in the second transaction occurs. All strict schedules are ACA. Disallows WAW and RAW. || # Database normalization Database normalization is the process of organizing the fields and tables of a relational database to minimalize redundancy and dependency. ## Normal forms (aka NF) A normal form is a set of criteria for determining a table's degree of vulnerability to logical inconsistencies and anomalies. The most common normal forms are: || 1NF || A table represents a relation and has no repeating groups. All attribute values are required to be atomic. || || 2NF || No non-prime attribute in the table is partially dependent on any candiate key. || || 3NF || Every non-prime attribute in the table is directly dependent on every candidate key. || || BCNF || Every non-trivial functional dependency is a dependency on a superkey. || Note that to meet the requirements of a normal form, a table must also meet all the requirementes of the lower normal forms. ## Functional Dependencies A functional dependency, written $ A \rightarrow B $, on some relation (table) where A and B are sets of attributes (columns) in the relation, means that if two tuples have the same value in A they must have the same value in B. $$$ A \rightarrow B \Longleftrightarrow t\_{n}[A]=t\_{m}[A] \Rightarrow t\_{n}[B]=t\_{m}[B] $$$ If more than one attribute appear on either side of the arrow they should be and'ed together ## Keys || Superkey || A set of attributes of a relation schema that are unique for each tuple. I.e. if S is a superkey of the relation schema R, then no two tuples in R will have the same combination of values in S: $$ t\_{n}, t\_{m}\in R $$, $$ n\neq m $$, $$ t\_{n}[S]\neq t\_{m}[S] $$ || || Key || A minimal superkey: removing one of the key's attributes results in it ceasing to be a superkey. || || Candidate key || If a relation has more than one key, they are called candidate keys: candidate keys are a set of attributes that qualify as a key. || || Primary key || Whichever candidate key is arbitrarily chosen as a relation's key. || || Prime attribute || Key attributes are attributes that are part of one or more candidate keys. || # Relational algebra ## Primitive operators || Projection (π) || The projection of a relation R on a set of attributes A is the set of all tuples in R restricted to the set of attributes in A. Basically the SELECT of relational algebra. || || Selection (σ) || A selection with the set of conditions C selects all the tuples in a relation R such that the conditions in C hold. Basically the WHERE of relational algebra. || || Rename (ρ) || Renames attributes in a relation. Basically the AS of relational algebra. ||

|| Natural Join (⋈) || Joins two relations R and S. The result of the join is the set of all combinations of tuples in R and S that are equal on their common attribute names. Basically the INNER JOIN of relational algebra.|| # Andre relevante dokumenter: _Notater fra kapittel 16 til kapittel 24:_ https://docs.google.com/document/d/175PwQqt08J5Lw0j6MmBiMQxgfROGDt37-0gGwdx_1Uo/edit?usp=sharing _Kompendier skrevet av enkeltpersoner:_ https://wiki.online.ntnu.no/projects/online/wiki/TDT4145