TDT4225: Store, distribuerte datamengder
> One day we'll be done with this, and we'll never have to look back unless some vague memory pops into our mind by happenstance which is actually apt to happen. So, y'know, good luck or whatever.
# Curriculum 2015
The four lab exercices are relevant for the exam.
In addition, the slides that have been used in lectures this semester:
Doeppner, kap 6
: File Systems 1-49, 53-90, 97-123, 126-129, 138-151
Forfang/Bratsberg
: Key-value stores 1-25
Kjell B. kap 7, Bloom filters
: 31, 32, 37, 38, 39, 40, 41
Kjell B. kap 9
: Matrices: 6, 7, 9-11, 14-22, 23-28
Retrival of objects: 29-37, 39, 40, 42, 45 (formel b=x3=Vs), 46, 48, 52
DB opt: 53-56
Kjell B. kap 10
: Sorting: 2-9, 11-12.
Kjell B. kap 11
: Algebra 4-6, 16, 17-18, 20, 22-27, 29, 31, 33, 34, 35
Kjell B. kap 12
: Adv. algebra 2-3, 4-7, 14, 21, 24-29, 33, 34, 40, 42
Kjell B. kap 13
: Indexes and opt 2-4, 6-10
# Exam tips
__NOTE:__ these tips were written based on old exams written by Kjell Bratbergsengen. They may not apply if the exam is made by someone else, or if he's not involved with the course the semester you're taking the exam.
* If information is not readily available in a problem (such as the specifications for a hard drive), make wild assumptions about it and reason based on those.
* Answer first whatever seems most obvious and simple -- even if it seems so blatantly obvious that you'd not expect to be asked about it in a fourth year course. Spend time trying to be clever once you've done that for all the questions.
* Values given in "[KMGT]B" might be either [KMGT]iB (binary, IEC definition) or [KMGT]B (decimal, metric system). It's usually not specified and you should state in your answer whatever you assume it is.
# File systems
## S5FS
- Simple
- Ancient
S5FS considers the read time of all sectors on the disk to be the same.
It does not take the time the disk head needs to move into account.
It does not try to minimize seek time (the time the disk head needs to position itself), or rotational latency (time before the desired sector is under the disk head).
The only thing S5FS optimizes for, is a minimum number of disk operations.
The first block on the disk is the _boot block_. It contains a program for loading the operating system into memory.
The second block is the _superblock_. It holds references to _free lists_ and describes the layout of the file system.
Then comes the _i-list_. This is an array of inodes, which represent files.
Last is the _data region_, which contain the disk blocks that make up file contents.
### Inodes
The inode contains direct pointers to the first 10 disk blocks of a file.
If the file is larger than 10 disk blocks, the inode also points to what is called an _indirect block_.
This is a disk block which contains pointers to up to 256 new data blocks.
The inode may also point to a _double indirect block_, which contains pointers to 256 new indirect blocks.
Finally, the inode may contain a pointer to a _triple indirect block_.
In total, this allows for files up to around 16 GB.
The actual maximum allowed file size however, is 2 GB, because the file size must be representable as a 32-bit word.
When opening a file, its inode is loaded into primary memory.
This means that accessing the first 10 data blocks of the file will be very fast.
When accessing blocks beyond this, additional indirect block lists will first need to be accessed.
### Problems with S5FS
- Small block size
- Poor file-allocation strategy
- Separation of inodes and data blocks
- Vulnerable to data loss caused by crashes
The poor performance comes from lots of slow seeks. When loading a file, it must first fetch the inode.
Then, it musk seek to another part of the disk to access the data blocks.
For large files, when indirect blocks are needed, this takes even longer.
Blocks are allocated when needed, and not placed for sequential access.
To quote Doeppner, the "performance is so terrible that one really wonders why this file system was used at all.".
_(Operating Systems In Depth, Thomas Doeppner)_
## FFS
_Fast File System_
## NTFS
Inodes are called _file records_.
The inode file is called the _master file table_.
## Journaling
Journaling is essentialy write ahead logging. We use the term journaling to avoid confusion with log-structured file systems. The main principle is to write any changes that are to be commited to a log before the changes are written to disk. If a crash now happenes, we can recover by reading the log at restart. Ext3 uses journaling.
# Key-value stores
## LevelDB
Fast key-value storage engine written at Google.
Provides an ordered mapping from string keys to string values.
**API:**
- Put(key, value)
- Get(key)
- Delete(key)
The data in LevelDB is stored internally in memtables and sstables.
### Memtable
Stored entirely in memory.
Max. a few MB size.
### Sstable
_(Sorted String Table)_
Around 2 MB in size.
# Bloom filter
Returns either _probably in set_ or _definitely not in set_.
That is, it may give false positives, but never false negatives.
The bloom filter is an array of _m_ bits, all initialized to 0.
It also needs _k_ hash functions.
**Inserting:**
When adding an element, it is run through all the _k_ hash functions. The output of each hash function denotes a position in the _m_-bit array. All these positions are then set to 1.
**Querying:**
To check if an element is in the set, run it through the hash functions, like when inserting.
If any of the array positions returned by the hash functions contain a zero, the element is _definitely not_ in the set.
If all the bits are 1, then the element _may be_ in the set (or they have all just been set to 1 by accident).
Because the bloom filter cannot permit false negatives, removing elements is not supported (that would mean changing 1-bits back to 0, which would also "remove" other elements that map to the same bit).
# Two-phase reservoir/replacement sort (external sorting method)
This is "explained" in chapter 10 of SAMOLDV.
The replacement/reservoir sort described in the book is a kind of external tournament sort used when we don't have enough working memory to hold the entire data set that is to be sorted.
It consists of two phases:
* The sort phase, see [Tournament sort on wikipedia](http://en.wikipedia.org/wiki/Tournament_sort) for a brief explanation and [this stackoverflow question](http://stackoverflow.com/questions/16326689/replacement-selection-sort-v-selection-sort) for a better explanation than what the book offers.
* The merge phase, in which the "runs" from the sort phase are merged. The book describes some hella weird merging scheme in 10.2.4 and 10.2.5.
# R-trees
R-trees are used for spatial access methods. Like indexing multi-dimensional information such as geographical coordinates.
Imagine a two-dimensional grid with things on it that are indexed in the R-tree.
The terminal/leaf nodes of the R-tree can be thought of as rectangular boxes defined by the coordinates of two of their opposite corners.
All things/objects/items on the grid that are found within the rectangular box of a terminal node is stored in that terminal node.
But guess what? Every node is represented in this manner!
It's boxes all the way down.
The rectangular boxes are also known as bounding boxes.
## Quad(ratic) split
Quadratic split find the pair of rectangles that are the least desirable to have in the same node, and uses these two as the initial objects in two new groups.
So within the node or block or whatever it is you are splitting, for each box/object, calculate the area of the bounding box that would encompass it and one other box/object (for all other objects).
These two objects, $S_{1}$ and $S_{2}$, are chosen as the seeds for the split.
For every other object $O$, calculate the size of the bounding box encompassing $S_{1} \cup O$ and $S_{2} \cup O$.
Add $O$ to the set which would experience the smallest increase in its bounding box area if $O$ is added to it.
I.e., if $A(x)$ is the bounding box area of $x$, add $O$ to $S_{1}$ if $A(S_1 \cup O) - A(S_1) < A(S_2 \cup O) - A(S_2)$ and to $S_2$ otherwise.
Continue doing this until the sum of size of the smallest set/group and the remaining objects is equal to $m$.
When this happens add all the remaining objects to the smallest set/group.
Wait, $m$? Yeah, $m$. It is a _selected parameter_, which means __you__ get to decide what it is.
$ 2 \leq m \leq M$, where $M$ is the number of objects in the block/node.
# K-d tree
Short for K-dimensional tree.
# Storage of raster graphics and voxels
Raster graphics are pretty much just a bitmap: a 2-dimensional grid where each cell corresponds to a pixel or point in the image.
A voxel represents a value on a regular grid in 3-dimensional space.
Think of it like a pixel but for 3d images.
A voxel can be a color value (like a pixel) representing an arbitrary volume in the picture or model.
## Arbitrary cut
# Hashing, hash file ("randomisert fil")
## Linear hashing
# Relational Algebra
Yeah that's right, relational algebra is up in this one as well.
But it's not exactly the same as in [TDT4145](/TDT4145).
It's outlined in 11.3 of SAMOLDV, page 352.
The "basic algebra operations" are defined in terms of what you might remember as relational algebra operations from TDT4145.
Which is weird because the book defines selection using both $\sigma$ and $\pi$.
## Syntax
$$\text{TableName : selector : reductor}$$
* __TableName__ = The name of the table we want to do stuff with.
* __selector__ (s) = selection criteria for tuples/records/rows in _TableName_.
* __reductor__ (r) = Which columns/attributes of the selected tuples should be "carried on"/made available in the result.
If the reductor is empty or `*` then all attributes are returned.
If an attribute is listed and immediately followed by a `/` followed by a string (`Attribute/string`) then that attribute/column is renamed to / made available as the string in the result.
$r_{R}$ is the reductor supplied with table R. $s_{A}$ is the selector supplied with table A.
## Basic algebra operations
### Selection
`SELECT(A:s:r, R:s:r);`
$$R = \pi_{r_R}\sigma_{s_{R}}(\pi_{r_{A}}\sigma_{s_A}A)$$
You can already tell it's weird because we've defined selection using the relational algebra operation selection ($\pi$) twice.
I don't know what's up with that either. But it's totally unnecesary.
Just check out [the wikipedia article on Relational Algebra](http://en.wikipedia.org/wiki/Relational_algebra) for a better explanation.
### Projection
### Join
### Cross Product
## Set operations
### Union
### Difference
### Intersection
### Subset
### Testing table equality
### Testing table unequality
## Grouping Operations
### Aggregation
### Division
The division is a binary operation that is written as $R \div S$. The result consists of the restrictions of tuples in R to the attribute names unique to R, i.e., in the header of R but not in the header of S, for which it holds that all their combinations with tuples in S are present in R.
### Group Sort
## Order of operations / operand ordering
See Problem 6 on the 2013v Exam for an example problem of this kind.
Note that when calculating the "total I/O volume", tables are read when involved in an operation and intermediate results are written back to disk before being read again for the next operation.
That's how the answer key usually describes it anyway.
For natural join, __*__, use a greedy approach and find the smallest given intermediate result and select that.
Repeat until there are no more intermediate results to select.
For example, if asked to find the execution order of __R=A\*B\*C__ that results in the smallest I/O volume, pick whichever one of __A\*B__, __B\*C__ or __A\*C__ (natural join is symmetric) that has the smallest intermediate result.
The next operation would be whichever one you just picked and the remaining table.
Total I/O volume is equal to: $A_{size} + B_{size} + C_{size} + R_{size} + 2\times X_{size}$, where $X$ is the intermediate operation you picked.
# Advanced methods for execution of relational algebra
See also chapter 12 of _SAMOLDV_.
# Relevant papers
## Facebook TAO
"The Associations and Objects"
<https://cs.uwaterloo.ca/~brecht/courses/854-Emerging-2014/readings/data-store/tao-facebook-distributed-datastore-atc-2013.pdf>
## Amazon Dynamo
## RAMCloud
## Google's Spanner