This is an old version of the compendium, written May 20, 2015, 4:32 p.m. Changes made in this revision were made by stiaje. View rendered version.

TDT4225: Store, distribuerte datamengder

> One day we'll be done with this, and we'll never have to look back unless some vague memory pops into our mind by happenstance which is actually apt to happen. So, y'know, good luck or whatever. # Curriculum 2014 The curriculum for 2014 was the entirety of _Storing and Management of Large Data Volumes_, (SAMOLDV) by Kjell Bratbergsengen. The book was made available on the course's ITSL-page. Kjell also did the lectures and, well, everything else in the course I think. Anyway next year (2015) some other guy is apparently going to be running the boat or ship so things might change then. # Exam tips __NOTE:__ these tips were written based on old exams written by Kjell Bratbergsengen. They may not apply if the exam is made by someone else, or if he's not involved with the course the semester you're taking the exam. * If information is not readily available in a problem (such as the specifications for a hard drive), make wild assumptions about it and reason based on those. * Answer first whatever seems most obvious and simple -- even if it seems so blatantly obvious that you'd not expect to be asked about it in a fourth year course. Spend time trying to be clever once you've done that for all the questions. * Values given in "[KMGT]B" might be either [KMGT]iB (binary, IEC definition) or [KMGT]B (decimal, metric system). It's usually not specified and you should state in your answer whatever you assume it is. # Two-phase reservoir/replacement sort (external sorting method) This is "explained" in chapter 10 of SAMOLDV. The replacement/reservoir sort described in the book is a kind of external tournament sort used when we don't have enough working memory to hold the entire data set that is to be sorted. It consists of two phases: * The sort phase, see [Tournament sort on wikipedia](http://en.wikipedia.org/wiki/Tournament_sort) for a brief explanation and [this stackoverflow question](http://stackoverflow.com/questions/16326689/replacement-selection-sort-v-selection-sort) for a better explanation than what the book offers. * The merge phase, in which the "runs" from the sort phase are merged. The book describes some hella weird merging scheme in 10.2.4 and 10.2.5. # R-trees R-trees are used for spatial access methods. Like indexing multi-dimensional information such as geographical coordinates. Imagine a two-dimensional grid with things on it that are indexed in the R-tree. The terminal/leaf nodes of the R-tree can be thought of as rectangular boxes defined by the coordinates of two of their opposite corners. All things/objects/items on the grid that are found within the rectangular box of a terminal node is stored in that terminal node. But guess what? Every node is represented in this manner! It's boxes all the way down. The rectangular boxes are also known as bounding boxes. ## Quad(ratic) split Quadratic split find the pair of rectangles that are the least desirable to have in the same node, and uses these two as the initial objects in two new groups. So within the node or block or whatever it is you are splitting, for each box/object, calculate the area of the bounding box that would encompass it and one other box/object (for all other objects). These two objects, $S_{1}$ and $S_{2}$, are chosen as the seeds for the split. For every other object $O$, calculate the size of the bounding box encompassing $S_{1} \cup O$ and $S_{2} \cup O$. Add $O$ to the set which would experience the smallest increase in its bounding box area if $O$ is added to it. I.e., if $A(x)$ is the bounding box area of $x$, add $O$ to $S_{1}$ if $A(S_1 \cup O) - A(S_1) < A(S_2 \cup O) - A(S_2)$ and to $S_2$ otherwise. Continue doing this until the sum of size of the smallest set/group and the remaining objects is equal to $m$. When this happens add all the remaining objects to the smallest set/group. Wait, $m$? Yeah, $m$. It is a _selected parameter_, which means __you__ get to decide what it is. $ 2 \leq m \leq M$, where $M$ is the number of objects in the block/node. # K-d tree Short for K-dimensional tree. # Storage of raster graphics and voxels A raster image/raster graphics are pretty much just a bitmap: a 2-dimensional grid where each cell corresponds to a pixel or point in the image. A voxel represents a value on a regular grid in 3-dimensional space. Think of it like a pixel but for 3d images. A voxel can be a color value (like a pixel) representing an arbitrary volume in the picture or model. ## Arbitrary cut # Hashing, hash file ("randomisert fil") ## Linear hashing # Relational Algebra Yeah that's right, relational algebra is up in this one as well. But it's not exactly the same as in TDT4145. It's outlined in 11.3 of SAMOLDV, page 352. The "basic algebra operations" are defined in terms of what you might remember as relational algebra operations from TDT4145. Which is weird because the book defines selection using both $\sigma$ and $\pi$. ## Syntax $$\text{TableName : selector : reductor}$$ * __TableName__ = The name of the table we want to do stuff with. * __selector__ (s) = selection criteria for tuples/records/rows in _TableName_. * __reductor__ (r) = Which columns/attributes of the selected tuples should be "carried on"/made available in the result. If the reductor is empty or "*" then all attributes are returned. If an attribute is listed and immediately followed by a / followed by a string ("Attrbute/string") then that attribute/column is renamed to/made available as the string in the result. $r_{R}$ is the reductor supplied with table R. $s_{A}$ is the selector supplied with table A. ## Basic algebra operations ### Selection `SELECT(A:s:r, R:s:r);` $$R = \pi_{r_R}\sigma_{s_{R}}(\pi_{r_{A}}\sigma_{s_A}A)$$ You can already tell it's weird because we've defined selection using the relational algebra operation selection ($\pi$) twice. I don't know what's up with that either. But it's totally unnecesary. Just check out [the wikipedia article on Relational Algebra](http://en.wikipedia.org/wiki/Relational_algebra) for a better explanation. ### Projection ### Join ### Cross Product ## Set operations ### Union ### Difference ### Intersection ### Subset ### Testing table equality ### Testing table unequality ## Grouping Operations ### Aggregation ### Division The division is a binary operation that is written as $R \div S$. The result consists of the restrictions of tuples in R to the attribute names unique to R, i.e., in the header of R but not in the header of S, for which it holds that all their combinations with tuples in S are present in R. ### Group Sort ## Order of operations / operand ordering See Problem 6 on the 2013v Exam for an example problem of this kind. Note that when calculating the "total I/O volume", tables are read when involved in an operation and intermediate results are written back to disk before being read again for the next operation. That's how the answer key usually describes it anyway. For natural join, __*__, use a greedy approach and find the smallest given intermediate result and select that. Repeat until there are no more intermediate results to select. For example, if asked to find the execution order of __R=A\*B\*C__ that results in the smallest I/O volume, pick whichever one of __A\*B__, __B\*C__ or __A\*C__ (natural join is symmetric) that has the smallest intermediate result. The next operation would be whichever one you just picked and the remaining table. Total I/O volume is equal to: $A_{size} + B_{size} + C_{size} + R_{size} + 2\times X_{size}$, where $X$ is the intermediate operation you picked. # Advanced methods for execution of relational algebra See also chapter 12 of _SAMOLDV_.