This is an old version of the compendium, written May 14, 2016, 2:30 p.m. Changes made in this revision were made by stiaje. View rendered version.

TDT4305: Big Data Architecture

~~TDT4305: Big Data Architecture~~ ## Introduction

### Terminology First of all we want to define some words taht are often used: || __Word__|| __Description__|| || Structured data || Well defined fields in tables.|| || Unstructured data || Data that's usually by and for humans, e.g. text messages.|| || Semi-structured data || Self describing data, e.g. XML and JSON.|| || Batch-oriented || To run a series of programs without human interaction. || || Near-realtime || Short time between the time where the data is available, and the data is processed. || || Realtime || Data is processed as it is made available. || || Stream data || Ordered sequence of instances, e.g. sensor data or Twitter-messages. The data can be processed without having the whole stream. || ### Definition of Big Data As the lectures did, we start by introducting som definitions: > “Big data is a broad term for data sets so large or complex that traditional data processing applications (i.e., DBMS) are inadequate for capturing/storing/managing/analyzing. “ – McKinsey > “Big data is high-volume, high-velocity and high-varietyinformation assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” – Gartner The Garter Group characterised Big Data by three V's in 2011: ||__Volume__ ||Unlike traditional data sets, in _big data_ they are incredibly large. This is due to the fact that Internt of Things and other evolutions has caused a greater production of data. || || __Velocity__ ||The datasets are being changed and updated very quickly (high velocity), so one has o process them quickly.|| ||__Variety__ || In traditional applications the data was mainly of one type: transactions involving som e industry e.g. financial, insurance, travel, healthcare, retail, government etc. Data sources has expanded. || Other researchers have added properties: || __Veracity__ ||Two built in features: Credibility of the source, and the suitability of data for its target audience. Many sources generate incomplete or inaccurate data, so one has to validate this for trustworthiness and validity before use. || || __Value__ || The potential value of the data. || The following table shows differences in some contexts in the old relational database scenario, and in the new Big Data scenario: ![](http://i.imgur.com/kVaNnVZ.png) From D.Loshin's _Big Data Analytics_ (2013). ## NOSQL Database Systems ### Introduction to NOSQL __What is NOSQL?__ It stands for Not Only SQL, and is database systems that are not relational, are usually schemaless and do not base themselves on Structured Query Language (SQL). __Why do we need NOSQL-databases?__ The growth of web 2.0 applications like social networks, blogs and applications that share data has given us new requirements for database systems that the relational systems cannot meet. We sacrifice database properties for higher availability, scaleabaility, schemalessness and semi-structured properties. __What does traditional SQL-database systems offer?__ A simple data model, a powerful query language (SQL), unison storing of data, indexing, constraints, procedures (aka schemas) and last but not least: Transactions with their ACID properties (Atomicity, Consitency, Isolation, Durability). __What does NOSQL databases usually offer?__ Horizontal scaling of simple operations over many machines. It can replicate data over many nodes. There is a simple interface for managing the data: CRUD/SCRUD. There's efficient use of distributed indexes and RAM. There is no schema and it supports semi-structured data that is self-describing, which means we can use formats like JSON and XML. The consistency model is simpler than ACID ("eventual consistency"). Instead of ACID we say it's BASE: Basically Available Soft state, Eventually consistent. __NOSQL systems has the following characteristics:__ || Characteristic ||Introduction and description|| ||Scaleability|| There is two types of scalability: Horizontal and vertical. Horizontal scalability is generally used in NOSQL systems: You add more nodes for data storage and processing as the amount of data grows. Vertical scalability adds power to existing nodes. || ||Availability, Replication and Eventual Consistency || Applications that use NOSQL systems often require continuous availabilty. This is accomplished by having the same data on multiple nodes. It improves availability and performance. It will make writes slower as it must write to all copies. If consistency is less important one can use eventual consistency. || ||Replication models|| There are two replication models in NOSQL systems: _Master-slave replication_ requires one master copy. All write operations is applied to that copy, and is propagated to the slave copies. The slave copies will _eventually_ be the same as the master copy. _Master-master replication_ means that one can read and write to any copy, but one cannot guarantee that the information in the copy is updated. || ||Sharding of Files|| _Sharding_ aka _horizontal partitioning_ is to distribute the file records to multiple nodes. || ||High-Performance Data Access || One uses either hashing or range partitioning on object keys to be able to find individual records among millions. With hashing you have a key $K$ and a hash function $h(K)$ that is applied to it and determines the value location. With range partitioning the location is given by a range of key values, e.g. location $i$ would hold the objects whose key values $K$ are in the range $Ki_{min} \leq K \leq $Ki_{max}$ || __Characheristics related to data models and query languages:__ || __Characteristic__ || __Introduction and description__|| || Not requiring a schema || Some NOSQL systems lets yo define a partial schema, but it is not required. Data restrictions has to be programmed into the application. || ||Less powerful query languages || Applications built on NOSQL systems may not require a powerful query language such as SQL because the read queries in the systems often locate single objects in a single file based on their keys. NOSQL systems usually give you an API for SCRUD operations (Search Create Read Update Delete) Many NOSQL systems do not provide join operations as part of their query language. It has to be implemented in the application itself. || || Versioning || Some NOSQL systems provide storage for multiple versions of data items. || __NOSQL system categories:__ || __Category__ || __Description__ || || Document-based NOSQL systems || Stores data in the form of documnes using well-known formats, e.g. JSON. || || NOSQL Key-value systems || Simple data model based on fast access by key to the associated value. Value can be record, object or more complex structure.|| || Column-based or wide column NOSQL systems || Partition table by columnt into columnt families. Each column family is stored in its own files. Also has versioning of data values. || ||Graph-based NOSQL systems ||Data is represented graphs. Related data can be found by traversing the graph. || Hybrid NOSQL systems gave characteristics from two or more of the above categories. Object databases and XML databases are examples of NOSQL system categories that was available before the term NOSQL system was around. ### The CAP Theorem The CAP theorem aka. Brewer's theorem after Eric Brewer, introduced as the CAP principle, states: > For a distributed system it is impossible to simultaneously provide all three of the following guarantees: > >* Consistency (all nodes see the same data at the same time) (this is not the identical concept as the one referred to when one refers to "ACID properties"). >* Availability (every read/write request will either be processed successfully or will give a message that it could not be completed.) >* Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures) Also see the [Wikipedia-article](https://en.wikipedia.org/wiki/CAP_theorem) on the subject, which also is the source for this explanation. __Regarding the difference between _Consistency_ in CAP and ACID:__ In ACID it refers to the fact that a transaction will not violate the integrity constraints in the database schema. In CAP it refers to the consistency between the values in different copies on different nodes. In traditional SQL systems guaranteeing the ACID properties is important. In SQL systems you choose two of the CAP properties. ### Document-based NOSQL Systems #### MongoDB ### NOSQL key-value stores #### Voldemort ### Column-based or Wide Column based NOSQL systems #### Hbase ### NOSQL Graph Databases #### Neo4j ## Big Data Analysis and Technologies ### MapReduce ### Hadoop ### Spark ### Recommender Systems