Introduction
NoSQL is a type of database management system (DBMS) that provides mechanisms for data storage and retrieval by means other than the tabular relation used in relational databases. In practice, NoSQL DBMSs usually complement, rather than replace, relational DBMSs. NoSQL is suitable for storing schema-less data and for web servers that need high availability, which are some of the reasons why NoSQL is a popular choice in developing web-related services. For a brief overview, see "NoSQL Databases: An Overview". For a detailed overview in the form of a book written by the same author, see NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence.
First, a high-level view of the NoSQL land that is not specific to a particular data model is presented. This includes distribution models, consistency, version stamps, and map-reduce. Then, the four major NoSQL data models are given in more details: key-value, document, column-family, and graph. The workshop portion provides basic, hands-on exposure to some of the popular NoSQL DBMSs, namely Apache HBase, CouchDB, and ArangoDB. Occasionally, NoSQL is compared to and contrasted with the relational model where appropriate.
Distribution and Consistency
One of the reasons for the rise of NoSQL is its ability to run databases on a cluster in order to meet the high demands of a web server, for example. Unlike the relational model, an aggregate data model (which encompasses the major NoSQL data models except graph) is suitable to be scaled to a cluster, since the aggregate is a unit of distribution for a cluster of nodes. There are two methods of distribution: sharding and replication. Replication can be further categorized into master-slave and peer-to-peer replication. Sharding and replication can be used independently or in combination.
Consistency is an important concern in NoSQL and comes in different forms. Relational DBMS usually exhibits strong consistency, which treats any inconsistency as an unacceptable state. NoSQL, on the other hand, can have varying levels of consistency depending on the usage requirements and demands. Generally, giving up a certain degree of consistency is normally a response to other necessary characteristics of the system. In NoSQL, the CAP theorem explains why one would need to relax consistency. Related to consistency is durability, which is a characteristic of the system that indicates the reliability of the data in the event of a node failure. Durability and responsiveness are often opposing factors.
Version stamps are used in detecting conflicts in reading and updating data. They can implemented using UUIDs, message digests, timestamps, incremental counters, or any combinations of these. In the context of running databases on a cluster, version stamps are important in determining whether different nodes have conflicting reads or updates.
Map-Reduce
Running databases on a cluster affects computation as well as data storage. Since there are many nodes, where a computation takes place is an important contributing factor to performance. Transferring data across the network slows down computation, so one should perform computation on the same node where an aggregate is located as much as that is possible.
The map-reduce pattern is a method of consolidating the computations on a set of aggregates such that data transfer among different nodes is minimized. A map is a function that takes a single aggregate and returns a set of key-value pairs. Since an application of the map function is independent of another, the mapping can be parallelized and distributed along with the aggregate that it is acting on to different nodes, thereby increasing the locality of data access.
The reduce function acts on multiple map outputs that have the same key and returns a single value for that key. A special type of reduce function that receives and returns the same type and semantics of data is known as a combinable reducer. By using a combinable reducer, the amount of data transfer from one node to another can be further lessened. On the flip side, an intrinsic property of a reduce function is that it cannot act across different keys. Therefore, it is vital that the design of an aggregate and the computation steps acting on it are taken into consideration when running on a distributed system.
The Four Major Data Models
Key-Value
The key-value model is the simplest of the four and is nothing more than a simple hash table that associates a key with a value. The value is opaque and, hence, can be text or binary. It is up to the application accessing the database to know the type and structure of the stored value. Many key-value stores are actually in-memory.
Since the key-value model treats the value as a blob, designing the keys is crucial, since there is no direct way of using an aspect of the value as part of a query. The resulting downside is that the application must read the value in order to determine whether the key-value pair meets a set of criteria.
Document
A document-oriented database is similar to a key-value store but with one fundamental difference: the value is examinable. In the document data model, this value is called a document. The fact that the value can be examined increases the level of querying capabilities of the database, since information about the values can now be part of the query design. As such, the value (i.e., document) is usually in some form of structured data, such as JSON (which is the most popular format), and contains a unique identifier.
Column-Family
Like the key-value model, the column-family model (also called wide-column model) maps a key to an associated value. However, in the column-family model, the key represents a row, and the row contains a set of column families, where each family is a group of arbitrary columns. Each column, then, contains a value that is treated as a blob just as in the key-value model.
The column-family model has much conceptual overlap with the relational model but with important differences. A table in a column-family store is analogous to a table in a relational database, where a table contains a set of rows. Similar to a relational database where the columns of a table are predefined, a table in a column-family store contains rows that must have the same column families. But unlike a relational database, a column family in a row can contain columns that are different from the columns in the same column family in another row. Hence, one way to think of a table in the column-family model is to view it as a multidimensional map.
Graph
The graph data model is different from the other three models in that it is not an aggregate model and works on connected vertices of a graph. As such, many graph-database systems have similar drawbacks that a relational DBMS has with regards to the distribution models. Furthermore, a graph DBMS ensures consistency through transactions and is normally ACID-compliant, just as in relational DBMSs.
Graph databases are used when relationships between entities are important. Such relationship is specified as an edge connecting from one vertex to another, where directionality is important and affects how a graph is traversed. A graph DBMS allows multiple levels of relationships between the entities and performs traversals very quickly. In a relational database, such relationships can be mimicked using joins, but which does not lend itself to easy traversals of the relationships. Moreover, adding another relationship type in a relational database often entails substantial changes to the schema and much data movement.
Hands-on with NoSQL
Instructions for obtaining data sets, issuing shell commands, and running scripts (including web crawlers) will be provided during the workshop.
Apache HBase
Apache HBase is a column-family DBMS. During the workshop, we will use a simplistic form of metaprogramming to put some data into HBase. The sample data set is a subset of the genomic features (originally formatted as GFF3) of Chlamydomonas reinhardtii (green algae) provided by Phytozome, a portal of the United States Department of Energy's Joint Genome Institute.
No knowledge of molecular biology, the genome, or C. reinhardtii is necessary. The data set simply provides an illustrative scenario that underscores some of the advantages of the column-family model over the relational model when confronted with columns that have natural groupings (resulting in column families). Once the data set is stored in accordance with the column-family model, HBase provides a filter language that allows the retrieval of rows satisfying a set of criteria, while only displaying columns that are of interest.
CouchDB
CouchDB is a document-oriented DBMS. Using a custom-made web scraper, extracting stock-market data of the Dow Jones technology sector will be demonstrated. After the stock-market information is extracted, another dedicated script will be used to store the data in CouchDB as JSON documents. Finally, we will look at how to retrieve the stored market data using CouchDB's JSON querying language, Mango.
ArangoDB
ArangoDB is a graph DBMS. Another custom-made web crawler will be used to extract information of several books from Amazon's website. The web crawler traverses a network of books that are both directly and indirectly related to a given list of books up to an arbitrary depth. The information of the books that are encountered by the web crawler is then stored in ArangoDB as a graph, where a vertex is a book (with its associated data), and an edge is the relationship between two books. (ArangoDB does not support hypergraphs.) We will then look at some simple uses of the ArangoDB query language (AQL) in retrieving book data and in traversing a network of related books.