A DBA's Journey into the Cloud and Big Data: Understanding What is a NoSQL Database?

Everyone is trying to Understand NoSQL Databases

When I work with customers that are looking at Big Data and Hadoop solutions, I am often asked to define what NoSQL databases are. There is a lot of information being written about NoSQL databases because they are one of the hot technology areas of Big Data. I'm going to help explain NoSQL databases to make it easier to understand how NoSQL databases fit into Big Data ecosystems.

Understanding Big Data

Traditional systems (relational databases and data warehouses) have been the way most organizations have stored, managed and analyzed data. These traditional systems are not going anywhere, what they do, they do well. However, today's data environment has changed significantly and traditional systems have difficulty working with a large part of the data of today (Big Data). Big data has been given a lot of different definitions, but what it really is, is a data environment that meets one or more of the following criteria:

A large amount of data to be stored, processed and analyzed.
Data that often has large amounts of semi-structured or unstructured data.
Data that can have large data ingestion rates.
Large amounts of data that have to be processed quickly.

Traditional Systems Were Not Designed For BigData

Traditional systems from their foundations were not designed to handle this type of data environment. Big data is an environment that exists when it gets too difficult or expensive for traditional systems to handle. Organizations are finding that data about you can be just as critical to business success as the data generated internally. Organizations are almost desperate to correlate internal data with data that is generated externally (social media, VOIP, machine data, RFID, geographical coordinates, videos, sound, etc). NoSQL systems are designed from the ground up to deal with this type of data environment cost effectively. Traditional database vendors are not wanting to miss out on this wave of Big Data but they are providing add ons to their systems where as NoSQL databases are designed from the ground up to work with Big Data. The other challenge with traditional systems is they are wanting to sell very expensive hardware and software licenses compared to the relatively very inexpensive open source solutions.

What is NoSQL?

NoSQL is a database management system that has characteristics and capabilities that can address big data in ways that traditional databases were not designed for. NoSQL solutions usually have the following features or characteristics:

Scalability of big data (100s of TB to PBs). Horizontal scalability with x86 commodity hardware.
Schema-on-read (versus traditional databases schema-on-write) makes it much easier to work with semi-structured and unstructured data.
Data spread out using distributed file systems that use replicas for high availability.
High availability and self-healing capability.
Connectivity can include but not limited to SQL, Thrift, REST, JavaScript and APIs.

Here is the Wikipedia definition of NoSQL.

A NoSQL database provides a mechanism for storage and retrieval of data that employs less constrained consistency models than traditional relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. NoSQL databases are often highly optimized key–value stores intended for simple retrieval and appending operations, with the goal being significant performance benefits in terms of latency and throughput. NoSQL databases are finding significant and growing industry use in big data and real-time web applications. NoSQL systems are also referred to as "Not only SQL" to emphasize that they may in fact allow SQL-like query languages to be used.

The term NoSQL is more of an approach or way to address data management versus being a rigid definition. There are different types of NoSQL databases and they often share certain characteristics but are optimized for specific types of data which then requires different capabilities and features. NoSQL may mean Not only SQL, or it may mean "No" SQL. A No SQL database may use APIs or JavaScript to access data versus traditional SQL. NoSQL datastores may be optimized for key-value, Columnar, Document-Oriented, XML, Graph and Object data structures. NoSQL databases are very scalable, have high availability and provide a highly level of parallelization for processing large volumes of data quickly. NoSQL solutions are evolving constantly.

A number of the NoSQL databases can point to Google's BigTable design as their parent source. Characteristics of Google BigTable include:

Designed to support massive scalability of tens to hundreds of petabytes.
Move the programs to the data versus relational databases that move the data to the programs (memory).
Data is sorted using row keys.
Designed to be deployed in a clustered environment using x86 commodity hardware.
Supports compression algorithms.
Distributes data across local disk drives on commodity hardware supporting massive levels of IOPS.
Supports replicas of data for high availability.
Uses a parallel execution framework like Map Reduce or something similar for extremely high parallelization capabilities.

The two primary NoSQL databases supported by the Hortonworks Data Platform (HDP) are HBase and Accumulo. Here are some examples of NoSQL databases:

HBase (Columnar) – designed for optimized scanning of column data
Accumulo – Key-value datastore that can maintain data consistency at the petabyte level, read and write in near real-time and contains cell-level security. Accumulo was developed at the National Security Agency.
Cassandra – A real-time datastore that is highly scalable. Uses a peer-peer distributed system. Key oriented using column families. Supports primary and secondary databases. Uses CSQL for it's SQL language.
MongoDB (document-oriented) – Highly scalable database runs MapReduce jobs using JavaScript
CouchDB (document-oriented) – Highly scalable database that can survive just about anything except maybe a nuclear bomb. Uses JavaScript to access data.
Terracotta – Uses a big memory approach to deliver fast high scalable systems.
Voldemort – A key-value distributed storage system.
MarkLogic – Highly scalable XML based database management system.
Neo3J (graph oriented) – A graph database that allows you to access your data in the form of a graph. A graph database gives you fast access to information associated with nodes and relationships.
VMware vFabric GemFire (object entries) Uses key-value pairs for in-memory data management.
Redis (key-value) – String oriented keys can be hashes, lists or sets. Entire data set is cached in memory with disk persistence. Highly scalable.
Riak (key-value) – Text oriented, scalable system based on Amazon's Dynamo.

NoSQL databases are not designed to replace the traditional RDBMS. NoSQL databases are becoming part of the enterprise data platform for organizations and providing functionality that traditional systems do not handle well due to either the size, complexity of data or the volume of data being absorbed.

NoSQL and SQL Analogies
Here is another way of looking at NoSQL and SQL from a coauthor and friend, Steven Jones.

Think of SQL and No SQL in terms of distinctions. Here are some word pictures of distinctions:

No SQL databases handle fast answers to messy big piles of data.

SQL databases handle deliberate logically churned out answers to well organized and groomed to the essentials data.

Think of them as the odd couple one is Felix and the other is Max.

Or No SQL is detective Columbo and SQL is detective Monk.

One is a an answer from a hot mess the other is a architects blueprint where logical reasoning reduces truth to it's essence.

No SQL is rap or dubstep, SQL is classical.

SQL assumes by it's order or structure you know the questions to be asked.

NO SQL assumes no order until you can think of a question or a need in the moment.

SQL is mathematically derived.

NO SQL is merely reasonably ordered.

Rankings of Different Types of Databases from DB-Engines (November 2013)

DB-Engines ranks Wide Column Stores

Cassandra
HBase
Accumulo
Hypertable

DB-Engines ranks Document Stores

MongoDB
CouchDB
Couchbase
RavenDB
Gemfire

DB-Engines ranks Graph DBMS

Neo4J
Titan
OrientDB
Dex

DB-Engines ranks Key Value Stores

Redis
Memcached
Riak
Ehcache
DynamoDB

Note: Berkeley DB (7^th), Coherence (8^th), Oracle NoSQL (10^th)

DB-Engines ranks Object Oriented DBMS

Cache
Db4o
Versant Object Database

DB-Engines ranks Relational DBMS

Oracle
MySQL
Microsoft SQL Server
PostgreSQL
DB2

Note: Teradata (9^th), Hive (12^th), SAP HANA (16^th)

1 comment:

ademartinsNovember 10, 2013 at 11:24 PM
Irrespective of new DBMS platforms, seasoned DBAs and DBA Architects skills would still be required to design, support, and maintain these new platforms, except that a DBAs endless need for continuous learning and to innovate become a lifelong practice.

Sunday, November 10, 2013

Understanding What is a NoSQL Database?

1 comment: