Saturday, October 26, 2013

Demystifying Hadoop for Data Architects, DBAs, Devs and BI Teams

I started doing Demystifying series on subjects such as database technologies, infrastructure and Java since back in the Oracle 8.0 days.  The topics have ranged from Demystifying Oracle RAC, Demystifying Oracle Fusion, Demystifying MySQL, etc.  So I guess it's time to Demystify Hadoop.  J

Whether you are talking Oracle RAC,  Oracle ExaData, MySQL, SQL Server, DB2, Teradata or Application Servers, it's really all about the data.  Companies are constantly striving to make faster business decisions with higher degrees of accuracy.  Traditional systems such as Oracle, SQL Server, IBM, Teradata, etc. are scaling their systems to store hundreds of terabytes and even petabytes, with hardware that keeps getting faster and faster.   However these traditional systems were designed for transaction systems and have a lot of difficulties working with big data.   I'm going to talk to you about why these traditional systems are not designed for big data and we're going to talk about how Hadoop is the right technology at the right time to address Big Data.

What's The Deal About Big Data

Across the board, industry analyst firms consistently report almost unimaginable numbers on the growth of data.  The traditional data in relational databases and data warehouses are growing at incredible rates.   The traditional data is a challenge by itself (show in Enterprise data below).  
The big news though is VOIP, social media and machine data are growing at exponential rates and are completely dwarfing the data growth of traditional systems.   Most organizations are learning that this data is just as critical to making business decisions as their traditional data.  This non-traditional data is usually semi-structured and unstructured data. Examples of this data include web logs, mobile web, click stream, spatial and GPS coordinates, sensor data, RFID, video, audio and image data. The chart below shows the growth of non-traditional data (Machine Data, Social Media, VoIP) relative to traditional data (Enterprise Data). The source is the IDC.

Data becomes big data when the volume, velocity, and/or variety of data gets to the point where it is too difficult or too expensive for traditional systems to handle.  Big data is not when when the data reaches a certain volume velocity of data ingestion or type of data.  Big data is when traditional systems are no longer viable solutions due to the volume, velocity and/or variety of data.   A good first book on big data to read is Disruptive Possibilities.

The Big Data Challenge

The reason traditional systems have a problem with big data is they were not designed for big data.
  • Problem - Schema-On-Write: Traditional systems are schema-on-write.  Schema-on-write requires the data be validated when it is written.  This means that a lot of work has to be done before new data sources can be analyzed. Here is an example of the problem with this. Let's say a company wants to start analyzing a new source of data from unstructured or semi-structure sources.  A company will usually spend months (i.e. 3-6 months) designing schemas, etc. to store the data in a data warehouse.   That is 3 - 6 months that they are not able to use the data to make business decisions.  Then when the data warehouse design is complete 6 months later, often the data has changed again.  If you look at data structures from social media, they change on a regular basis.  The schema-on-write environment is too slow and non-flexible  to deal with the dynamics of semi-structured and unstructured data.   The other problem with unstructured data is traditional systems usually use BLOBs to handle unstructured data.  Anyone that has worked with BLOBs for big data, would rather get their gums scraped than work with BLOB data types in traditional systems. 
  • Solution - Schema-On-Read:  Hadoop systems are schema-on-read.  Which means any data can be written to the storage system immediately.  Data is not validated until it is read.  This allows Hadoop systems to load any type of data in, and begin analyzing it quickly.   Hadoop systems have extremely short data latency compared to traditional systems.  Data latency is the differential between data hitting the disk and the data being able to provide business value.  Schema-on-read gives Hadoop a tremendous advantage over traditional systems in an area that matters most.  Being able to analyze the data faster to make business decisions. 
  • Problem - Cost of Storage: Traditional systems use SAN storage.  As organizations start to ingest larger volumes of data, SAN storage is cost prohibitive.
  • Solution - Local Storage: Hadoop is able to use HDFS, a distributed file system that leverages local disks on commodity servers.   SAN storage is about $1.20/GB while local storage is about $.04/GB per storage.  Hadoop's HDFS creates three replicas by default for high availability. So at .12 cents per GB it is still a fraction of the cost of traditional SAN storage.  As organizations are storing much larger volumes of data, the traditional SAN storage is too expensive to be a viable solution. 
  • Problem - Cost of Proprietary Hardware:
    Large proprietary hardware solutions can be cost prohibitive when deployed to process extremely large volumes of data.  Organizations are spending millions of dollars in hardware and software licensing costs while supporting large data environments.  Organizations are often growing their hardware in million dollar increments to handle the increasing data.
  • Solution: Commodity Hardware:  People new to Hadoop do not realize that it is possible to build a high performance super computer environment using Hadoop. One customer was looking at a proprietary hardware vendor for a solution. The hardware vendor's solution was $1.2 million in hardware costs and $3 million in software licensing.   The Hadoop solution for the same processing power was $400,000 for hardware, the software was free and the support costs were included.  Since data volumes would be constantly increasing, the proprietary solution would be growing in  $500k and $1 million dollar increments and the Hadoop solution would be growing in $10,000 and $100,000 increments.
  • Problem - Complexity: When you look at any traditional proprietary solution it is full of extremely complex silos of system administrators, DBAs, application server teams, storage teams and network teams.  Often there is one DBA for every 40 - 50 database servers.  Anyone running traditional systems knows that complex systems fail in complex ways.   
  • Solution - Simplicity:  Since Hadoop uses commodity hardware, it is a hardware stack that one person can understand.   Numerous organizations running Hadoop have one administrator for every 1000 data nodes.  
  • Problem - Causation:   Because data is so expensive to store in traditional systems, data is filtered, aggregated and large volumes are thrown out due to the cost of storage.  Minimizing the data to be analyzed reduces the accuracy and confidence of the results. 
  • Solution - Correlation:  Due to the relatively low cost of storage of Hadoop, the detailed records are stored in Hadoop's storage system HDFS.  Traditional data can then be analyzed with non-traditional data to find correlation points that can provide much higher accuracy of data analysis. We are moving to a world of correlation because the accuracy and confidence of the results is factors higher than traditional systems. An example, the Center for Disease and Control (CDC) used to take 28 - 30 days to identify an outbreak.  The CDC had traditionally obtained data from doctors and hospitals.  This data was then analyzed  in large volumes and cross referenced sources in order to validate the data.  The next step then was going back a number of years and correlating it with the data from social media sources such as Twitter and Facebook.   They validated the accuracy of the correlation results going back years.  Now using big data, the CDC can identify an outbreak in 5 - 6 hours.  Organizations are seeing big data as transformational.  
  • Problem - Bringing Data to the Programs:  In relational databases and data warehouses, data is loaded usually in 8k - 16k data blocks into memory so programs can process the data. When you need to process 10s, 100s and 1000s of TB this model completely breaks down or is extremely expensive to implement. 
  • Solution - Bringing Programs to the Data:  With Hadoop, the programs are moved to the data. Hadoop data is spread across all the disks on the local servers that make up the Hadoop cluster in 128MB (default) increments.   Individual programs, one for every block runs in parallel across the cluster delivering a very high level of parallelization and IOPS.  Which means Hadoop systems can process extremely large volumes of data much faster than traditional systems at a fraction of the cost due to the architecture model.
Successfully leveraging big data is transforming how organizations are analyzing data and making business decisions.  The "value" of the results of big data has most companies racing to build Hadoop solutions to do data analysis.  The diagram below show how significant big data is.  Often customers bring in Hortonworks and say, we need you to make sure we "out Hadoop" our competitors.  Hadoop is not just a transformation technology it's the strategic difference between success and failure.

Examples of New Types of Data

Hadoop is being used by every type of organization ranging from Internet companies, Telecommunication firms, Banks, Credit Card companies, gaming companies, on-line retail companies,etc.  Anyone that needs to analyze data is moving to Hadoop. Here are examples of data being processed by organizations.

Hadoop Distributions - The Hortonworks Data Platform (HDP)

A Hadoop distribution is made up of a number of different open source frameworks.  An organization can build their own distribution from the different versions of each framework.  Anyone running a production system needs an enterprise version of a distribution.  Since Hortonworks has key committers and project leaders on the different open source framework projects, we use our expertise to pick the latest version of a framework that works reliably with the other frameworks.  Hortonworks then goes out and tests a distribution and builds an enterprise distribution of Hadoop.  For example, Hadoop 2 went GA the week of  October 15th, 2013.  Hadoop 2 has been running on Hadoop clusters with thousands of nodes since January of 2013, being tested by the large set of Hortonworks partners.

The Hortonworks distribution is called the Hortonworks Data Platform.  The new GA release of Hadoop 2 by Hortonworks is called HDP 2.  Hortonworks runs on a true open source model.  Every line of code written by Hortonworks for Hadoop is given back to the Apache Software Foundation (ASF).  When means every Hortonworks distribution is only a few patch sets off of the main Apache baseline.  The result is HDP2 is extremely stable from a support perspective and protects an organization from vendor lock in.  Here is an example of the HDP2 distribution and the key frameworks associated with it.  Hortonworks builds it's reputation on the "enterprise" quality of it's distribution.  The industry is recognizing the platform expertise of Hortonworks.

There are a number of different Hadoop distributions.   Some of the distributions have been around longer than Hortonworks.  In my expert opinion, the reason to choose Hortonworks is: 
  • Platform Expertise - Hadoop is a platform that frameworks run on.  Hortonworks' entire focus is on the enterprise platform for Hadoop.  Hortonworks is not trying to be everything for everybody.  Hortonworks focus is the Hadoop platform.  Hortonworks has demonstrated this in a number of ways.   Hadoop is open source and is developed as a community.  Hortonworks is by far the largest contributor of source lines of code for Hadoop.  
  • Defining the Roadmap: More and more large vendors, are seeing Hortonworks as defining the road map for Hadoop.  Hortonworks while working with the open source community, has been a key leader in the design and architecture of YARN.  YARN is the foundational processing component of Hadoop 2.  The platform expertise demonstrated by Hortonworks is moving a number of the largest vendors in the world to move to the Hortonworks Data Platform (HDP).  This is why you seen major vendors such as Microsoft and Teradata choosing HDP as the Hadoop distribution of choice.
  • Enterprise Version of Hadoop - Hortonworks is focused as being the definitive enterprise distribution of Hadoop.  
  • Open Source  - Hortonworks is based on an open source model. Every line of code created goes back into the Apache Software Foundation.   Other distributions are proprietary or open source proprietary.   The proprietary solutions create vendor lock in which more and more companies are trying to avoid.   With Hortonworks contributing all code back to the Apache Software Foundation it minimizes support issues.
  • Windows and Linux - HDP is the only Hadoop distribution that runs on Linux and Windows.

The two main frameworks of Hadoop are the Hadoop Distributed File System (HDFS) which provides the storage and I/O and YARN with is a distributed parallel processing framework.

YARN (Yet Another Resource Negotiator) is the foundation for parallel processing in Hadoop.  YARN is:
  • Scaleable to 10,000+ data node systems.  
  • Supports different types of workloads such as batch, real-time queries (Tez), streaming, graphing data, in-memory processing, messaging systems, streaming video, etc.  You can think of YARN as a highly scalable and parallel processing operating system that supports all kinds of different types of workloads. 
  • Supports batch processing providing high throughput performing sequential read scans.
  • Supports real time interactive queries with low latency and random reads.

HDFS uses NameNodes (master servers) and DataNodes (slave servers) to provide the I/O for Hadoop.  The NameNodes manage the meta data. NameNodes can be federated (multiples) for scalability.  Each NameNode can have a standby NameNode for failover (active-passive).  All the user data is stored on the DataNodes.  Data is distributed across all the disks in 128MB - 1GB block sizes. The data has 3 replicas (default) for high availability.  HDFS provides a solution similar to striping and mirroring using local disks.

Additional Frameworks
Here is a summary of some of the key frameworks that make up HDP 2.
  • Hive - A data warehouse infrastructure than runs on top of Hadoop.  Hive supports SQL queries, star schemas, partitioning, join optimizations, caching of data, etc.  All the standard features you'd expect to have in a data warehouse.  Hive lets you process Hadoop data using a SQL language.
  • Pig - A scripting language for processing Hadoop data in parallel.
  • MapReduce - Java applications that can process data in parallel.
  • Ambari - An open source management interface for installing, monitoring and managing a Hadoop cluster. Ambari has also been selected as the management interface for OpenStack.
  • HBase - A NoSQL columnar database for providing extremely hast scanning of column data for analytics.
  • Scoop, Flume and WebHDFS - tools providing large data ingestion for Hadoop using SQL, streaming  and REST API interfaces.
  • Oozie - A workflow manager and scheduler.
  • Zookeeper - A coordinator infrastructure
  • Mahout - a machine learning library supporting Recommendation, Clustering, Classification and Frequent Itemset mining. 
  • Hue - is a Web interface that contains a file browser for HDFS, a Job Browser for YARN, an HBase Browser, Query Editors for Hive, Pig and Sqoop and a Zookeeper browser.

Hadoop - A Super Computing Platform

Hadoop is a solution that leverages commodity hardware to build a high performance super computing environment.  Hadoop contains master nodes and data nodes.  HDFS is the distributed file system that provides high availability and high performance.    HDFS is made up of a number of data nodes that break a file into multiple blocks. The block sizes are usually in 128MB - 1GB in size.  Each block is replicated for high availability.   YARN is a distributed processing architecture than can distribute the work load across the data nodes in a Hadoop cluster.  People new to Hadoop do not realize the massive amount of IOPS that commodity X86 servers can generate with local disks.

In the diagram below:
HDFS - distributes data blocks across all the local disks in a cluster.  This allows the cluster to leverage the IOPS that local disks can generate across all the local servers.  When a process needs to run, the programs are distributed in parallel across all the data nodes to create an extremely high performance parallel environment.   Without looking into the details, the main point is this is a super computer environment that can leverage parallelization for processing and leverage the massive amounts of IOPS that local disks can generate running across multiple data nodes as a distributed file system.  The diagram shows multiple parallel processes running across a large volume of local disks running as a single distributed file system.  

Hadoop is linearly scalable with commodity hardware. If a Hadoop cluster cannot handle the workload, an administrator can add some data node servers using local disks to increase processing and IOPS. Hadoop is linearly scalable at commodity hardware pricing. 

Summary - Demystifying Hadoop

Hadoop is not replacing anything.  Hadoop has become another component in an organizations enterprise data platform.  This diagram shows that Hadoop (Big Data Refinery) can ingest data from all types of different sources.  Hadoop then interacts and has data flows with traditional systems that provide transactions and interactions (relational databases) and business intelligence and analytic systems (data warehouses). 


  1. Thanks George for the nice thorough writeup about Hadoop for data architects.

  2. Thanks Jimmy, I'm getting a lot of great feedback. I've been asked to provide more detail for the other frameworks, so I'm going to do that when I get a chance.