Hadoop – Big Data Tutorial

In this hadoop tutorial, I will be discussing the need of big data technologies, the problems they intend to solve and some information around involved technologies and frameworks.

Table of Contents

How really big is Big Data?
Characteristics Of Big Data Systems
How Google solved the Big Data problem?
Evolution of Hadoop
Apache Hadoop Distribution Bundle
Apache Hadoop Ecosystem

How really big is Big Data?

Let’s start with some quick facts. The amount of data produced by us from the beginning of time till 2003 was 5 billion gigabytes. The same amount was created in every two days in 2011, and in every ten minutes in 2013. This rate is still growing enormously. Statistic shows that 500+ terabytes of new data gets ingested into the databases of social media site Facebook, every day. This data is mainly generated in form of photo and video uploads, messages, comments etc [reference]. Google processes 20 petabytes of information per day.

See given below info-graphic. It will help you in realizing that how much data in generated these sources and similar of thousand sources.

Big Data Growth
Big Data Growth

Image: Erik Fitzpatrick licensed CC BY 2.0

Now you know the amount of data that is being generated. Though such large amount of data is itself a big challenge, bigger challenge arises with fact that this data is of no fixed format. It has images, videos, line streaming records, GPS tracking details, sensor records and many more forms. In short, it’s unstructured data. Traditional systems are good in working with structured data (limited as well), but they can’t handle such large amount of unstructured data.

One may ask this question that why even need to care about storing this data and processing it? For what purpose? Answer is that we need this data to make more smart and calculative decisions in whatever field we are working on. Business forecasting is not a new thing. It has been done in past as well, but with very limited data. To be ahead of competition, businesses MUST use this data and then make more intelligent decisions. These decisions ranges from guessing the preferences of consumers to preventing fraud activities well in advance. Professionals in every field may find their reasons for analysis of this data.

Characteristics Of Big Data Systems

When you want to decide that you need to use any big data system for your next project, look into your data that your application will produce and try to look for these characteristics. These characteristics are called 4 V’s of big data.

  1. Volume

    Volume is certainly a part of what makes Big Data big. The internet-mobile revolution, bringing with it a torrent of social media updates, sensor data from devices and an explosion of e-commerce, means that every industry is swamped with data- which can be incredibly valuable, if you know how to use it.

    Does your data is expanding/ or may explode exponentially like above discussed statistics?

  2. Variety

    Structured data stored in SQL tables is a thing of past. Today, 90% of data generated is ‘unstructured’, coming in all shapes and forms- from Geo-spatial data, to tweets which can be analyzed for content and sentiment, to visual data such as photos and videos.

    Is your data always structured always? Or is it semi-structured or unstructured?

  3. Velocity

    Every minute of every day, users around the globe upload 100 hours of video on Youtube, send over 200 million emails and send 300,000 tweets. And the speed is increasing fast.

    What is the velocity of your data, or what it will be in future?

  4. Veracity

    This refers to the uncertainty of the data (or variability) available to marketers. This may also be applied to the variability of data streaming that can be inconsistent, making it harder for organizations to react quickly and more appropriately.

    Do you always get data in consistent form?

How Google solved the Big Data problem?

Probably this problem itched google first due to their search engine data, which exploded with the revolution of internet industry (though don’t have any proof of it). They smartly solved this problem using the concept of parallel processing. They created an algorithm called MapReduce. This algorithm divides the task into small parts and assigns those parts to many computers connected over the network, and collects the results to form the final result dataset.

Well this seems logical when you realize that I/O is most costly operation in data processing. Traditionally, database systems were storing data into single machine and when you need data, you send them some commands in form of SQL query. These systems fetch data from store, put it in local memory area, process it and send back to you. This is best thing which you could do with limited data in hand, and limited processing power.

But when you get Big Data, you cannot store all data in single machine. You MUST store it into multiple machine (may be thousands of machines). And when you need to run a query, you cannot aggregate data into single place due to high I/O cost. So what MapReduce algorithm does; it runs your query into all nodes independently where data is present, and then aggregate the result and return to you.

It brings two major improvements i.e. very low I/O cost because data movement is minimal; and second less time because your job parallelly ran into multiple machines into smaller data sets.

Evolution of Hadoop

It all started in 1997, when Doug Cutting started writing Lucene (full text search library) in effort to index whole web (like google did). Later Lucene was adapted by Apache community, and Cutting along with University of Washington graduate student Mike Cafarella created a lucene sub-project “Apache Nutch“. Nutch is known as a web crawler now. Nutch crawl websites and when it fetches a page, Nutch uses Lucene to index the contents of the page (to make it “searchable”).

Initially, they deployed the application in single machine with 1GB of RAM and 8 hardrives with total capacity 8 TB with indexing rate of around 1000 pages per second. But as soon as application data grew, limitations came forward. And it was quite understandable that you can not store whole internet data into single machine. So they added 3 more machines (primarily for storing data). But it had it’s own challenge because now they need to move data from one machine to other manually. They wanted to make the application easily scalable because even 4 machines will fill soon.

So they started figuring out a system which could be schema-less with no predefined structure, durable, capable of handling component failure e.g. hard disk failures and automatically rebalanced to even out disk space consumption throughout cluster of machines. Fortunately, in October 2003, Google published their Google File System paper. This paper was to solve the exact same problem they were facing. Great !!

They implemented the solution in java, brilliantly, and called it Nutch Distributed File System (NDFS). Following the GFS paper, Cutting and Cafarella solved the problems of durability and fault-tolerance by splitting each file into 64MB chunks and storing each chunk on 3 different nodes (replication factor set to 3). In the event of component failure the system would automatically notice the defect and re-replicate the chunks that resided on the failed node by using data from the other two other replicas. The failed node therefore, did nothing to the overall state of NDFS.

NDFS solved their one problem i.e. storage, but brought another problem “how to process this data”? It was of the utmost importance that the new algorithm should had the same scalability characteristics as NDFS. The algorithm had to be able to achieve the highest possible level of parallelism (ability to run on multiple nodes at the same time). Again fortune favored the braves. In December 2004, Google published another paper on similar algorithm “MapReduce“. Jackpot !!

The three main problems that the MapReduce paper solved were Parallelization, Distribution and Fault-tolerance. These were exact problems Cutting and Cafarella were facing. One of the key insights of MapReduce was that one should not be forced to move data in order to process it. Instead, a program is sent to where the data resides. That is a key differentiator, when compared to traditional data warehouse systems and relational databases. In July 2005, Cutting reported that MapReduce is integrated into Nutch, as its underlying compute engine.

In February 2006, Cutting pulled out NDFS and MapReduce out of the Nutch code base and created Hadoop. It consisted of Hadoop Common (core libraries), HDFS and MapReduce. That’s how hadoop came into existence.

There are plenty of things which happened since then and led to Yahoo contributed their higher level programming language on top of MapReduce “Pig” and Facebook contributed “Hive“, first incarnation of SQL on top of MapReduce.

Apache Hadoop Distribution Bundle

The open source Hadoop is maintained by the Apache Software Foundation and website location is http://hadoop.apache.org/. The current Apache Hadoop project (version 2.7) includes the following modules:

  • Hadoop common: The common utilities that support other Hadoop modules
  • Hadoop Distributed File System (HDFS): A distributed filesystem that provides high-throughput access to application data
  • Hadoop YARN: A framework for job scheduling and cluster resource management
  • Hadoop MapReduce: A YARN-based system for parallel processing of large datasets

Apache Hadoop can be deployed in the following three modes:

  1. Standalone: It is used for simple analysis or debugging into single machine environment.
  2. Pseudo distributed: It helps you to simulate a multi-node installation on a single node. In pseudo-distributed mode, each of the component processes runs in a separate JVM.
  3. Distributed: Cluster with multiple nodes in tens or hundreds or thousands.

Apache Hadoop Ecosystem

Apart from above given core components distributed with hadoop, there are plenty of components which complement the base Hadoop framework and give companies the specific tools they need to get the desired Hadoop results.

Hadoop Ecosystem
Hadoop Ecosystem
  1. DataStorage Layer: This is where the data is stored in a distributed file system, consist of HDFS and HBase ColumnDB Storage. HBase is scalable, distributed database that supports structured data storage for large tables.
  2. Data Processing Layer: Here the scheduling, resource management and cluster management to be calculated. YARN job scheduling and cluster resource management with Map Reduce are located in this layer.
  3. Data Access Layer: This is the layer where the request from Management layer was sent to Data Processing Layer. Hive, A data warehouse infrastructure that provides data summarization and adhoc querying; Pig, A high-level data-flow language and execution framework for parallel computation; Mahout, A Scalable machine learning and data mining library; Avro, data serialization system.
  4. Management Layer: This is the layer that meets the user. User access the system through this layer which has the components like: Chukwa, A data collection system for managing large distributed system and ZooKeeper, high-performance coordination service for distributed applications.

In next set of posts, I will be going into detail of programming concepts involved in hadoop cluster.

Happy Learning !!

References

https://en.wikipedia.org/wiki/Doug_Cutting
http://hadoop.apache.org/
http://lucene.apache.org/
http://nutch.apache.org/
http://hive.apache.org/
http://pig.apache.org/
http://research.google.com/archive/gfs.html
http://research.google.com/archive/mapreduce.html
https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704 [Good Read]
https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan [Must Read]

Was this post helpful?

Join 7000+ Fellow Programmers

Subscribe to get new post notifications, industry updates, best practices, and much more. Directly into your inbox, for free.

7 thoughts on “Hadoop – Big Data Tutorial”

  1. Great and informative article on hadoop. Its important for the java developer to know about apache hadoop. You have shared very insightful information here, it will be guide for us to hadoop.Thanks for sharing.

    Reply
  2. Hi Lokesh,

    It is very nice introduction about Haoop.

    I have one query regarding Hive,
    When HDFS itself a storage, why do we need to have Hive database ..? eg: Creating databases,altering,tables,etc…
    When we run the HiveQL, does it process in HDFS or Hive database ..?

    Please clarify.

    Reply
    • Hive is not database. It’s only a query language. [It’s listed in “Data Access Layer” in Hadoop ecosystem]

      Just for making comparison, think HDFS as oracle database; and Hive as MySQL. You use MySQL to query data stored in oracle. You use Hive (Using HQL) to query data stored on HDFS. It is very rough comparison but it make sense for now.

      You can also think of it as complement to Map/Reduce programming. Though Hive support injecting Map/Reduce programs in queries, where you cannot express whole logic in HQL query.

      Basically, under the hood, all queries submitted via the Hive environment are translated into MapReduce jobs.

      Reply
  3. One of the best & to the point introduction I’ve found on Big data and Hadoop. Waiting for detail of programming concepts.

    Reply

Leave a Comment

HowToDoInJava

A blog about Java and its related technologies, the best practices, algorithms, interview questions, scripting languages, and Python.