Introduction to Hadoop

In this Hadoop tutorial, we will discuss the need for big data technologies, the problems they intend to solve and some information about involved technologies and frameworks.

1. How big really is Big Data?

Let’s start with some quick facts. The amount of data we produced from the beginning of time till 2003 was 5 billion gigabytes. The same amount was created every two days in 2011, and every ten minutes in 2013. This rate is still growing enormously. The statistic shows that 500+ terabytes of new data get ingested into the databases of the social media site Facebook, every day. This data is mainly generated through photo and video uploads, messages, comments etc. Google processes 20 petabytes of information per day. [Stats will 2012]

See given below info-graphic. It will help you in realizing how much data is generated by these sources and similar to thousand sources.

Big Data Growth
Big Data Growth

Image: Erik Fitzpatrick licensed CC BY 2.0

Now you know the amount of data that is being generated. Though such a large amount of data is itself a big challenge, a bigger challenge arises because this data is of no fixed format. It has images, videos, line streaming records, GPS tracking details, sensor records and many more forms. In short, it’s unstructured data. Traditional systems are good at working with structured data (limited as well), but they can’t handle such a large amount of unstructured data.

One may ask this question why even need to care about storing this data and processing it? For what purpose? The answer is that we need this data to make smarter and more calculative decisions in whatever field we are working on. Business forecasting is not a new thing. It has been done in the past, but with very limited data. To be ahead of the competition, businesses MUST use this data and then make more intelligent decisions. These decisions range from guessing consumers’ preferences to preventing fraud activities well in advance. Professionals in every field may find their reasons for analyzing this data.

2. Characteristics Of Big Data Systems

When you want to decide that you need to use any big data system for your next project, look into the data that your application will produce and try to look for these characteristics. These characteristics are called the 4 V’s of big data.

2.1. Volume

Volume is certainly a part of what makes Big Data big. The internet-mobile revolution, bringing with it a torrent of social media updates, sensor data from devices and an explosion of e-commerce, means that every industry is swamped with data- which can be incredibly valuable if you know how to use it.

Does your data is expanding/ or may explode exponentially like the above-discussed statistics?

2.2. Variety

Structured data stored in SQL tables are a thing of the past. Today, 90% of data generated is ‘unstructured’, coming in all shapes and forms- from Geo-spatial data to tweets that can be analyzed for content and sentiment, to visual data such as photos and videos.

Is your data always structured always? Or is it semi-structured or unstructured?

2.3. Velocity

Every minute of every day, users around the globe upload 100 hours of video on Youtube, send over 200 million emails and send 300,000 tweets. And the speed is increasing fast.

What is the velocity of your data, or what will it be in the future?

2.4. Veracity

This refers to the uncertainty of the data (or variability) available to marketers. This may also be applied to the variability of data streaming that can be inconsistent, making it harder for organizations to react quickly and more appropriately.

Do you always get data in a consistent form?

3. How Google solved the Big Data problem?

Probably this problem itched google first due to their search engine data, which exploded with the revolution of the internet industry (though don’t have any proof of it). They smartly solved this problem using the concept of parallel processing. They created an algorithm called MapReduce. This algorithm divides the task into small parts and assigns those parts to many computers connected over the network, and collects the results to form the final result dataset.

Well, this seems logical when you realize that I/O is the most costly operation in data processing. Traditionally, database systems store data in a single machine, and when you need data, you send them some commands in form of SQL query. These systems fetch data from the store, put it in the local memory area, process it and send it back to you. This is the best thing that you could do with limited data in hand and limited processing power.

But when you get Big Data, you cannot store all data in a single machine. You MUST store it in multiple machines (maybe thousands of machines). And when you need to run a query, you cannot aggregate data into a single place due to high I/O cost. So what MapReduce algorithm does; it runs your query into all nodes independently where data is present, and then aggregates the result and returns to you.

It brings two major improvements i.e. very low I/O cost because data movement is minimal, and second less time because your job parallelly ran into multiple machines into smaller data sets.

4. Evolution of Hadoop

It all started in 1997 when Doug Cutting started writing Lucene (a full-text search library) in an effort to index the whole web (like google did). Later Lucene was adopted by the Apache community, and Cutting and University of Washington graduate student Mike Cafarella created a Lucene sub-project “Apache Nutch”. Nutch is known as a web crawler now. Nutch crawls websites and when it fetches a page, Nutch uses Lucene to index the contents of the page (to make it “searchable”).

Initially, they deployed the application in a single machine with 1GB of RAM and 8 hard drives with a total capacity of 8 TB with an indexing rate of around 1000 pages per second. But as soon as application data grew, limitations came forward. And it was quite understandable that you can not store whole internet data in a single machine. So they added 3 more machines (primarily for storing data). But it had its own challenge because now they need to move data from one machine to other manually. They wanted to make the application easily scalable because even 4 machines will fill soon.

So they started figuring out a system that could be schema-less with no predefined structure, durable, capable of handling component failure e.g. hard disk failures and automatically rebalanced to even out disk space consumption throughout a cluster of machines. Fortunately, in October 2003, Google published its Google File System paper. This paper was to solve the exact same problem they were facing. Great !!

They implemented the solution in java, brilliantly, and called it Nutch Distributed File System (NDFS). Following the GFS paper, Cutting and Cafarella solved the problems of durability and fault-tolerance by splitting each file into 64MB chunks and storing each chunk on 3 different nodes (replication factor set to 3). In the event of component failure, the system would automatically notice the defect and re-replicate the chunks that resided on the failed node by using data from the other two other replicas. The failed node, therefore, did nothing to the overall state of NDFS.

NDFS solved their one problem i.e. storage, but brought another problem “how to process this data”? It was of the utmost importance that the new algorithm should have the same scalability characteristics as NDFS. The algorithm had to be able to achieve the highest possible level of parallelism (ability to run on multiple nodes at the same time). Again fortune favored the braves. In December 2004, Google published another paper on a similar algorithm “MapReduce“. Jackpot !!

The three main problems that the MapReduce paper solved were Parallelization, Distribution and Fault-tolerance. These were the exact problems Cutting, and Cafarella were facing. One of the key insights of MapReduce was that one should not be forced to move data to process it. Instead, a program is sent to where the data resides. That is a key differentiator when compared to traditional data warehouse systems and relational databases. In July 2005, Cutting reported that MapReduce is integrated into Nutch, as its underlying compute engine.

In February 2006, Cutting pulled NDFS and MapReduce from the Nutch code base and created Hadoop. It consisted of Hadoop Common (core libraries), HDFS and MapReduce. That’s how Hadoop came into existence.

There are plenty of things that happened since then that led to Yahoo contributing their higher level programming language on top of MapReduce “Pig” and Facebook contributing “Hive“, first incarnation of SQL on top of MapReduce.

5. Apache Hadoop Distribution Bundle

The open-source Hadoop is maintained by the Apache Software Foundation, and the website location is The current Apache Hadoop project (version 2.7) includes the following modules:

  • Hadoop common: The common utilities that support other Hadoop modules
  • Hadoop Distributed File System (HDFS): A distributed filesystem that provides high-throughput access to application data
  • Hadoop YARN: A framework for job scheduling and cluster resource management
  • Hadoop MapReduce: A YARN-based system for parallel processing of large datasets

Apache Hadoop can be deployed in the following three modes:

  1. Standalone: It is used for simple analysis or debugging in single-machine environment.
  2. Pseudo-distributed: It helps you to simulate a multi-node installation on a single node. In pseudo-distributed mode, each component process runs in a separate JVM.
  3. Distributed: Cluster with multiple nodes in tens or hundreds or thousands.

6. Apache Hadoop Ecosystem

Apart from the above given core components distributed with Hadoop, there are plenty of components that complement the base Hadoop framework and give companies the specific tools they need to get the desired Hadoop results.

Hadoop Ecosystem
Hadoop Ecosystem
  • DataStorage Layer: This is where the data is stored in a distributed file system, consisting of HDFS and HBase ColumnDB Storage. HBase is a scalable, distributed database that supports structured data storage for large tables.
  • Data Processing Layer: Here the scheduling, resource management and cluster management are to be calculated. YARN job scheduling and cluster resource management with Map Reduce are located in this layer.
  • Data Access Layer: This is the layer where the request from the Management layer is sent to Data Processing Layer. Hive, A data warehouse infrastructure that provides data summarization and ad-hoc querying; Pig, A high-level data-flow language and execution framework for parallel computation; Mahout, A Scalable machine learning and data mining library; Avro, data serialization system.
  • Management Layer: This is the layer that meets the user. Users access the system through this layer with components like Chukwa, A data collection system for managing large distributed systems and ZooKeeper, a high-performance coordination service for distributed applications.

In the next set of posts, I will be going into detail about programming concepts involved in Hadoop cluster.

Happy Learning !!


Notify of

Most Voted
Newest Oldest
Inline Feedbacks
View all comments

About Us

HowToDoInJava provides tutorials and how-to guides on Java and related technologies.

It also shares the best practices, algorithms & solutions and frequently asked interview questions.