HDFS (Hadoop Distributed File System) is where big data is stored. Primary objective of HDFS is to store data reliably even in the presence of failures including Name Node failures, Data Node failures and/or network partitions (‘P’ in CAP theorem). This tutorial aims to look into different components involved into implementation of HDFS into distributed clustered environment.
Table of Contents HDFS Architecture NameNode DataNodes CheckpointNode BackupNode File System Snapshots File Operations Read/Write operation Block Placement Replication management Balancer Block Scanner
HDFS is the file system component of Hadoop. You can visualize normal file system (e.g. FAT and NTFS), but designed to work with very large datasets/files. Default block size is 64 MB (128 MB in HDFS 2). That’s why HDFS performs best when you store large files in it. Small files will actually result into memory wastage.
In HDFS, data is stored into two forms i.e. actual data and it’s meta data (file size, block locations, timestamps etc). Meta data is stored in NameNode and application data is stored into DataNodes. All servers are fully connected and communicate with each other using TCP-based protocols. By distributing storage and computation across many DatNodes, the cluster can grow horizontally while keeping cost in limit because these nodes are simple commodity hardwares.
NameNode store metadata. When you store a large file into HDFS, it is splitted into blocks (typically 128 MB, but user can override it). Then each block is stored into multiple DataNodes independently (default 3). NameNode is the component which stores mapping of file blocks to DataNodes (the physical location of file data), file permissions, modification and access timestamps, namespace and disk space quotas in a data structure called inode.
Any client which wants to read a file first contacts the NameNode for the locations of data blocks and then reads block contents from the DataNode closest to the client. Similarily,while writing data, the client requests the NameNode to nominate a suite of three DataNodes to write the block replicas. The client then writes data to the DataNodes in a pipeline fashion (we will see how it is done).
Any cluster can have ONLY ONE NameNode at a time. HDFS keeps the entire namespace in RAM for faster access to client programs; and periodically sync this data into persistent filesystem in form of image, checkpoint and journal.
The in-memory inode data and the list of blocks belonging to each file comprise the metadata of the namespace is called the image. The persistent record of the image (stored in native files system) is called a checkpoint. And modification log of the image called the journal in the local host’s native file system. During restarts the NameNode restores the namespace by reading the checkpoint and replaying the journal.
DataNode store actual application data. Each data block in datanode has two parts in separate files i.e. data itself and its metadata including checksums for the block data and the block’s generation stamp.
During startup each DataNode connects to the NameNode and performs a handshake to verify the namespace ID and the software version of the DataNode. If either does not match with the records present in NameNode, the DataNode automatically shuts down. Namespace ID is single unique ID for whole cluster, and stored into all nodes when node is formatted to be included into cluster. Software version is version of HDFS, and its verified to prevent any data loss due to changes in features of new versions.
After the handshake the DataNode registers with the NameNode and send it’s block report. A block report contains the block id, the generation stamp and the length for each block replica that datanode hosts. The first block report is sent immediately after the DataNode registration. Subsequent block reports are sent every hour and provide the NameNode with an upto date view of where block replicas are located on the cluster.
To notify it’s live status, all data nodes send heartbeats to namenode. The default heartbeat interval
is three seconds. If the NameNode does not receive a heartbeat from a DataNode in ten minutes the NameNode considers
the DataNode to be out of service and the block replicas hosted by that DataNode to be unavailable. The NameNode then schedules creation of new replicas of those blocks on other DataNodes.
Heartbeats from a DataNode also carry information about total storage capacity, fraction of storage in use, and the number of data transfers currently in progress. These statistic are used for the NameNode’s space allocation and load balancing decisions.
The NameNode never call DataNodes directly. It uses replies to heartbeats to send instructions to the DataNodes e.g. replicate blocks to other nodes, remove local block replicas, re-register or to shut down the node or send an immediate block report. These commands are important for maintaining the overall system integrity and therefore it is critical to keep heartbeats frequent even on big clusters. The NameNode can process thousands of heartbeats per second without affecting other NameNode operations.
A CheckpointNode is usually another NameNode (hosted in another machine and not serving clients directly) in cluster. CheckpointNode downloads the current checkpoint and journal files from the NameNode, merges them locally, and returns the new checkpoint back to the NameNode. At this point, NameNode store this new checkpoint and empty existing journal it has.
Creating periodic checkpoints is one way to protect the file system metadata. The system can start from the most recent checkpoint if all other persistent copies of the namespace image or journal are unavailable in NameNode.
BackupNode is capable of creating periodic checkpoints, but in addition it maintains an in-memory, up-to-date image of the file system namespace that is always synchronized with the state of the NameNode. NameNode informs BackupNode about all the transactions in form of a stream of changes. As NameNode and BackupNode, both store all information in memory, so memory requirements are similar for both.
If the NameNode fails, then BackupNode’s image in memory and the checkpoint on disk is a record of the latest namespace state. The BackupNode can be viewed as a read-only NameNode. It contains all file system metadata information except of block locations. It can perform all operations of the regular NameNode that do not involve modification of the namespace or knowledge of block locations. It means when NameNode fails, all latest updates which have not been persisted until now, will not be persisted by BackupNode now.
Please note that a BackupNode is not a Secondary NameNode (alternate NameNode if active NameNode fails), it does not entertain client requests for read/write files. Secondary node is a feature of HDFS 2.x (Apache YARN), which we will cover in separate tutorial.
File System Snapshots
A snapshot is an image of whole cluster, which is saved to prevent any data loss in case of system upgrades. The actual data files are not part of this images, because it will result in doubling the memory area of whole cluster.
When an snapshot is requested by admin, NameNode pick exiting checkpoint file and merge all journal logs into it and store into persistent filesystem. Similarily, all DataNodes copy their directory information, and hard links to data blocks and store into them. This information is used in event of cluster failure.
Above all components build HDFS file system and play their role in continuous manner to keep cluster healthy. Now let’s try to understand how file read/write operations happen in HDFS clustered environment.
If you want to write data into HDFS, then simply create a file and write data to that file. HDFS manages all above said complexity for you. While reading the data, give it a file name; and start reading the data from it.
One important thing to learn is that you cannot modify the data once you write in HDFS. You can only append new data to it. OR simply delete the file and write a new file in place of it.
The client application, which want to write data, first need to ask for a “write lease” (similar to write lock). Once client got the write lease, no other client can write to this file until first program is done or the lease expires. Writer client periodically renews the lease by sending a heartbeat to the NameNode. When the file is closed, the lease is revoked and other programs can request for write lease to this file.
The lease duration is bound by a soft limit and a hard limit. Until the soft limit expires, the writer is certain of exclusive access to the file. If the soft limit expires and the client fails to close the file or renew the lease, another client can preempt the lease. If after the hard limit expires (one hour) and the client has failed to renew the lease, HDFS assumes that the client has quit and will automatically close the file on behalf of the writer, and recover the lease.
The data is written in pipeline manner as given in below manner.
Above diagram depicts a pipeline of three DataNodes (DN) and a block of five packets. Client write to first DataNode only, then that DataNode pass the data to second DataNode and so on. In this way, all nodes participates in data writing.
When a client opens a file to read, it fetches the list of blocks and the locations of each block replica from the NameNode. The locations of each block are ordered by their distance from the reader. When reading the content of a block, the client tries the closest replica first. If the read attempt fails, the client tries the next replica in sequence.
When choosing the nodes for writing the data, NameNode follows replica management policy. The default HDFS replica placement policy is as follows:
- No DataNode contains more than one replica of any block.
- No rack contains more than two replicas of the same block, provided there are sufficient racks on the cluster.
One primary responsibility of NameNode is to ensure that all data blocks have proper number of replicas. During summarizing block reports from DataNodes, NameNode detects that a block has become under- or over-replicated. When a block becomes over-replicated, the NameNode chooses a replica to remove.
When a block becomes under-replicated, it is put in the replication priority queue to create more replicas of that data block. A block with only one replica has the highest priority. A background thread periodically scans the head of the replication queue to decide where to place new replicas based on block placement policy as stated above.
HDFS block placement strategy does not take into account DataNode disk space utilization. Imbalance also occurs when new nodes are added to the cluster. The balancer is a tool that balances disk space usage on an HDFS cluster. The tool is deployed as an application program that can be run by the cluster administrator. It iteratively moves replicas from DataNodes with higher utilization to DataNodes with lower utilization.
When choosing a replica to move and deciding its destination, the balancer guarantees that the decision does not reduce either the number of replicas or the number of racks.
The balancer optimizes the balancing process by minimizing the inter-rack data copying. If the balancer decides that a replica A needs to be moved to a different rack and the destination rack happens to have a replica B of the same block, the data will be copied from replica B instead of replica A.
Each DataNode runs a block scanner that periodically scans its block replicas and verifies that stored checksums match the block data. Also, If a client reads a complete block and checksum verification succeeds, it informs the DataNode. The DataNode treats it as a verification of the replica.
Whenever a read client or a block scanner detects a corrupt block, it notifies the NameNode. The NameNode marks the replica as corrupt, but does not schedule deletion of the replica immediately. Instead, it starts to replicate a good copy of the block. Only when the good replica count reaches the replication factor of the block the corrupt replica is scheduled to be removed.
That’s all for this pretty complex introduction of HDFS. Let me know of your questions.
Happy Learning !!