Hadoop Architecture and Basic Terminology Explained
A node is just a computer system typically a non enterprise hardware capable of storing data. Please note that it is a combination of nodes that lets Hadoop run with each node performing several functionality which I will go through in later tutorials. Diagrammatically a node is just another computer system as below
We can add several nodes one after another and this collection of nodes is termed as a Rack. A rack is typically a collection of 30 to 40 nodes physically aligned close to each other and connected via the same network switch.
Important note: Network bandwidth any two nodes on a rack is greater than network bandwidth between two nodes on different racks.
A Hadoop Cluster or a Cluster is a collection of Racks.
Now as we are aware of the common terminologies that are involved, lets get on to the architecture of Hadoop.
Hadoop has two major components:
-> Distributed File System Component (Also called Hadoop Distributed File System)
-> MapReduce Component (which is a framework for performing calculations on data stored in distributed file system).
I am going to explain in brief about MapReduce component in this tutorial.
MapReduce is a technology from Google. MapReduce program consists of a map function and a reduce function. A scheduled MapReduce function is called as a MapReduce job.
A MapReduce job is broken into map tasks that run in parallel and reduce tasks that run in parallel too. This was a brief explanation of MapReduce. A detailed explanation of MapReduce will be covered in later tutorials.
Now let's go on to HDFS.
Hadoop Distributed File System (HDFS)
HDFS runs on top of the existing file system on each node of the Hadoop cluster. It is designed to handle very large files with streaming data access patterns.
The larger the file, the less time Hadoop spends seeking for the next data location on the disk and most times Hadoop runs at the limit of the bandwidth of the disks. As everyone would know, seeks are pretty expensive operations and are useful only when you only need to analyze a small subset of a data set.
Since, Hadoop is designed to run over the entire data set, it is best to minimize seeks by using large files. Hadoop is designed for streaming or sequential data access rather than random access.
Sequential data access means fewer seeks, since Hadoop only seeks the beginning of each block and begins reading sequentially from there.
Hadoop uses blocks to store file or parts of file as shows below: