Tutorial : Hadoop -- Main Thread

Well, I have decided to come over and post stuff that I manage to collect and learn as I go ahead with my Hadoop practice. This thread will be an introduction to what Big Data and Hadoop are and will contain links of other threads that I am going to post so that it leads to easy navigation and a structured learning experience for readers.



Introduction to Hadoop and Big Data

Imagine a scenario wherein you have a Gigabyte of data that you want to process and you have a desktop computer with an RDBMS. The desktop computer will have no problem handling this load. Then your firm starts to grow very quickly and the data goes from 1GB to 10GB and then 100GB. This is when you start to reach the limits of your desktop computer.

Suppose you invest some money and get in a bigger computer with more processing power then you can process up to 10 TB to 100 TB and then you approach the limits of this computer too. Also, now you are asked to feed your application with unstructured data as well coming from Facebook, Sensors, RFID readers etc.


Now, how can you handle this huge amount of data? Hadoop lets you do so.

Hadoop is an open source project of Apache Software Foundation. Its a framework written in Java developed by Doug Cutting and he named it after his son's toy elephant.

Hadoop uses Google's MapReduce and Google's file system technologies as its foundation.

It is optimized to handle massive amounts of structured and unstructured data with the use of hardware that are relatively inexpensive computers. It is capable of doing massive parallel processing but that being a batch process, the response time is not immediate.


Hadoop replicates its data among different computers so that if one computer goes down, the data can be processed from another computer. This is called Mirroring or RAID 2 architecture.

Hadoop is not suitable for OLTP loads where data is randomly scattered on a structured data platform like RDBMS. Hadoop can also not be used for OLAP loads where data is sequentially accessed on a structured data platform to generate reports that provide business intelligence.

Hadoop is used for Big Data. It compliments OLTP and OLAP. It is in no way a replacement to RDBMS.

So now what is Big Data?

Big Data is a term used to describe large collections of data (also known as datasets) that may be unstructured, and grow so large and quickly that it is difficult to manage with regular database or statistics tools.

Now with all the advantages of Hadoop there are some cons too. As Hadoop processes data randomly, it is not suitable for processing transactions. It doesn't suit low data latency scenarios and is also not good where work cannot be parallelized.

Please post in any doubts or comments that you have and look out for the index section of this post to go to other tutorials


You are reading an archived discussion.

Related Posts

For other tutorials and index of tutorials in this series click here : Main Thread Hello guys. Once you have an idea of what Hadoop and Big Data really are,...
Some days ago i was hosting an application on a server the thing that is different is that instead of providing mysql database on the they are providing the...
A brief history on Automobile in the world: I would like to share this to all CEans and please post your views on the journey toward the new Cars you...
After spending an awful two days at Neurology department of the very famous Amrita Institute of Medical Sciences, Kochi, Kerala, India with my college mate who is also my good...
hey am building a robot for robowars, i bought 4 used maruti 800 wiper motors i am planing to use it for the wheels, I am new to this, soo...