Geeks With Blogs
Rahul Anand's Blog If my mind can conceive it, and my heart can believe it, I know I can achieve it.

Apache Hadoop is a framework which supports distributed processing and distributed storage of very large data sets on clusters of commodity computers.

The distributed processing is achieved through MapReduce and distributed storage is achieved through HDFS (Hadoop Distributed File System). The Hadoop framework is created especially for the clusters of computers so it is very much aware of the nodes, its network configuration and handles node/storage/network failures. YARN (NextGen MapReduce) further improves this framework by splitting the two major functionality of Job Management and Resource Management in two separate daemons.

MapReduce is a new abstraction that allows users to express the simple computations that process large amount of raw data, but hides the messy details of managing

a. Parallelization

b. Fault Tolerance

c. Data distribution

d. Load balancing

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model.

This is very well explained in the white paper released by Google which inspired creation Hadoop:

http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf

Hadoops Approach:

1. Take advantage of data locality. Splits data in blocks and stores them on different nodes (with replication to handle faults). The processing logic is then sent over to nodes to work on their local copy of data. It works well because processing logic can be expressed in much fewer bytes compared to actual data and so instead of moving data through network it transfers the processing logic itself. Moving Computation is Cheaper than Moving Data.

2. Designed to have Fault tolerance through fault recovery. Instead of designing a system to be fault safe it is designed to recover from faults (which only impacts the processing when there is a real fault). It is handled through replication and redundancy.

3. Split very large data sets in small manageable blocks and run the logic in parallel on multiple coordinated nodes. By controlling the input specification and output specification it makes it easy to stitch the processing logic of mapper and reducer together.

4. Take advantage of network proximity and minimize the usage of network bandwidth by replicated data within a data center and within a rack.

5. Let the nodes work independently and report status. Only coordinate the processing without introducing a bottleneck.

6. Simplify the logic expression and avoid requirements of complex logic to minimize iterations for performance optimization.

7. Easily scale horizontally as data size increases.

8. Handle very large data sets by pooling in the storage from multiple nodes. It can even handle data sets which cannot fit on one node.

9. Read chunks of data in parallel from multiple nodes and provide a very high aggregate bandwidth. The emphasis is on high throughput of data access rather than low latency of data access.

10. Use redundant execution to reduce the impact of slow machines, and to handle machine failures and data loss. Speculative Execution in Hadoop to handle stragglers.

The base Apache Hadoop framework is composed of the following modules:

1. Hadoop Common – contains libraries and utilities needed by other Hadoop modules

2. Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster

3. Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications

4. Hadoop MapReduce – a programming model for large scale distributed data processing.

Posted on Monday, October 19, 2015 9:45 PM | Back to top


Comments on this post: Introduction to Hadoop

No comments posted yet.
Your comment:
 (will show your gravatar)


Copyright © Rahul Anand | Powered by: GeeksWithBlogs.net