Hadoop MapReduce

MapReduce is a core component of the Apache Hadoop software framework. The MapReduce components as the name implies Maps and Reduces. It distributes work to different nodes within a cluster/map (MAP) and organize the returned result into a result of the query being made (REDUCE).


There are three main components of MapReduce

  1. JobTracker: The node that manages all jobs in a cluster. It is also known as the master node. Jobs are divided into Tasks assigned to individual machines in a cluster. 
  2. TaskTracker: A component that takes tracks every task assigned to an individual machine.
  3. JobHistoryServer: This component tracks completed jobs.

MapReduce distributes input data and collate Results. It does so by operating in parallel across massive clusters. Jobs can be split across any number of servers. MapReduce is available in several languages. MapReduce libraries abstract Programmers from under the hood, and create task between having to worry about the intricacies of distributed computing paradigm.

Each node reports back to the master node. The master node can re-assign the task to any other node, if the child node doesn't report back. This makes MapReduce highly fault-tolerant, with the only single point of failure being the master node.



What is Hadoop



Hadoop is a framework to process huge amount of data across clusters of computers, using commodity hardware in a distributed computing environment. It can work on a single server or thousands of machines having their own storage. Hence it is a massively parallel execution environment that brings the power of supercomputing using only commodity hardware. Hadoop is primarily used for big data analytics. 
Hadoop should be classified as an ecosystem comprised of many components that range from data storage, to data integration, to data processing, to specialized tools for data analysts.

Hadoop Components


HDFS is a main component of Hadoop. It is a distributed File System able to run on commodity hardware. This is where the data is stored. It provides the foundation for other tools, such as HBase.

  1. MapReduce: Hadoop’s main execution framework is MapReduce, a programming model for distributed, parallel data processing, breaking jobs into mapping phases and reduce phases (thus the name). MapReduce is a core component of the Apache Hadoop software framework. Hadoop enables resilient, distributed processing of massive unstructured data sets across commodity computer clusters, in which each node of the cluster includes its own storage.
  2. HBase: A column-oriented NoSQL database. Simply put HBase is the DataStore for Hadoop and BigData. 
  3. Zookeeper: It is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Zookeeper is Hadoop’s distributed coordination service. Specifically designed for distributed management. Many components of Hadoop depend on Zookeeper.
  4. Oozie: Oozie is Hadoop workflow scheduler. It schedules Hadoop Jobs.It is integrated with rest of Hadoop stack.
  5. Pig: Pig is a platform for analyzing large data sets. It consists of its own scripting language, PIG Latin which is translated by the compiler that produces MapReduce sequences.
  6. Hive: An SQL-like, high-level language It works like pig but translate Sql like queries into MapReduce sequences.
The Hadoop ecosystem also contains several other frameworks
  1. Sqoop: Tool to transfer data to and from Hadoop to relational databases. 
  2. Flume: Tool to move data from individual machines to HDFS.

Running Drupal in Docker

I will assume that you have already installed docker. If you haven't installed docker please visit https://www.docker.com/ to download a...