Hadoop Ecosystem for Big Data Analytics

 Big Data


Massive amount of data that is complex and quite impractical to manage by traditional database systems and software tools, this data is referred to as Big Data. Unlike traditional data, bid data is generated per second. Its volume ranges from Petabytes (1015 bytes) to Zettabytes (1021 bytes). Sources of Bid data are social media platforms, weather forecasting, Emails, blogs and e-news, traffic signals and GPS, software logs, digital pictures, and videos. Which contains all types of data including text, JSON, XML, images, audio, video, device data, sensor data, and many more. Thus, managing this data is very difficult and time-consuming. So here Hadoop and its Ecosystem come into the
picture to ease the overall workload.



It is an open-source programming framework for storing a large amount of data and performing the computation efficiently in less time. Hadoop is written in Java programming language however it supports other programming languages like Python and C++. Hadoop ecosystem is composed of multiple modules which include the Apache project and various commercial tools. They all work together and perform services such as data absorption, analysis, storage, and maintenance. In 2008, Hadoop defeated the supercomputers and became the fastest system on the planet for sorting terabytes of data.

Basic components of Hadoop Ecosystem


  • HDFS: Hadoop Distributed File System
  • YARN: Yet Another Resource Negotiator
  • MapReduce: Programming based Data Processing
  • Spark: In-Memory Data Processing
  • PIG, HIVE: Query-based Processing
  • HBase: NoSQL DataBase
  • Mahout, Spark MLLib: Machine Learning Algorithm Libraries
  • Solar, Lucene: Searching and Indexing
  • Zookeeper: Managing cluster
  • Oozie: Job Scheduling

Brief introduction


HDFS: It is Distributed file system that stores various types of data, i.e. Structured, Semi-Structured as well as Unstructured. It works based on master-slave methodology containing NAMENODE and DATANODE.
YARN:It performs tasks of process allocation and scheduling tasks, It manages everything using its three components: Resource Manager, Node Manager, Application Manager.
MapReduce: It is a core component of Hadoop, which helps in writing an application that processes large data sets using distributed and parallel algorithms. It converts Big data sets into a manageable ones. It has two methods named, Map() for filtering and sorting and Reduce() to summarize the result of the Map function.
Pig: It is developed by yahoo. It uses Pig Latin Language which is a query-based language. Pig is a platform for structuring data flow, processing, and analyzing massive data sets. It is responsible for executing commands and processing all MapReduce activities in the background. After processing either it stores resultant data in HDFS or Dumps it on the screen.
HIVE: It supports real-time processing and batch processing using Hive Query Language(HQL). It uses the JDBC driver for connection and Hive Command-Line for query processing.
Mahout: It allows automatic learning, it helps systems develop themselves based on certain patterns, user/environment interactions, or algorithm-based fundamentals. It provides functionalities like collaborative filtering, clustering, and classification. Apache Spark: It is a framework for real-time data analytics. It executes in-memory computation to increase the speed of data processing over MapReduce. It is almost 100x faster than Hadoop.
Apache HBase: It is a non-relational distributed database. It is fault-tolerant and can work with any type of data. HBase application can be written in REST, Thrift API, Avro.
Zookeeper: It manages everything and provides inter-component-based communication, synchronization, and consistency. It reduces the time consumed by different services of the Hadoop ecosystem.
Oozie: It simplifies the task of job scheduler by scheduling jobs and binding them together as a single unit. Oozie workflow is the job that needs to be executed in sequential order and the Oozie Coordinator job is triggered when some external stimulus is given to it.


Hadoop Ecosystem Distribution

Advantages of Hadoop Ecosystem


  • Open-source
  • Scalable and Cost-effective
  • Varied Data sources
  • High performance and fault-tolerant
  • High throughput and compatible
  • Multiple Language supported
  • Real-time analytics
  • Distributed Architecture

Disadvantages of Hadoop Ecosystem


  • Issues with small files
  • Processing overhead
  • Iterative processing
  • Securit

Vibhuti is pursuing her Bachelors in Computer Engineering from Marwadi Education Foundation Group of Institutions. She is fond of drawing & traveling.
Vibhuti, Intern at GlobalVox. | Posted on: March 31, 2022