Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running on clustered systems. It is at the center of a growing ecosystem of big data technologies that are primarily used to support advanced analytics initiatives, including predictive analytics, data mining and machine learning applications. Hadoop can handle various forms of structured and unstructured data, giving users more flexibility for collecting, processing and analyzing data than relational databases and data warehouses provide.
Hadoop and big data
Hadoop runs on clusters of commodity servers and can scale up to support thousands of hardware nodes and massive amounts of data. It uses a namesake distributed file system that’s designed to provide rapid data access across the nodes in a cluster, plus fault-tolerant capabilities so applications can continue to run if individual nodes fail. Consequently, Hadoop became a foundational data management platform for big data analytics uses after it emerged in the mid-2000s.
Components of Hadoop
The core components in the first iteration of Hadoop were Hadoop Distributed File System (HDFS), YARN, MapReduce, Hadoop Common
- Hadoop Distributed File System (HDFS): A File system that manages the storage of and access to data distributed across the various nodes of a Hadoop cluster
- YARN: Hadoop’s cluster resource manager, responsible for allocating system resources to applications and scheduling jobs
- MapReduce: A programming framework and processing engine used to run large-scale batch applications in Hadoop systems.
- Hadoop Common: A set of utilities and libraries that provide underlying capabilities required by the other pieces of Hadoop