What is Hadoop?

Making the complex simple

Managing Director of Intelligent Business Strategies Limited, Intelligent Business Strategies Limited

Apache Hadoop is a set of software technology components that together form a scalable system optimized for analyzing data. Data analyzed on Hadoop has several typical characteristics

  • Structured—for example, customer data, transaction data and clickstream data that is recorded when people click links while visiting websites
  • Unstructured—for example, text from web-based news feeds, text in documents and text in social media such as tweets
  • Very large in volume
  • High rate of speed for creation and arrival

The Hadoop system

A typical Hadoop system is deployed on a hardware cluster, which comprise racks of linked computer servers. Here is a high level diagram of what Hadoop looks like:

What is Hadoop?

In addition to open source Hadoop, a number of commercial distributions of Hadoop are available from various vendors. A Hadoop system comprises a number of key components: 

  • Yet Another Resource Negotiator (YARN)
  • Hadoop Distributed File System (HDFS)
  • Execution engines that run analytics applications at scale
  • Apache Pig
  • Apache Hive and/or third-party SQL on Hadoop engines
  • Apache HBase
  • Search

The Hadoop operating system

Think of YARN as Hadoop’s operating system. It is cluster management software that controls the resources allocated to different applications and execution engines across the cluster.

The Hadoop file system

HDFS provides highly scalable storage in a Hadoop cluster that makes use of the disk storage on every node in the cluster. Many different types of data in many different formats can be loaded into and stored in HDFS. In addition, data in HDFS is partitioned across servers so that it can be accessed in parallel, and it is triple replicated for high availability if disks and/or servers fail.

Hadoop execution engines

The different application execution engines such as Apache Spark, Apache Storm, Apache Tez and MapReduce parallelize the execution of analytics application steps across the cluster. The term engine means software that runs on every server—node—in the cluster, under the control of YARN. The engines execute application logic and analytics in parallel across the cluster to process partitioned data stored in HDFS.

In general, parallelism is achieved by taking the application logic to the data, rather than taking the data to the application. In other words, copies of the application—or each task within an application—are run on every server to process local data physically stored on that server. This approach avoids moving data elsewhere in the cluster to be processed.

The MapReduce engine runs analytics applications in batch and in parallel. To do so, application developers and data scientists need to write applications as two distinct program components—the map component and the reduce component. The MapReduce engine runs the map step on all nodes in the cluster to produce a set of intermediate output files. It then sorts these intermediate files and runs a reduce step to take the sorted intermediate files and aggregate the data in them to get a final result. This process is scalable but relatively slow because of the need to write lots of intermediate files to disk and then re-read them again.

Tez is an alternative to the MapReduce engine. It does not need to write and read intermediate files to disk. For this reason, it is generally faster than the MapReduce engine and can still run MapReduce-style applications. It just does so in a different way.

Spark accelerates application execution even further by enabling data from HDFS to be read into memory and partitioned across the cluster—meaning that Spark analytics applications can analyze data at scale in parallel and in memory. No I/O exists. Speed is the reason why Spark is getting so much attention nowadays.

Storm is also an execution engine, but one that is different because it is for real-time streaming analytics applications, which means data can be analyzed before it is stored anywhere. It can be analyzed as soon as the data is generated. Sensor data and market data offer examples. Storm is designed to analyze data at scale in real time, typically using time-series analysis. This kind of analysis means that applications, running in parallel in a cluster, look for patterns in the data. For example, a sequence of events within a short time window—such as the last 10 seconds, the last 60 seconds or the last 5 minutes—together indicate a business condition. When the condition is detected or predicted, then some kind of action is taken, which may be an alert or an automated action such as shutting down a piece of equipment because it is predicted to fail.

Most Storm applications are written in Java. However, writing these programs can be challenging, so a library of prebuilt Storm components was made available to speed up application development. More recently, tools have emerged to generate Storm analytics applications.

Today, Spark offers an alternative to Storm called Spark Streaming that runs in memory to accelerate processing. Spark is emerging as the general-purpose execution engine of choice for all types of analytics applications, which means that interest in MapReduce and Storm are fading somewhat.

Another programming language

If you don’t want to write analytics applications in a programming language such as Java Python, R or Scala, you can always write them in Pig Latin scripts. The Pig platform includes a high-level language that is optimized for data-flow processing, whereby the output of one step is fed into the next step and so on. The difference is that Pig Latin is a declarative language. In simple terms, you state in Pig what you want to happen, and then the Pig script is compiled into MapReduce, Spark or Tez jobs to run in parallel on Hadoop clusters processing data in HDFS. Pig is very popular for extract, transform and load (ETL) processing.

A Hadoop data warehouse infrastructure

Hive is available on all Hadoop systems and is a complimentary SQL interface to data stored in HDFS files. Hive allows you to connect self-service business intelligence (BI) tools—and applications using SQL to query data—to Hadoop Hive and then use SQL to access data. Note that Hadoop is not a relational database management system (RDBMS). However, Hive enables you to create a schema, or a table structure, on top of a file to describe the structure of data within the file. Then you can use SQL to query the table, and Hive will navigate the file to get your data. Also, many SQL-on-Hadoop alternatives to Hive—for example, IBM BigSQL and Spark SQL that also offer SQL access—are available. The difference is that technologies such as IBM BigSQL offer a much more comprehensive SQL capability and performance than Hive.

Hadoop search capability

Search enables building search indexes by crawling the data in HDFS. Once the indexes are built, you can then explore the data using a familiar search interface and search queries. The queries access the indexes and let you discover what is in your data. This method is particularly useful for unstructured data such as text, but it can also work on structured data and semistructured data such as JavaScript Object Notation (JSON) or XML.

Learn more about Hadoop and how it can increase your speed of innovation, and stay tuned for more installments of the Making the complex simple blog series.