Blogs

Hadoop: A quick and easy guide to this flexible framework

Best-Selling Author, Keynote Speaker and Leading Business and Data Expert

When diving into the topic of big data, one soon comes across an odd-sounding name: Hadoop. What exactly is Apache Hadoop? Put simply, Hadoop can be thought of as a set of programs and procedures that anyone can use as the backbone for big data operations. Hadoop is also open source, which means it is essentially complimentary for anyone to use or modify. Without oversimplifying the topic, and because a lot of people reading this post are not software engineers, here is a brief guide for anyone wanting to know a bit more about the nuts and bolts that make big data analysis possible. 

Hadoop enters the picture

Development of Hadoop began when forward-thinking software engineers realized that it was quickly becoming useful for anybody to be able to store and analyze data sets far larger than can be practically stored and accessed on one physical storage device—such as a hard disk drive. In part, as physical storage devices become bigger, the component—which in a hard disk is the head—that reads the data from the disk takes longer to move to a specified disk segment. Instead, many smaller devices working in parallel are more efficient for storage than one large device. 

Hadoop was released in 2005 by the Apache Software Foundation, a nonprofit organization that produces open source software. This kind of software powers much of the Internet behind the scenes. And if you’re wondering where the odd name came from, it was the name of a toy elephant that belonged to the son of one of the original creators.

Hadoop is made up of modules, and each carries out a particular task that is essential for a computer system designed for handling big data analytics. The most important two modules are the Hadoop Distributed File System (HDFS) and MapReduce. 

HDFS: This module allows data to be stored in an easily accessible format, across a large number of linked storage devices. A file system is the method that computers use to store data so that the data can be found and used. Normally, this function is determined by the computer’s operating system; however, the Hadoop framework uses its own file system that sits above the file system of the host computer. In short, the HDFS can be accessed using any computer running any supported operating system. 

MapReduce: Named for the combined two basic operations that this module carries out, it reads data from the database and puts it into a format suitable for analysis—a map. And it performs mathematical operations, such as counting the number of people 30 years old or older in a customer database—reduce. 

Hadoop Common: This module provides the tools, in Java, needed for the end user’s computer systems—whether they are Windows-based, UNIX-based or based on any other operating system—to read data stored within the HDFS. 

Hadoop YARN: The YARN module manages the resources for the systems storing the data and running the analysis.  

Various other procedures, libraries and features have come to be considered part of the Hadoop framework over recent years, but HDFS, MapReduce, Hadoop Common and YARN are the four principle modules. 

Hadoop flexibility

The flexible nature of a Hadoop system enables organizations to add to or modify their data system as their needs change. Cost-effective and readily-available components from any IT vendor can be used. 

Today, the Hadoop framework is widely used for providing data storage and processing across commodity hardware—relatively cost-effective, off-the-shelf systems that are linked together, as opposed to costly, bespoke systems that are custom-made for the job at hand. A significant number of Fortune 500 companies also use Hadoop. 

Just about all the big online names use it. And because everyone is free to alter it for their own purposes, modifications made by expert engineers at Amazon and Google, for example, are fed back to the development community, where they are often used to improve the official product. This form of collaborative development among volunteer and commercial users is a key feature of open source software. 

In its raw state—using the basic modules supplied by Apache—Hadoop can be very complex, even for IT professionals. As a result, various commercial versions have been developed such as Apache Cloudera, which helps simplify the task of installing and running a Hadoop system. In addition, Apache offers training and support services. 

Thanks to the flexible nature of the system, organizations can expand and adjust their data analysis operations as their business expands. And the support and enthusiasm of the open source community behind Hadoop has led to great strides toward making big data analysis readily accessible for everyone. 

For more information, check out my latest book, Big Data – Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance. Also, read a complimentary sample chapter, or have a look at my blog posts at Forbes.

Hadoop is the heart of many cognitive computing applications, such as IBM Watson Analytics. Learn more about IBM Watson Analytics and start using it today.