Running Hadoop in the Cloud

Senior Managing Consultant, Big Data and Analytics Practice, IBM

With the growing popularity of cloud computing, enterprises are seriously looking at moving workloads to the cloud. There are issues around multi-tenancy, data security, software license, data integration, etc., that have to be considered before enterprises can make this shift. Even then, not all workloads can be moved easily to the cloud. In recent years, Hadoop has gained a lot of interest as a big data technology that can help enterprises cost-effectively store and analyze massive amounts of data. As enterprises start evaluating Hadoop, one of the questions frequently asked is “Can we run Hadoop in the cloud?”

hadoop-cluster.jpegTo answer this, the following key aspects of the Hadoop infrastructure are important to understand:

1. Hadoop runs best on physical servers. A Hadoop cluster comprises a master node called the name node and multiple child nodes called data nodes. These data nodes are separate physical servers with dedicated storage (much like your PC hard drive), instead of a common shared storage.

2. Hadoop is “Rack Aware” – Hadoop data nodes (servers) are installed in racks. Each rack typically contains multiple data node servers with a top of rack switch for network communication. “Rack awareness” means that the name node knows where each data node server is and in which rack. This ensures that Hadoop can write data to 3 (default) different data nodes that are not on the same physical rack, which helps prevent data loss due to data node and rack failure. When a MapReduce job needs access to data blocks, the name node ensures that the job is assigned to the closest data node that contains the data, thereby reducing the network traffic. The Hadoop system admin manually maintains this rack awareness information for the cluster. Since the Hadoop cluster has a lot of network traffic, it is recommended that it be isolated into its own network, instead of using VLAN (refer to Brad Hedlund’s article on Hadoop rack awareness).

Options for running Hadoop in the Cloud

  • Hadoop as a Service in the Public Cloud – Hadoop distributions (Cloudera CDH, IBM BigInsights, MapReduce, Hortonworks) can be launched and run on the public clouds like AWS, Rackspace, MS Azure, IBM SmartCloud, etc., which offer Infrastructure as a Service (IAAS). In a public cloud, you are sharing the infrastructure with other customers. As a result, you have very limited control over which server the VM is being spun up and what other VMs (yours or other customers) are running on the same physical server. There is no “rack awareness” that you have access to and can configure in the name node. The performance and availability of the cluster may be affected as you are running on VM. Enterprises can use and pay for these Hadoop clusters on demand. There are options for creating your own private network using VLAN, but Hadoop cluster performance recommendation is to have a separate isolated network because of high network traffic between nodes. In all the cases with the exception of the AWS EMR, you have to install and configure the Hadoop cluster on the cloud.
    • MapReduce as a Service – Amazon’s EMR (Elastic MapReduce) provides a quick and easy way to run MapReduce jobs without having to install a Hadoop cluster on its cloud. This can be a good way to develop Hadoop programming expertise internally within your organization or if you only want to run MapReduce jobs in your workloads.
    • Hadoop on S3 – You can run Hadoop using Amazon’s S3 instead of HDFS to store data. Performance of S3 is slower than HDFS, but it provides other features like bucket versioning and elasticity as well as its own data loss protection schemes. This may be an option if your data is already being stored in S3 for your business (e.g. Netflix uses a Hadoop cluster using S3).
  • Hadoop in private Cloud – We have the same set of considerations for a private cloud deployment for Hadoop as well. However, in case of a private cloud, you may have more control over your infrastructure that will enable you to provision bare-metal servers or create a separate isolated network for your Hadoop clusters. Some of these private cloud solutions also provide a Paas layer that offers pre-build patterns for deploying Hadoop clusters easily (e.g. IBM offers patterns for deploying InfoSphere BigInsights on their SmartCloud Enterprise). In addition, you also have an option of deploying a “Cloud in a Box” like the IBM PureData System, which offers Hadoop ready in your own data center. The big reason for private cloud deployment would be around data security and access control for your data as well better visibility and control of your Hadoop infrastructure.

Key things to consider before deploying a Hadoop cluster in the cloud:

  • Your enterprise should evaluate the security criteria for deploying workloads in public cloud before moving any data into the Hadoop cluster. Hadoop cluster security is very limited. There is no native security for data that will satisfy enterprise data security requirements around SOX, PII, HIPPA, etc.
  • Evaluate Hadoop distributions that you would want to use and the operating system standards of your enterprise. Preferably go with distributions that are close to the open source Apache distributions. Hadoop distributions typically run on Linux. Hortonworks provides a Hadoop distribution for Windows that is currently available on MS Azure cloud.
  • When using AWS, be aware that using Hadoop with S3 would tie you to Amazon’s cloud. For open standards, look at OpenStack-based cloud providers like Rackspace, IBM SmartCloud, HP, etc.
  • Look at the entire Hadoop ecosystem and not just the basic Hadoop cluster. The value from Hadoop is the analytics and data visualization that can be applied on large data sets. Ensure that the tools you want to use for analytics (e.g. Tableau, R, SPSS etc) are available for use on the cloud provider.
  • Get an understanding on where the data to be loaded into Hadoop comes from. Are you going to load data from your internal systems that are not on the cloud or if the data is already in the cloud. Most public clouds charge for data transmission fees if you are moving data back and forth.
  • Hadoop clusters on VM will be slow. You may be able to use these for development and test clusters. VMware’s project Serengeti is trying to address the deployment of Hadoop clusters on virtual machines without taking a performance hit. However, with this approach you will be tied to VMware’s Hypervisor which should be a criterion to consider when selecting a cloud provider.

What concerns or questions do you have about Hadoop in the cloud? Post your questions in the comment section below.