Scalability Without Limits for Big Data
Enterprise-scale data integration has what it takes to handle the volume, variety, and velocity of big data
Big data streams into many organizations at high velocity, and therefore the level of performance for processing big data is very important. Data also changes rapidly, and it must be fed to various applications in the system quickly so that management can react to changing business conditions as soon as possible. To successfully handle big data, organizations need to implement enterprise-class data integration that is dynamic, extendable, and combined with the Apache Hadoop framework to meet the following kinds of needs:
- Dynamic: Meeting current and future performance requirements
- Extendable: Partitioning for fast and easy scalability
- Hadoop integration: Leveraging Hadoop—which is not an integration platform, but it can be—as part of an integration architecture to land data and determine its value while balancing optimization
In addition to performance, scalability is also a key consideration when it comes to big data integration. Employee time is a valuable resource that can be made highly efficient through enhanced performance and scalability to help boost productivity. Using technology solutions to reduce the amount of time workers spend on particular tasks also helps improve an enterprise’s business success.
The steady growth of data volume, variety, and velocity can impact service-level agreements (SLAs) within IT departments. The time required to process data integration jobs can exceed the window allowed by SLAs, and when this condition occurs IT may no longer be capable of meeting the needs of internal customers. Efficient processing of big data can be a requirement for every part of an organization.
Optimizing big data workloads
A highly critical requirement for processing large, enterprise-class data volumes for big data integration is unlimited data scalability, which offers the capability to process large data volumes in a single application proportionate to any added hardware.* Processing unlimited data volumes makes it possible to solve many high-value business problems for the first time while ensuring that a hardware platform yields predictable benefits.
Unlimited data scalability enables organizations to process vast quantities of data in parallel, helping dramatically reduce the amount of time it takes to handle various workloads. Unlike other processing models, unlimited data scalability systems optimize hardware resource usage and allow the maximum amount of data per node to be processed.
How can organizations know they have an infrastructure that is strong enough to handle big data? The architecture is at the core of successfully implementing a big data initiative. Unlimited data scalability systems have four key elements in common:
- Shared-nothing architecture
- Software data flow
- Data partitioning for linear data scalability
- Design isolation
Architected software is designed from the ground up to capitalize on a shared-nothing, massively parallel processing (MPP) architecture (see Figure 1). Data sets are partitioned across computing nodes and a single application is executed with the same application logic implemented against each data partition. As a result, there is no single point of contention, or processing bottleneck, anywhere in the system. And there is no upper limit on data volume, processing throughput, number of processors, and nodes.
Figure 1. A contention-free, shared-nothing architecture
Software data flow
A shared-nothing architecture can be fully exploited by a software data flow that easily implements and executes data pipelining and data partitioning within a node and across nodes (see Figure 2). Software data flow also hides the complexities of building, tuning, and executing parallel applications from end users.
Figure 2. Parallel software data flow with automatically repartitioning data
Software data flow is an architecture that is well suited for capitalizing on multi-core processors within a symmetric multiprocessor (SMP) server—application scale-up—and for scaling out to multiple machines—application scale-out. It supports pipelined and partitioned parallelism within and across SMP nodes and provides one mechanism for parallelization across all hardware architectures that helps eliminate complexity. In addition, no upper limit exists for data volumes, processing throughput, and numbers of processing nodes.
Large data sets can be partitioned across separate nodes, and a single job—for example, deploying the IBM® InfoSphere® Information Server integration platform—can execute the same application logic against all partitioned data (see Figure 3). Other approaches such as task partitioning are not capable of delivering linear data scalability as data volumes grow because the amount of data that can be sorted, merged, aggregated, and so on is limited to what can be processed on one node. The following characteristics apply to systems with data partitioning:
- Data partitions are distributed across nodes.
- One job is executed in parallel across nodes.
- Pipelining and repartitioning occur between stages and between nodes without landing to disk.
- Grid hardware for big data is cost-effective.
Figure 3. The same application logic executed for all data partitioned across nodes
The capability for a developer to design a data processing job once, and use it in any hardware configuration without needing to redesign and retune the job, is called design isolation (see Figure 4). This approach allows a job to be built once and run without modification anywhere. It offers one unified mechanism for parallelizing data flow. A single configuration file provides a clean separation between the development of a job and the expression of parallelization at runtime. In addition, performance tuning is not required every time the hardware architecture changes, and there is no upper limit for data scalability as hardware is added.
Figure 4. One-time data processing design for any hardware configuration
All highly scalable platforms—IBM Netezza® data warehouse appliances, IBM PureData™ System data services delivery, IBM DB2® database software, the IBM database partitioning feature (DPF), Teradata, Hadoop, and the InfoSphere Information Server data integration platform—support data isolation characteristics. And each can seamlessly leverage MPP and commodity grid architectures.
Facilitating efficient productivity
Implementing increasingly efficient big-data integration technology can enhance employee productivity. The development environment should be designed to improve efficiency by providing a single design palette in the shared application environment. In this way, developers should never have to flip through a lot of different interfaces; instead, everything they need is easily accessible.
An integrated, shared metadata environment is another significant efficiency enhancement for developers. It needs to be quick and easy for tracking job progress and diagnosing any problems. Additionally, visualizing data through a dashboard view can save countless hours by giving developers and IT operations staff a unified picture of what’s going on with their data.
As big data stores continue to grow, these high-performance, time-saving features become even more important for IT departments. They can mean the difference between meeting or not meeting SLA requirements, and having time to work on innovative new projects or merely managing existing systems. And for the business, performance and time-efficiency features can enable rapid, well-informed decision making, which may lead to cost-effective service enhancements for customers and competitive advantage.
Visit the IBM Data Integration website for more information about IBM data integration offerings, and please share any thoughts or questions in the comments.
* For more information on scalability proportionate to added hardware, see “Bringing Big Data Up to the Big Leagues,” by Leslie Wiggins, IBM Data magazine, October 2013.