The Impact of Big Data on Data Analytics

Searching for advanced, big data analytics solutions to increase business agility and reduce costs

Data analytics has always played a key role in harnessing value from electronically stored information. Organizations use data analytics solutions to deliver insights that can lead to increased revenues, market share gains, reduced costs, and scientific breakthroughs.

Today, the increasing automation of business processes is expanding the landscape of data analytics. Information that was once stored in separate on- and off-line repositories, and in a variety of formats, is now available in digital format, ready to be amalgamated and analyzed. Consequently, business executives are asking more of their data and expecting faster, more impactful answers. Organizations are placing a higher priority on data analytics activities and more pressure on existing business analysts and IT teams to deliver.

Defining big data

Big data is part of the new frontier for data analytics. Some of the earliest references to this term related to the open source Nutch project. Big data referred to massive data sets—in that case, weblogs tens to hundreds of terabytes in size—that needed to be batch processed or analyzed at once for the purpose of updating web search indexes. With the release of Google’s papers on MapReduce and the Google File System (GFS), which evolved into the Apache Hadoop open source project, the meaning of big data has expanded beyond referencing only the volume of data; it also encompasses an aspect of the speed with which data is to be processed. It includes an element of complexity with the introduction of new, structured, unstructured, and multi-structured data types.

ESG has found that vendors are using the term “big data” rather liberally to refer to a broader set of use cases. This trend is evident among vendors offering distributed parallel file systems (for example, GPFS and Luster), workload-specific storage solutions (such as EMC Isilon and Panasas), and databases designed specifically for complex analytics (including Teradata Aster, HP Vertica, IBM Netezza, and EMC Greenplum). As Figure 1 shows, ESG has updated its definition to reflect the current usage.

Figure 1. Big data defined


Assessing the impact of big data on data analytics

ESG does not believe big data is marketing hype. It is a reality for many organizations across multiple vertical industries, and big data is altering the data center landscape. Big data is increasingly requiring IT teams to consider unconventional approaches to addressing business needs as data volumes, data-processing speeds, and data-type complexities continue to grow beyond standardized front- or back-office data processing capabilities.

Many organizations are still trying to figure out how to develop a plan that can handle increasing data volumes, along with the pressure to improve performance, with respect to their current analytics platforms and underlying IT infrastructure. ESG recently conducted a survey polling 270 decision makers and influencers to better understand how organizations viewed big data challenges and to identify the benefits organizations expect to realize by deploying new analytics platforms for their big data needs. (See the survey results here.)

According to the ESG poll, organizations with large and fast-growing database volumes, and those pulling data from multiple sources, are more likely to have encountered challenges related to big data. As more data sources are integrated into business intelligence and data processing tasks, the usual data analytics processes are no longer sufficient. These organizations view improving data analytics capabilities as critical.

More than half of respondents in the ESG survey identified improving data analytics as one of their organization’s top 5 IT priorities over the next 12 to 18 months (see Figure 2). On the other end of the spectrum, only 5 percent indicated that data analytics is not among their organization’s top 20 IT priorities. More than half (54 percent) of enterprise organizations (those with 1,000 or more employees) consider data analytics to be a top 5 IT priority compared to only 42 percent of their large midmarket counterparts (those with 500 to 999 employees).


Figure 2. Relative importance of data analytics


At this time, no dominant data analytics platform has emerged. More than half of organizations currently use custom data analytics solutions. General-purpose databases tuned for specific workloads are also widely used to perform data analytics activities. Organizations with at least 100 TB of data are significantly more inclined to be using cloud-based data analytics services, as well as either massively parallel processing (MPP) or symmetric multi-processing (SMP) analytical databases. Despite the fact that workload-specific appliances (that is, analytical databases bundled together with software, storage, server, and network resources) have been available for years, only 6 percent identified these solutions as their organization’s primary data analytics platform. This small percentage is likely due to the limited number of vendor options for appliances—a limitation that will probably persist for the next 12 to 18 months. These findings confirm that organizations are pushing analytics platforms to their limits and are looking for architectures that are more suited to increasingly demanding analytic tasks.

While data integration is the most common data analytics challenge, more than one-third (39 percent) of respondents believe that data integration processes take too long, that data volumes are too large (35 percent), or both. Some of these issues are exacerbated by the number of data sources an organization typically integrates.

Identifying drivers and expected benefits of new data analytics platforms

The need to reduce the costs of existing analytics platforms is the biggest driver of new data analytics platform purchases. Although cost was not identified by many organizations as a data analytics challenge, it was the most frequently cited driver for evaluating new data analytics platforms. It is possible that organizations could spend significantly more capital by using existing platforms instead of deploying a solution better suited to their environment. This is especially true for organizations that have short timeframes to process large data sets and complete analytics exercises comprised of complex integration schemes. Next-generation big data analytics platforms promise to run complex analytics on massive amounts of data using low-cost commodity hardware. This is a cost-savings promise organizations can’t afford to ignore if the benefits prove to be real.

Improved business agility is the most sought-after benefit of deploying a new data analytics solution—in fact, more than half of the organizations that plan to implement a new data analytics platform expect an increase in that area. In many cases, the incorporation of new data sets into analytics tools requires the addition of complex data integration and extraction processes. The sheer data volume, the added transformations, taxed networks, and database servers all can introduce latencies. Consequently, a request becomes stale before the data is ever useful. Organizations need better ways to handle the incorporation of new data sets so they can accommodate constantly changing business requirements.

Tracking big data analytics trends

Are organizations using the MapReduce framework (that is, Hadoop) to address big data analytics challenges? Adoption so far has been limited, yet MapReduce continues to claim a top spot based on marketing hype. In the ESG survey, only 8 percent of respondents indicated that their organizations are currently using the technology, but 13 percent plan to do so within the next 12 months. The adoption and planned adoption numbers are significantly higher among organizations that are currently processing more data per analytics exercise, as well as those with higher annual rates of data growth.

Since the options for commercial versions of MapReduce frameworks were limited until 2011, it is not surprising that most current MapReduce users are using an open source version distribution. However, among planned adopters of the technology, nearly half (47 percent) of these respondents plan to leverage commercial distributions, not the open source version. These new commercial offerings provide enhanced management capabilities, proprietary back-end storage systems to simplify integration, and better-optimized engines, which all result in higher-performing processes.

Reassessing data quality and data management

Responses to the ESG survey suggest that the move toward big data is spurring organizations to reassess data quality issues and data management as a whole, and to implement both data quality solutions and data governance strategies as a way of alleviating issues related to integrating data from disparate sources. Organizations with multiple data sources—especially larger organizations with overlapping business functions and duplicate business processes—are more likely to recognize the benefits of implementing master data management (MDM) solutions with their data governance strategies.

Finding the best fit

Many organizations will need new data analytics platforms that can handle the challenges introduced by big data. Evaluating a new analytics platform that can scale with big data can be either an exciting or a daunting task. A plethora of vendors is flooding the market, each using marketing terms to promote its solution as the “silver bullet” for big data analytics, but no single platform has emerged as a leading reference architecture. As with any emerging market, confusion abounds. Big data analytics is no different.

The good news is that organizations have a variety of big data analytics platforms to choose from. As organizations assess those platforms, it is critical that they pair the appropriate workload with the best-fit solution without getting caught up in marketing hype. They should also remember that not all data is key-value paired, and that no two analytics projects are alike.

[followbutton username='IBMdatamag' count='false' lang='en' theme='light']