Part II: The Big Data Platform Manifesto
In Part I of this series, we looked at the key considerations for an analytic enterprise to stay competitive in today’s world. To enable those considerations, one needs to define the imperatives for the supporting big data platform. In this post we will explore what that big data platform manifesto would need to look like. The limitations of traditional approaches to support analytics have resulted in failed projects, expensive environments, and non-scalable deployments. Hence, a big data platform has to support all of the data and must be able to run all of the computations that are needed to drive the analytics. It has to enable analysts and modelers to ask the most complex questions on enterprise data, without the infrastructure getting in the way. To achieve these objectives, we believe that any big data platform must include the following six imperatives.
Data Discovery and Exploration
The process of data analysis begins with understanding data sources, figuring out what data is available within a particular data source, and getting a sense of its quality and its relationship to other data elements. This process, known as data discovery, enables data scientists to create the right analytic model and computational strategy. Traditional approaches required data to be physically moved to a central location before it could be discovered. With big data, that can be expensive and impractical.
To facilitate data discovery and unlock resident value within big data, the platform must be able to discover data “in place.” It has to be able to support the indexing, searching, and navigation of different sources of big data. It has to be able to facilitate discovery of a diverse set of data sources, such as databases, flat files, content management systems—pretty much any persistent data store that contains structured, semi-structured, or unstructured data. This capability benefits analysts and data scientists by helping them to quickly incorporate or discover new data sources in their analytic applications.
Extreme Performance: Run Analytics Closer to the Data
Traditional architectures decoupled analytical environments from data environments. Analytical software would run on its own infrastructure and retrieve data from back-end data warehouses or other systems to perform complex analytics. The rationale behind this was that data environments were optimized for faster access to data, but not necessarily for advanced mathematical computations. Hence, analytics were treated as a distinct workload that had to be managed in a separate infrastructure.
This architecture was expensive to manage and operate, created data redundancy, and performed poorly with increasing data volumes.
The analytic architecture of the future needs to run both data processing and complex analytics on the same platform. It needs to deliver petascale high-performance throughput by seamlessly executing analytic models inside the platform, against the entire data set, without replicating or sampling data. It must enable data scientists to iterate through different models more quickly to facilitate discovery and experimentation with a “best fit” yield.
Manage and Analyze Unstructured Data
For a long time, data has been classified on the basis of its type—structured, semi-structured, or unstructured. Existing infrastructures had barriers that prevented the seamless correlation and holistic analysis of this data; for example, independent systems to store and manage these different data types. We’ve also seen the emergence of hybrid systems that often let us down because they don’t natively manage all data types.
One thing that always strikes us as odd is that nobody ever affirms the obvious: organizational processes don’t distinguish between data types. When you want to analyze customer support effectiveness, structured information about a CSR conversation (such as call duration, call outcome, customer satisfaction, survey response, and so on) is as important as unstructured information gleaned from that conversation (such as sentiment, customer feedback, and verbally expressed concerns). Effective analysis needs to factor in all components of an interaction, and analyze them within the same context, regardless of whether the underlying data is structured or not. A game-changing analytics platform must be able to manage, store, and retrieve both unstructured and structured data. It also has to provide tools for unstructured data exploration and analysis.
Analyze Data in Real-Time
Performing analytics on activity as it unfolds presents a huge untapped opportunity for the analytic enterprise. Historically, analytic models and computations ran on data that was stored in databases. This worked well for transpired events from a few minutes, hours, or even days back. These data- bases relied on disk drives to store and retrieve data. Even the best performing disk drives had unacceptable latencies for reacting to certain events in real time. Enterprises that want to differentiate themselves need the capability to analyze data as it’s being generated, and then to take appropriate action. It’s about deriving insight before the data gets stored on physical disks. We refer to this type of data as streaming data, and the resulting analysis as analytics of data in-motion. Streaming data has an elastic quality to it.
Depending on time of day, or other contexts, the volume of the data stream can vary dramatically. For example, consider a stream of data carrying stock trades in an exchange. Depending on trading activity, that stream can very quickly swell from 10 to 100 times its normal volume. This implies that the big data platform not only has to be able to support analytics of data in-motion, but also to scale effectively to manage increasing volumes of data streams.
A Rich Library of Analytical Functions and Tools
One of the key goals of the big data platform should be to reduce the analytic cycle time - the amount of time that it takes to discover and transform data, develop and score models, and analyze and publish results. When your platform empowers you to run extremely fast analytics, you have a foundation on which to support multiple analytic iterations and speed up model development. Although this is the desired end goal, there needs to be a focus on improving developer productivity. By making it easy to discover data, develop and deploy models, visualize results, and integrate with front-end applications, your organization can enable practitioners, such as analysts and data scientists, to be more effective in their respective jobs. You shouldn’t just want, you should always demand that your big data platform flatten the time-to-analysis curve with a rich set of accelerators, libraries of analytic functions, and a tool set that accelerates the development and visualization process.
Because analytics is an emerging discipline, it’s not uncommon to find data scientists who have their own preferred mechanisms for creating and visualizing models. They might use packaged applications, use emerging open source libraries, or adopt the “roll your own” approach and build the models using procedural languages. Creating a restrictive development environment curtails their productivity.
A big data platform needs to support interaction with the most commonly available analytic packages, with deep integration that facilitates pushing computationally intensive activities from those packages, such as model scoring, into the platform. It needs to have a rich set of “parallelizable” algorithms that have been developed and tested to run on big data. It has to have specific capabilities for unstructured data analytics, such as text analytics routines and a framework for developing additional algorithms. It must also provide the ability to visualize and publish results in an intuitive and easy-to-use manner.
Integrate and Govern All Data Sources
Over the last few years, the information management community has made enormous progress in developing sound data management principles. These include policies, tools, and technologies for data quality, security, governance, master data management, data integration, and information lifecycle management. They establish veracity and trust in the data, and are extremely critical to the success of any analytics program.
A big data platform has to embrace these principles and make them a part of the platform. It’s almost scary how many times we’ve seen data quality and governance considered to be afterthoughts that are “bolted” onto existing processes. We need to treat these principles as foundational and intrinsic to the platform itself.
This post is Part 2 of a four-part series on IBM’s big data platform. In this part I will describe the key analytical capabilities that an enterprise needs to put in place to stay competitive in today’s big data world. In subsequent parts we will look at how IBM translates these considerations into its big data platform components and technology solution stack.
Read Part 1: Key Considerations of the Analytic Enterprise
This series is an abridged excerpt from our new book, Harness the Power of Big Data.