High Powered Profiling: Emerging Best Practice for Big Data

Big Data Evangelist, IBM

If you're not careful, your big data investment can degenerate from a strategic asset into an unmanageable burden.

I'm not speaking about the potential complexity and cost of the underlying big data platform, though those are important concerns. What I'm referring to is the unfortunate tendency that many data professionals have of throwing gobs of data into their “Ha-dump” without any clear idea of when, how, or why they might need to use it.

Visibility is critical. Human beings can't effectively consume petabytes of data without high-powered tools for looking deeply into diverse data sources and profiling it all up front. Typically, profiling involves discovering, indexing, categorizing, searching, and visualizing data across diverse internal and external sources, including your enterprise data warehouses (EDWs) and data marts, and often in federated, distributed, heterogeneous environments.

High-powered profiling is a critical, but still emerging, big data best practice. Without the ability to assess federated data by relevancy to your needs, you're likely to dump far more of it into your Hadoop and other big data platforms than you need or can afford to persist and manage. While you're biding time waiting for a clue to emerge on how you might use all this data, you're throwing money into the server, storage, software, and other resources needed to hold all those precious petabytes in cold storage.

To start realizing big data value and controlling costs, the key is to integrate profiling features into your environment from the very start. This is the pivotal role for the technology provided by Vivisimo, a groundbreaking solution provider that IBM has signed a definitive agreement to acquire. In much the same way you've used data profiling in conjunction with your EDW and business intelligence programs, you'll need similar tools applicable to the unstructured sources at the heart of Hadoop and other big data initiatives, and that's where Vivisimo shines. Vivisimo allows you to discover and profile sources, ranging from structured to unstructured, without having to move them—and without having to load and store any of the profiled data in your big data platform.

IBM will integrate Vivisimo's technology into our big data platforms going forward. I'll leave it to the various product teams to bring you up to speed when we're ready to announce our detailed roadmaps.

For More Information On:

Follow Jim On: