Blogs

Big Data, Big Discovery

Senior Solution Consultant, IBM

The need to innovate and stay ahead of customer demands is even more imperative today. IDC estimates that “in 2012 the digital universe will grow to 2.7 zettabytes.” As customers and the market as a whole generate data, companies are compelled to capture and analyze an ever-greater percentage of the total volume. Companies must uncover answers, insight and nuggets trapped in data repositories on both sides of the firewall to drive their innovation and customer experience initiatives. This in itself is a daunting challenge – added to this, about 80% of our data today across these data repositories is complex and unstructured.

Part of the challenge of working with unstructured data in a big data environment is getting a handle on exactly what type of data you have available. Simply moving everything in bulk into Hadoop clusters and data warehouses is not a viable approach. Successful big data implementations take a phased approach and deciding what data to roll into your big data platform is part of this process. This data exploration phase is critical in developing and understanding what data exists, what is missing and how the data ties to the use case scenarios most important to the business.

Discovery Through Search

Search is an important tool in this exploration and discovery process. Data analysts must be able to execute queries over a range of repositories and aggregate the results in a meaningful way. A unified search interface, such as the one in InfoSphere Data Explorer, enables this aggregate search capability.

In addition to enabling unified search across multiple repositories, the search interface must also help users derive meaning from the returned results. Considering the ever-increasing volume of data that is being searched, simply returning search results in the form of a long list will lead to frustration on the part of searchers.

Add Some Structure

One way to help alleviate the sheer volume of search results is through the use of structured navigation and visualization. Results are categorized into bins that users can then utilize to further refine their search. But how do you define these bins? One way is use a static set of tags that have been applied to the source documents. These tags may have been assigned manually or automatically by software that assigns the labels. The search platform can then index and use these tags to bucket content into groups at search time.

This works well when examining a single repository where content creators have shown good discipline in tagging content, but when we start to consider the big data case of highly varied content stored in multiple repositories, consistent tagging will not be the norm, and additional steps must be taken to categorize the data. The problem is further exacerbated by the fact that as the size and variety of data increases; the set of tags that can adequately cover the set must be made more generic. The unfortunate side effect of this dynamic is that the structured navigation based on the tag set also becomes rather broad and generic, making it difficult for users to drill down to precise results.

For documents and content that do not have quality metadata associated with them, entity extraction can help fill the void. Entity extraction is the process of automatically extracting document metadata from unstructured text documents. Extracting key entities such as person names, locations, dates, specialized terms and product terminology from free-form text can improve keyword search and also structured, faceted navigation.

Entity extraction relies on the adoption of a controlled vocabulary or taxonomy for describing documents. This can be problematic for highly variable data sources. Defining a comprehensive taxonomy that suitably applies across varied data repositories is difficult at best. Furthermore, even if such a taxonomy can be defined, maintaining its relevance on an ongoing basis can be very time consuming and expensive. Even so, terms derived from entity extraction can be a valuable complement to existing metadata tags.

Infer Meaning

Dynamic tagging (or clustering) addresses the problems with static tag sets by inferring labels dynamically from the content itself, thus avoiding tag sets that are too general or simply unavailable. Furthermore, the nature of dynamic tagging allows for the identification of richer descriptive phrases as labels as opposed to simple keywords.

The dynamic tagging technique in IBM’s big data platform automatically organizes search results into groups of related content that are known as clusters. It uses multiple heuristics to quickly identify meaningful groups that can be concisely described, and creates these groups as search results are returned. The costs and disadvantages of taxonomy maintenance therefore do not apply. Apart from some optional cluster tuning, all classification is done on the fly, with no intellectual effort or maintenance required by the organization.

There is great value in clustering beyond simply the notion of dynamically assigning tags to documents. A federated search can be configured to range over several sources, combine their search results and cluster them. Even though some of these sources may have metadata associated with contents and some may not, dynamic clustering draws common themes from the search results allowing you to understand relationships between seemingly disjointed data sets.

The ability to navigate and visualize your data is a critical component of any big data initiative. It’s imperative that companies employ applications that facilitate data exploration across repositories without first having to move any of the data. This creates a leaner information environment with faster time to insight.


For more information

IBM big data platform - Addresses the full spectrum of big data business challenges: visualization and discovery, Hadoop-based analytics, stream computing, data warehousing, text analytics

InfoSphere Data Explorer - Discovery and navigation software that provides real-time access and fusion of big data with rich and varied data from enterprise applications for greater insight and ROI