#BigDataMgmt chat recap: Search meets big data

Product Marketing Leader, IBM Watson Group, IBM

Most organizations are feeling the pressure to move big data initiatives past the discussion phase and toward well-planned projects. To what extent can a complete view of all of the available information help organizations gain deeper insights and make better decisions? Special guest John Mancini (@jmancini77), president of the Association for Information and Image Management (AIIM), IBM’s big data evangelist James Kobielus (@jameskobielus) and I joined a highly engaged group on #bigdatamgmt chat to kick this topic around for an hour.

How important is search in delivering on the promise of big data?

As John Mancini pointed out, 65 percent of respondents to an AIIM survey reported that their organizations have “disorganized content” and 81 percent reported “limited search capability.” That’s a deadly combination. In such a chaotic world, search becomes fundamental to delivering on the promise of big data, with a premium on powerful ad hoc query capability. Basic insight: search is where the average user gains benefit from big data.

What are the most important features in big data analytic applications?

Confidence in search results, scalability, faceted navigation, connectivity, intuitive user interface—all of these received mention along with a way to correlate data from structured and unstructured sources visually. Through much of the discussion, several participants focused on metadata as a key enabler for navigation and categorization, along with semantic technologies.  

What is role of semantic tech in enabling high-powered big data search?

As James Kobielus noted, semantic technology enables search by concepts and connections, taking users past the limitations of keyword searching. The connection to metadata, which was a recurring theme, was obvious. In fact several participants felt that semantic technologies can cover a host of challenges, including lack of good metadata. Depending on the maturity level of an organization, semantic technologies can provide a range of benefits from improving search to enabling fusion of unstructured content with structured business data and processes. 

How must search adapt to address greater big data volume, velocity and/or variety?

Some adaptations are architectural. For example, under the traditional model, search platforms crawl and pull data from sources they target. Big data—and cloud for that matter—requirements demand at the very least the option to push data to the search system for indexing. This lets the target systems, which manage or store the data to be indexed, to regulate when and how the content is transmitted. Elastic scalability is another requirement if you’re setting up a search system to keep up with extreme volume. As James Kobielus pointed out, search also has to adapt to new data types.

What search features are built into Apache Hadoop or provided by open-source initiatives?

This question didn’t elicit a lot of comment from the participants aside from noting the various ways that you can get data back out of Hadoop. 

It’s interesting to note the close historical linkage of Hadoop and search.

What is best big data approach for high performance search? Hadoop, NoSQL, in-memory or other?

As James Kobielus pointed out, it depends.

My own view is that, when it comes to search for people, Hadoop and related technologies are better reserved for background processing to prepare and optimize data for search, while search engines are designed for end users.

What are special challenges of search when applied to streaming data?

The paradigm for searching streaming data is very different from the at rest data that is searched with traditional search engines. Streaming data needs to be continuously analyzed with specialized tools like IBM InfoSphere Streams, which scan and correlate data as it arrives. When a particular condition is met, you have a hit. It’s more of a filtering process, aimed at winnowing out relevant data or detecting a particular condition. As James Kobielus tweeted:

As John Mancini pointed out, streaming data is rarely linked to systems of record.

This disconnect presents a challenge to actually leveraging that data and a challenge for data scientists.

What does the future hold for innovations in big data search?

My own view is that search will morph into more conversational modes (think Apple's Siri, Google Now and IBM Watson). While dramatically different under the covers, these systems seek to provide answers, recommendations or actions in response to a natural language request. As Marko Pitkanen commented:


For me, the Tweet of the day came from our special guest, John Mancini. Short and sweet:

Keeping that in mind will add balance to any big data endeavor.

Join next week's #BigDataMgmt chat on February 19 at 12:00 p.m. EST where the discussion will circle around "Evolving integration and governancefor the era of big data."