Textual Analysis of Big Data: Part 1
Integrate unstructured data in big data repositories through enterprise-scale search and textual analysis approaches
A debate is raging in the industry, and it is being fanned by the adoption of big data. Two simple questions to consider are: Should enhanced search techniques be created? Or, should a deeper dive be taken to apply text analysis for integrating unstructured data? A simple answer to both questions is “yes,” but there are hidden layers of complexity in that answer. The first installment of this two-part series compares search and analysis approaches and the application of search for integrating unstructured information. The
then takes a look at a textual analysis approach.
Editor’s note: This article by Krish Krishnan, founder and CEO at Sixth Sense Advisors, Inc., is offered for publication in association with the Data Management Forum’s Big Data and Enterprise Architecture 2014 conference November 19–21, 2014 at the Crystal City Hilton hotel. The conference features Bill Inmon, president and CTO, Inmon Consulting, and James Kobielus, big data evangelist at IBM, and editor in chief for IBM Data magazine.
Comparing search and textual analysis
At a fundamental level, both search and analysis engines operate on textual data, but that is where the similarity ends. A search typically involves looking for patterns and presenting the findings in short order to the end user. There is no further transformation to the text. Analysis, on the other hand, encompasses the discovery of the pattern, which is akin to search; but more importantly, transformations are applied to the text to create a meaningful outcome. Analysis assumes that text must be integrated and transformed before it can be analyzed.
This advanced treatment of text in terms of analysis is where complexities arise, and the field—though decades rich in terms of algorithms, research and development, and published theses—continues to be nascent and niche. The fundamental characteristic of text is best described with the adjective “erose”—not to be confused with “verbose.” The Latin word erose means “irregular, uneven; specifically, having the margin irregularly notched (an erose leaf)”* and is used mainly in the area of botany to describe the leaves of a plant.
The underlying reason for characterizing text as erose is to say that text is long, complex, and unpredictable. Textual data is often a combination of words and phrases that form contextual statements, which may contain repeatable patterns. The repeatability can also differ based on context within a single document or text. When discussing unstructured data, a lack of repeatability and the associated ambiguity is often used to distinguish text data analysis and outcomes. This approach contrasts with structured data in which there is great repeatability of data—a structured and formatted storage architecture—that lends itself well to integration and analytics.
Applying search for unstructured integration
Given available search infrastructure and algorithms, to integrate any unstructured data one can arguably ask, “Why not just extend search outputs?” In other words, why create a text analysis platform separately? Despite attempts at that tactic, including integration and transformation as part of a search is not a good approach.
Search engines or enterprise appliances become lethargic and slow when including integration and transformation with the normal workload. For example, assume that 10,000 searches have to be carried out for a contract database on a content management platform for every end-user query. Each search transaction creates operational structures and returns quick hits on a set of patterns as its output. Adding the analysis type of transformation introduces great operational inefficiencies to this exercise. The critical reason is that analysis requires applying clarity and context around the unstructured information, and both of these operations are highly complex and require processing. The additional operation causes immense slowdown of the search task.
Search engines do a lot of pattern matching; metadata-based—taxonomy and ontology—indexing; and large-scale distributed data processing. Metadata and patterns are definitely nimble and agile techniques for transforming the minimal data required for search processing, but these practices do not scale to support the complex nature of unstructured data analysis. Searches are designed to process patterns for every end-user query and are by design inconsistent. No two end users search for the same pattern at a given time. As a result, the same algorithms are replayed over and over for multiple types of data patterns, which have short lifecycles and are efficient despite having processing inconsistencies.
While these reasons are key to why applying search to analyze unstructured data is not the best option, they are not the only reasons. Analysis of text requires a lot of additional processing, including spelling correction, alternate spellings, synonyms, user-defined rules, and much more in the way of deep processing.
Part 2 of this series
looks at the details of an analysis approach.
Please share any thoughts or questions in the comments.
* Definition from the Merriam-Webster website at m-w.com.