Textual Analysis of Big Data: Part 2
Apply textual analytics in multiple transformation stages to process unstructured data in a consistent manner
Text analysis can advance the integration of unstructured data beyond just light indexing and pattern matching for conducting a search. The first installment of this two-part series compares search and analysis operations and the application of search for integrating unstructured information. This concluding installment offers a detailed look at textual analysis as a contrasting approach to using search.
Processing in a consistent manner
Analysis consists of multiple transformation steps, each of which needs to be run once per set of patterns, metadata terms, or context. Analysis creates multiple iterations of metadata output as opposed to simple result sets of entire pages, which create a powerful set of indexes within the textual data and its context. Analysis always processes data in a consistent manner as opposed to the search method.
For example, consider a popular example in a natural language processing context (see Figure 1). The sentence, “I never said she stole my money” demonstrates the importance stress can play in a sentence, which can have seven different meanings, depending on which of its seven words is stressed in speech. As a result, the sentence can present an inherent challenge for a natural language processor tasked with parsing the sentence.
Figure 1. Seven different meanings of a sentence, depending on which word is stressed when said aloud
A search for this pattern results in all the statements, and searching for the extended meaning interprets them. By processing this example through a text-analysis platform, a context-oriented result set can be created to provide not only the result, but also the associated context that is far more useful. The need to transform data before it becomes useful for analytics and reporting is not a new consideration. The data warehouse has always been designed to process data in this fashion, which is known as an extract-transform-load (ETL) process. Extending this analysis to text creates a powerful textual ETL concept.
This need for transformation and integration of text presents some interesting challenges. One challenge is the size of the data to be transformed. For example, assume the Internet is a data set. Is it possible to transform and analyze all the text found on the Internet? It’s obvious that doing so is not practical or feasible. In such a situation, Internet end users primarily rely on search and can use a subset of data from the result set for deeper analysis.
But there are other data sets such as enterprise data that are made up of large volumes, have complex formats, and have multiple contexts—yet they lend themselves to the rigors of text analysis and processing. For example, contracts exist across the different business divisions in an organization such as purchasing, supply chain, inventory management, logistics, transportation, and human resources. Each of these contracts has a different purpose, and there may be many contracts of a type that can provide insights beyond just start and end dates. Insights include legal terms and conditions with applied context, liabilities, obligations, and much more. After analysis, such text will create a powerful and rich metadata output with context that can be simply integrated into an ecosystem for decision-support systems.
Other challenges include the variety of formats, the volumes of text, the ambiguous nature of the data itself, and lack of formal documentation, to name a few. But once the challenges are addressed, the output from such an analysis can be powerful in creating a huge visualization platform for looking into text and unstructured data within the enterprise. This platform is where organizations can leverage the data that has been stored for years on content management platforms for useful output of trends and behaviors. Key differences between result sets produced by a search approach and a text-analytics system are shown in Figure 2.
Figure 2. Primary contrasting characteristics of search and text analysis approaches to integrating unstructured data
|Search||||Oriented to process informational needs of a single user query|
|||Under normal circumstances, outputs a proprietary and temporary result set for a specific end user that cannot be shared|
|||Transformation rules are repeated with every query and are minimalistic|
|||The result set cannot be integrated with a database management system (DBMS)|
|||Processing cannot scale for large and complex operations|
|||Context-based searching adds significant overhead|
|Text analysis||||Defined by end users for processing with business rules, such as ETL|
|||The result set is a key-value column pair often stored in a relational database management system (RDBMS)|
|||Result sets can be used for further analytical processing and stored as snapshots for repeated processing|
|||Transformation of data and associated context is repeatable in multiple processing cycles|
|||Text of different languages for global organizations can be stored in the same result database, based on metadata integrations and rules|
|||Text analysis can scale easily based on infrastructure capabilities|
Based on this discussion, search is good for finding data on an ad hoc basis in a large set of data. Analysis is good for creating a platform that can be used repeatedly against a large but finite amount of textual data, such as the data used by a corporation.
To perform text analysis and deep text mining, process the text rather than extend a search engine or appliance. A robust text-analysis system can provide the following features:
- Antonyms, homonyms, and synonyms
- Business rules integration
- Document fracturing and processing
- Integration with taxonomies
- Reprocessing capabilities
- Spelling correction
Each of these features allows for processing large volumes of textual data and creating the resulting database to support that processing. This database can be used with search operations to create guided search and navigation and can be extended to machine learning using a combined search-and-analysis platform.
The major advantage of text analysis is document midpoint reprocessing, which is the ability to track changes as they occur within the text environment in a manner similar to tracking changes in a dimension. This benefit represents highly powerful output that makes analysis a better alternative than search. And this concept can be extended very easily to emails, Microsoft Excel spreadsheets, and other document types.
Taking a processing approach for the right purpose
Search and text analysis approaches for integration of unstructured data both serve different purposes and can be effectively leveraged for processing unstructured data. Search can be used for data discovery in early stages, and text analysis can be used for detailed analysis and downstream analytical processing. However, organizations should ensure they do not substitute search as an alternative to traditional textual analytics.
Please share any thoughts or questions in the comments.