Untangling the Definition of Unstructured Data
Big data has given rise to many ways of defining unstructured data—consider a new approach
What is unstructured data? The question seems simple. Organizations talk about it, and everybody professes to know what it is. But dig beneath the surface a bit, and the question becomes anything but simple. Start with perhaps the most rudimentary understanding of unstructured data—that it is any data not managed by a standard database management system (DBMS).
In a standard DBMS there are keys, records, attributes, and indexes. There is little doubt that all data managed in a standard DBMS is structured. Simply stated, data that is not structured cannot be placed in a standard DBMS. Therefore, all data in a standard DBMS is structured. The converse is that all data not found in a standard DBMS is unstructured.
These simple definitions of structured and unstructured data are used by many vendors of many products. They only mean the data is not managed by a standard DBMS; therefore, it is unstructured data. And using this definition, that statement would be correct. It makes a very valid interpretation of the meanings of both structured and unstructured data.
However, there are other interpretations of the meaning of structured and unstructured when applied to data. One definition specifies that unstructured data is everything that is textual. For many purposes, this definition can work quite well. To computer technicians, textual data does not fit into a standard DBMS, and it is therefore highly unstructured.
But some have pointed out that text—language—is anything but unstructured. In certain communities, one can rationally argue that language is highly structured. Most languages involve spelling, punctuation, sentence structure, verbs, nouns, prepositions, and many other characteristics. Many English teachers would contend that the English language is in fact highly structured. And they have a valid point. English students would be wise to agree with this premise if they want an A in their course.
A data classification approach
There are at least two schools of thought that are very different about what constitutes the meaning of what is and what is not structured data. One school of thought, as stated previously, is that everything not in a standard DBMS is unstructured. Another definition is that something is unstructured only if there is not a rational way to explain the structure. These are two very different interpretations of what is meant by unstructured. And both viewpoints are perfectly rational and valid. However, they are in conflict with each other. These are just two viewpoints, and there are undoubtedly others on what constitutes the meaning of structured and unstructured data.
Based on some recent research, another less-confusing way for classifying data exists. That classification involves looking at the repetition of data occurrences. Data that occurs frequently, repetitive data, is data in a record that appears very similar to data in every other record. The records are similar in terms of size and structure, and in many cases even their content is the same. Examples of repetitive data—and there are many—include metering data; click-stream data; telephone call records data, such as time of call, the caller’s telephone number, and the call’s length; analog data; and so on.
The converse of repetitive data, nonrepetitive data, is data in which each occurrence is unique in terms of content—that is, each nonrepetitive record is different from the others. Any similarity of record content, size, or structure that may exist among nonrepetitive data is strictly a matter of chance. There are many different forms of nonrepetitive data, and examples include emails, call center conversations, corporate contracts, warranty claims, insurance claims, and so on.
The many distinctions between repetitive and nonrepetitive data are important. But perhaps the most important distinction is the pattern of business value. Many occurrences of repetitive data in which only a few records are of real business value fall into a typical situation category. For example, a call-detail database can have millions of records, but only a few may be of interest to an organization. Or with click-stream data, there may be thousands of records, but only one or two may have business value. Percentagewise, just because there may be only a few repetitive records that are of interest to an organization doesn’t mean they have no business value. Those few records may in fact have huge business value. There just may not be many of them.
Conversely, when looking at nonrepetitive records, typically there are many that have business value. For example, in a collection of warranty claim data, almost every record likely has business value. On a percentage basis, many more nonrepetitive records with business value can be found than repetitive records.
Much less confusion exists between repetitive and nonrepetitive data than the confusion between structured and unstructured data. The difference between repetitive records and nonrepetitive records is stark. A database contains either one type of record or the other type of record, and little or no confusion exists as to what type of record is found in the data collection. In today’s world, the preponderance of confusion associated with the term unstructured has rendered it almost irrelevant. The term is used so widely and in so many ways that its meaning has been lost, or at least badly disfigured.
Please share any thoughts or questions in the comments.