Unstructured and structured data versus repetitive and non-repetitive data

Owner, Forest Rim Technology

Many people classify data as structured or unstructured. But classifying data in this manner presents a problem because the meaning of unstructured is very unclear. What may be unstructured to one audience can be quite structured to another audience.

Take text, for example. Ask any technician if text is unstructured, and the technician will likely assure you that text is definitely unstructured. But ask an English teacher if text is unstructured, and the English teacher is quite likely to tell you that text is about as structured as it gets.

Who is right, the technician or the English teacher?

Classifying record structure

And lots of other ambiguities exist for the term unstructured. And because of those ambiguities, a better way to classify data is as repetitive and non-repetitive data. What is repetitive data? It is data with a structure that repeats, usually in large numbers: 

Many examples of repetitive data abound. Some examples include details of telephone call records, clickstream records, log tape data, metering data, banking ATM data and so forth. Consider a record of telephone call details. Each record has the date and time of the telephone call, the originator of the call, the person to whom the call was made and the length of the call. Structurally, one record looks like every other record, and there are a lot of records. ATM records are very similar. The only difference from one detailed record to the next is the content of the record. The structure of each ATM record is identical to every other record.

Non-repetitive data, on the other hand, is data in which little or no repetition of the record structure is evident from one record to the next. Any two records that happen to have an identical structure is accidental: 

Some common examples of non-repetitive data are email, call center data, restaurant and hotel feedback data, help desk conversations, warranty claim data, insurance claim data and so on. Consider email. When someone writes an email, no one is telling the author what to write or how to write it. The email’s author is free to write however much or however little the writer desires. That person can also write the email in any language, use slang and write in complete or incomplete sentences. If any two emails happen to have the same exact content, it is purely a coincidence.

The same considerations apply to conversations. Conversations can be short or long. They can be gentle and polite or mean spirited. And they can be in Spanish, English or any other language.

Comparing business value

Several very interesting aspects emerge when classifying data this way. One aspect is that there is no confusion. Data is either repetitive or it is not repetitive. No middle ground exists between the two. For this reason alone, classifying data as repetitive or non-repetitive is worthwhile.

But another very important, very nonobvious difference occurs between the two classifications of data. A very high business value is associated with non-repetitive data, while a very low amount of business value is associated with repetitive data. The implications of this observation are quite far reaching and deserve explanation.

Consider how many records have repetitive data. The number depends on the environment, but in many environments there are many, many records. Now consider how many of those records have real value. Take a look at telephone call record details. In a day’s time hundreds of millions of records can be created. Every time someone picks up a telephone and gets a dial tone, a call record is created. Now, how many of those phone calls have value? Suppose that in a day’s time three phone calls are related to terrorism. Within that time frame then, three of perhaps 100 million records are of interest. Certainly the three records that are of interest are of very real, very important interest, but the percentage of records that have that value is very low.

Usually, not as many non-repetitive records as repetitive records are available. But every non-repetitive record has business value. Some records have high business value, and some have low business value. But typically, every non-repetitive record has business value.

Take call center records as an example. A call center record is a transcript of the conversation between an 800 number operator and a customer or a prospective customer. Every conversation between the operator and the customer or prospect has business value. If 5,000 calls are made in a day, then those 5,000 calls have business value. The business value ratio of the call center data therefore is 5,000:5,000, or one to one, which is substantially different than the 3:100 million ratio from the previous telephone call record detail example.

Repetitive data having a low ratio of business value versus non-repetitive data having a very high ratio of business value is a phenomenon that is repeated again and again. As a result, the significance of repetitive data’s business value is very different than the significance of non-repetitive data’s business value.

Catch W.H. Inmon speaking on this topic at the Big Data Seminar 2016, 15–16 September 2016 at the Hotel Pennsylvania in New York, New York. The event is sponsored by Data Management Forum. And explore the power of enterprise content management.