Garbage in, garbage out

Is this approach still sustainable for clean data?

Executive Architect, IBM

Early computer programs are no longer around, but the data they captured and manipulated might well still exist, stored on file systems somewhere—perhaps being used to make business decisions even now. What’s more, the importance of data produced by computer programs has only increased with each generation of computing. Consider, for example, the Passenger Name Record, or PNR, that identifies passenger books with an airline. This six-character alphanumeric code conveys no obvious meaning to travelers, but it is a crucial piece of information nonetheless, having been integral to the travel transportation business since the 1960s.

Rescuing data from the garbage heap of history

Not surprisingly, then, a primary principle of data processing is GIGO: garbage in, garbage out. Processing invalid data produces useless results. In moving along a continuum that includes data processing, information processing, distributed processing, personal processing, analytical processing and cognitive processing, one constant remains: We must understand the data that we process. Failure to understand the data input into a process translates into failure to understand the results of that process.

The first few generations of data processing systems could capture and process only a limited amount and variety of data. Technical professionals could define parameters and values of captured data with relative ease, but business professionals encountered great difficulty in understanding what was being captured. The tools used by the technical professionals of the time were blunt instruments, hard pressed to capture the nuances required by business professionals. hindrance to communication between information technology and business personnel about data and process has left us a heritage of mismatched data artifacts and ambiguous—even contradictory—business processes. In the modern business environment, both IT organizations and business organizations are aware of this problem but struggle to overcome it nonetheless—attempting, with only limited success, to fix or eliminate multiple generations of systems on both sides.

Garbage in, financial crisis out

The volume of data captured and processed in the modern business environment is growing exponentially, with little sign of slowing, but organizations’ ability to understand that data has lagged behind their ability to process it. We need look no farther than the subprime mortgage crisis for an example: Although analysis of tremendous amounts of data identified certain mortgage bonds as very safe investment opportunities, also naming insurance policies for those bonds as very safe policies, that analysis dramatically understated the risks posed by such bonds.

Because the mortgage application process was unable to sufficiently validate data about claimed income, most people who bought bonds had no idea whether a particular mortgage had been granted to a buyer whose level of income should have raised concerns about his or her long-term ability to pay. Not surprisingly, without access to data that had been turned into truly useful information, garbage in produced garbage out—in short, financial crisis.

Recycling your garbage

How can you break free of the GIGO cycle? Begin by assuming that your data, whether structured or unstructured, does not come to you clean—including the data you acquire through the Internet of Things. Such an assumption forces you to implement processes that can help you understand what data you need to achieve your desired business outcomes using the data you aim to process. Moreover, it means thinking about how governance processes turn data into information—and then finding ways of recycling your garbage into information, capturing the context and meaning of your data in business terms accessible to everyone who must use that information.

Thus, although garbage in may have been garbage out once upon a time, modern tools can help us recycle our garbage, turning it into information. Such solutions are integrated and cloud-enabled, able to handle big data—the perfect foundation for analytical and cognitive applications that can produce actionable business results. An information governance tool set can help you put a stop to the GIGO cycle, giving you a vantage point on your data from which to develop a new paradigm: information governed, information out.

Learn more about data integration and IBM Infosphere Information Server.