How to manage your database like a water supplier

Portfolio Marketing, Hadoop/BigInsights, IBM Analytics

Databases may be to information as reservoirs are to water, but water suppliers often do a better job at providing clarity than IT organizations. For example, imagine if your local water supplier were to take water from all sources into its reservoirs—and then send that water directly to your tap, unfiltered. You’d start looking for someone else to take charge of your water supply.

But suppose that your supplier caught wind of your concerns and took steps to alleviate the problem. Suppose that whenever someone turned on a tap, the water company cleaned the amount of water required before supplying that customer. However, although the sludge would be gone, water delivery would be sluggish—and parched consumers would begin marking their calendars for the next municipal election.

Yet data engineers sometimes fail to avoid just such pitfalls as these. Indeed, IT personnel can learn a great deal from the best practices of water providers. If you know how to clean and treat your data—just as utilities clean and treat water—you can help keep your reservoir of data from becoming a brackish swamp of hard-to-digest information.

Untreated material clogs data on demand

Typically, 80 percent of the development effort in a big data project goes into data integration. Why? Data engineers typically do two things that would shock conscientious water management professionals:

  • They take a do-nothing approach, dumping unstructured data into the Hadoop database and then waiting to deal with data integration issues until they arise.
  • After dumping the data, they use master data management (MDM) to extract the data, run it through a cleansing and governance program, and then load it back into the database.

Because they employ these methods, data engineers often resort to hand-coding applications to extract and transform particular kinds of data—then spend more time cleansing dirty data from public sources such as social media. All the while, the data reservoir gets muddier and muddier, making data filtering increasingly complex and time-consuming.

Ongoing filtering clears the pipes

Instead of making the data pool murkier, data engineers should integrate and qualify their data, either in Hadoop or before dumping it into Hadoop or a traditional database. And here, again, big data planning and processes can take their cue from water management professionals.

Water suppliers, like data engineers, prefer their sources to be clear—but when they are not, water managers reduce contaminants, first by pumping water through pipes that screen large chunks of debris into bank-sided reservoirs with impermeable linings, then by filtering water in tanks before pumping it into the delivery pipes that supply consumers. As a result, most people don’t even think about where their water comes from or how it is processed; they use it for drinking and bathing while assuming that water will be there whenever they need it.

Maintaining repositories keeps data sparkling clear

As data volumes grow, data science professionals find themselves deluged with ever more information, in quantities that clog databases, transforming them into data swamps. To overcome this challenge, take your cue from best practices in water management to help keep your data sparkling clear—regardless of how you store and deliver it.

Like water in a reservoir, you’d like your data to be as unsullied as possible while it sits in its Hadoop cluster, traditional engine or database. After all, your consumers are ready and waiting to drink it in. Unmuddying your repositories of data—and keeping them that way—means using platforms and tools that enable data scientists and IT professionals to do the following:

  • Leverage high-performance native connectors that connect to a wide range of Hadoop and traditional enterprise data sources
  • Scale integration, transformation, filtration, governance and delivery of big data
  • Manage, monitor and automate runtime environments using a web-based dashboard and information governance catalog
  • Understand and manage metadata that would otherwise choke critical information
  • Spot unpalatable inconsistencies across records and tables
  • Identify categories of data that suit your organization’s particular interests
  • Quickly uncover data quality insights, patterns and trends across systems
  • Align quality indicators with business policy

Learn more about how the IBM BigInsights BigIntegrate and BigInsights BigQuality data solutions can help you keep your data clean and your customers satisfied.