Is the Data Warehouse Dead?
The role of the data warehouse in the big data era
Many technologists have claimed that in the age of big data, the data warehouse is no longer relevant. Some thought leaders predict that implementations of data warehouses—particularly enterprise data warehouses (EDWs)—will decline and eventually cease to prevail.
Big data has spurred the creation of a new paradigm for how we manage and analyze data, and how we deliver insight. It has helped produce exciting opportunities for businesses that can capitalize on fresh sources of information. But does the emergence of big data mean that we should throw away what we’ve learned in the last 30 or so years of data warehousing? No. The data warehouse is not irrelevant, on its way out, nor dead. I would argue that it has evolved.
The need for data warehouses
Companies implement data warehouses to consolidate data from operational applications in order to provide a centralized repository built specifically for analysis and reporting. Some persist data for the entire enterprise (as in the EDW), but some provide this capability only for specific business departments so those departments can create subject-oriented data warehouses or line-of-business data marts.
Regardless of the scope of the data, the reason why companies choose to implement data warehouses is because the data and analysis to be gained from a data warehouse are of high value. This data, plus the analysis, helps to drive revenue growth, manage operational and financial risk, and maintain regulatory and legislative compliance. The insights derived from data warehouses are fundamental to the sustainability of the organization. As such, data stored in data warehouses, and the processes to manage and query this data, need to be governed. The data and processes must be structured, modeled, made repeatable, and made trustworthy. Accomplishing these goals requires an investment in time and personnel resources. Data must be processed so that it is presented in a standardized, normalized, and dimensional state that is fit for broad business consumption.
So why is there a perception that the data warehouse has not fulfilled its promise? The most common business answer is that traditional data warehousing is a costly and slow exercise, since data needs to be modeled and transformed. The practice of data warehousing has provided a successful foundation for organizations that choose to invest in treating their high-value information as an asset. However, data warehousing is stretched when organizations need to deal with volatile data sources that are highly variable in format. In today’s big data landscape, technology is producing large volumes and varieties of data at incredible speeds.
The introduction of big data technologies
Fortunately, technology has caught up to the volume, variety, and velocity of data. Options such as Hadoop, streams-based computing, and high-performance analytic solutions are changing the game by delivering rapid insights from big data. To reduce the time-to-insight barrier, these technologies avoid the redundancy of modeling and transformation by using methods such as assemble-on-demand, no-schema, schema-later, and schema-on-run.
The volatility of these approaches is a challenge for the humble data warehouse, though. As a result, some big data enthusiasts argue that traditional data warehousing methods are no longer applicable in today’s data landscape.
Many companies I work with are embarking on a big data strategy using Hadoop. In most situations, the Hadoop environment becomes a repository for data collection and a system that sits on its own. Universal data stores (UDS) and big data stores (BDS) serve as platforms for collecting all types of data sources, including internal and external (such as social media data). These platforms can be tapped and mined for potential business benefit.
Using big data solutions to complement data warehousing
Hadoop is an important part of what big data technologies can offer. But it is critical to merge big data with the traditional enterprise data strategy.
Many organizations are exploring and implementing a logical data warehouse (LDW) or a virtual data warehouse (VDW). The premise of an LDW or VDW is that there is no single data repository. Instead, the data warehouse is an ecosystem of multiple fit-for-purpose repositories, technologies, and tools that combine to manage and provide enterprise and personal analytics. In an LDW, Hadoop provides a powerful, low-cost repository for both structured and unstructured data. It complements the EDW for a UDS/ODS in the same way high-performance analytic appliances complement the EDW for data marts.
The key to this approach is the interoperability of these tools within the ecosystem. For example, the traditional data warehouse must be able to draw insight from Hadoop and vice versa. IT groups should agree on applicable use cases or design patterns (see Figure 1) when deciding which platform to use.
Once data is explored and considered to be of high value to the organization, there needs to be a path within the LDW for data and analyses to be propagated into the data warehouse for repeatability and broad consumption to the business user community.
[table id=1 /]
Bringing new life to the data warehouse
The idea that the data warehouse is dead is somewhat far-fetched. Yes, a data warehouse may be expensive and slow. But consider its use for your high-value information and think about why you would implement one.
Big data brings new life to the data warehouse by enriching it and introducing new insights taken from non-traditional sources, as well as unexplored data sources. The integration of big data and traditional data warehousing can produce results that are the best of both worlds. Together, big data solutions and data warehouses can deliver a complete solution for your enterprise data management strategy.
What is your approach to big data and warehousing? Which method has achieved the best outcome for your organization? Feel free to post your comments here or connect with me on Twitter @fooisms.