Fundamentals for sure-fire cloud data warehouse optimization
An interview with James Kobielus
As IBM's big data evangelist, James Kobielus is IBM Senior Program Director, Product Marketing, Big Data Analytics solutions. He is an industry veteran, a popular speaker, a social media participant and a thought leader in big data, Hadoop, enterprise data warehousing, advanced analytics, business intelligence, data management and next best action technologies.
When we talk about cloud data warehouses, we’re really talking about an increasingly fundamental part of many companies’ infrastructure.
For the past decade at least, we have seen organizations get smarter about maintaining repositories of historical data in data warehouses. Today, we’re seeing companies move these vital systems of record from dedicated physical hardware onto cloud infrastructures – whether that means a flexible private cloud environment within the data center, or a public cloud service.
There is a growing need for versatile, hybrid architectures that can combine the best of both data warehousing and big data analytics. The cloud is the perfect solution, because it makes it easier to build a robust data warehouse as a central “hub”, and then add other environments that can be scaled up or down to meet the specific needs of different datasets.
Nevertheless, it is important to think carefully about the design of the entire hybrid architecture, and avoid a number of common pitfalls. We spoke to Jim Kobielus at IBM for his top tips on strategizing and optimizing cloud data warehouses.
Andrea Braida: Jim, thanks for joining us. What’s the first thing that companies should think about when they are looking to design and build a cloud data warehouse?
Jim Kobielus: My first tip would be to consolidate your end-to-end data warehouse storage infrastructure before you migrate.
At many companies, data warehouses have grown organically over several years: first the sales team sets up a repository of historical sales data, then the manufacturing team begins keeping records of production-line activity from previous years, and so on. Ultimately, the business ends up with multiple separate data marts, each housing its own datasets and requiring its own separate instances of databases and analytics software.
When moving to the cloud, it’s important to rethink these fragmented architectures and reengineer them using a smaller core set of data storage technologies. For example, companies often use several different relational databases from multiple vendors – all of which are fundamentally designed to do the same job. Instead of running ten IBM DB2 instances, five Oracle instances and three Microsoft SQL Server instances, plus a random selection of MySQL, PostgreSQL and other open source relational databases, just pick one, and run as few instances of it as possible.
By doing so, you will hugely simplify the work of your database administrators, avoiding the need to burden them with laborious series of tasks that must be repeated every time a new data source is added or a new set of queries is developed. A streamlined, consistent set of core technologies can be a huge time-saver, significantly boosting the productivity of the teams who manage your data, and reducing complexity and cost.
“Companies often use several different relational databases from multiple vendors – all of which are fundamentally designed to do the same job… Just pick one, and run as few instances of it as possible.” -James Kobielus
Andrea Braida: So moving to the cloud is a great opportunity to simplify your data warehouse landscape. But if you want to reduce the number of database instances, does that mean you need to move everything to a single type of database or convert it into a single type of data structure?
Jim Kobielus: I don’t think so – although you do want to minimize the core set of technologies that you build upon, that doesn’t mean that you should try to adopt a one-size-fits-all approach.
Different data warehouse deployment roles and workloads have different requirements, and trying to shoehorn all of your data into a single data model will rob you of the power and flexibility that your business needs.
For example, no matter how well-designed the dimensional data model is within an online analytical processing (OLAP) environment, users will eventually outgrow its constraints and demand more flexible decision-support. By requiring that relational data be de-normalized and pre-joined into star schemas and other fixed, subject-specific structures, traditional multidimensional OLAP denies users the flexibility to drill down, up, and across data sets in ways that were not designed into the underlying cubes.
Other storage approaches – such as columnar, in-memory, and inverted indexing – may be more appropriate for such applications. However, at the same time, they may not be generic enough to address broader deployment roles.
For this reason, your core set of data storage technologies should include multiple types of databases, each of which can play a useful role in the work that your data warehouse will be required to do.
For example, to complement the main relational database at the heart of your architecture, you might also need a NoSQL document store or a key-value store for semi-structured data; a graph database for exploring the relationships between entities; an object store for large-scale file storage; and a Hadoop cluster as a data lake for storing and analyzing large volumes of unstructured data.
Jim Kobielus: Absolutely! And when you’re assembling your core set of technologies, you need to think about more than just the logical data structures and schemas that each database employs.
It’s also vital to understand the strengths and weaknesses of the tools they provide to help you maintain and scale your data warehouse environment. Such tools can often make the difference between a clever technical solution that fails in practice, and a bullet-proof product that you can rely on for a full-scale data warehouse deployment in the cloud.
The first area where database tooling can make a big difference is by automating the management of the underlying storage infrastructure that supports your cloud data warehouse. The platforms you choose should be able to manage and optimize storage continuously, without imposing inordinate manual burdens on your IT staff. They should not require physical tuning, indexes, partitions, manual compression schemes, results caching, cubes, or materialized views. They should not require software installation or hardware upgrades. They should automatically regenerate failed drives rewrite and relocate bad sectors. They should not require RAID-level decisions for database spaces, nor should they require logical volume creation of files. Finally, they should not allow queries or transactions to fail as a result of disk failures.
Most enterprise database platforms will pass these tests – but many of the more emergent technologies in the open source space may still have question-marks around some of them, particularly when you want to deploy them at scale. It is important to do your homework and get expert advice during the design phase to make sure that you pick technologies that won’t create hundreds of hours of work for your database administrators in the future.
“Your core set of data storage technologies should include multiple types of databases, each of which can play a useful role in the work that your data warehouse will be required to do.” -James Kobielus
Andrea Braida: What about storage capacity? In a cloud landscape, the amount you pay is usually directly proportional to your storage requirements – and that’s a major consideration for many companies when they look at cloud data warehousing options.
Jim Kobielus: Great question – again, this is an area where the right choice of database platform can pay dividends. By choosing database schemas that store data efficiently, or database platforms that offer powerful tools to de-duplicate and compress it, you can significantly reduce costs and optimize your use of resources.
Some physical data models are more inherently compact than others (for example, tokenized and columnar storage are more efficient than row-based storage), just as some logical data models are more storage-efficient (third-normal-form relational is typically more compact than large de-normalized tables stored in a dimensional star schema).
However, as we have mentioned above, you will often need to select physical and logical data models on the basis of their ability to optimize performance for specific workloads. If there is a trade-off between performance and efficiency, you may not always be free to choose the most storage-efficient model every time.
In such cases, intelligent compression – which encompasses diverse metadata-driven techniques for generating compact data footprints for storage, transmission, and/or processing – can still help you dramatically reduce your overall storage requirements and costs.
Again, however, the devil is in the detail: it’s important to understand how these intelligent compression capabilities have been implemented. In some cases, they can degrade performance by making CPUs shoulder the workloads associated with on-the-fly compression and decompression.
In such cases, database administrators can achieve an optimal balance by performing the decompression in specialized hardware close to disk, which takes the load off the front-end query processors.
Andrea Braida: And besides automation and compression tools, are there any other types of tooling that are important to consider?
Jim Kobielus: I think the final thing we should mention is that your databases should provide tools that help you adapt your data modeling approaches as new business needs arise.
No fixed cloud data warehouse modeling approach – physical or logical – can do justice to the ever-shifting mix of queries, loads and other operations that you need to perform against your growing, evolving pool of data and storage resources.
The more dynamic the demands on your big-data environment, the more often you will need to revisit your data schemas, join strategies, and partitioning and indexing approaches to maintain acceptable performance for all users and applications.
Make sure you adopt database platforms that let you focus on your logical data models, and that automatically reconfigure the underlying physical data models for optimal query and data load performance.
“If you are building a hybrid cloud data services architecture, you will most likely need a fast, robust and scalable cloud data warehouse at its core… Time invested in architecture design and data consolidation at the start can pay huge dividends in the long run – potentially saving terabytes of storage space and hundreds of hours of work for your busy DBA team.” -James Kobielus
From my conversation with Jim, I learned that if you are building a hybrid cloud data services architecture, you will most likely need a fast, robust and scalable cloud data warehouse at its core. Choosing the right platform for this central node of the architecture is vital, and your choice will have an impact on the technologies you select for the outlying nodes too: the aim is to bring together the smallest possible set of database platforms and other technologies that will give you the broadest possible set of capabilities.
As part of the selection process, it is important to consider both the core data management capabilities of each platform, and their broader capabilities in terms of integration with other platforms, automation of storage management, compression and deduplication capabilities, and flexible self-optimization.
Also, the time you invest in architecture design and data consolidation at the start can pay huge dividends in the long run. Careful planning can save you terabytes of storage space and hundreds of hours of work for your busy DBA team, while ensuring that your cloud data warehouse can scale to support as many users, queries and datasets as your business requires.
If you would like to learn more about the advantages of bridging to a hybrid cloud data services architecture, please check out Jim’s presentation. Learn about IBM’s premier cloud data warehouse offering, IBM dashDB for analytics.