Blogs

How a data warehouse appliance can help your data scientists deliver insights faster

Product Marketing Manager for Data Lake & Hortonworks Partnership, IBM

Data scientists are one of the most valuable teams a company can have. But some businesses overlook how important it is to equip their data scientists with the tools that can help them unlock value from data.

In particular, it is essential that companies remain adaptive and agile when taking in vast amounts of unstructured data. To get the most value from their data and their data scientists, businesses need a capable, easy-to-use data warehouse that offers built-in data science and parallel processing.

These demands point to a data warehouse appliance—but not all appliances are created equal. Quark + Lepton’s recent TCO report revealed IBM Integrated Analytics System (IAS) had 45% percent lower costs over five years compared to Teradata IntelliFlex.

In this two-part blog series, we’ll examine some of the lessons learned in that report about what you should look for in a data warehouse appliance. We’ll also describe how the right choice can be transformative for your data scientists and your business—increasing functionality, decreasing complexity, and potentially cutting costs almost in half.

An appliance ready for data science

A modern data warehouse appliance needs to support data science so that sophisticated data analytics models can be built and refined to deliver valuable insights at scale. But there’s a world of difference between cobbling together individual data science features and holistic, embedded, in-place data science capabilities.

An effective data-science-ready appliance demands multiple capabilities, including:

  • In-place processing. By allowing data science models to be refined and run where the data is, data scientists save time that would have otherwise been spent on data migration. By avoiding migration delays, businesses can gather insights and act on them faster than their competition. Data scientists can also use time that would have been spent on migration to investigate additional insights, unlocking a competitive advantage.
  • Built-in Spark. Having Spark built into the appliance directly provides better processing speed and near-real time insights. This allows businesses to act on insights at the exact moment when they have the most impact, maximizing the return that each insight provides.
  • Hadoop integration. Integration with Hadoop is essential for larger workloads with various data types and sources. Including the often-unstructured data in Hadoop in their analysis means data scientists can provide more complete analysis. In turn, that means businesses can act on opportunities their competitors haven’t identified.
  • Built-in machine learning. With machine learning capabilities at their fingertips, data scientists can refine models more rapidly than they otherwise would if starting from scratch. Again, this means they can deliver insights faster and have more time to seek new insights, ultimately resulting in businesses acting more quickly than the competition.
  • Integration with familiar tools. Data scientists are already familiar with – and frequently prefer – a number of tools such as R Studio and Jupyter Notebooks. By providing integration with these tools, businesses avoid the time and expense of retraining data scientists on other tools.

Increased speed and integration with massive parallel processing (MPP)

Effective data science capabilities are only one part of what is required to deliver deep insights rapidly. To maximize performance, MPP is a must for data warehouse appliances. It enables workloads to be split up and run on multiple compute nodes in parallel so you can use compute resources more efficiently and reduce time to insight.

Because open-source platforms such as Spark and Hadoop also use similar, distributed computing technologies, you can use them and data warehouse appliances with MPP at the same time. Still, the best appliances go beyond MPP with capabilities such as query pushdown optimization, which allows the appliance to check if data can be processed where it resides. If so, you can reduce latency because the system doesn’t have to wait for data to move to a centralized compute location. With MPP, insights are delivered faster, more efficiently, and with better integration across open-source and data science tools.

Usability: separating the good from the passable

What separates great data warehouse appliances from the rest is usability. Data science tools and MPP mean very little if they introduce too much hassle for the user. Any inherent complexity of an enterprise data warehouse should be handled behind-the-scenes by a well-designed and well-integrated data platform. This prevents your IT pros – including database administrators and data scientists – from spending their time on administrative tasks or making sure everything works as it should.

The design philosophy and ecosystem for your data warehouse solution are essential considerations. Older styles of data warehousing that rely on storage and compute pairings linked with proprietary connectors and maintained by IT experts with years of institutional knowhow are no longer efficient enough. These cobbled-together systems of disparate, siloed data that often needs to be migrated for analysis introduces delays to the process of gathering insight. Such systems also require too much involvement by IT experts to make sure the numerous solutions that comprise the architecture continue to work with one another. This can steal time that should go towards value-add activities.

Easy connectivity based on similar codebases or data virtualization is a much better option. In contrast to cobbled-together systems, data can be processed where it resides and accessed much more smoothly than with migrations. Such efficiencies save your experts time by reducing latency and allow them to pursue more important tasks like uncovering new insight. In turn, the business can realize additional cost savings through reduced manual effort or even revenue boosts through quicker insights.

Centralized data warehouse management tools make an enormous difference to usability as well. Tools that enable administration, alerts, and monitoring in one place, alongside federation and SQL execution support, help users to avoid wasting time bouncing between tools or engaging in highly-manual processes, up to and including data migration.

Make your data scientists more efficient today

Every day data scientists spend with an outdated, patched-together environment is a day where the business isn’t getting its full value. You can choose an appliance that boosts their productivity rather than hamstringing them by selecting one with built-in data science capabilities, MPP, and enhanced usability features. Many find that doing so is the most cost-effective option in the long run.

For more insight into the cost savings of a data-science-focused appliance, read Quark + Lepton’s TCO comparison of the IBM Integrated Analytics System and Teradata IntelliFlex. You can discover why IAS showed a 45 percent lower 5-year TCO on average.

If you’d like to discuss questions specific to your business, reach out to one of our data warehouse appliance experts for a no-cost, one-on-one consultation.