Mining Data in a High-Performance Sandbox
Fulfill data analysts’ dreams with data warehouse appliances for in-database analytics and data mining
Data scientists and database administrators (DBAs) do not often work together, which is unfortunate. The synergy between advanced analytics and large-scale databases is hugely valuable to analysts. Analysts want lots of data to analyze, and the big databases hold all that data. However, data analysts generally end up dragging out as much data as possible from the databases into their analytic environments, so-called sandboxes, for advanced analytics processes.
The analytical environments used by data analysts are suitable for small- to medium-sized analysis. But as the scale of data analysis moves to big data—up to hundreds of terabytes—performance in these analytical environments begins to degrade rapidly. Data analysts then become accustomed to waiting hours for processes to complete. This delay not only slows the analytical process, but it also reduces the number and sophistication of the analyses performed, ultimately reducing the overall value of the analytics to the organization. These analytical sandboxes are also far less secure than traditional databases, which can expose organizations to data privacy problems.
Initial attempts to use data warehouse appliances for advanced analytics are often frustrating for DBAs and data analysts. Analytical tools working on large database systems sometimes start by dragging all the data across the network into the tool, which can result in huge delays and apparent performance degradation. Usually, the analyst has no idea the data is moving into the tool from the database and assumes the activity is executing in the database. When this data transfer occurs, the analytical tool issues the equivalent of an unconstrained Select * from table x; query to the database, which is very inefficient.
Developing a sandbox
Creating a scalable, high-performing sandbox for advanced analytics in a data warehouse appliance is a solution that works well for both DBAs and data analysts (see figure). Begin by creating a separate sandbox database instance that may need analytics—and geographic information systems (GIS)—libraries to be initialized. Synonyms—for performance and security—should be created in the sandbox to connect to source data tables in the other databases within the appliance. Analysts should have read/write access to the sandbox, and through the synonyms, read-only access to the other database tables to prevent accidental damage to the main data warehouse tables.
Important analytical functions available as SQL extensions or functions within the database
Appliances such as Teradata, IBM® PureData™ System for Analytics software, and IBM PureData System for Operational Analytics expert integrated data system have a vast array of advanced analytics built into the appliance. All analytics activity and data can stay in the appliance.
Next, select tools that support SQL pushdown or in-database analytics. Tools such as IBM Cognos® business intelligence (BI) software and SAP Business Objects are designed to access databases directly and efficiently. However, many analytical tools assume the tool is accessing stand-alone local data sets, which is not ideal for large database access. Analytical tools like IBM SPSS® Modeler data mining workbench and SAS—with SAS/ACCESS for relational databases—have in-database capabilities that recognize the database type and version and write appropriate SQL.
Most problems occur when the analytical tool is attempting to run a process that cannot be executed in a database. Work with the DBA to find out why certain procedures are not executing in the database. Seemingly simple functions such as data type conversions, counting functions, and matrix functions often cannot run in databases directly. Other ways can be found to run these functions in the database. Running processes in serial and creating intermediate data sets in the database is a good way to break up the processing and simplify the SQL being issued.
Keeping the analytics in the database
When nearly all the analytics are executed in the database, performance can be impressive—which data analysts appreciate. Many data warehouse appliances have sophisticated statistical procedures and data mining algorithms built inside the database. No data moves out of the database when executing these analyses. This lack of data movement is vital when processing up to hundreds of terabytes of data. A little extra work is involved in setting up the in-database analytical environment, but the results can be well worth the trouble. This solution really works. If you have any questions about individual implementations, please share them in the comments.