High-Performance Data Mining
Parallelized scoring performance with an SAS PMML model in InfoSphere Balanced Warehouse
When it comes to predictive analytics and business intelligence (BI), organizations with large information warehouses usually face a choice: create and implement data mining models directly within the database environment, or create them in a separate analytic environment, such as a data mining workbench.
Deploying mining models in a database environment can generate business results faster by improving scoring performance through parallelization and reducing overall software licensing costs. In addition, significant savings in software licensing may be realized by limiting high-cost data mining software to a small development environment and then porting the data mining models to the large-scale production environment. On the other hand, some organizations develop and deploy data mining models in an analytic environment, investing significant resources and experience in their mining models and process.
For many organizations, the best choice may be a hybrid scenario that enables data mining models to be developed in an analytic environment and then deployed in a database environment optimized for high speed, high-volume scoring processes. This approach is made possible by Predictive Model Markup Language (PMML), which defines a format for expressing data mining models. Data mining models that are created with PMML can be easily imported into a database, making them available to a scoring process within the database environment.
The IBM® InfoSphere® Balanced Warehouse (IBW) platform enables organizations to deploy externally created PMML data mining models to create a high-speed, high-volume BI and predictive analytics environment. To demonstrate the capabilities of IBW, IBM conducted a scoring performance study with the following objectives:
- Demonstrate that a PMML data mining model can successfully be used for scoring in a high-speed, high-volume IBW environment.
- Assess the scaling performance of scoring in an IBW environment across a range of hardware configurations and data volumes.
- Develop best-practices recommendations for configuring an IBW data mining environment.
Setting up the test environment
The study consisted of four steps:
- Prepare the server environment.
- Establish a data mining model in IBW by obtaining a SAS logistic regression model in PMML format and importing it into the database.
- Build a set of five data tables in the database, with each table defined on four different partitioning schemes.
- Use SQL scripts to apply the data mining model to each of the five data tables for each of the four partitioning schemes and to report the execution time.
An IBW environment was configured on an IBW E7100 consisting of a cluster of 11 IBM System p6 570 servers running IBM AIX (see Figure 1). The E7100 cluster consisted of one administrator server and 10 data servers. Each server contained four processors and 64 GB of memory. The servers were connected over a Gigabit Ethernet switch. IBM System Storage DS4800 and DS4700 units were used for storage.
Data mining model
We obtained a logistic regression model created using SAS Enterprise Miner 5. This data mining model was exported from SAS Enterprise Miner in PMML format and then imported into a DB2 table in the IBW environment, making it available for incorporation into a DB2 scoring process.(1)
Data creation and partitioning
The data for the study was extracted from a very large database provided by an IBM customer. The extract contained two random samples of 1 million and 10 million account records, respectively. These two samples were used to construct five tables for the scoring runs, comprising 1 million rows, 10 million rows, 30 million rows, 100 million rows, and 300 million rows.
These five tables were created across four different partitioning schemes to demonstrate the “scaling up” (more records) and “scaling out” (more partitions) of data mining queries in an IBW environment. The partitioning schemes consisted of an administrative server and from 1 to 10 data servers, with each scheme having 8 partitions on each server. The four schemes were set up in the following configurations: 1 data server and 8 partitions, 3 data servers and 24 partitions, 5 data servers and 40 partitions, and 10 data servers and 80 partitions.
We assessed scoring performance for each combination of partitions and rows. For each run, the DB2 buffer pools were warmed to a consistent state to help ensure that the recorded times would be consistent and comparable across runs. For each case, the number of records scored per second and the number of records scored per second per partition are reported relative to the base case of 1 million records and eight partitions. To view a table of scoring execution times and performance metrics, go to ibm.com/developerworks/data/library/dmmag/DMMag_2009_Issue3/IBW/index.html.
Figures 2–4 illustrate the scoring performance. In Figure 2, we see that the relative scoring performance measured as the number of records scored per second per partition relative to the base case (i.e., the relative scoring rate per partition) remains constant as the number of partitions increases. For the cases of 100 million records on 8 partitions and of 300 million records on 8, 24, and 40 partitions, performance was limited by available physical memory.Figure 2: Scoring performance: scoring rate per partition vs. number of partitions (relative to 1 million records on 8 partitions)
Investigation of this performance limitation showed it to be the result of buffer pool thrashing. In our tests, the source and target tables were in the same tablespace and buffer pool. Once the number of records became large enough, reads and writes began competing for the same buffer pool resource and thus slowed down the overall scoring rate.
In Figure 3, we see that changing the number of partitions changes the relative scoring time performance (number of records scored per second relative to the base case) by the same factor. For the case of 1 million records, we see that tripling the number of partitions from 8 to 24 reduces the relative scoring time threefold (from 1 to 0.3).Figure 3: Scoring performance: scoring time vs. number of partitions (relative to 1 million records on 8 partitions)
In Figure 4, we see that the relative scoring time increases linearly as the number of records increases. Deviations from linear performance in the cases of 8, 24, and 40 partitions reflect the limitation of available physical memory.Figure 4: Scoring performance: scoring time vs. number of records (relative to 1 million records on 8 partitions)
Findings and conclusions
The results of this study indicate that scoring performance using PMML data mining models in an IBW environment scales linearly with data volumes and hardware configuration. Furthermore, performance is constrained by available physical memory. Specifically:
- Performance is constant on a per-partition basis regardless of data volume and configuration size (see Figure 2).
- Changing the number of partitions changes the relative scoring time performance by the same factor (see Figure 3).
- Performance scales linearly with increasing data volumes (see Figure 4).
- Exhaustion of the available physical memory on the servers indicates a need to increase the total number of servers in the cluster.(2)
- The onset of buffer thrashing indicates a need either to add additional memory to each server or to add more servers to the cluster. Until this memory threshold is reached, performance remains linear.
IBW’s capability to import a PMML data mining model means that analytic and IBW environments can be used synergistically to create and use data mining models for high-speed, high-volume scoring through operational business applications or automated processes. Organizations using this strategy can leverage their existing investment in analytic expertise and tools with an IBW environment to better support the decision-making process.
Our findings and conclusions lead to three best-practices recommendations for configuring an IBW data mining environment:
- To eliminate logging overhead, set an output table for scoring results to Not Logged Initially.
- To facilitate buffer pool tuning and to separate disk activity, source tables and scoring results tables should be placed in different tablespaces and different buffer pools.
- To calculate configuration metrics for a particular data mining model, the formulae displayed in Figure 5 can be used, where the model’s performance (number of rows processed per second per partition) has been determined by scoring a randomly selected subset of the data to be scored.
1 Dynamic Warehousing: Data Mining Made Easy, by C. Ballard, J. Rollins, J. Ramos, A. Perkins, R. Hale, A. Dorneich, E. Milner, and J. Chodagam, IBM Redbooks publication SG24-7418-00, 2007.
2 In an IBW configuration having servers with standard preconfigured memory, the capacity-planning decision would focus on the number of servers, not on the amount of memory per server.