Moving beyond spreadsheets into enterprise-ready statistical analysis

Post Comment
Big Data Evangelist, IBM

For most knowledge workers, spreadsheets can be an attractive choice for performing calculations. If the task is to carry out simple tests on a small number of variables, then a spreadsheet is as good a tool as any. Its principal advantages are ubiquity, familiarity and ease of use. 

However,  spreadsheets may prove to be inadequate for a wide range of statistical analysis purposes. Their statistical features and algorithms are generally not as sophisticated, flexible or accurate as those in solutions that were specifically designed for enterprise-grade statistical analysis. For business analysts and anyone else whose job depends on the ability to build complex statistical models on greater amounts of data from disparate sources, you want a tool that goes beyond spreadsheets. You want an enterprise-grade tool that offers greater reliability, accuracy, speed, flexibility, scalability, functionality and programmability. And if that tool is as easy to use, manage and master as any spreadsheet, all the better. 



Spreadsheet constraints

If we’re going to propose a practical alternative to spreadsheets for daily use, we first need to examine what limitations keep spreadsheets from being used in high-performance enterprise applications. They have three key limitations: superficiality, limited flexibility and inadequate decision support.

Superficial at best 

Spreadsheets are acceptable if you intend to settle for a superficial glimpse at your data. But when the patterns, trends and correlations in your data set aren’t immediately apparent, you want something more. What you need are regression analysis, data mining and other advanced statistical capabilities that a spreadsheet simply can’t offer. Also, spreadsheets typically impose constraints on the number of records that can be analyzed. Consequently, a spreadsheet-based model can’t scale if the data set you’re analyzing is large—as many of them are in this era of big data.

Yes, carrying out mathematically sophisticated calculations on huge data sets using a spreadsheet is possible. But you can’t easily account for complex factors such as seasonality in a business, performing what-if analyses or developing multivariate scenarios unless you have a sophisticated tool such as IBM SPSS Statistics.

Limited flexibility

Spreadsheets are mass-market software programs that often lack the flexibility to be extended to handle tasks for which they weren’t designed. This characteristic is especially true for tasks involving many variables, huge amounts of data and exceptionally complex statistical correlations. Though many people use spreadsheets as database substitutes, they are unsuited to this role because a proper database has built-in rules for structuring data, maintains data integrity, provides audit trails and has other robust data management and extensibility features that spreadsheets generally lack.

In addition, spreadsheet programs such as Microsoft Excel lack the programmability of a true database management system (DBMS). They instead rely on what is known as a nonprocedural programming language rather than the more sophisticated procedural languages such as Basic, C, Fortran and Java that enterprises rely on. Furthermore, spreadsheets often have functional limitations when handling special types of data such as missing or categorical data and performing sophisticated predictive analyses beyond straight-line extrapolations of past trends. They also exhibit drawbacks when propagating complex data updates beyond the clumsy process of propagating changes through tens to hundreds of separate linked cells, rows, columns and formulas without sacrificing visibility into their impact on accuracy.

Unreliable decision support

Spreadsheet software can be unreliable. For one thing, spreadsheet vendors continually need to issue patches and corrections for problems that might lead to calculation errors. Also, spreadsheets are not nearly accurate enough for complicated mathematical procedures or large data sets, even when they are coded correctly. And further, spreadsheet-based models have a high frequency of errors. The most troublesome of these are stealth errors, which produce incorrect results that nobody is likely to catch. The most frequent sources of stealth errors include mistakes in logic, incorrectly copied formulas, accidentally overwritten formulas, misuse of built-in functions, omitted factors and data-input errors.

Analytical power

What would the ideal statistical analysis tool for enterprise adoption look like? At a bare minimum, it would need to possess sophisticated statistical analysis, flexible decision-support analysis and a trusted statistical analysis software resource.

Sophisticated analysis

Users need to be able to perform robust, in-depth statistical information analyses in the tool without anything that smacks of programming. They need to be able to apply more advanced methods immediately because they are already there, in the software, and they need to be able to easily create commonly used charts such as scatterplot matrices, histograms and population pyramids. Charts and tables should be viewable anytime or anywhere on a range of devices. The tool needs to allow analysts to examine the available data and calculate values for missing items while employing diagnostic reports to uncover missing data patterns. Other key data-preparation steps include looking at the distribution of data, checking for outliers and organizing or binning data so that the algorithms you plan to use—such as Naive Bayes or logit models—operate efficiently.

Flexible analysis for decision support 

Statistical analysts need to have the flexibility to generate and deploy results in various formats, including an impressive array of plots and graphs, and to a diverse range of target platforms. When any type of analysis is performed, the tool needs to automatically write a program in the form of syntax that can be saved and run time after time on different data sets without the need to change it.

Advanced users need a tool that enables them to implement new languages, procedures, and functionality through programmability extension hooks. The tools need to allow support extensions for programming in Java, .NET, Python or R and to embed new algorithms or functions directly. And the tools need to enable users to create a native GUI for the new feature they’ve created to give access to it for nonprogrammers who can then perform the analyses on their own—rapidly and efficiently.

Trusted analysis software

Users need to be able to perform statistical calculations that are accurate, valid and optimized at every point, from data entry through to the creation of reports for decision makers. In data entry, users need to be able to start the process with definitions of the data types that are going to be used, at any level of detail. This capability can trigger the first level of error checking.

The tool also needs to ensure that the data types, data relationships and other characteristics of the data layout cannot be modified accidentally during data entry. Tool-driven data validation and error checking mechanisms need to enable users to compare two data files or two data sets either by the entire document metadata or in a case-by-case comparison of selected variable values to identify any discrepancies between them. Automatic procedures locate values that appear to be out of line, which takes care of locating most typos. If, however, the value happened to be within range but somehow abnormal when compared to the other numbers entered, the tool needs to be able to spot it and ask about it. 

Statistical sophistication

If you’re wondering where you can find sophisticated, flexible, trustworthy tools for enterprise-grade statistical analysis, give a complimentary trial version of IBM SPSS Statistics a try.

Check out advanced statistical capabilities