Blogs

Post a Comment

Data Scientists: How Big is Your Big Data Sandbox?

May 7, 2012

Big data is not just about scaling your data analytics processing platforms to keep up with the onslaught of new information. Just as important, big data is about bringing together your best and brightest minds and giving them the tools they need to interactively and collaboratively explore rich information sets.

Without scalable development “sandboxes,” you won't realize the full value of your investment in big data. Developer productivity is a critical concern, especially when you're talking about high-priced data scientists. Today's statistical modelers and business analysts need high-performance platforms where they can aggregate and prepare data sets, tweak segmentations and decision trees, and iterate through statistical models as they look for deep statistical patterns.

To be as productive as possible, these teams must have massively parallel CPU, memory, storage, and I/O capacity at their fingertips to tackle analytics workloads of growing complexity. Teams can obtain that sandbox platform capacity from a stand-alone analytic data mart, such as IBM Netezza 1000; or a logical partition in an enterprise data warehouse (EDW), such as IBM Smart Analytics System; or a Hadoop cluster, such as IBM InfoSphere BigInsights Enterprise Edition.

Big data sandboxes are where you develop the all-important intellectual property - advanced analytic models - that extract intelligence from otherwise inchoate gobs of content. Sandbox scalability is critical, but it's more than just raw horsepower. You also need the ability to support the growing scope of mission-critical projects that fall under the strategic umbrella of big data. Today your sandboxing requirements may revolve around traditional statistical analysis, data mining, and predictive modeling, but you may be moving rapidly into Hadoop/MapReduce, R, geospatial, matrix manipulation, natural language processing, sentiment analysis, and other resource-intensive types of big data processing.

To avoid choking on the dizzying variety of big data projects, your sandboxing platform—such as IBM Netezza Analytics—must embed comprehensive, extensible libraries of reusable algorithms and models for advanced analytics. Does your sandboxing platform allow you to also plug in your own libraries or those of a preferred analytic vendor? Does it provide an integrated development environment with prepackaged modeling tools, connectors, and language adapters that the team can standardize on to accelerate your geographically wide-ranging big data development programs? Does it come from a vendor that offers a wide range of best-of-breed tools, such as IBM SPSS Modeler, to meet all your development needs? And does that vendor provide a world-class big data professional services capability, such as IBM Business Analytic and Optimization services, to supplement, extend, and bootstrap your internal big data development practice?

Your choice of a sandbox is just as important as your commitment to an operational big data platform. Talented people are your most precious resource. The sandbox is where most big data developers will spend most of their productive hours. If you fail to provide them with scalability they need to run a growing range of jobs, you'll be wasting their time as they queue up for access to limited processing and storage resource. Likewise, if they don't have access to a common sandboxing platform with a rich library of algorithms and models, you'll make it difficult for them to pool their expertise on common projects using shared tools.

So when it comes to big data development, don't forget to think inside the sandbox—and to grow and deepen that shared resource as your organization's needs evolve.