Big Data, Fractal Geometry, and Pervasively Parallel Processing

Scaling from the macro to the nano levels is the key to powerful performance

Big Data Evangelist, IBM

Big data pushes the scalability frontier. But where does the frontier begin and end?

When people think about big data, they generally envision macroscopic scaling: in other words, bigness that spans the globe or, at the very least, some large data center. People generally associate big data platforms with server farms that sprawl across huge tracts of real estate, and which are populated by ever-larger racks of processing, storage, memory, and interconnect nodes arranged in endless rows.

However, big data platforms must also push the nanoscopic frontier of scalability. It's useful to think of a consolidated big data platform, architecturally, as a fractal structure. What that means is that big data must be self-similar on all platform scales—from macro to nano—and leverage the full parallel processing resources that are available at each level.

Big data's self-similarity comes from one of its central architectural features: pervasive parallelization. This architectural principle should operate concurrently at three levels of scalability in the architecture of your big data platform:

  • Scale-out: This level is often referred to as massively parallel processing (MPP) or horizontal scaling. It is the macro-level scaling that involves many nodes, servers, clusters, and grids operating in parallel within and across data centers. To the extent that it leverages a shared-nothing architecture, MPP/horizontal/scale-out multi-node big data architectures enable administrators to maintain service levels as their workloads expand into the petabytes. When the number of nodes grows to the dozens, hundreds, or thousands, shared-nothing MPP is the best approach. A shared-nothing architecture eliminates the dependencies of nodes on common storage, memory, and processing resources, thereby maximizing linear scalability.
  • Scale-up: This is often referred to as "symmetric processing" (SMP) or "vertical scaling." Query processing, ETL jobs, in-database analytics, and other big data workloads have many fine-grained processes that can be accelerated though the server platform’s native shared-memory SMP features, which leverage parallel processing on a meso scale that can be grown through rack/blade architectures within system chassis. Every node in a big data infrastructure should scale up through more intensive application of SMP on machines with ever-speedier CPUs, more RAM, more I/O bandwidth, and bigger, faster disk subsystems.
  • Scale-in: This is parallelization at a nano level—or, as I like to think of it, "infinitesimally parallel processing." At this level, scale-in architecture focuses on engineering ever more densely packed interconnections among ever-tinier processing, storage, memory, and other components. It also involves tooling densely packed integrated systems with elastic provisioning and flexible virtualization features. And it involves integrating networking, storage, resilience, and system management capabilities into a single system that is easy to deploy and manage either in stand-alone mode within your multi-rack server farms. Scale-in architectures enable you to add workloads and boost performance within existing densely configured nodes, each of which should be an expert integrated system. You can execute dynamic, unexpected workloads with linear performance gains while making most efficient use of existing server capacity. And you can significantly scale your big data storage, application software, and compute resources per square foot of precious data center space.

Scale-in will steadily increase its importance in your scalability strategy. Miniaturization remains the juggernaut uber-trend, and subatomic density is its frontier. We all know that today's handheld consumer gadgets have far more computing, memory, storage, and networking capacity than the state-of-the-art mainframes that IBM and others were selling back in the Beatles era. And we're all starting to get our heads around quantum computing, atomic storage, synaptic computing, and other "scale-in" approaches that will keep pushing Moore's Law forward for the foreseeable future.

For enterprises that are serious about pervasively scaling their consolidated big data architectures, a balanced strategy should incorporate all three approaches—scale-up, scale-out, and scale-in—depending on the workloads you're trying to optimize.

You will need to scale big data elastically and concurrently toward both the infinite and the infinitesimal.

What do you think? Let me know in the comments.

[followbutton username='jameskobielus' count='false' lang='en' theme='light']
[followbutton username='IBMdatamag' count='false' lang='en' theme='light']