Rising Heroically to the Scalability Challenge
Can commodity tools step up to the big leagues of big data analytics?
At a recent proof-of-concept exercise, a vendor took its offering through the necessary paces to demonstrate its wares in meeting its client’s steep data processing needs. Hopes were quite high. The marketing slicks said it all. If everything were true, this product should handily harness and master the capabilities of IBM® PureData® System for Analytics, powered by IBM Netezza® technology. And yet, the event that unfolded could only be described as a circus, if not a serial tragedy. One test demonstration after another not only failed, but some were catastrophic—crashing the program altogether. Who would purposefully walk into such a failure? Well, nobody would. And yet the fact that this vendor did was simply indicative of it not understanding the Netezza platform. When leaving their offices prior to the demo, the reps no doubt literally snatched up a spare CD on their way out that contained the product’s most recent version, hopped into the car, and drove blissfully toward their doomed presentation. Each demonstration of the product’s capability failed for a common reason: there was a lack of understanding about the machine or, conversely, a presumption that it behaves like any other database. Enterprise products generally presume the database is aligned or configured in a way that normally resembles a commodity database such as Oracle Database, Microsoft SQL Server, and so forth. These platforms have features—such as indexes—that the product can leverage, and they perceive management and performance as a function of the cleverness of a SQL statement, or of applying several clever SQL statements. This presumption fails with the Netezza architecture because it has no indexes and is otherwise a very physical machine.
The commodity zone for the many
Performance in a Netezza machine cannot be derived through manipulating logical SQL. A quick examination of its architecture’s internal organization reveals why. Tables in Netezza are logically represented in the catalog only once, but at the machine level they are a collection of data slices. What’s a data slice? If there is a one-rack Striper model with 140 data slices, each combines a processor and disk drive, which enjoys a shared-nothing relationship at the hardware level. In the IBM TwinFin® model, the processor and disk are one for one, and in the Striper there is one processor for multiple disk drives. The fact remains that the disk drive interacts with only one processor, and that processor’s resources are self-contained and not shared with other processors or disk drives. If 140 million records are loaded to a randomly distributed table in a single-rack Striper—the Netezza N2001 next-generation servers that IBM branded as Striper—each data slice receives 1 million records. There is another layer, however, in that the data slices are a collection of physical disk pages. A Netezza machine reads pages, not rows. A Netezza N1001 model—previously, the IBM Netezza TwinFin® server family—can read roughly 600 pages per second, whereas a Striper can read 1,000 pages per second. Once the page is in the field-programmable gate array (FPGA), it sifts the necessary records from the pages using hardware filtration the same way a gold miner sifts gold from the ore. These concepts, a collection of data slices and a collection of pages, are quite foreign to many commodity tools that handle only rows. Logically, a table can be considered a collection of rows, but at the physical level a table is ultimately a collection of disk pages/data slice. The more data slices that are active—and the fewer disk pages that are read—the faster the query. The declaration of how to manage distribution and the pages-read is found in the way the table is physically laid out on the machine—DISTRIBUTE ON ( )—and how the rows themselves are optimized for fast query turnaround—ORGANIZE ON ( ). And of course, it is coupled with the consuming query that leverages both. The consuming query must regard the distribution—using join logic—and the organization—using where-clause filters—to get the maximum boost. Massively parallel processing (MPP) in Netezza is shared-nothing hardware, while symmetric multiprocessing (SMP) is shared-everything hardware. Performance is in the physics, and the shared-everything nature of SMP dissipates the machine’s power—even more so as the data grows. This distinctive difference between an MPP-like Netezza machine and a commodity SMP machine, such as those supporting Oracle or SQL Server, is why SMP will never scale to the level of an MPP machine. This difference is also why most commodity products such as backup, archive, replication, and so on are not prepared for the radical scale that MPP natively supports. As such, these technologies are basically geared to deliver in the commodity zone, where 80 percent of all solutions reside. After all, not everyone has a need for the Netezza machine’s scale. Of those that do, they need something better than the average commodity tool serving the 80 percent of the market that doesn’t have a scalability problem. The point is, anyone can be a hero in the 80 percent commodity zone, but what is needed is a champion in the 20 percent zone. That zone is where the big data heat is highest, and a technology that can step up is desired.
Heavy lifting for the few
The aforementioned wannabe product had an unforeseen challenge. When one of the test demonstrations crashed, the vendor discovered that it was because an internal variable was an integer instead of a big integer. This may sound like a geeky discovery, but the significance is profound. An integer can hold only 2 billion possible values, so this is a fatal flaw in a product that must move or manage many billions of rows at a time. Moreover, the solution was standardized on the integer type, meaning the flaw was architectural, and not a feature that could be fixed easily. This scenario is evident in the 80 percent heroes; they are standardized on commodity scales and would never imagine hitting hundreds of millions, much less multiple billions, of rows. The reps for this particular vendor said they would go back to the shop, fix the problems, and schedule another proof of concept. However, clearly they wouldn’t be back. Why? They saw that modifying the product would require an architectural overhaul, which is nontrivial. Extensive retesting of the entire product is expensive. Acknowledging that end users in the 80 percent commodity world would be perfectly happy with the product as is, there was no incentive for the vendor to step up to the needs of one customer, or even a few. Those living in the 20 percent zone tend to forget that the commodity vendors have no market incentive to fortify their products for extreme heavy lifting. The market for such a solution is too small, and the uptake is too costly and steep. Imagine one of them coming out with such a product and wanting to charge more money than similar products in the 80 percent realm. Sticker shock would ensue, and everyone would wonder what the vendor could have been thinking. If it plans to sell the product at the same price as an equivalent commodity, knowing that it cannot recoup the development costs, what does that strategy say about the longevity of its business model? Quite simply, Netezza solves extreme data problems, but the remainder of the marketplace is slow to catch up. The problem is not intrinsic to Netezza, but to the amount and complexity of data to harness in the solution.
Test to the limit for all
Some excellent products support the capabilities of a Netezza machine, but before procuring one for a particular requirement, a proof of concept is warranted to establish objective answers. Make the vendor prove its claims. Don’t softball the process and give them easy problems to solve. Every domain has a hardest problem, so compel the vendor to solve it as part of the proof of concept. If it cannot address the hardest problem now, what will happen on day one after installation? Objectively evaluating the tools and technologies already serving other data management needs is also important. If vendors cannot step up, shoehorning them into the process only makes things worse and doesn’t solve the problem. With capacities of extraordinary scale, solid, reliable products that simplify data management and logistics are what is needed. Please share any thoughts or questions in the comments.