Big Data & Analytics Heroes
David Birmingham, senior solutions architect at Brightlight Analytics and this week's big data and analytics hero, discusses the power of simplification in big data and analytics architectures. At the same time, he expands on the notion that scalability is the ultimate criterion of whether promising architectural concepts actually succeed in robust deployments.
When getting started with big data and analytics, what were your biggest challenges?
If a business doesn’t have top-down commitment to analytics, its initiatives in this direction tend to limp along. Moreover, successful analytics programs have one or more internal champions who have a roadmap and a vision. Without these guiding elements in place, standing up an analytics program is very nearly an uphill battle.
Another equally daunting challenge is the situation in which organizational technologists learn the technologies and techniques to a competency that is well beyond their pay grade. They tie off their responsibilities and leave for better pay or to become a consultant. This situation leaves a knowledge gap in the company that is even more difficult than ever to fill.
How did you get organizational support for your big data initiatives?
Typically, the top level of an organization falls into two general camps—those who are skeptical of the value of big data and those who are all-in. The second group, while perhaps seemingly the best fit for adoption and support, actually creates a lot of noise as the leaders attempt to shoehorn big data into places where it might not fit at all. This approach can lead directly to program failure.
The better road than being all-in with respect to adoption and support is through folks who are skeptical of big data’s value, which requires added uptake to justify the use of big data. Once this approach has been fully discussed and vetted—and also justified—it can be a source of enormous lift. The mysteries have been deliberately sifted from the conversation, and there are deliberate, objective roadmaps with real return on investment (ROI).
At their level, analysts want more self-service without IT intervention. This desire means vendors need to deliberately simplify the user’s ability to stand up, deploy and operate solutions so that the analysts can have a more self-contained solution.
How have big data and analytics impacted the way you do your job today?
Staying in front—or near the front—of the technology’s edge is just as much a challenge today as ever. However, companies such as IBM and others are working diligently to congeal and integrate these technologies into a more cohesive form, and the most recent IBM Fluid Query serves as an example. A primary objective of this roadmap should be to simplify the configuration and implementation of the technologies.
Simplification cannot be overrated. The more spinning plates and moving parts in a solution, the more likely one will spontaneously break and leave us hanging. We need it light out, hands free and simple to deploy and operate. .I mean, really dirt simple. The aficionados of big data need to take the time to simplify it and, if necessary, give us the libraries and training to bypass the simplicity if we want to make our own products or offerings. But for the average user, simple is better. This simplification requires power, but IBM PureData for Analytics, powered by IBM Netezza technology, has the power—without it, simplicity is impossible. Only with great power comes great simplicity.
When we cross a large-scale bridge over a raging river, we must trust that the bridge architect followed the primary principles of scale and does not require us to stop the car, get out and perform a laundry list of tasks and double checks before we can proceed. Moreover, the vendors of big data solutions need to look to keystroke reduction for their users. This approach includes fewer mouse clicks as well. The more steps users must remember in managing the large-scale system, the more risk is introduced by simple human error. For rote, mechanical tasks, users may seemingly have more surgical control over their fate, but with big data that control is a liability.
Many years ago, someone quite accidentally truncated a table containing over 60 billion records. I didn’t mention this number when I was explaining the problem to small data folks, and their initial reaction was to restore from backup. But how do you just restore 60 billion records? The client had an in-machine archive that was less than a week in arrears. Restoring data from that system and replaying the prior transactions to catch up was far simpler than spending ten or more days attempting to restore from backup.
Do you think big data and analytics will handle the data growth in 10–15 years, or will we need another shift in technology? Why?
Big data and analytics in general are in many ways driving the growth, so the capabilities are in a constant feedback loop. Years ago, when attempting analytics against a common, general-purpose database, users had to wait hours for their queries to return, if they returned at all. Today, those same queries on a PureData for Analytics machine come back in five minutes. But as we have seen, history is forgotten and now five minutes is intolerable. The point being, if the technology isn’t able to keep up with such aggressive demands today, it won’t be able to keep up with them tomorrow, when five seconds—not minutes—is an eternity of time.
The shift in technology will arrive in the form of large-scale data processing. Extract-transform-load (ETL) tools were required when the database could not possibly process the data on the inside. Now, PureData for Analytics can process data faster than any ETL tool, and none of the ETL tools have stepped up to harness this kind of power. For example, one ETL tool was required to execute SQL against the Netezza database to shepherd 170 tables worth of data, but it required four SQL operations to make it work. That’s 680 SQL operations total. In Netezza, none of these operations required more than one second of duration, meaning that even if we serialized them, they would not take longer than 600 seconds running back to back.
Why did the ETL tool require three hours to run the operations? The ETL tool does one thing well, and issuing SQL statements to Netezza, while functionally accurate, is grossly inefficient. None of the ETL tools are attempting to reduce this inefficiency, and their architects honestly believe that when using SQL statements in this manner, users will execute a few. But none of the users we work with execute a few—they execute hundreds, if not thousands, of operations and want a tool that can blast them into the machine to make it work harder in less duration. ETL tool vendors are largely in denial that the Netezza platform should be used this way, and certainly not this aggressively. But nobody wants to take the data out of the box, crunch it and put it back. This method was great when the database was too weak to do the job. Not anymore.
In a nutshell, latency-reduction and overhead to complete otherwise fast, simple operations will be a key differentiator in those technologies that accelerate ahead or are left woefully behind. This distinction is already happening, and none of the ETL tools are stepping up to the challenge.
The general shift in technology is always underway as the market spontaneously adapts to where users are going. The core differentiators will always be the storage capacity and the ability to retrieve requests in an acceptable time frame. What will be key, however, is the cost-effectiveness of the back-end infrastructure to make all this technology—data management, security, storage management, operational integrity and so on—happen. Absolutely everyone can deliver lightning turnaround and unlimited storage if they have limitless resources to buy highly expensive hardware. But this scenario is not a viable business model. Differentiation will be found in how people leverage less with more efficiency and deliver ever-richer experiences.
The capabilities scalable and adaptable always rest on stability, because neither is available on an unstable platform. When adaptable and scalable are first leveraged however, well-meaning people often make compromise decisions that deliver the request but detract from stability. Once stability is lost, adaptability and scalability are lost with it. A scalable and adaptable platform has to preserve the original investment, becoming more stable, scalable and adaptable with each new challenge. Some of this progress requires experience and forethought, but mainly it requires a loyalty to the basic principles of big. When going big, ideas that work on smaller scales are crushed to powder. Big can be merciless, and requires deliberate harnessing and control. Many IT shops don’t have the acumen or the will to embrace these issues, so having them well in hand, simplified and harnessed is incumbent upon vendors before users even break the plastic.