The Spark that ignites the insight economy

Big Data Evangelist, IBM

Big data analytics is a key fundamental force that is driving innovation and creativity in today’s insight economy. Open source big data tools such as Apache Hadoop are the heart and soul of this revolution. They are designed to deliver data-infused analytics guidance and recommendations into a staggering range of business and consumer applications.

As anybody who attended last week’s Hadoop Summit could see, Hadoop has gained broad adoption, matured rapidly, remains amazingly innovative and spawned a wide range of subprojects that fit together into an increasingly robust ecosystem. The Hadoop ecosystem continues to evolve at a rapid pace, and the future is increasingly close at hand. Today in San Francisco, IBM issued a landmark set of announcements around Apache Spark, which is a highly promising and widely adopted recent codebase that’s included in the core Hadoop distribution. as a whole, the latest announcements from IBM signal its commitment to fostering an open, mature and innovative industry ecosystem to accelerate Spark adoption. But before launching into a detailed discussion of the announcement, understanding exactly why IBM has made such a substantial strategic bet on Spark is important.

IBM sees Spark as one of the primary development tools for 21st century data scientists and other data-driven application developers. Spark is a power tool for developers to address challenges involving new and creative blends of in-memory, streaming, graph analysis and machine learning analytics. It can significantly boost the productivity of developers whose companies have put big data analytics at the heart of their business model. Apache Spark provides a platform to bring application developers, data scientists and data engineers together in a unified environment that is not resource-intensive but is easy to use. It radically simplifies the process of developing and deploying intelligent applications fueled by data. And it provides a unified open source programming framework, algorithm libraries and high-performance runtime engines that are well suited for tomorrow’s big data analytics challenges, such as in-memory, streaming, graph and machine learning analytics.

Spark’s tooling and runtime infrastructure enable developers to rapidly refine their statistical algorithms through fast, iterative modeling on massively parallel server clusters, farms and cloud-based platforms. Drawing on Hadoop and other data lakes, Spark enables streamlined collaboration among data scientists, data-driven application developers and data engineers through an environment that is conducive to reuse and sharing of data, algorithms and other assets.

With so much at stake for its customers, IBM has opted to take a multipronged approach in announcing its deep commitment to Spark.

Integrate Spark throughout the IBM solution and service portfolio

IBM is expected to build Spark into the core of its analytics and commerce platforms. Plans call for Spark to be embedded in solutions that, now or in the near future, incorporate machine learning technology. IBM already supports Spark in IBM InfoSphere BigInsights 4.1 for Apache Hadoop—both on cloud-based and on-premises applications—on which users can leverage Spark as an alternative in-memory platform to accelerate MapReduce applications.

IBM also already supports Spark on IBM InfoSphere Streams, using adapters to enable analytics written in Java or the Spark machine-learning library (MLlib) to be shared between Streams—for event-by-event analytics—and Spark—for minibatch, time-based, low-latency applications. The solution road map calls for eventual delivery of diverse software and cloud-based offerings built on Spark and server infrastructure such as IBM Power Systems to host Spark applications and consulting services that help clients build and deploy Spark applications.

IBM is also expected to offer tools that surround the core Spark technology to make it easy to consume. And IBM plans to soon offer Spark as a managed, subscription-based, on-demand cloud service on the IBM Bluemix platform. This service enables developers to quickly load data, model it in Spark and develop a predictive Spark algorithm for use in their applications.

Conduct extensive, worldwide research and development into Spark-related projects

IBM Research currently has over 30 active Spark projects that address technology underneath, inside and on top of Spark. In addition, IBM has put more than 3,500 IBM researchers and developers to work on Spark-related projects at more than a dozen labs around the world. Furthermore, IBM is one of four founding members of the University of California, Berkeley’s AMPLab, where Spark was invented in 2009 and remains a very active sponsor of and collaborator in its work. It engages with the lab in multiday research retreats, provides advice and real-world insight and interacts closely with its researchers on projects of mutual interest. And IBM recently launched a very successful internal Hack Spark Challenge involving 10,000 IBM developers, resulting in over 100 new Spark applications being built in less than two weeks.

Contribute machine learning technology to the Spark community under open source licensing

IBM has open sourced its SystemML machine learning technology for the benefit of the Spark open source ecosystem. Through this move, and by engaging deeply in machine learning projects both with the Spark community and with diverse industry partners, IBM expects to improve adoption of high-quality machine learning algorithms across the Spark ecosystem.

IBM SystemML makes expressing scalable machine learning algorithms fast and easy, thereby improving developers’ ability to deploy statistical models that can automate learning of correlations and patterns directly from the data. By contributing SystemML, IBM can help data scientists iterate rapidly to address the changing needs of business and to enable a growing ecosystem of application developers to apply deep intelligence into every project.

Establish a center of excellence for Spark developers

IBM has opened the Spark Technology Center at its facility in downtown San Francisco to help data scientists and other developers foster design-led innovations in applications powered by big data analytics. The Spark Technology Center—designed to accommodate up to 300 resident data practitioners, developers and designers—accelerates Spark-enabled innovations, develops machine learning skills and helps developers to infuse Spark’s rich analytics into myriad practical applications. The Spark Technology Center is set up to get designers, developers and data practitioners to collaborate side by side on disruptive, creative and sophisticated applications of Spark technology. The Spark Technology Center is expected to be a focus of the IBM and Spark open source community engagement, and it will also create open and free educational assets to accelerate Spark adoption everywhere.

Engage deeply with the Apache Spark open source community

IBM is expected to engage deeply with the Spark open source community to develop sophisticated machine learning assets, drawing on the open source SystemML algorithms and speeding development of sophisticated Spark applications that make optimized use of these assets. In this way, IBM is simply carrying on its long history of support and contributions to open source communities. Note that IBM has actively and creatively participated in such diverse open source communities as Hadoop, Linux and Eclipse. To date, IBM has contributed 12.5 million lines of code to Eclipse alone, not to mention Linux—in which IBM-contributed code comprises 6.3 percent of its total contributions to date.

Educate IBMers, partners and customers on Spark

IBM is educating more than one million data scientists and data engineers on Spark. It undertakes this ongoing education program in collaboration with strategic Spark partners such as AMPLab, DataCamp, MetiStream, Galvanize and the Big Data University massive open online course (MOOC).

Take your Spark journey to the next step. IBM invites you to a free 3-month trial of IBM Analytics for Apache Spark and IBM Cloudant. Use Spark in the cloud to conduct fast in-memory analytics on your Cloudant JSON data. Sign up today and also receive free SaaS Startup Advisory Services to help you accelerate your time to results.