Blogs

Build more adaptive applications with Apache Spark

Manager of Portfolio Strategy, IBM

Startups, particularly those in tech hot spots, have the potential to make historic achievements. Amazon, eBay and Facebook are just a few examples. What do these innovative businesses have in common? They had data science expertise and a strong vision.

Unfortunately, in most organizations, sophisticated data analysis isn’t always fully leveraged. The fragmentation of existing data science approaches tends to prevent integration of data analysis into applications. Also, data science workflows tend to be unclear, inconsistent or nonstandardized. Without organizational commitments to improving data science skills, building communities among their data scientists and investing in new tools, the promise of this technology might seem to be out of reach for most startups. 

Apache Spark is gaining momentum in the data science community because it is helping to address these challenges. IBM views Spark as the analytics operating system because Spark brings together the data science workflow—enabling developers, data engineers and data scientists to collaborate in more agile teaming environments. 

The data science workflow consists of several steps: understand the business goal; profile, explore and prepare the data; consult with subject-matter experts; build and deploy the applications; and validate and refresh analytical models often. As the analytics operating system, Spark provides a unified platform for enabling data science professionals to execute these steps in a structured, collaborative, non-linear and iterative fashion. Constant iterations in the data science workflow are essential for teams to drive better business outcomes from the statistical models they produce.

http://www.ibmbigdatahub.com/sites/default/files/adaptiveapplications_embed.jpgThink about your organization. Hadoop and relational databases are often used to store data, IBM SPSS or R are used to build models, and Microsoft Excel and IBM Watson Explore help visualize and explore data, but how do these solutions work together? How do the developer, engineer and subject-matter expert share a common workflow? 

Spark offers:

  • Scalability across users, data and applications
  • Data architecture in memory for fast response times
  • Data access through a SQL interface or other standard data access methodologies
  • Analytics that include application programming interfaces (APIs) for machine learning and graph databases
  • Programming using Java, Scala and Python to get started quickly using familiar languages 

A number of IBM clients are already using Spark; take a look at these Ignite talks. Independence Blue Cross is using Spark to analyze clinical data and scanned images to identify patients with hip implants who have high-risk exposure to complications.

Plus, IBM, NASA and the SETI Institute are collaborating to analyze terabytes of complex deep-space radio signals using Spark’s machine-learning capabilities in a hunt for patterns that might indicate the presence of intelligent extraterrestrial life. The IBM jStart team has joined with the SETI Institute to develop a Spark application to analyze the 100 million radio events detected by the Allen Telescope Array (ATA) over several years. The complex nature of the data demands sophisticated mathematical models to tease out faint signals and machine-learning algorithms to separate terrestrial interference from true signals of interest. These requirements are well suited to the scalable in-memory capabilities offered by Apache Spark, especially when combined with the big data capabilities of IBM Cloud Data Services. 

Take your Spark journey to the next step. IBM invites you to a free 3-month trial of IBM Analytics for Apache Spark and IBM Cloudant. Use Spark in the cloud to conduct fast in-memory analytics on your Cloudant JSON data. Sign up today and also receive free SaaS Startup Advisory Services to help you accelerate your time to results.