Lean data science with Apache Spark
I treat every data science project as if it were a lean start-up. Here’s how it goes: I picture myself as a dashingly handsome entrepreneur who’s determined to change the world—or change how something works at the company that hired me. My journey begins by understanding what works and what’s broken for my customers—the internal or external stakeholders:
- What tasks are they trying to accomplish?
- What pains are they feeling?
- What value can I bring to their lives with data science?
By focusing on the answers to these questions, I generate a vision for my product. Sometimes I envision a new process or tool to aid in corporate decision making, sometimes the vision is an optimization of an existing process and sometimes it’s a new, insights-driven data product to be used internally or sold externally.
Armed with my product vision, I prepare to win the battle for funding—company resources such as help from others, computers and so on. Venture capitalists (VCs)—my executive sponsors—are going to be looking to fund a company with an experienced management team and a product with traction. Because I only have an idea at the outset, I have to start with friends, family and fools—work friends, interns, people I can sucker into helping me and my own 20 percent time. I also start with angel investors —my manager or the product manager—to raise a seed round.
Once I’ve raised my seed round—usually in the form of approval to spend some of my time on it and some other minimal resources—the clock starts ticking. My goal is to prove that I’m worthy of funding and that my product solves a real problem for customers before I’m out of runway.
Having won the first battle for funding, I continue to fight simultaneously on two fronts: ensuring I can deliver on my vision and ensuring customers will buy into my vision if I successfully deliver. First, I use a variety of methods—discussions, mockups, results-first presentations, prototypes and so on—to anticipate if potential customers will value my product. Then, I drive down the feature set—data integration work, number of data sources, complexity of algorithms and so on—to the absolute minimum viable product (MVP) to reduce the development effort and time it takes to gain traction.
Many start-ups, and data science projects, don’t make it through this stage unchanged. Only by doing, exploring, testing their assumptions, and asking hard questions can an entrepreneur learn the true problems and complexities. If I fail to build to my vision, or if I fail to validate the need for my product, my options are to either pivot by modifying my vision based on what I’ve learned, or walk away from it. Iterating on the vision at least a few times is the norm.
When I manage to produce an MVP and my product starts to show early traction with customers, only then can I begin the process of scaling. I build partnerships with other companies—data engineering and product management. I form close relationships with customers—my stakeholder—through marketing, educating and consulting. Finally, I raise VC funding—the fresh resources granted by executive sponsors. With these new resources, I continue to add features and functionality to the MVP, accruing more value than ever to my customers.
Some data science tools support this philosophy and some do not. When evaluating data science technologies, I’m most influenced by learning curve, ease of use, prototyping speed and the ability to scale my prototype.
The right tool for the job
One modern tool that supports my data-science-as-start-up philosophy is Apache Spark. The community behind Spark seems to be building the tool using these lean product principles. In particular, I applaud its effort to prioritize disciplines that are held in high esteem in the start-up world but are frequently missing from open source and proprietary tools: design and usability. Relative to other tools, Spark’s application programming interface (API) design is clean and intuitive, and its programming guides are helpful and easy to read. The UI that automatically pops up on port 4040 is, on occasion, useful in debugging problems or understanding what’s happening in an application.
Spark anticipates my need for rapid prototyping by taking developer productivity seriously. Many hours have been poured into enabling developers with different backgrounds to quickly build Spark applications: porting from native Scala to Java and Python APIs was the first of several big leaps. Spark SQL, by introducing schemas, significantly reduces the complexity of working with structured data. GraphX and the Spark machine-learning library (MLlib) provides the scaffolding for applications that depend on network analysis or machine learning, and DataFrames and the R API allow data scientists to be productive with Spark.
In addition, Spark allows me to scale my data science by improving and nuancing functionality and by integrating into the rest of my workflow. Its open source design gives me flexibility to inspect and override the functioning of specific code, and its integration with the Apache Hadoop ecosystem combined with its multi-language support means I can easily connect to other parts of my extract, transform, and load (ETL) process and analysis pipeline. These capabilities facilitate staying within the Spark ecosystem while I improve existing features and extend my MVP with new functionalities based on learnings about the true needs of my stakeholders.
New and improved
As the Spark 2.0 release nears, I’m excited to see this prioritization of design, developer productivity and ecosystem integration extend to improved functionalities such as structured streaming and the integration of the data sets and DataFrames APIs.
For inspiration on this topic, read The Four Steps to the Epiphany (K&S Ranch, July 2013) by Steve Blank, my lean start-up guru. I would love to hear reactions and comments from you.
And be sure to attend the Apache Spark Maker Community event, 6 June 2015, at Galvanize in San Francisco, California—SoMa. Focused on topics of interest to data scientists, data application developers and data engineers, the event features special announcements, a keynote, a panel discussion and a hall of innovation.
Leading industry figures have committed to participating, including John Akred, CTO at Silicon Valley Data Science; Matthew Conley, data scientist at Tesla Motors; Ritika Gunnar, vice president of offering management, IBM Analytics, at IBM; and Todd Holloway, director of content science and algorithms at Netflix. Be sure to register for the in-person event. If you aren’t able to attend the in-person event, then register to watch a livestream presentation.