Datapalooza: What’s so funny 'bout peace, love and data science?

Big Data Evangelist, IBM

Data scientists are key application developers in the era of cognitive computing, big data analytics, and cloud-based data services. Today’s developers realize their future success depends on their ability to become data science professionals. At the very least, this ability involves acquiring data science skills and tools, focusing their attentions on this line of work and engaging in the worldwide community around topics such as machine learning and streaming analytics.

Surmounting the learning curve requires the passion and determination to enter what some call the sexiest profession of the 21st century. Are you interested in becoming a data scientist? If you already are a data scientist, are you interested in deepening your professional skills and knowledge? Have you shown that you can build and deploy a production-grade data science product at an accelerated pace, in an agile environment, using Apache Spark, Apache Hadoop and other leading tools and platforms?

A range of paths into data science

Data scientists have their own background, goals and emphases. Some data scientists are struggling to bring statistical methods and mathematical techniques to help optimize the business and create enhanced customer relationships. Others seek to build deep analytics such as machine learning and natural language processing in a world that is accustom to querying databases. No matter which type of data scientist you are striving to become, being able to combine creative and analytical skills and recognize data science patterns in disparate real-world problem domains is imperative. In addition, unearthing undiscovered patterns, bringing an artistic flair to guide how results are visualized and presenting it all in a data-driven narrative that leverages storytelling skills is required.

Once you’ve decided what sort of data scientist you wish to become, you may take any of several paths into the profession. Some data scientists acquired the requisite skills and knowledge in degree-granting institutions of higher learning. Others acquired them in whole or in part on the job, while others acquired them on their own. Some have even plugged into open source communities to bootstrap their way into this profession.

Data science professionals often feel disconnected. Typically, they have to go it alone by finding new data sources, domain-specific models and suitable algorithms with the guidance on how to use and deploy them. Self-study is often required to learn new technologies such as Hadoop or Spark. They don’t have a common way to share work products, iterate quickly and learn from each other.

Unless you’re connected to Facebook, Google or IBM with dedicated data science teams, learning about data science independently can be a daunting task. Community engagement is an important path into data science for the new generation of developers.

Core data science skills and knowledge

Imagine a day when every data source is known and accessible, every algorithm is easy to understand and computation technology is available and easy to deploy to meet our needs. To realize that potential, some core skills and knowledge can be gained by plugging into the data science community.

Adopting leading paradigms and best practices

Data scientists need to acquire a grounding in core concepts of data science, analytics and data management. They also need to gain a common understanding of the data science lifecycle, as well as the typical roles and responsibilities of data scientists in every phase. They need to be instructed on the various roles of data scientists and how they work in teams and in conjunction with business domain experts and stakeholders. And they need to learn a standard approach for establishing, managing and operationalizing data science projects in the business.

Applying appropriate algorithms and statistical modeling techniques

Data scientists need to obtain a core understanding of several disciplines:

  • Association rules
  • Basic statistics
  • Bayesian and Monte Carlo Statistics
  • Classification
  • Cluster analysis
  • Data mining
  • Decision trees
  • Experimental design
  • Forecasting
  • Linear algebra
  • Linear and logistic regression
  • Machine learning
  • Market-basket analysis
  • Matrix operations
  • Predictive modeling
  • Primary components analysis
  • Sampling
  • Summarization
  • Text analytics
  • Time-series analysis
  • Unsupervised learning-constrained optimization

Mastering the core tools and platforms

Data scientists need to master a core group of modeling, development and visualization tools and the platforms used for discovery, acquisition, preparation, storage, execution, integration and governance of organizational big data. Depending on the environment, and the extent to which data scientists work with both structured and unstructured data, mastering tools and platforms may involve some combination of Hadoop, Spark, stream computing and other platforms. And it will likely also entail providing instruction in R and other new open source development languages.

Driving strategic business outcomes

Data scientists need to learn the chief business applications of data science in their organization and how to work best with subject-domain experts. In many companies, data science focuses on marketing, customer service, next-best offers, and other customer-centric applications. Often, these applications require that data scientists understand how to leverage customer data acquired from structured survey tools, sentiment analysis software, social media monitoring tools and other sources. And gaining an understanding of the key business outcomes—such as maximizing customer lifetime value—that should focus modeling initiatives is essential for every data scientist.

Data science community involvement these ends, consider participating at Datapalooza, 10–12 November 2015, in San Francisco, California. Datapalooza is the first program of its kind that brings together data science professionals to build practical skills and knowledge in the era of Spark, open source data platforms and cloud data services. The San Francisco sessions kick off a world tour that will take Datapalooza to Berlin, Germany; Hong Kong; London, England; Moscow, Russia; Tokyo, Japan; Tel Aviv, Isreal and other cities in 2016 and beyond.

Datapalooza is a three-day immersive experience for data science professionals and for those who want to take a deeper dive into this profession. Participants from diverse backgrounds share experiences, learn from each other and engage with experienced data scientists from the IBM Spark Technology Center, Galvanize and other organizations. The event incorporates deep dives with industry leaders from data science, data engineering and application development.

Datapalooza participants also acquire hands-on experience building a data product over the course of the three days. They combine analytics and innovative skills to attack real-world challenges using natural language processing, machine learning and the IBM Watson cognitive computing framework. Sessions and talks focus on data wrangling, data pipelines and data variables, models and scoring. Those eager to take applications to the marketplace can explore frameworks, visualizations and product launch. Participants’ work products can be shared, reused and seamlessly integrated within the Datapalooza community.

Register for Datapalooza, and stay tuned to see when the event will take place in a city near you. In addition, register to engage in Spark Summit Europe, 27–29 October 2015, in Amsterdam, the Netherlands.

For additional resources, learn how to use Spark as a service on the IBM Bluemix cloud platform to address urgent business challenges. And deepen your data science explorations with these IBM Analytics and Spark resources: