Post a Comment

Data Scientist: Mastering the Methodology, Learning the Lingo

August 30, 2012

We can argue till we’re blue in the face on the issue of whether a true data scientist must have academic credentials. But no one doubts that credentials mean little if you can’t actually do the work.

You can call yourself a data scientist in good conscience only if you can master the methodology. Yes, there’s a significant–some might say "scary"–learning curve awaiting anybody who seriously wants to enter this field. Many people let their fear of math keep them from getting that degree, cracking open the books, glancing at the journals, or paying close attention when data scientists are speaking.

Statistical patterns are the very heart of big data applications, so it’s a bit disappointing when big data professionals have skimpy knowledge of the quantitative techniques upon which all else rides. For example, I believe that math-phobia is one of the chief reasons we don’t see many industry analysts focus on data mining, predictive modeling and statistical analysis. Many otherwise-technical people tune out when the technical discussion goes deep into equations festooned with garlands of Greek letters.

Data scientists must truly walk the walk through a thicket of statistical algorithms and techniques. It’s not enough to have a passing familiarity with regression modeling, for example, because that’s not the only statistical approach in the data scientist kitbag and, besides, there are several ways to regress variables, none of which is perfectly suited to every modeling scenario. Choosing the right modeling approach is often a creative exercise that demands expert human judgment.

No, you don’t need to have a Ph.D. in statistics to be a data scientist. What you do need are curiosity, intellectual agility, statistical fluency, research stamina, scientific rigor and a skeptical nature. You must also be articulate, because no one will accept the validity of the patterns you surface if you can’t explain clearly how you built your model, what variables and data you used, or what the results truly mean in the context either of some business problem or scientific endeavor.

Data scientists are a community with a specialized lingo–some might call it a "jargon." You must learn that lingo, but you can’t just talk the talk. Any attempt to glibly bluff your way through this difficult material will quickly expose you as a pretender. Also, if you wish to be a well-rounded data scientist, you should master specialized subject matters and terminologies in each of several key areas: algorithms and modeling, tools and platforms, applications and outcomes, and paradigms and practices.

The classic routes for acquiring this expertise and patois are in academia and on the job. Another great resource, especially for the self-taught, is to participate in online discussions among data scientists. In the course of my work here at IBM, I participate in several professional forums on LinkedIn devoted to big data. Here, grouped under several broad categories, are some recent LinkedIn discussion threads relevant to data science (click through to observe and participate in each, if you wish):

Algorithms and modeling:

Tools and platforms:

Applications and outcomes:

Paradigms and practices:

If you’re an established data scientist, you might find one very specialized discussion to be compelling, but the rest not worth a moment of your time. If you’re not a data scientist but wish to engage with those professionals in various business initiatives, these sorts of discussions may be the intellectual on-ramp you need to orient yourself.

Yes, some of the methodological discussions can be sleep-inducing and are best followed on a full tank of caffeine. If the thought of obtaining initial cluster components by extracting the required number of principal components and performing an orthoblique rotation makes you break out in hives, don’t say I didn’t warn you.