Data Scientist: Mastering the Methodology, Learning the Lingo
We can argue till we’re blue in the face on the issue of whether a true data scientist must have academic credentials. But no one doubts that credentials mean little if you can’t actually do the work.
You can call yourself a data scientist in good conscience only if you can master the methodology. Yes, there’s a significant–some might say "scary"–learning curve awaiting anybody who seriously wants to enter this field. Many people let their fear of math keep them from getting that degree, cracking open the books, glancing at the journals, or paying close attention when data scientists are speaking.
Statistical patterns are the very heart of big data applications, so it’s a bit disappointing when big data professionals have skimpy knowledge of the quantitative techniques upon which all else rides. For example, I believe that math-phobia is one of the chief reasons we don’t see many industry analysts focus on data mining, predictive modeling and statistical analysis. Many otherwise-technical people tune out when the technical discussion goes deep into equations festooned with garlands of Greek letters.
Data scientists must truly walk the walk through a thicket of statistical algorithms and techniques. It’s not enough to have a passing familiarity with regression modeling, for example, because that’s not the only statistical approach in the data scientist kitbag and, besides, there are several ways to regress variables, none of which is perfectly suited to every modeling scenario. Choosing the right modeling approach is often a creative exercise that demands expert human judgment.
No, you don’t need to have a Ph.D. in statistics to be a data scientist. What you do need are curiosity, intellectual agility, statistical fluency, research stamina, scientific rigor and a skeptical nature. You must also be articulate, because no one will accept the validity of the patterns you surface if you can’t explain clearly how you built your model, what variables and data you used, or what the results truly mean in the context either of some business problem or scientific endeavor.
Data scientists are a community with a specialized lingo–some might call it a "jargon." You must learn that lingo, but you can’t just talk the talk. Any attempt to glibly bluff your way through this difficult material will quickly expose you as a pretender. Also, if you wish to be a well-rounded data scientist, you should master specialized subject matters and terminologies in each of several key areas: algorithms and modeling, tools and platforms, applications and outcomes, and paradigms and practices.
The classic routes for acquiring this expertise and patois are in academia and on the job. Another great resource, especially for the self-taught, is to participate in online discussions among data scientists. In the course of my work here at IBM, I participate in several professional forums on LinkedIn devoted to big data. Here, grouped under several broad categories, are some recent LinkedIn discussion threads relevant to data science (click through to observe and participate in each, if you wish):
Algorithms and modeling:
- Random Effects Regression and ANOVA
- Share your views on the appropriateness of the following method used for variable reduction
- Confounding variables
- Mathematical optimization: finding minima of functions
Tools and platforms:
- Do you know how to build more efficient Real Time Data Warehouse spending less money and time?
- Please suggest me the best Open Source Tool for extracting Social Networks (Facebook, Twitter or Linkedin) data to Hadoop
- Video: Cassandra - A Real Time Big Data Application
- Looking for alternative design patterns for BI solutions using Hadoop and map reduce with out data ingestion, but rather references to the source from various sources. . . .
- R-integration with BI
- Big Data needs OpenCL
- What cloud based tools are available for Social Media Measurement and Analytics?
Applications and outcomes:
- Single File System Supercomputer Cluster Fuels Climate Research in...
- Using Location Intelligence to Create and Enable a Smarter Enterprise
- Best way to visualize events on a timeline
- What Impact Will Big Data Have on New Product Forecasting and Innovations Planning?
- At the Intersection of Big Data and Healthcare: What 7.2 Million...
- NASA Aims High with Airline Data
Paradigms and practices:
- What should be a starting point training for big data?
- Wow! Not a Data Governor, or Data Czar, but Data Dictator!!!
- 3 Reasons Why Data Mining is (almost) Dead
- Your roboboss called, and he doesn’t like your social media stats ...
- "Hoarders:" Big Data Edition
- Big, Big Data From Little Devices
- The What, Why and How of Becoming a Data-Driven Organization
- Do You Need a Chief Data Officer?
- Will big data fundamentally alter the workplace hierarchy and processes?
- ..Can Big Data Smoke Out the Silent Majority?
If you’re an established data scientist, you might find one very specialized discussion to be compelling, but the rest not worth a moment of your time. If you’re not a data scientist but wish to engage with those professionals in various business initiatives, these sorts of discussions may be the intellectual on-ramp you need to orient yourself.
Yes, some of the methodological discussions can be sleep-inducing and are best followed on a full tank of caffeine. If the thought of obtaining initial cluster components by extracting the required number of principal components and performing an orthoblique rotation makes you break out in hives, don’t say I didn’t warn you.