Putting Big Data Myths to Rest

Avoid giving credence to these misconceptions <br>when making decisions about big data

Solution CTO, IBM

Recent adventures in discussions on LinkedIn have surfaced a number of persistent myths about big data that require your attention. As you have probably learned by now, I am not shy about trying to prevent people from making bad decisions based on these myths. Perhaps publishing this topic here will help put the following myths to rest by sharing why they deserve to be cast aside. Maybe it will also serve as a cooperative call on certain competitors to spend a bit more time researching what they put into circulation through their marketing efforts.

Big data is only for unstructured sources

Not true, not true, not true. As discussed in a recent Big Data Bytes webcast, the myth that big data is for unstructured data sources is a falsehood that certain database vendors want you to believe because they are scared of the technologies. In reality, most NoSQL systems store a wide range of information types just fine. More specifically, using structured data in Apache Hadoop is very useful; in fact, as one of our studies showed, many solutions use core structured data as their primary source. For example, one of my current banking projects is solely focused on structured data in our Hadoop platform—IBM® InfoSphere® BigInsights™ technology—so why does this myth exist? There are two primary reasons: one terrible one and one that makes some sense. Looking at the terrible one first, simply stated some legacy database vendors try to nudge you into thinking this way because they are scared of running off their sales. This tactic, frankly, is a dumb thing to do because data will flow to the best place to work with it, so better to just be accurate and embrace the changing landscape. The second reason that makes some sense has a lot to do with how Hadoop got started, which was primarily as a log processing solution. To be precise, that data is actually semi-structured data, but the media grabbed hold of it as being unstructured data and then associated the two together. Unfortunately, legacy vendors engaging in fear, uncertainty, and doubt (FUD) amplified that association. Do not fall for this myth. Big data technologies like InfoSphere Streams and BigInsights handle structured data just fine, thank you very much.

Big data has an inherent data quality problem

Recently, a representative from a large enterprise resource planning (ERP) organization represented this particular myth as fact. As I've written here previously, metadata, accuracy, and lineage do matter, but those considerations do not mean big data has an inherent data quality problem. First, not all the use cases require a traditional focus on data quality. The data exploration zone use cases that are utilized for discovery of the information sources or pulling in external sources that you are considering utilizing simply don’t require a traditional data quality focus at that phase of the project. Second, many big data use cases involve trusted source systems directly writing out unaltered information. Now, writing out unaltered information from these sources does not mean the data is perfect, but it does mean that you have an even greater problem than data quality. This problem needs to be addressed if you can’t trust your source transaction systems. Oftentimes, the challenge isn’t accuracy but instead format—putting aside sheer volume as a consideration here for a moment—and an extract-load-transform (ELT) approach can handle this challenge. Third, there are use cases in which working with a large population of data—what I commonly refer to as a superset—can be more accurate than sample-based methods. This topic is pretty hot and quite deep, so expect to see additional coverage here about using whole populations versus sampling. Again, these considerations do not mean you can skip data quality concerns, but they also do not mean you have an inherent data quality problem. Data quality and lifecycle management considerations become front and center as you start moving into production-grade analytics and golden master concepts, and I’ll cover those topics in a future posting.

Machine learning prevents human bias

A big data–related topic that caught fire this summer was machine learning, which generated a lot of discussions on how best to use it. Most of these discussions were helpful, but a couple were distinctly off base, and I want to make sure bad information does not impact you as you look to get started. There are a lot of reasons why machine learning should be part of approaches to analytics, but expecting that these approaches will remove human bias is definitely not one of them. As with the discussion on using a whole data population versus sampling, this topic is also pretty deep, but the reason why using machine learning doesn’t remove the potential for human bias is pretty easy to explain. First, someone has to select the use case to pursue. Second, someone has to select the data to use. Third, someone has to pick the machine-learning methods. And fourth, someone has to interpret and rerun the models until they provide reliable output. If you do not believe that human bias can interfere with any of these parts of the process, I have a large bridge I’d like to sell you.

Machine learning is a real-time activity

This particular myth really irks me, partially because it was put out there by a vendor that recently purchased a real-time engine—it isn’t actually a real-time streaming engine; it is a complex event processing (CEP) solution that is a subset of a streaming engine, but more on that later. Evidently, the vendor now thinks it is a real-time problem, including machine learning. To be really super-clear here, machine-learning approaches are used off-line to build models that are then used in real time. You may score the models generated by the machine learning in real time, but the act of building the machine-learning models is an off-line process that is highly iterative and typically quite time intensive. As I tried to cooperatively correct in a LinkedIn discussion, without much success, you need to understand and architect based on machine learning being iterative and requiring a lot of human involvement. These models are built over weeks or months, and weeks or months most certainly do not qualify as real time. There is certainly a lot of room for debate and differing opinions on the topic of big data, but there is also quite a bit of cut-and-dry, right-and-wrong, fact-versus-fiction knowledge as well. Stay away from the vendor myths and recycled information from people that have never worked with or helped create these technologies. Hopefully, this post helps lay bare some of the most common myths, so you can readily avoid them in your planning. If you have a myth or common misunderstanding that you would like to see addressed in a future column, please let me know. [followbutton username='thomasdeutsch' count='false' lang='en' theme='light']

 <table cellpadding="0" cellspacing="0" valign="top" width="15%>

  [followbutton username='IBMdatamag' count='false' lang='en' theme='light']