Making Data Simple: The big data problem
01:30 Connect with Al Martin on Twitter (@amartin_v) and LinkedIn (linkedin.com/in/al-martin-ku)
04:30 Connect with Daniel Hernandez on Twitter (@danhernandezATX) and LinkedIn (linkedin.com/in/danielghernandez)
06:15 NPS = Net Promoter Score (http://www.medallia.com/net-promoter-score/)
08:40 The four Vs of Big Data (http://www.ibmbigdatahub.com/infographic/four-vs-big-data)
17:30 Accidental Empires written by Robert X. Cringely (1996), Dealers of Lightening: Xerox PARC and the Dawn of the Computer Age, written by Michael A Hiltzik (2000)
Hungry for more? Check out our other podcast episodes of Making Data Simple:
- Episode 2: Making Data Simple: End of tech companies
- Episode 3: Making Data Simple: A new definition of client care
- Episode 4: Making Data Simple: Will machines take our jobs?
- Episode 5: Making Data Simple: Growth Hacking - Not just for start ups
- Episode 6: Making Data Simple: From 2D to 3D -- Augmented reality data visualization
- Episode 7: Making Data Simple: The 5 areas businesses MUST get right
- Episode 8: Making Data Simple: How data science is helping to improve aviation
- Episode 9: Making Data Simple: Making data fun & easy with Caleb Curry
- Episode 10: Making Data Simple: Data movement at size and scale
- Episode 11: Making Data Simple: Cloud computing, part 1
Intro: You’re listening to Making Data Simple. Where we make the world of data effortless, relevant, and yes, even fun.
Al Martin: Welcome to the series of Making Data Simple. What I’ve done is I’ve commandeered the Analytics Insights podcast and pivoted to more of my interests being the simple notion of making data simple. My name is Al Martin. I’m an executive at IBM in the Analytics space. I hold responsibility for data, appliances, content. So essentially databases, content management, data movement. Those disciplines include development, devops, support on both private and public cloud. But all that really means is that I have a deeply rooted interest in all things data. And, you know, it is often said and maybe it is a cliche that data is now the world’s most valuable natural resources. I buy into that notion. Untapped it is no different than crude oil with little value, but mined with insights and data driven decisions, I do think it will change the world.
Al Martin: While I’m very proud of what IBM has done and continues to do in this space, check it out, if you’d like to reach out to me separately I’d love to talk to you about it, this is a series within a podcast that is not about selling, it is about views and experience that I will provide and my guests will provide about elements of the industry around data that I find interesting and all around making it simple. Because I think it is a complex situation that we need to make simple.
Al Martin: For me, there is a selfish motivation in doing this podcast. And that is to use my network to meet with experts in the industry and simply learn. So, the other thing I’ll say is that I am a podcast junkie. I don’t like to waste any time in a day. If there is any idle time I’ll fill it with a podcast, video cast, Ted Talk, something of that nature. Making Data Simple has a wide aperture. I want to start with why data matters, how data is relevant to every aspect of both professional and personal life. I want to explore the client experience and dynamics around clients, cloud, private cloud. I want to look at challenges like insight, personalization, visualization, mobile — the good news is that I have a topic with almost endless possibilities. And the title just became very clear to me, again, because of making all the complexity very simple and fun for everyone to listen and enjoy. So look, I’ll provide all views and knowledge based on my experience. No question is off limits. I’m not afraid of admitting what I don’t know, in fact, that’s why I’m here. And again the objective is to have fun and stay within scope.
Al Martin: I consider myself a problem solver by nature. So I think I want to start with a problem here and I can work over the next podcasts or series of podcasts to find a solution. I’m going to start with the Big Data Problem. What it is, why should I care, why should you care, and how do we solve it. Today, I’ve asked a guest speaker, Daniel Hernandez, to come and chat with us. Daniel, how are you?
Daniel Hernandez: How are you doing, Al?
Al Martin: I’m doing well. Hey, Dan, what do your friends call you? Dan or Daniel?
Daniel Hernandez: Actually, my friends call me Danny.
Al Martin: Danny! Alright, I’m going to call you Danny.
Daniel Hernandez: You didn’t expect that one, did you?
Al Martin: (Laughing) So Dan, Danny, Daniel is a fellow IBMer. He’s an exec who leads Offering Management in the Analytics space. He’s also, as I know him, a student of data and analytics and he’s always researching and devising solutions built on data, sometimes to my exhaustion because I implement some of those technologies. So that’s a poor intro. I’ll let you give it a minute in terms of your day to day role, Dan, and your interests.
Daniel Hernandez: So maybe the best way to say it is I feel like my job in offering management is to build the stuff that I need in order to do my job. It happens that my job is one where data and using data in order to drive insights is kind of the essential part of what we do. So, basically trying to build this is the job my team and I have. In short.
Al Martin: Very nice. So here’s what I’ll do. I’ll start with a few questions. Hopefully it will help drive a conversation and then we’ll just see where it goes.
Daniel Hernandez: Sweet.
Al Martin: So, the definition of big data. It is really kind of old news. But what do you see as the definition of big data and what are the problems that exist today?
Daniel Hernandez: To be honest, I actually do not like the word Big Data or the phrase Big Data because it is somewhat of a misnomer in my mind. Essentially what it implies is unless you have a huge amount of data that there is no value. And we know that is not true. Think about decisions you are making today and the information you need in order to make that decision. It is often the stuff that might be on your desktop. It might be stuff that is a few tuples inside of a table in a database that you’ve got and NOT Big Data. So the association that Big Data has with size I think is wrong, to be honest, at least in the way that I think of it. And the way we have designed our products and our portfolio and how we’re taking it to market. It is about the scope of the data available to you that matters most. So, depending on your question, for instance, are my clients satisfied? That might be information that is available to you through a small amount of data, like NPS, but it is the kind of data that typically is not stored in your traditional operational databases. And to me that is a more important attribute of Big Data than the size attribute of Big Data. It is any and all data that you need in order to make a decision. That is kind of what I view Big Data to be. Stuff in spreadsheets, stuff on email, stuff in documents. It is everything.
Al Martin: That leads to two questions that I have for you. One is, so did Big Data just come about because of the advent of Hadoop? Before that, I don’t know that I heard that definition arise. And is it a problem or is it an opportunity?
Daniel Hernandez: Yeah, I actually don’t know, to be honest, where the word came from. I think I’m in the camp of Al here where I started hearing it whenever Hadoop really started to emerge. I do think there needed to be a way to distinguish it from other forms of data. I think it is unfortunate we chose Big Data to reflect it. Because we have data that fits that classification that I offered, that is any and all data that offers value, and typically is not the stuff you would see inside of databases. But it also includes stuff like our content management systems, which for many of our customers in every industry actually, have been storing and capturing for years. The biggest challenge is there hadn’t been effective ways to get at it. And so, what I do think the focus on Big Data has done is drive attention to data that often is not accessible to everyday users as part of everyday decision making, and force us to reckon with how do you make it accessible? Which, if you consider for instance parts of our portfolio like Watson Explorer, which is in part focused on search, that is what it is designed to do, to help you find things you would not find in your traditional systems. But I still think that remains generally a challenge for a lot of our clients today, that largely in many cases is unaddressed.
Al Martin: I don’t have any opposition to big data in and of itself as a term, but I’ve always struggled with the academic definition that it came out with. If you recall, it was the Vs. Started out with several Vs: volume, velocity, variety, and I think it grew to like 7 or 8 Vs. It wasn’t meaningful to me. For me, it was more about analytics, insights, and the real interest as we progress into machine learning. That’s where I thought Big Data would lend itself to. So does this imply that we have conquered structured data and now it is off into the unstructured world?
Daniel Hernandez: What would you consider rows inside of a file that happen to be stored in of (Apache) Parquet in and HFTS file system? Structured data that happens to be in an unstructured format inside of a file system. And that is where a lot of the information is being captured today. Click stream data that will help you to understand who is using your products, how they are using your products and what not, is often stored that way. And tell me the tools you would have at your disposal to tap into it. There aren’t as many as you need so, no I don’t think we’re at end fo job for helping people get use of data in structured forms because there are more structured forms that are emerging and there is not enough ways to tackle problems like data discovery for traditional forms. Let me give you an example. So you and I spend a lot of time worried about our customers. We spend a lot of time with our own customers. How often do you hear things like, “I bought into self-service analytics, I helped people take advantage of tools that allow them to do data discovery and charting and what not. The biggest problem that I’ve got is that I don’t have an effective way for them to find information in order for those tools to be that useful. Or maybe, it is easy for them to get at the data but there aren’t appropriate controls around who has access to what data and as a consequence is locked off for them inside of those tools. So no, absolutely not. I think there is a major governance issue on the structured data side which is what I refer to when I talk about the data discovery problem. I think there is an access issue in terms of new emerging structured data types that are not in your classic relational database (RDMS). So no, I don’t think we are done by any stretch.
Al Martin: You are right. I visit a lot of clients and most of the clients I visit today still don’t know where to start. Most of them I walk in and they have Machine Learning (ML) aspirations and we start talking about the maturity curve and it is back to spending money to save money, more about cost reduction, more about operations, more about they are still trying to get to the data lake when we are talking data science and bigger things like that. So I’m with you from that perspective.
Al Martin: So where does Hadoop play? Where does that come in?
Daniel Hernandez: It is another way for you to use commodity hardware to store data and to take advantage of open source innovation to make that happen. Whether it is H base, Hive, HDFS which is a file system, the advantage that it confers outside of commodity hardware there is a large ecosystem of providers offering tools in and around Hadoop to manage the information, govern it, integrate data into it, to do data profile and quality. We happen to be the premier vendor there, but because it is open our clients are afforded a bevy of choices that are unlike traditional and sometimes proprietary systems. Hadoop, at the end, helps you store more information on commodity hardware and take advantage of the ecosystem of providers that have rallied around it. So it is quite powerful.
Al Martin: What about Spark? You know Spark, at least from an IBM standpoint, all the products and technologies that I’m driving, boy we damn near put Spark and have embedded Spark into every one of those technologies. Why do you see that as the strategy?
Daniel Hernandez: Similar thing. So what are the characteristics of Spark that we like? First, you’ve got incredibly fast data processing because of the in memory model. It takes advantage of distributed compute, so it allows you to scale and do data processing at a level that is unrivaled certainly in some cases. And because it is open source, you get the benefit from the ecosystem of providers that are building tools and offerings and solutions around it. What I particularly like about it is that it allows you to deal with the data access problem. Aside from doing data processing, including running Machine learning pipelines, for instance, on data, there is a federation of data access benefit that we particularly like about it. Let’s say you’ve got data in an analytical warehouse, let’s call it a Db2 Warehouse. You’ve got your operational data, call it Db2, you’ve got click stream data in HDFS on Parquet. Spark lets you basically run data processing across all of it, indifferent of where it is using a standard programming model. The last thing I’ll say about Spark, and part of the reason why I am excited about it, it allows you to run batch operations on your data, so do customer segmentation on my data that happens to be spread across multiple databases, for instance, but it also helps you do the same kind of data processing on data that it trickling in on Kafka, for instance. So not exactly real time, but near real time or mini batches is probably the more accurate way to describe it. And it affords you a lot of options in terms of addressing use cases that you otherwise couldn’t had you been forced to do batch only.
Al Martin: Well, certainly, I’m a database guy by nature and the thing I like about it is that Sparks allows queries to run very very fast. Obviously the speed of access to RAM is, what, six times faster than disc. With memory so cheap, it makes a great case for Spark. How about open source and this whole gamut of things?
Daniel Hernandez: It gives customers choice. That is kind of the bottom line. There is more that many can do than a few can do. In other words, if we were in a world where all the innovation was coming from tech writers and R&D shops I don’t think we’d have topics like self-service analytics, which is about liberating data to anyone that wants it. Getting as much traction as possible. From a vendor standpoint, from IBM, it helps us focus our resources. Instead of 100% focus on plumbing and infrastructure, we’re able to map to “okay, how do we deliver business outcomes that matter for our customers and how do we deliver that by bridging the gap between what is necessary to get that benefit and what is offered by open source. That could be fanatical support that we deliver from our customer success and support teams or it could be from tools and solutions we build on top of that that are benefiting from all prior art and open source. So, huge fan. I especially like what is happening in the Spark space and machine learning space for sure.
Al Martin: So back to where we started, in terms of the problem, the way I see the problems today are we have a storage problem, we have an insights problem, we have a machine learning opportunity. In terms of managing the whole data ecosystem from ingestion to storage to governance to analytics to visualizations, to your point, we haven’t solved structured or unstructured data in all of those areas. We have a lot to do yet. Do you see it the same way?
Daniel Hernandez: Oh yeah. That is exactly right.
Al Martin: I was listening to a podcast the other day. The way they summarized big data or data is: “big apps on top of big compute on top of big data.” Is that also how you think of it?
Daniel Hernandez: I can get my mind around that. For sure. And for what purpose, I would say, would be a way to put that with a nice little bow.
Al Martin: And then a UI that covers up all the mess and makes it look pretty, right?
Daniel Hernandez: (Laughing) Yes.
Al Martin:So let’s go on to a little speed round. I like to end each podcast to get to know you a little bit. I’ll give you a few questions, just off the cuff, and get your answer. It will only take a minute.
Daniel Hernandez: You got it.
Al Martin: First of all, I think you are from Austin and in Austin, what is it, everything’s weird? What is the saying?
Daniel Hernandez: Keep Austin Weird.
Al Martin: (Laughing) What are you currently reading?
Daniel Hernandez: I just actually re-read Accidental Empires which was, there was a PBS documentary called Triumph of the Nerds that I had just started programming, it was in part my bible. So I re-read the book. That was pretty fun to re-read it. He is dissing on Microsoft all day long. So I thought that was a little bit of fun. He said some not nice things about IBM either. But I thought it was a good history book. I’m also working on a book, actually I just finished it on a plane yesterday, it is Dealers of Lightening, and that is about the Park Integration and the personalities behind it and debunks a few myths like they never made any money. So two books highly encouraged, quick reads, definitely worth it.
Al Martin: So wait, Accidental Empires, I’ve got that one. And the other one was…
Daniel Hernandez: Dealers of Lightening
Al Martin: Oh, I haven’t heard of that one. Alright. It is on my list. Number one role model.
Daniel Hernandez: I’ve had many, actually, throughout my career. It turns out one of the who I got a start with, his name is Brian Armstrong, I was a kid basically putting up boxes and he hired me to be a system administrator for this company called EXE. He was a role model for a long time when I was growing up programming. Actually I wasn’t a programmer, I was a system admin to start. He gave me my first break in tech, not my first programming break. And it turns out he joined IBM in Softlayer a couple of months ago as an executive there. That was certainly one. You are one of my role models for customer success. I don’t think that there is, I’m serious, man, I got religion on making clients happy and I think your decision making and focus on fanatical support is something I’m subscribing to, I’m trying to be like you.
Al Martin: Many, many thanks, man. That means a lot. So this is the tough one and then I’ve got one easy one. Greatest professional fear.
Daniel Hernandez: For a long time I never wanted to look stupid, which I realize there is no chance in hell that is something I would be successful at. So I got over that fear factor and now I just go for it all the time. I accept that I’m going to be stupid. Turns out it is professionally beneficial because if you ask what are perceived to be dumb questions in your brain, it often turns out to be the same stuff everyone else is fearful to ask and isn’t asking in their day to day. It turns out to be a quite useful tool. I’m kind of driven by fear all the time, to be honest. I’m scared of everything which is part of the reason I walk around with a chip on my shoulder. I study more than I probably need. It is fuel, I guess.
Al Martin: You and me both, man. I’ll give you the last word, Daniel, you got anything else and then I’ll sign off.
Daniel Hernandez: What got you fascinated with Big Data?
Al Martin: I’m kinda like you. I’m not so fascinated with big data. I’m more fascinated with data in that I think it will change the world. I know that sounds kind of geeky but I think it will. When you put that data together you are able to find trends, able to use augmented intelligence, machine learning. I think it will transform the healthcare industry, I think it is going to transform the way we go about our day to day. I don’t even think we’ve touched IoT. I travel all over the world. I hear about all the smart buildings and I don’t know that I’ve been through one true end-to-end smart building. It just interests me in terms of what the opportunities are. I think it is huge.
Daniel Hernandez: So let’s talk about the application inside of your team. I think the stuff you are doing is super awesome. NPS and how it is changing the way we are going about it. I consider it a data use case as per our definition. Tell us a little bit more on how that is impacting our day to day and how we work.
Al Martin: To me, NPS, that being surveying your customers, and by the way there is more than just support. You survey through offering management, sales has a survey. I think a survey in and of itself is just a survey. I’ve been doing this a long time and there is nothing special about it. Sometimes customers, you know, when they get upset, that is when you get to hear their wrath if you will, but the magic is in taking the time to investigate what they are saying and listen to them. You can find a ton of great nuggets about how to better your business. You not only look at the things that are going negative, everyone always want to look at what did they say and where are we messing up. I think it is equally important to look at what you are doing well. You can’t forget about that. So we can accentuate our strengths and minimize our weakness. So I think that cadence just to make sure you are on that and always recognize there is a customer at the other end, that drives good behavior and client sat is everything.
Daniel Hernandez: One of the reasons why I like that as a use case is the data just tells you what your customers are thinking, but if you ignore it nothing is going to change. So the entire way of work around what you do with that data and how you are going to respond to the feedback you are getting both positively and negatively is the difference maker. We often, especially as technologists, try to focus on, let’s get the data and present it in the right way but not necessarily how it is ultimately going to be infused in the decisions that drive the outcome. In this case, making clients happy. Or solving problems that are getting in the way.
Al Martin: Totally agree.
Daniel Hernandez:: Last question for you Al, why is Texas barbeque the best barbeque on the planet.
Al Martin: (Laughing) I think you are mistaken. I thought you were one of the smartest guys I knew and now it is going downhill to end the podcast. Kansas City, by and large, number one and then maybe Texas is second or third. You are happy with that, aren’t you?
Daniel Hernandez: I would pick Memphis barbeque over the stuff you guys have over there.
Al Martin: You are hurting me right now. Whose idea was it to invite you on the first podcast.
Daniel Hernandez: Alright, Kate, let’s edit out all that bad stuff about the Texas barbeque.
Al Martin: Alright, thank you for joining us today. You are great. Share it with your friends, colleague, mother, whoever, check out the show notes and you can find me here or on Twitter at @amartin_v.
Thanks for listening to the Making Data Simple podcast. Where we make data fun. Be sure to visit ibmbigdatahub.com/podcasts to access the show notes and uncover even more great episodes. Remember the views expressed here are those of the host and its guests and do not necessarily represent the views of IBM. Until next time, over and out.